Sounding Out Reconstruction Error-Based Evaluation of Generative Models of Expressive Performance

Generative models of expressive piano performance are usually assessed by comparing their predictions to a reference human performance. A generative algorithm is taken to be better than competing ones if it produces performances that are closer to a human reference performance. However, expert human performers can (and do) interpret music in different ways, making for different possible references, and quantitative closeness is not necessarily aligned with perceptual similarity, raising concerns about the validity of this evaluation approach. In this work, we present a number of experiments that shed light on this problem. Using precisely measured high-quality performances of classical piano music, we carry out a listening test indicating that listeners can sometimes perceive subtle performance difference that go unnoticed under quantitative evaluation. We further present tests that indicate that such evaluation frameworks show a lot of variability in reliability and validity across different reference performances and pieces. We discuss these results and their implications for quantitative evaluation, and hope to foster a critical appreciation of the uncertainties involved in quantitative assessments of such performances within the wider music information retrieval (MIR) community.


INTRODUCTION
The recent years have seen the creation and publication of several corpora of precisely measured and score-aligned piano performances within MIR and digital musicology communities [17,23,29].This renewed interest in computational models of expressive piano performance, in particular the data-driven kind.Yet it also rekindled concerns surrounding the direct applicability of large scale data processing and machine learning techniques to this type of data.
This paper addresses one such concern, namely issues of quantitatively evaluating generative models of expressive piano performance (GMEPP) at scale.Quantitative evaluation in itself is nothing new, GMEPPs are routinely evaluated in terms of how close their predictions are to actual human expert performances.This closeness is generally estimated with figures of merit such as reconstruction errors [7,17] or likelihood functions [10,15,18].
One issue with this type of evaluation arises from the fact that having a model produce a performance that is numerically close (in some aspect yet to be clarified) to an expert piano performance -i.e., a model that does well according to generally accepted figures of merit -possibly misses the mark of GMEPP; the goal of producing convincing, musical, and consistent performances for human listeners.Evaluating by asking such listeners is, however, only an option in a minority of situations, and most of the time the training, development, and evaluation of GMEPP requires scalable, automated metrics.This potential goal misalignment raises at least two problems: is a measured distance to a human reference performance related to the perceptual similarity of performances?And is the choice of an arbitrary human reference performance immaterial for evaluative outcome?
These questions tap into profound epistemic, perceptual, and axiological issues beyond the scope this article.What we can and do address in the following, are three smaller, but nevertheless operatively useful questions about current reconstruction errorbased evaluation (REE) techniques: • To what extent does REE validly identify the [performances by] expert pianists under different reference and piece conditions?
To assess these issues, we set up two experiments.First, a listening test asking participants to identify expert performances in pairs of expert and artificially generated performances.Second, we investigate the reliability and validity of REE evaluation, using the previously assessed artificially generated performances as negatives.We discuss the results of these experiments in the context of the literature on and perception of expressive performance and we identify potential steps to improve quantitative evaluation of GMEPP.With more, larger, and ecologically valid (i.e., stemming from realistic performance scenarios) datasets of expressive piano performance becoming publicly available and used by the wider MIR community, we hope this discussion to foster a critical appreciation of the uncertainties involved in quantitative assessments of such performances.
The rest of this paper is structured as follows: Section 2 details the framework of quantitative GMEPP evaluation as investigated in this article.Section 3 describes how we extract and preprocess expressive parameters from recordings of expressive expert performances.Section 4 describes the performance discernment listening test and section 5 details the reliability and validity experiments.Finally, section 6 discusses these results for evaluation of GMEPP and concludes this article.The audio files, code, and data is available at https://github.com/CPJKU/performance_similarity_dlfm23.

A FRAMEWORK OF QUANTITATIVE EVALUATION
To aid the description of our experiments, we formalize reconstruction error-based evaluation as the following framework, as shown schematically in Figure 1.We consider a two-model evaluation framework which asks the question: is performance P1 produced by Model1 "better" than P2 produced by Model2?Concretely, the framework takes a triplet of two performances P1 and P2 of the same piece, one generated by each model, and computes their reconstruction error with respect to (wrt) a third expert RP of the same piece.The standard evaluative argument of GMEPP is as follows: the model which produced the performance with smaller reconstruction error is favored and its performance is taken to be more musical.
This seemingly overly formal description of a simple and widely used evaluation technique allows us to formulate experiments about evaluation by controlling specific elements.Specifically, we use the framework to evaluate models with controlled ground truth wrt evaluation, i.e., with models which are known to better or worse.Getting performances of good, musical models is straightforward, any human expert performance can be taken as such.However, finding unmusical performances faces the performance research version of the Anna Karenina principle: all musical performance are (potentially) alike, but all unmusical performances are unmusical in their own way.To mitigate the complexity, we opt for a type of randomization to create unmusical performances.
Before we describe our process to create (and validate our choice of) unmusical performances in Section 2.2, we briefly introduce the numerical representation of expressive performances in Section 2.1.

Model1
Model2 Figure 1: Schematic representation of our framework for two model evaluation.These frameworks are commonly used for the comparison of two or more candidate models of expressive performance.In our experiments, however, the models are specifically designed for their known ground truth wrt evaluation (in the sense discussed in 2): Model 1 only produces expert performances (purple), model 2 only randomly sampled performances (orange), i.e. model 1 is the musically valid one.The two models produce a performance each ( 1 and  2 ).The MSE of the performances with respect to an expert reference performance () is measured ( 1 and  2 , row 3).The comparison of error terms (row 4) outputs a Boolean decision value (red).
We then connect our two main experiments to the framework in Section 2.3.

Numerical Representation of Performances
To capture nuances and deviation from the score in performances we use numerical features.Every performance yields sequences of measurements encoding an expressively relevant attribute, e.g., tempo.The sequence contains values (e.g., the current beat period) for each note or score onsets (from now on broadly referred to as dimensions), i.e., performances can be different from others in each of these dimensions and distance metrics aggregate differences in each of these dimensions into a single value.An example of the numerical sequence representation of performances in terms of beat period is illustrated in Figure 2. Performances differ from each other (vertically) in each dimension, i.e., at each score onset on the horizontal axis.

Randomization within the Ball of Expert Performances
Given a number of expert performances of the same piece, anyone can be chosen as a reference performance.This means that any other expert performance sits at some distance from the reference (in a high-dimensional space), some closer, some further away.If we are able to synthesize performances with an expected distance Sonate 11 no greater than this expert performance's distance from the reference, our performances would -in expectation and according to the framework-seem as good as an expert performance.In other words, any performance in a (high-dimensional) ball around the average expert performance looks musical to this evaluation framework, given that the ball diameter is the average distance between pairs of expert performances.
Our aim is thus to randomize performances that stay within this ball in expectation.We approximate this using mixture of Gaussian random variables, set at the mean of quantiles of the average expert performance.Figure 3 illustrates this process from top to bottom.We start by computing the average expression feature curve across performers of a chosen excerpt.We then split the dimensions (horizontal axis) according to quantiles of the expression feature (vertical axis).Finally, we define a Gaussian random variable for each quantile, defined by the mean of the expression curves within the quantile and a configurable standard deviation, which we refer to as noise level.Setting the noise level allows for (probabilistic) control over the (expected) distance to an average expert performance.Generally, the higher the noise level, the further the performance.
A possible result of randomization is shown in Figure 2 using an excerpt of a Mozart Piano Sonata.Expert performances are shown in gray and a generated randomized performance in red.Note that the shaded area where many expert performances come to lie (mean performance and one standard deviation above and below) is not an illustration of the high-dimensional ball defined by curves that do not exceed an average reconstruction error wrt the average performance.Such high-dimensional balls are difficult to visualize, but intuitively large deviations in few dimensions are possible if the values in a majority of other dimensions fall very close to the reference.
Furthermore, note that this Gaussian mixture is not guaranteed to stay within this ball for general sequences or even any possible expression features sequences.The performances of one excerpt might lie very closely, narrowing down the possibilities such that even noise level zero, i.e., a quantile-wise deadpan performance, is beyond the ball.However, we never see this happen on our data.

Experiments
In the first experiment, we are interested in the capacity of listeners to discern slightly randomized performances that look similar to expert performances to the quantitative framework.Do listeners perceive these randomizations or are they too fine?In a listening test, we present participants with several pairs of performance excerpts, each pair consisting of one expert and one randomized, and ask them to identify the expert performance among each pair.The randomizations are created with precisely controlled error rates of the framework for each excerpt.
In the second experiment, we use the same randomized and expert performance pairs as in the previous one, however, with increased randomization strength and no excerpt-wise configuration of the randomization.In this scenario, a listener should be overwhelmingly likely to identify the randomization, at the cost of the randomization also being more visible to the evaluation framework, i.e., the framework should identify more than 50 % of human performances correctly.This experiment addresses the second and third of our guiding questions: the reliability, i.e., the evaluative consistency, and the validity, i.e., the evaluative correctness, of the framework under various reference performances and and validity of the quantitative evaluation framework.

METHODS
The previous discussion of the framework remained abstract, in this sections we discuss concrete dataset, expression features, standardizations, metrics, and randomizations used in the two experiments.

Datasets
For our analysis we use excerpts of MIDI or MIDI-like recordings with performed notes matched to their corresponding score notes extracted from two datasets: Vienna 4x22: This dataset was originally compiled by Goebl [12] and consists of 4 excerpts of solo piano pieces, each performed by KAIST / International Piano-e-Competition: This dataset consists of MIDI recordings of performances of several editions of the International Piano-e-Competition1 for a number of which researchers at KAIST [17] collected and corrected scores in MusicXML format.All performances were recorded on Yamaha Disklavier instruments.The scores and performances have been aligned by KAIST2 using Nakamura et al.'s HMM-based alignment tool [21].We converted these alignments to Matchfile format [11], extracted the pieces for which more than eight -or more than five in the case of Bach's well-tempered clavier -performances exist and cleaned up the alignments.
Taken together, this yields 33 pieces or excerpts thereof -16 by Frédéric Chopin, 8 by Johann Sebastian Bach, 5 by Ludwig Van Beethoven, 2 by Franz Liszt, 1 by Wolfgang Amadeus Mozart, and 1 by Franz Schubert -with 40786 unique score onsets, each played at least 6 and 34 times for a total of 476 performances.

Expression Features
In order to compare expressive parameters, we do not work directly with note-wise onsets, dynamics etc, but we compute four expression features: two onset-wise features (tempo and velocity), and two note-wise features (timing and articulation).These features are defined as follows: • A tempo curve is derived by dividing the performed interonset interval (IOI) by the score IOI for every score onset, where the performed onsets are first averaged across note sharing a score onset.Tempo curves encode a measure of the rate of change measured in seconds per beat (aka beat period).• Likewise, dynamics curves are computed as the average MIDI velocity of individual notes at each score onset.• We define timing as the note-wise deviation from an average onset time of notes at a common score onset (as used in the tempo computation above) in milliseconds.The timing of notes sharing a score onset sums thus to zero, the timing of notes unique at their onset is also zero.• We define articulation as the base-two logarithm of the played duration divided by the notated duration times the beat period.These definitions are by no means universal, however, these or equivalent expression features are commonly used (see section 4.1 in [5]).

Standardization and Metric
The literature provides many examples of standardization, factoring and smoothing of expressive parameter curves (e.g.[8,9,20,27]).Li et al. [20] proposed a number of standardization techniques which they compared as parameters in a model selection test.We evaluate four standardization techniques: mean standardization, mean-log standardization, mean/variance standardization (aka sampling standard score), and no standardization, under mean squared error (MSE).In the middle plot of Figure 2 non-standardized tempo curves are shown, and in the bottom we show the same curves but mean-log standardized.Note that the MSE of two series of data points -i.e.performances - 1 and  2 under mean variance standardization is equivalent to 2 − 2 ×  ( 1 ,  2 ), where  is the Pearson correlation coefficient.The given test results hence also imply evaluation under correlation, another commonly used metric.

Randomization
For our experiments we use the following quantile and noise level settings.The listening test uses quartiles (Q 1 , Q 2 , Q 3 , Q 4 ), the noise level  is set for each excerpt individually to control the evaluative validity of the framework.Formally, the randomizations are sampled from: The second experiment uses unequal quantiles (lowest 5%, center 90%, and highest 5%), the noise level is set to the average standard deviation across performances .Formally, the randomizations are sampled from: where the quantiles of dimensions  are shown as sets constrained by the probabilities  of curve values  at these dimensions.

LISTENER DISCERNMENT EXPERIMENT
Using a listening test we estimate the degree to which listeners are capable of discerning differences in performance expression features that look similar under the quantitative evaluation framework.

Data
For the listening test we extract excerpts of pieces of the Vienna 4x22 dataset.We use two excerpts per expression feature, with all four expression features (tempo, timing, articulation, and velocity) being investigated, making for a total of eight excerpts.The excerpts are chosen based on two considerations: First, they need to cover enough musical material to be able to judge phrasing and timbre, but not be too long for the listeners.We opt for 8 -10 measures.Secondly, we extract all excerpts fulfilling the length criteria and measure their inter-performance correlation.For each expression feature, we choose the excerpts with the highest and the lowest correlations, respectively.For high correlation excerpts, performances are very consistent across performers, we thus expect the randomization ball to be small, and identification of randomized performances correspondingly harder.We further double the number of test pairs by using two noise levels.Noise level 50 refers to standard deviations set in the randomization such that the framework identifies 50 % of the pairs correctly, i.e., the framework evaluates at chance level, the randomization is indistinguishable to the framework.At noise level 90, the framework identifies 90 % of randomizations.We expect listeners to be able to identify the stronger randomizations (noise level 90) with greater ease.Each of the eight excerpts is matched with 44 randomized performances, 22 at noise level 50, 22 at noise level 90.

Listening Test
Participants are provided with an online questionnaire of 16 pairs of performances, one for each test case, randomly sampled from the 22 × 22 possible (random × expert) pairings.On the first page, listeners are instructed to the task -listening to the two audio files and identifying the expert performance among them -and presented the five items of the short Musical Training subsection of the Goldsmiths Music Sophistication self-assessment Index (GMSI).Of the participants that completed the GMSI questions, 56% engaged in regular practice of a musical instrument for 4 or more years and 69% reported practicing their primary instrument for at least 2 hours per day.Listeners can start, pause, stop, or rewind the audio excerpts at their leisure.The possible answers include: performance 1 is the expert performance, performance 2 is the expert performance, and undecided.

Results
More than 250 listeners participate in the online study, with usable (unskipped) answers per (noise level × feature)-configuration ranging from 185 to 240.Table 1 presents the results of the listening test.The table breaks down the answers hierarchically, with the top row identifying the four expression features studied.The next five rows from the top divide each feature into two noise levels and report from top to bottom: the noise level used, the total number of answers, the number of correct expert performer identification, the ratio of correct identification as percentage, and finally the probability (as percentage) of this outcome under a binomial distribution with success probability of 0.5, the distribution corresponding to the null hypothesis; listeners can't discern the expert performances.The next six rows report the same values again, albeit further broken down by excerpt.For each excerpt we further note the starting point and duration in measures.
Most apparent from the results is that the inconsistency of listener discernment across features.They largely fail to perform better than chance for timing and velocity, yet show clear (and statistically significant at p=0.01) discernment for articulation and tempo.Furthermore, the noise level influences the results as assumend for tempo and articulation, but fails to influence the judgment of the other two in a significant way.Addressing our first guiding question, listeners discerned randomization in both articulation and tempo which are indistinguishable under the evaluative framework (noise level 50).However, the framework readily identifies stronger randomizations (noise level 90) in velocity and timing, which escape the listeners.

VALIDITY AND RELIABILITY EXPERIMENT
This experiment addresses our guiding questions two and three, concerning the reliability and validity of reconstruction error-based evaluations under different reference performances, respectively.All experiments are carried out with respect to two of the performances' expressive parameters, namely onset-wise tempo and dynamics curves.1: Results of the listening test broken down hierarchically, with the top row identifying the four expression features.The next five rows divide each feature into two noise levels and report from top to bottom: the noise level used, the total number of answers, the number of correct expert performer identification, the ratio of correct identification as percentage, and finally the probability (as percentage) of this outcome under the null hypothesis.The next six rows report the same values again, albeit split down by excerpt.
See Figure 1 for a schematic representation of the frameworks.We use this framework to evaluate expert performances against randomized ones.For each piece in the combined datasets described above (see Section 3.1), we create 64 randomized ones.The randomization starts from the average expert perfromance and follows the process described in section 2.2 and used in the listening test, albeit with one major difference: the randomization follows a mixture of three Gaussians corresponding to the top 5 %, bottom 5 %, and center 90 % quantiles, the noise level is set to the overall average standard deviation of the expression features for each piece.Given the results of the listening test, we assume a listener to be overwhelmingly likely to identify the randomization, at least for the tempo curves.

Reliability and Validity
Using the described ground truth models, we compute validity and reliability values for the given evaluation frameworks.We define reliability as the consistency of the evaluation framework under changes of references and across a variety of pieces, independent of the correctness of this result.Given a human expert performance (produced by the 'musical' model) and a random sequence (produced by the 'unmusical' model), does the framework consistently favor the same model wrt different reference performances?This consistency is quantified as average correlation of the binary output of the two-model evaluation (0 = model 1 has smaller MSE,1 = model 2 has smaller MSE) wrt different targets.This is interpretable as inter-reference-performance correlation of evaluation framework.A perfectly reliable evaluation always favors the same model independent of RP.
We define the validity of the frameworks as the extent to which they accurately recover the ground truth.A perfectly valid evaluation will always favor the expert performance and reject the randomized sequence.Numerically, validity is estimated by the ratio of tests that erroneously recover the randomized performance over all possible reference, test and random performance combinations.As for reliability, we compute and compare this number across a variety of pieces.
All in all, then, both tempo and dynamics under four standardizations are evaluated in two tests over 33 pieces.This amounts to a total of 2 × 4 × 2 × 33 = 792 experiments.Every test is carried out for the  reference performances,  − 1 test expert performances, and 64 randomly sampled performances, where n is the number of expert performances available for the respective piece.

Results
In this section, we present the results of selected tests.Results are reported in Table 2, one part for dynamics curves, and another for tempo curves.MSE between expression features under mean variance standardization, i.e. the standard score per performance excerpt, proved most beneficial for the framework's discernment capacity and is hence used throughout the experiments.Each row in Table 2 represents a piece, the values given in the first four columns are as follows.The name and opus number of the piece and its composer.The number of expert performances, the number of their shared onsets.
The following four columns are given once for tempo curves and once for dynamics curves.The mean of three MSE distributions:   the inter expert performance MSEs, the MSEs between expert performances and randomized performances, and the MSEs among randomized performances.The next column reports the reliability of the two-model evaluation as the mean of correlations among the two model tests over different target performances.Lastly, the validity of the framework is given as the percentage of randomly sampled performances with lower MSE than a given expert performance.2 relating to important aspects of reliability are colored in red.The average correlation of all twomodel evaluations wrt dynamics curves is 0.85, the highest value being 1.0 and the lowest 0.09 (Table 2, col.8).The average correlation of all two-model evaluations wrt tempo curves is 0.73, the highest value being 1.0 and the lowest 0.13 (col.14).Generally, there is agreement in a majority of pieces and less reliability in a minority.For 13 pieces, the correlation of evaluations drops below 0.5 wrt tempo or dynamics, highlighting high variation across pieces.

Reliability. Values in Table
The pieces exhibiting low reliability differ between tempo curve and dynamics curves tests.Only one piece (Grande Etude de Paganini S.141 No 1) shows correlation below 0.5 in both tempo and dynamics curves. 2 relating to the validity of the two-model evaluation are colored in blue.The two-model evaluation validity tests show an average of 5.3 % of comparisons wrt dynamics, and an average of 14.0 % of comparisons wrt tempo, favoring the randomly sampled performance (col.9/14).The average for all pieces is not weighted by the number of expert performances or tests.The percentages vary greatly from perfect recovery of all expert performances to 88.3 % of evaluations favoring random performances.Again, valid evaluation wrt tempo does not imply valid evaluation wrt dynamics and vice versa.15 pieces exhibit good performance of the framework with rejection of random performances in more than 90 % of cases for both tempo and dynamics.

DISCUSSION AND CONCLUSIONS
Performance data is complex and sometimes more opaque than apparent at first glance.Not without reason have researchers interested in performance practice and computational performance modelling spent decades dissecting the minutiae of phrasing, melody lead, pedalling, to name just a few aspects.To better appreciate the breadth of issues, we briefly discuss several research directions.
Directly implied in our investigations are computational models of expressive performance, we refer to [5] and [19] for a comprehensive overview.For an overview of methods for evaluating computational models of expressive performance, we refer the reader to [4].
Other cues come from performance research related to listener judgments, e.g., the seminal work by Repp [25] which presents evidence suggesting that listeners prefer average performances.Wesolowski et al. [28] present a critical view of listeners' aesthetic judgments as a methodological tool for evaluating the differences in Jazz ensemble performances by analyzing their ratings' variability.The music psychology literature provides evidence showing that the assessment of the (aesthetic) quality of a performance depends not only on the auditory component of a performance (e.g., [24]).
Performance practice research is also interested in an entirely different type of perceptual classification of performance, namely semantic descriptor of expressive performance, or, more commonly, instrumental timbre.By means of example, we refer to the sequence of studies undertaken by Bernays et al. [1][2][3], or more recently and from within the MIR community [6].Besides verbal descriptions, quantitative performance research often takes the form of detailed analyses of expression features in specific contexts.Exemplary work was carried out by Goebl et al. [12,14], e.g., their work on the sources of melody lead [13].
From a music education perspective, Gururani et al. [16] investigate quantitative descriptors for assessing the quality of performance.Pati et al. [22] present a deep learning based approach to assess student music performance.
Our tests add some bits to the knowledge surrounding measured expressive performances and their generative models.They indicate that MSE based model evaluation is not necessarily reliably favoring the same performance wrt different targets.Furthermore, MSE based model evaluation is not dependably capable of discerning expert performances from randomized performances.The pieces under examination show great variability both wrt to the tests, as well as wrt closeness of expert performances.Listeners perceive randomizations in articulation and tempo that escape the evaluation framework, but they do not notice randomizations in velocity and microtiming with the same acuity.Reasons for this can be sought both in the perception as well as in the production of expressive performances.
How then can automatic, quantitative evaluation be improved?Our experiments and experience allow only for tentative answers, but answers they still are: Most settings seem to benefit from more fine-grained evaluations.Shorter excerpts tend to give more reliable and valid results and are better suited to localize errors.If multiple performances are available, test excerpts can be chosen which have high internal consistency, i.e., high inter-performer correlation or low inter-performer MSE, respectively.Ideally these excerpts can relate to specific and discussed performance issues like phrasing, clear voices, specific timbre, etc. Formulated in the negative, researchers should avoid resting their evaluative arguments on aggregated absolute errors across large, undocumented test dataset splits.Such numbers carry too little information about the models under scrutiny.
Even better evaluation could plausibly be achievable with distributional metrics, e.g. the probability of generated performances under a Gaussian process (GP) regressor fitted with expert performances or inversely the likelihood of a generative GP model, like the model Teramura et al. [26] proposed, for test performances.In a similar vein, trained neural network (NN) discriminators seem a promising avenue for future research.However, neither tractable (GP) nor untractable (NN) approaches are a priori connected to listener judgment.This is by no means an exhaustive discussion of issues surrounding the perception, characterization, and quantification of expressive performance, but we hope it serves to gain an appreciation of the intricacies of this data.Prospective as well as seasoned researchers in the field of GMEPP do well in reminding themselves of these facts: piano performance are aesthetically, culturally and axiologically rich, dynamic, and complex musical objects.
Can listeners discern performances that are indistinguishable under REE? • To what extent does REE reliably favor [performances by] the same model under different reference and piece conditions?

Figure 3 :
Figure 3: Illustration of the sampling process approximating the ball of expert performances with a mixture of three Guassian random variables.The average performance (opaque blue, top) is computed from expert performances (translucent blue, top) and segmented into quantiles (red boxes).A randomized performance (orange, bottom) is then sampled from Gaussian distribution for each quantile, with a standard deviation controlled as noise level parameter.
22 pianists.The excerpts are the first 21 bars of Chopin's Etude Op. 10 No. 3, the first 45 bars of Chopin's Ballade in F Op. 38, the first 36 bars (i.e., the theme) of Mozart's Piano Sonata in A K 331, and all 32 bars of Schubert D783 No. 15 (with repeats played).All performances were recorded on a Bösendorfer 290 SE Grand Piano as MIDI-like data and subsequently each played note matched to its respective score note.