Intra-note Features Prediction Model for Jazz Saxophone Performance

INTRA-NOTE FEATURES PREDICTION MODEL FOR JAZZ 

SAXOPHONE PERFORMANCE 

Rafael Ramirez Amaury Hazan Esteban Maestre 

Pompeu Fabra University 

Music Technology Group - IUA 

Ocata 1, 08003 Barcelona, Spain 

{rafael, ahazan, emaestre}@iua.upf.es 

ABSTRACT 

Expressive performance is an important issue in music 

which has been studied from different perspectives. In 

this paper we describe an approach to investigate musical 

expressive performance based on inductive machine 

learning. In particular, we focus on the study of variations 

on intra-note features (e.g. attack) that a saxophone 

interpreter introduces in order to expressively perform a 

Jazz standard. The study of these features is intended to 

build on our current system which predicts expressive deviations 

on note duration, note onset and note energy. 

1. INTRODUCTION 

Modeling expressive music performance is one of the most 

challenging aspects of computer music. The focus of this 

paper is the study of how skilled musicians (saxophone 

Jazz players in particular) express and communicate their 

view of the musical and emotional content of musical pieces 

by introducing deviations and changes of various parameters. 

In the past, we have studied expressive deviations 

on note duration, note onset and note energy [11, 

10]. We have used this study as the basis of an inductive 

content-based transformation system for performing 

expressive transformation on musical phrases. 

In this paper we focus on the study of the intra-note 

features (i.e. attack, sustain, release) that the interpreter 

introduces in order to expressively perform a piece. In 

particular, we apply machine learning techniques to induce 

a predictive model for the type of attack, sustain and 

release of a note according to the context in which the note 

appears. 

The rest of the paper is organized as follows: Section 

2 briefly describes how we extract a melodic description 

from audio recordings. Section 3 describes the approach 

we have followed for the induction of the predictive model 

of the mentioned intra-note features. Section 4 reports on 

some related work, and finally Section 5 presents some 

conclusions and indicates some areas of future research. 

Figure 1. Overview of the description system 

2. AUDIO DESCRIPTION 

In this section, we summarize how the audio description is 

extracted from a set of monophonic recordings. We compute 

descriptors related to three different temporal scopes: 

some of them related to an analysis frame, some to a note 

segment, and others related to an intra-note segment. All 

the descriptors are stored into a XML document. Figure 

1 represents a schematic view of our description scheme, 

including different levels of abstraction. Roughly, the procedure 

for audio description is as follows: First, the audio 

signal is divided into analysis frames, and a set of lowlevel 

descriptors are computed for each analysis frame. 

Then, we perform a note segmentation using low-level 

descriptor values and fundamental frequency. Using note 

boundaries and low-level descriptors, we carry out energybased 

intra-note segmentation, and a posterior intra-note 

segment amplitude envelope characterization. 

2.1. Low-level descriptors computation 

The main low-level descriptors used to characterize expressive 

performance are instantaneous energy and funda-

mental frequency. Energy is computed on the spectral domain, 

using the values of the amplitude spectrum. For the 

estimation of the instantaneous fundamental frequency we 

use a harmonic matching model, the Two-Way Mismatch 

procedure (TWM) [8]. After a first test of this implementation, 

some improvements to the original algorithm 

where implemented and reported in [6]. 

2.2. Note segmentation 

Note segmentation is performed using a set of frame descriptors, 

which are energy computation in different frequency 

bands and fundamental frequency. Energy onsets 

are first detected following a band-wise algorithm that 

uses some psycho-acoustical knowledge [7]. In a second 

step, fundamental frequency transitions are also detected. 

Finally, both results are merged to find the note boundaries. 

2.3. Note descriptor computation 

We compute note descriptors using the note boundaries 

and the low-level descriptors values. The low-level descriptors 

associated to a note segment are computed by 

averaging the frame values within this note segment. Pitch 

histograms have been used to compute the pitch note and 

the fundamental frequency that represents each note segment, 

as found in [9]. 

2.4. Intra-note description 

We perform intra-note segmentation based on energy envelope 

contour. Once note boundaries are found, energy 

envelopes of notes extracted from the recordings and divided 

into three segments, namely attack, sustain, release. 

Figure 2 shows the representation of these segments for a 

melody fragment. 

The procedure for intra-note energy-based description 

is outlined as follows: We consider the audio note segments 

as differentiable function over time. We compute 

the zero-crossings of third derivative [Jenssen] in order 

to select the segment characteristic points, i.e. maximum 

curvature points. Then, we join these points by straight 

lines, conforming a set of consecutive linear segments. 

We compute the slopes and durations (i.e. the projection 

on the time axis) of the linear fragments, and finally define 

an attack, a sustain and a release section. We characterize 

a the complete envelope of a note by the following 

(self explanatory) six descriptors: Attack Relative Duration 

(ARD), Attack Regression Slope (ARS), Attack Relative 

End Value (AREV), Sustain Regression Slope (SS), 

release Relative Duration (RRD), and Release Regression 

Slope (RS). We obtained a correlation coefficient of 0.83 

for our data set of 4360 notes. 

3. INTRA-NOTE PREDICTIVE MODEL 

In this section, we describe our inductive approach for 

learning the intra-note predictive model from performances 

Figure 2. Example of linear envelope approximation of a 

sequence of saxophone notes 

of Jazz standards by a skilled saxophone player. Our aim 

is to find an intra-note-level model which predict how a 

particular note in a particular context should be played 

(i.e. predict its envelope). This, we believe, will improve 

our current model which predicts note-level deviations on 

note duration, note onset and note energy. As we have 

pointed out in the past, We are aware of the fact that not 

all the expressive transformations performed by a musician 

can be predicted at a local note level. Musicians perform 

music considering a number of abstract structures 

(e.g. musical phrases) which makes of expressive performance 

a multi-level phenomenon. In this context, our ultimate 

aim is to obtain an integrated model of expressive 

performance which combines note-level knowledge with 

structure-level knowledge. Thus, the work presented in 

this paper may be seen as part of a starting step towards 

this ultimate aim. 

Training data. The training data used in our experimental 

investigations are monophonic recordings of four Jazz 

standards (Body and Soul, Once I loved, Like Someone in 

Love and Up Jumped Spring) performed by a professional 

musician at 11 different tempos around the nominal tempo 

(apart from the tempo requirements, the musician was not 

given any particular instructions on how to perform the 

pieces). A set of melodic features is extracted from the 

recordings which is stored in a structured format. The 

performances are then compared with their corresponding 

scores in order to automatically compute the transformations 

performed. 

Descriptors. In this paper, we are concerned with intranote 

(in particular the note’s envelope) expressive transformations. 

Given a note’s musical context in the score, 

we are interested in inducing a model for predicting aspects 

of the note’s envelope. The musical context of each 

note is defined in a structured way using first order logic 

predicates. The predicate melo/10 specifies information 

both about the note itself and the local context in which it 

appears. Information about intrinsic properties of the note 

includes the note duration and the note’s metrical position, 

while information about its context includes the duration 

of previous and following notes, and extension and direction 

of the intervals between the note and both the previous 

and the subsequent note. contextnarmour/4

specifies information about the Narmour groups a particular 

note belongs to. Temporal aspect of music is encoded 

via predicate succ/4. For instance, succ(A,B,C,D) 

indicates that note in position D in the excerpt indexed by 

the tuple (A,B) follows note C. 

The use of first order logic for specifying the musical 

context of each note is much more convenient than using 

traditional attribute-value (propositional) representations. 

Encoding both the notion of successor notes and 

Narmour group membership would be very difficult using 

a propositional representation. In order to mine the 

structured data we used Tilde’s top-down decision tree induction 

algorithm ([2]). Tilde can be considered as a first 

order logic extension of the C4.5 decision tree algorithm: 

instead of testing attribute values at the nodes of the tree, 

Tilde tests logical predicates. This provides the advantages 

of both propositional decision trees (i.e. efficiency 

and pruning techniques) and the use of first order logic 

(i.e. increased expressiveness). The increased expressiveness 

of first order logic not only provides a more elegant 

and efficient specification of the musical context of a note, 

but it provides a more accurate predictive model (more on 

this later). Tilde can also be used to build multivariate regression 

trees, i.e. trees able to predict vectors. In our 

case the predicted vectors are the amplitude envelope linear 

approximation descriptors, which can be seen as a first 

step towards a more complete (e.g. pitch and centroid envelope), 

and more refined (e.g. using spline regression) 

intra-note description. 

Table 1 compares the correlation coefficients obtained 

by 10-fold cross validation of a propositional regression 

tree model and our first order logic regression tree model 

for each descriptor in the amplitude envelope prediction 

task. On the one hand, the first order logic model takes 

into account a wider note’s musical context by considering 

an arbitrary temporal window around the note (via 

succ/4 predicate). On the other hand, the propositional 

model only consider the note’s local context, i.e. the temporal 

information is restricted to the duration and pitch of 

the previous and following notes. In Table 1 ARD refers 

to Attack Relative Duration, ARS to Attack Regression 

Slope, AREV to Attack Relative End Value, SS to Sustain 

Regression Slope, RRD to release Relative Duration, and 

RS to Release Regression Slope. The pruned tree models 

have an average size of 69 nodes for the propositional 

model, and 123 nodes for the first order logic model. We 

also obtained correlation coefficients of 0.71 and 0.58 for 

the complete amplitude envelope prediction task by performing 

a 10-fold cross-validation for the first order logic 

model and the propositional model, respectively. Table 2 

shows the average absolute error (AAE) and the root mean 

squared error (RMSE) for each descriptor in the amplitude 

envelope prediction task. 

4. RELATED WORK 

Previous research in building performance models in a 

somehow structured musical context has included a broad 

Morphological attribute C.C (Prop) C.C(FOL) 

ARD 0.19 0.27 

ARS 0.39 0.51 

AREV 0.51 0.65 

SS 0.22 0.24 

RRD 0.24 0.31 

RRS 0.31 0.40 

Table 1. Comparision of Pearson Correlation coefficients 

obtained by 10-fold cross-validation for propositional and 

first order logic models. 

Morphological attribute AAE RMSE 

ARD 0.11 0.19 

ARS 0.05 0.08 

AREV 0.65 0.91 

SS 0.01 0.01 

RRD 0.20 0.26 

RRS 0.05 0.11 

Table 2. Errors for each descriptor in the amplitude envelope 

prediction task by the first order logic model obtained 

by 10-fold cross validation 

spectrum of music domains. 

Widmer et al [12] have focused on the task of discovering 

general rules of expressive classical piano performance 

as well as generating them from real performance 

data via inductive machine learning. The performance 

data used for the study are MIDI and audio recordings of 

piano sonatas by W.A. Mozart performed by a skilled pianist. 

In addition to these data, the music score along with 

hierarchical phrase structure description done by hand was 

also coded. The resulting substantial data consists of information 

about the nominal note onsets, duration, metrical 

information and annotations. However, given that 

they are interested in classical piano performances, they 

do not study intra-note expressive variations. Here, we 

are interested in saxophone recordings of Jazz standards, 

thus we have studied deviations on local duration, onset 

and energy, as well as ornamentations [11]. In [10], we 

generate saxophone expressive performances by predicting 

local note deviations. 

In [4], Dannenberg et al study trumpet envelopes by 

computing amplitude envelopes descriptors in a total of 

125 contours (i.e. 3 notes sequences varying in interval 

size, direction, and articulation). Statistical analysis techniques 

led to find significant groupings of the envelopes 

by interval or direction types, or more specific intentions 

(e.g. staccato, legato). This work is extended to a system 

that combines instrument and performance models ([3]). 

The authors do not take into account duration, onset deviations 

or ornamentations. 

Dubnov et al [5] have followed a similar line. Analysis 

of sound behavior as it occurs in course of an actual performances 

in several solo works is performed, in order to 

build a model able to reproduce aspects of sound texture

originated by expressive inflexions of the performer. Correlations 

between pitch, energy, and spectral envelopes 

variations are studied, phase coherence between pitch and 

energy, and decomposition between periodic component 

and noise component are investigated. To our knowledge, 

these models are devised after a preliminary statistical analysis 

rather than being induced from the training data. A 

possible reason may be the difficulties to parameterize continuos 

data form real world acoustic recordings and feed a 

machine learning component with the parameterized data. 

A first step towards continuous signal parameterization 

is done in [1], where voice pitch-continuous signals, 

namely indian gamakas, are parameterized with Bezier 

splines. An approximation of amplitude, pitch, and centroid 

curves is obtained, and the reduced data can be used 

to render similar signals. Nevertheless, the proposed representation 

lacks of a higher level context (e.g. attack, 

sustain) that can be used to analyze and synthesize in a 

more accurate way specific parts of the audio signal. 

5. CONCLUSION 

This paper describes an inductive logic programming approach 

for learning expressive performance transformations 

at the intra-note level. In particular, we have focused 

on the study of variations on intra-note features that 

a saxophone interpreter introduces in order to expressively 

perform Jazz standards. We have compared the induced 

first order logic model and a propositional model and concluded 

that the increased expressiveness of first order logic 

not only provides a more elegant and efficient specification 

of the musical context of a note, but it provides a 

more accurate predictive model than the one obtained with 

propositional machine learning techniques. 

Future work: This paper presents work in progress so 

there is future work in different directions. We plan to 

explore different envelope approximations, e.g. pitch and 

centroid envelope or splines, for characterizing the notes’ 

envelopes. Another short term goal is to incorporate the 

intra-note model described in this paper into our current 

system which predicts expressive deviations on note duration, 

note onset and note energy. This will give us a 

better validation of the intra-note model. We also plan to 

increase the amount of training data as well as experiment 

with different information encoded in it. Increasing the 

training data, extending the information in it and combining 

it with background musical knowledge will certainly 

generate a more complete model. 

Acknowledgments: This work is supported by the Spanish 

TIC project ProMusic (TIC 2003-07776-C02-01). We 

would like to thank Emilia Gomez and Maarten Grachten 

for pre-processing and providing the data. 

6. REFERENCES 

[1] B. Battey. Bezier spline modeling of pitchcontinuous 

melodic expression and ornamentation. 

Computer Music Journal, 28:4, 2004. 

[2] H. Blockeel, L. De Raedt, and J. Ramon. Top-down 

induction of clustering trees. In ed. J. Shavlik, editor, 

Proceedings of the 15th International Conference on 

Machine Learning, pages 53–63, Madison, Wisconsin, 

USA, 1998. Morgan Kaufmann. 

[3] R.B. Danneberg and I. Derenyi. Combining instrument 

and performance models for high quality music 

synthesis. Journal of New Music Research, 27:3, 

1998. 

[4] R.B. Danneberg, H. Pellerin, and I. Derenyi. A study 

of trumpet envelopes. In Proceedings of the International 

Computer Music Conference., San Fransisco: 

International Computer Music Association., 1998. 

[5] S. Dubnov and X. Rodet. Study of spectro-temporal 

parameters in musical performance, with applications 

for expressive instrument synthesis. In 1998 

IEEE International Conference on Systems Man and 

Cybernetics, San Diego, USA, November 1998. 

[6] E. Gómez. Melodic Description of Audio Signals 

for Music Content Processing. PhD thesis, Pompeu 

Fabra University, 2002. 

[7] A. Klapuri. Sound onset detection by applying psychoacoustic 

knowledge. In Proceedings of the IEEE 

International Conference on Acoustics, Speech and 

Signal Processing,ICASSP, 1999. 

[8] R.C. Maher and J.W. Beauchamp. Fundamental frequency 

estimation of musical signals using a twoway 

mismatch procedure. Journal of the Acoustic 

Society of America, 95, 1994. 

[9] R.J. McNab, Smith Ll. A., and Witten I.H. Signal 

processing for melody transcription. In SIG working 

paper, volume 95-22. 1996. 

[10] R. Ramirez and A. Hazan. Modeling expressive 

music performance in jazz. In Proceedings of the 

18th Florida Artificial Intelligence Research Society 

Conference (FLAIRS 2005), Clearwater Beach, FL, 

2005. 

[11] R. Ramirez, A. Hazan, E. Gmez, and E. Maestre. 

Understanding expressive transformations in saxophone 

jazz performances using inductive machine 

learning. In Proceedings of Sound and Music Computing 

’04, Paris, France, 2004. 

[12] G. Widmer and A. Tobudic. Playing mozart by 

analogy: Learning multi-level timing and dynamics 

strategies. Journal of New Music Research, 2003.

Intra-note Features Prediction Model for Jazz Saxophone Performance

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?