03.05.2014 Views

Computational Models of Music Similarity and their ... - OFAI

Computational Models of Music Similarity and their ... - OFAI

Computational Models of Music Similarity and their ... - OFAI

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

16 2 Audio-based <strong>Similarity</strong> Measures<br />

The ZCR is the average number <strong>of</strong> times the audio signal crosses the zero<br />

amplitude line per time unit. Figure 2.1 illustrates this. The ZCR is higher if<br />

the signal is noisier. Examples are given in Figure 2.2. The first <strong>and</strong> second<br />

excerpts (both jazz) are close to each other with values below 1.5/ms, <strong>and</strong><br />

the third <strong>and</strong> fourth (both hard pop) are close to each other with values<br />

above 2.5/ms. Thus the ZCR seems capable <strong>of</strong> distinguishing jazz from hard<br />

pop (at least on these 4 examples).<br />

The fifth excerpt is electronic dance music with very fast <strong>and</strong> strong<br />

percussive beats. However, the ZCR value is very low. The sixth excerpt<br />

is a s<strong>of</strong>t, slow, <strong>and</strong> bright orchestra piece in which wind instruments play a<br />

dominant role. Considering that the ZCR is supposed to measure noisiness<br />

the ZCR value is surprisingly high.<br />

The limitations <strong>of</strong> the ZCR as a noisiness feature for music are obvious<br />

when considering that with a single sine wave any ZCR value can be generated<br />

(by varying the frequency). In particular, the ZCR does not only<br />

measure noise, but also pitch. The fifth excerpt has relatively low values<br />

because its energy is mainly in the low frequency b<strong>and</strong>s (bass beats). On<br />

the other h<strong>and</strong>, the sixth excerpt has a lot <strong>of</strong> energy in the higher frequency<br />

b<strong>and</strong>s.<br />

This leads to the question <strong>of</strong> how to evaluate features <strong>and</strong> to ensure that<br />

they perform well in general <strong>and</strong> not only on the subset used for testing.<br />

Obviously it is necessary to use larger test collections. However, in addition<br />

it is also helpful to underst<strong>and</strong> what the features describe <strong>and</strong> how this<br />

relates to a human listener’s perception (in terms <strong>of</strong> similarity). Evaluation<br />

procedures will be discussed in detail in Section 2.3.<br />

Another question this leads to is how to extract additional information<br />

from the audio signal which might solve problems the ZCR cannot. Other<br />

features which can be extracted directly from the audio signal include, for<br />

example, the Root Mean Square (RMS) energy which is correlated with loudness<br />

(see Subsection 2.2.5). However, in general it is a st<strong>and</strong>ard procedure to<br />

first transform the audio signal from the time domain to the spectral domain<br />

before extracting any further features. This will be discussed in the next<br />

subsection.<br />

2.2.2 Preprocessing (MFCCs)<br />

The basic idea <strong>of</strong> preprocessing is to transform the raw data so that the<br />

interesting information is more easily accessible. In the case <strong>of</strong> computational<br />

models <strong>of</strong> music similarity this means drastically reducing the overall amount<br />

<strong>of</strong> data. However, this needs to be achieved without losing too much <strong>of</strong> the<br />

information which is critical to a human listener’s perception <strong>of</strong> similarity.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!