Computational Models of Music Similarity and their ... - OFAI
Computational Models of Music Similarity and their ... - OFAI
Computational Models of Music Similarity and their ... - OFAI
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
16 2 Audio-based <strong>Similarity</strong> Measures<br />
The ZCR is the average number <strong>of</strong> times the audio signal crosses the zero<br />
amplitude line per time unit. Figure 2.1 illustrates this. The ZCR is higher if<br />
the signal is noisier. Examples are given in Figure 2.2. The first <strong>and</strong> second<br />
excerpts (both jazz) are close to each other with values below 1.5/ms, <strong>and</strong><br />
the third <strong>and</strong> fourth (both hard pop) are close to each other with values<br />
above 2.5/ms. Thus the ZCR seems capable <strong>of</strong> distinguishing jazz from hard<br />
pop (at least on these 4 examples).<br />
The fifth excerpt is electronic dance music with very fast <strong>and</strong> strong<br />
percussive beats. However, the ZCR value is very low. The sixth excerpt<br />
is a s<strong>of</strong>t, slow, <strong>and</strong> bright orchestra piece in which wind instruments play a<br />
dominant role. Considering that the ZCR is supposed to measure noisiness<br />
the ZCR value is surprisingly high.<br />
The limitations <strong>of</strong> the ZCR as a noisiness feature for music are obvious<br />
when considering that with a single sine wave any ZCR value can be generated<br />
(by varying the frequency). In particular, the ZCR does not only<br />
measure noise, but also pitch. The fifth excerpt has relatively low values<br />
because its energy is mainly in the low frequency b<strong>and</strong>s (bass beats). On<br />
the other h<strong>and</strong>, the sixth excerpt has a lot <strong>of</strong> energy in the higher frequency<br />
b<strong>and</strong>s.<br />
This leads to the question <strong>of</strong> how to evaluate features <strong>and</strong> to ensure that<br />
they perform well in general <strong>and</strong> not only on the subset used for testing.<br />
Obviously it is necessary to use larger test collections. However, in addition<br />
it is also helpful to underst<strong>and</strong> what the features describe <strong>and</strong> how this<br />
relates to a human listener’s perception (in terms <strong>of</strong> similarity). Evaluation<br />
procedures will be discussed in detail in Section 2.3.<br />
Another question this leads to is how to extract additional information<br />
from the audio signal which might solve problems the ZCR cannot. Other<br />
features which can be extracted directly from the audio signal include, for<br />
example, the Root Mean Square (RMS) energy which is correlated with loudness<br />
(see Subsection 2.2.5). However, in general it is a st<strong>and</strong>ard procedure to<br />
first transform the audio signal from the time domain to the spectral domain<br />
before extracting any further features. This will be discussed in the next<br />
subsection.<br />
2.2.2 Preprocessing (MFCCs)<br />
The basic idea <strong>of</strong> preprocessing is to transform the raw data so that the<br />
interesting information is more easily accessible. In the case <strong>of</strong> computational<br />
models <strong>of</strong> music similarity this means drastically reducing the overall amount<br />
<strong>of</strong> data. However, this needs to be achieved without losing too much <strong>of</strong> the<br />
information which is critical to a human listener’s perception <strong>of</strong> similarity.