12.07.2015 Views

A computationally efficient multipitch analysis model ... - IEEE Xplore

A computationally efficient multipitch analysis model ... - IEEE Xplore

A computationally efficient multipitch analysis model ... - IEEE Xplore

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

TOLONEN AND KARJALAINEN: COMPUTATIONALLY EFFICIENT MULTIPITCH ANALYSIS MODEL 711TABLE ISUGGESTED PARAMETER VALUES FOR THE PROPOSED TWO-CHANNEL MODELFig. 4. An example of <strong>multipitch</strong> <strong>analysis</strong>. A test signal with three clarinettones with fundamental frequencies 147, 185, and 220 Hz, and relative rmsvalues of 0.4236, 0.7844, and 1, respectively, was analyzed. Top: two-channelSACF, bottom: two-channel ESACF.which peaks are true pitch peaks. For instance, the autocorrelationfunction generates peaks at all integer multiples of the fundamentalperiod. Furthermore, in case of musical chords the roottone often appears very strong though in most cases it should notbe considered as the fundamental period of any source sound.To be more selective, a peak pruning technique similar to [21],but <strong>computationally</strong> more straightforward, is used in our <strong>model</strong>.The technique is the following. The original SACF curve, asdemonstrated above, is first clipped to positive values and thentime-scaled (expanded in time) by a factor of two and subtractedfrom the original clipped SACF function, and again the result isclippedtohavepositivevaluesonly.Thisremovesrepetitivepeakswith double the time lag where the basic peak is higher than theduplicate. It also removes the near-zero time lag part of the SACFcurve. This operation can be repeated for time lag scaling withfactorsofthree,four,five, etc.,asfarasdesired,inordertoremovehighermultiplesofeachpeak.Theresultingfunctioniscalledherethe enhanced summary autocorrelation (ESACF).An illustrative example of the enhanced SACF <strong>analysis</strong> isshown in Fig. 4 for a signal consisting of three clarinet tones.The fundamental frequencies of the tones are 147, 185, and 220Hz. The SACF is depicted on the top and the enhanced SACFcurve on the bottom, showing clear indication of the three fundamentalperiodicities and no other peaks. We have experimentedwith different musical chords and source instrument sounds. Inmost cases, sound combinations of two to three sources are resolvedquite easily if the amplitude levels of the sources are nottoo different. For chords with four or more sources, the subsignalseasily mask each other so that some of the sources are notresolved reliably. One further idea to improve the pitch resolutionwith complex mixtures, especially with relatively differentamplitudes, is to use an iterative algorithm, whereby the mostprominent sounds are first detected and filtered out (see SectionVI) or attenuated properly, and then the pitch <strong>analysis</strong> isrepeated for the residual.IV. MODEL PARAMETERSThe <strong>model</strong> of Fig. 2 has several parameters that affect thebehavior of pitch <strong>analysis</strong>. In the following, we show with examplesthe effect of each parameter on the SACF and ESACFrepresentations. In most cases, it is difficult to obtain the correctvalue for a parameter from the theory of human perception.Rather, we attempt to obtain <strong>model</strong> performance that is similarto that of the human perception or approximately optimal basedon visual inspection of <strong>analysis</strong> results. The suggested parametervalues are collected in Table I.In the following section, there are two types of illustrations.In Figs. 5 and 7, the SACFs of one frame of the signal are plottedon the top, and one or two ESACF’s that correspond to the parametervalues that we found best suited are depicted on thebottom. In Figs. 6 and 8, consecutive ESACFs are illustrated asa function of time. This representation is somewhat similar tothe spectrogram: time is shown on the horizontal axis, ESACFlag on the vertical axis, and the gray level of a point correspondsto the value of the ESACF. Test signals are mixtures of syntheticharmonic tones, noise, and speech signals. Each synthetic tonehas the amplitude of the first harmonic equal to 1.0, and theamplitude of the th harmonic equal to . The initial phasesof the harmonics are 0. The noise that is added in some examplesis white Gaussian noise. In this work, we have used speechsamples that have been recorded in an anechoic chamber and innormal office conditions. The sampling rate of all the examplesis 22 050 Hz.A. Compression of Magnitude in Transform DomainIn Section II we motivated the use of transform-domain computationof the “generalized autocorrelation” by two considerations:1) it allows nonlinear compression of the spectral representationand 2) it is <strong>computationally</strong> more <strong>efficient</strong>. The followingexamples concentrate on magnitude compression andsuggest that the normal autocorrelation function ( ) is suboptimalfor our periodicity detection <strong>model</strong>. Exponential compressionis easily available by adjusting parameter in (1).While we only consider exponential compression in thiscontext, other nonlinear functions may be applied as well. Acommon choice in the speech processing community is to usethe natural logarithm, which results in the cepstrum. Nonlinearcompression in the transform domain has been studied in thecontext of pitch <strong>analysis</strong> of speech signals [31]. In that study,the periodicity measure was computed using the widebandsignal directly without dividing into channels. It was reportedthat the compression parameter gave the best results.It was also shown that pre-whitening of the signal spectrumshowed a tendency of improving the pitch <strong>analysis</strong> accuracy.Fig. 5 illustrates the effect of magnitude compression to theSACF. The upper four plots depict the SACF’s that are obtainedusing , , , and , whereas thebottom plot shows the ESACF that is obtained from the SACFwith . A Hamming window of 1024 samples (46.4 mswith a sampling frequency of 22 050 Hz) is used.The test signal consists of two synthetic harmonic toneswith fundamental frequencies 140.0 and 148.3 Hz and whiteGaussian noise with signal-to-noise ratio (SNR) of 2.2 dB. Thefundamental frequencies of the harmonic tones are separated

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!