Monaural Voiced Speech Segregation Based on Pitch and Comb Filter

INTERSPEECH 2011<strong>Monaural</strong> <strong>Voiced</strong> <strong>Speech</strong> <strong>Segregation</strong> <strong>Based</strong> on Pitch and Comb Filter Xueliang Zhang 1 and Wenju Liu 21 Compute Science Department, Inner Mongolia University, Huhhot, China, 0100212 National Laboratory of Pattern Recognition (NLPR), Institute of Automation,Chinese Academy of Sciences Beijing, China, 10019cszxl@imu.edu.cnlwj@nlpr.ia.ac.cnABSTRACTThe correlogram is an important mid-level representationfor periodic sounds which is widely used in sound sourceseparation and pitch detection. However, it is very timeconsuming. In this paper, we presented a novel scheme formonaural voiced speech separation without computing correlograms.The noisy speech is firstly decomposing intotime-frequency units. Pitch contour of the target speech isextracted according to the zero crossing rate of the units.Then we applied a comb filter to label each unit as targetspeech or intrusion. Compared with previous correlogrambasedmethod, the proposed algorithm saves computingtime and also yields better performance.Index Terms—Sound separation, Computational auditoryscene analysis, Correlogram1. INTRODUCTIONHuman auditory system has superior capacity to focus on asingle talker among a mixture of conversations and backgroundnoises. By exploring the process of human to soundperception, Bregman proposed a theory called AuditoryScene Analysis (ASA) [1]. ASA inspired research in thedomain referred to as Computational Auditory SceneAnalysis (CASA) [2].The general framework of CASA-based separation systemshas two main stages: segmentation and grouping. Insegmentation, the acoustic input is decomposed into sensorysegments, each of which should originate from a singlesource. In grouping, those segments that likely come fromthe same source are grouped together according to thegrouping cues. For voiced speech separation, pitch or fundamentalfrequency (F0) is one of the most important cues.Given the F0, systems can utilize harmonicity principle togroup the segments in different frequency regions. Harmonicityprinciple shows that F0 and its overtones are perceivedas single source by human beings.A well-established representation for harmonic structureis a correlogram [3], which has been adopted by manyCASA systems. Input signal is firstly decomposed into multiplechannels by auditory filterbank. The correlogram is a* This work was supported in part by the NSF (no. 60675026, no. 60121302,and no. 90820011), the 863 Program (no. 20060101Z4073, no. 2006AA01Z194), the 973 Program (no. 2004CB318105) and SPH-IMU (no. Z20100112).running autocorrelation of the signal within a certain periodof time in each filter channel. The periodicity of the signalis represented by the corresponding autocorrelation function(ACF). Separation systems [4][5] also employed crosschannel correlation of correlogram for segmentation.Although CASA is faced with many difficulties (such assequential organization, unvoice segregation), it can be stillused in some practical applications (e.g. extracting singingvoice from musical sound and musical instruments separationgiven the ground truth F0). The truth F0 can be obtainedby user’s singing or MIDI files. However, computingcorrelogram is very time consuming which limitsCASA for these kinds of applications.In this paper, we proposed a novel scheme to separate thevoiced speech from intrusions in monaural situation insteadof computing correlograms. At first, input signal is decomposedinto time-frequency units. Then, the units are mergedinto several segments. Pitch of the target is extracted bysegment-based method. After that, each unit is labeled astarget or intrusion according to harmonicity principle andamplitude modulation (AM) criterion [1] which will beintroduced later. Finally, the labeled units are separated intoforeground or background. Being different with previoussystems, critical parts (segmentation, pitch estimation andunit labeling) of the proposed system aren’t based on correlogramsand run much faster.The rest of paper is organized as follows. In section 2,we describe the details of each stage. In section 3, timecomplexity of the algorithms is analyzed. Signal to noiseratio (SNR) of separated speech and computing time arereported in section 4. A conclusion is given in section 5.2. SYSTEM DESCRIPTIONThe proposed system has four stages as shown in Figure 1.2.1. Front-End processingFig. 1. Systematic diagramIn the typical Front-End processing(e.g. in [4]), input signalCopyright © 2011 ISCA174128-31 August 2011, Florence, Italy

x(t) is decomposed by a gammatone filterbank with 128channels whose center frequencies are quasi logarithmicallyspaced from 80 Hz to 5 kHz and bandwidths equal to theequivalent rectangle bandwidth (ERB). Instead, the proposedsystem decomposes the signal by forward-backwardfiltering. The reason will be given later. Specifically, x(t) isfirst passed through the gammatone filterbank. The outputsare time reversed and re-filtered by the gammatone filterbank.Then the filter outputs are time reversed again. Afterthat, phase delay of output is compensated in each channel.Because forward-backward filtering causes the actual bandwidthnarrower than standard gammatone filter, we enlargethe bandwidths to 1.6 times of the ERB.AM of gammatone filter output is extracted by band-passfiltering the Hilbert envelope which is a conventionalmethod [2]. Considering the plausible pitch range of speech,the bandpass is set from 50 Hz to 550 Hz. Gammatone filteroutput and its AM at channel c are denoted as g(c,t) ande(c,t) respectively.The output of each channel is then divided into 20 mstime frames and 10 ms time shift. In the following part, u cmis to denote a time-frequency (T-F) unit for frequency channelc and time frame m. As in the Hu and Wang model [4],we employ different methods to segregate resolved andunresolved units. The resolved one is defined as beingdominated by single harmonic and unresolved one is definedas being dominated by several harmonics. In order todiscriminate the resolved and unresolved units, we introducea feature called carrier to envelope energy ratio (CER)which is calculated as (1). u cm is termed as resolved ifR eng (c,m)> R ; otherwise it is termed as unresolved. Motivationis that when unit is dominated by several harmonics,the AM is relative strong and it leads to a small value forR eng .W2 gcmT (, n)n0Reng(, c m) log(1)W2ecmT (, n)n0where, T =160 and W=320 corresponding to 10 ms timeshift and 20 ms time frame; the frequency sampling rate is16 kHz.Segmentation plays an important role in CASA system.Each segment consists of spatially (time and frequency)continuous units which are generated according to crosschannelcorrelation. In stead of computing it by correlogram(as in [4][5]), the cross channel correlation is computeddirectly on gammatone filter outputs by (2) and it isthe reason that we use forward-backward filtering to compensatethe phase delay.W 1C ( , ) ˆ( , ) ˆRc m g c mT n g( c1, mT n)(2)n0where, (c,t) is zero-mean and unity-variance version offilter response at channel c and window m.In addition, we compute the zero crossing rate of gammatonefilter output with positive slope in each unit, termedas Z(c,m) for u cm . ZCR is used for pitch detection in thefollowing section.2.2. Pitch estimationIn this subsection, the continuous pitch contour is extractedbased on segments and ZCR of units. We know that ZCRdoes not work well for complex waveform. Therefore, theunits used in pitch estimation should be dominated by signalharmonic. The resolved units tend to meet this requirement.To improve the accuracy, the resolved units are furtherselected by the segmentation. Specifically, the units arefirstly selected as candidates when CER exceeds a threshold R = 1.82 and cross channel correlation is larger than0.98. Then the neighboring candidates are merged intosegments. The segments shorter than 30 ms are removedsince which unlikely arise from target speech. The remainingunits in longer segments are used for pitch estimation.We employ cosine function as substitution for pitch detectionwhose frequency is set to the ZCR. In the selectedunits, cosine function has similar shape with autocorrelationfunction. The rest of pitch detection is similar with the Huand Wang model. The dominant pitch at each frame isshown by the maximum peak of summary cosine functions.Then, we use the longest segment and dominant pitch as acriterion to segregate each segment into foreground andbackground (details can be found in [4]). Different from theprocess in [4], pitch estimation is based on the longest segmentin foreground and its harmonic order. Harmonic ordershows that the segment is dominated by which harmonic. Itis given by (3). n H arg max SCFm, , n u cmS long Z c m (3) where, S long stands for the longest segment in foreground;Z(c,m) is the zero crossing rate at unit u cm ; SCF(m,) issummary cosine function of units in foreground at frame m.The pitch period at frame m is determined by (4)Pm ( ) argmax SCF' m,(4)where, [2 ms,12.5 ms] corresponding to the range [80Hz, 500 Hz]; SCF(m,) is summary cosine functions withthe range [H-/2, H+/2] of units in longest segments.2.3. Unit labelingWe use estimated pitch period to label the units as “target”or “intrusion”. The labeled units will be segregated intoforeground and background in next subsection.As in the Hu and Wang model, resolved and unresolvedT-F units are treated differently. For the resolved one, it is1742

labeled according to harmonicity principle. If response frequencyis multiple of the estimated pitch, the resolved T-Funit is labeled as target. For the unresolved one, it is labeledaccording to AM criterion. AM criterion pointed out that if afilter responding to multiple harmonics of a single harmonicsound source, the response envelope fluctuates at therate of F0 of the source. Therefore, the unresolved T-F unitis labeled as target if the AM rate equals to the estimatedpitch.To measure these two criterions, we employ an IIR combfilter with sieves at F0 and its overtones to filter the flattenedoutput of gammatone filter g N (c,t) or its flattened envelopee N (c,t). The flattened signals are obtained by (5).This process can be viewed as a simplified simulation forthe automatic gain control which is one of the functions ofcochlear. rN(,) c t sgn r c, t r c,t(5)where, r(c,t) stands for output of gammatone filter or itsenvelope at channel c; r N (c,t) is the flattened signal; is thecompression rate, here =0.1.Then the flattened signals are passed through the combfilter , , , r c t r c t r c t P t(6)R N Rwhere, P(t) is the pitch period at time t which is obtained bylinear interpolation of the frame pitch period; r R (c,t) standsfor comb filter output; =0.55.Unit u cm is labeled as target if most of the flattened signalspassing through the comb filter, i.e. if the relative energyof comb filter output is above a certain threshold c ;otherwise u cm is labeled as intrusion.W 12 rRc,mT n 0log W 1c(7)2rNc,mT n 0For resolved T-F unit, r(c,t) in (5) stands for the outputof gammatone filter g(c,t). Similarly, r N (c,t) and r R (c,t)stands for g N (c,t) and g R (c,t) respectively. For unresolved T-F unit, r(c,t) in (5) stands for the envelope of gammatonefilter output e(c,t). The threshold c =0.55 for resolved unitsand c =0.45 for unresolved units.2.4. Separation and synthesisTo directly use labeling information as the final decisionwill lead to some errors. Hence, previous method [4] provideda separation method based on segmentation. Here, wetake the similar process with different details.a) The resolved T-F unit separation is based on the segmentsgenerated in Section 2.2. These segments are firstlymarked as matched or mismatched on each frame. Specifically,if more than 50% units on a frame are labeled as target,we call the segment matched on this frame. If morethan half of the segment’s frames marked as matched, it isgrouped into foreground; otherwise it is grouped into background.In foreground, the units labelled as intrusion aremerged into new segments. Then the segments larger than30 ms are moved into background.b) For unresolved T-F units, they respond to several frequencycomponents. If dominated by target voiced speech,their AM rate equals to the F0. Therefore, the flattened envelopeis used in (5) and (6). However, the large value in (6)doesn’t mean AM rate equals to F0. The units are possiblydominated by noise with fractional F0 (e.g. F0/2,F0/3…).To eliminate the errors, we compute the ZCR ofunresolved T-F units on the comb filter output. The segmentsconsisting of unresolved T-F units are formed by thespatially continuous candidates with distance between ZCRand F0 less than 50%. The segments longer than 30 ms aregrouped into foreground.c) The units labelled as target, not in foreground, aremerged iteratively into its adjacent segments in foreground.The rest are grouped into background.Finally, the units in foreground are used to synthesize thewaveform of separated target speeches.3. ANALYSIS OF RUN-TIME COMPLEXITYIn this section, we analyze the run-time complexity of theproposed algorithm and compare it with the Hu and Wangmodel. As a typical correlogram-based speech separationsystem, the Hu and Wang model [4] has much better performancethan previous systems. The innovations of the Huand Wang model are that: 1) different separation methodfor resolved and unresolved harmonics; 2) separation basedon segmentation; 3) pitch detection in noisy environment.Correlogram plays a vital role in each stage.Because the entire separation systems are relative complicated,we only compared the major processes in eachstages. From table 1, it can be seen that computing correlogramsis the bottleneck of the Hu and Wang system. Toaccelerate the computation, autocorrelation could be donein frequency domain [2]. Then the complexity of computingcorrelograms is O(CLlogW). Another place of the Hu andWang system which could be accelerated is in unit labeling,where we can conduct bandpass filtering in frequency domain,and its complexity is O(CLlogL). In our algorithm,the complexity of those two counterparts is O(CL). Specificrunning time of the algorithms is shown in next section.4. SYSTEM EVALUATIONThe proposed scheme is evaluated on a corpus of 100 mixturescomposed of ten voiced utterances mixed with tendifferent kinds of intrusions collected by Cooke [6] whichis widely used to evaluate the separation systems. In thedataset, ten voiced utterances have continuous pitch nearly1743

throughout whole duration. And the intrusions are ten differentkinds of sounds including N0, 1 kHz pure tone; N1,white noise; N2, noise bursts; N3, “cocktail party” noise;N4, rock music; N5, siren; N6, trill telephone; N7, femalespeech; N8, male speech; and N9, female speech. Tenvoiced utterances are regarded as targets. Frequency samplingrate of the corpus is 16 kHz.As commonly used objective performance measure forseparation systems [4][5], signal to noise ration (SNR) ischosen. Its computation is as follows:2 R( t)tSNR 10 log (8)102[ R( t) S ( t)] twhere, R(t) is the clean speech and S(t) is the synthesizedwaveform by segregation systems.In table 2, each value represents the average SNR for oneintrusion mixed with ten target speeches. The average ofoverall intrusions is shown in last row. As shown in table,our algorithm improves SNR for most of the intrusions andproduces a gain of 0.7 dB over the Hu and Wang model.For further comparison, we replace estimated pitch bytrue pitch in the Hu and Wang model (termed as TP-HW)and the proposed algorithm (termed as TP-Pro). The truepitch is obtained by performing the algorithm on cleanspeech. The performances are listed in table 2. It can beseen that TP-HW produces a gain of 0.35 dB comparedwith the original Hu and Wang model. While the TP-Proproduces a gain of 0.17 dB. Although we didn’t comparethe pitch estimation of both algorithms, the proposed pitchestimation is at least no worse than the method in the Huand Wang model. The SNR gap between TP-HW and TP-Pro is 0.53 dB.To compare the computing time, both of the proposed algorithmand the Hu and Wang model are implemented by Clanguage and run on the PC platform with 1.6 GHz CPUand 3 GB storage memory. The implementation of the Huand Wang model is provided by Prof. Deliang Wang. Wealso accelerate the Hu and Wang model by computing correlogramsand bandpass filtering in frequency domainwhich is termed as accHW. Results of computing time arelisted in table 3.From table 3, we can see that total duration of 100 mixturesis 168.3 seconds. And the computing time of the Huand Wang is 14.6 times of real time. For the accelerated Huand Wang model, the computing time is 6.33 times of realtime. It saves 57% computing time. While, the total computingtime of the proposed system is 2.23 times of realtime. Compared with original Hu and Wang model andaccelerated Hu and Wang model, the proposed methodsaves 84.8 % and 64.8 % of computing time respectively.5 DISCUSSION AND CONCLUSIONIn this paper, we propose a novel algorithm for monauralvoiced speech separation which avoids computing the correlograms.Segmentation, pitch detection and segregationare implemented in an efficient way by the novel scheme.Compared with the typical correlogram-based algorithm Huand Wang model, the proposed scheme achieves better performanceand saves computing time.6REFERENCE[1] S. Bregman, Auditory Scene Analysis, MA: MIT press, 1990.[2] D. L.Wang and G. J. Brown, Computational Auditory Scene Analysis:Principles, Algorithms and Applications, Wiley-IEEE Press, 2006.[3] J. C. R. Licklider, “A duplex theory of pitch perception,” Experientiavol. 7 no.4, pp. 128–134, 1951.[4] G. N. Hu and D. L. Wang, “<strong>Monaural</strong> speech segregation based onpitch tracking and amplitude modulation,” IEEE Trans. Neural Networks,vol. 15, no. 5, pp. 1135–1150, 2004.[5] D. L. Wang, and G. J. Brown, “Separation of speech from interferingsounds based on oscillatorycorrelation,” IEEE Trans. Neural Networks,vol. 10, no. 3, pp. 684-697, 1999.[6] M.P. Cooke, Modeling Auditory Processing and Organization, U.K.:Cambridge University, 1993TABLE 1 COMPARISON OF TIME COMPLEXITYStage Process HW ProposedFront-End proc. Signal decomposition O(CL) O(CL)Envelope extraction O(CLlogL) O(CLlogL)Correlograms O(CLD) –ZCR – O(CL)Pitch estimation Segmentation O(CL/T) O(CL/T)Pitch estimation O(CL) O(CL)Unit Labeling Bandpass filtering O(CLF) –Comb filtering – O(CL)Sep & Syn O(CL) O(CL)C: the number of channels; L: length of input signal; T: time shift; D:maximum pitch period; F: Length of FIR bandpass filter.TABLE 2 SNR RESULTSIntrusion Mixture HW Proposed TP-HW TP-ProN0 -3.27 16.44 17.86 16.41 17.87N1 -4.08 7.80 8.16 8.15 8.32N2 10.18 16.71 18.27 16.55 18.46N3 4.34 8.24 8.26 8.66 8.76N4 3.98 10.77 11.28 11.11 11.28N5 -5.83 14.87 16.04 14.83 16.04N6 1.89 16.66 17.46 17.07 17.59N7 6.61 11.99 11.93 12.15 11.87N8 10.36 14.27 14.84 14.87 15.15N9 0.72 4.25 4.98 5.69 5.48AVG 2.49 12.20 12.91 12.55 13.08TABLE 3 COMPUTING TIMETotal Duration Run Time Real-Time propertyMixture 168 s – –HW – 2460 s 14.6×RTaccHW – 1064 s 6.33×RTProposed – 375 s 2.23×RT1744

Monaural Voiced Speech Segregation Based on Pitch and Comb Filter

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?