Monaural Speech Segregation Based on Pitch Tracking and ...

<strong>Monaural</strong> <strong>Speech</strong> <strong>Segregation</strong><strong>Based</strong> on Pitch Tracking andAmplitude ModulationWritten by: Guoning Hu and DeLiang WangPresented by: Eli Yazovitsky1

• Introduction• DefinitionsTopics• Assumption and innovations• Model Overview• Results and Comparison• Summary• My Opinion2

Introduction• <strong>Speech</strong> always appears with acousticinterference(cocktail party)• Systems that use more than one microphone:Blind source separation or sensor array for spatialfiltering• For some systems we need monaural (onemicrophone) solution• CASA systems try to simulate human auditorysystem• CASA systems has two main stages:segmentation( analysis) and grouping (synthesis)3

Definitions• Filter bank view using an auditory bandpass filter decompose the input into T-Fdomain Unit• We use an auditory filters called“gammatone” filtersl−⎧t 1 exp( −2 πbt)cos(2 π ft), t ≥ 0⎫gt () = ⎨ ⎬⎩0,else⎭• Resolved and unresolved harmonics4

Definitions(cont.)• Low-order harmonics:One per auditory filter(Resolved)• High-order harmonics:Two or more per auditory filter(Unresolved)• Amplitude Modulation• Autocorrelation:N c−1Acm (, , τ ) = 1/ N hcmT (, −nhcmT )(, −n−τ)c∑n=05

Model OverviewSignal decompotion:• 128-channel filterbanks• quasi-logaritmically spaced from 80-5000Hz•Output divided to 20 ms frame with 10-ms overlap• The output decompose into two-dimensional T-F units7

Model Overview(cont.)Auditory Feature Extraction:Correlogram:N c −1AH( c, m, τ ) = 1/ Nc∑h( c, mT − n) h( c, mT − n −τ)n = 0Dominant Pitch (Find Maximum- τ ( m ) ):DSm ( , τ ) = ∑ AH( cm , , τ )cEnvelope Correlogram:N c −1Eτc∑E En=0Cross-Channel Corelation:L−1CH( c, m) = ∑ AH( c, m, τ ) AH( c+1, m, τ )A (, c m, ) = 1/ N h (, c mT −n) h (, c mT −n−τ)τ = 08

Graphs9

Initial <strong>Segregation</strong>• Units with some response energy and highcross channel correlation are considerA c m C c m2H( , ,0) > θH; H( , ) > θc; θH = 50; θc= 0.985• Segments Shorter then 30 ms removed• Segmentation on this point for unresolvedunits are very small10

Initial Grouping• Segments are grouped into to streams:0S F0S Fwhich corresponding to target speechwhich corresponding with intrusion• Segments that probably consist of units thatat least half of them agrees with dominant0pitch at each frame correspond to S F• Segmantation follows by the criteria:AH(, c m, τD( m))> θP; θP= 0.95A (, c m, τ (, c m))HP11

Illustration12

Pitch Tracking• After initial segregation we have S 0thatprobably describes the target speech Fbut tobe more accurate we repeat the samealgorithm. We employ constraints to check0the reliability of units in S F• Now τ ( m ) and τ ( m)0DP is obtained from S FConstraint 1:AH(, c m, τD( m))> θP; θP= 0.95A (, c m, τ (, c m))HConstraint 2: the pitch contour changes slowlyless than 20% from frame m to m+1P13

Unit Labeling• Now we compute the pitch streakwhich contain of τ ( m0) fromSPFthat satisfiesboth constraints. This pitch will be the mostreliable estimation for target pitch• Now we divide S 0to two streams S 11S F and BF1S F - foreground stream satisfy the equation1S B - the rest of the segmentsAH(, c m, τD( m))A (, c m, τ (, c m))HS> θ ; θ = 0.85TT14

Ilustration15

Labaling Unresolve Harmonics• The pitch of target speech does notcorrespond to global maximum of theautocorrelation of such units• The filter response is strongly Amplitudemodulated• The filter response envelope fluctuates atthe rate of F0 of the source• Now we define a new criteria:rcn^( , ) - normalized band pass function withthe periodicity of envelope16

Labaling Unresolve Harmonics(cont.)φcm2 T − 1 ^∑2 T − 1 ^⎡ 2 π n ⎤= arg m in ∑ r ( c, m T − n ) − exp ⎢ j( + φ ) ⎥φ n = 0⎣ τs( m ) fs ⎦⎡2 π n⎤⎢ r ( c, m T − n ) − cos( + φcm)τ ( m ) f⎥⎣⎦n = 0s s2 T − 1 ^ 2∑n = 0r ( c, m T − n )2< θ ; θ = 0.2AMAM2We lable unresolve units by the criteriaabove17

Final Segragation• Segments corresponding to unresolvedharmonics are generated based on temporalcontinuity an cross channel envelopecorrelation• Segments grouped again into foreground2and background stream S B• The grouping criteria is:duration no less the 50 msfundamental frequency of unresolved willbe close to target pitch2S F18

Ilustration19

Results and Comparison• SNR Criteria: signal before and aftersegregation (using target speech beforemixing)• Intrusions:‣N0 – 1Khz pure tone.‣N1 – White noise.‣N3 – “Cocktail Party”‣N4 – Rock music‣N8 – Mail Voice‣N9 – Female Voice20

SNR-Comparison21

Ideal Binary Mask Criteria• Ideal binary mask is constructed as follows:T-F unit is assign 1 if the target energy inthe unit is greater than the intrusion energyotherwise 0• This mask uses a priori information aboutthe speech and the intrusion• We define new criteria as follows:On ( )- speech from our systemI ( n)- speech from ideal mask22

Ideal Binary Maske ( n)1Criteria(cont.)- Signal present in I(n) and missing inO(n).e ( n)22PPenergy − lossnoise−residue- Signal present in I(n) and missing inO(n).==∑n∑nnI2∑n∑eO1( n )e222( n )This Criteria gives us better information whensegregated speech different from the original23

Results24

Results(cont.)25

Summary• This Model applies different mechanism todeal with resolved and unresolvedharmonics• This Model uses temporal structure tosegregate speech from mixure (Autocorrelations, Pitch Detection and AM)• The model Uses iterative Process• Good criteria for measuring performances is“ideal mask”26

My Opinion• From the results we can see goodperformance• We can use this algorithm iteratively to findless dominant pitches (more sources)• This algorithm maybe complex toimplement• This algorithm design for voiced speech27

Monaural Speech Segregation Based on Pitch Tracking and ...

Create successful ePaper yourself

Delete template?

Save as template?