11.07.2015 Views

Monaural Speech Segregation Based on Pitch Tracking and ...

Monaural Speech Segregation Based on Pitch Tracking and ...

Monaural Speech Segregation Based on Pitch Tracking and ...

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

<str<strong>on</strong>g>M<strong>on</strong>aural</str<strong>on</strong>g> <str<strong>on</strong>g>Speech</str<strong>on</strong>g> <str<strong>on</strong>g>Segregati<strong>on</strong></str<strong>on</strong>g><str<strong>on</strong>g>Based</str<strong>on</strong>g> <strong>on</strong> <strong>Pitch</strong> <strong>Tracking</strong> <strong>and</strong>Amplitude Modulati<strong>on</strong>Written by: Gu<strong>on</strong>ing Hu <strong>and</strong> DeLiang WangPresented by: Eli Yazovitsky1


• Introducti<strong>on</strong>• Definiti<strong>on</strong>sTopics• Assumpti<strong>on</strong> <strong>and</strong> innovati<strong>on</strong>s• Model Overview• Results <strong>and</strong> Comparis<strong>on</strong>• Summary• My Opini<strong>on</strong>2


Introducti<strong>on</strong>• <str<strong>on</strong>g>Speech</str<strong>on</strong>g> always appears with acousticinterference(cocktail party)• Systems that use more than <strong>on</strong>e microph<strong>on</strong>e:Blind source separati<strong>on</strong> or sensor array for spatialfiltering• For some systems we need m<strong>on</strong>aural (<strong>on</strong>emicroph<strong>on</strong>e) soluti<strong>on</strong>• CASA systems try to simulate human auditorysystem• CASA systems has two main stages:segmentati<strong>on</strong>( analysis) <strong>and</strong> grouping (synthesis)3


Definiti<strong>on</strong>s• Filter bank view using an auditory b<strong>and</strong>pass filter decompose the input into T-Fdomain Unit• We use an auditory filters called“gammat<strong>on</strong>e” filtersl−⎧t 1 exp( −2 πbt)cos(2 π ft), t ≥ 0⎫gt () = ⎨ ⎬⎩0,else⎭• Resolved <strong>and</strong> unresolved harm<strong>on</strong>ics4


Definiti<strong>on</strong>s(c<strong>on</strong>t.)• Low-order harm<strong>on</strong>ics:One per auditory filter(Resolved)• High-order harm<strong>on</strong>ics:Two or more per auditory filter(Unresolved)• Amplitude Modulati<strong>on</strong>• Autocorrelati<strong>on</strong>:N c−1Acm (, , τ ) = 1/ N hcmT (, −nhcmT )(, −n−τ)c∑n=05


Model OverviewSignal decompoti<strong>on</strong>:• 128-channel filterbanks• quasi-logaritmically spaced from 80-5000Hz•Output divided to 20 ms frame with 10-ms overlap• The output decompose into two-dimensi<strong>on</strong>al T-F units7


Model Overview(c<strong>on</strong>t.)Auditory Feature Extracti<strong>on</strong>:Correlogram:N c −1AH( c, m, τ ) = 1/ Nc∑h( c, mT − n) h( c, mT − n −τ)n = 0Dominant <strong>Pitch</strong> (Find Maximum- τ ( m ) ):DSm ( , τ ) = ∑ AH( cm , , τ )cEnvelope Correlogram:N c −1Eτc∑E En=0Cross-Channel Corelati<strong>on</strong>:L−1CH( c, m) = ∑ AH( c, m, τ ) AH( c+1, m, τ )A (, c m, ) = 1/ N h (, c mT −n) h (, c mT −n−τ)τ = 08


Graphs9


Initial <str<strong>on</strong>g>Segregati<strong>on</strong></str<strong>on</strong>g>• Units with some resp<strong>on</strong>se energy <strong>and</strong> highcross channel correlati<strong>on</strong> are c<strong>on</strong>siderA c m C c m2H( , ,0) > θH; H( , ) > θc; θH = 50; θc= 0.985• Segments Shorter then 30 ms removed• Segmentati<strong>on</strong> <strong>on</strong> this point for unresolvedunits are very small10


Initial Grouping• Segments are grouped into to streams:0S F0S Fwhich corresp<strong>on</strong>ding to target speechwhich corresp<strong>on</strong>ding with intrusi<strong>on</strong>• Segments that probably c<strong>on</strong>sist of units thatat least half of them agrees with dominant0pitch at each frame corresp<strong>on</strong>d to S F• Segmantati<strong>on</strong> follows by the criteria:AH(, c m, τD( m))> θP; θP= 0.95A (, c m, τ (, c m))HP11


Illustrati<strong>on</strong>12


<strong>Pitch</strong> <strong>Tracking</strong>• After initial segregati<strong>on</strong> we have S 0thatprobably describes the target speech Fbut tobe more accurate we repeat the samealgorithm. We employ c<strong>on</strong>straints to check0the reliability of units in S F• Now τ ( m ) <strong>and</strong> τ ( m)0DP is obtained from S FC<strong>on</strong>straint 1:AH(, c m, τD( m))> θP; θP= 0.95A (, c m, τ (, c m))HC<strong>on</strong>straint 2: the pitch c<strong>on</strong>tour changes slowlyless than 20% from frame m to m+1P13


Unit Labeling• Now we compute the pitch streakwhich c<strong>on</strong>tain of τ ( m0) fromSPFthat satisfiesboth c<strong>on</strong>straints. This pitch will be the mostreliable estimati<strong>on</strong> for target pitch• Now we divide S 0to two streams S 11S F <strong>and</strong> BF1S F - foreground stream satisfy the equati<strong>on</strong>1S B - the rest of the segmentsAH(, c m, τD( m))A (, c m, τ (, c m))HS> θ ; θ = 0.85TT14


Ilustrati<strong>on</strong>15


Labaling Unresolve Harm<strong>on</strong>ics• The pitch of target speech does notcorresp<strong>on</strong>d to global maximum of theautocorrelati<strong>on</strong> of such units• The filter resp<strong>on</strong>se is str<strong>on</strong>gly Amplitudemodulated• The filter resp<strong>on</strong>se envelope fluctuates atthe rate of F0 of the source• Now we define a new criteria:rcn^( , ) - normalized b<strong>and</strong> pass functi<strong>on</strong> withthe periodicity of envelope16


Labaling Unresolve Harm<strong>on</strong>ics(c<strong>on</strong>t.)φcm2 T − 1 ^∑2 T − 1 ^⎡ 2 π n ⎤= arg m in ∑ r ( c, m T − n ) − exp ⎢ j( + φ ) ⎥φ n = 0⎣ τs( m ) fs ⎦⎡2 π n⎤⎢ r ( c, m T − n ) − cos( + φcm)τ ( m ) f⎥⎣⎦n = 0s s2 T − 1 ^ 2∑n = 0r ( c, m T − n )2< θ ; θ = 0.2AMAM2We lable unresolve units by the criteriaabove17


Final Segragati<strong>on</strong>• Segments corresp<strong>on</strong>ding to unresolvedharm<strong>on</strong>ics are generated based <strong>on</strong> temporalc<strong>on</strong>tinuity an cross channel envelopecorrelati<strong>on</strong>• Segments grouped again into foreground2<strong>and</strong> background stream S B• The grouping criteria is:durati<strong>on</strong> no less the 50 msfundamental frequency of unresolved willbe close to target pitch2S F18


Ilustrati<strong>on</strong>19


Results <strong>and</strong> Comparis<strong>on</strong>• SNR Criteria: signal before <strong>and</strong> aftersegregati<strong>on</strong> (using target speech beforemixing)• Intrusi<strong>on</strong>s:‣N0 – 1Khz pure t<strong>on</strong>e.‣N1 – White noise.‣N3 – “Cocktail Party”‣N4 – Rock music‣N8 – Mail Voice‣N9 – Female Voice20


SNR-Comparis<strong>on</strong>21


Ideal Binary Mask Criteria• Ideal binary mask is c<strong>on</strong>structed as follows:T-F unit is assign 1 if the target energy inthe unit is greater than the intrusi<strong>on</strong> energyotherwise 0• This mask uses a priori informati<strong>on</strong> aboutthe speech <strong>and</strong> the intrusi<strong>on</strong>• We define new criteria as follows:On ( )- speech from our systemI ( n)- speech from ideal mask22


Ideal Binary Maske ( n)1Criteria(c<strong>on</strong>t.)- Signal present in I(n) <strong>and</strong> missing inO(n).e ( n)22PPenergy − lossnoise−residue- Signal present in I(n) <strong>and</strong> missing inO(n).==∑n∑nnI2∑n∑eO1( n )e222( n )This Criteria gives us better informati<strong>on</strong> whensegregated speech different from the original23


Results24


Results(c<strong>on</strong>t.)25


Summary• This Model applies different mechanism todeal with resolved <strong>and</strong> unresolvedharm<strong>on</strong>ics• This Model uses temporal structure tosegregate speech from mixure (Autocorrelati<strong>on</strong>s, <strong>Pitch</strong> Detecti<strong>on</strong> <strong>and</strong> AM)• The model Uses iterative Process• Good criteria for measuring performances is“ideal mask”26


My Opini<strong>on</strong>• From the results we can see goodperformance• We can use this algorithm iteratively to findless dominant pitches (more sources)• This algorithm maybe complex toimplement• This algorithm design for voiced speech27

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!