Time Difference of Arrival Estimation of Speech Source in a Noisy ...

EE049035 – Spring 2007 

Time Difference of Arrival Estimation 

of Speech Source in a Noisy and 

Reverberant Environment 

Tsvi Dvorkind and Sharon Gannot 

2005 

Presented by Ronen Talmon


Outline 

Introduction 

Problem Formulation 

Ideal Model 

Reverberation Model 

Common Algorithms 

Cross Correlation Method 

Generalized Cross Correlation Method (GCC) 

Time Difference of Arrival (TDOA) Estimation 

Discussion and Proposed Extension 

2


Introduction 

Source Localization: 

Determining the spatial position of a 

speaker. 

Motivation: 

Automated camera steering and tracking 

are required in video conferences. 

3


Introduction 

Microphone array are used for this task 

Blind problem 

Dual step approach: 

Time Difference of Arrival (TDOA) 

estimation. 

Determining the spatial position of the 

source. 

4



Ideal Model 

z ( t) = α s( t) + n ( t) 

1 1 1 

z ( t) = α s( t + τ) 

+ n ( t) 

2 2 2 

s( t), n ( t), n ( t) 

uncorrelated 

1 2 

Single-path propagation 

TDOA Estimation: 

Estimate ˆτ given { z 1, z2 

} . 

5



6



Reverberation Model: 

z ( t) = a ( t) ∗ s( t) + n ( t) 

m m m 

- impulse response from the source 

to the mth microphone 

where is the 

impulse response between the noise 

and the mth am( t) 

nm( t) = bm ( t) ∗ n( t) 

bm ( t) 

n( t) 

microphone 

Multi-path propagation 

7



8


Acoustic Impulse Response 

TDOA – between 

the direct path 

0.05 

0 

a 1 (t) (T60 = 0.9sec) 

-0.05 

0 500 1000 1500 2000 

samples 

2500 3000 3500 4000 

a (t) (T60 = 0.9sec) 

2 

0.04 

0.02 

0 

-0.02 

-0.04 

0 500 1000 1500 2000 

samples 

2500 3000 3500 4000 

9



Based on the ideal model 

Delay estimation is the lag time that 

maximizes the CC function: 

τˆ = arg max R ( τ) 

CC z z 

τ 

1 2 

R ( τ) = E { z ( τ) z ( t − 

τ) 

} 

z1z2 1 2 

10



τ 

Correlation 

11


Generalized CC Method (GCC) 

Given an observation interval T, 

we have to estimate the 

CC function Rˆ ( τ) 

: 

z z 

1 2 

1 T 

Rˆ z ( ) 1z τ = ∫ z 2 1( t) z2( t − τ) 

dt 

T − τ τ 

In order to improve estimation, pre-filtering is suggested: 

ˆ( g) ∞ 

ˆ jwτ 

y ( ) ( ) ( ) 

1y τ = ∫ ψ 

2 g z1z2 −∞ 

ψg 

( w) 

should be chosen to ensure a sharp peak. 

R w P w e dw 

Knapp and Carter (1976) [2]. 

12


GCC – Simulation Results 

20 

10 

GCC (SNR=5[db]; T60=0.11[s]) 

0 

-5 -4 -3 -2 -1 0 1 2 3 4 5 

TDOA [samples] 

20 

10 

GCC (SNR=5[dB]; T60=0.9[s]) 

0 

-5 -4 -3 -2 -1 0 1 2 3 4 5 


20 

10 

GCC (SNR=-5[dB]; T60=0.11[s]) 

0 

-5 -4 -3 -2 -1 0 1 2 3 4 5 


13


Time Difference of Arrival Estimation 

Features: 

Based on reverberation model 

Assume speech source and exploits speech 

quasi-stationarity 

In frequency domain. 

14


ATF-s Ratio for TDOA Extraction 

Denote the Acoustic Transfer Functions (ATF-s): 

m 

−jwn0 −jwni 

( ) = α + α , = 2,..., 

0 

i 

m n n 

i= 

1 

L 

−jwp −jwp 

1 

A1( w) = βp e 0 

0 + βp 

e i 

L 

∑ 

A w e e m M 

The ATF-s ratio (RTF): 

H w 

A ( w) 

α −jwn 

e 

e w 

∑ 

i= 

1 

0 

m 

n0 

( ) = = 

( ); 

A 0 

1( 

w) −jwp 

βp 

e 0 

m m 

where, at low reverberations αn ≫ α ; , 0 

0 n β i p ≫ β 0 p i ≠ 

i : 

e ( w) ≈ 1 

m 

The peak of the corresponding hm( t) 

can be used to 

determining the TDOA. 

i 

15


Spectrum Analysis 

Using that the speech and noise are uncorrelated 

we get the PSD: 

* * 

z z i j ss i j nn 

Φ ( w) = A( w) A ( w) Φ ( w) + B ( w) B ( w) Φ ( w) 

i j 

The connection between ( ) and Φ ( ) : 

Φzmz w 1 1 1 

Φ − Φ = Φ 

( w) H ( w) ( w) 1 ( w) 

m 1 1 1 

m 

z z m z z b 

where 1 ( ) is a noise only term: 

Φ 

b w 

m 

z z w 

2 

bm 

m m 1 nn 

Φ 1 ( w) = ( G ( w) − H ( w) ) B ( w) Φ ( w); 

G ( w) 

≜ 

m 

Bm( w) 

B ( w) 

Problem – speech is non stationary - Φss( 

w) 

1 

16


Speech Quasi-Stationarity 

Consider the observation time interval of length NP: 

The noise is stationary 

The speech stats are changing. 

By dividing the observation interval to N frame of length P , 

the speech is stationary for each frame. 

Assume the analysis window of length P is much larger than 

the support of a ( t), b ( t) 

(MTF assumption): 

m m 

Z ( n, w) = A ( w) S( n, w) + 

B ( w) N( w) 

m m m 

Therefore, for each frame, n = 1,..., N : 

Φ ˆ = Φ ˆ + Φˆ 

( n, w) H ( w) ( n, w) 1 ( n, w) 

m 1 1 1 

m 

z z m z z b 

17


Speech Quasi-Stationarity 

Define an error term, we get the first form of 

stationarity (S1): 

Φ ˆ = Φ ˆ + Φ + 

( n, w) H ( w) ( n, w) 1 ( w) ξ( 

n, w) 

m 1 1 1 

m 

z z m z z b 

Weighted LS Solution is given by: 

⎡Hˆ ( w) 

⎤ 

⎢ 

ˆ 

⎥ 

⎢Φ 1 b ( w) 

⎥ 

⎣ m ⎦ 

⎢ m ⎥ −1 

= Φ 

A WA A W ˆ ( w), 

H H 

( ) zmz1 Similarly, we get using the connection between 

and Φ an estimate of Hˆ ( w), Φˆ 

2 ( w) 

(S2) 

z z 

1 m 

⎡ ˆ ( ) 

ˆ 

z1z 1, w ,1 

⎤ ⎡ 

(1, ) 

1 zmz w 

⎤ 

⎢ Φ ⎥ ⎢ Φ 1 ⎥ 

⎢ ⎥ ⎢ ⎥ 

A = ⎢ ⋮ ⎥ ; Φ ˆ 

z ( ) 

mz 

w = ⎢ ⋮ 

⎥ 

⎢ ⎥ 1 ⎢ ⎥ 

⎢ ˆ ⎥ ⎢ 

( N, w ) ,1 ˆ ⎥ 

⎢Φ ⎥ ⎢Φ ( N, w) 

⎥ 

⎣ z1z1 ⎦ ⎣ zmz1 ⎦ 

m b 

m 

Φzmzm 

18


Decorrelation Criterion 

The cross PSD matrix of 1 st and m th microphones: 

⎡ Φz ( ) ( ) 

1z w Φ 1 z1z w ⎤ m 

P = ⎢ ⎥ 

⎢ 

z ( ) ( ) 

mz w 1 zmz w ⎥ 

⎢Φ Φ 

⎣ m ⎥⎦ 

Impose the fact that the speech and noise are uncorrelated 

Searching decorrelation transformation 

Λ ( w) = U( w) P( w) U ( w); 

⎡ u1( w) 

−1⎤ U( w) 

= ⎢ ⎥ 

⎢−u2 ( w) 

1 ⎥ 

⎢⎣ ⎥⎦ 

{ u1 w = Hm w u2 w = Gm w } 

{ u1( w) = Gm( w), u2( w) = 

Hm( w) 

} 

( ) ( ), ( ) ( ) ; 

H 

19


Decorrelation Criterion 

For low SNR we get a poor estimation of Hm( w) 

from (S1) 

and (S2), but a good estimation of the noise terms: 

Φˆ 

2 ( ) 

ˆ b w 

m 

Gm( w) 

= 

Φˆ 

1 ( w) 

Using initialization u2( w) = Gˆ m( 

w) 

, decorrelation 

becomes a linear set, with the LS solution (LD): 

b 

m 

H −1 

H 

( ) ˆ ˆ 

m m 2 m 1 

Hm( w) = V V V ⎡ 

⎣ 

Φz z ( w) − u ( w) Φz 

z ( w) 

⎤ 

⎦ 

; 

V ≜ Φˆ ( w) − u ( w) Φˆ 

( w) 

z z 2 z z 

1 m 

1 1 

20


Other Algorithms 

Also in the paper: 

Iterative solution that combines decorrelation and the first form of 

stationarity, using Gauss iterations (GS1) 

Recursive algorithms based on steepest descent, for each batch 

algorithm, used for tracking moving speakers. 

21


Setup: 

Simulation Results 

Room dimensions = [4, 7, 2.75] 

Mic1 location= [2, 3.5, 1.375] 

Mic2 location = [1.7, 3.5, 1.375] 

Speech source location = [2.53, 4.03, 2.67] 

Noise source location = [1.5, 4, 2.08] 

Speech TDOA = 3.07[ samples ] 

Noise TDOA = −2.62 

samples 

[ ] 

2.5 

2 

1.5 

1 

0.5 

0 

6 

4 

2 

Acoustic Room Setup 

0 

0 

1 

2 

3 

4 

mic1 

mic2 

source 

noise 

22


GCC 

S1 

S2 

LD 

GS1 


SNR = 5 dB 

RMSE 

0.1 

0.1713 

0.1883 

0.1917 

0.187 

[ ] 

T 60 = 0.9 s 

[ ] 

AIR length = 512 

P 

= 

256 

Anomaly 

89% 

7% 

7% 

7% 

7% 

20 

10 

GCC 

0 

-5 -4 -3 -2 -1 0 1 2 3 4 5 


S1 

20 

10 

0 

-5 -4 -3 -2 -1 0 1 2 3 4 5 


S2 

20 

10 

0 

-5 -4 -3 -2 -1 0 1 2 3 4 5 


LD 

20 

10 

0 

-5 -4 -3 -2 -1 0 1 2 3 4 5 


GS1 

20 

10 

0 

-5 -4 -3 -2 -1 0 1 2 3 4 5 


23


Discussion 

First form of stationarity (S1): 

⎡Hˆ m( 

w) 

⎤ 

⎢ ⎥ H −1 

H 

⎢ ( A WA) A W ˆ 

z ( ), 

mz 

w 

1 

ˆ 

⎥ = Φ 

⎢Φ 1 b ( w) 

⎥ 

⎣ m ⎦ 

Two conflicting requirements: 

Frames with higher SNR – good estimation of 

Frames with lower SNR – good estimation of 

Two approaches: 

ˆ ( ) 

Hˆ m( 

w) 

ˆ 1 ( ) Φ 

b w 

Use (S1) for 1 estimation only as advised in the paper. 

Φ 

b w 

m 

Cohen [3] has proposed to use voice activity detector 

(VAD) and separate ˆ 1 ( ) and estimation. 

Φ Hˆ ( w) 

b w 

m 

m 

m 

24


Discussion 

Work Assumptions: 

Speech is stationary only in short periods of time. 

Tracking objects requires dynamic acoustic impulse 

response - 

Static acoustic impulse response for short periods of time. 

Real room acoustic impulse response: 

Reverberant environment (long T60) 

Long impulse response – takes into account late 

reverberations 

Using MTF approx.: 

The analysis window length (time frame length) is much 

larger than the room acoustic impulse response. 

25


Analysis Window Length 

Real room (long) acoustic impulse response 

Long Analysis Window (time frame) 

Speech is non stationary (fundamental assumption). 

Tracking is unavailable. 

MTF Assumption 

Few observations – large estimation variance. 

26


Analysis Window Length 

Real room (long) acoustic impulse response 

Short Analysis Window (time frame) 

Speech is stationary 

Tracking is available. 

Many observations – small estimation variance 

MTF assumption doesn’t hold 

(fundamental assumption). 

27


GCC 

S1 

S2 

LD 

GS1 


SNR = 5 dB 

RMSE 

--- 

0.2326 

0.3344 

0.2118 

0.4221 

[ ] 

T 60 = 0.9 s 

[ ] 

AIR length = 4096 

P 

= 

256 

Anomaly 

100% 

54% 

57% 

39% 

57% 

20 

10 

GCC 

0 

-5 -4 -3 -2 -1 0 1 2 3 4 5 


S1 

20 

10 

0 

-5 -4 -3 -2 -1 0 1 2 3 4 5 


S2 

20 

10 

0 

-5 -4 -3 -2 -1 0 1 2 3 4 5 


LD 

20 

10 

0 

-5 -4 -3 -2 -1 0 1 2 3 4 5 


GS1 

20 

10 

0 

-5 -4 -3 -2 -1 0 1 2 3 4 5 


28


Proposed Extension 

Using short analysis window (short frames) 

Assume the following: 

Speech is stationary in each frame. 

Static acoustic impulse response (enables tracking objects). 

Real room acoustic impulse response: 

Reverberant environment (long T60) 

Long impulse response – takes into account late reverberations 

Discard MTF assumption. 

Instead, work in STFT domain [4]. 

Combine with Cohen method [3]. 

29


References 

[1] T.G. Dvorkind and S. Gannot, Time Difference of Arrival Estimation of 

Speech Source in a Noisy and Reverberant Environment, Signal Processing, 

vol. 85, no. 1, pp.177-204, 2005 

[2] C.H. Knapp and G.C. Carter, The Generalized Correlation Method for 

Estimation of Time Delay, IEEE Trans. on Acoustics, Speech and Signal 

Processing, vol. 24, no. 4, pp. 320-327, 1976 

[3] I. Cohen, Relative Transfer Function Identification Using Speech Signals, 

IEEE Trans. on Speech and Audio Processing, Vol. 12, No. 5, 2004 

[4] Y. Avargel and I. Cohen, System Identification in the Short Time Fourier 

Transform Domain with Crossband Filtering, IEEE Trans. on Audio, Speech 

and Language Processing, in future issue 

[5] National Institute of Standards and Technology, The DRAPA TIMIT 

Acoustic-Phonetic Continues Speech Corpus 

30

Time Difference of Arrival Estimation of Speech Source in a Noisy ...

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?