15.09.2013 Views

Time Difference of Arrival Estimation of Speech Source in a Noisy ...

Time Difference of Arrival Estimation of Speech Source in a Noisy ...

Time Difference of Arrival Estimation of Speech Source in a Noisy ...

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

EE049035 – Spr<strong>in</strong>g 2007<br />

<strong>Time</strong> <strong>Difference</strong> <strong>of</strong> <strong>Arrival</strong> <strong>Estimation</strong><br />

<strong>of</strong> <strong>Speech</strong> <strong>Source</strong> <strong>in</strong> a <strong>Noisy</strong> and<br />

Reverberant Environment<br />

Tsvi Dvork<strong>in</strong>d and Sharon Gannot<br />

2005<br />

Presented by Ronen Talmon


EE049035 – Spr<strong>in</strong>g 2007<br />

Outl<strong>in</strong>e<br />

Introduction<br />

Problem Formulation<br />

Ideal Model<br />

Reverberation Model<br />

Common Algorithms<br />

Cross Correlation Method<br />

Generalized Cross Correlation Method (GCC)<br />

<strong>Time</strong> <strong>Difference</strong> <strong>of</strong> <strong>Arrival</strong> (TDOA) <strong>Estimation</strong><br />

Discussion and Proposed Extension<br />

2


EE049035 – Spr<strong>in</strong>g 2007<br />

Introduction<br />

<strong>Source</strong> Localization:<br />

Determ<strong>in</strong><strong>in</strong>g the spatial position <strong>of</strong> a<br />

speaker.<br />

Motivation:<br />

Automated camera steer<strong>in</strong>g and track<strong>in</strong>g<br />

are required <strong>in</strong> video conferences.<br />

3


EE049035 – Spr<strong>in</strong>g 2007<br />

Introduction<br />

Microphone array are used for this task<br />

Bl<strong>in</strong>d problem<br />

Dual step approach:<br />

<strong>Time</strong> <strong>Difference</strong> <strong>of</strong> <strong>Arrival</strong> (TDOA)<br />

estimation.<br />

Determ<strong>in</strong><strong>in</strong>g the spatial position <strong>of</strong> the<br />

source.<br />

4


EE049035 – Spr<strong>in</strong>g 2007<br />

Problem Formulation<br />

Ideal Model<br />

z ( t) = α s( t) + n ( t)<br />

1 1 1<br />

z ( t) = α s( t + τ)<br />

+ n ( t)<br />

2 2 2<br />

s( t), n ( t), n ( t)<br />

uncorrelated<br />

1 2<br />

S<strong>in</strong>gle-path propagation<br />

TDOA <strong>Estimation</strong>:<br />

Estimate ˆτ given { z 1, z2<br />

} .<br />

5


EE049035 – Spr<strong>in</strong>g 2007<br />

Problem Formulation<br />

6


EE049035 – Spr<strong>in</strong>g 2007<br />

Problem Formulation<br />

Reverberation Model:<br />

z ( t) = a ( t) ∗ s( t) + n ( t)<br />

m m m<br />

- impulse response from the source<br />

to the mth microphone<br />

where is the<br />

impulse response between the noise<br />

and the mth am( t)<br />

nm( t) = bm ( t) ∗ n( t)<br />

bm ( t)<br />

n( t)<br />

microphone<br />

Multi-path propagation<br />

7


EE049035 – Spr<strong>in</strong>g 2007<br />

Problem Formulation<br />

8


EE049035 – Spr<strong>in</strong>g 2007<br />

Acoustic Impulse Response<br />

TDOA – between<br />

the direct path<br />

0.05<br />

0<br />

a 1 (t) (T60 = 0.9sec)<br />

-0.05<br />

0 500 1000 1500 2000<br />

samples<br />

2500 3000 3500 4000<br />

a (t) (T60 = 0.9sec)<br />

2<br />

0.04<br />

0.02<br />

0<br />

-0.02<br />

-0.04<br />

0 500 1000 1500 2000<br />

samples<br />

2500 3000 3500 4000<br />

9


EE049035 – Spr<strong>in</strong>g 2007<br />

Cross Correlation Method<br />

Based on the ideal model<br />

Delay estimation is the lag time that<br />

maximizes the CC function:<br />

τˆ = arg max R ( τ)<br />

CC z z<br />

τ<br />

1 2<br />

R ( τ) = E { z ( τ) z ( t −<br />

τ)<br />

}<br />

z1z2 1 2<br />

10


EE049035 – Spr<strong>in</strong>g 2007<br />

Cross Correlation Method<br />

τ<br />

Correlation<br />

11


EE049035 – Spr<strong>in</strong>g 2007<br />

Generalized CC Method (GCC)<br />

Given an observation <strong>in</strong>terval T,<br />

we have to estimate the<br />

CC function Rˆ ( τ)<br />

:<br />

z z<br />

1 2<br />

1 T<br />

Rˆ z ( ) 1z τ = ∫ z 2 1( t) z2( t − τ)<br />

dt<br />

T − τ τ<br />

In order to improve estimation, pre-filter<strong>in</strong>g is suggested:<br />

ˆ( g) ∞<br />

ˆ jwτ<br />

y ( ) ( ) ( )<br />

1y τ = ∫ ψ<br />

2 g z1z2 −∞<br />

ψg<br />

( w)<br />

should be chosen to ensure a sharp peak.<br />

R w P w e dw<br />

Knapp and Carter (1976) [2].<br />

12


EE049035 – Spr<strong>in</strong>g 2007<br />

GCC – Simulation Results<br />

20<br />

10<br />

GCC (SNR=5[db]; T60=0.11[s])<br />

0<br />

-5 -4 -3 -2 -1 0 1 2 3 4 5<br />

TDOA [samples]<br />

20<br />

10<br />

GCC (SNR=5[dB]; T60=0.9[s])<br />

0<br />

-5 -4 -3 -2 -1 0 1 2 3 4 5<br />

TDOA [samples]<br />

20<br />

10<br />

GCC (SNR=-5[dB]; T60=0.11[s])<br />

0<br />

-5 -4 -3 -2 -1 0 1 2 3 4 5<br />

TDOA [samples]<br />

13


EE049035 – Spr<strong>in</strong>g 2007<br />

<strong>Time</strong> <strong>Difference</strong> <strong>of</strong> <strong>Arrival</strong> <strong>Estimation</strong><br />

Features:<br />

Based on reverberation model<br />

Assume speech source and exploits speech<br />

quasi-stationarity<br />

In frequency doma<strong>in</strong>.<br />

14


EE049035 – Spr<strong>in</strong>g 2007<br />

ATF-s Ratio for TDOA Extraction<br />

Denote the Acoustic Transfer Functions (ATF-s):<br />

m<br />

−jwn0 −jwni<br />

( ) = α + α , = 2,...,<br />

0<br />

i<br />

m n n<br />

i=<br />

1<br />

L<br />

−jwp −jwp<br />

1<br />

A1( w) = βp e 0<br />

0 + βp<br />

e i<br />

L<br />

∑<br />

A w e e m M<br />

The ATF-s ratio (RTF):<br />

H w<br />

A ( w)<br />

α −jwn<br />

e<br />

e w<br />

∑<br />

i=<br />

1<br />

0<br />

m<br />

n0<br />

( ) = =<br />

( );<br />

A 0<br />

1(<br />

w) −jwp<br />

βp<br />

e 0<br />

m m<br />

where, at low reverberations αn ≫ α ; , 0<br />

0 n β i p ≫ β 0 p i ≠<br />

i :<br />

e ( w) ≈ 1<br />

m<br />

The peak <strong>of</strong> the correspond<strong>in</strong>g hm( t)<br />

can be used to<br />

determ<strong>in</strong><strong>in</strong>g the TDOA.<br />

i<br />

15


EE049035 – Spr<strong>in</strong>g 2007<br />

Spectrum Analysis<br />

Us<strong>in</strong>g that the speech and noise are uncorrelated<br />

we get the PSD:<br />

* *<br />

z z i j ss i j nn<br />

Φ ( w) = A( w) A ( w) Φ ( w) + B ( w) B ( w) Φ ( w)<br />

i j<br />

The connection between ( ) and Φ ( ) :<br />

Φzmz w 1 1 1<br />

Φ − Φ = Φ<br />

( w) H ( w) ( w) 1 ( w)<br />

m 1 1 1<br />

m<br />

z z m z z b<br />

where 1 ( ) is a noise only term:<br />

Φ<br />

b w<br />

m<br />

z z w<br />

2<br />

bm<br />

m m 1 nn<br />

Φ 1 ( w) = ( G ( w) − H ( w) ) B ( w) Φ ( w);<br />

G ( w)<br />

≜<br />

m<br />

Bm( w)<br />

B ( w)<br />

Problem – speech is non stationary - Φss(<br />

w)<br />

1<br />

16


EE049035 – Spr<strong>in</strong>g 2007<br />

<strong>Speech</strong> Quasi-Stationarity<br />

Consider the observation time <strong>in</strong>terval <strong>of</strong> length NP:<br />

The noise is stationary<br />

The speech stats are chang<strong>in</strong>g.<br />

By divid<strong>in</strong>g the observation <strong>in</strong>terval to N frame <strong>of</strong> length P ,<br />

the speech is stationary for each frame.<br />

Assume the analysis w<strong>in</strong>dow <strong>of</strong> length P is much larger than<br />

the support <strong>of</strong> a ( t), b ( t)<br />

(MTF assumption):<br />

m m<br />

Z ( n, w) = A ( w) S( n, w) +<br />

B ( w) N( w)<br />

m m m<br />

Therefore, for each frame, n = 1,..., N :<br />

Φ ˆ = Φ ˆ + Φˆ<br />

( n, w) H ( w) ( n, w) 1 ( n, w)<br />

m 1 1 1<br />

m<br />

z z m z z b<br />

17


EE049035 – Spr<strong>in</strong>g 2007<br />

<strong>Speech</strong> Quasi-Stationarity<br />

Def<strong>in</strong>e an error term, we get the first form <strong>of</strong><br />

stationarity (S1):<br />

Φ ˆ = Φ ˆ + Φ +<br />

( n, w) H ( w) ( n, w) 1 ( w) ξ(<br />

n, w)<br />

m 1 1 1<br />

m<br />

z z m z z b<br />

Weighted LS Solution is given by:<br />

⎡Hˆ ( w)<br />

⎤<br />

⎢<br />

ˆ<br />

⎥<br />

⎢Φ 1 b ( w)<br />

⎥<br />

⎣ m ⎦<br />

⎢ m ⎥ −1<br />

= Φ<br />

A WA A W ˆ ( w),<br />

H H<br />

( ) zmz1 Similarly, we get us<strong>in</strong>g the connection between<br />

and Φ an estimate <strong>of</strong> Hˆ ( w), Φˆ<br />

2 ( w)<br />

(S2)<br />

z z<br />

1 m<br />

⎡ ˆ ( )<br />

ˆ<br />

z1z 1, w ,1<br />

⎤ ⎡<br />

(1, )<br />

1 zmz w<br />

⎤<br />

⎢ Φ ⎥ ⎢ Φ 1 ⎥<br />

⎢ ⎥ ⎢ ⎥<br />

A = ⎢ ⋮ ⎥ ; Φ ˆ<br />

z ( )<br />

mz<br />

w = ⎢ ⋮<br />

⎥<br />

⎢ ⎥ 1 ⎢ ⎥<br />

⎢ ˆ ⎥ ⎢<br />

( N, w ) ,1 ˆ ⎥<br />

⎢Φ ⎥ ⎢Φ ( N, w)<br />

⎥<br />

⎣ z1z1 ⎦ ⎣ zmz1 ⎦<br />

m b<br />

m<br />

Φzmzm<br />

18


EE049035 – Spr<strong>in</strong>g 2007<br />

Decorrelation Criterion<br />

The cross PSD matrix <strong>of</strong> 1 st and m th microphones:<br />

⎡ Φz ( ) ( )<br />

1z w Φ 1 z1z w ⎤ m<br />

P = ⎢ ⎥<br />

⎢<br />

z ( ) ( )<br />

mz w 1 zmz w ⎥<br />

⎢Φ Φ<br />

⎣ m ⎥⎦<br />

Impose the fact that the speech and noise are uncorrelated<br />

Search<strong>in</strong>g decorrelation transformation<br />

Λ ( w) = U( w) P( w) U ( w);<br />

⎡ u1( w)<br />

−1⎤ U( w)<br />

= ⎢ ⎥<br />

⎢−u2 ( w)<br />

1 ⎥<br />

⎢⎣ ⎥⎦<br />

{ u1 w = Hm w u2 w = Gm w }<br />

{ u1( w) = Gm( w), u2( w) =<br />

Hm( w)<br />

}<br />

( ) ( ), ( ) ( ) ;<br />

H<br />

19


EE049035 – Spr<strong>in</strong>g 2007<br />

Decorrelation Criterion<br />

For low SNR we get a poor estimation <strong>of</strong> Hm( w)<br />

from (S1)<br />

and (S2), but a good estimation <strong>of</strong> the noise terms:<br />

Φˆ<br />

2 ( )<br />

ˆ b w<br />

m<br />

Gm( w)<br />

=<br />

Φˆ<br />

1 ( w)<br />

Us<strong>in</strong>g <strong>in</strong>itialization u2( w) = Gˆ m(<br />

w)<br />

, decorrelation<br />

becomes a l<strong>in</strong>ear set, with the LS solution (LD):<br />

b<br />

m<br />

H −1<br />

H<br />

( ) ˆ ˆ<br />

m m 2 m 1<br />

Hm( w) = V V V ⎡<br />

⎣<br />

Φz z ( w) − u ( w) Φz<br />

z ( w)<br />

⎤<br />

⎦<br />

;<br />

V ≜ Φˆ ( w) − u ( w) Φˆ<br />

( w)<br />

z z 2 z z<br />

1 m<br />

1 1<br />

20


EE049035 – Spr<strong>in</strong>g 2007<br />

Other Algorithms<br />

Also <strong>in</strong> the paper:<br />

Iterative solution that comb<strong>in</strong>es decorrelation and the first form <strong>of</strong><br />

stationarity, us<strong>in</strong>g Gauss iterations (GS1)<br />

Recursive algorithms based on steepest descent, for each batch<br />

algorithm, used for track<strong>in</strong>g mov<strong>in</strong>g speakers.<br />

21


EE049035 – Spr<strong>in</strong>g 2007<br />

Setup:<br />

Simulation Results<br />

Room dimensions = [4, 7, 2.75]<br />

Mic1 location= [2, 3.5, 1.375]<br />

Mic2 location = [1.7, 3.5, 1.375]<br />

<strong>Speech</strong> source location = [2.53, 4.03, 2.67]<br />

Noise source location = [1.5, 4, 2.08]<br />

<strong>Speech</strong> TDOA = 3.07[ samples ]<br />

Noise TDOA = −2.62<br />

samples<br />

[ ]<br />

2.5<br />

2<br />

1.5<br />

1<br />

0.5<br />

0<br />

6<br />

4<br />

2<br />

Acoustic Room Setup<br />

0<br />

0<br />

1<br />

2<br />

3<br />

4<br />

mic1<br />

mic2<br />

source<br />

noise<br />

22


EE049035 – Spr<strong>in</strong>g 2007<br />

GCC<br />

S1<br />

S2<br />

LD<br />

GS1<br />

Simulation Results<br />

SNR = 5 dB<br />

RMSE<br />

0.1<br />

0.1713<br />

0.1883<br />

0.1917<br />

0.187<br />

[ ]<br />

T 60 = 0.9 s<br />

[ ]<br />

AIR length = 512<br />

P<br />

=<br />

256<br />

Anomaly<br />

89%<br />

7%<br />

7%<br />

7%<br />

7%<br />

20<br />

10<br />

GCC<br />

0<br />

-5 -4 -3 -2 -1 0 1 2 3 4 5<br />

TDOA [samples]<br />

S1<br />

20<br />

10<br />

0<br />

-5 -4 -3 -2 -1 0 1 2 3 4 5<br />

TDOA [samples]<br />

S2<br />

20<br />

10<br />

0<br />

-5 -4 -3 -2 -1 0 1 2 3 4 5<br />

TDOA [samples]<br />

LD<br />

20<br />

10<br />

0<br />

-5 -4 -3 -2 -1 0 1 2 3 4 5<br />

TDOA [samples]<br />

GS1<br />

20<br />

10<br />

0<br />

-5 -4 -3 -2 -1 0 1 2 3 4 5<br />

TDOA [samples]<br />

23


EE049035 – Spr<strong>in</strong>g 2007<br />

Discussion<br />

First form <strong>of</strong> stationarity (S1):<br />

⎡Hˆ m(<br />

w)<br />

⎤<br />

⎢ ⎥ H −1<br />

H<br />

⎢ ( A WA) A W ˆ<br />

z ( ),<br />

mz<br />

w<br />

1<br />

ˆ<br />

⎥ = Φ<br />

⎢Φ 1 b ( w)<br />

⎥<br />

⎣ m ⎦<br />

Two conflict<strong>in</strong>g requirements:<br />

Frames with higher SNR – good estimation <strong>of</strong><br />

Frames with lower SNR – good estimation <strong>of</strong><br />

Two approaches:<br />

ˆ ( )<br />

Hˆ m(<br />

w)<br />

ˆ 1 ( ) Φ<br />

b w<br />

Use (S1) for 1 estimation only as advised <strong>in</strong> the paper.<br />

Φ<br />

b w<br />

m<br />

Cohen [3] has proposed to use voice activity detector<br />

(VAD) and separate ˆ 1 ( ) and estimation.<br />

Φ Hˆ ( w)<br />

b w<br />

m<br />

m<br />

m<br />

24


EE049035 – Spr<strong>in</strong>g 2007<br />

Discussion<br />

Work Assumptions:<br />

<strong>Speech</strong> is stationary only <strong>in</strong> short periods <strong>of</strong> time.<br />

Track<strong>in</strong>g objects requires dynamic acoustic impulse<br />

response -<br />

Static acoustic impulse response for short periods <strong>of</strong> time.<br />

Real room acoustic impulse response:<br />

Reverberant environment (long T60)<br />

Long impulse response – takes <strong>in</strong>to account late<br />

reverberations<br />

Us<strong>in</strong>g MTF approx.:<br />

The analysis w<strong>in</strong>dow length (time frame length) is much<br />

larger than the room acoustic impulse response.<br />

25


EE049035 – Spr<strong>in</strong>g 2007<br />

Analysis W<strong>in</strong>dow Length<br />

Real room (long) acoustic impulse response<br />

Long Analysis W<strong>in</strong>dow (time frame)<br />

<strong>Speech</strong> is non stationary (fundamental assumption).<br />

Track<strong>in</strong>g is unavailable.<br />

MTF Assumption<br />

Few observations – large estimation variance.<br />

26


EE049035 – Spr<strong>in</strong>g 2007<br />

Analysis W<strong>in</strong>dow Length<br />

Real room (long) acoustic impulse response<br />

Short Analysis W<strong>in</strong>dow (time frame)<br />

<strong>Speech</strong> is stationary<br />

Track<strong>in</strong>g is available.<br />

Many observations – small estimation variance<br />

MTF assumption doesn’t hold<br />

(fundamental assumption).<br />

27


EE049035 – Spr<strong>in</strong>g 2007<br />

GCC<br />

S1<br />

S2<br />

LD<br />

GS1<br />

Simulation Results<br />

SNR = 5 dB<br />

RMSE<br />

---<br />

0.2326<br />

0.3344<br />

0.2118<br />

0.4221<br />

[ ]<br />

T 60 = 0.9 s<br />

[ ]<br />

AIR length = 4096<br />

P<br />

=<br />

256<br />

Anomaly<br />

100%<br />

54%<br />

57%<br />

39%<br />

57%<br />

20<br />

10<br />

GCC<br />

0<br />

-5 -4 -3 -2 -1 0 1 2 3 4 5<br />

TDOA [samples]<br />

S1<br />

20<br />

10<br />

0<br />

-5 -4 -3 -2 -1 0 1 2 3 4 5<br />

TDOA [samples]<br />

S2<br />

20<br />

10<br />

0<br />

-5 -4 -3 -2 -1 0 1 2 3 4 5<br />

TDOA [samples]<br />

LD<br />

20<br />

10<br />

0<br />

-5 -4 -3 -2 -1 0 1 2 3 4 5<br />

TDOA [samples]<br />

GS1<br />

20<br />

10<br />

0<br />

-5 -4 -3 -2 -1 0 1 2 3 4 5<br />

TDOA [samples]<br />

28


EE049035 – Spr<strong>in</strong>g 2007<br />

Proposed Extension<br />

Us<strong>in</strong>g short analysis w<strong>in</strong>dow (short frames)<br />

Assume the follow<strong>in</strong>g:<br />

<strong>Speech</strong> is stationary <strong>in</strong> each frame.<br />

Static acoustic impulse response (enables track<strong>in</strong>g objects).<br />

Real room acoustic impulse response:<br />

Reverberant environment (long T60)<br />

Long impulse response – takes <strong>in</strong>to account late reverberations<br />

Discard MTF assumption.<br />

Instead, work <strong>in</strong> STFT doma<strong>in</strong> [4].<br />

Comb<strong>in</strong>e with Cohen method [3].<br />

29


EE049035 – Spr<strong>in</strong>g 2007<br />

References<br />

[1] T.G. Dvork<strong>in</strong>d and S. Gannot, <strong>Time</strong> <strong>Difference</strong> <strong>of</strong> <strong>Arrival</strong> <strong>Estimation</strong> <strong>of</strong><br />

<strong>Speech</strong> <strong>Source</strong> <strong>in</strong> a <strong>Noisy</strong> and Reverberant Environment, Signal Process<strong>in</strong>g,<br />

vol. 85, no. 1, pp.177-204, 2005<br />

[2] C.H. Knapp and G.C. Carter, The Generalized Correlation Method for<br />

<strong>Estimation</strong> <strong>of</strong> <strong>Time</strong> Delay, IEEE Trans. on Acoustics, <strong>Speech</strong> and Signal<br />

Process<strong>in</strong>g, vol. 24, no. 4, pp. 320-327, 1976<br />

[3] I. Cohen, Relative Transfer Function Identification Us<strong>in</strong>g <strong>Speech</strong> Signals,<br />

IEEE Trans. on <strong>Speech</strong> and Audio Process<strong>in</strong>g, Vol. 12, No. 5, 2004<br />

[4] Y. Avargel and I. Cohen, System Identification <strong>in</strong> the Short <strong>Time</strong> Fourier<br />

Transform Doma<strong>in</strong> with Crossband Filter<strong>in</strong>g, IEEE Trans. on Audio, <strong>Speech</strong><br />

and Language Process<strong>in</strong>g, <strong>in</strong> future issue<br />

[5] National Institute <strong>of</strong> Standards and Technology, The DRAPA TIMIT<br />

Acoustic-Phonetic Cont<strong>in</strong>ues <strong>Speech</strong> Corpus<br />

30

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!