Time Difference of Arrival Estimation of Speech Source in a Noisy ...
Time Difference of Arrival Estimation of Speech Source in a Noisy ...
Time Difference of Arrival Estimation of Speech Source in a Noisy ...
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
EE049035 – Spr<strong>in</strong>g 2007<br />
<strong>Time</strong> <strong>Difference</strong> <strong>of</strong> <strong>Arrival</strong> <strong>Estimation</strong><br />
<strong>of</strong> <strong>Speech</strong> <strong>Source</strong> <strong>in</strong> a <strong>Noisy</strong> and<br />
Reverberant Environment<br />
Tsvi Dvork<strong>in</strong>d and Sharon Gannot<br />
2005<br />
Presented by Ronen Talmon
EE049035 – Spr<strong>in</strong>g 2007<br />
Outl<strong>in</strong>e<br />
Introduction<br />
Problem Formulation<br />
Ideal Model<br />
Reverberation Model<br />
Common Algorithms<br />
Cross Correlation Method<br />
Generalized Cross Correlation Method (GCC)<br />
<strong>Time</strong> <strong>Difference</strong> <strong>of</strong> <strong>Arrival</strong> (TDOA) <strong>Estimation</strong><br />
Discussion and Proposed Extension<br />
2
EE049035 – Spr<strong>in</strong>g 2007<br />
Introduction<br />
<strong>Source</strong> Localization:<br />
Determ<strong>in</strong><strong>in</strong>g the spatial position <strong>of</strong> a<br />
speaker.<br />
Motivation:<br />
Automated camera steer<strong>in</strong>g and track<strong>in</strong>g<br />
are required <strong>in</strong> video conferences.<br />
3
EE049035 – Spr<strong>in</strong>g 2007<br />
Introduction<br />
Microphone array are used for this task<br />
Bl<strong>in</strong>d problem<br />
Dual step approach:<br />
<strong>Time</strong> <strong>Difference</strong> <strong>of</strong> <strong>Arrival</strong> (TDOA)<br />
estimation.<br />
Determ<strong>in</strong><strong>in</strong>g the spatial position <strong>of</strong> the<br />
source.<br />
4
EE049035 – Spr<strong>in</strong>g 2007<br />
Problem Formulation<br />
Ideal Model<br />
z ( t) = α s( t) + n ( t)<br />
1 1 1<br />
z ( t) = α s( t + τ)<br />
+ n ( t)<br />
2 2 2<br />
s( t), n ( t), n ( t)<br />
uncorrelated<br />
1 2<br />
S<strong>in</strong>gle-path propagation<br />
TDOA <strong>Estimation</strong>:<br />
Estimate ˆτ given { z 1, z2<br />
} .<br />
5
EE049035 – Spr<strong>in</strong>g 2007<br />
Problem Formulation<br />
6
EE049035 – Spr<strong>in</strong>g 2007<br />
Problem Formulation<br />
Reverberation Model:<br />
z ( t) = a ( t) ∗ s( t) + n ( t)<br />
m m m<br />
- impulse response from the source<br />
to the mth microphone<br />
where is the<br />
impulse response between the noise<br />
and the mth am( t)<br />
nm( t) = bm ( t) ∗ n( t)<br />
bm ( t)<br />
n( t)<br />
microphone<br />
Multi-path propagation<br />
7
EE049035 – Spr<strong>in</strong>g 2007<br />
Problem Formulation<br />
8
EE049035 – Spr<strong>in</strong>g 2007<br />
Acoustic Impulse Response<br />
TDOA – between<br />
the direct path<br />
0.05<br />
0<br />
a 1 (t) (T60 = 0.9sec)<br />
-0.05<br />
0 500 1000 1500 2000<br />
samples<br />
2500 3000 3500 4000<br />
a (t) (T60 = 0.9sec)<br />
2<br />
0.04<br />
0.02<br />
0<br />
-0.02<br />
-0.04<br />
0 500 1000 1500 2000<br />
samples<br />
2500 3000 3500 4000<br />
9
EE049035 – Spr<strong>in</strong>g 2007<br />
Cross Correlation Method<br />
Based on the ideal model<br />
Delay estimation is the lag time that<br />
maximizes the CC function:<br />
τˆ = arg max R ( τ)<br />
CC z z<br />
τ<br />
1 2<br />
R ( τ) = E { z ( τ) z ( t −<br />
τ)<br />
}<br />
z1z2 1 2<br />
10
EE049035 – Spr<strong>in</strong>g 2007<br />
Cross Correlation Method<br />
τ<br />
Correlation<br />
11
EE049035 – Spr<strong>in</strong>g 2007<br />
Generalized CC Method (GCC)<br />
Given an observation <strong>in</strong>terval T,<br />
we have to estimate the<br />
CC function Rˆ ( τ)<br />
:<br />
z z<br />
1 2<br />
1 T<br />
Rˆ z ( ) 1z τ = ∫ z 2 1( t) z2( t − τ)<br />
dt<br />
T − τ τ<br />
In order to improve estimation, pre-filter<strong>in</strong>g is suggested:<br />
ˆ( g) ∞<br />
ˆ jwτ<br />
y ( ) ( ) ( )<br />
1y τ = ∫ ψ<br />
2 g z1z2 −∞<br />
ψg<br />
( w)<br />
should be chosen to ensure a sharp peak.<br />
R w P w e dw<br />
Knapp and Carter (1976) [2].<br />
12
EE049035 – Spr<strong>in</strong>g 2007<br />
GCC – Simulation Results<br />
20<br />
10<br />
GCC (SNR=5[db]; T60=0.11[s])<br />
0<br />
-5 -4 -3 -2 -1 0 1 2 3 4 5<br />
TDOA [samples]<br />
20<br />
10<br />
GCC (SNR=5[dB]; T60=0.9[s])<br />
0<br />
-5 -4 -3 -2 -1 0 1 2 3 4 5<br />
TDOA [samples]<br />
20<br />
10<br />
GCC (SNR=-5[dB]; T60=0.11[s])<br />
0<br />
-5 -4 -3 -2 -1 0 1 2 3 4 5<br />
TDOA [samples]<br />
13
EE049035 – Spr<strong>in</strong>g 2007<br />
<strong>Time</strong> <strong>Difference</strong> <strong>of</strong> <strong>Arrival</strong> <strong>Estimation</strong><br />
Features:<br />
Based on reverberation model<br />
Assume speech source and exploits speech<br />
quasi-stationarity<br />
In frequency doma<strong>in</strong>.<br />
14
EE049035 – Spr<strong>in</strong>g 2007<br />
ATF-s Ratio for TDOA Extraction<br />
Denote the Acoustic Transfer Functions (ATF-s):<br />
m<br />
−jwn0 −jwni<br />
( ) = α + α , = 2,...,<br />
0<br />
i<br />
m n n<br />
i=<br />
1<br />
L<br />
−jwp −jwp<br />
1<br />
A1( w) = βp e 0<br />
0 + βp<br />
e i<br />
L<br />
∑<br />
A w e e m M<br />
The ATF-s ratio (RTF):<br />
H w<br />
A ( w)<br />
α −jwn<br />
e<br />
e w<br />
∑<br />
i=<br />
1<br />
0<br />
m<br />
n0<br />
( ) = =<br />
( );<br />
A 0<br />
1(<br />
w) −jwp<br />
βp<br />
e 0<br />
m m<br />
where, at low reverberations αn ≫ α ; , 0<br />
0 n β i p ≫ β 0 p i ≠<br />
i :<br />
e ( w) ≈ 1<br />
m<br />
The peak <strong>of</strong> the correspond<strong>in</strong>g hm( t)<br />
can be used to<br />
determ<strong>in</strong><strong>in</strong>g the TDOA.<br />
i<br />
15
EE049035 – Spr<strong>in</strong>g 2007<br />
Spectrum Analysis<br />
Us<strong>in</strong>g that the speech and noise are uncorrelated<br />
we get the PSD:<br />
* *<br />
z z i j ss i j nn<br />
Φ ( w) = A( w) A ( w) Φ ( w) + B ( w) B ( w) Φ ( w)<br />
i j<br />
The connection between ( ) and Φ ( ) :<br />
Φzmz w 1 1 1<br />
Φ − Φ = Φ<br />
( w) H ( w) ( w) 1 ( w)<br />
m 1 1 1<br />
m<br />
z z m z z b<br />
where 1 ( ) is a noise only term:<br />
Φ<br />
b w<br />
m<br />
z z w<br />
2<br />
bm<br />
m m 1 nn<br />
Φ 1 ( w) = ( G ( w) − H ( w) ) B ( w) Φ ( w);<br />
G ( w)<br />
≜<br />
m<br />
Bm( w)<br />
B ( w)<br />
Problem – speech is non stationary - Φss(<br />
w)<br />
1<br />
16
EE049035 – Spr<strong>in</strong>g 2007<br />
<strong>Speech</strong> Quasi-Stationarity<br />
Consider the observation time <strong>in</strong>terval <strong>of</strong> length NP:<br />
The noise is stationary<br />
The speech stats are chang<strong>in</strong>g.<br />
By divid<strong>in</strong>g the observation <strong>in</strong>terval to N frame <strong>of</strong> length P ,<br />
the speech is stationary for each frame.<br />
Assume the analysis w<strong>in</strong>dow <strong>of</strong> length P is much larger than<br />
the support <strong>of</strong> a ( t), b ( t)<br />
(MTF assumption):<br />
m m<br />
Z ( n, w) = A ( w) S( n, w) +<br />
B ( w) N( w)<br />
m m m<br />
Therefore, for each frame, n = 1,..., N :<br />
Φ ˆ = Φ ˆ + Φˆ<br />
( n, w) H ( w) ( n, w) 1 ( n, w)<br />
m 1 1 1<br />
m<br />
z z m z z b<br />
17
EE049035 – Spr<strong>in</strong>g 2007<br />
<strong>Speech</strong> Quasi-Stationarity<br />
Def<strong>in</strong>e an error term, we get the first form <strong>of</strong><br />
stationarity (S1):<br />
Φ ˆ = Φ ˆ + Φ +<br />
( n, w) H ( w) ( n, w) 1 ( w) ξ(<br />
n, w)<br />
m 1 1 1<br />
m<br />
z z m z z b<br />
Weighted LS Solution is given by:<br />
⎡Hˆ ( w)<br />
⎤<br />
⎢<br />
ˆ<br />
⎥<br />
⎢Φ 1 b ( w)<br />
⎥<br />
⎣ m ⎦<br />
⎢ m ⎥ −1<br />
= Φ<br />
A WA A W ˆ ( w),<br />
H H<br />
( ) zmz1 Similarly, we get us<strong>in</strong>g the connection between<br />
and Φ an estimate <strong>of</strong> Hˆ ( w), Φˆ<br />
2 ( w)<br />
(S2)<br />
z z<br />
1 m<br />
⎡ ˆ ( )<br />
ˆ<br />
z1z 1, w ,1<br />
⎤ ⎡<br />
(1, )<br />
1 zmz w<br />
⎤<br />
⎢ Φ ⎥ ⎢ Φ 1 ⎥<br />
⎢ ⎥ ⎢ ⎥<br />
A = ⎢ ⋮ ⎥ ; Φ ˆ<br />
z ( )<br />
mz<br />
w = ⎢ ⋮<br />
⎥<br />
⎢ ⎥ 1 ⎢ ⎥<br />
⎢ ˆ ⎥ ⎢<br />
( N, w ) ,1 ˆ ⎥<br />
⎢Φ ⎥ ⎢Φ ( N, w)<br />
⎥<br />
⎣ z1z1 ⎦ ⎣ zmz1 ⎦<br />
m b<br />
m<br />
Φzmzm<br />
18
EE049035 – Spr<strong>in</strong>g 2007<br />
Decorrelation Criterion<br />
The cross PSD matrix <strong>of</strong> 1 st and m th microphones:<br />
⎡ Φz ( ) ( )<br />
1z w Φ 1 z1z w ⎤ m<br />
P = ⎢ ⎥<br />
⎢<br />
z ( ) ( )<br />
mz w 1 zmz w ⎥<br />
⎢Φ Φ<br />
⎣ m ⎥⎦<br />
Impose the fact that the speech and noise are uncorrelated<br />
Search<strong>in</strong>g decorrelation transformation<br />
Λ ( w) = U( w) P( w) U ( w);<br />
⎡ u1( w)<br />
−1⎤ U( w)<br />
= ⎢ ⎥<br />
⎢−u2 ( w)<br />
1 ⎥<br />
⎢⎣ ⎥⎦<br />
{ u1 w = Hm w u2 w = Gm w }<br />
{ u1( w) = Gm( w), u2( w) =<br />
Hm( w)<br />
}<br />
( ) ( ), ( ) ( ) ;<br />
H<br />
19
EE049035 – Spr<strong>in</strong>g 2007<br />
Decorrelation Criterion<br />
For low SNR we get a poor estimation <strong>of</strong> Hm( w)<br />
from (S1)<br />
and (S2), but a good estimation <strong>of</strong> the noise terms:<br />
Φˆ<br />
2 ( )<br />
ˆ b w<br />
m<br />
Gm( w)<br />
=<br />
Φˆ<br />
1 ( w)<br />
Us<strong>in</strong>g <strong>in</strong>itialization u2( w) = Gˆ m(<br />
w)<br />
, decorrelation<br />
becomes a l<strong>in</strong>ear set, with the LS solution (LD):<br />
b<br />
m<br />
H −1<br />
H<br />
( ) ˆ ˆ<br />
m m 2 m 1<br />
Hm( w) = V V V ⎡<br />
⎣<br />
Φz z ( w) − u ( w) Φz<br />
z ( w)<br />
⎤<br />
⎦<br />
;<br />
V ≜ Φˆ ( w) − u ( w) Φˆ<br />
( w)<br />
z z 2 z z<br />
1 m<br />
1 1<br />
20
EE049035 – Spr<strong>in</strong>g 2007<br />
Other Algorithms<br />
Also <strong>in</strong> the paper:<br />
Iterative solution that comb<strong>in</strong>es decorrelation and the first form <strong>of</strong><br />
stationarity, us<strong>in</strong>g Gauss iterations (GS1)<br />
Recursive algorithms based on steepest descent, for each batch<br />
algorithm, used for track<strong>in</strong>g mov<strong>in</strong>g speakers.<br />
21
EE049035 – Spr<strong>in</strong>g 2007<br />
Setup:<br />
Simulation Results<br />
Room dimensions = [4, 7, 2.75]<br />
Mic1 location= [2, 3.5, 1.375]<br />
Mic2 location = [1.7, 3.5, 1.375]<br />
<strong>Speech</strong> source location = [2.53, 4.03, 2.67]<br />
Noise source location = [1.5, 4, 2.08]<br />
<strong>Speech</strong> TDOA = 3.07[ samples ]<br />
Noise TDOA = −2.62<br />
samples<br />
[ ]<br />
2.5<br />
2<br />
1.5<br />
1<br />
0.5<br />
0<br />
6<br />
4<br />
2<br />
Acoustic Room Setup<br />
0<br />
0<br />
1<br />
2<br />
3<br />
4<br />
mic1<br />
mic2<br />
source<br />
noise<br />
22
EE049035 – Spr<strong>in</strong>g 2007<br />
GCC<br />
S1<br />
S2<br />
LD<br />
GS1<br />
Simulation Results<br />
SNR = 5 dB<br />
RMSE<br />
0.1<br />
0.1713<br />
0.1883<br />
0.1917<br />
0.187<br />
[ ]<br />
T 60 = 0.9 s<br />
[ ]<br />
AIR length = 512<br />
P<br />
=<br />
256<br />
Anomaly<br />
89%<br />
7%<br />
7%<br />
7%<br />
7%<br />
20<br />
10<br />
GCC<br />
0<br />
-5 -4 -3 -2 -1 0 1 2 3 4 5<br />
TDOA [samples]<br />
S1<br />
20<br />
10<br />
0<br />
-5 -4 -3 -2 -1 0 1 2 3 4 5<br />
TDOA [samples]<br />
S2<br />
20<br />
10<br />
0<br />
-5 -4 -3 -2 -1 0 1 2 3 4 5<br />
TDOA [samples]<br />
LD<br />
20<br />
10<br />
0<br />
-5 -4 -3 -2 -1 0 1 2 3 4 5<br />
TDOA [samples]<br />
GS1<br />
20<br />
10<br />
0<br />
-5 -4 -3 -2 -1 0 1 2 3 4 5<br />
TDOA [samples]<br />
23
EE049035 – Spr<strong>in</strong>g 2007<br />
Discussion<br />
First form <strong>of</strong> stationarity (S1):<br />
⎡Hˆ m(<br />
w)<br />
⎤<br />
⎢ ⎥ H −1<br />
H<br />
⎢ ( A WA) A W ˆ<br />
z ( ),<br />
mz<br />
w<br />
1<br />
ˆ<br />
⎥ = Φ<br />
⎢Φ 1 b ( w)<br />
⎥<br />
⎣ m ⎦<br />
Two conflict<strong>in</strong>g requirements:<br />
Frames with higher SNR – good estimation <strong>of</strong><br />
Frames with lower SNR – good estimation <strong>of</strong><br />
Two approaches:<br />
ˆ ( )<br />
Hˆ m(<br />
w)<br />
ˆ 1 ( ) Φ<br />
b w<br />
Use (S1) for 1 estimation only as advised <strong>in</strong> the paper.<br />
Φ<br />
b w<br />
m<br />
Cohen [3] has proposed to use voice activity detector<br />
(VAD) and separate ˆ 1 ( ) and estimation.<br />
Φ Hˆ ( w)<br />
b w<br />
m<br />
m<br />
m<br />
24
EE049035 – Spr<strong>in</strong>g 2007<br />
Discussion<br />
Work Assumptions:<br />
<strong>Speech</strong> is stationary only <strong>in</strong> short periods <strong>of</strong> time.<br />
Track<strong>in</strong>g objects requires dynamic acoustic impulse<br />
response -<br />
Static acoustic impulse response for short periods <strong>of</strong> time.<br />
Real room acoustic impulse response:<br />
Reverberant environment (long T60)<br />
Long impulse response – takes <strong>in</strong>to account late<br />
reverberations<br />
Us<strong>in</strong>g MTF approx.:<br />
The analysis w<strong>in</strong>dow length (time frame length) is much<br />
larger than the room acoustic impulse response.<br />
25
EE049035 – Spr<strong>in</strong>g 2007<br />
Analysis W<strong>in</strong>dow Length<br />
Real room (long) acoustic impulse response<br />
Long Analysis W<strong>in</strong>dow (time frame)<br />
<strong>Speech</strong> is non stationary (fundamental assumption).<br />
Track<strong>in</strong>g is unavailable.<br />
MTF Assumption<br />
Few observations – large estimation variance.<br />
26
EE049035 – Spr<strong>in</strong>g 2007<br />
Analysis W<strong>in</strong>dow Length<br />
Real room (long) acoustic impulse response<br />
Short Analysis W<strong>in</strong>dow (time frame)<br />
<strong>Speech</strong> is stationary<br />
Track<strong>in</strong>g is available.<br />
Many observations – small estimation variance<br />
MTF assumption doesn’t hold<br />
(fundamental assumption).<br />
27
EE049035 – Spr<strong>in</strong>g 2007<br />
GCC<br />
S1<br />
S2<br />
LD<br />
GS1<br />
Simulation Results<br />
SNR = 5 dB<br />
RMSE<br />
---<br />
0.2326<br />
0.3344<br />
0.2118<br />
0.4221<br />
[ ]<br />
T 60 = 0.9 s<br />
[ ]<br />
AIR length = 4096<br />
P<br />
=<br />
256<br />
Anomaly<br />
100%<br />
54%<br />
57%<br />
39%<br />
57%<br />
20<br />
10<br />
GCC<br />
0<br />
-5 -4 -3 -2 -1 0 1 2 3 4 5<br />
TDOA [samples]<br />
S1<br />
20<br />
10<br />
0<br />
-5 -4 -3 -2 -1 0 1 2 3 4 5<br />
TDOA [samples]<br />
S2<br />
20<br />
10<br />
0<br />
-5 -4 -3 -2 -1 0 1 2 3 4 5<br />
TDOA [samples]<br />
LD<br />
20<br />
10<br />
0<br />
-5 -4 -3 -2 -1 0 1 2 3 4 5<br />
TDOA [samples]<br />
GS1<br />
20<br />
10<br />
0<br />
-5 -4 -3 -2 -1 0 1 2 3 4 5<br />
TDOA [samples]<br />
28
EE049035 – Spr<strong>in</strong>g 2007<br />
Proposed Extension<br />
Us<strong>in</strong>g short analysis w<strong>in</strong>dow (short frames)<br />
Assume the follow<strong>in</strong>g:<br />
<strong>Speech</strong> is stationary <strong>in</strong> each frame.<br />
Static acoustic impulse response (enables track<strong>in</strong>g objects).<br />
Real room acoustic impulse response:<br />
Reverberant environment (long T60)<br />
Long impulse response – takes <strong>in</strong>to account late reverberations<br />
Discard MTF assumption.<br />
Instead, work <strong>in</strong> STFT doma<strong>in</strong> [4].<br />
Comb<strong>in</strong>e with Cohen method [3].<br />
29
EE049035 – Spr<strong>in</strong>g 2007<br />
References<br />
[1] T.G. Dvork<strong>in</strong>d and S. Gannot, <strong>Time</strong> <strong>Difference</strong> <strong>of</strong> <strong>Arrival</strong> <strong>Estimation</strong> <strong>of</strong><br />
<strong>Speech</strong> <strong>Source</strong> <strong>in</strong> a <strong>Noisy</strong> and Reverberant Environment, Signal Process<strong>in</strong>g,<br />
vol. 85, no. 1, pp.177-204, 2005<br />
[2] C.H. Knapp and G.C. Carter, The Generalized Correlation Method for<br />
<strong>Estimation</strong> <strong>of</strong> <strong>Time</strong> Delay, IEEE Trans. on Acoustics, <strong>Speech</strong> and Signal<br />
Process<strong>in</strong>g, vol. 24, no. 4, pp. 320-327, 1976<br />
[3] I. Cohen, Relative Transfer Function Identification Us<strong>in</strong>g <strong>Speech</strong> Signals,<br />
IEEE Trans. on <strong>Speech</strong> and Audio Process<strong>in</strong>g, Vol. 12, No. 5, 2004<br />
[4] Y. Avargel and I. Cohen, System Identification <strong>in</strong> the Short <strong>Time</strong> Fourier<br />
Transform Doma<strong>in</strong> with Crossband Filter<strong>in</strong>g, IEEE Trans. on Audio, <strong>Speech</strong><br />
and Language Process<strong>in</strong>g, <strong>in</strong> future issue<br />
[5] National Institute <strong>of</strong> Standards and Technology, The DRAPA TIMIT<br />
Acoustic-Phonetic Cont<strong>in</strong>ues <strong>Speech</strong> Corpus<br />
30