05.03.2015 Views

a statistical model-based double-talk detection incorporating soft ...

a statistical model-based double-talk detection incorporating soft ...

a statistical model-based double-talk detection incorporating soft ...

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

of the far-end signal as shown in Fig. 1. The background<br />

noise is not taken into account since we assume that near-end<br />

speech absence is not correlated with the background noise.<br />

Under the assumption that D(i, k) and S(i, k) are characterized<br />

by separate zero-mean complex Gaussian distributions,<br />

the following is obtained [7].<br />

p(Y (i, k)|H 0 ) =<br />

p(Y (i, k)|H 1 ) =<br />

1<br />

[<br />

πλ d (i, k) exp −<br />

|Y (i,<br />

]<br />

k)|2<br />

λ d (i, k)<br />

1<br />

π(λ s (i, k)+λ d (i, k))<br />

[<br />

|Y (i, k)| 2 ]<br />

exp −<br />

λ s (i, k)+λ d (i, k)<br />

where λ s (i, k) and λ d (i, k) are the variance of the near-end<br />

speech and estimated echo, respectively. Accordingly, the<br />

GNSPP p(H 1 |Y(i)) is derived from Bayes’ rule, such that<br />

[7]:<br />

p(Y(i)|H 1 )p(H 1 )<br />

p(H 1 |Y(i)) =<br />

p(Y(i)|H 0 )p(H 0 )+p(Y(i)|H 1 )p(H 1 )<br />

(4)<br />

where p(H 0 )(= 1−p(H 1 )) represents the a priori probability<br />

of near-end speech absence. Since the spectral component in<br />

each frequency bin is assumed to be <strong>statistical</strong>ly independent,<br />

(4) can be rewritten as [7]<br />

p(H 1 |Y(i)) (5)<br />

M∏<br />

p(H 1 ) p(Y (i, k)|H 1 )<br />

k=1<br />

=<br />

M∏<br />

M∏<br />

p(H 0 ) p(Y (i, k)|H 0 )+p(H 1 ) p(Y (i, k)|H 1 )<br />

k=1<br />

M∏<br />

q Λ k (Y (i, k))<br />

k=1<br />

=<br />

M∏<br />

1+q Λ k (Y (i, k))<br />

k=1<br />

k=1<br />

in which q = p(H 1 )/p(H 0 )(= 1) which is determined by<br />

the rough estimate of the ratio of absence time duration and<br />

presence time duration for near-end speech and Λ k (Y (i, k))<br />

is the likelihood ratio computed in the kth frequency bin, as<br />

given by [7]:<br />

Λ k (Y (i, k)) = p(Y (i, k)|H 1)<br />

(6)<br />

p(Y (i, k)|H 0 )<br />

1<br />

[ γ(i, k)ξ(i, k)<br />

]<br />

=<br />

1+ξ(i, k) exp 1+ξ(i, k)<br />

where the a posteriori signal-to-echo ratio (SER) γ(i, k) and<br />

the a priori SER ξ(i, k) are defined by<br />

(2)<br />

(3)<br />

Microphone<br />

Near-end<br />

Loudspeaker<br />

DFT<br />

IDFT<br />

VAD<br />

Decision<br />

GNSPP<br />

DTD<br />

SER Esimation<br />

Echo Path Response<br />

Estimation<br />

x<br />

G<br />

Send path<br />

IDFT<br />

DFT<br />

Far-end<br />

Receive path<br />

Fig. 1. Block diagram of the proposed DTD algorithm.<br />

ξ(i, k) ≡ λ s(i, k)<br />

(8)<br />

λ d (i, k)<br />

where λ d (i, k) is estimated by ˆλ d (i, k). The power spectrum<br />

of the echo signal is obtained in the case of the absence of the<br />

near-end speech signal, as given by<br />

ˆλ d (i, k) =ζ Dˆλd (i − 1,k)+(1− ζ D )|Ŷ (i, k)|2 (9)<br />

in which ζ D (= 0.93) is the smoothing parameter. Also, in (8),<br />

ξ(i, k) is estimated with the help of the well-known decisiondirected<br />

approach with α DD =0.6 [11]. Then,<br />

|Ŝ(i − 1,k)|2<br />

ˆξ(i, k) =α DD<br />

λ d (i − 1,k) +(1− α DD)P {γ(i, k) − 1}<br />

(10)<br />

where P {z} = z if z ≥ 0, and P {z} =0otherwise. As<br />

specified in (9), the robust estimation of the echo magnitude<br />

spectrum Ŷ (i, k) plays an essential role in the performance.<br />

In our approach, we follow the parameter estimation procedure<br />

proposed in [10] as follows:<br />

|Ŷ (i, k)| = Ĥ(i, k)|X(i, k)| (11)<br />

where Ĥ(i, k) is the estimate for the echo path response mimicking<br />

the actual echo path. Specifically, Ĥ opt (i, k) is obtained<br />

<strong>based</strong> on the magnitude of the least squares estimator<br />

as follows [10]:<br />

Ĥ opt (i, k) =<br />

E[X ∗ (i, k)Y (i, k)]<br />

∣E[X ∗ (i, k)X(i, k)] ∣ (12)<br />

where ∗ denotes the complex conjugate and E[·] represents<br />

the expected value. Note that there exist some delay between<br />

the far-end speech X(i, k) and the microphone input signal<br />

Y (i, k) (due to a digital amplifier, e.g.). In our approach, it is<br />

assumed that the echo time-delay is separately estimated and<br />

compensated (i.e., no delay) at the near-end. Since the echo<br />

path is time varying, the estimated echo path response Ĥ(i, k)<br />

is obtained using the iterative procedure such that[10]<br />

γ(i, k) ≡<br />

|Y (i, k)|2<br />

λ d (i, k)<br />

(7)<br />

Ĥ(i, k) =<br />

C(i, k)<br />

R(i, k)<br />

(13)<br />

5083

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!