a statistical model-based double-talk detection incorporating soft ...
a statistical model-based double-talk detection incorporating soft ...
a statistical model-based double-talk detection incorporating soft ...
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
of the far-end signal as shown in Fig. 1. The background<br />
noise is not taken into account since we assume that near-end<br />
speech absence is not correlated with the background noise.<br />
Under the assumption that D(i, k) and S(i, k) are characterized<br />
by separate zero-mean complex Gaussian distributions,<br />
the following is obtained [7].<br />
p(Y (i, k)|H 0 ) =<br />
p(Y (i, k)|H 1 ) =<br />
1<br />
[<br />
πλ d (i, k) exp −<br />
|Y (i,<br />
]<br />
k)|2<br />
λ d (i, k)<br />
1<br />
π(λ s (i, k)+λ d (i, k))<br />
[<br />
|Y (i, k)| 2 ]<br />
exp −<br />
λ s (i, k)+λ d (i, k)<br />
where λ s (i, k) and λ d (i, k) are the variance of the near-end<br />
speech and estimated echo, respectively. Accordingly, the<br />
GNSPP p(H 1 |Y(i)) is derived from Bayes’ rule, such that<br />
[7]:<br />
p(Y(i)|H 1 )p(H 1 )<br />
p(H 1 |Y(i)) =<br />
p(Y(i)|H 0 )p(H 0 )+p(Y(i)|H 1 )p(H 1 )<br />
(4)<br />
where p(H 0 )(= 1−p(H 1 )) represents the a priori probability<br />
of near-end speech absence. Since the spectral component in<br />
each frequency bin is assumed to be <strong>statistical</strong>ly independent,<br />
(4) can be rewritten as [7]<br />
p(H 1 |Y(i)) (5)<br />
M∏<br />
p(H 1 ) p(Y (i, k)|H 1 )<br />
k=1<br />
=<br />
M∏<br />
M∏<br />
p(H 0 ) p(Y (i, k)|H 0 )+p(H 1 ) p(Y (i, k)|H 1 )<br />
k=1<br />
M∏<br />
q Λ k (Y (i, k))<br />
k=1<br />
=<br />
M∏<br />
1+q Λ k (Y (i, k))<br />
k=1<br />
k=1<br />
in which q = p(H 1 )/p(H 0 )(= 1) which is determined by<br />
the rough estimate of the ratio of absence time duration and<br />
presence time duration for near-end speech and Λ k (Y (i, k))<br />
is the likelihood ratio computed in the kth frequency bin, as<br />
given by [7]:<br />
Λ k (Y (i, k)) = p(Y (i, k)|H 1)<br />
(6)<br />
p(Y (i, k)|H 0 )<br />
1<br />
[ γ(i, k)ξ(i, k)<br />
]<br />
=<br />
1+ξ(i, k) exp 1+ξ(i, k)<br />
where the a posteriori signal-to-echo ratio (SER) γ(i, k) and<br />
the a priori SER ξ(i, k) are defined by<br />
(2)<br />
(3)<br />
Microphone<br />
Near-end<br />
Loudspeaker<br />
DFT<br />
IDFT<br />
VAD<br />
Decision<br />
GNSPP<br />
DTD<br />
SER Esimation<br />
Echo Path Response<br />
Estimation<br />
x<br />
G<br />
Send path<br />
IDFT<br />
DFT<br />
Far-end<br />
Receive path<br />
Fig. 1. Block diagram of the proposed DTD algorithm.<br />
ξ(i, k) ≡ λ s(i, k)<br />
(8)<br />
λ d (i, k)<br />
where λ d (i, k) is estimated by ˆλ d (i, k). The power spectrum<br />
of the echo signal is obtained in the case of the absence of the<br />
near-end speech signal, as given by<br />
ˆλ d (i, k) =ζ Dˆλd (i − 1,k)+(1− ζ D )|Ŷ (i, k)|2 (9)<br />
in which ζ D (= 0.93) is the smoothing parameter. Also, in (8),<br />
ξ(i, k) is estimated with the help of the well-known decisiondirected<br />
approach with α DD =0.6 [11]. Then,<br />
|Ŝ(i − 1,k)|2<br />
ˆξ(i, k) =α DD<br />
λ d (i − 1,k) +(1− α DD)P {γ(i, k) − 1}<br />
(10)<br />
where P {z} = z if z ≥ 0, and P {z} =0otherwise. As<br />
specified in (9), the robust estimation of the echo magnitude<br />
spectrum Ŷ (i, k) plays an essential role in the performance.<br />
In our approach, we follow the parameter estimation procedure<br />
proposed in [10] as follows:<br />
|Ŷ (i, k)| = Ĥ(i, k)|X(i, k)| (11)<br />
where Ĥ(i, k) is the estimate for the echo path response mimicking<br />
the actual echo path. Specifically, Ĥ opt (i, k) is obtained<br />
<strong>based</strong> on the magnitude of the least squares estimator<br />
as follows [10]:<br />
Ĥ opt (i, k) =<br />
E[X ∗ (i, k)Y (i, k)]<br />
∣E[X ∗ (i, k)X(i, k)] ∣ (12)<br />
where ∗ denotes the complex conjugate and E[·] represents<br />
the expected value. Note that there exist some delay between<br />
the far-end speech X(i, k) and the microphone input signal<br />
Y (i, k) (due to a digital amplifier, e.g.). In our approach, it is<br />
assumed that the echo time-delay is separately estimated and<br />
compensated (i.e., no delay) at the near-end. Since the echo<br />
path is time varying, the estimated echo path response Ĥ(i, k)<br />
is obtained using the iterative procedure such that[10]<br />
γ(i, k) ≡<br />
|Y (i, k)|2<br />
λ d (i, k)<br />
(7)<br />
Ĥ(i, k) =<br />
C(i, k)<br />
R(i, k)<br />
(13)<br />
5083