a statistical model-based double-talk detection incorporating soft ...
a statistical model-based double-talk detection incorporating soft ...
a statistical model-based double-talk detection incorporating soft ...
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
Far−end Echo<br />
Signal<br />
Near−end Speech<br />
Signal<br />
Microphone Input<br />
Signal<br />
p(H 1<br />
|Y(i))<br />
1<br />
0<br />
−1<br />
1<br />
0<br />
−1<br />
1<br />
0<br />
−1<br />
1.0<br />
0.5<br />
0.0<br />
Noise Far−end echo Double−Talk Near−end Speech Noise<br />
Table 1. ERLE and SA test results obtained from the proposed<br />
DTD algorithm <strong>based</strong> on a <strong>soft</strong> decision with those<br />
yielded by the conventional hard decision method during<br />
<strong>double</strong>-<strong>talk</strong>.<br />
Environments ERLE (dB) SA (dB)<br />
Noise SNR (dB) Park-<strong>based</strong> proposed Park-<strong>based</strong> proposed<br />
10 3.23 2.96 1.75 1.64<br />
White 20 3.46 3.30 1.99 1.86<br />
30 3.51 3.36 2.03 1.90<br />
10 3.36 3.23 1.81 1.71<br />
Babble 20 3.48 3.33 1.20 1.87<br />
30 3.52 3.36 2.03 1.90<br />
10 2.95 2.84 1.56 1.47<br />
Vehicle 20 3.40 3.25 1.96 1.83<br />
30 3.50 3.35 2.02 1.90<br />
Clean speech ∞ 3.52 3.37 2.03 1.91<br />
0 1 2 3 4<br />
Time (sec)<br />
Fig. 2. DTD results for the acoustic echo signal under the<br />
vehicular noise condition (SNR=20 dB).<br />
where<br />
C(i, k) =ζ C C(i − 1,k)+(1− ζ C )|X ∗ (i, k)Y (i, k)| (14)<br />
R(i, k) =ζ R R(i − 1,k)+(1− ζ R )|X ∗ (i, k)X(i, k)|, (15)<br />
and ζ C (= 0.998) and ζ R (= 0.998) are smoothing parameters.<br />
Note that this update iteration achieves the room change<br />
tracking.<br />
3. PROPOSED DOUBLE TALK DETECTION<br />
BASED AES<br />
As noted earlier, the update of the echo path response must<br />
be frozen in the case of the <strong>double</strong>-<strong>talk</strong>. For this, we propose<br />
the DTD technique to incorporate the newly derived GNSPP,<br />
p(H 1 |Y(i)), with the help of the VAD results of the near-end<br />
and far-end signal, as shown in Fig. 1. We inherently consider<br />
the near-end speech presence in the case of far-end signal<br />
presence, where the GNSPP substantially determine the<br />
<strong>double</strong>-<strong>talk</strong> situation and is used to update the echo path response<br />
<strong>based</strong> on (12). Note that the VAD has an impact on<br />
the near-end speech presence and the far-end speech presence<br />
only. Specifically, we derive a novel update routine of the<br />
echo path response by utilizing the <strong>soft</strong> decision as follows:<br />
⎧<br />
⎪⎨<br />
ˇĤ(i, k) =<br />
p(H 1 |Y(i)) ˇĤ(i − 1,k)<br />
+(1 − p(H 1 |Y(i)))Ĥopt(i, k)<br />
, if I(Y(i)) = 1 and I(X(i)) = 1<br />
⎪⎩<br />
Ĥ opt (i, k), otherwise<br />
(16)<br />
where I(·) denotes an indicator function of the VAD result<br />
provided by the IS-127 noise suppression algorithm since it<br />
is known that it gives us a robust performance under various<br />
noise environments [12]. Furthermore, we modified the<br />
VAD algorithm to reduce the false decisions. For example,<br />
I(Y(i)) = 1 if the near-end signal Y(i) exists at the ith frame<br />
and I(Y(i)) = 0 otherwise. Therefore, the update of ˇĤ(i, k)<br />
is finally addressed such that<br />
ˇĤ(i, k) replaces ˇĤ(i − 1,k)<br />
(i.e., no update) within the <strong>double</strong>-<strong>talk</strong> regions on each frequency<br />
bin and (12) in the case of single-<strong>talk</strong>. In particular, in<br />
the case of abrupt transient periods between <strong>double</strong>-<strong>talk</strong> and<br />
single-<strong>talk</strong>, as shown in Fig. 2, the GNSPP could be a <strong>soft</strong><br />
value between 0 and 1. This accounts for why the <strong>soft</strong> decision<br />
scheme is more insensitive to <strong>detection</strong> error compared<br />
to conventional hard decision methods.<br />
Based on this proposed DTD method, we finally apply it to<br />
the AES algorithm proposed by Faller et al. [10] as follows:<br />
Ŝ(i, k) =G(i, k)Y (i, k) (17)<br />
where the Wiener filter gain G(i, k) is given by [10]<br />
[ max(|Y (i, k)|−| Ŷ (i, k)|, 0)<br />
]<br />
G(i, k) =<br />
. (18)<br />
|Y (i, k)|<br />
4. EXPERIMENTAL RESULTS<br />
In order to verify the performance of the proposed DTD algorithm,<br />
we conducted objective comparison experiments under<br />
various noise conditions. Twenty test phrases, spoken by<br />
seven speakers and sampled at 8 kHz, were used as the experimental<br />
data. For assessing the performance of the proposed<br />
method, we artificially created 20 data files, where<br />
each file was produced by mixing the far-end signal with the<br />
near-end signal. Each frame of the windowed signal was<br />
transformed into its corresponding spectrum through a 128-<br />
point DFT after zero padding. We then constructed 16 frequency<br />
bands through combination of subbands to cover all<br />
frequency ranges (∼4 kHz) of the narrow band speech signal,<br />
5084