05.03.2015 Views

a statistical model-based double-talk detection incorporating soft ...

a statistical model-based double-talk detection incorporating soft ...

a statistical model-based double-talk detection incorporating soft ...

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Far−end Echo<br />

Signal<br />

Near−end Speech<br />

Signal<br />

Microphone Input<br />

Signal<br />

p(H 1<br />

|Y(i))<br />

1<br />

0<br />

−1<br />

1<br />

0<br />

−1<br />

1<br />

0<br />

−1<br />

1.0<br />

0.5<br />

0.0<br />

Noise Far−end echo Double−Talk Near−end Speech Noise<br />

Table 1. ERLE and SA test results obtained from the proposed<br />

DTD algorithm <strong>based</strong> on a <strong>soft</strong> decision with those<br />

yielded by the conventional hard decision method during<br />

<strong>double</strong>-<strong>talk</strong>.<br />

Environments ERLE (dB) SA (dB)<br />

Noise SNR (dB) Park-<strong>based</strong> proposed Park-<strong>based</strong> proposed<br />

10 3.23 2.96 1.75 1.64<br />

White 20 3.46 3.30 1.99 1.86<br />

30 3.51 3.36 2.03 1.90<br />

10 3.36 3.23 1.81 1.71<br />

Babble 20 3.48 3.33 1.20 1.87<br />

30 3.52 3.36 2.03 1.90<br />

10 2.95 2.84 1.56 1.47<br />

Vehicle 20 3.40 3.25 1.96 1.83<br />

30 3.50 3.35 2.02 1.90<br />

Clean speech ∞ 3.52 3.37 2.03 1.91<br />

0 1 2 3 4<br />

Time (sec)<br />

Fig. 2. DTD results for the acoustic echo signal under the<br />

vehicular noise condition (SNR=20 dB).<br />

where<br />

C(i, k) =ζ C C(i − 1,k)+(1− ζ C )|X ∗ (i, k)Y (i, k)| (14)<br />

R(i, k) =ζ R R(i − 1,k)+(1− ζ R )|X ∗ (i, k)X(i, k)|, (15)<br />

and ζ C (= 0.998) and ζ R (= 0.998) are smoothing parameters.<br />

Note that this update iteration achieves the room change<br />

tracking.<br />

3. PROPOSED DOUBLE TALK DETECTION<br />

BASED AES<br />

As noted earlier, the update of the echo path response must<br />

be frozen in the case of the <strong>double</strong>-<strong>talk</strong>. For this, we propose<br />

the DTD technique to incorporate the newly derived GNSPP,<br />

p(H 1 |Y(i)), with the help of the VAD results of the near-end<br />

and far-end signal, as shown in Fig. 1. We inherently consider<br />

the near-end speech presence in the case of far-end signal<br />

presence, where the GNSPP substantially determine the<br />

<strong>double</strong>-<strong>talk</strong> situation and is used to update the echo path response<br />

<strong>based</strong> on (12). Note that the VAD has an impact on<br />

the near-end speech presence and the far-end speech presence<br />

only. Specifically, we derive a novel update routine of the<br />

echo path response by utilizing the <strong>soft</strong> decision as follows:<br />

⎧<br />

⎪⎨<br />

ˇĤ(i, k) =<br />

p(H 1 |Y(i)) ˇĤ(i − 1,k)<br />

+(1 − p(H 1 |Y(i)))Ĥopt(i, k)<br />

, if I(Y(i)) = 1 and I(X(i)) = 1<br />

⎪⎩<br />

Ĥ opt (i, k), otherwise<br />

(16)<br />

where I(·) denotes an indicator function of the VAD result<br />

provided by the IS-127 noise suppression algorithm since it<br />

is known that it gives us a robust performance under various<br />

noise environments [12]. Furthermore, we modified the<br />

VAD algorithm to reduce the false decisions. For example,<br />

I(Y(i)) = 1 if the near-end signal Y(i) exists at the ith frame<br />

and I(Y(i)) = 0 otherwise. Therefore, the update of ˇĤ(i, k)<br />

is finally addressed such that<br />

ˇĤ(i, k) replaces ˇĤ(i − 1,k)<br />

(i.e., no update) within the <strong>double</strong>-<strong>talk</strong> regions on each frequency<br />

bin and (12) in the case of single-<strong>talk</strong>. In particular, in<br />

the case of abrupt transient periods between <strong>double</strong>-<strong>talk</strong> and<br />

single-<strong>talk</strong>, as shown in Fig. 2, the GNSPP could be a <strong>soft</strong><br />

value between 0 and 1. This accounts for why the <strong>soft</strong> decision<br />

scheme is more insensitive to <strong>detection</strong> error compared<br />

to conventional hard decision methods.<br />

Based on this proposed DTD method, we finally apply it to<br />

the AES algorithm proposed by Faller et al. [10] as follows:<br />

Ŝ(i, k) =G(i, k)Y (i, k) (17)<br />

where the Wiener filter gain G(i, k) is given by [10]<br />

[ max(|Y (i, k)|−| Ŷ (i, k)|, 0)<br />

]<br />

G(i, k) =<br />

. (18)<br />

|Y (i, k)|<br />

4. EXPERIMENTAL RESULTS<br />

In order to verify the performance of the proposed DTD algorithm,<br />

we conducted objective comparison experiments under<br />

various noise conditions. Twenty test phrases, spoken by<br />

seven speakers and sampled at 8 kHz, were used as the experimental<br />

data. For assessing the performance of the proposed<br />

method, we artificially created 20 data files, where<br />

each file was produced by mixing the far-end signal with the<br />

near-end signal. Each frame of the windowed signal was<br />

transformed into its corresponding spectrum through a 128-<br />

point DFT after zero padding. We then constructed 16 frequency<br />

bands through combination of subbands to cover all<br />

frequency ranges (∼4 kHz) of the narrow band speech signal,<br />

5084

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!