05.03.2015 Views

a statistical model-based double-talk detection incorporating soft ...

a statistical model-based double-talk detection incorporating soft ...

a statistical model-based double-talk detection incorporating soft ...

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

A STATISTICAL MODEL-BASED DOUBLE-TALK DETECTION<br />

INCORPORATING SOFT DECISION<br />

Yun-Sik Park, Ji-Hyun Song, Sang-Ick Kang, Woojung Lee and Joon-Hyuk Chang<br />

School of Electronic Engineering<br />

Inha University, Incheon, Korea<br />

E-mail: [yspark, jhsong, sikang, wjlee]@dsp.inha.ac.kr, changjh@inha.ac.kr<br />

ABSTRACT<br />

In this paper, we propose a novel <strong>double</strong>-<strong>talk</strong> <strong>detection</strong><br />

(DTD) technique <strong>based</strong> on a <strong>soft</strong> decision in the frequency<br />

domain. The proposed method provides an efficient procedure<br />

to detect the <strong>double</strong>-<strong>talk</strong> situation by the use of the<br />

global near-end speech presence probability (GNSPP) and<br />

voice activity <strong>detection</strong> (VAD) of the near-end and far-end<br />

signal. Specifically, the GNSPP is derived <strong>based</strong> on a <strong>statistical</strong><br />

method of speech and is employed to determine the<br />

<strong>double</strong>-<strong>talk</strong> presence in a given frame. The performance of<br />

our approach is evaluated by objective tests under different<br />

environments, and it is found that the suggested method yields<br />

better results compared with the conventional scheme.<br />

Index Terms— Double-Talk Detection, Speech Presence<br />

Probability, Voice Activity Detection<br />

1. INTRODUCTION<br />

In most hands-free mobile communication systems, since<br />

the loudspeaker and microphone are acoustically coupled,<br />

acoustic echoes occur. In efforts to address this problem,<br />

numerous acoustic echo cancellation (AEC) techniques <strong>incorporating</strong><br />

an adaptive filter such as the least mean square<br />

(LMS) and normalized LMS (NLMS) have been reported [1]-<br />

[3]. One of the major problems of AEC techniques, however,<br />

is that the performance significantly degrades during<br />

the <strong>double</strong>-<strong>talk</strong> periods, in which signals from both the nearend<br />

and far-end coexist because the <strong>double</strong>-<strong>talk</strong> acts as very<br />

large interference to the adaptive filter. The problem can be<br />

alleviated by freezing the adaptive filter coefficients through<br />

the use of a <strong>double</strong> <strong>talk</strong> <strong>detection</strong> (DTD) algorithm [4]. In<br />

this regard, many studies have been dedicated to the problem<br />

of DTD. In practice, cross-correlation and coherence-<strong>based</strong><br />

algorithms are relevant, as they present straightforward approaches.<br />

Adopting hard decisions [4]-[6], these schemes<br />

classify each frame into one of two (i.e., <strong>double</strong>-<strong>talk</strong> or not)<br />

This work was supported by National Research Foundation of Korea<br />

(NRF) grant funded by the Korean Government (MEST) (NRF-2009-<br />

0072319) and This work was supported by the IT R&D program of<br />

MKE/KEIT. [2008-F-045-02].<br />

cases by comparing decision statistics and given threshold<br />

values. However, they are sensitive to optimized parameters<br />

and do not always provide reliable performance under various<br />

conditions.<br />

In this paper, we propose a novel DTD algorithm <strong>based</strong><br />

on a global <strong>soft</strong> decision [7], where the term ‘global’ means<br />

that DTD is performed globally in a given frame and ‘<strong>soft</strong> decision’<br />

[8], [9] denotes that the probability of <strong>double</strong>-<strong>talk</strong> is<br />

introduced as a decision and is applied to update the adaptive<br />

filter in the acoustic echo suppressor (AES) algorithm [10].<br />

Specifically, the global near-end speech presence probability<br />

(GNSPP) <strong>based</strong> on a <strong>statistical</strong> <strong>model</strong> is computed in each<br />

frame to apply the proposed DTD algorithm in conjunction<br />

with results of voice activity <strong>detection</strong> (VAD) of the near-end<br />

and far-end signal. It is worth noting that our approach provides<br />

for the first time an effective framework of DTD <strong>based</strong><br />

on a <strong>soft</strong> decision by taking advantage of a <strong>statistical</strong> <strong>model</strong>,<br />

in contrast with the conventional hard decision-<strong>based</strong> method.<br />

The performance of the proposed algorithm is evaluated by<br />

echo return loss enhancement (ERLE) and speech attenuation<br />

(SA) tests during <strong>double</strong>-<strong>talk</strong> and is demonstrated to be better<br />

than that of the conventional method.<br />

2. GLOBAL NEAR-END SPEECH PRESENCE<br />

PROBABILITY<br />

In this section, we consider how to derive the global nearend<br />

speech presence probability (GNSPP) in the frequency<br />

domain. To this end, we first assume that two hypotheses,<br />

H 0 and H 1 , indicate near-end speech absence and presence<br />

as follows:<br />

H 0 : near-end speech absent : Y(i) =D(i) (1)<br />

: near-end speech present : Y(i) =D(i)+S(i)<br />

H 1<br />

where D(i) = [D(i, 1),D(i, 2),...,D(i, M)], S(i) =<br />

[S(i, 1),S(i, 2),...,S(i, M)] and Y(i) =[Y (i, 1),Y(i, 2),<br />

...,Y(i, M)], respectively, represent the Fourier domain<br />

spectra of the echo signal, the near-end speech and the microphone<br />

input signal with a frame index i. Also, X(i) =<br />

[X(i, 1),X(i, 2),...,X(i, M)] denote the Fourier spectrum<br />

978-1-4244-4296-6/10/$25.00 ©2010 IEEE 5082<br />

ICASSP 2010


of the far-end signal as shown in Fig. 1. The background<br />

noise is not taken into account since we assume that near-end<br />

speech absence is not correlated with the background noise.<br />

Under the assumption that D(i, k) and S(i, k) are characterized<br />

by separate zero-mean complex Gaussian distributions,<br />

the following is obtained [7].<br />

p(Y (i, k)|H 0 ) =<br />

p(Y (i, k)|H 1 ) =<br />

1<br />

[<br />

πλ d (i, k) exp −<br />

|Y (i,<br />

]<br />

k)|2<br />

λ d (i, k)<br />

1<br />

π(λ s (i, k)+λ d (i, k))<br />

[<br />

|Y (i, k)| 2 ]<br />

exp −<br />

λ s (i, k)+λ d (i, k)<br />

where λ s (i, k) and λ d (i, k) are the variance of the near-end<br />

speech and estimated echo, respectively. Accordingly, the<br />

GNSPP p(H 1 |Y(i)) is derived from Bayes’ rule, such that<br />

[7]:<br />

p(Y(i)|H 1 )p(H 1 )<br />

p(H 1 |Y(i)) =<br />

p(Y(i)|H 0 )p(H 0 )+p(Y(i)|H 1 )p(H 1 )<br />

(4)<br />

where p(H 0 )(= 1−p(H 1 )) represents the a priori probability<br />

of near-end speech absence. Since the spectral component in<br />

each frequency bin is assumed to be <strong>statistical</strong>ly independent,<br />

(4) can be rewritten as [7]<br />

p(H 1 |Y(i)) (5)<br />

M∏<br />

p(H 1 ) p(Y (i, k)|H 1 )<br />

k=1<br />

=<br />

M∏<br />

M∏<br />

p(H 0 ) p(Y (i, k)|H 0 )+p(H 1 ) p(Y (i, k)|H 1 )<br />

k=1<br />

M∏<br />

q Λ k (Y (i, k))<br />

k=1<br />

=<br />

M∏<br />

1+q Λ k (Y (i, k))<br />

k=1<br />

k=1<br />

in which q = p(H 1 )/p(H 0 )(= 1) which is determined by<br />

the rough estimate of the ratio of absence time duration and<br />

presence time duration for near-end speech and Λ k (Y (i, k))<br />

is the likelihood ratio computed in the kth frequency bin, as<br />

given by [7]:<br />

Λ k (Y (i, k)) = p(Y (i, k)|H 1)<br />

(6)<br />

p(Y (i, k)|H 0 )<br />

1<br />

[ γ(i, k)ξ(i, k)<br />

]<br />

=<br />

1+ξ(i, k) exp 1+ξ(i, k)<br />

where the a posteriori signal-to-echo ratio (SER) γ(i, k) and<br />

the a priori SER ξ(i, k) are defined by<br />

(2)<br />

(3)<br />

Microphone<br />

Near-end<br />

Loudspeaker<br />

DFT<br />

IDFT<br />

VAD<br />

Decision<br />

GNSPP<br />

DTD<br />

SER Esimation<br />

Echo Path Response<br />

Estimation<br />

x<br />

G<br />

Send path<br />

IDFT<br />

DFT<br />

Far-end<br />

Receive path<br />

Fig. 1. Block diagram of the proposed DTD algorithm.<br />

ξ(i, k) ≡ λ s(i, k)<br />

(8)<br />

λ d (i, k)<br />

where λ d (i, k) is estimated by ˆλ d (i, k). The power spectrum<br />

of the echo signal is obtained in the case of the absence of the<br />

near-end speech signal, as given by<br />

ˆλ d (i, k) =ζ Dˆλd (i − 1,k)+(1− ζ D )|Ŷ (i, k)|2 (9)<br />

in which ζ D (= 0.93) is the smoothing parameter. Also, in (8),<br />

ξ(i, k) is estimated with the help of the well-known decisiondirected<br />

approach with α DD =0.6 [11]. Then,<br />

|Ŝ(i − 1,k)|2<br />

ˆξ(i, k) =α DD<br />

λ d (i − 1,k) +(1− α DD)P {γ(i, k) − 1}<br />

(10)<br />

where P {z} = z if z ≥ 0, and P {z} =0otherwise. As<br />

specified in (9), the robust estimation of the echo magnitude<br />

spectrum Ŷ (i, k) plays an essential role in the performance.<br />

In our approach, we follow the parameter estimation procedure<br />

proposed in [10] as follows:<br />

|Ŷ (i, k)| = Ĥ(i, k)|X(i, k)| (11)<br />

where Ĥ(i, k) is the estimate for the echo path response mimicking<br />

the actual echo path. Specifically, Ĥ opt (i, k) is obtained<br />

<strong>based</strong> on the magnitude of the least squares estimator<br />

as follows [10]:<br />

Ĥ opt (i, k) =<br />

E[X ∗ (i, k)Y (i, k)]<br />

∣E[X ∗ (i, k)X(i, k)] ∣ (12)<br />

where ∗ denotes the complex conjugate and E[·] represents<br />

the expected value. Note that there exist some delay between<br />

the far-end speech X(i, k) and the microphone input signal<br />

Y (i, k) (due to a digital amplifier, e.g.). In our approach, it is<br />

assumed that the echo time-delay is separately estimated and<br />

compensated (i.e., no delay) at the near-end. Since the echo<br />

path is time varying, the estimated echo path response Ĥ(i, k)<br />

is obtained using the iterative procedure such that[10]<br />

γ(i, k) ≡<br />

|Y (i, k)|2<br />

λ d (i, k)<br />

(7)<br />

Ĥ(i, k) =<br />

C(i, k)<br />

R(i, k)<br />

(13)<br />

5083


Far−end Echo<br />

Signal<br />

Near−end Speech<br />

Signal<br />

Microphone Input<br />

Signal<br />

p(H 1<br />

|Y(i))<br />

1<br />

0<br />

−1<br />

1<br />

0<br />

−1<br />

1<br />

0<br />

−1<br />

1.0<br />

0.5<br />

0.0<br />

Noise Far−end echo Double−Talk Near−end Speech Noise<br />

Table 1. ERLE and SA test results obtained from the proposed<br />

DTD algorithm <strong>based</strong> on a <strong>soft</strong> decision with those<br />

yielded by the conventional hard decision method during<br />

<strong>double</strong>-<strong>talk</strong>.<br />

Environments ERLE (dB) SA (dB)<br />

Noise SNR (dB) Park-<strong>based</strong> proposed Park-<strong>based</strong> proposed<br />

10 3.23 2.96 1.75 1.64<br />

White 20 3.46 3.30 1.99 1.86<br />

30 3.51 3.36 2.03 1.90<br />

10 3.36 3.23 1.81 1.71<br />

Babble 20 3.48 3.33 1.20 1.87<br />

30 3.52 3.36 2.03 1.90<br />

10 2.95 2.84 1.56 1.47<br />

Vehicle 20 3.40 3.25 1.96 1.83<br />

30 3.50 3.35 2.02 1.90<br />

Clean speech ∞ 3.52 3.37 2.03 1.91<br />

0 1 2 3 4<br />

Time (sec)<br />

Fig. 2. DTD results for the acoustic echo signal under the<br />

vehicular noise condition (SNR=20 dB).<br />

where<br />

C(i, k) =ζ C C(i − 1,k)+(1− ζ C )|X ∗ (i, k)Y (i, k)| (14)<br />

R(i, k) =ζ R R(i − 1,k)+(1− ζ R )|X ∗ (i, k)X(i, k)|, (15)<br />

and ζ C (= 0.998) and ζ R (= 0.998) are smoothing parameters.<br />

Note that this update iteration achieves the room change<br />

tracking.<br />

3. PROPOSED DOUBLE TALK DETECTION<br />

BASED AES<br />

As noted earlier, the update of the echo path response must<br />

be frozen in the case of the <strong>double</strong>-<strong>talk</strong>. For this, we propose<br />

the DTD technique to incorporate the newly derived GNSPP,<br />

p(H 1 |Y(i)), with the help of the VAD results of the near-end<br />

and far-end signal, as shown in Fig. 1. We inherently consider<br />

the near-end speech presence in the case of far-end signal<br />

presence, where the GNSPP substantially determine the<br />

<strong>double</strong>-<strong>talk</strong> situation and is used to update the echo path response<br />

<strong>based</strong> on (12). Note that the VAD has an impact on<br />

the near-end speech presence and the far-end speech presence<br />

only. Specifically, we derive a novel update routine of the<br />

echo path response by utilizing the <strong>soft</strong> decision as follows:<br />

⎧<br />

⎪⎨<br />

ˇĤ(i, k) =<br />

p(H 1 |Y(i)) ˇĤ(i − 1,k)<br />

+(1 − p(H 1 |Y(i)))Ĥopt(i, k)<br />

, if I(Y(i)) = 1 and I(X(i)) = 1<br />

⎪⎩<br />

Ĥ opt (i, k), otherwise<br />

(16)<br />

where I(·) denotes an indicator function of the VAD result<br />

provided by the IS-127 noise suppression algorithm since it<br />

is known that it gives us a robust performance under various<br />

noise environments [12]. Furthermore, we modified the<br />

VAD algorithm to reduce the false decisions. For example,<br />

I(Y(i)) = 1 if the near-end signal Y(i) exists at the ith frame<br />

and I(Y(i)) = 0 otherwise. Therefore, the update of ˇĤ(i, k)<br />

is finally addressed such that<br />

ˇĤ(i, k) replaces ˇĤ(i − 1,k)<br />

(i.e., no update) within the <strong>double</strong>-<strong>talk</strong> regions on each frequency<br />

bin and (12) in the case of single-<strong>talk</strong>. In particular, in<br />

the case of abrupt transient periods between <strong>double</strong>-<strong>talk</strong> and<br />

single-<strong>talk</strong>, as shown in Fig. 2, the GNSPP could be a <strong>soft</strong><br />

value between 0 and 1. This accounts for why the <strong>soft</strong> decision<br />

scheme is more insensitive to <strong>detection</strong> error compared<br />

to conventional hard decision methods.<br />

Based on this proposed DTD method, we finally apply it to<br />

the AES algorithm proposed by Faller et al. [10] as follows:<br />

Ŝ(i, k) =G(i, k)Y (i, k) (17)<br />

where the Wiener filter gain G(i, k) is given by [10]<br />

[ max(|Y (i, k)|−| Ŷ (i, k)|, 0)<br />

]<br />

G(i, k) =<br />

. (18)<br />

|Y (i, k)|<br />

4. EXPERIMENTAL RESULTS<br />

In order to verify the performance of the proposed DTD algorithm,<br />

we conducted objective comparison experiments under<br />

various noise conditions. Twenty test phrases, spoken by<br />

seven speakers and sampled at 8 kHz, were used as the experimental<br />

data. For assessing the performance of the proposed<br />

method, we artificially created 20 data files, where<br />

each file was produced by mixing the far-end signal with the<br />

near-end signal. Each frame of the windowed signal was<br />

transformed into its corresponding spectrum through a 128-<br />

point DFT after zero padding. We then constructed 16 frequency<br />

bands through combination of subbands to cover all<br />

frequency ranges (∼4 kHz) of the narrow band speech signal,<br />

5084


which is analogous to that of the IS-127 noise suppression algorithm<br />

[12]. The far-end speech signal was passed through<br />

a filter simulating the acoustic echo path <strong>model</strong>ed by a timeinvariant<br />

FIR filter <strong>based</strong> on the analysis of room acoustics<br />

before being mixed electrically [13], [14]. The simulation environment<br />

was designed to fit a small office room having a<br />

size of 5×4×3 m 3 . The echo level measured at the input microphone<br />

was 3.5 dB lower than that of the input near-end<br />

speech on average. In order to create noisy conditions, white,<br />

babble and vehicular noises from the NOISEX-92 database<br />

were added to clean near-end speech signals at signal-to-noise<br />

ratios (SNRs) of 10, 20, and 30 dB. For the purpose of an objective<br />

comparison, we evaluated the performance of the proposed<br />

scheme and that of the conventional DTD algorithm<br />

proposed by Park et al. [6], wherein the cross-correlation<br />

coefficients-<strong>based</strong> <strong>double</strong>-<strong>talk</strong> <strong>detection</strong> method is used. The<br />

performances of the approaches were measured in terms of<br />

echo return loss enhancement (ERLE) and speech attenuation<br />

(SA) during <strong>double</strong>-<strong>talk</strong> [14].<br />

Given the three types of noise environments, the ERLE and<br />

SAs scores were averaged to give final mean score results,<br />

as presented in Table 1. From Table 1, it is evident that in<br />

most noisy conditions, the proposed DTD algorithm <strong>based</strong> on<br />

a <strong>soft</strong> decision yielded a lower ERLE compared to the hard<br />

decision-<strong>based</strong> conventional technique. Also, from Table 2,<br />

we can observe that the SAs (measured during <strong>double</strong>-<strong>talk</strong> periods)<br />

of the proposed scheme <strong>based</strong> on a <strong>soft</strong> decision were<br />

better than those of the previous method [6] in all the tested<br />

conditions. It is noted that the performance gain of the proposed<br />

method becomes smaller as the SNR becomes lower.<br />

This is attributed to imperfection of the GNSPP under adverse<br />

noise conditions. Summarizing the overall results, the<br />

proposed approach is found to be effective in the AES technique.<br />

5. CONCLUSIONS<br />

In this paper, we have proposed a novel DTD algorithm<br />

<strong>based</strong> on a <strong>soft</strong> decision scheme in the frequency domain.<br />

The GNSPP <strong>based</strong> on a <strong>statistical</strong> <strong>model</strong> of the near-end and<br />

far-end signal is applied to the DTD algorithm in conjunction<br />

with VAD decisions for effective echo suppression. The<br />

performance of the proposed algorithm has been found to be<br />

superior to that of the conventional technique through objective<br />

evaluation tests.<br />

6. REFERENCES<br />

[1] P. S. R. Diniz, Adaptive Filtering: Algorithm and Practical<br />

Implementation. Norwell, MA: Kluwer, 1997.<br />

[3] H. Ye and B. X. Wu, “A new <strong>double</strong>-<strong>talk</strong> <strong>detection</strong> algorithm<br />

<strong>based</strong> on the orthogonality theorem,” IEEE Trans.<br />

Commun., vol. 39, pp. 1542-1545, Nov. 1991.<br />

[4] T. Gänsler, M. Hansson and C. J. Ivarsson, “A <strong>double</strong><strong>talk</strong><br />

detector <strong>based</strong> on coherence,” IEEE Trans. Commun.,<br />

vol. 44, pp. 1421-1427, Nov. 1996.<br />

[5] J. Benesty, D. R. Morgan and J. H. Cho, “A new class of<br />

<strong>double</strong><strong>talk</strong> detectors <strong>based</strong> on cross-correlation,” IEEE<br />

Trans. Speech Audio Processing, vol. 8, no. 10, pp. 168-<br />

172, Mar. 2000.<br />

[6] S. J. Park, C. G. Cho, C. Lee and D. H. Youn, “Integrated<br />

echo and noise canceler for hands-free applications,”<br />

IEEE Trans. on Circuits and Systems II, vol. 49,<br />

issue 3, pp. 186-195, Mar. 2002.<br />

[7] N. S. Kim and J.-H. Chang, “Spectral enhancement<br />

<strong>based</strong> on global <strong>soft</strong> decision,” IEEE Signal Processing<br />

Letters, vol. 7, no. 5, pp. 108-110, May 2000.<br />

[8] Y.-S. Park and J.-H. Chang, “A novel approach to a robust<br />

a priori SNR estimator in speech enhancement,”<br />

IEICE Trans. on Communications, vol. E90-B, no. 8, pp.<br />

2182-2185, Aug 2007.<br />

[9] Y.-S. Park and J.-H. Chang, “A probabilistic combination<br />

method of minimum statistics and <strong>soft</strong> decision for<br />

robust noise power estimation in speech enhancement,”<br />

IEEE Signal Processing Letters, vol. 15, no. 5, pp. 95-<br />

98, Jan 2008.<br />

[10] C. Faller and C. Tournery, “Robust echo control using<br />

a simple echo path <strong>model</strong>,” in Proc. IEEE Int. Conf.<br />

Acoust., Speech Signal Processing, vol. 5, pp. V281-<br />

V284, 2006.<br />

[11] Y. Ephraim and D. Malah, “Speech enhancement using<br />

a minimum mean-square error short-time spectral amplitude<br />

estimator,” IEEE Trans. Acoust., Speech, Signal<br />

Process., vol. ASSP-32, no. 6, pp. 1109-1121, Dec.<br />

1984.<br />

[12] TIA/EIA/IS-127, “Enhanced variable rate codec, speech<br />

service option 3 for wideband spread spectrum digital<br />

systems,” 1996.<br />

[13] S. McGovern, A Model for Room Acoustics, 2003 [Online].<br />

Available: http://2pi.us/rir.html<br />

[14] S. Y. Lee and N. S. Kim, “A <strong>statistical</strong> <strong>model</strong> <strong>based</strong><br />

residual echo suppression,” IEEE Signal Processing Letters,<br />

vol. 14, no. 10, pp. 758-761, Oct. 2007.<br />

[2] C. Avendano, “Acoustic echo suppression in the STFT<br />

domain,” in Proc. IEEE Workshop on Appl. of Sig. Proc.<br />

to Audio and Acoust., Oct. 2001.<br />

5085

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!