a statistical model-based double-talk detection incorporating soft ...

A STATISTICAL MODEL-BASED DOUBLE-TALK DETECTION 

INCORPORATING SOFT DECISION 

Yun-Sik Park, Ji-Hyun Song, Sang-Ick Kang, Woojung Lee and Joon-Hyuk Chang 

School of Electronic Engineering 

Inha University, Incheon, Korea 

E-mail: [yspark, jhsong, sikang, wjlee]@dsp.inha.ac.kr, changjh@inha.ac.kr 

ABSTRACT 

In this paper, we propose a novel double-talk detection 

(DTD) technique based on a soft decision in the frequency 

domain. The proposed method provides an efficient procedure 

to detect the double-talk situation by the use of the 

global near-end speech presence probability (GNSPP) and 

voice activity detection (VAD) of the near-end and far-end 

signal. Specifically, the GNSPP is derived based on a statistical 

method of speech and is employed to determine the 

double-talk presence in a given frame. The performance of 

our approach is evaluated by objective tests under different 

environments, and it is found that the suggested method yields 

better results compared with the conventional scheme. 

Index Terms— Double-Talk Detection, Speech Presence 

Probability, Voice Activity Detection 

1. INTRODUCTION 

In most hands-free mobile communication systems, since 

the loudspeaker and microphone are acoustically coupled, 

acoustic echoes occur. In efforts to address this problem, 

numerous acoustic echo cancellation (AEC) techniques incorporating 

an adaptive filter such as the least mean square 

(LMS) and normalized LMS (NLMS) have been reported [1]- 

[3]. One of the major problems of AEC techniques, however, 

is that the performance significantly degrades during 

the double-talk periods, in which signals from both the nearend 

and far-end coexist because the double-talk acts as very 

large interference to the adaptive filter. The problem can be 

alleviated by freezing the adaptive filter coefficients through 

the use of a double talk detection (DTD) algorithm [4]. In 

this regard, many studies have been dedicated to the problem 

of DTD. In practice, cross-correlation and coherence-based 

algorithms are relevant, as they present straightforward approaches. 

Adopting hard decisions [4]-[6], these schemes 

classify each frame into one of two (i.e., double-talk or not) 

This work was supported by National Research Foundation of Korea 

(NRF) grant funded by the Korean Government (MEST) (NRF-2009- 

0072319) and This work was supported by the IT R&D program of 

MKE/KEIT. [2008-F-045-02]. 

cases by comparing decision statistics and given threshold 

values. However, they are sensitive to optimized parameters 

and do not always provide reliable performance under various 

conditions. 

In this paper, we propose a novel DTD algorithm based 

on a global soft decision [7], where the term ‘global’ means 

that DTD is performed globally in a given frame and ‘soft decision’ 

[8], [9] denotes that the probability of double-talk is 

introduced as a decision and is applied to update the adaptive 

filter in the acoustic echo suppressor (AES) algorithm [10]. 

Specifically, the global near-end speech presence probability 

(GNSPP) based on a statistical model is computed in each 

frame to apply the proposed DTD algorithm in conjunction 

with results of voice activity detection (VAD) of the near-end 

and far-end signal. It is worth noting that our approach provides 

for the first time an effective framework of DTD based 

on a soft decision by taking advantage of a statistical model, 

in contrast with the conventional hard decision-based method. 

The performance of the proposed algorithm is evaluated by 

echo return loss enhancement (ERLE) and speech attenuation 

(SA) tests during double-talk and is demonstrated to be better 

than that of the conventional method. 

2. GLOBAL NEAR-END SPEECH PRESENCE 

PROBABILITY 

In this section, we consider how to derive the global nearend 

speech presence probability (GNSPP) in the frequency 

domain. To this end, we first assume that two hypotheses, 

H 0 and H 1 , indicate near-end speech absence and presence 

as follows: 

H 0 : near-end speech absent : Y(i) =D(i) (1) 

: near-end speech present : Y(i) =D(i)+S(i) 

H 1 

where D(i) = [D(i, 1),D(i, 2),...,D(i, M)], S(i) = 

[S(i, 1),S(i, 2),...,S(i, M)] and Y(i) =[Y (i, 1),Y(i, 2), 

...,Y(i, M)], respectively, represent the Fourier domain 

spectra of the echo signal, the near-end speech and the microphone 

input signal with a frame index i. Also, X(i) = 

[X(i, 1),X(i, 2),...,X(i, M)] denote the Fourier spectrum 

978-1-4244-4296-6/10/$25.00 ©2010 IEEE 5082 

ICASSP 2010

of the far-end signal as shown in Fig. 1. The background 

noise is not taken into account since we assume that near-end 

speech absence is not correlated with the background noise. 

Under the assumption that D(i, k) and S(i, k) are characterized 

by separate zero-mean complex Gaussian distributions, 

the following is obtained [7]. 

p(Y (i, k)|H 0 ) = 

p(Y (i, k)|H 1 ) = 

1 

[ 

πλ d (i, k) exp − 

|Y (i, 

] 

k)|2 

λ d (i, k) 

1 

π(λ s (i, k)+λ d (i, k)) 

[ 

|Y (i, k)| 2 ] 

exp − 

λ s (i, k)+λ d (i, k) 

where λ s (i, k) and λ d (i, k) are the variance of the near-end 

speech and estimated echo, respectively. Accordingly, the 

GNSPP p(H 1 |Y(i)) is derived from Bayes’ rule, such that 

[7]: 

p(Y(i)|H 1 )p(H 1 ) 

p(H 1 |Y(i)) = 

p(Y(i)|H 0 )p(H 0 )+p(Y(i)|H 1 )p(H 1 ) 

(4) 

where p(H 0 )(= 1−p(H 1 )) represents the a priori probability 

of near-end speech absence. Since the spectral component in 

each frequency bin is assumed to be statistically independent, 

(4) can be rewritten as [7] 

p(H 1 |Y(i)) (5) 

M∏ 

p(H 1 ) p(Y (i, k)|H 1 ) 

k=1 

= 

M∏ 

M∏ 

p(H 0 ) p(Y (i, k)|H 0 )+p(H 1 ) p(Y (i, k)|H 1 ) 

k=1 

M∏ 

q Λ k (Y (i, k)) 

k=1 

= 

M∏ 

1+q Λ k (Y (i, k)) 

k=1 

k=1 

in which q = p(H 1 )/p(H 0 )(= 1) which is determined by 

the rough estimate of the ratio of absence time duration and 

presence time duration for near-end speech and Λ k (Y (i, k)) 

is the likelihood ratio computed in the kth frequency bin, as 

given by [7]: 

Λ k (Y (i, k)) = p(Y (i, k)|H 1) 

(6) 

p(Y (i, k)|H 0 ) 

1 

[ γ(i, k)ξ(i, k) 

] 

= 

1+ξ(i, k) exp 1+ξ(i, k) 

where the a posteriori signal-to-echo ratio (SER) γ(i, k) and 

the a priori SER ξ(i, k) are defined by 

(2) 

(3) 

Microphone 

Near-end 

Loudspeaker 

DFT 

IDFT 

VAD 

Decision 

GNSPP 

DTD 

SER Esimation 

Echo Path Response 

Estimation 

x 

G 

Send path 

IDFT 

DFT 

Far-end 

Receive path 

Fig. 1. Block diagram of the proposed DTD algorithm. 

ξ(i, k) ≡ λ s(i, k) 

(8) 

λ d (i, k) 

where λ d (i, k) is estimated by ˆλ d (i, k). The power spectrum 

of the echo signal is obtained in the case of the absence of the 

near-end speech signal, as given by 

ˆλ d (i, k) =ζ Dˆλd (i − 1,k)+(1− ζ D )|Ŷ (i, k)|2 (9) 

in which ζ D (= 0.93) is the smoothing parameter. Also, in (8), 

ξ(i, k) is estimated with the help of the well-known decisiondirected 

approach with α DD =0.6 [11]. Then, 

|Ŝ(i − 1,k)|2 

ˆξ(i, k) =α DD 

λ d (i − 1,k) +(1− α DD)P {γ(i, k) − 1} 

(10) 

where P {z} = z if z ≥ 0, and P {z} =0otherwise. As 

specified in (9), the robust estimation of the echo magnitude 

spectrum Ŷ (i, k) plays an essential role in the performance. 

In our approach, we follow the parameter estimation procedure 

proposed in [10] as follows: 

|Ŷ (i, k)| = Ĥ(i, k)|X(i, k)| (11) 

where Ĥ(i, k) is the estimate for the echo path response mimicking 

the actual echo path. Specifically, Ĥ opt (i, k) is obtained 

based on the magnitude of the least squares estimator 

as follows [10]: 

Ĥ opt (i, k) = 

E[X ∗ (i, k)Y (i, k)] 

∣E[X ∗ (i, k)X(i, k)] ∣ (12) 

where ∗ denotes the complex conjugate and E[·] represents 

the expected value. Note that there exist some delay between 

the far-end speech X(i, k) and the microphone input signal 

Y (i, k) (due to a digital amplifier, e.g.). In our approach, it is 

assumed that the echo time-delay is separately estimated and 

compensated (i.e., no delay) at the near-end. Since the echo 

path is time varying, the estimated echo path response Ĥ(i, k) 

is obtained using the iterative procedure such that[10] 

γ(i, k) ≡ 

|Y (i, k)|2 

λ d (i, k) 

(7) 

Ĥ(i, k) = 

C(i, k) 

R(i, k) 

(13) 

5083

Far−end Echo 

Signal 

Near−end Speech 

Signal 

Microphone Input 

Signal 

p(H 1 

|Y(i)) 

1 

0 

−1 

1 

0 

−1 

1 

0 

−1 

1.0 

0.5 

0.0 

Noise Far−end echo Double−Talk Near−end Speech Noise 

Table 1. ERLE and SA test results obtained from the proposed 

DTD algorithm based on a soft decision with those 

yielded by the conventional hard decision method during 

double-talk. 

Environments ERLE (dB) SA (dB) 

Noise SNR (dB) Park-based proposed Park-based proposed 

10 3.23 2.96 1.75 1.64 

White 20 3.46 3.30 1.99 1.86 

30 3.51 3.36 2.03 1.90 

10 3.36 3.23 1.81 1.71 

Babble 20 3.48 3.33 1.20 1.87 

30 3.52 3.36 2.03 1.90 

10 2.95 2.84 1.56 1.47 

Vehicle 20 3.40 3.25 1.96 1.83 

30 3.50 3.35 2.02 1.90 

Clean speech ∞ 3.52 3.37 2.03 1.91 

0 1 2 3 4 

Time (sec) 

Fig. 2. DTD results for the acoustic echo signal under the 

vehicular noise condition (SNR=20 dB). 

where 

C(i, k) =ζ C C(i − 1,k)+(1− ζ C )|X ∗ (i, k)Y (i, k)| (14) 

R(i, k) =ζ R R(i − 1,k)+(1− ζ R )|X ∗ (i, k)X(i, k)|, (15) 

and ζ C (= 0.998) and ζ R (= 0.998) are smoothing parameters. 

Note that this update iteration achieves the room change 

tracking. 

3. PROPOSED DOUBLE TALK DETECTION 

BASED AES 

As noted earlier, the update of the echo path response must 

be frozen in the case of the double-talk. For this, we propose 

the DTD technique to incorporate the newly derived GNSPP, 

p(H 1 |Y(i)), with the help of the VAD results of the near-end 

and far-end signal, as shown in Fig. 1. We inherently consider 

the near-end speech presence in the case of far-end signal 

presence, where the GNSPP substantially determine the 

double-talk situation and is used to update the echo path response 

based on (12). Note that the VAD has an impact on 

the near-end speech presence and the far-end speech presence 

only. Specifically, we derive a novel update routine of the 

echo path response by utilizing the soft decision as follows: 

⎧ 

⎪⎨ 

ˇĤ(i, k) = 

p(H 1 |Y(i)) ˇĤ(i − 1,k) 

+(1 − p(H 1 |Y(i)))Ĥopt(i, k) 

, if I(Y(i)) = 1 and I(X(i)) = 1 

⎪⎩ 

Ĥ opt (i, k), otherwise 

(16) 

where I(·) denotes an indicator function of the VAD result 

provided by the IS-127 noise suppression algorithm since it 

is known that it gives us a robust performance under various 

noise environments [12]. Furthermore, we modified the 

VAD algorithm to reduce the false decisions. For example, 

I(Y(i)) = 1 if the near-end signal Y(i) exists at the ith frame 

and I(Y(i)) = 0 otherwise. Therefore, the update of ˇĤ(i, k) 

is finally addressed such that 

ˇĤ(i, k) replaces ˇĤ(i − 1,k) 

(i.e., no update) within the double-talk regions on each frequency 

bin and (12) in the case of single-talk. In particular, in 

the case of abrupt transient periods between double-talk and 

single-talk, as shown in Fig. 2, the GNSPP could be a soft 

value between 0 and 1. This accounts for why the soft decision 

scheme is more insensitive to detection error compared 

to conventional hard decision methods. 

Based on this proposed DTD method, we finally apply it to 

the AES algorithm proposed by Faller et al. [10] as follows: 

Ŝ(i, k) =G(i, k)Y (i, k) (17) 

where the Wiener filter gain G(i, k) is given by [10] 

[ max(|Y (i, k)|−| Ŷ (i, k)|, 0) 

] 

G(i, k) = 

. (18) 

|Y (i, k)| 

4. EXPERIMENTAL RESULTS 

In order to verify the performance of the proposed DTD algorithm, 

we conducted objective comparison experiments under 

various noise conditions. Twenty test phrases, spoken by 

seven speakers and sampled at 8 kHz, were used as the experimental 

data. For assessing the performance of the proposed 

method, we artificially created 20 data files, where 

each file was produced by mixing the far-end signal with the 

near-end signal. Each frame of the windowed signal was 

transformed into its corresponding spectrum through a 128- 

point DFT after zero padding. We then constructed 16 frequency 

bands through combination of subbands to cover all 

frequency ranges (∼4 kHz) of the narrow band speech signal, 

5084

which is analogous to that of the IS-127 noise suppression algorithm 

[12]. The far-end speech signal was passed through 

a filter simulating the acoustic echo path modeled by a timeinvariant 

FIR filter based on the analysis of room acoustics 

before being mixed electrically [13], [14]. The simulation environment 

was designed to fit a small office room having a 

size of 5×4×3 m 3 . The echo level measured at the input microphone 

was 3.5 dB lower than that of the input near-end 

speech on average. In order to create noisy conditions, white, 

babble and vehicular noises from the NOISEX-92 database 

were added to clean near-end speech signals at signal-to-noise 

ratios (SNRs) of 10, 20, and 30 dB. For the purpose of an objective 

comparison, we evaluated the performance of the proposed 

scheme and that of the conventional DTD algorithm 

proposed by Park et al. [6], wherein the cross-correlation 

coefficients-based double-talk detection method is used. The 

performances of the approaches were measured in terms of 

echo return loss enhancement (ERLE) and speech attenuation 

(SA) during double-talk [14]. 

Given the three types of noise environments, the ERLE and 

SAs scores were averaged to give final mean score results, 

as presented in Table 1. From Table 1, it is evident that in 

most noisy conditions, the proposed DTD algorithm based on 

a soft decision yielded a lower ERLE compared to the hard 

decision-based conventional technique. Also, from Table 2, 

we can observe that the SAs (measured during double-talk periods) 

of the proposed scheme based on a soft decision were 

better than those of the previous method [6] in all the tested 

conditions. It is noted that the performance gain of the proposed 

method becomes smaller as the SNR becomes lower. 

This is attributed to imperfection of the GNSPP under adverse 

noise conditions. Summarizing the overall results, the 

proposed approach is found to be effective in the AES technique. 

5. CONCLUSIONS 

In this paper, we have proposed a novel DTD algorithm 

based on a soft decision scheme in the frequency domain. 

The GNSPP based on a statistical model of the near-end and 

far-end signal is applied to the DTD algorithm in conjunction 

with VAD decisions for effective echo suppression. The 

performance of the proposed algorithm has been found to be 

superior to that of the conventional technique through objective 

evaluation tests. 

6. REFERENCES 

[1] P. S. R. Diniz, Adaptive Filtering: Algorithm and Practical 

Implementation. Norwell, MA: Kluwer, 1997. 

[3] H. Ye and B. X. Wu, “A new double-talk detection algorithm 

based on the orthogonality theorem,” IEEE Trans. 

Commun., vol. 39, pp. 1542-1545, Nov. 1991. 

[4] T. Gänsler, M. Hansson and C. J. Ivarsson, “A doubletalk 

detector based on coherence,” IEEE Trans. Commun., 

vol. 44, pp. 1421-1427, Nov. 1996. 

[5] J. Benesty, D. R. Morgan and J. H. Cho, “A new class of 

doubletalk detectors based on cross-correlation,” IEEE 

Trans. Speech Audio Processing, vol. 8, no. 10, pp. 168- 

172, Mar. 2000. 

[6] S. J. Park, C. G. Cho, C. Lee and D. H. Youn, “Integrated 

echo and noise canceler for hands-free applications,” 

IEEE Trans. on Circuits and Systems II, vol. 49, 

issue 3, pp. 186-195, Mar. 2002. 

[7] N. S. Kim and J.-H. Chang, “Spectral enhancement 

based on global soft decision,” IEEE Signal Processing 

Letters, vol. 7, no. 5, pp. 108-110, May 2000. 

[8] Y.-S. Park and J.-H. Chang, “A novel approach to a robust 

a priori SNR estimator in speech enhancement,” 

IEICE Trans. on Communications, vol. E90-B, no. 8, pp. 

2182-2185, Aug 2007. 

[9] Y.-S. Park and J.-H. Chang, “A probabilistic combination 

method of minimum statistics and soft decision for 

robust noise power estimation in speech enhancement,” 

IEEE Signal Processing Letters, vol. 15, no. 5, pp. 95- 

98, Jan 2008. 

[10] C. Faller and C. Tournery, “Robust echo control using 

a simple echo path model,” in Proc. IEEE Int. Conf. 

Acoust., Speech Signal Processing, vol. 5, pp. V281- 

V284, 2006. 

[11] Y. Ephraim and D. Malah, “Speech enhancement using 

a minimum mean-square error short-time spectral amplitude 

estimator,” IEEE Trans. Acoust., Speech, Signal 

Process., vol. ASSP-32, no. 6, pp. 1109-1121, Dec. 

1984. 

[12] TIA/EIA/IS-127, “Enhanced variable rate codec, speech 

service option 3 for wideband spread spectrum digital 

systems,” 1996. 

[13] S. McGovern, A Model for Room Acoustics, 2003 [Online]. 

Available: http://2pi.us/rir.html 

[14] S. Y. Lee and N. S. Kim, “A statistical model based 

residual echo suppression,” IEEE Signal Processing Letters, 

vol. 14, no. 10, pp. 758-761, Oct. 2007. 

[2] C. Avendano, “Acoustic echo suppression in the STFT 

domain,” in Proc. IEEE Workshop on Appl. of Sig. Proc. 

to Audio and Acoust., Oct. 2001. 

5085

a statistical model-based double-talk detection incorporating soft ...

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?