a statistical model-based double-talk detection incorporating soft ...
a statistical model-based double-talk detection incorporating soft ...
a statistical model-based double-talk detection incorporating soft ...
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
A STATISTICAL MODEL-BASED DOUBLE-TALK DETECTION<br />
INCORPORATING SOFT DECISION<br />
Yun-Sik Park, Ji-Hyun Song, Sang-Ick Kang, Woojung Lee and Joon-Hyuk Chang<br />
School of Electronic Engineering<br />
Inha University, Incheon, Korea<br />
E-mail: [yspark, jhsong, sikang, wjlee]@dsp.inha.ac.kr, changjh@inha.ac.kr<br />
ABSTRACT<br />
In this paper, we propose a novel <strong>double</strong>-<strong>talk</strong> <strong>detection</strong><br />
(DTD) technique <strong>based</strong> on a <strong>soft</strong> decision in the frequency<br />
domain. The proposed method provides an efficient procedure<br />
to detect the <strong>double</strong>-<strong>talk</strong> situation by the use of the<br />
global near-end speech presence probability (GNSPP) and<br />
voice activity <strong>detection</strong> (VAD) of the near-end and far-end<br />
signal. Specifically, the GNSPP is derived <strong>based</strong> on a <strong>statistical</strong><br />
method of speech and is employed to determine the<br />
<strong>double</strong>-<strong>talk</strong> presence in a given frame. The performance of<br />
our approach is evaluated by objective tests under different<br />
environments, and it is found that the suggested method yields<br />
better results compared with the conventional scheme.<br />
Index Terms— Double-Talk Detection, Speech Presence<br />
Probability, Voice Activity Detection<br />
1. INTRODUCTION<br />
In most hands-free mobile communication systems, since<br />
the loudspeaker and microphone are acoustically coupled,<br />
acoustic echoes occur. In efforts to address this problem,<br />
numerous acoustic echo cancellation (AEC) techniques <strong>incorporating</strong><br />
an adaptive filter such as the least mean square<br />
(LMS) and normalized LMS (NLMS) have been reported [1]-<br />
[3]. One of the major problems of AEC techniques, however,<br />
is that the performance significantly degrades during<br />
the <strong>double</strong>-<strong>talk</strong> periods, in which signals from both the nearend<br />
and far-end coexist because the <strong>double</strong>-<strong>talk</strong> acts as very<br />
large interference to the adaptive filter. The problem can be<br />
alleviated by freezing the adaptive filter coefficients through<br />
the use of a <strong>double</strong> <strong>talk</strong> <strong>detection</strong> (DTD) algorithm [4]. In<br />
this regard, many studies have been dedicated to the problem<br />
of DTD. In practice, cross-correlation and coherence-<strong>based</strong><br />
algorithms are relevant, as they present straightforward approaches.<br />
Adopting hard decisions [4]-[6], these schemes<br />
classify each frame into one of two (i.e., <strong>double</strong>-<strong>talk</strong> or not)<br />
This work was supported by National Research Foundation of Korea<br />
(NRF) grant funded by the Korean Government (MEST) (NRF-2009-<br />
0072319) and This work was supported by the IT R&D program of<br />
MKE/KEIT. [2008-F-045-02].<br />
cases by comparing decision statistics and given threshold<br />
values. However, they are sensitive to optimized parameters<br />
and do not always provide reliable performance under various<br />
conditions.<br />
In this paper, we propose a novel DTD algorithm <strong>based</strong><br />
on a global <strong>soft</strong> decision [7], where the term ‘global’ means<br />
that DTD is performed globally in a given frame and ‘<strong>soft</strong> decision’<br />
[8], [9] denotes that the probability of <strong>double</strong>-<strong>talk</strong> is<br />
introduced as a decision and is applied to update the adaptive<br />
filter in the acoustic echo suppressor (AES) algorithm [10].<br />
Specifically, the global near-end speech presence probability<br />
(GNSPP) <strong>based</strong> on a <strong>statistical</strong> <strong>model</strong> is computed in each<br />
frame to apply the proposed DTD algorithm in conjunction<br />
with results of voice activity <strong>detection</strong> (VAD) of the near-end<br />
and far-end signal. It is worth noting that our approach provides<br />
for the first time an effective framework of DTD <strong>based</strong><br />
on a <strong>soft</strong> decision by taking advantage of a <strong>statistical</strong> <strong>model</strong>,<br />
in contrast with the conventional hard decision-<strong>based</strong> method.<br />
The performance of the proposed algorithm is evaluated by<br />
echo return loss enhancement (ERLE) and speech attenuation<br />
(SA) tests during <strong>double</strong>-<strong>talk</strong> and is demonstrated to be better<br />
than that of the conventional method.<br />
2. GLOBAL NEAR-END SPEECH PRESENCE<br />
PROBABILITY<br />
In this section, we consider how to derive the global nearend<br />
speech presence probability (GNSPP) in the frequency<br />
domain. To this end, we first assume that two hypotheses,<br />
H 0 and H 1 , indicate near-end speech absence and presence<br />
as follows:<br />
H 0 : near-end speech absent : Y(i) =D(i) (1)<br />
: near-end speech present : Y(i) =D(i)+S(i)<br />
H 1<br />
where D(i) = [D(i, 1),D(i, 2),...,D(i, M)], S(i) =<br />
[S(i, 1),S(i, 2),...,S(i, M)] and Y(i) =[Y (i, 1),Y(i, 2),<br />
...,Y(i, M)], respectively, represent the Fourier domain<br />
spectra of the echo signal, the near-end speech and the microphone<br />
input signal with a frame index i. Also, X(i) =<br />
[X(i, 1),X(i, 2),...,X(i, M)] denote the Fourier spectrum<br />
978-1-4244-4296-6/10/$25.00 ©2010 IEEE 5082<br />
ICASSP 2010
of the far-end signal as shown in Fig. 1. The background<br />
noise is not taken into account since we assume that near-end<br />
speech absence is not correlated with the background noise.<br />
Under the assumption that D(i, k) and S(i, k) are characterized<br />
by separate zero-mean complex Gaussian distributions,<br />
the following is obtained [7].<br />
p(Y (i, k)|H 0 ) =<br />
p(Y (i, k)|H 1 ) =<br />
1<br />
[<br />
πλ d (i, k) exp −<br />
|Y (i,<br />
]<br />
k)|2<br />
λ d (i, k)<br />
1<br />
π(λ s (i, k)+λ d (i, k))<br />
[<br />
|Y (i, k)| 2 ]<br />
exp −<br />
λ s (i, k)+λ d (i, k)<br />
where λ s (i, k) and λ d (i, k) are the variance of the near-end<br />
speech and estimated echo, respectively. Accordingly, the<br />
GNSPP p(H 1 |Y(i)) is derived from Bayes’ rule, such that<br />
[7]:<br />
p(Y(i)|H 1 )p(H 1 )<br />
p(H 1 |Y(i)) =<br />
p(Y(i)|H 0 )p(H 0 )+p(Y(i)|H 1 )p(H 1 )<br />
(4)<br />
where p(H 0 )(= 1−p(H 1 )) represents the a priori probability<br />
of near-end speech absence. Since the spectral component in<br />
each frequency bin is assumed to be <strong>statistical</strong>ly independent,<br />
(4) can be rewritten as [7]<br />
p(H 1 |Y(i)) (5)<br />
M∏<br />
p(H 1 ) p(Y (i, k)|H 1 )<br />
k=1<br />
=<br />
M∏<br />
M∏<br />
p(H 0 ) p(Y (i, k)|H 0 )+p(H 1 ) p(Y (i, k)|H 1 )<br />
k=1<br />
M∏<br />
q Λ k (Y (i, k))<br />
k=1<br />
=<br />
M∏<br />
1+q Λ k (Y (i, k))<br />
k=1<br />
k=1<br />
in which q = p(H 1 )/p(H 0 )(= 1) which is determined by<br />
the rough estimate of the ratio of absence time duration and<br />
presence time duration for near-end speech and Λ k (Y (i, k))<br />
is the likelihood ratio computed in the kth frequency bin, as<br />
given by [7]:<br />
Λ k (Y (i, k)) = p(Y (i, k)|H 1)<br />
(6)<br />
p(Y (i, k)|H 0 )<br />
1<br />
[ γ(i, k)ξ(i, k)<br />
]<br />
=<br />
1+ξ(i, k) exp 1+ξ(i, k)<br />
where the a posteriori signal-to-echo ratio (SER) γ(i, k) and<br />
the a priori SER ξ(i, k) are defined by<br />
(2)<br />
(3)<br />
Microphone<br />
Near-end<br />
Loudspeaker<br />
DFT<br />
IDFT<br />
VAD<br />
Decision<br />
GNSPP<br />
DTD<br />
SER Esimation<br />
Echo Path Response<br />
Estimation<br />
x<br />
G<br />
Send path<br />
IDFT<br />
DFT<br />
Far-end<br />
Receive path<br />
Fig. 1. Block diagram of the proposed DTD algorithm.<br />
ξ(i, k) ≡ λ s(i, k)<br />
(8)<br />
λ d (i, k)<br />
where λ d (i, k) is estimated by ˆλ d (i, k). The power spectrum<br />
of the echo signal is obtained in the case of the absence of the<br />
near-end speech signal, as given by<br />
ˆλ d (i, k) =ζ Dˆλd (i − 1,k)+(1− ζ D )|Ŷ (i, k)|2 (9)<br />
in which ζ D (= 0.93) is the smoothing parameter. Also, in (8),<br />
ξ(i, k) is estimated with the help of the well-known decisiondirected<br />
approach with α DD =0.6 [11]. Then,<br />
|Ŝ(i − 1,k)|2<br />
ˆξ(i, k) =α DD<br />
λ d (i − 1,k) +(1− α DD)P {γ(i, k) − 1}<br />
(10)<br />
where P {z} = z if z ≥ 0, and P {z} =0otherwise. As<br />
specified in (9), the robust estimation of the echo magnitude<br />
spectrum Ŷ (i, k) plays an essential role in the performance.<br />
In our approach, we follow the parameter estimation procedure<br />
proposed in [10] as follows:<br />
|Ŷ (i, k)| = Ĥ(i, k)|X(i, k)| (11)<br />
where Ĥ(i, k) is the estimate for the echo path response mimicking<br />
the actual echo path. Specifically, Ĥ opt (i, k) is obtained<br />
<strong>based</strong> on the magnitude of the least squares estimator<br />
as follows [10]:<br />
Ĥ opt (i, k) =<br />
E[X ∗ (i, k)Y (i, k)]<br />
∣E[X ∗ (i, k)X(i, k)] ∣ (12)<br />
where ∗ denotes the complex conjugate and E[·] represents<br />
the expected value. Note that there exist some delay between<br />
the far-end speech X(i, k) and the microphone input signal<br />
Y (i, k) (due to a digital amplifier, e.g.). In our approach, it is<br />
assumed that the echo time-delay is separately estimated and<br />
compensated (i.e., no delay) at the near-end. Since the echo<br />
path is time varying, the estimated echo path response Ĥ(i, k)<br />
is obtained using the iterative procedure such that[10]<br />
γ(i, k) ≡<br />
|Y (i, k)|2<br />
λ d (i, k)<br />
(7)<br />
Ĥ(i, k) =<br />
C(i, k)<br />
R(i, k)<br />
(13)<br />
5083
Far−end Echo<br />
Signal<br />
Near−end Speech<br />
Signal<br />
Microphone Input<br />
Signal<br />
p(H 1<br />
|Y(i))<br />
1<br />
0<br />
−1<br />
1<br />
0<br />
−1<br />
1<br />
0<br />
−1<br />
1.0<br />
0.5<br />
0.0<br />
Noise Far−end echo Double−Talk Near−end Speech Noise<br />
Table 1. ERLE and SA test results obtained from the proposed<br />
DTD algorithm <strong>based</strong> on a <strong>soft</strong> decision with those<br />
yielded by the conventional hard decision method during<br />
<strong>double</strong>-<strong>talk</strong>.<br />
Environments ERLE (dB) SA (dB)<br />
Noise SNR (dB) Park-<strong>based</strong> proposed Park-<strong>based</strong> proposed<br />
10 3.23 2.96 1.75 1.64<br />
White 20 3.46 3.30 1.99 1.86<br />
30 3.51 3.36 2.03 1.90<br />
10 3.36 3.23 1.81 1.71<br />
Babble 20 3.48 3.33 1.20 1.87<br />
30 3.52 3.36 2.03 1.90<br />
10 2.95 2.84 1.56 1.47<br />
Vehicle 20 3.40 3.25 1.96 1.83<br />
30 3.50 3.35 2.02 1.90<br />
Clean speech ∞ 3.52 3.37 2.03 1.91<br />
0 1 2 3 4<br />
Time (sec)<br />
Fig. 2. DTD results for the acoustic echo signal under the<br />
vehicular noise condition (SNR=20 dB).<br />
where<br />
C(i, k) =ζ C C(i − 1,k)+(1− ζ C )|X ∗ (i, k)Y (i, k)| (14)<br />
R(i, k) =ζ R R(i − 1,k)+(1− ζ R )|X ∗ (i, k)X(i, k)|, (15)<br />
and ζ C (= 0.998) and ζ R (= 0.998) are smoothing parameters.<br />
Note that this update iteration achieves the room change<br />
tracking.<br />
3. PROPOSED DOUBLE TALK DETECTION<br />
BASED AES<br />
As noted earlier, the update of the echo path response must<br />
be frozen in the case of the <strong>double</strong>-<strong>talk</strong>. For this, we propose<br />
the DTD technique to incorporate the newly derived GNSPP,<br />
p(H 1 |Y(i)), with the help of the VAD results of the near-end<br />
and far-end signal, as shown in Fig. 1. We inherently consider<br />
the near-end speech presence in the case of far-end signal<br />
presence, where the GNSPP substantially determine the<br />
<strong>double</strong>-<strong>talk</strong> situation and is used to update the echo path response<br />
<strong>based</strong> on (12). Note that the VAD has an impact on<br />
the near-end speech presence and the far-end speech presence<br />
only. Specifically, we derive a novel update routine of the<br />
echo path response by utilizing the <strong>soft</strong> decision as follows:<br />
⎧<br />
⎪⎨<br />
ˇĤ(i, k) =<br />
p(H 1 |Y(i)) ˇĤ(i − 1,k)<br />
+(1 − p(H 1 |Y(i)))Ĥopt(i, k)<br />
, if I(Y(i)) = 1 and I(X(i)) = 1<br />
⎪⎩<br />
Ĥ opt (i, k), otherwise<br />
(16)<br />
where I(·) denotes an indicator function of the VAD result<br />
provided by the IS-127 noise suppression algorithm since it<br />
is known that it gives us a robust performance under various<br />
noise environments [12]. Furthermore, we modified the<br />
VAD algorithm to reduce the false decisions. For example,<br />
I(Y(i)) = 1 if the near-end signal Y(i) exists at the ith frame<br />
and I(Y(i)) = 0 otherwise. Therefore, the update of ˇĤ(i, k)<br />
is finally addressed such that<br />
ˇĤ(i, k) replaces ˇĤ(i − 1,k)<br />
(i.e., no update) within the <strong>double</strong>-<strong>talk</strong> regions on each frequency<br />
bin and (12) in the case of single-<strong>talk</strong>. In particular, in<br />
the case of abrupt transient periods between <strong>double</strong>-<strong>talk</strong> and<br />
single-<strong>talk</strong>, as shown in Fig. 2, the GNSPP could be a <strong>soft</strong><br />
value between 0 and 1. This accounts for why the <strong>soft</strong> decision<br />
scheme is more insensitive to <strong>detection</strong> error compared<br />
to conventional hard decision methods.<br />
Based on this proposed DTD method, we finally apply it to<br />
the AES algorithm proposed by Faller et al. [10] as follows:<br />
Ŝ(i, k) =G(i, k)Y (i, k) (17)<br />
where the Wiener filter gain G(i, k) is given by [10]<br />
[ max(|Y (i, k)|−| Ŷ (i, k)|, 0)<br />
]<br />
G(i, k) =<br />
. (18)<br />
|Y (i, k)|<br />
4. EXPERIMENTAL RESULTS<br />
In order to verify the performance of the proposed DTD algorithm,<br />
we conducted objective comparison experiments under<br />
various noise conditions. Twenty test phrases, spoken by<br />
seven speakers and sampled at 8 kHz, were used as the experimental<br />
data. For assessing the performance of the proposed<br />
method, we artificially created 20 data files, where<br />
each file was produced by mixing the far-end signal with the<br />
near-end signal. Each frame of the windowed signal was<br />
transformed into its corresponding spectrum through a 128-<br />
point DFT after zero padding. We then constructed 16 frequency<br />
bands through combination of subbands to cover all<br />
frequency ranges (∼4 kHz) of the narrow band speech signal,<br />
5084
which is analogous to that of the IS-127 noise suppression algorithm<br />
[12]. The far-end speech signal was passed through<br />
a filter simulating the acoustic echo path <strong>model</strong>ed by a timeinvariant<br />
FIR filter <strong>based</strong> on the analysis of room acoustics<br />
before being mixed electrically [13], [14]. The simulation environment<br />
was designed to fit a small office room having a<br />
size of 5×4×3 m 3 . The echo level measured at the input microphone<br />
was 3.5 dB lower than that of the input near-end<br />
speech on average. In order to create noisy conditions, white,<br />
babble and vehicular noises from the NOISEX-92 database<br />
were added to clean near-end speech signals at signal-to-noise<br />
ratios (SNRs) of 10, 20, and 30 dB. For the purpose of an objective<br />
comparison, we evaluated the performance of the proposed<br />
scheme and that of the conventional DTD algorithm<br />
proposed by Park et al. [6], wherein the cross-correlation<br />
coefficients-<strong>based</strong> <strong>double</strong>-<strong>talk</strong> <strong>detection</strong> method is used. The<br />
performances of the approaches were measured in terms of<br />
echo return loss enhancement (ERLE) and speech attenuation<br />
(SA) during <strong>double</strong>-<strong>talk</strong> [14].<br />
Given the three types of noise environments, the ERLE and<br />
SAs scores were averaged to give final mean score results,<br />
as presented in Table 1. From Table 1, it is evident that in<br />
most noisy conditions, the proposed DTD algorithm <strong>based</strong> on<br />
a <strong>soft</strong> decision yielded a lower ERLE compared to the hard<br />
decision-<strong>based</strong> conventional technique. Also, from Table 2,<br />
we can observe that the SAs (measured during <strong>double</strong>-<strong>talk</strong> periods)<br />
of the proposed scheme <strong>based</strong> on a <strong>soft</strong> decision were<br />
better than those of the previous method [6] in all the tested<br />
conditions. It is noted that the performance gain of the proposed<br />
method becomes smaller as the SNR becomes lower.<br />
This is attributed to imperfection of the GNSPP under adverse<br />
noise conditions. Summarizing the overall results, the<br />
proposed approach is found to be effective in the AES technique.<br />
5. CONCLUSIONS<br />
In this paper, we have proposed a novel DTD algorithm<br />
<strong>based</strong> on a <strong>soft</strong> decision scheme in the frequency domain.<br />
The GNSPP <strong>based</strong> on a <strong>statistical</strong> <strong>model</strong> of the near-end and<br />
far-end signal is applied to the DTD algorithm in conjunction<br />
with VAD decisions for effective echo suppression. The<br />
performance of the proposed algorithm has been found to be<br />
superior to that of the conventional technique through objective<br />
evaluation tests.<br />
6. REFERENCES<br />
[1] P. S. R. Diniz, Adaptive Filtering: Algorithm and Practical<br />
Implementation. Norwell, MA: Kluwer, 1997.<br />
[3] H. Ye and B. X. Wu, “A new <strong>double</strong>-<strong>talk</strong> <strong>detection</strong> algorithm<br />
<strong>based</strong> on the orthogonality theorem,” IEEE Trans.<br />
Commun., vol. 39, pp. 1542-1545, Nov. 1991.<br />
[4] T. Gänsler, M. Hansson and C. J. Ivarsson, “A <strong>double</strong><strong>talk</strong><br />
detector <strong>based</strong> on coherence,” IEEE Trans. Commun.,<br />
vol. 44, pp. 1421-1427, Nov. 1996.<br />
[5] J. Benesty, D. R. Morgan and J. H. Cho, “A new class of<br />
<strong>double</strong><strong>talk</strong> detectors <strong>based</strong> on cross-correlation,” IEEE<br />
Trans. Speech Audio Processing, vol. 8, no. 10, pp. 168-<br />
172, Mar. 2000.<br />
[6] S. J. Park, C. G. Cho, C. Lee and D. H. Youn, “Integrated<br />
echo and noise canceler for hands-free applications,”<br />
IEEE Trans. on Circuits and Systems II, vol. 49,<br />
issue 3, pp. 186-195, Mar. 2002.<br />
[7] N. S. Kim and J.-H. Chang, “Spectral enhancement<br />
<strong>based</strong> on global <strong>soft</strong> decision,” IEEE Signal Processing<br />
Letters, vol. 7, no. 5, pp. 108-110, May 2000.<br />
[8] Y.-S. Park and J.-H. Chang, “A novel approach to a robust<br />
a priori SNR estimator in speech enhancement,”<br />
IEICE Trans. on Communications, vol. E90-B, no. 8, pp.<br />
2182-2185, Aug 2007.<br />
[9] Y.-S. Park and J.-H. Chang, “A probabilistic combination<br />
method of minimum statistics and <strong>soft</strong> decision for<br />
robust noise power estimation in speech enhancement,”<br />
IEEE Signal Processing Letters, vol. 15, no. 5, pp. 95-<br />
98, Jan 2008.<br />
[10] C. Faller and C. Tournery, “Robust echo control using<br />
a simple echo path <strong>model</strong>,” in Proc. IEEE Int. Conf.<br />
Acoust., Speech Signal Processing, vol. 5, pp. V281-<br />
V284, 2006.<br />
[11] Y. Ephraim and D. Malah, “Speech enhancement using<br />
a minimum mean-square error short-time spectral amplitude<br />
estimator,” IEEE Trans. Acoust., Speech, Signal<br />
Process., vol. ASSP-32, no. 6, pp. 1109-1121, Dec.<br />
1984.<br />
[12] TIA/EIA/IS-127, “Enhanced variable rate codec, speech<br />
service option 3 for wideband spread spectrum digital<br />
systems,” 1996.<br />
[13] S. McGovern, A Model for Room Acoustics, 2003 [Online].<br />
Available: http://2pi.us/rir.html<br />
[14] S. Y. Lee and N. S. Kim, “A <strong>statistical</strong> <strong>model</strong> <strong>based</strong><br />
residual echo suppression,” IEEE Signal Processing Letters,<br />
vol. 14, no. 10, pp. 758-761, Oct. 2007.<br />
[2] C. Avendano, “Acoustic echo suppression in the STFT<br />
domain,” in Proc. IEEE Workshop on Appl. of Sig. Proc.<br />
to Audio and Acoust., Oct. 2001.<br />
5085