a statistical model-based double-talk detection incorporating soft ...

More documents

Recommendations

Info

of the far-end signal as shown in Fig. 1. The background noise is not taken into account since we assume that near-end speech absence is not correlated with the background noise. Under the assumption that D(i, k) and S(i, k) are characterized by separate zero-mean complex Gaussian distributions, the following is obtained [7]. p(Y (i, k)|H 0 ) = p(Y (i, k)|H 1 ) = 1 [ πλ d (i, k) exp − |Y (i, ] k)|2 λ d (i, k) 1 π(λ s (i, k)+λ d (i, k)) [ |Y (i, k)| 2 ] exp − λ s (i, k)+λ d (i, k) where λ s (i, k) and λ d (i, k) are the variance of the near-end speech and estimated echo, respectively. Accordingly, the GNSPP p(H 1 |Y(i)) is derived from Bayes’ rule, such that [7]: p(Y(i)|H 1 )p(H 1 ) p(H 1 |Y(i)) = p(Y(i)|H 0 )p(H 0 )+p(Y(i)|H 1 )p(H 1 ) (4) where p(H 0 )(= 1−p(H 1 )) represents the a priori probability of near-end speech absence. Since the spectral component in each frequency bin is assumed to be statistically independent, (4) can be rewritten as [7] p(H 1 |Y(i)) (5) M∏ p(H 1 ) p(Y (i, k)|H 1 ) k=1 = M∏ M∏ p(H 0 ) p(Y (i, k)|H 0 )+p(H 1 ) p(Y (i, k)|H 1 ) k=1 M∏ q Λ k (Y (i, k)) k=1 = M∏ 1+q Λ k (Y (i, k)) k=1 k=1 in which q = p(H 1 )/p(H 0 )(= 1) which is determined by the rough estimate of the ratio of absence time duration and presence time duration for near-end speech and Λ k (Y (i, k)) is the likelihood ratio computed in the kth frequency bin, as given by [7]: Λ k (Y (i, k)) = p(Y (i, k)|H 1) (6) p(Y (i, k)|H 0 ) 1 [ γ(i, k)ξ(i, k) ] = 1+ξ(i, k) exp 1+ξ(i, k) where the a posteriori signal-to-echo ratio (SER) γ(i, k) and the a priori SER ξ(i, k) are defined by (2) (3) Microphone Near-end Loudspeaker DFT IDFT VAD Decision GNSPP DTD SER Esimation Echo Path Response Estimation x G Send path IDFT DFT Far-end Receive path Fig. 1. Block diagram of the proposed DTD algorithm. ξ(i, k) ≡ λ s(i, k) (8) λ d (i, k) where λ d (i, k) is estimated by ˆλ d (i, k). The power spectrum of the echo signal is obtained in the case of the absence of the near-end speech signal, as given by ˆλ d (i, k) =ζ Dˆλd (i − 1,k)+(1− ζ D )|Ŷ (i, k)|2 (9) in which ζ D (= 0.93) is the smoothing parameter. Also, in (8), ξ(i, k) is estimated with the help of the well-known decisiondirected approach with α DD =0.6 [11]. Then, |Ŝ(i − 1,k)|2 ˆξ(i, k) =α DD λ d (i − 1,k) +(1− α DD)P {γ(i, k) − 1} (10) where P {z} = z if z ≥ 0, and P {z} =0otherwise. As specified in (9), the robust estimation of the echo magnitude spectrum Ŷ (i, k) plays an essential role in the performance. In our approach, we follow the parameter estimation procedure proposed in [10] as follows: |Ŷ (i, k)| = Ĥ(i, k)|X(i, k)| (11) where Ĥ(i, k) is the estimate for the echo path response mimicking the actual echo path. Specifically, Ĥ opt (i, k) is obtained based on the magnitude of the least squares estimator as follows [10]: Ĥ opt (i, k) = E[X ∗ (i, k)Y (i, k)] ∣E[X ∗ (i, k)X(i, k)] ∣ (12) where ∗ denotes the complex conjugate and E[·] represents the expected value. Note that there exist some delay between the far-end speech X(i, k) and the microphone input signal Y (i, k) (due to a digital amplifier, e.g.). In our approach, it is assumed that the echo time-delay is separately estimated and compensated (i.e., no delay) at the near-end. Since the echo path is time varying, the estimated echo path response Ĥ(i, k) is obtained using the iterative procedure such that[10] γ(i, k) ≡ |Y (i, k)|2 λ d (i, k) (7) Ĥ(i, k) = C(i, k) R(i, k) (13) 5083
Far−end Echo Signal Near−end Speech Signal Microphone Input Signal p(H 1 |Y(i)) 1 0 −1 1 0 −1 1 0 −1 1.0 0.5 0.0 Noise Far−end echo Double−Talk Near−end Speech Noise Table 1. ERLE and SA test results obtained from the proposed DTD algorithm based on a soft decision with those yielded by the conventional hard decision method during double-talk. Environments ERLE (dB) SA (dB) Noise SNR (dB) Park-based proposed Park-based proposed 10 3.23 2.96 1.75 1.64 White 20 3.46 3.30 1.99 1.86 30 3.51 3.36 2.03 1.90 10 3.36 3.23 1.81 1.71 Babble 20 3.48 3.33 1.20 1.87 30 3.52 3.36 2.03 1.90 10 2.95 2.84 1.56 1.47 Vehicle 20 3.40 3.25 1.96 1.83 30 3.50 3.35 2.02 1.90 Clean speech ∞ 3.52 3.37 2.03 1.91 0 1 2 3 4 Time (sec) Fig. 2. DTD results for the acoustic echo signal under the vehicular noise condition (SNR=20 dB). where C(i, k) =ζ C C(i − 1,k)+(1− ζ C )|X ∗ (i, k)Y (i, k)| (14) R(i, k) =ζ R R(i − 1,k)+(1− ζ R )|X ∗ (i, k)X(i, k)|, (15) and ζ C (= 0.998) and ζ R (= 0.998) are smoothing parameters. Note that this update iteration achieves the room change tracking. 3. PROPOSED DOUBLE TALK DETECTION BASED AES As noted earlier, the update of the echo path response must be frozen in the case of the double-talk. For this, we propose the DTD technique to incorporate the newly derived GNSPP, p(H 1 |Y(i)), with the help of the VAD results of the near-end and far-end signal, as shown in Fig. 1. We inherently consider the near-end speech presence in the case of far-end signal presence, where the GNSPP substantially determine the double-talk situation and is used to update the echo path response based on (12). Note that the VAD has an impact on the near-end speech presence and the far-end speech presence only. Specifically, we derive a novel update routine of the echo path response by utilizing the soft decision as follows: ⎧ ⎪⎨ ˇĤ(i, k) = p(H 1 |Y(i)) ˇĤ(i − 1,k) +(1 − p(H 1 |Y(i)))Ĥopt(i, k) , if I(Y(i)) = 1 and I(X(i)) = 1 ⎪⎩ Ĥ opt (i, k), otherwise (16) where I(·) denotes an indicator function of the VAD result provided by the IS-127 noise suppression algorithm since it is known that it gives us a robust performance under various noise environments [12]. Furthermore, we modified the VAD algorithm to reduce the false decisions. For example, I(Y(i)) = 1 if the near-end signal Y(i) exists at the ith frame and I(Y(i)) = 0 otherwise. Therefore, the update of ˇĤ(i, k) is finally addressed such that ˇĤ(i, k) replaces ˇĤ(i − 1,k) (i.e., no update) within the double-talk regions on each frequency bin and (12) in the case of single-talk. In particular, in the case of abrupt transient periods between double-talk and single-talk, as shown in Fig. 2, the GNSPP could be a soft value between 0 and 1. This accounts for why the soft decision scheme is more insensitive to detection error compared to conventional hard decision methods. Based on this proposed DTD method, we finally apply it to the AES algorithm proposed by Faller et al. [10] as follows: Ŝ(i, k) =G(i, k)Y (i, k) (17) where the Wiener filter gain G(i, k) is given by [10] [ max(|Y (i, k)|−| Ŷ (i, k)|, 0) ] G(i, k) = . (18) |Y (i, k)| 4. EXPERIMENTAL RESULTS In order to verify the performance of the proposed DTD algorithm, we conducted objective comparison experiments under various noise conditions. Twenty test phrases, spoken by seven speakers and sampled at 8 kHz, were used as the experimental data. For assessing the performance of the proposed method, we artificially created 20 data files, where each file was produced by mixing the far-end signal with the near-end signal. Each frame of the windowed signal was transformed into its corresponding spectrum through a 128- point DFT after zero padding. We then constructed 16 frequency bands through combination of subbands to cover all frequency ranges (∼4 kHz) of the narrow band speech signal, 5084
Page 1: A STATISTICAL MODEL-BASED DOUBLE-TA

a statistical model-based double-talk detection incorporating soft ...

Create successful ePaper yourself

Delete template?

Save as template?