03.11.2012 Views

Downstream speech enhancement in a low directivity binaural ...

Downstream speech enhancement in a low directivity binaural ...

Downstream speech enhancement in a low directivity binaural ...

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

23–27 August 2010, Sydney, Australia Proceed<strong>in</strong>gs of 20th International Congress on Acoustics, ICA 2010<br />

4. From the total amount of time-frames, the algorithm used<br />

37.3%, 31.7% and 27.4% of time-frames for the localization<br />

process <strong>in</strong> the anechoic condition, the cafeteria noise and the<br />

reverberant environment, respectively.<br />

−75<br />

−90<br />

−60<br />

−105<br />

−120<br />

−45<br />

−135<br />

Anechoic<br />

−30<br />

0 15<br />

30<br />

−15<br />

−150<br />

−165 ±180<br />

45<br />

60<br />

75<br />

−75<br />

−60<br />

−45<br />

90<br />

0.5<br />

1<br />

105<br />

1.5<br />

2 120<br />

2.5x10<br />

135<br />

2<br />

−90<br />

−105<br />

−120<br />

−135<br />

Cafeteria noise<br />

−30<br />

0 15<br />

30<br />

−15<br />

150 Reverberant −150<br />

165<br />

−165<br />

0 ±180<br />

15<br />

−30 30<br />

−15<br />

−75<br />

−90<br />

−60<br />

−105<br />

−120<br />

−45<br />

−135<br />

−150<br />

−165 ±180<br />

45<br />

60<br />

75<br />

90<br />

0.5<br />

1<br />

105<br />

1.5<br />

2 120<br />

2.5x10<br />

135<br />

2<br />

150<br />

165<br />

45<br />

60<br />

75<br />

90<br />

0.5<br />

1<br />

105<br />

1.5<br />

2 120<br />

2.5x10<br />

135<br />

2<br />

150<br />

165<br />

Figure 4: Localization performance of 3 speakers <strong>in</strong> an anechoic,<br />

cafeteria noise (5 dB SNR) and reverberant (T60 = 1.0 s)<br />

environment.<br />

Mask-based Post-processor<br />

The formation of masks is based on a local selection criterion<br />

that decides whether a certa<strong>in</strong> time-frequency b<strong>in</strong> belongs to<br />

the target or to the noise. An example of the application based<br />

on a simple local SNR criterion is given <strong>in</strong> Figure 5. A class of<br />

successful b<strong>in</strong>aural <strong>speech</strong> <strong>enhancement</strong> algorithms employs<br />

the deviation of acoustic sources from the mid-l<strong>in</strong>e as a selection<br />

criterion, i.e., their b<strong>in</strong>aural differences. By def<strong>in</strong><strong>in</strong>g a<br />

weight<strong>in</strong>g function with a preferential listen<strong>in</strong>g direction, b<strong>in</strong>s<br />

that correspond to this direction are weighted with one and b<strong>in</strong>s<br />

that do not correspond to this direction are attenuated with a<br />

value smaller than one. Wittkop et al. [19, 2] give an overview<br />

of these algorithms that capture psychoacoustical pr<strong>in</strong>ciples of<br />

the spatial mask<strong>in</strong>g release.<br />

Generally, these algorithms calculate the b<strong>in</strong>aural cues from<br />

the comparison of the left and the right STFT signals. As an<br />

extension to the short-time spectra, the <strong>speech</strong> <strong>enhancement</strong><br />

algorithm of Kollmeier and Koch [5] uses modulation spectra<br />

to analyze b<strong>in</strong>aural disparity. The algorithm has been designed<br />

on the grounds of physiological and psychoacoustical results,<br />

where a decomposition of stimuli <strong>in</strong>to different modulation<br />

frequencies has been found to be <strong>in</strong>dependent of center frequencies,<br />

i.e. the tonotopic representation of the cochlear frequency<br />

decomposition, as well as the spatial decomposition. Psychoacoustically,<br />

this particular process<strong>in</strong>g of the auditory system<br />

is also known as co-modulation mask<strong>in</strong>g release. For <strong>speech</strong><br />

<strong>in</strong>telligibility this generally means, that if two speakers differ <strong>in</strong><br />

their fundamental frequency which modulate their respective<br />

spectra, they are each more <strong>in</strong>telligible than if their fundamentals<br />

collapse. The Kollmeier and Koch <strong>speech</strong> <strong>enhancement</strong><br />

algorithm has been shown to be robust <strong>in</strong> different acoustic<br />

situations and to improve the SNR over a wide dynamic range<br />

by approximately 2 dB [5, 19].<br />

In here, a variation of this algorithm is applied to the b<strong>in</strong>aural<br />

output of the MVDR beamformer. The variation refers to<br />

the weight<strong>in</strong>g function. First, a frequency <strong>in</strong>dependent direc-<br />

Centre Frequency [Hz]<br />

Centre Frequency [Hz]<br />

5000<br />

2688<br />

1473<br />

720<br />

325<br />

80<br />

5000<br />

2688<br />

1473<br />

720<br />

325<br />

80<br />

Signal<br />

0.3 0.6 0.9 1.2 Time [s]<br />

Mixture<br />

0.3 0.6 0.9 1.2 Time [s]<br />

Centre Frequency [Hz]<br />

5000<br />

2688<br />

1473<br />

720<br />

325<br />

80<br />

5000<br />

2688<br />

1473<br />

720<br />

325<br />

80<br />

5000<br />

2688<br />

1473<br />

720<br />

325<br />

80<br />

Masked mixture<br />

0.3 0.6 0.9 1.2 Time [s]<br />

Noise<br />

0.3 0.6 0.9 1.2 Time [s]<br />

Ideal Mask<br />

0.3 0.6 0.9 1.2 Time [s]<br />

Figure 5: Illustration of the mask<strong>in</strong>g pr<strong>in</strong>ciple. The spectrogram<br />

of the sentence “We kwamen net te laat terug.” said by male is<br />

shown <strong>in</strong> the top left panel (red denotes <strong>low</strong> energy and violet<br />

high energy). The sentence is mixed with noise and the aim is<br />

to filter the sentence by apply<strong>in</strong>g a mask (<strong>in</strong> the spectrotemporal<br />

doma<strong>in</strong>). The noise is recorded <strong>in</strong> an autobus with an artificial<br />

dummy head. The spectrogram of the noise is shown <strong>in</strong> the top<br />

right panel. The spectrogram of the <strong>speech</strong> signal mixed with<br />

the noise is shown <strong>in</strong> the middle left panel (SNR=-7 dB). The<br />

middle right panel shows the mask. All b<strong>in</strong>s with a SNR above<br />

-7 dB are white and the time-frequency b<strong>in</strong>s with an SNR be<strong>low</strong><br />

-7 dB are black. The bottom panel shows the spectrogram of the<br />

mixture after multiplication with the mask.<br />

tionality was applied, which is adopted from the lateral filter<br />

function of Wittkop and Hohmann [20]. Second, the reference<br />

values for the mask generation are data-driven. Therefore the<br />

parametric output of the localizer is used to calculate the a<br />

posteriori probability for the presence of the target source at a<br />

certa<strong>in</strong> azimuthal position. By such means, the filter-function<br />

is constantly updated to the scene and the effects of distortion<br />

are reduced. The concept for source separation was <strong>in</strong>troduced<br />

by Madhu [21, 22]. By way of example, he has shown that<br />

reverberation <strong>in</strong>creases the variation of the b<strong>in</strong>aural cues and<br />

that an a posteriori estimation corresponds to that change and<br />

opens the spatial aperture <strong>in</strong> mask-based filters.<br />

For the most part, the here applied Kollmeier and Koch <strong>speech</strong><br />

processor was implemented accord<strong>in</strong>g to the algorithmic features<br />

given <strong>in</strong> [5]. We therefore restrict this report to the details<br />

of the variation from the <strong>in</strong>itial implementation.<br />

The parametric output of the localizer is first used to calculate<br />

an azimuthal non-symmetric aperture of the filter toward the<br />

target speaker. For that reason, the parametric output of the<br />

localizer is used as the <strong>in</strong>put to an expectation-maximization<br />

(EM) rout<strong>in</strong>e <strong>in</strong> order to approximate the localization through<br />

a mixture of gaussians (MoG). This MoG fit ˆθ(n) is then employed<br />

to calculate the angular a posteriori probability of a<br />

target source Pt| ˆθ (n) <strong>in</strong> a time frame n. The approximation<br />

through a MoG is based on an a priori probability of Pi, the<br />

location θi with θ ∈ [−π,π] and a variance σ 2 i<br />

of each source i<br />

with Q the total number of sources. The EM algorithm of Feder<br />

and We<strong>in</strong>ste<strong>in</strong> (see [23]) is applied to iteratively approximate<br />

the MoG estimation:<br />

�<br />

Q<br />

ˆθ(n)<br />

(θ(n) − θi) 2<br />

= ∑ Piexp −<br />

i=1<br />

2σ 2 �<br />

. (14)<br />

i<br />

4 ICA 2010

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!