21.07.2015 Views

Blind Adaptive Principal Eigenvector Beamforming for ... - CiteSeer

Blind Adaptive Principal Eigenvector Beamforming for ... - CiteSeer

Blind Adaptive Principal Eigenvector Beamforming for ... - CiteSeer

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

<strong>Blind</strong> <strong>Adaptive</strong> <strong>Principal</strong> <strong>Eigenvector</strong> <strong>Beam<strong>for</strong>ming</strong> <strong>for</strong> Acoustical SourceSeparationErnst Warsitz, Reinhold Haeb-Umbach, Dang Hai Tran VuUniversity of Paderborn, Dept. of Communications Engineering33098 Paderborn, Germany{warsitz,haeb,tran}@nt.uni-paderborn.deAbstractFor separating multiple speech signals given a convolutive mixture,time-frequency sparseness of the speech sources can beexploited. In this paper we present a multi-channel source separationmethod based on the concept of approximate disjoint orthogonalityof speech signals. Unlike binary masking of singlechannelsignals as e.g. applied in the DUET algorithm weuse a likelihood mask to control the adaptation of blind principaleigenvector beam<strong>for</strong>mers. Furthermore orthogonal projectionof the adapted beam<strong>for</strong>mer filters leads to mutually orthogonalfilter coefficients thus enhancing the demixing per<strong>for</strong>mance.Experimental results in terms of the achievable signalto-interferenceratio (SIR) and a perceptual speech quality measureare given <strong>for</strong> the proposed method and are compared to theDUET algorithm.Index Terms: BSS, DUET, PCA, <strong>Beam<strong>for</strong>ming</strong>1. IntroducionFor hands-free multi-channel telecommunication or auditoryscene analysis in a scenario of simultaneously active sourcesit can be interesting to estimate not only one source signal, asis usually done in adaptive beam<strong>for</strong>ming (ABF) [1], but allsource signals from their mixtures at the sensors. ABF is a<strong>for</strong>m of spatial filtering enhancing signals from the look directionand attenuating signals from other directions. Its per<strong>for</strong>mancedepends heavily on the estimation of the directionof-arrival(DoA), and the adaptation to an individual source israther difficult if sources are simultaneously active. On the otherhand, blind source separation (BSS) methods need no in<strong>for</strong>mationof the array geometry and speaker direction, and demixingis possible even when the sources are simultaneously active.In the last decade BSS has focused on methods based on independentcomponent analysis (ICA). They involve higher orderstatistics and non-linear cost functions and are usually computationallyquite expensive. To deal with convolutive soundmixtures ICA can be employed in the frequency domain <strong>for</strong>each frequency bin separately [2], however at the expense ofpermutation ambiguity. To overcome the problems of BSS andABF, combined methods have been proposed [3, 4]: geometricallyconstrained BSS algorithms solve the frequency permutationproblem of BSS and they are less sensitive to DoA estimationerrors than ABF. It should be noted, that geometricallyconstrained separation is not blind anymore.From a physical point of view the separation of two sourcesby frequency-domain BSS can be shown to be equivalent tonull-beam<strong>for</strong>ming by two frequency-domain ABFs. In bothcases the algorithms attempt to reduce the interference signal by<strong>for</strong>ming a spatial null in the interference direction [5]. The per<strong>for</strong>manceof BSS is limited by that of perfectly adapted beam<strong>for</strong>mers[6].Here we propose to separate multiple speech signals by exploitingthe time-frequency sparseness (disjoint orthogonality)of the speech sources [7] to control the adaptation of blind principaleigenvector beam<strong>for</strong>mers. The dominant eigenvector orprincipal component generates a beam towards the directionof maximum power which can also be interpreted as matchedfiltering<strong>for</strong> the unknown mixing system [8, 9]. Adaptation isonly carried out, when the source of interest is dominant. Wederive an adaptation control mechanism <strong>for</strong> the case of twosources by use of a so-called likelihood mask. In oppositionto the hard decision of binary masking applied in the DUET algorithm[7, 10] we use some kind of soft decision to control theprincipal eigenvector adaptation.In eigenvector beam<strong>for</strong>ming the filter coefficients are optimizedto <strong>for</strong>m a spatial pattern with a maximum in the directionof the dominant source, however without spatial minima at theinterferer directions. There<strong>for</strong>e we additionally apply mutualorthogonal projection (MOP) of the eigenvector filters to en<strong>for</strong>ceorthogonality and to place implicitly a null or at least aminimum in the interferer direction.2. <strong>Principal</strong> <strong>Eigenvector</strong> <strong>Beam<strong>for</strong>ming</strong>We are given an array of M microphones. The DFTs of themicrophone signals are denoted by X i(k, m), i = 1, . . . , M.Here, m and k denote the frame index and frequency bin, respectively.The output of a first Filter-and-Sum beam<strong>for</strong>mer isY 1(k, m) =MXF1,i(k, ∗ m) · X i(k, m)i=1= F H 1 (k, m) · X(k, m), (1)where F 1(k, m) = (F 1,1(k, m), .., F 1,M(k, m)) T is the filtercoefficient vector and X(k, m) = (X 1(k, m), .., X M(k, m)) Tis the vector of input signals. Let us assume <strong>for</strong> the moment thata single speech source signal U 1(k, m) is present, which is connectedto the i-th microphone by the transfer function H 1,i(k).The vector of microphone signals is then given byX(k, m) = H 1(k)U 1(k, m) + N(k, m) (2)where H 1(k) = (H 1,1(k), .., H 1,M(k)) T , and N(k, m) is avector of stationary noise terms which are assumed to be spatiallywhite. Our goal is to determine a vector of filter coefficientsF 1(k, m) such that the signal-to-noise ratio of the outputY 1(k, m) is maximized. The frequency dependent SNR ismaximized by the eigenvector v max1 (k) corresponding to the


largest eigenvalue of the cross power spectral density matrix ofthe microphone signals, which turns out to bev max1 (k) = ζ 1(k) · H 1(k), (3)where ζ 1(k) ≠ 0 is an arbitrary complex scalar. We can computeF 1(k, m) iteratively to be the desired eigenvector (3), byemploying a two-step stochastic gradient ascent algorithm (seealso [9]):˜F 1(k, m+1) = F 1(k, m)+µ 1(k, m)Y ∗1 (k, m)[X(k, m) − F 1(k, m)Y 1(k, m)] (4)F 1(k, m+1) = ˜F 1(k, m+1) 1 + ˜F H 1 (k, m+1)˜F 1(k, m+1)2˜F H 1 (k, m+1)˜F 1(k, m+1) .Due to the block frequency nature of the algorithm afrequency-dependent step size µ 1(k, m), which is inverselyproportional to the power levels in the DFT frequency bins,can be used. Since (4) per<strong>for</strong>ms a principal component analysis(PCA) the resulting multi-channel filtering can be calledPCA-beam<strong>for</strong>ming.3. Time-Frequency MaskingNow we extend the problem to a multi-speaker scenario with Psources U(k, m) = (U 1(k, m),.., U P(k, m)) T , M ≥ P microphonesand P beam<strong>for</strong>mers. The relation between the sourcesignals and the microphone signals is given byPXX(k, m) = H l (k)U l (k, m) + N(k, m). (5)l=1Denoting the mixing system as(5) can be written asH(k) = [H 1(k),H 2(k), ..,H P(k)] (6)X(k, m) = H(k)U(k, m) + N(k, m). (7)The concept of disjoint orthogonality to estimate the sourcesignals U l (k, m) giving the mixtures X(k, m) is based on thetime-frequency sparseness of the speech sources [7]. The underlyingassumption is thatU l (k, m)U λ (k, m) = 0, ∀k, m l ≠ λ, (8)i.e. at no time-frequency bin two sources are simultaneouslyactive.3.1. Binary Masking (BM)Based on the assumption (8) a binary mask <strong>for</strong> approximate disjointorthogonality is introduced:(α (BM) 1, if U l (k, m) > βU λ (k, m)l(k, m) =, (9)0, elsewhere β is a heuristic parameter. The demixed output is thenobtained asÛ l (k, m) = α (BM)l(k, m)X 1(k, m). (10)In the DUET algorithm the mask is estimated by exploitingphase and amplitude in<strong>for</strong>mation of two sensors [7]. Un<strong>for</strong>tunately,reliable estimate of α (BM)l(k, m) are hard to obtain ina reverberant environment. Hence, the demixed speech sourceswill suffer from serious signal distortion.3.2. Likelihood-based Masking (LM)As opposed to (9) we propose to use a soft-valued mask basedon some kind of source activity probability:α (LM)l(k, m) ≈ p(|U l (k, m)|≫ |U λ (k, m)| |X), ∀λ;l ≠ λ.(11)This so-called likelihood mask can assume any value betweenZero and One and can be used to control the adaptationof the corresponding l-th beam<strong>for</strong>mer by replacing the step sizeµ l (k, m) of the gradient update rule by:˜µ l (k, m) = µ l (k, m)α (LM)l(k, m). (12)Note that the adaptation stops, if sources other than the targetsource are too strong (α (LM)l(k, m) → 0) or if the beam<strong>for</strong>merhas adapted towards its dominant source (X(k, m) −F l (k, m)Y l (k, m) → 0). As the optimal beam<strong>for</strong>mer is amatched-filter, see eq. (3), the mixing system (6) is thus estimatedcolumn-wise (by adapting to the currently dominantsource) in every time-frequency bin.3.3. Symmetric <strong>Adaptive</strong> Decorrelation (SAD)For the case of two sources P = 2 and l ∈ {1, 2}, a simplebut effective likelihood estimator α (LM)l(k, m) can be obtainedas follows. Motivated by [11] symmetrical adaptive cross filtersD i,j(k, m) between two neighboring microphones i and j,where i = 1, .., M − 1, j = i + 1 <strong>for</strong> decorrelating twoneighboring microphone signals are used:Z j,i(k, m) = X i(k, m) − D j,i(k, m)X j(k, m) (13)Z i,j(k, m) = X j(k, m) − D i,j(k, m)X i(k, m). (14)The adaptation of (13) and (14) can be per<strong>for</strong>med by normalizedleast mean square (NLMS) algorithmsD j,i(k, m+1) = D j,i(k, m)+µ j,i(k, m)Z j,i(k, m)Z ∗ i,j(k, m),(15)from which it can be seen that adaptation stops if Z j,i and Z i,jare decorrelated. The ratioα i,j(k, m) =|Z i,j(k, m)| 2|Z i,j(k, m)| 2 + |Z j,i(k, m)| 2 (16)serves as estimate <strong>for</strong> the dominance of one speech source overthe other based on the microphone pair i and j = i + 1. Theseestimates are combined to the “likelihood” maskM−1Yα (LM)l(k, m) = α i,i+1(k, m), (17)i=1where l ∈ {1, 2} is the one of the two sources which is spatiallycloser to the i-th than to the (i + 1)-st microphone. The moremicrophone pairs are used, the more reliable likelihood estimationsare obtained. In the limit of infinitely many microphones(17) will become a binary mask.4. PCA-<strong>Beam<strong>for</strong>ming</strong> Post-ProcessingWhile the optimal filter coefficients F l (k, m) <strong>for</strong>m a spatial patternwith main lobe in the direction of the l-th source, no spatialselectivity is per<strong>for</strong>med regarding the interferer. Thus thel-th output Y l (k, m) = F H l (k, m)X(k,m) has a good speechquality but contains also strong interfering signals from other


sources, especially at low frequencies. To gain in SIR we applymutual orthogonal projection of the eigenvector filters by01˜W l (k, m) = @ Y[I − F λ (k, m)F H λ (k, m)] AF l (k, m),λ;λ≠lwith a following normalization step(18)W l (k, m) = ˜W l (k, m)‖ ˜W l (k, m) ‖ . (19)The beam<strong>for</strong>mer outputs Y MOP = (Y l,MOP , . . . , Y P,MOP) Tare then given byY MOP(k, m) = W H (k, m)X(k,m). (20)where W(k, m) = [W 1(k, m),..,W P(k, m)]. Every outputsignal is thus determined by the subspace spanned by the remainingsource signals ensured by the mutual orthogonal projection(note subscript MOP).<strong>Beam<strong>for</strong>ming</strong> approaches in general suffer from cross-talkin a multi-speaker scenario, although spatial minima are placedin the directions of the interferers. This is due to multipath signalpropagation in a reverberant environment when reflected interferencecomponents are captured by the main lobe. As theinterfering components can be interpreted as noise, the idea is touse a post-filter similar to the Wiener post-filter (WP) in beam<strong>for</strong>ming[1] or speech enhancement. As estimate of the interferers’power present in the l-th beam<strong>for</strong>mer output we use thesquared outputs of the other beam<strong>for</strong>mers. The post-filteringcompletes the proposed demixing process and the final sourceestimates are|Y l,MOP (k, m)| 2Û l (k, m) =YMOP H (k, m)YMOP(k, m)Y l,MOP(k, m).(21)Fig. 1 shows the overall structure of the proposed demixingsystem including the corresponding equation numbers.XSAD(13)-(17)α (LM)1α (LM)2PCA(4), (12)F 1F 2Y 1,MOP Û 1MOPWP(18)-(20) (21)Y 2,MOPFigure 1: <strong>Blind</strong> PCA-based source separation system.5. Simulation ResultsIn this section we experimentally evaluate the proposed blindeigenbeam<strong>for</strong>mer source separation method <strong>for</strong> the case of twosimultaneously active sources in a simulated reverberant enclosureof the size (6 m) x (5 m) x (3 m). 10 utterances from differentspeakers (5 male and 5 female) sampled at 16 kHz wereused as source speech signals. Taking 2 out of the 10 speakersgives 45 mixing setups <strong>for</strong> each of 9 analyzed reverberationtimes T 60 between 0 s and 0.5 s, totalling 405 audio data. Onespeech source was placed at 45 ◦ and the other source at −30 ◦relative to broadside, each at a distance of 2 m. The power ratioof the two sources was about 0 dB. To every microphonespeech signal white noise with a signal-to-noise ratio of about25 dB was added.For the PCA-based source separation method a linear arrayof M = 8 equally spaced microphones (distance of 4 cm) withÛ 22 BFs and 7 SADs were used. The FIR filters had a length of256 taps each, and the DFT length was set to 512 taps. Allgiven simulation results are with steady state filter coefficientsafter convergence. For the online DUET algorithm [10] twomicrophones with a distance of 2.4 cm were employed.Since we mix the signals artificially we know each of the reverberatedmicrophone speech components and can compute aperfect binary mask. The optimum achievable per<strong>for</strong>mance obtainedthis way is analyzed in terms of the signal-to-interferenceratio power (SIR) and the perceptual similarity measure (PSM)[12]. PSM has been shown to give comparable objective perceptualquality evaluation results as the well-known PESQ measure[13]. The PSM 1 and SIR values are shown in the upperand lower subplot of Fig. 2, respectively. The reference signal<strong>for</strong> PSM was the target speech signal at one microphone (withoutnoise and interference) and the test signal to be evaluatedwas the processed target speech signal according to (9). Sincewe want to separate two sources two estimates Û1(k, m) andÛ 2(k, m) are available. For each utterance the better and theworse of the two (as measured by PSM) were collected in separatesets. In all figures “PSM-high” and “PSM-low” denotethe averaged PSM on the output with the better and lower PSMscores, respectively. All results are averages over 45 utterancesper T 60. “PSM-mean” is the mean of the two. The SIR scoreswere also evaluated separately <strong>for</strong> each of the two sets.PSMSIR [dB]10.90.80.72015PSM−highPSM−meanPSM−lowPerfect Binary Masking0 0.1 0.2 0.3 0.4 0.5T [s] 6010SIR_PSM−high5 SIR_PSM−meanSIR_PSM−low00 0.1 0.2 0.3 0.4 0.5T [s] 60Figure 2: Upper figure: Collected outputs with high and lowPSM values separately and overall PSM results (PSM-mean).Lower figure: SIR separately <strong>for</strong> the outputs with high and lowPSM and the overall SIR (SIR PSM-mean).It is very remarkable that the per<strong>for</strong>mance of the perfectmasking process works well <strong>for</strong> all simulated reverberationtimes and both outputs. This shows that disjoint orthogonality(8) is indeed a promising approach. However, the crucialissue is how to estimate the mask in a reverberant scenario.Next, the per<strong>for</strong>mance degradation is examined when thebinary mask is estimated online by the DUET algorithm [10],see Fig. 3. With increasing T 60 the speech quality differs dramaticallybetween the two outputs (PSM-high, PSM-low). Thisis due to the unequal demixing per<strong>for</strong>mance: the binary decision(9) is estimated to 1 <strong>for</strong> one output much more often than<strong>for</strong> the other output. Consequently the quality of target speechof that output, where the signal is passed through more often,is very good, however at the cost of reduced SIR compared to1 PSM results are generated with level alignment, lowpass filteringand without assimilation.


the other output. For that other one the speech quality is bad,because it was “switched off” very often. At the same time,however, the SIR is higher.PSMSIR [dB]10.90.80.7201510PSM−highPSM−meanPSM−lowBinary Masking with DUET [12]0 0.1 0.2 0.3 0.4 0.5T [s] 605SIR_PSM−highSIR_PSM−meanSIR_PSM−low00 0.1 0.2 0.3 0.4 0.5T [s] 60Figure 3: Upper figure: Collected outputs with high and lowPSM values separately and overall PSM results (PSM-mean).Lower figure: SIR separately <strong>for</strong> the outputs with high and lowPSM and the overall SIR (SIR PSM-mean).To overcome the unbalanced per<strong>for</strong>mance between the differentoutputs of the DUET algorithm the PCA-based sourceseparation method is proposed, see Fig. 4. It can be seen, thatthe speech quality is high <strong>for</strong> both outputs, and the SIR of thetwo is also similar. To give an impression of the adaptationPSMSIR [dB]10.90.80.7201510<strong>Blind</strong> PCA−based separation method with Likelihood MaskingPSM−highPSM−meanPSM−low0 0.1 0.2 0.3 0.4 0.5T 60[s]5SIR_PSM−highSIR_PSM−meanSIR_PSM−low00 0.1 0.2 0.3 0.4 0.5T 60[s]Figure 4: Upper figure: Collected outputs with high and lowPSM values separately and overall PSM results (PSM-mean).Lower figure: SIR separately <strong>for</strong> the outputs with high and lowPSM and the overall SIR (SIR PSM-mean).speed, in Fig. 5 the SIR <strong>for</strong> the proposed method is plotted overtime. We chose the case of no reverberation (T 60 = 0 s) tohave the highest dynamic range of the SIR. The initial filter coefficientsof the SAD were set to zero and the PCA coefficientswere chosen randomly. Fig. 5 shows the fast adaptation speedof the proposed separating method.6. ConclusionsA blind source separation method based on principal eigenvectorbeam<strong>for</strong>ming exploiting the time-frequency sparsenessof speech signals has been presented. The advantage of highSIR [dB]2010Adaptation speed of the blind PCA−based separation method00 5 10 15 20 25t [s]Figure 5: Adaptation speed of the principal eigenvector separation,averaged over all runs <strong>for</strong> T 60 = 0 s.speech quality obtained by adaptive beam<strong>for</strong>ming is combinedwith blind source classification. Important properties of the proposedmethod are as follows: First, the beam<strong>for</strong>mer makes noa priori assumptions on the acoustic enclosure and neither thesource directions nor the array geometry need to be known. Second,the separation is carried out in two steps: a frequency binwiselikelihood estimation of the dominant source, followed byprincipal eigenvector beam<strong>for</strong>ming. This results in robust, fastadaptation behavior and a flexible structure. Third, the interferencerejection is increased by mutually orthogonal projection ofthe filter coefficients and post-filtering.7. AcknowledgementThis work was in part supported by Deutsche Forschungsgemeinschaftunder contract number HA 3455/4-1.8. References[1] H. L. Van Trees, Optimum Array Processing, John Wiley, NewYork 2001.[2] H. Sawada, R. Mukai, S. Araki, and S. Makino, “Frequency-Domain <strong>Blind</strong> Source Separation”, in Speech Enhancement,Springer, 2005.[3] L. Parra and C.V. Alvino, “Geometric Source Separation: MergingConvolutive Source Separation with Geometric <strong>Beam<strong>for</strong>ming</strong>”,in IEEE Trans. on Speech and Audio Processing vol. 10,no. 6, Sep. 2002.[4] M. Knaak, S. Araki and S. Makino, “Geometrically constrainedIndependent Component Analysis”, IEEE Trans. Audio, Speech,Lang. Proc., vol. 15, no. 2, pp 715-726, Feb. 2007.[5] A. Shoko, S. Makino, Y. Hinamoto, R. Mukai, T. Nishikawa andH. Saruwatari, “Equivalence between Frequency-Domain <strong>Blind</strong>Source Separation and Frequency-Domain <strong>Adaptive</strong> <strong>Beam<strong>for</strong>ming</strong><strong>for</strong> Convolutive Mixtures”, EURASIP Journal on Applied SignalProcessing, vol. 2003 (2003), issue 11, pp. 1157-1166.[6] S. Makino, “<strong>Blind</strong> Source Separation of Convolutive Mixtures ofSpeech”, in <strong>Adaptive</strong> Signal Processing, Springer, 2003.[7] O. Yilmaz and S. Richard, “<strong>Blind</strong> Separation of Speech Mixturesvia Time-Frequency Masking”, in IEEE Trans. on Signal Processing,vol. 52, pp. 1830-1847, July 2004.[8] S. Affes and Y. Grenier, “A signal subspace tracking algorithm <strong>for</strong>microphone array processing of speech”, in IEEE Trans. Speechand Audio Processing, vol. 5, pp. 425-437, Sep. 1997.[9] E. Warsitz and R. Haeb-Umbach, “Acoustic Filter-and-Sum<strong>Beam<strong>for</strong>ming</strong> by <strong>Adaptive</strong> <strong>Principal</strong> Component Analysis”, inProc. IEEE ICASSP, Philadelphia, Mar. 2005.[10] S. Rickard, R. Balan, and J. Rosca, “Real-time time-frequencybased blind source separation” in Proc. ICA 2001, San Diego,Dec. 2001[11] B. Schölling, M. Heckmann, F. Joublin, and C. Goerick, “FromSource Localization to <strong>Blind</strong> Source Separation: An IntuitiveRoute to <strong>Blind</strong> Source Separation”, in Proc. IWAENC, Paris, Oct.2006.[12] R. Huber, “Objective assessment of audio quality using an auditoryprocessing model”, Ph.D. thesis, University of Oldenburg,Oldenburg, 2003.[13] T. Rohdenburg, V. Hohmann and B. Kollmeier, “Objective perceptualquality measures <strong>for</strong> the evaluation of noise reductionschemes”, in Proc. IWAENC, Eindhoven, Netherlands, Sep. 2005.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!