Blind Adaptive Principal Eigenvector Beamforming for ... - CiteSeer

Blind Adaptive Principal Eigenvector Beamforming for Acoustical SourceSeparationErnst Warsitz, Reinhold Haeb-Umbach, Dang Hai Tran VuUniversity of Paderborn, Dept. of Communications Engineering33098 Paderborn, Germany{warsitz,haeb,tran}@nt.uni-paderborn.deAbstractFor separating multiple speech signals given a convolutive mixture,time-frequency sparseness of the speech sources can beexploited. In this paper we present a multi-channel source separationmethod based on the concept of approximate disjoint orthogonalityof speech signals. Unlike binary masking of singlechannelsignals as e.g. applied in the DUET algorithm weuse a likelihood mask to control the adaptation of blind principaleigenvector beamformers. Furthermore orthogonal projectionof the adapted beamformer filters leads to mutually orthogonalfilter coefficients thus enhancing the demixing performance.Experimental results in terms of the achievable signalto-interferenceratio (SIR) and a perceptual speech quality measureare given for the proposed method and are compared to theDUET algorithm.Index Terms: BSS, DUET, PCA, Beamforming1. IntroducionFor hands-free multi-channel telecommunication or auditoryscene analysis in a scenario of simultaneously active sourcesit can be interesting to estimate not only one source signal, asis usually done in adaptive beamforming (ABF) [1], but allsource signals from their mixtures at the sensors. ABF is aform of spatial filtering enhancing signals from the look directionand attenuating signals from other directions. Its performancedepends heavily on the estimation of the directionof-arrival(DoA), and the adaptation to an individual source israther difficult if sources are simultaneously active. On the otherhand, blind source separation (BSS) methods need no informationof the array geometry and speaker direction, and demixingis possible even when the sources are simultaneously active.In the last decade BSS has focused on methods based on independentcomponent analysis (ICA). They involve higher orderstatistics and non-linear cost functions and are usually computationallyquite expensive. To deal with convolutive soundmixtures ICA can be employed in the frequency domain foreach frequency bin separately [2], however at the expense ofpermutation ambiguity. To overcome the problems of BSS andABF, combined methods have been proposed [3, 4]: geometricallyconstrained BSS algorithms solve the frequency permutationproblem of BSS and they are less sensitive to DoA estimationerrors than ABF. It should be noted, that geometricallyconstrained separation is not blind anymore.From a physical point of view the separation of two sourcesby frequency-domain BSS can be shown to be equivalent tonull-beamforming by two frequency-domain ABFs. In bothcases the algorithms attempt to reduce the interference signal byforming a spatial null in the interference direction [5]. The performanceof BSS is limited by that of perfectly adapted beamformers[6].Here we propose to separate multiple speech signals by exploitingthe time-frequency sparseness (disjoint orthogonality)of the speech sources [7] to control the adaptation of blind principaleigenvector beamformers. The dominant eigenvector orprincipal component generates a beam towards the directionof maximum power which can also be interpreted as matchedfilteringfor the unknown mixing system [8, 9]. Adaptation isonly carried out, when the source of interest is dominant. Wederive an adaptation control mechanism for the case of twosources by use of a so-called likelihood mask. In oppositionto the hard decision of binary masking applied in the DUET algorithm[7, 10] we use some kind of soft decision to control theprincipal eigenvector adaptation.In eigenvector beamforming the filter coefficients are optimizedto form a spatial pattern with a maximum in the directionof the dominant source, however without spatial minima at theinterferer directions. Therefore we additionally apply mutualorthogonal projection (MOP) of the eigenvector filters to enforceorthogonality and to place implicitly a null or at least aminimum in the interferer direction.2. Principal Eigenvector BeamformingWe are given an array of M microphones. The DFTs of themicrophone signals are denoted by X i(k, m), i = 1, . . . , M.Here, m and k denote the frame index and frequency bin, respectively.The output of a first Filter-and-Sum beamformer isY 1(k, m) =MXF1,i(k, ∗ m) · X i(k, m)i=1= F H 1 (k, m) · X(k, m), (1)where F 1(k, m) = (F 1,1(k, m), .., F 1,M(k, m)) T is the filtercoefficient vector and X(k, m) = (X 1(k, m), .., X M(k, m)) Tis the vector of input signals. Let us assume for the moment thata single speech source signal U 1(k, m) is present, which is connectedto the i-th microphone by the transfer function H 1,i(k).The vector of microphone signals is then given byX(k, m) = H 1(k)U 1(k, m) + N(k, m) (2)where H 1(k) = (H 1,1(k), .., H 1,M(k)) T , and N(k, m) is avector of stationary noise terms which are assumed to be spatiallywhite. Our goal is to determine a vector of filter coefficientsF 1(k, m) such that the signal-to-noise ratio of the outputY 1(k, m) is maximized. The frequency dependent SNR ismaximized by the eigenvector v max1 (k) corresponding to the

largest eigenvalue of the cross power spectral density matrix ofthe microphone signals, which turns out to bev max1 (k) = ζ 1(k) · H 1(k), (3)where ζ 1(k) ≠ 0 is an arbitrary complex scalar. We can computeF 1(k, m) iteratively to be the desired eigenvector (3), byemploying a two-step stochastic gradient ascent algorithm (seealso [9]):˜F 1(k, m+1) = F 1(k, m)+µ 1(k, m)Y ∗1 (k, m)[X(k, m) − F 1(k, m)Y 1(k, m)] (4)F 1(k, m+1) = ˜F 1(k, m+1) 1 + ˜F H 1 (k, m+1)˜F 1(k, m+1)2˜F H 1 (k, m+1)˜F 1(k, m+1) .Due to the block frequency nature of the algorithm afrequency-dependent step size µ 1(k, m), which is inverselyproportional to the power levels in the DFT frequency bins,can be used. Since (4) performs a principal component analysis(PCA) the resulting multi-channel filtering can be calledPCA-beamforming.3. Time-Frequency MaskingNow we extend the problem to a multi-speaker scenario with Psources U(k, m) = (U 1(k, m),.., U P(k, m)) T , M ≥ P microphonesand P beamformers. The relation between the sourcesignals and the microphone signals is given byPXX(k, m) = H l (k)U l (k, m) + N(k, m). (5)l=1Denoting the mixing system as(5) can be written asH(k) = [H 1(k),H 2(k), ..,H P(k)] (6)X(k, m) = H(k)U(k, m) + N(k, m). (7)The concept of disjoint orthogonality to estimate the sourcesignals U l (k, m) giving the mixtures X(k, m) is based on thetime-frequency sparseness of the speech sources [7]. The underlyingassumption is thatU l (k, m)U λ (k, m) = 0, ∀k, m l ≠ λ, (8)i.e. at no time-frequency bin two sources are simultaneouslyactive.3.1. Binary Masking (BM)Based on the assumption (8) a binary mask for approximate disjointorthogonality is introduced:(α (BM) 1, if U l (k, m) > βU λ (k, m)l(k, m) =, (9)0, elsewhere β is a heuristic parameter. The demixed output is thenobtained asÛ l (k, m) = α (BM)l(k, m)X 1(k, m). (10)In the DUET algorithm the mask is estimated by exploitingphase and amplitude information of two sensors [7]. Unfortunately,reliable estimate of α (BM)l(k, m) are hard to obtain ina reverberant environment. Hence, the demixed speech sourceswill suffer from serious signal distortion.3.2. Likelihood-based Masking (LM)As opposed to (9) we propose to use a soft-valued mask basedon some kind of source activity probability:α (LM)l(k, m) ≈ p(|U l (k, m)|≫ |U λ (k, m)| |X), ∀λ;l ≠ λ.(11)This so-called likelihood mask can assume any value betweenZero and One and can be used to control the adaptationof the corresponding l-th beamformer by replacing the step sizeµ l (k, m) of the gradient update rule by:˜µ l (k, m) = µ l (k, m)α (LM)l(k, m). (12)Note that the adaptation stops, if sources other than the targetsource are too strong (α (LM)l(k, m) → 0) or if the beamformerhas adapted towards its dominant source (X(k, m) −F l (k, m)Y l (k, m) → 0). As the optimal beamformer is amatched-filter, see eq. (3), the mixing system (6) is thus estimatedcolumn-wise (by adapting to the currently dominantsource) in every time-frequency bin.3.3. Symmetric Adaptive Decorrelation (SAD)For the case of two sources P = 2 and l ∈ {1, 2}, a simplebut effective likelihood estimator α (LM)l(k, m) can be obtainedas follows. Motivated by [11] symmetrical adaptive cross filtersD i,j(k, m) between two neighboring microphones i and j,where i = 1, .., M − 1, j = i + 1 for decorrelating twoneighboring microphone signals are used:Z j,i(k, m) = X i(k, m) − D j,i(k, m)X j(k, m) (13)Z i,j(k, m) = X j(k, m) − D i,j(k, m)X i(k, m). (14)The adaptation of (13) and (14) can be performed by normalizedleast mean square (NLMS) algorithmsD j,i(k, m+1) = D j,i(k, m)+µ j,i(k, m)Z j,i(k, m)Z ∗ i,j(k, m),(15)from which it can be seen that adaptation stops if Z j,i and Z i,jare decorrelated. The ratioα i,j(k, m) =|Z i,j(k, m)| 2|Z i,j(k, m)| 2 + |Z j,i(k, m)| 2 (16)serves as estimate for the dominance of one speech source overthe other based on the microphone pair i and j = i + 1. Theseestimates are combined to the “likelihood” maskM−1Yα (LM)l(k, m) = α i,i+1(k, m), (17)i=1where l ∈ {1, 2} is the one of the two sources which is spatiallycloser to the i-th than to the (i + 1)-st microphone. The moremicrophone pairs are used, the more reliable likelihood estimationsare obtained. In the limit of infinitely many microphones(17) will become a binary mask.4. PCA-Beamforming Post-ProcessingWhile the optimal filter coefficients F l (k, m) form a spatial patternwith main lobe in the direction of the l-th source, no spatialselectivity is performed regarding the interferer. Thus thel-th output Y l (k, m) = F H l (k, m)X(k,m) has a good speechquality but contains also strong interfering signals from other

sources, especially at low frequencies. To gain in SIR we applymutual orthogonal projection of the eigenvector filters by01˜W l (k, m) = @ Y[I − F λ (k, m)F H λ (k, m)] AF l (k, m),λ;λ≠lwith a following normalization step(18)W l (k, m) = ˜W l (k, m)‖ ˜W l (k, m) ‖ . (19)The beamformer outputs Y MOP = (Y l,MOP , . . . , Y P,MOP) Tare then given byY MOP(k, m) = W H (k, m)X(k,m). (20)where W(k, m) = [W 1(k, m),..,W P(k, m)]. Every outputsignal is thus determined by the subspace spanned by the remainingsource signals ensured by the mutual orthogonal projection(note subscript MOP).Beamforming approaches in general suffer from cross-talkin a multi-speaker scenario, although spatial minima are placedin the directions of the interferers. This is due to multipath signalpropagation in a reverberant environment when reflected interferencecomponents are captured by the main lobe. As theinterfering components can be interpreted as noise, the idea is touse a post-filter similar to the Wiener post-filter (WP) in beamforming[1] or speech enhancement. As estimate of the interferers’power present in the l-th beamformer output we use thesquared outputs of the other beamformers. The post-filteringcompletes the proposed demixing process and the final sourceestimates are|Y l,MOP (k, m)| 2Û l (k, m) =YMOP H (k, m)YMOP(k, m)Y l,MOP(k, m).(21)Fig. 1 shows the overall structure of the proposed demixingsystem including the corresponding equation numbers.XSAD(13)-(17)α (LM)1α (LM)2PCA(4), (12)F 1F 2Y 1,MOP Û 1MOPWP(18)-(20) (21)Y 2,MOPFigure 1: Blind PCA-based source separation system.5. Simulation ResultsIn this section we experimentally evaluate the proposed blindeigenbeamformer source separation method for the case of twosimultaneously active sources in a simulated reverberant enclosureof the size (6 m) x (5 m) x (3 m). 10 utterances from differentspeakers (5 male and 5 female) sampled at 16 kHz wereused as source speech signals. Taking 2 out of the 10 speakersgives 45 mixing setups for each of 9 analyzed reverberationtimes T 60 between 0 s and 0.5 s, totalling 405 audio data. Onespeech source was placed at 45 ◦ and the other source at −30 ◦relative to broadside, each at a distance of 2 m. The power ratioof the two sources was about 0 dB. To every microphonespeech signal white noise with a signal-to-noise ratio of about25 dB was added.For the PCA-based source separation method a linear arrayof M = 8 equally spaced microphones (distance of 4 cm) withÛ 22 BFs and 7 SADs were used. The FIR filters had a length of256 taps each, and the DFT length was set to 512 taps. Allgiven simulation results are with steady state filter coefficientsafter convergence. For the online DUET algorithm [10] twomicrophones with a distance of 2.4 cm were employed.Since we mix the signals artificially we know each of the reverberatedmicrophone speech components and can compute aperfect binary mask. The optimum achievable performance obtainedthis way is analyzed in terms of the signal-to-interferenceratio power (SIR) and the perceptual similarity measure (PSM)[12]. PSM has been shown to give comparable objective perceptualquality evaluation results as the well-known PESQ measure[13]. The PSM 1 and SIR values are shown in the upperand lower subplot of Fig. 2, respectively. The reference signalfor PSM was the target speech signal at one microphone (withoutnoise and interference) and the test signal to be evaluatedwas the processed target speech signal according to (9). Sincewe want to separate two sources two estimates Û1(k, m) andÛ 2(k, m) are available. For each utterance the better and theworse of the two (as measured by PSM) were collected in separatesets. In all figures “PSM-high” and “PSM-low” denotethe averaged PSM on the output with the better and lower PSMscores, respectively. All results are averages over 45 utterancesper T 60. “PSM-mean” is the mean of the two. The SIR scoreswere also evaluated separately for each of the two sets.PSMSIR [dB]10.90.80.72015PSM−highPSM−meanPSM−lowPerfect Binary Masking0 0.1 0.2 0.3 0.4 0.5T [s] 6010SIR_PSM−high5 SIR_PSM−meanSIR_PSM−low00 0.1 0.2 0.3 0.4 0.5T [s] 60Figure 2: Upper figure: Collected outputs with high and lowPSM values separately and overall PSM results (PSM-mean).Lower figure: SIR separately for the outputs with high and lowPSM and the overall SIR (SIR PSM-mean).It is very remarkable that the performance of the perfectmasking process works well for all simulated reverberationtimes and both outputs. This shows that disjoint orthogonality(8) is indeed a promising approach. However, the crucialissue is how to estimate the mask in a reverberant scenario.Next, the performance degradation is examined when thebinary mask is estimated online by the DUET algorithm [10],see Fig. 3. With increasing T 60 the speech quality differs dramaticallybetween the two outputs (PSM-high, PSM-low). Thisis due to the unequal demixing performance: the binary decision(9) is estimated to 1 for one output much more often thanfor the other output. Consequently the quality of target speechof that output, where the signal is passed through more often,is very good, however at the cost of reduced SIR compared to1 PSM results are generated with level alignment, lowpass filteringand without assimilation.

the other output. For that other one the speech quality is bad,because it was “switched off” very often. At the same time,however, the SIR is higher.PSMSIR [dB]10.90.80.7201510PSM−highPSM−meanPSM−lowBinary Masking with DUET [12]0 0.1 0.2 0.3 0.4 0.5T [s] 605SIR_PSM−highSIR_PSM−meanSIR_PSM−low00 0.1 0.2 0.3 0.4 0.5T [s] 60Figure 3: Upper figure: Collected outputs with high and lowPSM values separately and overall PSM results (PSM-mean).Lower figure: SIR separately for the outputs with high and lowPSM and the overall SIR (SIR PSM-mean).To overcome the unbalanced performance between the differentoutputs of the DUET algorithm the PCA-based sourceseparation method is proposed, see Fig. 4. It can be seen, thatthe speech quality is high for both outputs, and the SIR of thetwo is also similar. To give an impression of the adaptationPSMSIR [dB]10.90.80.7201510Blind PCA−based separation method with Likelihood MaskingPSM−highPSM−meanPSM−low0 0.1 0.2 0.3 0.4 0.5T 60[s]5SIR_PSM−highSIR_PSM−meanSIR_PSM−low00 0.1 0.2 0.3 0.4 0.5T 60[s]Figure 4: Upper figure: Collected outputs with high and lowPSM values separately and overall PSM results (PSM-mean).Lower figure: SIR separately for the outputs with high and lowPSM and the overall SIR (SIR PSM-mean).speed, in Fig. 5 the SIR for the proposed method is plotted overtime. We chose the case of no reverberation (T 60 = 0 s) tohave the highest dynamic range of the SIR. The initial filter coefficientsof the SAD were set to zero and the PCA coefficientswere chosen randomly. Fig. 5 shows the fast adaptation speedof the proposed separating method.6. ConclusionsA blind source separation method based on principal eigenvectorbeamforming exploiting the time-frequency sparsenessof speech signals has been presented. The advantage of highSIR [dB]2010Adaptation speed of the blind PCA−based separation method00 5 10 15 20 25t [s]Figure 5: Adaptation speed of the principal eigenvector separation,averaged over all runs for T 60 = 0 s.speech quality obtained by adaptive beamforming is combinedwith blind source classification. Important properties of the proposedmethod are as follows: First, the beamformer makes noa priori assumptions on the acoustic enclosure and neither thesource directions nor the array geometry need to be known. Second,the separation is carried out in two steps: a frequency binwiselikelihood estimation of the dominant source, followed byprincipal eigenvector beamforming. This results in robust, fastadaptation behavior and a flexible structure. Third, the interferencerejection is increased by mutually orthogonal projection ofthe filter coefficients and post-filtering.7. AcknowledgementThis work was in part supported by Deutsche Forschungsgemeinschaftunder contract number HA 3455/4-1.8. References[1] H. L. Van Trees, Optimum Array Processing, John Wiley, NewYork 2001.[2] H. Sawada, R. Mukai, S. Araki, and S. Makino, “Frequency-Domain Blind Source Separation”, in Speech Enhancement,Springer, 2005.[3] L. Parra and C.V. Alvino, “Geometric Source Separation: MergingConvolutive Source Separation with Geometric Beamforming”,in IEEE Trans. on Speech and Audio Processing vol. 10,no. 6, Sep. 2002.[4] M. Knaak, S. Araki and S. Makino, “Geometrically constrainedIndependent Component Analysis”, IEEE Trans. Audio, Speech,Lang. Proc., vol. 15, no. 2, pp 715-726, Feb. 2007.[5] A. Shoko, S. Makino, Y. Hinamoto, R. Mukai, T. Nishikawa andH. Saruwatari, “Equivalence between Frequency-Domain BlindSource Separation and Frequency-Domain Adaptive Beamformingfor Convolutive Mixtures”, EURASIP Journal on Applied SignalProcessing, vol. 2003 (2003), issue 11, pp. 1157-1166.[6] S. Makino, “Blind Source Separation of Convolutive Mixtures ofSpeech”, in Adaptive Signal Processing, Springer, 2003.[7] O. Yilmaz and S. Richard, “Blind Separation of Speech Mixturesvia Time-Frequency Masking”, in IEEE Trans. on Signal Processing,vol. 52, pp. 1830-1847, July 2004.[8] S. Affes and Y. Grenier, “A signal subspace tracking algorithm formicrophone array processing of speech”, in IEEE Trans. Speechand Audio Processing, vol. 5, pp. 425-437, Sep. 1997.[9] E. Warsitz and R. Haeb-Umbach, “Acoustic Filter-and-SumBeamforming by Adaptive Principal Component Analysis”, inProc. IEEE ICASSP, Philadelphia, Mar. 2005.[10] S. Rickard, R. Balan, and J. Rosca, “Real-time time-frequencybased blind source separation” in Proc. ICA 2001, San Diego,Dec. 2001[11] B. Schölling, M. Heckmann, F. Joublin, and C. Goerick, “FromSource Localization to Blind Source Separation: An IntuitiveRoute to Blind Source Separation”, in Proc. IWAENC, Paris, Oct.2006.[12] R. Huber, “Objective assessment of audio quality using an auditoryprocessing model”, Ph.D. thesis, University of Oldenburg,Oldenburg, 2003.[13] T. Rohdenburg, V. Hohmann and B. Kollmeier, “Objective perceptualquality measures for the evaluation of noise reductionschemes”, in Proc. IWAENC, Eindhoven, Netherlands, Sep. 2005.

Blind Adaptive Principal Eigenvector Beamforming for ... - CiteSeer

Create successful ePaper yourself

Delete template?

Save as template?