Robustness Techniques for Feature Extraction - Berlin Chen

Robustness Techniques 

for Speech Recognition 

Berlin Chen, 2003 

References: 

1. X. Huang et. al., Spoken Language Processing (2001), Chapter 10 

2. J. C. Junqua and J. P. Haton, Robustness in Automatic Speech Recognition (1996), 

Chapters 5, 8-9 

3. T. F. Quatieri, Discrete-Time Speech Signal Processing, Principles and Practice (2002), 

Chapter 13

Introduction 

• Classification of Speech Variability in Five Categories 

Pronunciation 

Variation 

Intra-speaker 

variability 

Linguistic 

variability 

Inter-speaker 

variability 

Speaker-independency 

Speaker-adaptation 

Speaker-dependency 

Robustness 

Enhancement 

Variability caused 

by the environment 

Variability caused 

by the context 

Context-Dependent 

Acoustic Modeling 

2

Introduction 

• The Diagram for Speech Recognition 

Acoustic Processing Linguistic Processing 

Speech 

signal 

Feature 

Extraction 

Likelihood 

computation 

Linguistic Network 

Decoding 

Recognition 

results 

Acoustic 

model model 

Language 

model model 

Lexicon 

• Importance of the robustness in speech recognition 

– Speech recognition systems must operate in situations with 

uncontrollable acoustic environments 

– The recognition performance is often degraded due to the 

mismatch in the training and testing conditions 

• Varying environmental noises, different speaker characteristics 

(sex, age, dialects), different speaking modes (stylistic, Lombard 

effect), etc. 

3

Introduction 

• If a speech recognition system’s accuracy doesn’t degrade 

very much under mismatch conditions, the system is called 

E 

s 

E 

s 2.5 

robust 

25dB 

= 10 log 

10 

=> = 10 ≈ 

– ASR performance is rather uniform for SNRs greater than 25dB, but 

there is a very steep degradation as the noise level increases 

• Variant noises exist in varying real-world environments 

(periordic, impulsive, or wide/narrow band) 

• Therefore, several possible robustness approaches have 

been developed to enhance the speech signal, its 

spectrum, and the acoustic models as well 

– Environment compensation processing (feature-based) 

– Environment model adaptation (model-based) 

– Inherently robust acoustic features (both model- and feature-based) 

• Discriminatively trained acoustic features 

E 

N 

E 

N 

316 

4

The Noise Types 

s[m] 

h[m] 

n[m] 

x[m] 

A model of the environment. 

x 

[ m] = s[ m] ∗ h[ m] + n[ m] 

X ( ω ) = S( ω ) H ( ω ) + N ( ω ) 

2 

2 

2 

2 

* 

X ( ω ) = S( ω ) H ( ω ) + N ( ω ) + 2 Re{ S( ω ) H ( ω ) N ( ω )} 

2 

2 

2 

= S( ω ) H ( ω ) + N ( ω ) + 2 S( ω ) H ( ω ) N ( ω ) cos 

2 

2 

2 

≈ S( ω ) H ( ω ) + N ( ω ) 

or P ( ) = P ( ) P ( ) + P ( ) , P() 

X 

ω 

S 

ω 

H 

ω 

N 

ω ⋅ : power spectrum 

or S ( ω ) = S ( ω ) S ( ω ) + S ( ω ) , S (): 

power spectrum 

⇔ 

⇔ 

xx 

ss 

hh 

nn 

-- 

⋅ 

θ 

ω

Additive Noises 

• Additive noises can be stationary or non-stationary 

– Stationary noises 

• Such as computer fan, air conditioning, car noise: the power 

spectral density does not change over time (the above noises are 

also narrow-band noises) 

– Non-stationary noises 

• Machine gun, door slams, keyboard clicks, radio/TV, and other 

speakers’ voices (babble noise, wide band nose, most difficult): the 

statistical properties 

change over time 

6

Additive Noises 

7

Convolutional Noises 

• Convolutional noises are mainly resulted from channel 

distortion (sometimes called “channel noises”) and are 

stationary for most cases 

– Reverberation, the frequency response of microphone, 

transmission lines, etc. 

8

Noise Characteristics 

• White Noise 

– The power spectrum is flat S nn 

ω = ,a condition equivalent to 

different samples being uncorrelated, R nn 

[ m] = qδ 

[ m] 

– White noise has a zero mean, but can have different distributions 

– We are often interested in the white Gaussian noise, as it 

resembles better the noise that tends to occur in practice 

• Colored Noise 

( ) q 

– The spectrum is not flat (like the noise captured by a microphone) 

– Pink noise 

• A particular type of colored nose that has a low-pass nature, as it 

has more energy at the low frequencies and rolls off at high 

frequency 

• E.g., the noise generated by a computer fan, an air conditioner, or 

an automobile 

9

• Musical Noise 

Noise Characteristics 

– Musical noise is short sinusoids (tones) randomly distributed 

over time and frequency that occur due to the drawback of 

original spectral subtraction technique and statistical inaccuracy 

in estimating noise magnitude spectrum 

• Lombard effect 

– A phenomenon by which a speaker increases his vocal effect in 

the presence of background noise (the additive noise) 

– When a large amount of noise is present, the speaker tends to 

shout, which entails not only a high amplitude, but also often 

higher pitch, slightly different formants, and a different coloring 

(shape) of the spectrum 

– The vowel portion of the words will be overemphasized by the 

speakers 

10

Robustness Approaches

Three Basic Categories of Approaches 

• Speech Enhancement Techniques 

– Eliminating or reducing the noisy effect on the speech signals, 

thus better accuracy with the originally trained models 

(Restore the clean speech signals or compensate for distortions) 

– The feature part is modified while the model part remains 

unchanged 

• Model-based Noise Compensation Techniques 

– Adjusting (changing) the recognition model parameters (means 

and variances) for better matching the testing noisy conditions 

– The model part is modified while the feature part remains 

unchanged 

• Inherently Robust Parameters for Speech 

– Finding robust representation of speech signals less influenced 

by additive or channel noise 

– Both of the feature and model parts are changed 

12

Three Basic Categories of Approaches 

• General Assumptions for the Noise 

– The noise is uncorrelated with the speech signal 

– The noise characteristics are fixed during the speech utterance 

or vary very slowly (the noise is said to be stationary) 

• The estimates of the noise characteristics can be obtained during 

non-speech activity 

– The noise is supposed to be additive or convolutional 

• Performance Evaluation 

– Intelligibility, quality (subjective assessment) 

– Distortion between clean and recovered speech (objective 

assessment) 

– Speech recognition accuracy 

13

Spectral Subtraction (SS) S. F. Boll, 1979 

• A Speech Enhancement Technique 

• Estimate the magnitude (or the power) of clean speech by 

explicitly subtracting the noise magnitude (or the power) 

spectrum from the noisy magnitude (or power) spectrum 

• Basic Assumption of Spectral Subtraction 

– The clean speech s [ m] 

is corrupted by additive noise n[ m] 

– Different frequencies are uncorrelated from each other 

– s [ m] 

and n[ m] 

are statistically independent, so that the power 

spectrum of the noisy speech x[ m] 

can be expressed as: 

P ( ω ) = P ( ω ) + P ( ω ) 

X 

S 

N 

– To eliminate the additive noise: P ( ω ) = P ( ω ) − P ( ω ) 

S 

X 

N 

– We can obtain an estimate of P N 

( ω ) using the average period of M 

frames that known to be just noise: 

Pˆ 

1 

M 

M 

( ) ∑ − 1 

ω = ( ω ) 

P N N ,i i= 

0 

14

Spectral Subtraction (SS) 

• Problems of Spectral Subtraction 

– s [ m] 

and n[ m] 

are not statistically independent such that the cross 

term in power spectrum can not be eliminated 

( ) 

– ω is possibly less than zero 

PˆS 

– Introduce “musical noise” when 

( ω ) ≈ P ( ω ) 

– Need a robust endpoint (speech/noise/silence) detector 

P 

X 

N 

15


• Modification: Nonlinear Spectral Subtraction (NSS) 

Pˆ 

P 

S 

⎧PX 

( ω ) − PN 

( ω ), 

if PX 

( ω ) ≥ PN 

( ω ) 

( ω ) = ⎨ 

⎩PN 

( ω ), 

otherwise 

( ω ) and P ( ω ): smoothed noisy and noise spectrum 

X 

N 

or 

Pˆ 

P 

φ 

⎧P 

( ) ( ) ( ) ( ) ( ) 

X 

ω − φ ω , if PX 

ω > φ ω + β ⋅ PN 

ω 

( ) 

S 

ω = ⎨ 

⎩β 

⋅ P ( ω ) 

N 

, otherwise 

( ω ) and P ( ω ) 

X 

N 

: smoothed noisy and noise 

( ω ): 

a non - linear function according to SNR 

spectrum 

16


• Spectral Subtraction can be viewed as a filtering 

operation 

Pˆ 

S 

( ω ) = P ( ) ( ) 

X 

ω − PN 

ω 

⎡ P 

( ) 

( ) 

N 

ω 

= PX 

ω ⎢1 

− 

P ( ω ) 

Power Spectrum 

P ( ) 

S 

ω 

( ω ) + P ( ω ) 

⎤ ⎡ 

⎤ 

⎥ = P ( ) 

X 

ω ⎢ 

⎥ (supposed that PX 

S 

+ 

⎣ X ⎦ ⎣ PS 

N ⎦ 

−1 

⎡ 1 ⎤ 

P 

( ) 

( ) 

( ) 

N 

ω 

= PX 

ω ⎢1 

+ 

( R = :instantaneous SNR ) 

R( ) 

⎥ ω 

⎣ ω ⎦ 

P ( ) 

S 

ω 

∴The time varyingsuppression filter is given approximately by : 

( ω) ≈ P ( ω) P ( ω) 

N 

) 

H 

( ω) 

⎢ 

⎡ = 1 + 

⎣ 

1 

R 

( ω) 

⎤ 

⎥ 

⎦ 

−1 / 2 

Spectrum Domain 

17

Wiener Filtering 


• From the Statistical Point of View 

– The process x[ m] 

is the sum of the random process s[ m] 

and the 

additive noise process n[ m] 

x [ m] = s[ m] + n[ m] 

– Find a linear estimate ŝ [ m] 

in terms of the process x[ m] 

: 

• Or to find a linear filter h [ m] 

such that the sequence ŝ[ m] = x[ m] ∗ h[ m] 

minimizes the expected value of ( ŝ[ m] − s[ m] 

) 2 

x [ m ] 

ŝ [ m ] 

Noisy Speech 

[ m ] 

s ˆ = x [ m ] ∗ h [ m ] 

= 

A linear filter 

h[n] 

=∑ ∞ 

l −∞ 

h 

[] l x [ m − l ] 

Clean Speech 

18


• Minimize the expectation of the squared error (MMSE 

estimate) 

Minimize F 

∀ 

k 

⇒ ∀ 

⇒ 

⇒ 

⇒ 

⇒ 

⇒ 

∂F 

h 

∞ 

[ m] − ∑ h[ l] x[ m − l] 

[] k 

s[ m] x[ m k] = ⎜ 

⎛ ∞ 

− ∑ h[ l] x[ m − l] ⎟ 

⎞x[ m − k] 

k 

k =−∞ 

k =−∞ 

k =−∞ 

R 

S 

∞ 

∑ 

∞ 

∑ 

∞ 

∑ 

s 

ss 

s 

s 

= 0 

⎧ 

= E⎨ 

⎩ 

⎡s 

⎢⎣ 

∞ ∞ 

[ m] x[ m − k] = ∑ h[ l] ∑x[ m − l] x[ m − k] 

∞ ∞ 

[ m] ( s[ m − k] + n[ m − k] 

) = ∑ h[] l ∑x[ m − l] x[ m − k] 

∞ 

∞ 

s[ m] s[ m − k] + ∑ s[ m] n[ m − k] = ∑ h[ l] Rx[ k − l] 

k =−∞ 

l =−∞ 

[] k = h[] k ∗ Rx[] 

k 

Rs 

[ n] and Rx 

[ n] 

( ω) = H ( ω) S ( ω) 

sequences of s n 

xx 

⎝ 

l =−∞ 

l=−∞ 

l =−∞ 

k=−∞ 

l=−∞ 

⎠ 

⎤ 

⎥⎦ 

2 

⎫ 

⎬ 

⎭ 

k=−∞ 

and 

Take Fourier transform 

Take summation for k 

[] x[ 

n] 

s 

[ m] n[ m] 

and are 

statistically independent! 

: are respective ly the autocorrel ation 

19


• Minimize the expectation of the squared error (MMSE 

estimate) 

Q S 

ss 

( ω) = H ( ω) S ( ω) 

xx 

⇒ 

H 

( ω) 

= 

S 

S 

ss 

xx 

( ω) 

( ω) 

= 

S 

ss 

S 

ss 

( ω) 

( ω) + S ( ω) 

nn 

, is 

called the noncausal 

Wiener filter 

(where 

S 

xx 

( ω) = S ( ω) + S ( ω) 

ss 

nn 

) 

20


• The time varying Wiener Filter also can be expressed in 

a similar form as the spectral subtraction 

H 

( ω ) 

S ( ) 

ss 

ω 

( ω ) + S ( ω ) 

= 

= 

S 

ss 

nn 

PS 

+ 

-1 

⎡ P ( ) 

N 

ω ⎤ ⎡ 1 ⎤ 

= ⎢1 

+ ⎥ = 1 

P ( ) R( ) 

S 

ω 

⎢ + 

ω 

⎥ 

⎣ ⎦ ⎣ ⎦ 

P ( ) 

S 

ω 

( ω ) P ( ω ) 

-1 

N 

, 

( R 

( ω ) 

= 

P 

P 

S 

N 

( ω ) 

( ω ) 

: instantane 

ous 

SNR 

) 

SS vs. Wiener Filter: 

1. Wiener filter has stronger attenuation 

at low SNR region 

2. Wiener filter does not invoke an 

absolute thresholding 

10 

log 

P 

P 

S 

N 

( ω ) 

( ω ) 

21


• Wiener Filtering can be realized only if we know the 

power spectra of both the noise and the signal 

– A chicken-and-egg problem 

• Approach - I : Ephraim(1992) proposed the use of an 

HMM where, if we know the current frame falls under, we 

can use it’s mean spectrum as S ( ω) or P ( ω) 

– In practice, we do not know what state each frame falls into 

either 

• Weigh the filters for each state by a posterior probability that frame 

falls into each state 

ss 

S 

22

• Approach - II : 


– The background/noise is stationary and its power spectrum can 

be estimated by averaging spectra over a known background 

region 

– For the non-stationary speech signal, its time-varying power 

spectrum can be estimated using the past Wiener filter (of 

previous frame) 

Pˆ 

∴H 

~ 

P 

S 

( t, 

ω) = PX 

( t, 

ω) H( t −1, 

ω) 

Pˆ 

S 

( ) 

( t, 

ω) 

t, 

ω = 

Pˆ 

( t, 

ω) + P ( ω) 

S 

S 

( t, 

ω) = P ( t, 

ω) H( t, 

ω) 

X 

N 

, 

( t :frameindex, H( ⋅ 

): 

Wiener filter) 

• The initial estimate of the speech spectrum can be derived from 

spectral subtraction 

– Sometimes introduce musical noise 

23


• Approach - III : 

– Slow down the rapid frame-to-frame movement of the object 

speech power spectrum estimate by apply temporal smoothing 

) 

P 

S 

~ 

( t, 

ω ) = α ⋅ P ( t −1, 

ω ) + ( 1− 

α ) ⋅ Pˆ 

( t, 

ω ) 

S 

S 

Then use 

H 

( t, 

ω ) 

= 

) 

P 

S 

Pˆ 

S 

( t, 

ω ) to replace Pˆ 

S 

( t, 

ω ) in 

Pˆ 

S 

( t, 

ω ) 

⇒ H ( t, 

ω ) 

( t, 

ω ) + P ( ω ) 

N 

= 

) 

P 

S 

) 

PS 

( t, 

ω ) 

( t, 

ω ) + P ( ω ) 

N 

24


Clean Speech 

Noisy Speech 

Enhanced Noise Speech 

Using Approach – III 

τ = 0.85 

Other more complicate 

Wiener filters 

25

The Effectives of Active Noise 

26

Cepstral Mean Normalization (CMN) 

• A Speech Enhancement Technique and sometimes 

called Cepstral Mean Subtraction (CMS) 

• CMN is a powerful and simple technique designed to 

handle conventional (Time-invariant linear filtering) 

distortions x[ n] = s[ n] ∗ h[ n] 

( ω ) S ( ω ) H ( ω ) 

X = 

l 

2 

2 

2 l 

X = log SH = log S + log H = S + 

H 

l l 

l 

l 

( S + H ) = CS CH 

l 1 T −1 

l 

l 

l 

= ( CS t ) l 

+ CH = CS CH 

l 

CX = C 

+ 

Time Domain 

Spectral Domain 

l 1 T −1 

CS = ∑ CS 

l 

t and CX ∑ 

+ 

t = 0 

t = 0 

T 

T 

if the training and testing speech materials were recored from two different channels 

l 

l 

l 

l 

l 

l 

l 

l 

l 

l 

() 1 = C( S + H(1) ) = CS + CH(1) , Testing : CX ( 2) = C( S + H( 2 ) ) = CS CH( 2 ) 

Training : CX + 

l 

Log Power Spectral Domain 

Cepstral Domain 

CX 

CX 

l 

l 

l 

l 

( 1) − CX ( 1) 

= CS − CS 

l 

l 

l 

l 

( 2 ) − CX ( 2 ) = CS − CS 

The spectral characteristics of the microphone 

and room acoustics thus can be removed ! 

Can be eliminated if the assumption of zero-mean speech contribution! 

27


• Some Findings 

– Interesting, CMN has been found effective even the testing and 

training utterances are within the same microphone and 

environment 

• Variations for the distance between the mouth and the microphone 

for different utterances and speakers 

– Be careful that the duration/period used to estimate the mean 

of noisy speech 

• Why? 

28


• Performance 

– For telephone recordings, where each call has different 

frequency response, the use of CMN has been shown to provide 

as much as 30 % relative decrease in error rate 

– When a system is trained on one microphone and tested on 

another, CMN can provide significant robustness 

Temporal (Modulation) 

Frequency 

29


• CMN has been shown to improve the robustness not 

only to varying channels but also to the noise 

– White noise added at different SNRs 

– System trained with speech with the same SNR (matched 

Condition) 

Cepstral delta and delta-delta 

features are computed prior to the 

CMN operation so that they are 

unaffected. 

30


• From the other perspective 

– We can interpret CMN as the operation of subtracting a low-pass 

temporal filter d[ n] 

, where all the T coefficients are identical and 

equal to 1 , which is a high-pass temporal filter 

T 

– Alleviate the effect of conventional noise introduced in the 

channel 

• Real-time Cepstral Normalization 

– CMN requires the complete utterance to compute the cepstral 

mean; thus, it cannot be used in a real-time system, and an 

approximation needs to be used 

– Based on the above perspective, we can implement other types 

of high-pass filters 

CX 

l 

t 

= α ⋅ CX 

l 

t + α 

l 

( 1 − ) ⋅ CX 

l 

t −1 

, ( CX t : cepstral mean) 

31

RASTA Temporal Filter Hyneck Hermansky, 1991 


• RASTA (Relative Spectral) 

Assumption 

– The linguistic message is coded into movements of the vocal 

tract (i.e., the change of spectral characteristics) 

– The rate of change of non-linguistic components in speech often 

lies outside the typical rate of change of the vocal tact shape 

• E.g. fix or slow time-varying linear communication channels 

– A great sensitivity of human hearing to modulation frequencies 

around 4Hz than to lower or higher modulation frequencies 

Effect 

– RASTA Suppresses the spectral components that change more 

slowly or quickly than the typical rate of change of speech 

32

RASTA Temporal Filter 

• The IIR transfer function 

1 

C 

( ) 

~ 

− 

( ) 

x 

z 

4 2 + z 

H z = = 0.1z ⋅ 

C ( z) 

1− 

MFCC stream 

x 

−3 

− z − 2z 

−1 

0.98z 

c [ t] 

c ~ [ t] 

H(z) 

H(z) 

−4 

New 

MFCC stream 

Frame index 

H(z) 

• An other version 

c 

~ 

H 

( z) 

−1 

−3 

2 + z − z − 2 z 

= 0.1⋅ 

−1 

1 − 0.98 z 

RASTA has a peak at about 

4Hz (modulation frequency) 

[ t] = 0.98 ⋅ c 

~ 

[ t − 1] + 0.2 ⋅ c[ t] + 0.1⋅ 

c[ t − 1] 

− 0.1⋅ 

c[ t − 2] + 0.2 ⋅ c[ t − 4] 

−4 

modulation frequency 100 Hz 

33

Retraining on Corrupted Speech 

• A Model-based Noise Compensation Technique 

• Matched-Conditions Training 

– Take a noise waveform from the new environment, add it to all 

the utterance in the training database, and retrain the system 

– If the noise characteristics are known ahead of time, this method 

allow as to adapt the model to the new environment with 

relatively small amount of data from the new environment, yet 

use a large amount of training data 

34

Retraining on Corrupted Speech 

• Multi-style Training 

– Create a number of artificial acoustical environments by 

corrupting the clean training database with noise samples of 

varying levels (30dB, 20dB, etc.) and types (white, babble, etc.), 

as well as varying the channels 

– All those waveforms (copies of training database) from multiple 

acoustical environments can be used in training 

35

Model Adaptation 


• The standard adaptation methods for speaker adaptation 

can be used for adapting speech recognizers to noisy 

environments 

– MAP (Maximum a Posteriori) can offer results similar to those of 

matched conditions, but it requires a significant amount of 

adaptation data 

– MLLR (Maximum Likelihood Regression) can achieve 

reasonable performance with about a minute of speech for minor 

mismatch. For severe mismatches, MLLR also requires a larger 

amount of adaptation data 

36

Signal Decomposition Using HMMs 


• Recognize concurrent signals (speech and noise) 

simultaneously 

– Parallel HMMs are used to model the concurrent signals and the 

composite signal is modeled as a function of their combined 

outputs 

• Three-dimensional Viterbi Search 

Noise HMM 

(especially for 

non-stationary noise) 

Computationally Expensive 

for both Training and Decoding ! 

Clean speech HMM 

37

Parallel Model Combination (PMC) 


• By using the clean-speech models and a noise model, 

we can approximate the distributions obtained by training 

a HMM with corrupted speech 

38


• The steps of Standard Parallel Model Combination (Log- 

Normal Approximation) 

Cepstral domain 

µ 

Σ 

c 

c 

Σ 

µ 

Clean speech HMM’s 

Noisy speech HMM’s 

µ ˆ 

c 

Σˆ 

c 

l 

l 

= C 

−1 

µ 

c 

= C Σ ( C 

−1 c −1 

) 

T 

c l 

µ ˆ = C µ ˆ 

ˆ c l T 

= C Σˆ 

C 

Σ 

Constraint: the estimate of 

variance is positive 

Log-spectral domain 

µ 

Σ 

l 

µ ˆ 

l 

l 

Σˆ 

l 

µ = exp µ 

i 

Σij = µ 

i 

µ 

j 

l l 

( + Σ 2) 

In linear spectral domain, 

the distribution is lognormal 

i 

ii 

l 

[ exp( Σ ) −1] 

Because speech and noise are 

independent and additive in the 

linear spectral domain 

ij 

1 Σˆ 

ii 

( ˆ µ 

i 

) − log( + 1) 

l 

ˆ µ 

i 

= log 

2 

ˆ µ i 

2 

Σˆ 

= log 

l 

ij 

ij ˆ µ ˆ µ 

( ) 

Σˆ 

+ 1 

i 

j 

Σ 

Linear spectral domain 

µ 

Σ 

µ ˆ = gµ 

+ µ ~ 

ˆ 2 

= 

g 

µ ˆ 

Σˆ 

Σ 

~ 

+ Σ 

Noise HMM’s 

µ ~ 

~ 

Σ 

Log-normal 

approximation 

(Assume the new 

distribution is lognormal) 

39


• Modification-I: Perform the model combination in the Log- 

Spectral Domain (the simplest approximation) 

– Log-Add Approximation: (without compensation of variances) 

( ( 

l 

exp µ ) exp( µ 

l 

) 

l 

ˆ µ = log + ~ 

• The variances are assumed to be small 

– A simplified version of Log-Normal approximation 

• Reduction in computational load 

• Modification-II: Perform the model combination in the 

Linear Spectral Domain (Data-Driven PMC, DPMC, or 

Iterative PMC) 

– Use the speech models to generate noisy samples (corrupted 

speech observations) and then compute a maximum likelihood of 

these noisy samples 

– This method is less computationally expensive than standard 

PMC with comparable performance 

40


• Modification-II: Perform the model combination in the 

Linear Spectral Domain (Data-Driven PMC, DPMC) 

Clean Speech HMM 

Noise HMM 

Noisy Speech HMM 

Cepstral domain 

Apply Monte Carlo 

simulation to draw random 

cepstral vectors 

(for example, at least 100 for 

each distribution) 

Generating 

samples 

Linear spectral domain 

Domain 

transform 

41


• Data-Driven PMC 

42

Vector Taylor Series (VTS) P. J. Moreno,1995 


• VTS Approach 

– Similar to PMC, the noisy-speech-like models is generated by 

combining of clean speech HMM’s and the noise HMM 

– Unlike PMC, the VTS approach combines the parameters of 

clean speech HMM’s and the noise HMM linearly in the logspectral 

domain 

P ω = P ω P ω + P ω 

Power spectrum 

X 

X 

( ) 

S 

( ) 

H 

( ) 

N 

( ) 

= log( PS 

( ω) PH 

( ω) + PN 

( ω) 

) 

⎛ ⎛ P 

( ) ( ) 

( ) ⎞ 

⎜ 

N 

ω ⎞ 

= log PS 

ω PH 

ω ⎜ 

⎟⎟ 

1 + 

PS 

( ) PH 

( ) 

⎝ ⎝ ω ω ⎠⎠ 

logPN 

( ω) −logPS 

( ω) −logPH 

( ω) 

= logP 

( ω) + logP 

( ω) + log1+ 

e 

l 

= S 

= S 

l 

l 

S 

+ H 

+ H 

l 

l 

+ 

f 

H 

Log Power spectrum 

( ) 

( ) 

N −S 

−H 

+ e 

Non-linear function 

( ) ( ) ( ) 

l l l 

l l l 

N −S 

−H 

S , H , N , wheref 

S , H , N = log1+ 

e 

+ log1 

l 

l 

l 

l 

l 

l 

Is a vector 

function 

43

Vector Taylor Series (VTS) 

• The Taylor series provides a polynomial representation 

of a function in terms of the function and its derivatives at 

a point 

– Application often arises when nonlinear functions are employed 

and we desire to obtain a linear approximation 

– The function is represented as an offset and a linear term 

f 

f 

: R → R 

( x ) = f ( x ) + f ′( x )( x − x ) + f ′′( x )( x − x ) 

0 

0 

1 

+ .... + f 

n! 

0 

( ) 

( )( ) ( ) 

n 

n 

n 

x x − x + o x − x 

0 

1 

2 

0 

0 

0 

0 

44


• Apply Taylor Series Approximation 

f 

l l l 

l l l 

l l l df 

( ) ( ) ( S 

0 

, H 

0 

, N 

0 

) 

, , 

, , 

( 

l l 

N S H ≅ f S 

0 

H 

0 

N 

0 

+ 

S − S 

l 

0 

dS 

) 

l l l 

l l l 

df ( S 

0 

, H 

0 

, N 

0 

) ( 

l 

l 

) 

df ( S 

0 

, H 

0 

, N 

0 

) 

+ 

− + 

( 

l 

l 

H H 

N − N ) + ..... 

dH 

– VTS-0: use only the 0th-order terms of Taylor Series 

– VTS-1: use only the 0th- and 1th-order terms of Taylor Series 

l l l 

– f S 

0 

, H 

0 

, N 

0 is the vector function evaluated at a particular 

vector point 

• If VTS-0 is used 

E 

u 

Σ 

l 

l l 

l l l 

[ X ] = E[ S + H + f ( S ,H ,N )] 

l l 

l l l 

= u [ ( )] 

s 

+ uh 

+ E f S ,H ,N 

l l 

l l l 

≅ u + u + E[ f ( u ,u ,u )] 

s h 

s h n 

l l 

l l l 

l 

≅ u + u + f ( u ,u ,u ) ( X 

l 

x 

l 

x 

s 

≅ Σ 

l 

s 

h 

+ Σ 

( ) 

l 

h 

(if 

s 

S 

l 

h 

n 

and H 

To get the clean speech statistics 

l 

l 

is also Gaussian) 

u 

Σ 

are independent) 

0 

l 

x 

l 

x 

≅ u 

≅ 

l 

s 

Σ 

+ g + 

l 

s 

dN 

If the channel filter is linear - time invariant, 

0-th order VTS 

we can regard it as a bias (constant) , g, 

in the log power spectrum domain 

f 

l 

l l 

l 

( u ,g ,u ) ( X is also Gaussian) 

s 

n 

0 

45


46

Retraining on Compensated Features 

• A Model-based Noise Compensation Technique that also 

Uses enhanced Features (processed by SS, CMN, etc.) 

– Combine speech enhancement and model compensation 

47

Principal Component Analysis 

• Principal Component Analysis (PCA) : 

– Widely applied for the data analysis and dimensionality reduction 

in order to derive the most “expressive” feature 

– Criterion: 

for a zero mean r.v. x∈R N , find k (k≤N) orthonormal vectors 

{e 1 , e 2 ,…, e k } so that 

– (1) var(e 1 

T 

x)=max 1 

(2) var(e iT x)=max i 

subject to e i ⊥ e i-1 ⊥…… ⊥e 1 1≤ i ≤k 

– {e 1 , e 2 ,…, e k } are in fact the eigenvectors 

of the covariance matrix (Σ x ) for x 

corresponding to the largest k eigenvalues 

– Final r.v y ∈R k : the linear transform 

(projection) of the original r.v., y=A T x 

A=[e 1 e 2 …… e k ] 

data 

Principal axis 

48


49


• Properties of PCA 

– The components of y are mutually uncorrelated 

E{y i y j }=E{(e iT x) (e jT x) T }=E{(e iT x) (x T e j )}=e iT E{xx T } e j =e iT Σ x e j 

= λ j e iT e j =0 , if i≠j 

∴ the covariance of y is diagonal 

– The error power (mean-squared error) between the original vector x 

and the projected x’ is minimum 

x=(e 1T x)e 1 + (e 2T x)e 2 + ……+(e kT x)e k + ……+(e NT x)e N 

x’=(e 1T x)e 1 + (e 2T x)e 2 + ……+(e kT x)e k (Note : x’∈R N ) 

error r.v : 

x-x’= (e k+1T x)e k+1 + (e k+2T x)e k+2 + ……+(e NT x)e N 

E((x-x’) T (x-x’))=E((e k+1T x) e k+1T e k+1 (e k+1T x))+……+E((e NT x) e NT e N 

(e NT x)) 

=var(e k+1T x)+ var(e k+2T x)+…… var(e NT x) 

= λ k+1 + λ k+2 +…… +λ N minimum 

50

PCA Applied in Inherently Robust Features 

• Application 1 : the linear transform of the original 

features (in the spatial domain) 

Original feature stream x t 

Frame index 

z t 

= A T x t 

A T A T A T A T 

The columns of A are the 

“first k” eigenvectors of Σ x 

transformed feature 

stream z t 

Frame index 

51


• Application 2 : PCA-derived temporal filter 

(in the temporal domain) 

– The effect of the temporal filter is equivalent to the weighted sum of 

sequence of a specific MFCC coefficient with length L slid along the 

frame index 

⎡ x(1,1) 

⎤ ⎡ x(2,1) 

⎤ ⎡ x(3,1) 

⎤ ⎡ x( 

n,1) 

⎤ ⎡ x( 

N ,1) ⎤ → y 

1 

( m ) 

B 1 

(z) ⎢ 

x(1,2) 

⎥ ⎢ 

x(2,2) 

⎥ ⎢ 

x(3,2) 

⎥ ⎢ 

x( 

n,2) 

⎥ ⎢ 

x( 

N ,2) 

⎥ 

→ y ( m ) 

⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ 

2 

quefrency 

⎢ M ⎥ ⎢ M ⎥ ⎢ M ⎥ ⎢ M ⎥ ⎢ M ⎥ M 

B 2 

(z) ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ L ⎢ ⎥ L ⎢ ⎥ 

⎢ x(1, 

k ) ⎥ ⎢ x(2, 

k ) ⎥ ⎢ x(3, 

k ) ⎥ ⎢ x( 

n, 

k ) ⎥ ⎢ x( 

N , k ) ⎥ → y 

k 

( m ) 

⎢ M ⎥ ⎢ M ⎥ ⎢ M ⎥ ⎢ M ⎥ ⎢ M ⎥ M 

⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ 

x(1, 

K ) x(2, 

K ) x(3, 

K ) x( 

n, 

K ) x( 

N , K ) → y 

K 

( m ) 

Original feature 

⎣ ⎦ ⎣ ⎦ ⎣ ⎦ ⎣ ⎦ ⎣ ⎦ 

x () 1 x( 2) x() 3 L x( n ) L x( N ) 

stream x B n 

(z) 

t 

z k 

(n)=[ y k 

(n) y k 

(n+1) y k 

(n+2) …… y k 

(n+L-1)] T 

N 

Frame index 

The impulse response of B k (z) is one of the 

∑ − L + 1 

µ 1 

= 

k 

( n) 

N − L + 1 

z 

zk n= 

1 

N L 1 

eigenvectors of the covariance for z k 1 T 

Σ = 

z ( n) 

− µ z n − µ 

L 

z k (1) 

z k (2) 

z k (3) 

z 

k 

N − L + 1 

The element in the new feature vector 

T 

( n, 

k) e ( 1) z ( n) 

xˆ = 

k 

k 

∑ − + 

n= 

1 

( )( ( ) ) 

k 

From Dr. Jei-wei Hung 

z 

k 

k 

z 

k 

52


The frequency responses of the 15 PCA-derived temporal filters 


53


• Application 2 : PCA-derived temporal filter 

Mismatched 

condition 

Filter length 

L=10 

Matched 

condition 


54


• Application 3 : PCA-derived filter bank 

x 1 

x 2 

x 3 

Power spectrum 

obtained by DFT 

h 1 

h 2 

h 3 

h k 

is one of the 

eigenvectors 

of the covariance 

for x k 


55


• Application 3 : PCA-derived filter bank 


56

Linear Discriminative Analysis 

• Linear Discriminative Analysis (LDA) 

– Widely applied for the pattern classification 

– In order to derive the most “discriminative” feature 

– Criterion : assume w j , µ j and Σ j are the weight, mean and 

covariance of class j, j=1……N. Two matrices are defined as: 

Between - class covariance : 

Within - class covariance : S 

Find W=[w 1 w 2 ……w k ] 

such that 

T 

W SbW 

Wˆ = arg max 

W T 

W S W 

– The columns w j of W are the 

eigenvectors of S w 

-1 

S B 

having the largest eigenvalues 

w 

( µ − µ )( µ − µ ) 

N T 

Sb 

= ∑ j = 1w 

j j 

j 

N 

w 

= ∑ j = 1w 

j 

Σ j 

57

Linear Discriminative Analysis 

The frequency responses of the 15 LDA-derived temporal filters 


58

Minimum Classification Error 

• Minimum Classification Error (MCE): 

– General Objective : find an optimal feature presentation or an 

optimal recognition model to minimize the expected error of 

classification 

– The recognizer is often operated under the following decision rule : 

C(X)=C i if g i (X,Λ)=max j g j (X,Λ) 

Λ={λ (i) } i=1……M (M models, classes), X : observations, 

g i (X,Λ): class conditioned likelihood function, for example, 

g i (X,Λ)=P(X|λ (i) ) 

– Traditional Training Criterion : 

find λ (i) such that P(X|λ (i) ) is maximum (Maximum Likelihood) if X 

∈C i 

• This criterion does not always lead to minimum classification error, 

since it doesn't consider the mutual relationship between 

different classes 

• For example, it’s possible that P(X|λ (i) ) is maximum but X ∉C i 

59


Threshold 

τ 

P ( LR ( k ) KW k 

∉ C k 

) 

P LR ( k ) 

k 

( ∈ ) 

KW k 

C k 

Type I 

error 

Type II 

error 

LR 

( k ) 

Example showing histograms of the likelihood ratio 

when keyword KW ∈ and KW ∉ 

k 

C k 

k 

C k 

LR 

( k ) 

Type I error: False Rejection 

Type II error: False Alarm/False Acceptance 

60


• Minimum Classification Error (MCE) (Cont.): 

– One form of the class misclassification measure : 

d 

d 

d 

i 

i 

i 

() i 

( X ) −g 

X , λ 

⎡ 1 

⎢ 

⎣ M −1 

( ) ( ( 

() i 

+ log exp g X , λ ) α ) 

= ∑ 

( X ) ≥ 0 implies a misclassification (error = 1) 

( X ) < 0 implies a correct classification (error = 0) 

– A continuous loss function is defined as follows : 

l 

i 

( X , Λ) = l( d ( X )) 

i 

X ∈ C 

where the sigmoid function 

( d ) 

j≠i 

– Classifier performance measure : 

i 

l 

1 

= 

1 + exp 

( − γ d + θ ) 

⎤ 

⎥ 

⎦ 

1 

α 

X 

∈C 

i 

L 

( Λ) 

= E [ L( X , Λ) 

] = l ( X , Λ) δ( X ∈ C ) 

X 

∑∑ 

X 

M 

i= 

1 

i 

i 

61


• Using MCE in model training : 

– Find Λ such that 

Λ ˆ = arg min L 

Λ 

( Λ) = arg min E [ L( X , Λ) 

] 

Λ 

the above objective function in general cannot be minimized 

directly but the local minimum can be achieved using gradient 

decent algorithm 

( ) 

∂L 

Λ 

wt+ = wt 

− ε , w : an arbitrary parameter of 

∂w 

• Using MCE in robust feature representation 

ˆ 

( f ) 

f = argminE 

L( f ( X ), 

Λ ) 

f 

Note: 

X 

1 Λ 

f 

X 

[ ] 

: a transform of the originalfeature X 

whilefeature presentation is changed, the modelis also changed accordingly 

62

Robustness Techniques for Feature Extraction - Berlin Chen

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?