18.05.2015 Views

Robustness Techniques for Feature Extraction - Berlin Chen

Robustness Techniques for Feature Extraction - Berlin Chen

Robustness Techniques for Feature Extraction - Berlin Chen

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

<strong>Robustness</strong> <strong>Techniques</strong><br />

<strong>for</strong> Speech Recognition<br />

<strong>Berlin</strong> <strong>Chen</strong>, 2003<br />

References:<br />

1. X. Huang et. al., Spoken Language Processing (2001), Chapter 10<br />

2. J. C. Junqua and J. P. Haton, <strong>Robustness</strong> in Automatic Speech Recognition (1996),<br />

Chapters 5, 8-9<br />

3. T. F. Quatieri, Discrete-Time Speech Signal Processing, Principles and Practice (2002),<br />

Chapter 13


Introduction<br />

• Classification of Speech Variability in Five Categories<br />

Pronunciation<br />

Variation<br />

Intra-speaker<br />

variability<br />

Linguistic<br />

variability<br />

Inter-speaker<br />

variability<br />

Speaker-independency<br />

Speaker-adaptation<br />

Speaker-dependency<br />

<strong>Robustness</strong><br />

Enhancement<br />

Variability caused<br />

by the environment<br />

Variability caused<br />

by the context<br />

Context-Dependent<br />

Acoustic Modeling<br />

2


Introduction<br />

• The Diagram <strong>for</strong> Speech Recognition<br />

Acoustic Processing Linguistic Processing<br />

Speech<br />

signal<br />

<strong>Feature</strong><br />

<strong>Extraction</strong><br />

Likelihood<br />

computation<br />

Linguistic Network<br />

Decoding<br />

Recognition<br />

results<br />

Acoustic<br />

model model<br />

Language<br />

model model<br />

Lexicon<br />

• Importance of the robustness in speech recognition<br />

– Speech recognition systems must operate in situations with<br />

uncontrollable acoustic environments<br />

– The recognition per<strong>for</strong>mance is often degraded due to the<br />

mismatch in the training and testing conditions<br />

• Varying environmental noises, different speaker characteristics<br />

(sex, age, dialects), different speaking modes (stylistic, Lombard<br />

effect), etc.<br />

3


Introduction<br />

• If a speech recognition system’s accuracy doesn’t degrade<br />

very much under mismatch conditions, the system is called<br />

E<br />

s<br />

E<br />

s 2.5<br />

robust<br />

25dB<br />

= 10 log<br />

10<br />

=> = 10 ≈<br />

– ASR per<strong>for</strong>mance is rather uni<strong>for</strong>m <strong>for</strong> SNRs greater than 25dB, but<br />

there is a very steep degradation as the noise level increases<br />

• Variant noises exist in varying real-world environments<br />

(periordic, impulsive, or wide/narrow band)<br />

• There<strong>for</strong>e, several possible robustness approaches have<br />

been developed to enhance the speech signal, its<br />

spectrum, and the acoustic models as well<br />

– Environment compensation processing (feature-based)<br />

– Environment model adaptation (model-based)<br />

– Inherently robust acoustic features (both model- and feature-based)<br />

• Discriminatively trained acoustic features<br />

E<br />

N<br />

E<br />

N<br />

316<br />

4


The Noise Types<br />

s[m]<br />

h[m]<br />

n[m]<br />

x[m]<br />

A model of the environment.<br />

x<br />

[ m] = s[ m] ∗ h[ m] + n[ m]<br />

X ( ω ) = S( ω ) H ( ω ) + N ( ω )<br />

2<br />

2<br />

2<br />

2<br />

*<br />

X ( ω ) = S( ω ) H ( ω ) + N ( ω ) + 2 Re{ S( ω ) H ( ω ) N ( ω )}<br />

2<br />

2<br />

2<br />

= S( ω ) H ( ω ) + N ( ω ) + 2 S( ω ) H ( ω ) N ( ω ) cos<br />

2<br />

2<br />

2<br />

≈ S( ω ) H ( ω ) + N ( ω )<br />

or P ( ) = P ( ) P ( ) + P ( ) , P()<br />

X<br />

ω<br />

S<br />

ω<br />

H<br />

ω<br />

N<br />

ω ⋅ : power spectrum<br />

or S ( ω ) = S ( ω ) S ( ω ) + S ( ω ) , S ():<br />

power spectrum<br />

⇔<br />

⇔<br />

xx<br />

ss<br />

hh<br />

nn<br />

--<br />

⋅<br />

θ<br />

ω


Additive Noises<br />

• Additive noises can be stationary or non-stationary<br />

– Stationary noises<br />

• Such as computer fan, air conditioning, car noise: the power<br />

spectral density does not change over time (the above noises are<br />

also narrow-band noises)<br />

– Non-stationary noises<br />

• Machine gun, door slams, keyboard clicks, radio/TV, and other<br />

speakers’ voices (babble noise, wide band nose, most difficult): the<br />

statistical properties<br />

change over time<br />

6


Additive Noises<br />

7


Convolutional Noises<br />

• Convolutional noises are mainly resulted from channel<br />

distortion (sometimes called “channel noises”) and are<br />

stationary <strong>for</strong> most cases<br />

– Reverberation, the frequency response of microphone,<br />

transmission lines, etc.<br />

8


Noise Characteristics<br />

• White Noise<br />

– The power spectrum is flat S nn<br />

ω = ,a condition equivalent to<br />

different samples being uncorrelated, R nn<br />

[ m] = qδ<br />

[ m]<br />

– White noise has a zero mean, but can have different distributions<br />

– We are often interested in the white Gaussian noise, as it<br />

resembles better the noise that tends to occur in practice<br />

• Colored Noise<br />

( ) q<br />

– The spectrum is not flat (like the noise captured by a microphone)<br />

– Pink noise<br />

• A particular type of colored nose that has a low-pass nature, as it<br />

has more energy at the low frequencies and rolls off at high<br />

frequency<br />

• E.g., the noise generated by a computer fan, an air conditioner, or<br />

an automobile<br />

9


• Musical Noise<br />

Noise Characteristics<br />

– Musical noise is short sinusoids (tones) randomly distributed<br />

over time and frequency that occur due to the drawback of<br />

original spectral subtraction technique and statistical inaccuracy<br />

in estimating noise magnitude spectrum<br />

• Lombard effect<br />

– A phenomenon by which a speaker increases his vocal effect in<br />

the presence of background noise (the additive noise)<br />

– When a large amount of noise is present, the speaker tends to<br />

shout, which entails not only a high amplitude, but also often<br />

higher pitch, slightly different <strong>for</strong>mants, and a different coloring<br />

(shape) of the spectrum<br />

– The vowel portion of the words will be overemphasized by the<br />

speakers<br />

10


<strong>Robustness</strong> Approaches


Three Basic Categories of Approaches<br />

• Speech Enhancement <strong>Techniques</strong><br />

– Eliminating or reducing the noisy effect on the speech signals,<br />

thus better accuracy with the originally trained models<br />

(Restore the clean speech signals or compensate <strong>for</strong> distortions)<br />

– The feature part is modified while the model part remains<br />

unchanged<br />

• Model-based Noise Compensation <strong>Techniques</strong><br />

– Adjusting (changing) the recognition model parameters (means<br />

and variances) <strong>for</strong> better matching the testing noisy conditions<br />

– The model part is modified while the feature part remains<br />

unchanged<br />

• Inherently Robust Parameters <strong>for</strong> Speech<br />

– Finding robust representation of speech signals less influenced<br />

by additive or channel noise<br />

– Both of the feature and model parts are changed<br />

12


Three Basic Categories of Approaches<br />

• General Assumptions <strong>for</strong> the Noise<br />

– The noise is uncorrelated with the speech signal<br />

– The noise characteristics are fixed during the speech utterance<br />

or vary very slowly (the noise is said to be stationary)<br />

• The estimates of the noise characteristics can be obtained during<br />

non-speech activity<br />

– The noise is supposed to be additive or convolutional<br />

• Per<strong>for</strong>mance Evaluation<br />

– Intelligibility, quality (subjective assessment)<br />

– Distortion between clean and recovered speech (objective<br />

assessment)<br />

– Speech recognition accuracy<br />

13


Spectral Subtraction (SS) S. F. Boll, 1979<br />

• A Speech Enhancement Technique<br />

• Estimate the magnitude (or the power) of clean speech by<br />

explicitly subtracting the noise magnitude (or the power)<br />

spectrum from the noisy magnitude (or power) spectrum<br />

• Basic Assumption of Spectral Subtraction<br />

– The clean speech s [ m]<br />

is corrupted by additive noise n[ m]<br />

– Different frequencies are uncorrelated from each other<br />

– s [ m]<br />

and n[ m]<br />

are statistically independent, so that the power<br />

spectrum of the noisy speech x[ m]<br />

can be expressed as:<br />

P ( ω ) = P ( ω ) + P ( ω )<br />

X<br />

S<br />

N<br />

– To eliminate the additive noise: P ( ω ) = P ( ω ) − P ( ω )<br />

S<br />

X<br />

N<br />

– We can obtain an estimate of P N<br />

( ω ) using the average period of M<br />

frames that known to be just noise:<br />

Pˆ<br />

1<br />

M<br />

M<br />

( ) ∑ − 1<br />

ω = ( ω )<br />

P N N ,i i=<br />

0<br />

14


Spectral Subtraction (SS)<br />

• Problems of Spectral Subtraction<br />

– s [ m]<br />

and n[ m]<br />

are not statistically independent such that the cross<br />

term in power spectrum can not be eliminated<br />

( )<br />

– ω is possibly less than zero<br />

PˆS<br />

– Introduce “musical noise” when<br />

( ω ) ≈ P ( ω )<br />

– Need a robust endpoint (speech/noise/silence) detector<br />

P<br />

X<br />

N<br />

15


Spectral Subtraction (SS)<br />

• Modification: Nonlinear Spectral Subtraction (NSS)<br />

Pˆ<br />

P<br />

S<br />

⎧PX<br />

( ω ) − PN<br />

( ω ),<br />

if PX<br />

( ω ) ≥ PN<br />

( ω )<br />

( ω ) = ⎨<br />

⎩PN<br />

( ω ),<br />

otherwise<br />

( ω ) and P ( ω ): smoothed noisy and noise spectrum<br />

X<br />

N<br />

or<br />

Pˆ<br />

P<br />

φ<br />

⎧P<br />

( ) ( ) ( ) ( ) ( )<br />

X<br />

ω − φ ω , if PX<br />

ω > φ ω + β ⋅ PN<br />

ω<br />

( )<br />

S<br />

ω = ⎨<br />

⎩β<br />

⋅ P ( ω )<br />

N<br />

, otherwise<br />

( ω ) and P ( ω )<br />

X<br />

N<br />

: smoothed noisy and noise<br />

( ω ):<br />

a non - linear function according to SNR<br />

spectrum<br />

16


Spectral Subtraction (SS)<br />

• Spectral Subtraction can be viewed as a filtering<br />

operation<br />

Pˆ<br />

S<br />

( ω ) = P ( ) ( )<br />

X<br />

ω − PN<br />

ω<br />

⎡ P<br />

( )<br />

( )<br />

N<br />

ω<br />

= PX<br />

ω ⎢1<br />

−<br />

P ( ω )<br />

Power Spectrum<br />

P ( )<br />

S<br />

ω<br />

( ω ) + P ( ω )<br />

⎤ ⎡<br />

⎤<br />

⎥ = P ( )<br />

X<br />

ω ⎢<br />

⎥ (supposed that PX<br />

S<br />

+<br />

⎣ X ⎦ ⎣ PS<br />

N ⎦<br />

−1<br />

⎡ 1 ⎤<br />

P<br />

( )<br />

( )<br />

( )<br />

N<br />

ω<br />

= PX<br />

ω ⎢1<br />

+<br />

( R = :instantaneous SNR )<br />

R( )<br />

⎥ ω<br />

⎣ ω ⎦<br />

P ( )<br />

S<br />

ω<br />

∴The time varyingsuppression filter is given approximately by :<br />

( ω) ≈ P ( ω) P ( ω)<br />

N<br />

)<br />

H<br />

( ω)<br />

⎢<br />

⎡ = 1 +<br />

⎣<br />

1<br />

R<br />

( ω)<br />

⎤<br />

⎥<br />

⎦<br />

−1 / 2<br />

Spectrum Domain<br />

17


Wiener Filtering<br />

• A Speech Enhancement Technique<br />

• From the Statistical Point of View<br />

– The process x[ m]<br />

is the sum of the random process s[ m]<br />

and the<br />

additive noise process n[ m]<br />

x [ m] = s[ m] + n[ m]<br />

– Find a linear estimate ŝ [ m]<br />

in terms of the process x[ m]<br />

:<br />

• Or to find a linear filter h [ m]<br />

such that the sequence ŝ[ m] = x[ m] ∗ h[ m]<br />

minimizes the expected value of ( ŝ[ m] − s[ m]<br />

) 2<br />

x [ m ]<br />

ŝ [ m ]<br />

Noisy Speech<br />

[ m ]<br />

s ˆ = x [ m ] ∗ h [ m ]<br />

=<br />

A linear filter<br />

h[n]<br />

=∑ ∞<br />

l −∞<br />

h<br />

[] l x [ m − l ]<br />

Clean Speech<br />

18


Wiener Filtering<br />

• Minimize the expectation of the squared error (MMSE<br />

estimate)<br />

Minimize F<br />

∀<br />

k<br />

⇒ ∀<br />

⇒<br />

⇒<br />

⇒<br />

⇒<br />

⇒<br />

∂F<br />

h<br />

∞<br />

[ m] − ∑ h[ l] x[ m − l]<br />

[] k<br />

s[ m] x[ m k] = ⎜<br />

⎛ ∞<br />

− ∑ h[ l] x[ m − l] ⎟<br />

⎞x[ m − k]<br />

k<br />

k =−∞<br />

k =−∞<br />

k =−∞<br />

R<br />

S<br />

∞<br />

∑<br />

∞<br />

∑<br />

∞<br />

∑<br />

s<br />

ss<br />

s<br />

s<br />

= 0<br />

⎧<br />

= E⎨<br />

⎩<br />

⎡s<br />

⎢⎣<br />

∞ ∞<br />

[ m] x[ m − k] = ∑ h[ l] ∑x[ m − l] x[ m − k]<br />

∞ ∞<br />

[ m] ( s[ m − k] + n[ m − k]<br />

) = ∑ h[] l ∑x[ m − l] x[ m − k]<br />

∞<br />

∞<br />

s[ m] s[ m − k] + ∑ s[ m] n[ m − k] = ∑ h[ l] Rx[ k − l]<br />

k =−∞<br />

l =−∞<br />

[] k = h[] k ∗ Rx[]<br />

k<br />

Rs<br />

[ n] and Rx<br />

[ n]<br />

( ω) = H ( ω) S ( ω)<br />

sequences of s n<br />

xx<br />

⎝<br />

l =−∞<br />

l=−∞<br />

l =−∞<br />

k=−∞<br />

l=−∞<br />

⎠<br />

⎤<br />

⎥⎦<br />

2<br />

⎫<br />

⎬<br />

⎭<br />

k=−∞<br />

and<br />

Take Fourier trans<strong>for</strong>m<br />

Take summation <strong>for</strong> k<br />

[] x[<br />

n]<br />

s<br />

[ m] n[ m]<br />

and are<br />

statistically independent!<br />

: are respective ly the autocorrel ation<br />

19


Wiener Filtering<br />

• Minimize the expectation of the squared error (MMSE<br />

estimate)<br />

Q S<br />

ss<br />

( ω) = H ( ω) S ( ω)<br />

xx<br />

⇒<br />

H<br />

( ω)<br />

=<br />

S<br />

S<br />

ss<br />

xx<br />

( ω)<br />

( ω)<br />

=<br />

S<br />

ss<br />

S<br />

ss<br />

( ω)<br />

( ω) + S ( ω)<br />

nn<br />

, is<br />

called the noncausal<br />

Wiener filter<br />

(where<br />

S<br />

xx<br />

( ω) = S ( ω) + S ( ω)<br />

ss<br />

nn<br />

)<br />

20


Wiener Filtering<br />

• The time varying Wiener Filter also can be expressed in<br />

a similar <strong>for</strong>m as the spectral subtraction<br />

H<br />

( ω )<br />

S ( )<br />

ss<br />

ω<br />

( ω ) + S ( ω )<br />

=<br />

=<br />

S<br />

ss<br />

nn<br />

PS<br />

+<br />

-1<br />

⎡ P ( )<br />

N<br />

ω ⎤ ⎡ 1 ⎤<br />

= ⎢1<br />

+ ⎥ = 1<br />

P ( ) R( )<br />

S<br />

ω<br />

⎢ +<br />

ω<br />

⎥<br />

⎣ ⎦ ⎣ ⎦<br />

P ( )<br />

S<br />

ω<br />

( ω ) P ( ω )<br />

-1<br />

N<br />

,<br />

( R<br />

( ω )<br />

=<br />

P<br />

P<br />

S<br />

N<br />

( ω )<br />

( ω )<br />

: instantane<br />

ous<br />

SNR<br />

)<br />

SS vs. Wiener Filter:<br />

1. Wiener filter has stronger attenuation<br />

at low SNR region<br />

2. Wiener filter does not invoke an<br />

absolute thresholding<br />

10<br />

log<br />

P<br />

P<br />

S<br />

N<br />

( ω )<br />

( ω )<br />

21


Wiener Filtering<br />

• Wiener Filtering can be realized only if we know the<br />

power spectra of both the noise and the signal<br />

– A chicken-and-egg problem<br />

• Approach - I : Ephraim(1992) proposed the use of an<br />

HMM where, if we know the current frame falls under, we<br />

can use it’s mean spectrum as S ( ω) or P ( ω)<br />

– In practice, we do not know what state each frame falls into<br />

either<br />

• Weigh the filters <strong>for</strong> each state by a posterior probability that frame<br />

falls into each state<br />

ss<br />

S<br />

22


• Approach - II :<br />

Wiener Filtering<br />

– The background/noise is stationary and its power spectrum can<br />

be estimated by averaging spectra over a known background<br />

region<br />

– For the non-stationary speech signal, its time-varying power<br />

spectrum can be estimated using the past Wiener filter (of<br />

previous frame)<br />

Pˆ<br />

∴H<br />

~<br />

P<br />

S<br />

( t,<br />

ω) = PX<br />

( t,<br />

ω) H( t −1,<br />

ω)<br />

Pˆ<br />

S<br />

( )<br />

( t,<br />

ω)<br />

t,<br />

ω =<br />

Pˆ<br />

( t,<br />

ω) + P ( ω)<br />

S<br />

S<br />

( t,<br />

ω) = P ( t,<br />

ω) H( t,<br />

ω)<br />

X<br />

N<br />

,<br />

( t :frameindex, H( ⋅<br />

):<br />

Wiener filter)<br />

• The initial estimate of the speech spectrum can be derived from<br />

spectral subtraction<br />

– Sometimes introduce musical noise<br />

23


Wiener Filtering<br />

• Approach - III :<br />

– Slow down the rapid frame-to-frame movement of the object<br />

speech power spectrum estimate by apply temporal smoothing<br />

)<br />

P<br />

S<br />

~<br />

( t,<br />

ω ) = α ⋅ P ( t −1,<br />

ω ) + ( 1−<br />

α ) ⋅ Pˆ<br />

( t,<br />

ω )<br />

S<br />

S<br />

Then use<br />

H<br />

( t,<br />

ω )<br />

=<br />

)<br />

P<br />

S<br />

Pˆ<br />

S<br />

( t,<br />

ω ) to replace Pˆ<br />

S<br />

( t,<br />

ω ) in<br />

Pˆ<br />

S<br />

( t,<br />

ω )<br />

⇒ H ( t,<br />

ω )<br />

( t,<br />

ω ) + P ( ω )<br />

N<br />

=<br />

)<br />

P<br />

S<br />

)<br />

PS<br />

( t,<br />

ω )<br />

( t,<br />

ω ) + P ( ω )<br />

N<br />

24


Wiener Filtering<br />

Clean Speech<br />

Noisy Speech<br />

Enhanced Noise Speech<br />

Using Approach – III<br />

τ = 0.85<br />

Other more complicate<br />

Wiener filters<br />

25


The Effectives of Active Noise<br />

26


Cepstral Mean Normalization (CMN)<br />

• A Speech Enhancement Technique and sometimes<br />

called Cepstral Mean Subtraction (CMS)<br />

• CMN is a powerful and simple technique designed to<br />

handle conventional (Time-invariant linear filtering)<br />

distortions x[ n] = s[ n] ∗ h[ n]<br />

( ω ) S ( ω ) H ( ω )<br />

X =<br />

l<br />

2<br />

2<br />

2 l<br />

X = log SH = log S + log H = S +<br />

H<br />

l l<br />

l<br />

l<br />

( S + H ) = CS CH<br />

l 1 T −1<br />

l<br />

l<br />

l<br />

= ( CS t ) l<br />

+ CH = CS CH<br />

l<br />

CX = C<br />

+<br />

Time Domain<br />

Spectral Domain<br />

l 1 T −1<br />

CS = ∑ CS<br />

l<br />

t and CX ∑<br />

+<br />

t = 0<br />

t = 0<br />

T<br />

T<br />

if the training and testing speech materials were recored from two different channels<br />

l<br />

l<br />

l<br />

l<br />

l<br />

l<br />

l<br />

l<br />

l<br />

l<br />

() 1 = C( S + H(1) ) = CS + CH(1) , Testing : CX ( 2) = C( S + H( 2 ) ) = CS CH( 2 )<br />

Training : CX +<br />

l<br />

Log Power Spectral Domain<br />

Cepstral Domain<br />

CX<br />

CX<br />

l<br />

l<br />

l<br />

l<br />

( 1) − CX ( 1)<br />

= CS − CS<br />

l<br />

l<br />

l<br />

l<br />

( 2 ) − CX ( 2 ) = CS − CS<br />

The spectral characteristics of the microphone<br />

and room acoustics thus can be removed !<br />

Can be eliminated if the assumption of zero-mean speech contribution!<br />

27


Cepstral Mean Normalization (CMN)<br />

• Some Findings<br />

– Interesting, CMN has been found effective even the testing and<br />

training utterances are within the same microphone and<br />

environment<br />

• Variations <strong>for</strong> the distance between the mouth and the microphone<br />

<strong>for</strong> different utterances and speakers<br />

– Be careful that the duration/period used to estimate the mean<br />

of noisy speech<br />

• Why?<br />

28


Cepstral Mean Normalization (CMN)<br />

• Per<strong>for</strong>mance<br />

– For telephone recordings, where each call has different<br />

frequency response, the use of CMN has been shown to provide<br />

as much as 30 % relative decrease in error rate<br />

– When a system is trained on one microphone and tested on<br />

another, CMN can provide significant robustness<br />

Temporal (Modulation)<br />

Frequency<br />

29


Cepstral Mean Normalization (CMN)<br />

• CMN has been shown to improve the robustness not<br />

only to varying channels but also to the noise<br />

– White noise added at different SNRs<br />

– System trained with speech with the same SNR (matched<br />

Condition)<br />

Cepstral delta and delta-delta<br />

features are computed prior to the<br />

CMN operation so that they are<br />

unaffected.<br />

30


Cepstral Mean Normalization (CMN)<br />

• From the other perspective<br />

– We can interpret CMN as the operation of subtracting a low-pass<br />

temporal filter d[ n]<br />

, where all the T coefficients are identical and<br />

equal to 1 , which is a high-pass temporal filter<br />

T<br />

– Alleviate the effect of conventional noise introduced in the<br />

channel<br />

• Real-time Cepstral Normalization<br />

– CMN requires the complete utterance to compute the cepstral<br />

mean; thus, it cannot be used in a real-time system, and an<br />

approximation needs to be used<br />

– Based on the above perspective, we can implement other types<br />

of high-pass filters<br />

CX<br />

l<br />

t<br />

= α ⋅ CX<br />

l<br />

t + α<br />

l<br />

( 1 − ) ⋅ CX<br />

l<br />

t −1<br />

, ( CX t : cepstral mean)<br />

31


RASTA Temporal Filter Hyneck Hermansky, 1991<br />

• A Speech Enhancement Technique<br />

• RASTA (Relative Spectral)<br />

Assumption<br />

– The linguistic message is coded into movements of the vocal<br />

tract (i.e., the change of spectral characteristics)<br />

– The rate of change of non-linguistic components in speech often<br />

lies outside the typical rate of change of the vocal tact shape<br />

• E.g. fix or slow time-varying linear communication channels<br />

– A great sensitivity of human hearing to modulation frequencies<br />

around 4Hz than to lower or higher modulation frequencies<br />

Effect<br />

– RASTA Suppresses the spectral components that change more<br />

slowly or quickly than the typical rate of change of speech<br />

32


RASTA Temporal Filter<br />

• The IIR transfer function<br />

1<br />

C<br />

( )<br />

~<br />

−<br />

( )<br />

x<br />

z<br />

4 2 + z<br />

H z = = 0.1z ⋅<br />

C ( z)<br />

1−<br />

MFCC stream<br />

x<br />

−3<br />

− z − 2z<br />

−1<br />

0.98z<br />

c [ t]<br />

c ~ [ t]<br />

H(z)<br />

H(z)<br />

−4<br />

New<br />

MFCC stream<br />

Frame index<br />

H(z)<br />

• An other version<br />

c<br />

~<br />

H<br />

( z)<br />

−1<br />

−3<br />

2 + z − z − 2 z<br />

= 0.1⋅<br />

−1<br />

1 − 0.98 z<br />

RASTA has a peak at about<br />

4Hz (modulation frequency)<br />

[ t] = 0.98 ⋅ c<br />

~<br />

[ t − 1] + 0.2 ⋅ c[ t] + 0.1⋅<br />

c[ t − 1]<br />

− 0.1⋅<br />

c[ t − 2] + 0.2 ⋅ c[ t − 4]<br />

−4<br />

modulation frequency 100 Hz<br />

33


Retraining on Corrupted Speech<br />

• A Model-based Noise Compensation Technique<br />

• Matched-Conditions Training<br />

– Take a noise wave<strong>for</strong>m from the new environment, add it to all<br />

the utterance in the training database, and retrain the system<br />

– If the noise characteristics are known ahead of time, this method<br />

allow as to adapt the model to the new environment with<br />

relatively small amount of data from the new environment, yet<br />

use a large amount of training data<br />

34


Retraining on Corrupted Speech<br />

• Multi-style Training<br />

– Create a number of artificial acoustical environments by<br />

corrupting the clean training database with noise samples of<br />

varying levels (30dB, 20dB, etc.) and types (white, babble, etc.),<br />

as well as varying the channels<br />

– All those wave<strong>for</strong>ms (copies of training database) from multiple<br />

acoustical environments can be used in training<br />

35


Model Adaptation<br />

• A Model-based Noise Compensation Technique<br />

• The standard adaptation methods <strong>for</strong> speaker adaptation<br />

can be used <strong>for</strong> adapting speech recognizers to noisy<br />

environments<br />

– MAP (Maximum a Posteriori) can offer results similar to those of<br />

matched conditions, but it requires a significant amount of<br />

adaptation data<br />

– MLLR (Maximum Likelihood Regression) can achieve<br />

reasonable per<strong>for</strong>mance with about a minute of speech <strong>for</strong> minor<br />

mismatch. For severe mismatches, MLLR also requires a larger<br />

amount of adaptation data<br />

36


Signal Decomposition Using HMMs<br />

• A Model-based Noise Compensation Technique<br />

• Recognize concurrent signals (speech and noise)<br />

simultaneously<br />

– Parallel HMMs are used to model the concurrent signals and the<br />

composite signal is modeled as a function of their combined<br />

outputs<br />

• Three-dimensional Viterbi Search<br />

Noise HMM<br />

(especially <strong>for</strong><br />

non-stationary noise)<br />

Computationally Expensive<br />

<strong>for</strong> both Training and Decoding !<br />

Clean speech HMM<br />

37


Parallel Model Combination (PMC)<br />

• A Model-based Noise Compensation Technique<br />

• By using the clean-speech models and a noise model,<br />

we can approximate the distributions obtained by training<br />

a HMM with corrupted speech<br />

38


Parallel Model Combination (PMC)<br />

• The steps of Standard Parallel Model Combination (Log-<br />

Normal Approximation)<br />

Cepstral domain<br />

µ<br />

Σ<br />

c<br />

c<br />

Σ<br />

µ<br />

Clean speech HMM’s<br />

Noisy speech HMM’s<br />

µ ˆ<br />

c<br />

Σˆ<br />

c<br />

l<br />

l<br />

= C<br />

−1<br />

µ<br />

c<br />

= C Σ ( C<br />

−1 c −1<br />

)<br />

T<br />

c l<br />

µ ˆ = C µ ˆ<br />

ˆ c l T<br />

= C Σˆ<br />

C<br />

Σ<br />

Constraint: the estimate of<br />

variance is positive<br />

Log-spectral domain<br />

µ<br />

Σ<br />

l<br />

µ ˆ<br />

l<br />

l<br />

Σˆ<br />

l<br />

µ = exp µ<br />

i<br />

Σij = µ<br />

i<br />

µ<br />

j<br />

l l<br />

( + Σ 2)<br />

In linear spectral domain,<br />

the distribution is lognormal<br />

i<br />

ii<br />

l<br />

[ exp( Σ ) −1]<br />

Because speech and noise are<br />

independent and additive in the<br />

linear spectral domain<br />

ij<br />

1 Σˆ<br />

ii<br />

( ˆ µ<br />

i<br />

) − log( + 1)<br />

l<br />

ˆ µ<br />

i<br />

= log<br />

2<br />

ˆ µ i<br />

2<br />

Σˆ<br />

= log<br />

l<br />

ij<br />

ij ˆ µ ˆ µ<br />

( )<br />

Σˆ<br />

+ 1<br />

i<br />

j<br />

Σ<br />

Linear spectral domain<br />

µ<br />

Σ<br />

µ ˆ = gµ<br />

+ µ ~<br />

ˆ 2<br />

=<br />

g<br />

µ ˆ<br />

Σˆ<br />

Σ<br />

~<br />

+ Σ<br />

Noise HMM’s<br />

µ ~<br />

~<br />

Σ<br />

Log-normal<br />

approximation<br />

(Assume the new<br />

distribution is lognormal)<br />

39


Parallel Model Combination (PMC)<br />

• Modification-I: Per<strong>for</strong>m the model combination in the Log-<br />

Spectral Domain (the simplest approximation)<br />

– Log-Add Approximation: (without compensation of variances)<br />

( (<br />

l<br />

exp µ ) exp( µ<br />

l<br />

)<br />

l<br />

ˆ µ = log + ~<br />

• The variances are assumed to be small<br />

– A simplified version of Log-Normal approximation<br />

• Reduction in computational load<br />

• Modification-II: Per<strong>for</strong>m the model combination in the<br />

Linear Spectral Domain (Data-Driven PMC, DPMC, or<br />

Iterative PMC)<br />

– Use the speech models to generate noisy samples (corrupted<br />

speech observations) and then compute a maximum likelihood of<br />

these noisy samples<br />

– This method is less computationally expensive than standard<br />

PMC with comparable per<strong>for</strong>mance<br />

40


Parallel Model Combination (PMC)<br />

• Modification-II: Per<strong>for</strong>m the model combination in the<br />

Linear Spectral Domain (Data-Driven PMC, DPMC)<br />

Clean Speech HMM<br />

Noise HMM<br />

Noisy Speech HMM<br />

Cepstral domain<br />

Apply Monte Carlo<br />

simulation to draw random<br />

cepstral vectors<br />

(<strong>for</strong> example, at least 100 <strong>for</strong><br />

each distribution)<br />

Generating<br />

samples<br />

Linear spectral domain<br />

Domain<br />

trans<strong>for</strong>m<br />

41


Parallel Model Combination (PMC)<br />

• Data-Driven PMC<br />

42


Vector Taylor Series (VTS) P. J. Moreno,1995<br />

• A Model-based Noise Compensation Technique<br />

• VTS Approach<br />

– Similar to PMC, the noisy-speech-like models is generated by<br />

combining of clean speech HMM’s and the noise HMM<br />

– Unlike PMC, the VTS approach combines the parameters of<br />

clean speech HMM’s and the noise HMM linearly in the logspectral<br />

domain<br />

P ω = P ω P ω + P ω<br />

Power spectrum<br />

X<br />

X<br />

( )<br />

S<br />

( )<br />

H<br />

( )<br />

N<br />

( )<br />

= log( PS<br />

( ω) PH<br />

( ω) + PN<br />

( ω)<br />

)<br />

⎛ ⎛ P<br />

( ) ( )<br />

( ) ⎞<br />

⎜<br />

N<br />

ω ⎞<br />

= log PS<br />

ω PH<br />

ω ⎜<br />

⎟⎟<br />

1 +<br />

PS<br />

( ) PH<br />

( )<br />

⎝ ⎝ ω ω ⎠⎠<br />

logPN<br />

( ω) −logPS<br />

( ω) −logPH<br />

( ω)<br />

= logP<br />

( ω) + logP<br />

( ω) + log1+<br />

e<br />

l<br />

= S<br />

= S<br />

l<br />

l<br />

S<br />

+ H<br />

+ H<br />

l<br />

l<br />

+<br />

f<br />

H<br />

Log Power spectrum<br />

( )<br />

( )<br />

N −S<br />

−H<br />

+ e<br />

Non-linear function<br />

( ) ( ) ( )<br />

l l l<br />

l l l<br />

N −S<br />

−H<br />

S , H , N , wheref<br />

S , H , N = log1+<br />

e<br />

+ log1<br />

l<br />

l<br />

l<br />

l<br />

l<br />

l<br />

Is a vector<br />

function<br />

43


Vector Taylor Series (VTS)<br />

• The Taylor series provides a polynomial representation<br />

of a function in terms of the function and its derivatives at<br />

a point<br />

– Application often arises when nonlinear functions are employed<br />

and we desire to obtain a linear approximation<br />

– The function is represented as an offset and a linear term<br />

f<br />

f<br />

: R → R<br />

( x ) = f ( x ) + f ′( x )( x − x ) + f ′′( x )( x − x )<br />

0<br />

0<br />

1<br />

+ .... + f<br />

n!<br />

0<br />

( )<br />

( )( ) ( )<br />

n<br />

n<br />

n<br />

x x − x + o x − x<br />

0<br />

1<br />

2<br />

0<br />

0<br />

0<br />

0<br />

44


Vector Taylor Series (VTS)<br />

• Apply Taylor Series Approximation<br />

f<br />

l l l<br />

l l l<br />

l l l df<br />

( ) ( ) ( S<br />

0<br />

, H<br />

0<br />

, N<br />

0<br />

)<br />

, ,<br />

, ,<br />

(<br />

l l<br />

N S H ≅ f S<br />

0<br />

H<br />

0<br />

N<br />

0<br />

+<br />

S − S<br />

l<br />

0<br />

dS<br />

)<br />

l l l<br />

l l l<br />

df ( S<br />

0<br />

, H<br />

0<br />

, N<br />

0<br />

) (<br />

l<br />

l<br />

)<br />

df ( S<br />

0<br />

, H<br />

0<br />

, N<br />

0<br />

)<br />

+<br />

− +<br />

(<br />

l<br />

l<br />

H H<br />

N − N ) + .....<br />

dH<br />

– VTS-0: use only the 0th-order terms of Taylor Series<br />

– VTS-1: use only the 0th- and 1th-order terms of Taylor Series<br />

l l l<br />

– f S<br />

0<br />

, H<br />

0<br />

, N<br />

0 is the vector function evaluated at a particular<br />

vector point<br />

• If VTS-0 is used<br />

E<br />

u<br />

Σ<br />

l<br />

l l<br />

l l l<br />

[ X ] = E[ S + H + f ( S ,H ,N )]<br />

l l<br />

l l l<br />

= u [ ( )]<br />

s<br />

+ uh<br />

+ E f S ,H ,N<br />

l l<br />

l l l<br />

≅ u + u + E[ f ( u ,u ,u )]<br />

s h<br />

s h n<br />

l l<br />

l l l<br />

l<br />

≅ u + u + f ( u ,u ,u ) ( X<br />

l<br />

x<br />

l<br />

x<br />

s<br />

≅ Σ<br />

l<br />

s<br />

h<br />

+ Σ<br />

( )<br />

l<br />

h<br />

(if<br />

s<br />

S<br />

l<br />

h<br />

n<br />

and H<br />

To get the clean speech statistics<br />

l<br />

l<br />

is also Gaussian)<br />

u<br />

Σ<br />

are independent)<br />

0<br />

l<br />

x<br />

l<br />

x<br />

≅ u<br />

≅<br />

l<br />

s<br />

Σ<br />

+ g +<br />

l<br />

s<br />

dN<br />

If the channel filter is linear - time invariant,<br />

0-th order VTS<br />

we can regard it as a bias (constant) , g,<br />

in the log power spectrum domain<br />

f<br />

l<br />

l l<br />

l<br />

( u ,g ,u ) ( X is also Gaussian)<br />

s<br />

n<br />

0<br />

45


Vector Taylor Series (VTS)<br />

46


Retraining on Compensated <strong>Feature</strong>s<br />

• A Model-based Noise Compensation Technique that also<br />

Uses enhanced <strong>Feature</strong>s (processed by SS, CMN, etc.)<br />

– Combine speech enhancement and model compensation<br />

47


Principal Component Analysis<br />

• Principal Component Analysis (PCA) :<br />

– Widely applied <strong>for</strong> the data analysis and dimensionality reduction<br />

in order to derive the most “expressive” feature<br />

– Criterion:<br />

<strong>for</strong> a zero mean r.v. x∈R N , find k (k≤N) orthonormal vectors<br />

{e 1 , e 2 ,…, e k } so that<br />

– (1) var(e 1<br />

T<br />

x)=max 1<br />

(2) var(e iT x)=max i<br />

subject to e i ⊥ e i-1 ⊥…… ⊥e 1 1≤ i ≤k<br />

– {e 1 , e 2 ,…, e k } are in fact the eigenvectors<br />

of the covariance matrix (Σ x ) <strong>for</strong> x<br />

corresponding to the largest k eigenvalues<br />

– Final r.v y ∈R k : the linear trans<strong>for</strong>m<br />

(projection) of the original r.v., y=A T x<br />

A=[e 1 e 2 …… e k ]<br />

data<br />

Principal axis<br />

48


Principal Component Analysis<br />

49


Principal Component Analysis<br />

• Properties of PCA<br />

– The components of y are mutually uncorrelated<br />

E{y i y j }=E{(e iT x) (e jT x) T }=E{(e iT x) (x T e j )}=e iT E{xx T } e j =e iT Σ x e j<br />

= λ j e iT e j =0 , if i≠j<br />

∴ the covariance of y is diagonal<br />

– The error power (mean-squared error) between the original vector x<br />

and the projected x’ is minimum<br />

x=(e 1T x)e 1 + (e 2T x)e 2 + ……+(e kT x)e k + ……+(e NT x)e N<br />

x’=(e 1T x)e 1 + (e 2T x)e 2 + ……+(e kT x)e k (Note : x’∈R N )<br />

error r.v :<br />

x-x’= (e k+1T x)e k+1 + (e k+2T x)e k+2 + ……+(e NT x)e N<br />

E((x-x’) T (x-x’))=E((e k+1T x) e k+1T e k+1 (e k+1T x))+……+E((e NT x) e NT e N<br />

(e NT x))<br />

=var(e k+1T x)+ var(e k+2T x)+…… var(e NT x)<br />

= λ k+1 + λ k+2 +…… +λ N minimum<br />

50


PCA Applied in Inherently Robust <strong>Feature</strong>s<br />

• Application 1 : the linear trans<strong>for</strong>m of the original<br />

features (in the spatial domain)<br />

Original feature stream x t<br />

Frame index<br />

z t<br />

= A T x t<br />

A T A T A T A T<br />

The columns of A are the<br />

“first k” eigenvectors of Σ x<br />

trans<strong>for</strong>med feature<br />

stream z t<br />

Frame index<br />

51


PCA Applied in Inherently Robust <strong>Feature</strong>s<br />

• Application 2 : PCA-derived temporal filter<br />

(in the temporal domain)<br />

– The effect of the temporal filter is equivalent to the weighted sum of<br />

sequence of a specific MFCC coefficient with length L slid along the<br />

frame index<br />

⎡ x(1,1)<br />

⎤ ⎡ x(2,1)<br />

⎤ ⎡ x(3,1)<br />

⎤ ⎡ x(<br />

n,1)<br />

⎤ ⎡ x(<br />

N ,1) ⎤ → y<br />

1<br />

( m )<br />

B 1<br />

(z) ⎢<br />

x(1,2)<br />

⎥ ⎢<br />

x(2,2)<br />

⎥ ⎢<br />

x(3,2)<br />

⎥ ⎢<br />

x(<br />

n,2)<br />

⎥ ⎢<br />

x(<br />

N ,2)<br />

⎥<br />

→ y ( m )<br />

⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥<br />

2<br />

quefrency<br />

⎢ M ⎥ ⎢ M ⎥ ⎢ M ⎥ ⎢ M ⎥ ⎢ M ⎥ M<br />

B 2<br />

(z) ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ L ⎢ ⎥ L ⎢ ⎥<br />

⎢ x(1,<br />

k ) ⎥ ⎢ x(2,<br />

k ) ⎥ ⎢ x(3,<br />

k ) ⎥ ⎢ x(<br />

n,<br />

k ) ⎥ ⎢ x(<br />

N , k ) ⎥ → y<br />

k<br />

( m )<br />

⎢ M ⎥ ⎢ M ⎥ ⎢ M ⎥ ⎢ M ⎥ ⎢ M ⎥ M<br />

⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥<br />

x(1,<br />

K ) x(2,<br />

K ) x(3,<br />

K ) x(<br />

n,<br />

K ) x(<br />

N , K ) → y<br />

K<br />

( m )<br />

Original feature<br />

⎣ ⎦ ⎣ ⎦ ⎣ ⎦ ⎣ ⎦ ⎣ ⎦<br />

x () 1 x( 2) x() 3 L x( n ) L x( N )<br />

stream x B n<br />

(z)<br />

t<br />

z k<br />

(n)=[ y k<br />

(n) y k<br />

(n+1) y k<br />

(n+2) …… y k<br />

(n+L-1)] T<br />

N<br />

Frame index<br />

The impulse response of B k (z) is one of the<br />

∑ − L + 1<br />

µ 1<br />

=<br />

k<br />

( n)<br />

N − L + 1<br />

z<br />

zk n=<br />

1<br />

N L 1<br />

eigenvectors of the covariance <strong>for</strong> z k 1 T<br />

Σ =<br />

z ( n)<br />

− µ z n − µ<br />

L<br />

z k (1)<br />

z k (2)<br />

z k (3)<br />

z<br />

k<br />

N − L + 1<br />

The element in the new feature vector<br />

T<br />

( n,<br />

k) e ( 1) z ( n)<br />

xˆ =<br />

k<br />

k<br />

∑ − +<br />

n=<br />

1<br />

( )( ( ) )<br />

k<br />

From Dr. Jei-wei Hung<br />

z<br />

k<br />

k<br />

z<br />

k<br />

52


PCA Applied in Inherently Robust <strong>Feature</strong>s<br />

The frequency responses of the 15 PCA-derived temporal filters<br />

From Dr. Jei-wei Hung<br />

53


PCA Applied in Inherently Robust <strong>Feature</strong>s<br />

• Application 2 : PCA-derived temporal filter<br />

Mismatched<br />

condition<br />

Filter length<br />

L=10<br />

Matched<br />

condition<br />

From Dr. Jei-wei Hung<br />

54


PCA Applied in Inherently Robust <strong>Feature</strong>s<br />

• Application 3 : PCA-derived filter bank<br />

x 1<br />

x 2<br />

x 3<br />

Power spectrum<br />

obtained by DFT<br />

h 1<br />

h 2<br />

h 3<br />

h k<br />

is one of the<br />

eigenvectors<br />

of the covariance<br />

<strong>for</strong> x k<br />

From Dr. Jei-wei Hung<br />

55


PCA Applied in Inherently Robust <strong>Feature</strong>s<br />

• Application 3 : PCA-derived filter bank<br />

From Dr. Jei-wei Hung<br />

56


Linear Discriminative Analysis<br />

• Linear Discriminative Analysis (LDA)<br />

– Widely applied <strong>for</strong> the pattern classification<br />

– In order to derive the most “discriminative” feature<br />

– Criterion : assume w j , µ j and Σ j are the weight, mean and<br />

covariance of class j, j=1……N. Two matrices are defined as:<br />

Between - class covariance :<br />

Within - class covariance : S<br />

Find W=[w 1 w 2 ……w k ]<br />

such that<br />

T<br />

W SbW<br />

Wˆ = arg max<br />

W T<br />

W S W<br />

– The columns w j of W are the<br />

eigenvectors of S w<br />

-1<br />

S B<br />

having the largest eigenvalues<br />

w<br />

( µ − µ )( µ − µ )<br />

N T<br />

Sb<br />

= ∑ j = 1w<br />

j j<br />

j<br />

N<br />

w<br />

= ∑ j = 1w<br />

j<br />

Σ j<br />

57


Linear Discriminative Analysis<br />

The frequency responses of the 15 LDA-derived temporal filters<br />

From Dr. Jei-wei Hung<br />

58


Minimum Classification Error<br />

• Minimum Classification Error (MCE):<br />

– General Objective : find an optimal feature presentation or an<br />

optimal recognition model to minimize the expected error of<br />

classification<br />

– The recognizer is often operated under the following decision rule :<br />

C(X)=C i if g i (X,Λ)=max j g j (X,Λ)<br />

Λ={λ (i) } i=1……M (M models, classes), X : observations,<br />

g i (X,Λ): class conditioned likelihood function, <strong>for</strong> example,<br />

g i (X,Λ)=P(X|λ (i) )<br />

– Traditional Training Criterion :<br />

find λ (i) such that P(X|λ (i) ) is maximum (Maximum Likelihood) if X<br />

∈C i<br />

• This criterion does not always lead to minimum classification error,<br />

since it doesn't consider the mutual relationship between<br />

different classes<br />

• For example, it’s possible that P(X|λ (i) ) is maximum but X ∉C i<br />

59


Minimum Classification Error<br />

Threshold<br />

τ<br />

P ( LR ( k ) KW k<br />

∉ C k<br />

)<br />

P LR ( k )<br />

k<br />

( ∈ )<br />

KW k<br />

C k<br />

Type I<br />

error<br />

Type II<br />

error<br />

LR<br />

( k )<br />

Example showing histograms of the likelihood ratio<br />

when keyword KW ∈ and KW ∉<br />

k<br />

C k<br />

k<br />

C k<br />

LR<br />

( k )<br />

Type I error: False Rejection<br />

Type II error: False Alarm/False Acceptance<br />

60


Minimum Classification Error<br />

• Minimum Classification Error (MCE) (Cont.):<br />

– One <strong>for</strong>m of the class misclassification measure :<br />

d<br />

d<br />

d<br />

i<br />

i<br />

i<br />

() i<br />

( X ) −g<br />

X , λ<br />

⎡ 1<br />

⎢<br />

⎣ M −1<br />

( ) ( (<br />

() i<br />

+ log exp g X , λ ) α )<br />

= ∑<br />

( X ) ≥ 0 implies a misclassification (error = 1)<br />

( X ) < 0 implies a correct classification (error = 0)<br />

– A continuous loss function is defined as follows :<br />

l<br />

i<br />

( X , Λ) = l( d ( X ))<br />

i<br />

X ∈ C<br />

where the sigmoid function<br />

( d )<br />

j≠i<br />

– Classifier per<strong>for</strong>mance measure :<br />

i<br />

l<br />

1<br />

=<br />

1 + exp<br />

( − γ d + θ )<br />

⎤<br />

⎥<br />

⎦<br />

1<br />

α<br />

X<br />

∈C<br />

i<br />

L<br />

( Λ)<br />

= E [ L( X , Λ)<br />

] = l ( X , Λ) δ( X ∈ C )<br />

X<br />

∑∑<br />

X<br />

M<br />

i=<br />

1<br />

i<br />

i<br />

61


Minimum Classification Error<br />

• Using MCE in model training :<br />

– Find Λ such that<br />

Λ ˆ = arg min L<br />

Λ<br />

( Λ) = arg min E [ L( X , Λ)<br />

]<br />

Λ<br />

the above objective function in general cannot be minimized<br />

directly but the local minimum can be achieved using gradient<br />

decent algorithm<br />

( )<br />

∂L<br />

Λ<br />

wt+ = wt<br />

− ε , w : an arbitrary parameter of<br />

∂w<br />

• Using MCE in robust feature representation<br />

ˆ<br />

( f )<br />

f = argminE<br />

L( f ( X ),<br />

Λ )<br />

f<br />

Note:<br />

X<br />

1 Λ<br />

f<br />

X<br />

[ ]<br />

: a trans<strong>for</strong>m of the originalfeature X<br />

whilefeature presentation is changed, the modelis also changed accordingly<br />

62

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!