14.07.2013 Views

Dynamic Time Warping (DTW) for Single Word and Sentence ...

Dynamic Time Warping (DTW) for Single Word and Sentence ...

Dynamic Time Warping (DTW) for Single Word and Sentence ...

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

<strong>DTW</strong> <strong>for</strong> <strong>Single</strong> <strong>Word</strong> <strong>and</strong> <strong>Sentence</strong> Recognizers<br />

<strong>Dynamic</strong> <strong>Time</strong> <strong>Warping</strong> (<strong>DTW</strong>)<br />

<strong>for</strong> <strong>Single</strong> <strong>Word</strong> <strong>and</strong> <strong>Sentence</strong><br />

Recognizers<br />

Reference: Huang et al. Ch. 8.2.1, Waibel/Lee, Ch. 4<br />

(some papers online at http://csl.ira.uka.de > MMMK)


<strong>DTW</strong> <strong>for</strong> <strong>Single</strong> <strong>Word</strong> <strong>and</strong> <strong>Sentence</strong> Recognizers<br />

Simplified Decoding<br />

Speech<br />

Feature<br />

extraction<br />

Speech<br />

features<br />

2<br />

Trained<br />

Phoneme<br />

classifiers<br />

Decision<br />

/h/<br />

...<br />

Hypotheses<br />

(phonemes)<br />

/h/ /e/ /l/ /o/<br />

/w/ /o/ /r/ /l/ /d/


<strong>DTW</strong> <strong>for</strong> <strong>Single</strong> <strong>Word</strong> <strong>and</strong> <strong>Sentence</strong> Recognizers<br />

Simplified Classifier Training<br />

Aligned<br />

Speech<br />

Feature<br />

extraction<br />

Speech<br />

features<br />

/h/ /e/ /l/ /o/ /h/ /e/ /l/ /o/<br />

3<br />

Train<br />

Classifier<br />

/e/<br />

Improved<br />

Classifiers<br />

Train Classifier<br />

Use aligned speech<br />

vectors (e.g. all frames<br />

of phoneme /e/) to train<br />

the reference vectors of<br />

/e/ (=Codebook):<br />

- kmeans<br />

- LVQ


<strong>DTW</strong> <strong>for</strong> <strong>Single</strong> <strong>Word</strong> <strong>and</strong> <strong>Sentence</strong> Recognizers<br />

Classification<br />

Xuedong Huang, Alex Acero, Hsiao-Wuen Hon:<br />

Spoken Language Processing, Chapter 4<br />

&<br />

R. Duda <strong>and</strong> P. Hart: Pattern Classification <strong>and</strong><br />

Scene Analysis. John Wiley & Sons, 2000 (2 nd edition)


<strong>DTW</strong> <strong>for</strong> <strong>Single</strong> <strong>Word</strong> <strong>and</strong> <strong>Sentence</strong> Recognizers<br />

Pattern Recognition Review<br />

• Static Patterns, no dependence on <strong>Time</strong> or Sequential Order<br />

• Approaches<br />

– Knowledge-based approaches:<br />

1. Compile knowledge<br />

2. Build decision trees<br />

– Connectionist approaches:<br />

1. Automatic knowledge acquisition, "black-box" behavior<br />

2. Simulation of biological processes<br />

– Statistical approaches:<br />

1. Build a statistical model of the "real world"<br />

2. Compute probabilities according to the models<br />

5


<strong>DTW</strong> <strong>for</strong> <strong>Single</strong> <strong>Word</strong> <strong>and</strong> <strong>Sentence</strong> Recognizers<br />

Pattern Recognition<br />

statistical<br />

6<br />

Pattern<br />

Recognition<br />

supervised unsupervised<br />

parametric nonparametric<br />

linear nonlinear<br />

knowledge /<br />

connectionist


<strong>DTW</strong> <strong>for</strong> <strong>Single</strong> <strong>Word</strong> <strong>and</strong> <strong>Sentence</strong> Recognizers<br />

Pattern Recognition<br />

• Important Notions<br />

– Supervised - Unsupervised Classifiers<br />

– Parametric - Non-Parametric Classifiers<br />

– Linear - Non-linear Classifiers<br />

• Classical Statistical Methods<br />

– Bayes Classifier<br />

– K-Nearest Neighbor<br />

• Connectionist Methods<br />

– Perceptron<br />

– Multilayer Perceptrons<br />

7


<strong>DTW</strong> <strong>for</strong> <strong>Single</strong> <strong>Word</strong> <strong>and</strong> <strong>Sentence</strong> Recognizers<br />

Supervised - Unsupervised<br />

• Supervised training:<br />

Class to be recognized is known<br />

<strong>for</strong> each sample in training data.<br />

Requires a priori knowledge of<br />

useful features <strong>and</strong> knowledge<br />

/labeling of each training token<br />

(cost!).<br />

• Unsupervised training:<br />

Class is not known <strong>and</strong> structure<br />

is to be discovered automatically.<br />

Feature-space-reduction<br />

example: clustering, auto-associative nets<br />

8


<strong>DTW</strong> <strong>for</strong> <strong>Single</strong> <strong>Word</strong> <strong>and</strong> <strong>Sentence</strong> Recognizers<br />

Unsupervised Classification<br />

F2<br />

+ +<br />

+<br />

+ + +<br />

• Classification:<br />

+<br />

+ +<br />

+<br />

+<br />

+<br />

+<br />

+ + +<br />

+ + +<br />

+ +<br />

+ +<br />

+ + + +<br />

+<br />

+<br />

+<br />

+<br />

+ + + + +<br />

+ + + +<br />

– Classes Not Known: Find Structure<br />

9<br />

+<br />

F1


<strong>DTW</strong> <strong>for</strong> <strong>Single</strong> <strong>Word</strong> <strong>and</strong> <strong>Sentence</strong> Recognizers<br />

Unsupervised Classification<br />

F2<br />

+ +<br />

+<br />

+ + +<br />

• Classification:<br />

– Classes Not Known: Find Structure<br />

– Clustering<br />

+<br />

+ +<br />

+<br />

+<br />

+<br />

+<br />

+ + +<br />

+ + +<br />

+ +<br />

+ +<br />

+ + + +<br />

+<br />

+<br />

+<br />

+<br />

+ + + + +<br />

+ + + +<br />

– How to cluster? How many clusters?<br />

10<br />

+<br />

F1


<strong>DTW</strong> <strong>for</strong> <strong>Single</strong> <strong>Word</strong> <strong>and</strong> <strong>Sentence</strong> Recognizers<br />

Supervised Classification<br />

Income<br />

+ +<br />

+<br />

+ +<br />

• Classification:<br />

– Classes Known: Creditworthiness: Yes-No<br />

– Features: Income, Age<br />

– Classifiers<br />

+ credit-worthy<br />

+<br />

+ +<br />

+<br />

+<br />

+<br />

+<br />

+ + +<br />

+ + +<br />

+ +<br />

+ +<br />

+ + + +<br />

+<br />

+<br />

+<br />

+<br />

+ + + + +<br />

+ + + +<br />

11<br />

+<br />

non-credit-worthy<br />

Age


<strong>DTW</strong> <strong>for</strong> <strong>Single</strong> <strong>Word</strong> <strong>and</strong> <strong>Sentence</strong> Recognizers<br />

Classifier Design in Practice<br />

• Need a priori probability P ( ) i (not too bad)<br />

• Need class conditional PDF p<br />

• Problems:<br />

– limited training data<br />

– limited computation<br />

– class-labeling potentially costly <strong>and</strong> prone to error<br />

– classes may not be known<br />

– good features not known<br />

• Parametric Solution:<br />

– Assume that p (x / )<br />

has a particular parametric <strong>for</strong>m<br />

i<br />

– Most common representative: multivariate normal density<br />

12<br />

(x / i )


<strong>DTW</strong> <strong>for</strong> <strong>Single</strong> <strong>Word</strong> <strong>and</strong> <strong>Sentence</strong> Recognizers<br />

Mixtures of Gaussian Densities<br />

Often the shape of the set of vectors that belong to one class does not<br />

look like what can be modeled by a single Gaussian.<br />

A (weighted) sum of Gaussians can approximate many more densities:<br />

In general, a class can be modeled as a mixture of Gaussians:<br />

13


<strong>DTW</strong> <strong>for</strong> <strong>Single</strong> <strong>Word</strong> <strong>and</strong> <strong>Sentence</strong> Recognizers<br />

Gaussian Densities<br />

• The most often used model <strong>for</strong> (preprocessed) speech signals are<br />

Gaussian densities.<br />

• Often the "size" of the parameter spaces is measured in "number of<br />

densities“<br />

• A multivariate Gaussian density looks like this:<br />

Its parameters are:<br />

• The mean vector µ (a vector with d coefficients)<br />

• The covariance matrix (a symmetric dxd matrix),<br />

if X indep. is diagonal<br />

• The determinant of the covariance matrix| |<br />

14


<strong>DTW</strong> <strong>for</strong> <strong>Single</strong> <strong>Word</strong> <strong>and</strong> <strong>Sentence</strong> Recognizers<br />

Codebooks of Reference Vectors<br />

Definition: A collection of reference vectors is called a codebook.<br />

Ways how to find (good) codebooks:<br />

• A) Simple but suboptimal: use centers of equally sized hypercubes<br />

• B) Possibly independent from data: build tree structure of feature space<br />

• C) Run the k-means (or LBG) algorithm<br />

• D) Use the "learning vector quantization" (LVQ) algorithm<br />

15


<strong>DTW</strong> <strong>for</strong> <strong>Single</strong> <strong>Word</strong> <strong>and</strong> <strong>Sentence</strong> Recognizers<br />

Finding Codebooks of Reference Vectors<br />

Tree-Structured Feature Space<br />

1. initialize one node <strong>for</strong> all data<br />

2. compute the direction of the largest deviation <strong>for</strong> all nodes<br />

3. split the node with the largest deviation in two (hyperplane through mean)<br />

4. if not yet satisfied then goto step 2<br />

Quantization:<br />

starting at root node, compare vector x with current node's hyperplane,<br />

descend to node that corresponds to the side of the hyperplane on which x lies<br />

16


<strong>DTW</strong> <strong>for</strong> <strong>Single</strong> <strong>Word</strong> <strong>and</strong> <strong>Sentence</strong> Recognizers<br />

Finding Codebooks of Reference Vectors<br />

The k-Means Algorithm<br />

1. Step 1 Initialize: given value of k, sample vectors v 1 , ... v T<br />

Intialize any k means (e.g. µ i = v i )<br />

2. Step 2 Nearest-Neighbor classification: Assign every vector v i<br />

to its class' centroid µ f(i)<br />

3. Step 3 Codebook update: Replace every mean µ i by the mean of<br />

all sample vectors that have been assigned to it<br />

4. Step 4 Iteration: If not satisfied, yet, then go to step 2<br />

possible stop-criterion:<br />

• A fixed number of iterations<br />

• The average (maximum) distance |v i - µ f(i) | is below a fixed value<br />

• The derivative of the distance is below a fixed value (nothing happens any<br />

more)<br />

• Theoretically k-means can only converge to a local optimum<br />

• Initialization is often critical - repeat k-means <strong>for</strong> several codebook sets<br />

• OR: Linde-Buzo-Gray (LBG) Algorithm, I.e. start with a 1-vector codebook <strong>and</strong> use<br />

splitting algorithm to obtain 2-vector, …, M-vector codebook<br />

17


<strong>DTW</strong> <strong>for</strong> <strong>Single</strong> <strong>Word</strong> <strong>and</strong> <strong>Sentence</strong> Recognizers<br />

Problems of Classifier Design<br />

• Features:<br />

– What <strong>and</strong> how many features should be selected?<br />

– Any features?<br />

– The more the better?<br />

– If additional features not useful (same mean <strong>and</strong> covariance),<br />

classifier will automatically ignore them?<br />

18


<strong>DTW</strong> <strong>for</strong> <strong>Single</strong> <strong>Word</strong> <strong>and</strong> <strong>Sentence</strong> Recognizers<br />

Curse of Dimensionality<br />

• Adding more features<br />

– Adding independent features may help<br />

– BUT: adding indiscriminant features may lead to worse<br />

per<strong>for</strong>mance!<br />

• Reason:<br />

– Training Data vs. Number of Parameters<br />

– Limited training data.<br />

• Solution:<br />

– select features carefully<br />

– reduce dimensionality<br />

– Principle Component Analysis<br />

19


<strong>DTW</strong> <strong>for</strong> <strong>Single</strong> <strong>Word</strong> <strong>and</strong> <strong>Sentence</strong> Recognizers<br />

Trainability<br />

• Two-phoneme classification example (Huang et al.),<br />

Phonemes modeled by Gaussian mixtures<br />

• Parameters are trained with a varied set of training<br />

samples<br />

20


<strong>DTW</strong> <strong>for</strong> <strong>Single</strong> <strong>Word</strong> <strong>and</strong> <strong>Sentence</strong> Recognizers<br />

Simplified Decoding <strong>and</strong> Training<br />

Speech<br />

Aligned<br />

Speech<br />

Feature<br />

extraction<br />

Feature<br />

extraction<br />

Speech<br />

features<br />

Speech<br />

features<br />

/h/ /e/ /l/ /o/ /h/ /e/ /l/ /o/<br />

21<br />

Decision<br />

(apply trained<br />

classifiers)<br />

/h/<br />

...<br />

Train<br />

Classifier<br />

/e/<br />

Hypotheses<br />

(phonemes)<br />

/h/ /e/ /l/ /o/<br />

/w/ /o/ /r/ /l/ /d/<br />

Improved<br />

Classifiers<br />

Train codebook<br />

- kmeans<br />

- LVQ


<strong>DTW</strong> <strong>for</strong> <strong>Single</strong> <strong>Word</strong> <strong>and</strong> <strong>Sentence</strong> Recognizers<br />

Where do we go from here?<br />

<strong>Dynamic</strong> Programming <strong>and</strong> <strong>Single</strong> <strong>Word</strong> Recognition<br />

• Comparing Complete Utterances<br />

• Problems: Endpoint Detection <strong>and</strong> Speech Detection<br />

• <strong>Time</strong> <strong>Warping</strong><br />

• The Minimal Editing Distance Problem<br />

• <strong>Dynamic</strong> Programming<br />

• <strong>Dynamic</strong> <strong>Time</strong> <strong>Warping</strong> (<strong>DTW</strong>)<br />

• Constraints <strong>for</strong> the <strong>DTW</strong>-Path<br />

• The <strong>DTW</strong> Search Space<br />

• <strong>DTW</strong> with Beam Search<br />

• The Principles of Building Speech Recognizers<br />

• Isolated <strong>Word</strong> Recognition with Template Matching<br />

22


<strong>DTW</strong> <strong>for</strong> <strong>Single</strong> <strong>Word</strong> <strong>and</strong> <strong>Sentence</strong> Recognizers<br />

Compare Complete Utterances<br />

What we had so far:<br />

• Record a sound signal<br />

• Compute frequency representation<br />

• Quantize/classify vectors<br />

We now have:<br />

• A sequence of pattern vectors<br />

What we want:<br />

• The similiarity between two such sequences<br />

Obviously: The order of vectors is important!<br />

vs.<br />

23<br />

=>


<strong>DTW</strong> <strong>for</strong> <strong>Single</strong> <strong>Word</strong> <strong>and</strong> <strong>Sentence</strong> Recognizers<br />

Comparing Complete Utterances<br />

Comparing speech vector sequences has to overcome three problems:<br />

1) Speaking rate characterizes speakers (speaker dependent!)<br />

if the speaker is speaking faster, we get fewer vectors<br />

2) Changing speaking rate by purpose: e.g. talking to a <strong>for</strong>eign person<br />

3) Changing speaking rate non-purposely: speaking disfluencies<br />

vs.<br />

So we have to find a way to decide which vectors to compare to another<br />

• Impose some constraints (compare every vector to all others is too costly)<br />

24


<strong>DTW</strong> <strong>for</strong> <strong>Single</strong> <strong>Word</strong> <strong>and</strong> <strong>Sentence</strong> Recognizers<br />

Endpoint Detection<br />

When comparing two recorded utterances we face two problems:<br />

1) When does the speech begin?<br />

• We might not have any mechanism to signal the recognizer<br />

when it should listen.<br />

2) Varying length of utterance<br />

• Utterances might be of different length (speaking rate, …)<br />

• One or both utterances can be preceeded or followed by a period<br />

of (possibly non-voluntarily recorded) silence<br />

vs.<br />

25


<strong>DTW</strong> <strong>for</strong> <strong>Single</strong> <strong>Word</strong> <strong>and</strong> <strong>Sentence</strong> Recognizers<br />

Speech Detection<br />

Solution to Problem (1) When does speech begin?<br />

A: Push-to-talk scenario: only listen when user pushes button<br />

B: Always on scenario: always listen, only consider speech regions<br />

Select Speech Regions:<br />

Use SNR: works well if SNR > 30dB, otherwise problematic<br />

Compute signal power:<br />

p[i..j] = k=i..j s[k]2 , then apply a threshold t to detect speech<br />

t<br />

26


<strong>DTW</strong> <strong>for</strong> <strong>Single</strong> <strong>Word</strong> <strong>and</strong> <strong>Sentence</strong> Recognizers<br />

Approaches to Alignment of Vector Sequences<br />

First idea to overcome the varying length<br />

of Utterances<br />

Problem (2):<br />

1. Normalize sequence length<br />

2. Make a linear alignment<br />

between the two sequences<br />

Linear alignment can h<strong>and</strong>le the problem of different speaking rates<br />

But …..<br />

What about varying speaking rates?<br />

27


<strong>DTW</strong> <strong>for</strong> <strong>Single</strong> <strong>Word</strong> <strong>and</strong> <strong>Sentence</strong> Recognizers<br />

Approaches to Alignment of Vector Sequences<br />

• Linear alignment can h<strong>and</strong>le the problem of different speaking rates<br />

• But: it can not h<strong>and</strong>le the problem of varying speaking rates during<br />

the same utterance.<br />

28


<strong>DTW</strong> <strong>for</strong> <strong>Single</strong> <strong>Word</strong> <strong>and</strong> <strong>Sentence</strong> Recognizers<br />

Alignment of Speech Vectors May Be Non-Bijective<br />

Given: two sequences x 1 ,x 2 ,...,x n <strong>and</strong> y 1 ,y 2 ,...,y m<br />

Wanted: alignment relation R (not function), where (i,j) is in R<br />

iff x i is aligned with y j .<br />

y<br />

x<br />

1 2 3 4 5 6 7 8 9 ...<br />

1 2 3 4 5 6 7 8 9 ...<br />

It is possible that …<br />

… more than one x is aligned to the same y (e.g. x 3, x 4)<br />

… more than one y is aligned to the same x (e.g. y 8, y 9)<br />

… more than an x or a y has no alignment partner at all (e.g. y 6)<br />

29


<strong>DTW</strong> <strong>for</strong> <strong>Single</strong> <strong>Word</strong> <strong>and</strong> <strong>Sentence</strong> Recognizers<br />

y<br />

Solution: <strong>Time</strong> <strong>Warping</strong><br />

x<br />

Given:<br />

two sequences x 1 ,x 2 ,...,x n <strong>and</strong> y 1 ,y 2 ,...,y m<br />

Wanted:<br />

alignment relation R, where (i,j) is in R<br />

iff x i is aligned with y j ., i.e. are looking <strong>for</strong><br />

a common time-axis:<br />

30<br />

y<br />

y 8,y 9 align to<br />

same x<br />

y 6 has no partner<br />

x 3,x 4 align to the same y<br />

x


<strong>DTW</strong> <strong>for</strong> <strong>Single</strong> <strong>Word</strong> <strong>and</strong> <strong>Sentence</strong> Recognizers<br />

What is the best alignment relation R?<br />

Distance Measure between two utterances:<br />

For a given path R(i,j), the<br />

distance between x <strong>and</strong> y is<br />

the sum of all<br />

local distances d(x i ,y j )<br />

In our example:<br />

d(x 1 ,y 1 ) + d(x 2 ,y 2 ) + d(x 3 ,y 3 )<br />

+ d(x 4 ,y 3 ) + d(x 5 ,y 4 ) +<br />

d(x 6 ,y 5 ) + d(x 7 ,y 7 ) + ...<br />

Question:<br />

How can we find a path that<br />

gives the minimal overall<br />

distance?<br />

31


<strong>DTW</strong> <strong>for</strong> <strong>Single</strong> <strong>Word</strong> <strong>and</strong> <strong>Sentence</strong> Recognizers<br />

Utterance Comparison by <strong>Dynamic</strong> <strong>Time</strong> <strong>Warping</strong><br />

How can we apply the DP algorithm <strong>for</strong> the minimal editing distance to the<br />

utterance comparison problem?<br />

Differences <strong>and</strong> Questions:<br />

• What do editing steps correspond to?<br />

• We "never" really get two identical vectors.<br />

• We are dealing with continuous <strong>and</strong> not discrete signals here<br />

Answers:<br />

• We can delete/insert/substitute vectors. Define cost <strong>for</strong> del/ins, define<br />

cost <strong>for</strong> substituting = distance between vectors<br />

• No two vectors are the same? So what.<br />

• Continuous signals => we get continuous distances (no big deal)<br />

The <strong>DTW</strong>-Algorithm:<br />

• Works like the minimal editing distance algorithm<br />

• Minor modification: Allow different kinds of steps (different<br />

predecessors of a state)<br />

• Use vector-vector distance measure as cost function<br />

32


<strong>DTW</strong> <strong>for</strong> <strong>Single</strong> <strong>Word</strong> <strong>and</strong> <strong>Sentence</strong> Recognizers<br />

<strong>Dynamic</strong> Programming<br />

How can we find the minimal editing distance?<br />

Greedy algorithm?<br />

• Always per<strong>for</strong>m the step that is currently the cheapest.<br />

• If there are more than one cheapest step …<br />

• … take any one of them.<br />

Obvious: can't guarantee to lead to the optimal solution.<br />

Solution: <strong>Dynamic</strong> Programming (DP)<br />

DP is frequently used in operations research, where<br />

consecutive decisions depend on each other <strong>and</strong> whose<br />

sequence must lead to optimal results.<br />

33


<strong>DTW</strong> <strong>for</strong> <strong>Single</strong> <strong>Word</strong> <strong>and</strong> <strong>Sentence</strong> Recognizers<br />

Key Idea of <strong>Dynamic</strong> Programming<br />

The key idea of DP is:<br />

If we would like to take our system into a state s i , <strong>and</strong><br />

• we know the costs c 1 ,...,c k <strong>for</strong> the optimal ways to get<br />

• from the start to all states q 1 ,...,q k from which we can go to s i ,<br />

• then the optimal way to s i goes over the<br />

state q l where l = argmin j c j<br />

34


<strong>DTW</strong> <strong>for</strong> <strong>Single</strong> <strong>Word</strong> <strong>and</strong> <strong>Sentence</strong> Recognizers<br />

The <strong>Dynamic</strong> Programming Matrix (1)<br />

To find the minimal editing distance from<br />

x 1 ,x 2 ,...,x n to y 1 ,y 2 ,...,y m<br />

we can define an algorithm inductively:<br />

Let C(i, j) denote the<br />

minimal editing distance from x1 ,x2 ,...,xi to y1 ,y2 ,...,yj .<br />

Then we get:<br />

• C(0,0) = 0 (no characters no editing)<br />

• C(i, j) is either (whichever is smallest):<br />

• C(i-1, j-1) plus the cost <strong>for</strong><br />

replacing xi with yj • or C(i-1, j) plus the cost <strong>for</strong> deleting xi • or C(i, j-1) plus the cost <strong>for</strong> inserting yj 35


<strong>DTW</strong> <strong>for</strong> <strong>Single</strong> <strong>Word</strong> <strong>and</strong> <strong>Sentence</strong> Recognizers<br />

The <strong>Dynamic</strong> Programming Matrix (2)<br />

Usually <strong>for</strong> the minimal editing distance:<br />

• The cost <strong>for</strong> deleting or inserting a character is 1<br />

• The cost <strong>for</strong> replacing x i with y j is 0 (if x i = y j ) or 1 (else)<br />

• Might be useful to define other costs <strong>for</strong> special purposes<br />

Eventually:<br />

• Remember <strong>for</strong> each state (i-1, j-1) which one was<br />

the best predecessor (backpointer)<br />

• Find the sequence of editing steps by<br />

backtracing the predecessor pointers<br />

from the final state<br />

36


<strong>DTW</strong> <strong>for</strong> <strong>Single</strong> <strong>Word</strong> <strong>and</strong> <strong>Sentence</strong> Recognizers<br />

The Minimal Editing Distance Problem<br />

Given: two character sequences (words) x 1 ,x 2 ,...,x n <strong>and</strong> y 1 ,y 2 ,...,y m<br />

Wanted: the minimal number (<strong>and</strong> sequence) of editing steps that are needed to<br />

convert x to y<br />

The editing cursor starts at x 0 , an editing step can be one of:<br />

• Delete the character x i under the cursor<br />

• Insert a character x i at the cursor position<br />

• Replace character x i at the cursor position with y j<br />

• Moving the cursor to the next character (no editing), we can't go back<br />

Example: Convert x = “BRAKES" to y = “BAKERY":<br />

One possible solution:<br />

• B = B, move curser to next character<br />

• Delete character x 2 = R<br />

• A = A, move curser to next character, K = K, move, E = E, move<br />

• Replace character x 5 = S with character y 5 = R<br />

• Insert character y 6 = Y (sequence not necessarily unique)<br />

Often referred to as Levinshtein distance, keep in mind, we will revisit this when we<br />

talk about how to measure the Per<strong>for</strong>mance (<strong>Word</strong> Accuracy) of a recognizer<br />

37


<strong>DTW</strong> <strong>for</strong> <strong>Single</strong> <strong>Word</strong> <strong>and</strong> <strong>Sentence</strong> Recognizers<br />

Levinshtein<br />

Y<br />

R<br />

E<br />

K<br />

A<br />

B<br />

B R A K E S<br />

B=B, move to next<br />

38


<strong>DTW</strong> <strong>for</strong> <strong>Single</strong> <strong>Word</strong> <strong>and</strong> <strong>Sentence</strong> Recognizers<br />

Levinshtein<br />

Y<br />

R<br />

E<br />

K<br />

A<br />

B<br />

B R A K E S<br />

Delete character x2=R 39


<strong>DTW</strong> <strong>for</strong> <strong>Single</strong> <strong>Word</strong> <strong>and</strong> <strong>Sentence</strong> Recognizers<br />

Levinshtein<br />

Y<br />

R<br />

E<br />

K<br />

A<br />

B<br />

B R A K E S<br />

A=A, move to next<br />

40


<strong>DTW</strong> <strong>for</strong> <strong>Single</strong> <strong>Word</strong> <strong>and</strong> <strong>Sentence</strong> Recognizers<br />

Levinshtein<br />

Y<br />

R<br />

E<br />

K<br />

A<br />

B<br />

B R A K E S<br />

41<br />

K=K, move to next


<strong>DTW</strong> <strong>for</strong> <strong>Single</strong> <strong>Word</strong> <strong>and</strong> <strong>Sentence</strong> Recognizers<br />

Levinshtein<br />

Y<br />

R<br />

E<br />

K<br />

A<br />

B<br />

B R A K E S<br />

42<br />

E=E, move to next


<strong>DTW</strong> <strong>for</strong> <strong>Single</strong> <strong>Word</strong> <strong>and</strong> <strong>Sentence</strong> Recognizers<br />

Levinshtein<br />

Y<br />

R<br />

E<br />

K<br />

A<br />

B<br />

B R A K E S<br />

43<br />

replace character x 6 = S<br />

with character y 5 = R


<strong>DTW</strong> <strong>for</strong> <strong>Single</strong> <strong>Word</strong> <strong>and</strong> <strong>Sentence</strong> Recognizers<br />

Y<br />

R<br />

E<br />

K<br />

A<br />

B<br />

Levinshtein<br />

B R A K E S<br />

44<br />

insert character y 6 = Y


<strong>DTW</strong> <strong>for</strong> <strong>Single</strong> <strong>Word</strong> <strong>and</strong> <strong>Sentence</strong> Recognizers<br />

Levinshtein<br />

Sequence is not necessarily unique!<br />

Y<br />

R<br />

E<br />

K<br />

A<br />

B<br />

B R A K E S<br />

insert character y 5 = R<br />

45<br />

replace character x 6 = S with y 6 = Y


<strong>DTW</strong> <strong>for</strong> <strong>Single</strong> <strong>Word</strong> <strong>and</strong> <strong>Sentence</strong> Recognizers<br />

What we could do already<br />

We can build a first simple isolated-word recognizer using <strong>DTW</strong><br />

We can build a preprocessor such that recorded speech can be processed<br />

by the recognizer<br />

We can recognize speech using <strong>DTW</strong> <strong>and</strong> print the score <strong>for</strong> each of its<br />

reference patterns:<br />

Example:<br />

• Build recognizer that can recognize two words w 1 <strong>and</strong> w 2<br />

• Collect training examples (in real life: a lot of data)<br />

• Skip the optimization phase (don't need development set)<br />

• Collect evaluation data (a few examples per word)<br />

• Run tests on evaluation data <strong>and</strong> report results<br />

46


<strong>DTW</strong> <strong>for</strong> <strong>Single</strong> <strong>Word</strong> <strong>and</strong> <strong>Sentence</strong> Recognizers<br />

Compare Complete Utterances<br />

=><br />

47<br />

Ref <strong>Word</strong> 2<br />

Ref <strong>Word</strong> 1<br />

Hypothese


<strong>DTW</strong> <strong>for</strong> <strong>Single</strong> <strong>Word</strong> <strong>and</strong> <strong>Sentence</strong> Recognizers<br />

Isolated <strong>Word</strong> Recognition with Template Matching<br />

• For each word in the vocabulary, store at least one reference pattern<br />

• When multiple reference patterns are available, either use all of<br />

them or compute an average<br />

• During recognition<br />

• Record a spoken word<br />

• Per<strong>for</strong>m pattern matching with all stored patterns (or at least<br />

with those that can be used in the current context)<br />

• Compute a <strong>DTW</strong> score <strong>for</strong> every vocabulary word (when using<br />

multiple references, compute one score out of many, e.g. average or<br />

max)<br />

• Recognize the word with the best <strong>DTW</strong> score<br />

• This approach works only <strong>for</strong> very small vocabularies <strong>and</strong>/or <strong>for</strong><br />

speaker-dependent recognition<br />

48


<strong>DTW</strong> <strong>for</strong> <strong>Single</strong> <strong>Word</strong> <strong>and</strong> <strong>Sentence</strong> Recognizers<br />

What we can‘t do yet:<br />

OR Problems with Pattern Matching<br />

• Need endpoint detection<br />

• Need collection of reference patterns (inconvenient <strong>for</strong> user)<br />

• Works only well <strong>for</strong> speaker-dependent recognition<br />

(difficult to cover variations)<br />

• High computational ef<strong>for</strong>t (esp. <strong>for</strong> large vocabularies),<br />

proportional to vocabulary size<br />

• Large vocabulary also means: need huge amount of training data<br />

since we need training samples <strong>for</strong> each word<br />

• Difficult to train suitable references (or sets of references)<br />

• Poor per<strong>for</strong>mance when the environment changes<br />

• Unsuitable where speaker is unknown <strong>and</strong> no training is feasible<br />

• Unsuitable <strong>for</strong> continuous speech, coarticulation<br />

(combinatorial explosion of possible patterns)<br />

• Impossible to recognize untrained words<br />

• Difficult to train/recognize subword units<br />

49


<strong>DTW</strong> <strong>for</strong> <strong>Single</strong> <strong>Word</strong> <strong>and</strong> <strong>Sentence</strong> Recognizers<br />

Make a Wish<br />

• We would like to work with speech units shorter than words<br />

-> each subword unit occurs often, training is easier, need<br />

less data<br />

• We want to recognize speech from any speaker, without prior training<br />

-> store "speaker-independent" references<br />

• We want to recognize continuous speech not only isolated words<br />

-> h<strong>and</strong>le coarticulation effects, h<strong>and</strong>le sequences of words<br />

• We would like to be able to recognize words that have not been trained<br />

-> train subword units <strong>and</strong> compose any word out of these<br />

(vocabulary independence)<br />

• We would prefer a sound mathematical foundation<br />

• Solution (particularly sucessful <strong>for</strong> ASR): Hidden Markov Models<br />

50


<strong>DTW</strong> <strong>for</strong> <strong>Single</strong> <strong>Word</strong> <strong>and</strong> <strong>Sentence</strong> Recognizers<br />

Approaches to Alignment of Vector Sequences<br />

First idea to overcome the varying length<br />

of Utterances<br />

Problem (2):<br />

1. Normalize sequence length<br />

2. Make a linear alignment<br />

between the two sequences<br />

Linear alignment can h<strong>and</strong>le the problem of different speaking rates<br />

But …..<br />

What about varying speaking rates?<br />

51


<strong>DTW</strong> <strong>for</strong> <strong>Single</strong> <strong>Word</strong> <strong>and</strong> <strong>Sentence</strong> Recognizers<br />

Approaches to Alignment of Vector Sequences<br />

Linear alignment can h<strong>and</strong>le the problem of different speaking rates<br />

But: it can not h<strong>and</strong>le the problem of varying speaking rates during<br />

the same utterance.<br />

52


<strong>DTW</strong> <strong>for</strong> <strong>Single</strong> <strong>Word</strong> <strong>and</strong> <strong>Sentence</strong> Recognizers<br />

Alignment of Speech Vectors May Be Non-Bijective<br />

Given: two sequences x 1 ,x 2 ,...,x n <strong>and</strong> y 1 ,y 2 ,...,y m<br />

Wanted: alignment relation R (not function), where (i,j) is in R<br />

iff x i is aligned with y j .<br />

y<br />

x<br />

1 2 3 4 5 6 7 8 9 ...<br />

1 2 3 4 5 6 7 8 9 ...<br />

It is possible that …<br />

… more than one x is aligned to the same y (e.g. x3, x4) … more than one y is aligned to the same x (e.g. y8, y9) … more than an x or a y has no alignment partner at all (e.g. y6) 53


<strong>DTW</strong> <strong>for</strong> <strong>Single</strong> <strong>Word</strong> <strong>and</strong> <strong>Sentence</strong> Recognizers<br />

y<br />

x<br />

Solution: <strong>Time</strong> <strong>Warping</strong><br />

Given:<br />

two sequences x 1 ,x 2 ,...,x n <strong>and</strong> y 1 ,y 2 ,...,y m<br />

Wanted:<br />

alignment relation R, where (i,j) is in R<br />

iff x i is aligned with y j ., i.e. are looking <strong>for</strong><br />

a common time-axis:<br />

54<br />

y<br />

y 8,y 9 align to same<br />

y 6 has no partner<br />

x 3,x 4 align to the same y<br />

x


<strong>DTW</strong> <strong>for</strong> <strong>Single</strong> <strong>Word</strong> <strong>and</strong> <strong>Sentence</strong> Recognizers<br />

What is the best alignment relation R?<br />

Distance Measure between two utterances:<br />

For a given path R(i,j), the<br />

distance between x <strong>and</strong> y is<br />

the sum of all<br />

local distances d(x i ,y j )<br />

In our example:<br />

d(x 1 ,y 1 ) + d(x 2 ,y 2 ) + d(x 3 ,y 3 )<br />

+ d(x 4 ,y 3 ) + d(x 5 ,y 4 ) +<br />

d(x 6 ,y 5 ) + d(x 7 ,y 7 ) + ...<br />

Question:<br />

How can we find a path that<br />

gives the minimal overall<br />

distance?<br />

55


<strong>DTW</strong> <strong>for</strong> <strong>Single</strong> <strong>Word</strong> <strong>and</strong> <strong>Sentence</strong> Recognizers<br />

<strong>Dynamic</strong> Programming<br />

How can we find the minimal editing distance?<br />

Greedy algorithm?<br />

• Always per<strong>for</strong>m the step that is currently the cheapest.<br />

• If there are more than one cheapest step …<br />

• … take any one of them.<br />

Obvious: can't guarantee to lead to the optimal solution.<br />

Solution: <strong>Dynamic</strong> Programming (DP)<br />

DP is frequently used in operations research, where<br />

consecutive decisions depend on each other <strong>and</strong> whose<br />

sequence must lead to optimal results.<br />

56


<strong>DTW</strong> <strong>for</strong> <strong>Single</strong> <strong>Word</strong> <strong>and</strong> <strong>Sentence</strong> Recognizers<br />

Key Idea of <strong>Dynamic</strong> Programming<br />

The key idea of DP is:<br />

If we would like to take our system into a state s i , <strong>and</strong><br />

• we know the costs c 1 ,...,c k <strong>for</strong> the optimal ways to get<br />

• from the start to all states q 1 ,...,q k from which we can go to s i ,<br />

• then the optimal way to s i goes over the<br />

state q l where l = argmin j c j<br />

57


<strong>DTW</strong> <strong>for</strong> <strong>Single</strong> <strong>Word</strong> <strong>and</strong> <strong>Sentence</strong> Recognizers<br />

<strong>DTW</strong> review<br />

Goal: identify example pattern that is most similar to unknown input<br />

compare patterns of different length<br />

Note: all patterns are preprocessed 100 vectors / second of speech<br />

<strong>DTW</strong>: Find alignment between unknown input <strong>and</strong> the example pattern<br />

that minimizes the overall distance<br />

Find average vector distance, but which frame-pairs ?<br />

One Example Pattern<br />

t 1 t 2 t M<br />

t 1 t 2<br />

?<br />

58<br />

Input = unknown pattern<br />

t N<br />

Euclidean Distance


<strong>DTW</strong> <strong>for</strong> <strong>Single</strong> <strong>Word</strong> <strong>and</strong> <strong>Sentence</strong> Recognizers<br />

Example Pattern 1<br />

<strong>DTW</strong> review<br />

Apply transition function<br />

t 1 t 2 t M<br />

t 1 t 2<br />

Unknown pattern<br />

59<br />

t N<br />

Editing Distance,<br />

<strong>for</strong> spell-checkers..<br />

C(x,y)=<br />

min (<br />

C(x-1,y) + d ins(x,y),<br />

C(x-1,y-1) + d sub(x,y),<br />

C(x,y-1) + d del(x,y)<br />

)<br />

Weighted Distance<br />

<strong>for</strong> SR..:<br />

1<br />

2<br />

2<br />

2<br />

1


<strong>DTW</strong> <strong>for</strong> <strong>Single</strong> <strong>Word</strong> <strong>and</strong> <strong>Sentence</strong> Recognizers<br />

The <strong>Dynamic</strong> Programming Matrix (2)<br />

Usually <strong>for</strong> the minimal editing distance:<br />

• The cost <strong>for</strong> deleting or inserting a character is 1<br />

• The cost <strong>for</strong> replacing x i with y j is 0 (if x i = y j ) or 1 (else)<br />

• Might be useful to define other costs <strong>for</strong> special purposes<br />

Eventually:<br />

• Remember <strong>for</strong> each state (i-1, j-1) which one was<br />

the best predecessor (backpointer)<br />

• Find the sequence of editing steps by<br />

backtracing the predecessor pointers<br />

from the final state<br />

60


<strong>DTW</strong> <strong>for</strong> <strong>Single</strong> <strong>Word</strong> <strong>and</strong> <strong>Sentence</strong> Recognizers<br />

<strong>DTW</strong>-Steps<br />

Many different warping steps are possible <strong>and</strong> have been used.<br />

Examples:<br />

General rule is:<br />

Cumulative cost of<br />

destination =<br />

best-of (cumulative cost<br />

of source<br />

+ cost of step<br />

+ distance in destination)<br />

Myers <strong>and</strong> Sakoe/Chiba<br />

evaluated per<strong>for</strong>mance<br />

<strong>and</strong> speed of <strong>DTW</strong> <strong>for</strong> a<br />

wide range of<br />

parameters: only small<br />

differences in<br />

per<strong>for</strong>mance <strong>for</strong> a fairly<br />

wide range<br />

symmetric<br />

(editing distance)<br />

61<br />

Bakis<br />

Itakura weighted


<strong>DTW</strong> <strong>for</strong> <strong>Single</strong> <strong>Word</strong> <strong>and</strong> <strong>Sentence</strong> Recognizers<br />

<strong>DTW</strong> optimization<br />

How to save computation time?<br />

Solution: Add constraints<br />

- define start <strong>and</strong> end point<br />

- path should be close to diagonal,<br />

- no two successive horizontal or vertical steps<br />

Example Pattern 1<br />

t 1 t 2 t M<br />

X<br />

t 1 t 2<br />

X<br />

Unknown pattern<br />

62<br />

t N<br />

Shaded fields cannot<br />

be reached <strong>and</strong> do not<br />

need to be computed<br />

(inactive)


<strong>DTW</strong> <strong>for</strong> <strong>Single</strong> <strong>Word</strong> <strong>and</strong> <strong>Sentence</strong> Recognizers<br />

Constraints <strong>for</strong> the <strong>DTW</strong>-Path<br />

• Endpoint constraints: we want the path not to skip a part<br />

at the beginning or end of utterance<br />

• Monotonicity conditions: we can't go back in time<br />

(<strong>for</strong> neither utterance)<br />

• Local continuity: possible types of motions (slope <strong>and</strong><br />

direction) of the path, no jumps etc.<br />

• Global path constraints: path should be close to diagonal<br />

•Slope weighting: <strong>DTW</strong>-path should be somehow<br />

"smooth"<br />

BTW: For most word recognition applications the warping path itself need not be<br />

computed, only the accumulated distance D is required<br />

63


<strong>DTW</strong> <strong>for</strong> <strong>Single</strong> <strong>Word</strong> <strong>and</strong> <strong>Sentence</strong> Recognizers<br />

<strong>DTW</strong> with Beam Search<br />

Idea:<br />

do not consider steps to be possible out of states that have "too high"<br />

cumulative distances.<br />

Approaches:<br />

• "exp<strong>and</strong>" only a fixed number of states per column of <strong>DTW</strong>-matrix<br />

• Exp<strong>and</strong> only states that have a cumulative distance<br />

less than a factor (the beam) times the best distance so far<br />

64


<strong>DTW</strong> <strong>for</strong> <strong>Single</strong> <strong>Word</strong> <strong>and</strong> <strong>Sentence</strong> Recognizers<br />

What we could do already<br />

We can build a first simple isolated-word recognizer using <strong>DTW</strong><br />

We can build a preprocessor such that recorded speech can be processed<br />

by the recognizer<br />

We can recognize speech using <strong>DTW</strong> <strong>and</strong> print the score <strong>for</strong> each of its<br />

reference patterns:<br />

Example:<br />

• Build recognizer that can recognize two words w 1 <strong>and</strong> w 2<br />

• Collect training examples (in real life: a lot of data)<br />

• Skip the optimization phase (don't need development set)<br />

• Collect evaluation data (a few examples per word)<br />

• Run tests on evaluation data <strong>and</strong> report results<br />

65


<strong>DTW</strong> <strong>for</strong> <strong>Single</strong> <strong>Word</strong> <strong>and</strong> <strong>Sentence</strong> Recognizers<br />

<strong>Single</strong> word recognizer<br />

=><br />

Ref <strong>Word</strong> 2<br />

Ref <strong>Word</strong> 1<br />

66<br />

Hypothese


<strong>DTW</strong> <strong>for</strong> <strong>Single</strong> <strong>Word</strong> <strong>and</strong> <strong>Sentence</strong> Recognizers<br />

Search in Automatic Speech Recognition<br />

Isolated <strong>Word</strong> Recognition:<br />

• For every applicable vocabulary word:<br />

• Compute it´s score (<strong>DTW</strong>)<br />

• Compute it´s likelihood (<strong>for</strong>ward algorithm)<br />

• Report the word with the best likelihood / score<br />

Problems, Optimization, Drawbacks<br />

Continuous Speech Recognition:<br />

• Not only find state sequence within words ...<br />

… BUT ...<br />

• … also find the word sequence<br />

(i.e. a sequence through the huge state graph)<br />

Problems, Strategies, Optimization, Drawbacks<br />

67


<strong>DTW</strong> <strong>for</strong> <strong>Single</strong> <strong>Word</strong> <strong>and</strong> <strong>Sentence</strong> Recognizers<br />

<strong>DTW</strong> optimization (2)<br />

Pruning beam: Eliminate losers (score > best score + )<br />

Reference pattern<br />

Pattern 1 Pattern 2 Pattern 3<br />

t N<br />

t1 t2 Unknown pattern<br />

68<br />

Best scoring field at<br />

this frame<br />

Fields that have such a<br />

bad score, that it is save<br />

to ‘deactivate’ them<br />

Fields with no active<br />

predecessors don’t need<br />

to be computed<br />

This word is no<br />

longer ‘active’


<strong>DTW</strong> <strong>for</strong> <strong>Single</strong> <strong>Word</strong> <strong>and</strong> <strong>Sentence</strong> Recognizers<br />

<strong>DTW</strong> Summary<br />

• Optimization of <strong>DTW</strong>:<br />

• Usually, only interested in final score<br />

• Algorithm requires only values in current <strong>and</strong> previous<br />

frame<br />

• Keep it simple: Do not allocate new storage but overwrite<br />

stuff that is no longer needed<br />

• For most transition patterns, one frame is enough<br />

• Drawbacks of <strong>DTW</strong>:<br />

• Does not generalize<br />

• Speaker dependent<br />

• Need example(s) <strong>for</strong> each word from each speaker<br />

• Gets computationally expensive <strong>for</strong> large vocabularies<br />

69


<strong>DTW</strong> <strong>for</strong> <strong>Single</strong> <strong>Word</strong> <strong>and</strong> <strong>Sentence</strong> Recognizers<br />

Isolated <strong>Word</strong> Recognition with Template Matching<br />

• For each word in the vocabulary, store at least one reference pattern<br />

• When multiple reference patterns are available, either use all of<br />

them or compute an average<br />

• During recognition<br />

• Record a spoken word<br />

• Per<strong>for</strong>m pattern matching with all stored patterns (or at least<br />

with those that can be used in the current context)<br />

• Compute a <strong>DTW</strong> score <strong>for</strong> every vocabulary word (when using<br />

multiple references, compute one score out of many, e.g. average or<br />

max)<br />

• Recognize the word with the best <strong>DTW</strong> score<br />

• This approach works only <strong>for</strong> very small vocabularies <strong>and</strong>/or <strong>for</strong><br />

speaker-dependent recognition<br />

70


<strong>DTW</strong> <strong>for</strong> <strong>Single</strong> <strong>Word</strong> <strong>and</strong> <strong>Sentence</strong> Recognizers<br />

From Isolated to Continuous Speech<br />

• Sloppier<br />

• Higher speaking rate<br />

• Combinatorial explosion of things that can be said<br />

• Spontaneous effects: restarts, fragments, noise<br />

• Co-articulation: Did you dija ..<br />

• Segmentation: how to find word boundaries<br />

• Solution: reduce to known problems<br />

71


<strong>DTW</strong> <strong>for</strong> <strong>Single</strong> <strong>Word</strong> <strong>and</strong> <strong>Sentence</strong> Recognizers<br />

Plan 1: Cut Continuous Speech Into <strong>Single</strong><br />

<strong>Word</strong>s<br />

• Write magic algorithm that segments speech into 1-word<br />

chunks<br />

• Run <strong>DTW</strong>/Viterbi on each chunk<br />

• BUT: Where are the boundaries????<br />

• No reliable segmentation algorithm <strong>for</strong> detecting word<br />

boundaries other than doing recognition itself, due to:<br />

– Co-articulation between words<br />

– Hesitations within words<br />

– Hard decisions lead to accumulating errors<br />

• integrated approach works better<br />

72


<strong>DTW</strong> <strong>for</strong> <strong>Single</strong> <strong>Word</strong> <strong>and</strong> <strong>Sentence</strong> Recognizers<br />

Plan 2: Two level <strong>DTW</strong><br />

Level 1 <strong>DTW</strong>: For each point in time (or each remotely likely word begin time)<br />

Compute <strong>DTW</strong> with open end within all words<br />

Level 2 <strong>DTW</strong>: For each word end <strong>and</strong> start point from first <strong>DTW</strong><br />

Combine words in second <strong>DTW</strong> based on calculated with-in scores<br />

Problem: Extremely costly <strong>for</strong> larger vocabularies <strong>and</strong> longer words<br />

Pattern<br />

1<br />

Pattern<br />

2<br />

73<br />

Assumed word<br />

begin frame<br />

Active<br />

corresponding<br />

word end frames<br />

(with scores)


<strong>DTW</strong> <strong>for</strong> <strong>Single</strong> <strong>Word</strong> <strong>and</strong> <strong>Sentence</strong> Recognizers<br />

Plan 3: Depth First Search<br />

Well known search strategies: depth first search vs. breath first search:<br />

In speech recognition, this corresponds roughly to:<br />

time-asynchronous vs. time-synchronous<br />

In time-synchronous search, we exp<strong>and</strong> every hypothesis simultaneously frame by frame.<br />

In time-asynchronous search, we exp<strong>and</strong> partial hypotheses that have different durations.<br />

74


<strong>DTW</strong> <strong>for</strong> <strong>Single</strong> <strong>Word</strong> <strong>and</strong> <strong>Sentence</strong> Recognizers<br />

Plan 4: One stage <strong>Dynamic</strong> Programming<br />

• Within words:<br />

– Viterbi<br />

• Between words:<br />

– For the first state of each words, consider the best last state of all<br />

words as predecessor.<br />

– This last state or the first state in the current word is predecessor.<br />

• first state means any state that can be the first of a word<br />

• last state means any state that has a transition out of the word<br />

75


<strong>DTW</strong> <strong>for</strong> <strong>Single</strong> <strong>Word</strong> <strong>and</strong> <strong>Sentence</strong> Recognizers<br />

Search with vs without Language Model<br />

without grammar with grammar<br />

• without grammar: The best predecessor state is the same <strong>for</strong> all<br />

word-initial states<br />

=> exp<strong>and</strong> only the word-final state that has the best score in a frame<br />

• with grammar: The best predecessor state depends also on the<br />

word transition probability/penalty<br />

76


<strong>DTW</strong> <strong>for</strong> <strong>Single</strong> <strong>Word</strong> <strong>and</strong> <strong>Sentence</strong> Recognizers<br />

Reference <strong>Sentence</strong><br />

Compare Complete Utterances / <strong>Word</strong>s<br />

77<br />

?<br />

Hypothesis<br />

= recognized<br />

sentence


<strong>DTW</strong> <strong>for</strong> <strong>Single</strong> <strong>Word</strong> <strong>and</strong> <strong>Sentence</strong> Recognizers<br />

Reference <strong>Sentence</strong><br />

Compare Smaller Units<br />

78<br />

Hypothesis<br />

= recognized<br />

sentence


<strong>DTW</strong> <strong>for</strong> <strong>Single</strong> <strong>Word</strong> <strong>and</strong> <strong>Sentence</strong> Recognizers<br />

What we can‘t do yet:<br />

OR Problems with Pattern Matching<br />

• Need endpoint detection<br />

• Need collection of reference patterns (inconvenient <strong>for</strong> user)<br />

• Works only well <strong>for</strong> speaker-dependent recognition<br />

(difficult to cover variations)<br />

• High computational ef<strong>for</strong>t (esp. <strong>for</strong> large vocabularies),<br />

proportional to vocabulary size<br />

• Large vocabulary also means: need huge amount of training data<br />

since we need training samples <strong>for</strong> each word<br />

• Difficult to train suitable references (or sets of references)<br />

• Poor per<strong>for</strong>mance when the environment changes<br />

• Unsuitable where speaker is unknown <strong>and</strong> no training is feasible<br />

• Unsuitable <strong>for</strong> continuous speech, coarticulation<br />

(combinatorial explosion of possible patterns)<br />

• Impossible to recognize untrained words<br />

• Difficult to train/recognize subword units<br />

79


<strong>DTW</strong> <strong>for</strong> <strong>Single</strong> <strong>Word</strong> <strong>and</strong> <strong>Sentence</strong> Recognizers<br />

Make a Wish<br />

• We would like to work with speech units shorter than words<br />

-> each subword unit occurs often, training is easier, need<br />

less data<br />

• We want to recognize speech from any speaker, without prior training<br />

-> store "speaker-independent" references<br />

• We want to recognize continuous speech not only isolated words<br />

-> h<strong>and</strong>le coarticulation effects, h<strong>and</strong>le sequences of words<br />

• We would like to be able to recognize words that have not been trained<br />

-> train subword units <strong>and</strong> compose any word out of these<br />

(vocabulary independence)<br />

• We would prefer a sound mathematical foundation<br />

• Solution (particularly sucessful <strong>for</strong> ASR): Hidden Markov Models<br />

80


<strong>DTW</strong> <strong>for</strong> <strong>Single</strong> <strong>Word</strong> <strong>and</strong> <strong>Sentence</strong> Recognizers<br />

Speech Production as Stochastic Process<br />

• The same word / phoneme / sound sounds different every time it is uttered<br />

• We can regard words / phonemes as states of a speech production process<br />

• In a given state we can observe different acoustic sounds<br />

• Not all sounds are possible / likely in every state<br />

• We say: in a given state the speech process "emits" sounds according to<br />

some probability distribution/density<br />

• The production process can make transitions from one state into another<br />

• Not all transitions are possible, transitions have different probabilities<br />

When we specify the probabilities <strong>for</strong> sound-emissions (emission<br />

probabilities) <strong>and</strong> <strong>for</strong> the state transitions, we call this a model.<br />

81


<strong>DTW</strong> <strong>for</strong> <strong>Single</strong> <strong>Word</strong> <strong>and</strong> <strong>Sentence</strong> Recognizers<br />

What‘s different?<br />

Reference in terms of state sequence of statistical<br />

models, models consists of<br />

prototypical references vectors<br />

82<br />

Hypothesis<br />

= recognized<br />

sentence


<strong>DTW</strong> <strong>for</strong> <strong>Single</strong> <strong>Word</strong> <strong>and</strong> <strong>Sentence</strong> Recognizers<br />

Formal Definition of Hidden Markov Models<br />

A Hidden Markov Model =(A,B,) is a five-tuple consisting of:<br />

S The set of States S={s 1 ,s 2 ,...,s n } – n is the number of states<br />

The initial probability distribution , (s i ) = P(q 1 = s i)<br />

probability of s i being the first state of a sequence<br />

A The matrix of state transition probabilities: 1 ≤ i, j ≤ n<br />

A=(a ij ) with a ij = P(q t+1= s j|q t = s i) going from state s i to s j<br />

B The set of emission probability distributions/densities,<br />

B={b 1 ,b 2 ,...,b n } where b i (x)=P(o t = x|q t = s i) is the probability<br />

of observing x when the system is in state s i<br />

V Set of symbols -- v is the number of distinct symbols<br />

The observable feature space can be discrete:<br />

V={x 1 ,x 2 ,...,x v }, or continuous V=R d<br />

83


<strong>DTW</strong> <strong>for</strong> <strong>Single</strong> <strong>Word</strong> <strong>and</strong> <strong>Sentence</strong> Recognizers<br />

Generating an Observation Sequence<br />

The term "hidden" refers to seeing observations <strong>and</strong> drawing<br />

conclusions without knowing the hidden sequence of states<br />

84


<strong>DTW</strong> <strong>for</strong> <strong>Single</strong> <strong>Word</strong> <strong>and</strong> <strong>Sentence</strong> Recognizers<br />

Keep in Mind <strong>for</strong> Next Session<br />

85<br />

Hypothesis<br />

= recognized<br />

sentence

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!