02.04.2013 Views

Perplexity of n-gram and dependency language models

Perplexity of n-gram and dependency language models

Perplexity of n-gram and dependency language models

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

<strong>Perplexity</strong> <strong>of</strong><br />

n-<strong>gram</strong> <strong>and</strong> <strong>dependency</strong><br />

<strong>language</strong> <strong>models</strong><br />

Martin Popel, David Mareček<br />

ÚFAL, Charles University in Prague<br />

TSD, 13th International Conference on Text, Speech <strong>and</strong> Dialogue<br />

September 8, 2010, Brno


● Language Models (LM)<br />

– basics<br />

– design decisions<br />

● Post-n<strong>gram</strong> LM<br />

● Dependency LM<br />

● Evaluation<br />

Outline<br />

● Conclusion & future plans<br />

2


P(s) = ?<br />

Language Models – basics<br />

P( The dog barked again<br />

)<br />

3


P(s) = ?<br />

Language Models – basics<br />

P( The dog barked again ) > P( The dock barked again<br />

)<br />

4


Language Models – basics<br />

P(s) = P(w 1, w 2, ... w m)<br />

P( The dog barked again<br />

) =<br />

P(w1 = The , w2 = dog , w3 = barked , w4 = again )<br />

5


Language Models – basics<br />

P(s) = P(w 1, w 2, ... w m) = P(w 1) P(w 2|w 1) … P(w m|w 1,...,w m-1)<br />

P( The dog barked again ) =<br />

P(w1 = The ) ·<br />

P(w2 = | w1 = The ) ·<br />

dog<br />

P(w3 = barked | w1 = The , w2 = dog ) ·<br />

Chain rule<br />

P(w4 = again | w1 = The , w2 = dog , w3 = barked )<br />

6


Language Models – basics<br />

P(s) = P(w 1, w 2, ... w m) = P(w 1) P(w 2|w 1) … P(w m|w 1,...,w m-1)<br />

P( The dog barked again ) =<br />

P(wi = The | i=1) ·<br />

P(wi = | i=2, wi-1 = The ) ·<br />

dog<br />

P(wi = barked | i=3, wi-2 = The , wi-1 = dog ) ·<br />

Changed notation<br />

P(wi = again | i=4, wi-3 = The , wi-2 = dog , wi-1 = barked )<br />

7


Language Models – basics<br />

P(s) = P(w 1, w 2, ... w m) = P(w 1) P(w 2|w 1) … P(w m|w 1,...,w m-1)<br />

P( The dog barked again ) =<br />

P(wi = The | i=1, wi-1 = NONE ) ·<br />

P(wi = | i=2, wi-2 = NONE<br />

, wi-1 = The ) ·<br />

dog<br />

Artificial start-<strong>of</strong>-sentence token<br />

P(wi = barked | i=3, wi-3 = NONE , wi-2 = The , wi-1 = dog ) ·<br />

P(wi = again | i=4, wi-4 = NONE , wi-3 = The , wi-2 = dog , wi-1 = barked )<br />

8


Language Models – basics<br />

P(s) = P(w 1, w 2, ... w m) = P(w 1) P(w 2|w 1) … P(w m|w 1,...,w m-1)<br />

P( The dog barked again ) ≈<br />

P(wi = The | wi-1 = NONE ) ·<br />

P(wi = | wi-2 = NONE<br />

, wi-1 = The ) ·<br />

dog<br />

Position back<strong>of</strong>f<br />

P(wi = barked | wi-3 = NONE , wi-2 = The , wi-1 = dog ) ·<br />

P(wi = again | wi-4 = NONE , wi-3 = The , wi-2 = dog , wi-1 = barked )<br />

9


Language Models – basics<br />

P(s) = P(w 1, w 2, ... w m) ≈ Π i=1..m P(w i | w i-1)<br />

P( The dog barked again ) ≈<br />

P(wi = The | wi-1 = NONE<br />

) ·<br />

P(wi = | wi-1 = The ) ·<br />

dog<br />

History back<strong>of</strong>f (bi<strong>gram</strong> LM)<br />

P(wi = barked | wi-1 = dog ) ·<br />

P(wi = again | wi-1 = barked )<br />

10


Language Models – basics<br />

P(s) = P(w 1, w 2, ... w m) ≈ Π i=1..m P(w i | w i-2, w i-1)<br />

P( The dog barked again ) ≈<br />

P(wi = The | wi-2 = NONE , wi-1 = NONE<br />

) ·<br />

P(wi = | wi-2 = NONE , wi-1 = The ) ·<br />

dog<br />

History back<strong>of</strong>f (tri<strong>gram</strong> LM)<br />

P(wi = barked | wi-2 = The , wi-1 = dog ) ·<br />

P(wi = again | wi-2 = dog , wi-1 = barked )<br />

11


Language Models – basics<br />

P(s) = P(w 1, w 2, ... w m) ≈ Π i=1..m P(w i | w i-2, w i-1)<br />

In general: Π i=1..m P(w i | h i )<br />

h i … context (history) <strong>of</strong> word w i<br />

12


Language Models – design decisions<br />

1) How to factorize P(w 1, w 2, ... w m) into Π i=1..m P(w i | h i),<br />

i.e. what word-positions will be used as the context h i ?<br />

2) What additional context information will be used<br />

(apart from word forms),<br />

e.g. stems, lemmata, POS tags, word classes,... ?<br />

3) How to estimate P(w i | h i ) from the training data?<br />

Which smoothing technique will be used?<br />

(Good-Turing, Jelinek-Mercer, Katz, Kneser-Ney,...)<br />

Generalized Parallel Back<strong>of</strong>f etc.<br />

13


Language Models – design decisions<br />

1) How to factorize P(w1, w2, ... wm) into Πi=1..m P(wi | hi), this<br />

i.e. what word-positions will be used as the context hi ?<br />

work<br />

2) What additional context information will be used<br />

(apart from word forms),<br />

e.g. stems, lemmata, POS tags, word classes,... ?<br />

3) How to estimate P(w i | h i ) from the training data?<br />

Which Linear smoothing interpolation technique will be used?<br />

Weights (Good-Turing, trained Jelinek-Mercer, by EM<br />

Katz, Kneser-Ney,...)<br />

Generalized Parallel Back<strong>of</strong>f etc.<br />

14


Language Models – design decisions<br />

1) How to factorize P(w1, w2, ... wm) hinto Πi=1..m P(wi | hi), i = wi-n+1 , ..., w<br />

this<br />

i-1<br />

i.e. what word-positions will be (n-<strong>gram</strong>-based used as the context LMs) hi ?<br />

work<br />

2) What additional context information will be used<br />

(apart from word forms),<br />

e.g. stems, lemmata, POS tags, word classes,... ?<br />

3) How to estimate P(w i | h i ) from the training data?<br />

Which Linear smoothing interpolation technique will be used?<br />

Weights (Good-Turing, trained Jelinek-Mercer, by EM<br />

Katz, Kneser-Ney,...)<br />

Generalized Parallel Back<strong>of</strong>f etc.<br />

other<br />

papers<br />

15


● Language Models (LM)<br />

– basics<br />

– design decisions<br />

● Post-n<strong>gram</strong> LM<br />

● Dependency LM<br />

● Evaluation<br />

Outline<br />

● Conclusion & future plans<br />

16


Post-n<strong>gram</strong> LM<br />

In general: P(s) = P(w 1 , w 2 , ... w m ) ≈ Π i=1..m P(w i | h i )<br />

h i … context (history) <strong>of</strong> word w i<br />

left-to-right factorization order<br />

Bi<strong>gram</strong> LM: h i = w i-1 (one previous word)<br />

Tri<strong>gram</strong> LM: h i = w i-2 , w i-1 (two previous words)<br />

17


Post-n<strong>gram</strong> LM<br />

In general: P(s) = P(w 1 , w 2 , ... w m ) ≈ Π i=1..m P(w i | h i )<br />

h i … context (history) <strong>of</strong> word w i<br />

left-to-right factorization order<br />

Bi<strong>gram</strong> LM: h i = w i-1 (one previous word)<br />

Tri<strong>gram</strong> LM: h i = w i-2 , w i-1 (two previous words)<br />

right-to-left factorization order<br />

Post-bi<strong>gram</strong> LM: h i = w i+1 (one following word)<br />

Post-tri<strong>gram</strong> LM: h i = w i+1 , w i+2 (two following words)<br />

18


Post-n<strong>gram</strong> LM<br />

In general: P(s) = P(w 1 , w 2 , ... w m ) ≈ Π i=1..m P(w i | h i )<br />

h i … context (history) <strong>of</strong> word w i<br />

left-to-right factorization order<br />

Bi<strong>gram</strong> LM: h i = w i-1 (one previous word)<br />

Tri<strong>gram</strong> LM: h i = w i-2 , w i-1 (two previous words)<br />

right-to-left factorization order<br />

Post-bi<strong>gram</strong> LM: h i = w i+1 (one following word)<br />

P( The dog barked again ) = P( again | NONE ) · P( barked | again ) ·<br />

P( dog | barked ) · P( The | dog<br />

)<br />

19


● Language Models (LM)<br />

– basics<br />

– design decisions<br />

● Post-n<strong>gram</strong> LM<br />

● Dependency LM<br />

● Evaluation<br />

Outline<br />

● Conclusion & future plans<br />

20


Dependency LM<br />

● exploit the topology <strong>of</strong> <strong>dependency</strong> trees<br />

The dog barked again<br />

21


Dependency LM<br />

● exploit the topology <strong>of</strong> <strong>dependency</strong> trees<br />

The<br />

dog<br />

barked<br />

again<br />

MALT parser<br />

22


Dependency LM<br />

● exploit the topology <strong>of</strong> <strong>dependency</strong> trees<br />

The<br />

dog<br />

barked<br />

P( The dog barked again ) = P( The | dog ) · P( dog | barked ) ·<br />

h i = parent( w i )<br />

again<br />

P( barked | NONE ) · P( again | barked<br />

)<br />

23


The<br />

Dependency LM<br />

Long distance dependencies<br />

The dog I heard last night barked again<br />

dog<br />

I<br />

heard<br />

last<br />

night<br />

barked<br />

again<br />

24


Dependency LM<br />

Motivation for usage<br />

● How can we know the <strong>dependency</strong> structure<br />

without knowing the word-forms?<br />

25


Dependency LM<br />

Motivation for usage<br />

● How can we know the <strong>dependency</strong> structure<br />

without knowing the word-forms?<br />

● For example in tree-to-tree machine translation.<br />

Ten<br />

pes<br />

štěkal<br />

ANALYSIS<br />

znovu<br />

TRANSFER<br />

The<br />

dog<br />

barked<br />

SYNTHESIS<br />

again<br />

Ten pes štěkal znovu The dog barked again<br />

26


● Model wp<br />

word form <strong>of</strong> parent<br />

P( The dog barked again ) =<br />

P( The | dog ) ·<br />

P( dog | barked ) ·<br />

P( barked | NONE ) ·<br />

P( again | barked )<br />

Dependency LM<br />

Examples<br />

The<br />

dog<br />

barked<br />

again<br />

27


● Model wp,wg<br />

word form <strong>of</strong> parent, word form <strong>of</strong> gr<strong>and</strong>parent<br />

P( The dog barked again ) =<br />

P( The | dog , barked ) ·<br />

P( dog | barked , NONE ) ·<br />

P( barked | NONE , NONE ) ·<br />

P( again | barked , NONE )<br />

Dependency LM<br />

Examples<br />

The<br />

dog<br />

barked<br />

again<br />

28


● Model E,wp<br />

edge direction, word form <strong>of</strong> parent<br />

P( The dog barked again ) =<br />

P( The | right, dog ) ·<br />

P( dog | right, barked ) ·<br />

P( barked | left, NONE ) ·<br />

P( again | left, barked )<br />

Dependency LM<br />

Examples<br />

The<br />

dog<br />

barked<br />

again<br />

29


● Model C,wp<br />

number <strong>of</strong> children, word form <strong>of</strong> parent<br />

P( The dog barked again ) =<br />

P( The | 0, dog ) ·<br />

P( dog | 1, barked ) ·<br />

P( barked | 2, NONE ) ·<br />

P( again | 0, barked )<br />

Dependency LM<br />

Examples<br />

The<br />

dog<br />

barked<br />

again<br />

30


● Model N,wp<br />

the word is N th child <strong>of</strong> its parent, word form <strong>of</strong> parent<br />

P( The dog barked again ) =<br />

P( The | 1, dog ) ·<br />

P( dog | 1, barked ) ·<br />

P( barked | 1, NONE ) ·<br />

P( again | 2, barked )<br />

Dependency LM<br />

Examples<br />

The<br />

dog<br />

barked<br />

again<br />

31


Dependency LM<br />

Examples <strong>of</strong> additional context information<br />

● Model tp,wp<br />

POS tag <strong>of</strong> parent, word form <strong>of</strong> parent<br />

P( The dog barked again ) =<br />

P( The | NN, dog ) ·<br />

P( dog | VBD, barked ) ·<br />

P( barked | NONE, NONE ) ·<br />

P( again | VBD, barked<br />

)<br />

The<br />

dog<br />

barked<br />

again<br />

32


Dependency LM<br />

Examples <strong>of</strong> additional context information<br />

● Model tp,wp<br />

POS tag <strong>of</strong> parent, word form <strong>of</strong> parent<br />

P( The dog barked again ) =<br />

P( The | NN, dog ) ·<br />

P( dog | VBD, barked ) ·<br />

P( barked | NONE, NONE ) ·<br />

P( again | VBD, barked )<br />

naïve tagger<br />

assignes the most<br />

frequent tag<br />

for a given word<br />

The<br />

dog<br />

barked<br />

again<br />

33


Dependency LM<br />

Examples <strong>of</strong> additional context information<br />

● Model Tp,wp<br />

coarse-grained POS tag <strong>of</strong> parent, word form <strong>of</strong> parent<br />

P( The dog barked again ) =<br />

P( The | N, dog ) ·<br />

P( dog | V, barked ) ·<br />

P( barked | x, NONE ) ·<br />

P( again | V, barked )<br />

The<br />

dog<br />

barked<br />

again<br />

34


Dependency LM<br />

Examples <strong>of</strong> additional context information<br />

● Model E,C,wp,N<br />

edge direction, # children, word form <strong>of</strong> parent,<br />

word is N th child <strong>of</strong> its parent<br />

P( The dog barked again ) =<br />

P( The | right, 0, dog , 1) ·<br />

P( dog | right, 1, barked , 1) ·<br />

P( barked | left, 2, NONE , 1) ·<br />

P( again | left, 0, barked , 2)<br />

The<br />

dog<br />

barked<br />

again<br />

35


● Language Models (LM)<br />

– basics<br />

– design decisions<br />

● Post-n<strong>gram</strong> LM<br />

● Dependency LM<br />

● Evaluation<br />

Outline<br />

● Conclusion & future plans<br />

36


Evaluation<br />

● Train <strong>and</strong> test data from CoNLL 2007 shared task<br />

● 7 <strong>language</strong>s: Arabic, Catalan, Czech,<br />

English (450 000 tokens, 3 % OOV), Hungarian,<br />

Italian (75 000 tokens), <strong>and</strong> Turkish (26 % OOV)<br />

● Cross-entropy = −(1/|T |) Σ i=1..|T | log 2 P(w i | h i ),<br />

measured on the test data T<br />

● <strong>Perplexity</strong> = 2 Cross-entropy<br />

● Lower perplexity ~ better LM<br />

● Baseline … tri<strong>gram</strong> LM<br />

●<br />

37<br />

4 experimental settings: PLAIN, TAGS, DEP, DEP+TAGS


normalized 110,00%<br />

perplexity<br />

100,00%<br />

90,00%<br />

80,00%<br />

70,00%<br />

60,00%<br />

50,00%<br />

Evaluation<br />

40,00%<br />

ar ca cs en hu it tr<br />

w-1,w-2 (BASELINE)<br />

w+1,w+2 (PLAIN)<br />

T+1,t+1,l+1,w+1,T+2,t+2,l+2,w+2<br />

(TAGS)<br />

E,C,wp,N,wg (DEP)<br />

E,C,Tp,tp,N,lp,wp,Tg,tg,lg<br />

(DEP+TAGS)<br />

38


Findings confirmed<br />

for all seven <strong>language</strong>s<br />

Conclusion<br />

Improvement over baseline<br />

for English<br />

● Post-tri<strong>gram</strong> better than tri<strong>gram</strong> PLAIN 8 %<br />

Post-bi<strong>gram</strong> better than bi<strong>gram</strong><br />

● Additional context (POS & lemma) helps TAGS 20 %<br />

● Dependency structure helps even more DEP 24 %<br />

● The best perplexity achieved with DEP+TAGS 31 %<br />

39


Future plans<br />

● Investigate the reason for better post-n<strong>gram</strong> LM perplexity<br />

● Extrinsic evaluation<br />

● Post-n<strong>gram</strong> LM in speech recognition<br />

● Dependency LM in tree-to-tree machine translation<br />

● Better smoothing using Generalized Parallel Back<strong>of</strong>f<br />

● Bigger LM for real applications<br />

40


Thank you<br />

41

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!