Perplexity of n-gram and dependency language models

Perplexity of 

n-gram and dependency 

language models 

Martin Popel, David Mareček 

ÚFAL, Charles University in Prague 

TSD, 13th International Conference on Text, Speech and Dialogue 

September 8, 2010, Brno

● Language Models (LM) 

– basics 

– design decisions 

● Post-ngram LM 

● Dependency LM 

● Evaluation 

Outline 

● Conclusion & future plans 

2

P(s) = ? 

Language Models – basics 

P( The dog barked again 

) 

3

P(s) = ? 


P( The dog barked again ) > P( The dock barked again 

) 

4


P(s) = P(w 1, w 2, ... w m) 

P( The dog barked again 

) = 

P(w1 = The , w2 = dog , w3 = barked , w4 = again ) 

5


P(s) = P(w 1, w 2, ... w m) = P(w 1) P(w 2|w 1) … P(w m|w 1,...,w m-1) 

P( The dog barked again ) = 

P(w1 = The ) · 

P(w2 = | w1 = The ) · 

dog 

P(w3 = barked | w1 = The , w2 = dog ) · 

Chain rule 

P(w4 = again | w1 = The , w2 = dog , w3 = barked ) 

6




P(wi = The | i=1) · 

P(wi = | i=2, wi-1 = The ) · 

dog 

P(wi = barked | i=3, wi-2 = The , wi-1 = dog ) · 

Changed notation 

P(wi = again | i=4, wi-3 = The , wi-2 = dog , wi-1 = barked ) 

7




P(wi = The | i=1, wi-1 = NONE ) · 

P(wi = | i=2, wi-2 = NONE 

, wi-1 = The ) · 

dog 

Artificial start-of-sentence token 

P(wi = barked | i=3, wi-3 = NONE , wi-2 = The , wi-1 = dog ) · 

P(wi = again | i=4, wi-4 = NONE , wi-3 = The , wi-2 = dog , wi-1 = barked ) 

8



P( The dog barked again ) ≈ 

P(wi = The | wi-1 = NONE ) · 

P(wi = | wi-2 = NONE 

, wi-1 = The ) · 

dog 

Position backoff 

P(wi = barked | wi-3 = NONE , wi-2 = The , wi-1 = dog ) · 

P(wi = again | wi-4 = NONE , wi-3 = The , wi-2 = dog , wi-1 = barked ) 

9


P(s) = P(w 1, w 2, ... w m) ≈ Π i=1..m P(w i | w i-1) 


P(wi = The | wi-1 = NONE 

) · 

P(wi = | wi-1 = The ) · 

dog 

History backoff (bigram LM) 

P(wi = barked | wi-1 = dog ) · 

P(wi = again | wi-1 = barked ) 

10


P(s) = P(w 1, w 2, ... w m) ≈ Π i=1..m P(w i | w i-2, w i-1) 


P(wi = The | wi-2 = NONE , wi-1 = NONE 

) · 

P(wi = | wi-2 = NONE , wi-1 = The ) · 

dog 

History backoff (trigram LM) 

P(wi = barked | wi-2 = The , wi-1 = dog ) · 

P(wi = again | wi-2 = dog , wi-1 = barked ) 

11


P(s) = P(w 1, w 2, ... w m) ≈ Π i=1..m P(w i | w i-2, w i-1) 

In general: Π i=1..m P(w i | h i ) 

h i … context (history) of word w i 

12

Language Models – design decisions 

1) How to factorize P(w 1, w 2, ... w m) into Π i=1..m P(w i | h i), 

i.e. what word-positions will be used as the context h i ? 

2) What additional context information will be used 

(apart from word forms), 

e.g. stems, lemmata, POS tags, word classes,... ? 

3) How to estimate P(w i | h i ) from the training data? 

Which smoothing technique will be used? 

(Good-Turing, Jelinek-Mercer, Katz, Kneser-Ney,...) 

Generalized Parallel Backoff etc. 

13


1) How to factorize P(w1, w2, ... wm) into Πi=1..m P(wi | hi), this 

i.e. what word-positions will be used as the context hi ? 

work 





Which Linear smoothing interpolation technique will be used? 

Weights (Good-Turing, trained Jelinek-Mercer, by EM 

Katz, Kneser-Ney,...) 


14


1) How to factorize P(w1, w2, ... wm) hinto Πi=1..m P(wi | hi), i = wi-n+1 , ..., w 

this 

i-1 

i.e. what word-positions will be (n-gram-based used as the context LMs) hi ? 

work 





Which Linear smoothing interpolation technique will be used? 

Weights (Good-Turing, trained Jelinek-Mercer, by EM 

Katz, Kneser-Ney,...) 


other 

papers 

15


– basics 





Outline 


16

Post-ngram LM 

In general: P(s) = P(w 1 , w 2 , ... w m ) ≈ Π i=1..m P(w i | h i ) 


left-to-right factorization order 

Bigram LM: h i = w i-1 (one previous word) 

Trigram LM: h i = w i-2 , w i-1 (two previous words) 

17







right-to-left factorization order 

Post-bigram LM: h i = w i+1 (one following word) 

Post-trigram LM: h i = w i+1 , w i+2 (two following words) 

18







right-to-left factorization order 

Post-bigram LM: h i = w i+1 (one following word) 

P( The dog barked again ) = P( again | NONE ) · P( barked | again ) · 

P( dog | barked ) · P( The | dog 

) 

19


– basics 





Outline 


20

Dependency LM 

● exploit the topology of dependency trees 

The dog barked again 

21

Dependency LM 


The 

dog 

barked 

again 

MALT parser 

22

Dependency LM 


The 

dog 

barked 

P( The dog barked again ) = P( The | dog ) · P( dog | barked ) · 

h i = parent( w i ) 

again 

P( barked | NONE ) · P( again | barked 

) 

23

The 

Dependency LM 

Long distance dependencies 

The dog I heard last night barked again 

dog 

I 

heard 

last 

night 

barked 

again 

24

Dependency LM 

Motivation for usage 

● How can we know the dependency structure 

without knowing the word-forms? 

25

Dependency LM 

Motivation for usage 

● How can we know the dependency structure 

without knowing the word-forms? 

● For example in tree-to-tree machine translation. 

Ten 

pes 

štěkal 

ANALYSIS 

znovu 

TRANSFER 

The 

dog 

barked 

SYNTHESIS 

again 

Ten pes štěkal znovu The dog barked again 

26

● Model wp 

word form of parent 


P( The | dog ) · 

P( dog | barked ) · 

P( barked | NONE ) · 

P( again | barked ) 

Dependency LM 

Examples 

The 

dog 

barked 

again 

27

● Model wp,wg 

word form of parent, word form of grandparent 


P( The | dog , barked ) · 

P( dog | barked , NONE ) · 

P( barked | NONE , NONE ) · 

P( again | barked , NONE ) 

Dependency LM 

Examples 

The 

dog 

barked 

again 

28

● Model E,wp 

edge direction, word form of parent 


P( The | right, dog ) · 

P( dog | right, barked ) · 

P( barked | left, NONE ) · 

P( again | left, barked ) 

Dependency LM 

Examples 

The 

dog 

barked 

again 

29

● Model C,wp 

number of children, word form of parent 


P( The | 0, dog ) · 

P( dog | 1, barked ) · 

P( barked | 2, NONE ) · 

P( again | 0, barked ) 

Dependency LM 

Examples 

The 

dog 

barked 

again 

30

● Model N,wp 

the word is N th child of its parent, word form of parent 


P( The | 1, dog ) · 

P( dog | 1, barked ) · 

P( barked | 1, NONE ) · 

P( again | 2, barked ) 

Dependency LM 

Examples 

The 

dog 

barked 

again 

31

Dependency LM 

Examples of additional context information 

● Model tp,wp 

POS tag of parent, word form of parent 


P( The | NN, dog ) · 

P( dog | VBD, barked ) · 

P( barked | NONE, NONE ) · 

P( again | VBD, barked 

) 

The 

dog 

barked 

again 

32

Dependency LM 


● Model tp,wp 

POS tag of parent, word form of parent 


P( The | NN, dog ) · 

P( dog | VBD, barked ) · 

P( barked | NONE, NONE ) · 

P( again | VBD, barked ) 

naïve tagger 

assignes the most 

frequent tag 

for a given word 

The 

dog 

barked 

again 

33

Dependency LM 


● Model Tp,wp 

coarse-grained POS tag of parent, word form of parent 


P( The | N, dog ) · 

P( dog | V, barked ) · 

P( barked | x, NONE ) · 

P( again | V, barked ) 

The 

dog 

barked 

again 

34

Dependency LM 


● Model E,C,wp,N 

edge direction, # children, word form of parent, 

word is N th child of its parent 


P( The | right, 0, dog , 1) · 

P( dog | right, 1, barked , 1) · 

P( barked | left, 2, NONE , 1) · 

P( again | left, 0, barked , 2) 

The 

dog 

barked 

again 

35


– basics 





Outline 


36

Evaluation 

● Train and test data from CoNLL 2007 shared task 

● 7 languages: Arabic, Catalan, Czech, 

English (450 000 tokens, 3 % OOV), Hungarian, 

Italian (75 000 tokens), and Turkish (26 % OOV) 

● Cross-entropy = −(1/|T |) Σ i=1..|T | log 2 P(w i | h i ), 

measured on the test data T 

● Perplexity = 2 Cross-entropy 

● Lower perplexity ~ better LM 

● Baseline … trigram LM 

● 

37 

4 experimental settings: PLAIN, TAGS, DEP, DEP+TAGS

normalized 110,00% 

perplexity 

100,00% 

90,00% 

80,00% 

70,00% 

60,00% 

50,00% 

Evaluation 

40,00% 

ar ca cs en hu it tr 

w-1,w-2 (BASELINE) 

w+1,w+2 (PLAIN) 

T+1,t+1,l+1,w+1,T+2,t+2,l+2,w+2 

(TAGS) 

E,C,wp,N,wg (DEP) 

E,C,Tp,tp,N,lp,wp,Tg,tg,lg 

(DEP+TAGS) 

38

Findings confirmed 

for all seven languages 

Conclusion 

Improvement over baseline 

for English 

● Post-trigram better than trigram PLAIN 8 % 

Post-bigram better than bigram 

● Additional context (POS & lemma) helps TAGS 20 % 

● Dependency structure helps even more DEP 24 % 

● The best perplexity achieved with DEP+TAGS 31 % 

39

Future plans 

● Investigate the reason for better post-ngram LM perplexity 

● Extrinsic evaluation 

● Post-ngram LM in speech recognition 

● Dependency LM in tree-to-tree machine translation 

● Better smoothing using Generalized Parallel Backoff 

● Bigger LM for real applications 

40

Thank you 

41

Perplexity of n-gram and dependency language models

Create successful ePaper yourself

Delete template?

Save as template?