Perplexity of n-gram and dependency language models
Perplexity of n-gram and dependency language models
Perplexity of n-gram and dependency language models
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
<strong>Perplexity</strong> <strong>of</strong><br />
n-<strong>gram</strong> <strong>and</strong> <strong>dependency</strong><br />
<strong>language</strong> <strong>models</strong><br />
Martin Popel, David Mareček<br />
ÚFAL, Charles University in Prague<br />
TSD, 13th International Conference on Text, Speech <strong>and</strong> Dialogue<br />
September 8, 2010, Brno
● Language Models (LM)<br />
– basics<br />
– design decisions<br />
● Post-n<strong>gram</strong> LM<br />
● Dependency LM<br />
● Evaluation<br />
Outline<br />
● Conclusion & future plans<br />
2
P(s) = ?<br />
Language Models – basics<br />
P( The dog barked again<br />
)<br />
3
P(s) = ?<br />
Language Models – basics<br />
P( The dog barked again ) > P( The dock barked again<br />
)<br />
4
Language Models – basics<br />
P(s) = P(w 1, w 2, ... w m)<br />
P( The dog barked again<br />
) =<br />
P(w1 = The , w2 = dog , w3 = barked , w4 = again )<br />
5
Language Models – basics<br />
P(s) = P(w 1, w 2, ... w m) = P(w 1) P(w 2|w 1) … P(w m|w 1,...,w m-1)<br />
P( The dog barked again ) =<br />
P(w1 = The ) ·<br />
P(w2 = | w1 = The ) ·<br />
dog<br />
P(w3 = barked | w1 = The , w2 = dog ) ·<br />
Chain rule<br />
P(w4 = again | w1 = The , w2 = dog , w3 = barked )<br />
6
Language Models – basics<br />
P(s) = P(w 1, w 2, ... w m) = P(w 1) P(w 2|w 1) … P(w m|w 1,...,w m-1)<br />
P( The dog barked again ) =<br />
P(wi = The | i=1) ·<br />
P(wi = | i=2, wi-1 = The ) ·<br />
dog<br />
P(wi = barked | i=3, wi-2 = The , wi-1 = dog ) ·<br />
Changed notation<br />
P(wi = again | i=4, wi-3 = The , wi-2 = dog , wi-1 = barked )<br />
7
Language Models – basics<br />
P(s) = P(w 1, w 2, ... w m) = P(w 1) P(w 2|w 1) … P(w m|w 1,...,w m-1)<br />
P( The dog barked again ) =<br />
P(wi = The | i=1, wi-1 = NONE ) ·<br />
P(wi = | i=2, wi-2 = NONE<br />
, wi-1 = The ) ·<br />
dog<br />
Artificial start-<strong>of</strong>-sentence token<br />
P(wi = barked | i=3, wi-3 = NONE , wi-2 = The , wi-1 = dog ) ·<br />
P(wi = again | i=4, wi-4 = NONE , wi-3 = The , wi-2 = dog , wi-1 = barked )<br />
8
Language Models – basics<br />
P(s) = P(w 1, w 2, ... w m) = P(w 1) P(w 2|w 1) … P(w m|w 1,...,w m-1)<br />
P( The dog barked again ) ≈<br />
P(wi = The | wi-1 = NONE ) ·<br />
P(wi = | wi-2 = NONE<br />
, wi-1 = The ) ·<br />
dog<br />
Position back<strong>of</strong>f<br />
P(wi = barked | wi-3 = NONE , wi-2 = The , wi-1 = dog ) ·<br />
P(wi = again | wi-4 = NONE , wi-3 = The , wi-2 = dog , wi-1 = barked )<br />
9
Language Models – basics<br />
P(s) = P(w 1, w 2, ... w m) ≈ Π i=1..m P(w i | w i-1)<br />
P( The dog barked again ) ≈<br />
P(wi = The | wi-1 = NONE<br />
) ·<br />
P(wi = | wi-1 = The ) ·<br />
dog<br />
History back<strong>of</strong>f (bi<strong>gram</strong> LM)<br />
P(wi = barked | wi-1 = dog ) ·<br />
P(wi = again | wi-1 = barked )<br />
10
Language Models – basics<br />
P(s) = P(w 1, w 2, ... w m) ≈ Π i=1..m P(w i | w i-2, w i-1)<br />
P( The dog barked again ) ≈<br />
P(wi = The | wi-2 = NONE , wi-1 = NONE<br />
) ·<br />
P(wi = | wi-2 = NONE , wi-1 = The ) ·<br />
dog<br />
History back<strong>of</strong>f (tri<strong>gram</strong> LM)<br />
P(wi = barked | wi-2 = The , wi-1 = dog ) ·<br />
P(wi = again | wi-2 = dog , wi-1 = barked )<br />
11
Language Models – basics<br />
P(s) = P(w 1, w 2, ... w m) ≈ Π i=1..m P(w i | w i-2, w i-1)<br />
In general: Π i=1..m P(w i | h i )<br />
h i … context (history) <strong>of</strong> word w i<br />
12
Language Models – design decisions<br />
1) How to factorize P(w 1, w 2, ... w m) into Π i=1..m P(w i | h i),<br />
i.e. what word-positions will be used as the context h i ?<br />
2) What additional context information will be used<br />
(apart from word forms),<br />
e.g. stems, lemmata, POS tags, word classes,... ?<br />
3) How to estimate P(w i | h i ) from the training data?<br />
Which smoothing technique will be used?<br />
(Good-Turing, Jelinek-Mercer, Katz, Kneser-Ney,...)<br />
Generalized Parallel Back<strong>of</strong>f etc.<br />
13
Language Models – design decisions<br />
1) How to factorize P(w1, w2, ... wm) into Πi=1..m P(wi | hi), this<br />
i.e. what word-positions will be used as the context hi ?<br />
work<br />
2) What additional context information will be used<br />
(apart from word forms),<br />
e.g. stems, lemmata, POS tags, word classes,... ?<br />
3) How to estimate P(w i | h i ) from the training data?<br />
Which Linear smoothing interpolation technique will be used?<br />
Weights (Good-Turing, trained Jelinek-Mercer, by EM<br />
Katz, Kneser-Ney,...)<br />
Generalized Parallel Back<strong>of</strong>f etc.<br />
14
Language Models – design decisions<br />
1) How to factorize P(w1, w2, ... wm) hinto Πi=1..m P(wi | hi), i = wi-n+1 , ..., w<br />
this<br />
i-1<br />
i.e. what word-positions will be (n-<strong>gram</strong>-based used as the context LMs) hi ?<br />
work<br />
2) What additional context information will be used<br />
(apart from word forms),<br />
e.g. stems, lemmata, POS tags, word classes,... ?<br />
3) How to estimate P(w i | h i ) from the training data?<br />
Which Linear smoothing interpolation technique will be used?<br />
Weights (Good-Turing, trained Jelinek-Mercer, by EM<br />
Katz, Kneser-Ney,...)<br />
Generalized Parallel Back<strong>of</strong>f etc.<br />
other<br />
papers<br />
15
● Language Models (LM)<br />
– basics<br />
– design decisions<br />
● Post-n<strong>gram</strong> LM<br />
● Dependency LM<br />
● Evaluation<br />
Outline<br />
● Conclusion & future plans<br />
16
Post-n<strong>gram</strong> LM<br />
In general: P(s) = P(w 1 , w 2 , ... w m ) ≈ Π i=1..m P(w i | h i )<br />
h i … context (history) <strong>of</strong> word w i<br />
left-to-right factorization order<br />
Bi<strong>gram</strong> LM: h i = w i-1 (one previous word)<br />
Tri<strong>gram</strong> LM: h i = w i-2 , w i-1 (two previous words)<br />
17
Post-n<strong>gram</strong> LM<br />
In general: P(s) = P(w 1 , w 2 , ... w m ) ≈ Π i=1..m P(w i | h i )<br />
h i … context (history) <strong>of</strong> word w i<br />
left-to-right factorization order<br />
Bi<strong>gram</strong> LM: h i = w i-1 (one previous word)<br />
Tri<strong>gram</strong> LM: h i = w i-2 , w i-1 (two previous words)<br />
right-to-left factorization order<br />
Post-bi<strong>gram</strong> LM: h i = w i+1 (one following word)<br />
Post-tri<strong>gram</strong> LM: h i = w i+1 , w i+2 (two following words)<br />
18
Post-n<strong>gram</strong> LM<br />
In general: P(s) = P(w 1 , w 2 , ... w m ) ≈ Π i=1..m P(w i | h i )<br />
h i … context (history) <strong>of</strong> word w i<br />
left-to-right factorization order<br />
Bi<strong>gram</strong> LM: h i = w i-1 (one previous word)<br />
Tri<strong>gram</strong> LM: h i = w i-2 , w i-1 (two previous words)<br />
right-to-left factorization order<br />
Post-bi<strong>gram</strong> LM: h i = w i+1 (one following word)<br />
P( The dog barked again ) = P( again | NONE ) · P( barked | again ) ·<br />
P( dog | barked ) · P( The | dog<br />
)<br />
19
● Language Models (LM)<br />
– basics<br />
– design decisions<br />
● Post-n<strong>gram</strong> LM<br />
● Dependency LM<br />
● Evaluation<br />
Outline<br />
● Conclusion & future plans<br />
20
Dependency LM<br />
● exploit the topology <strong>of</strong> <strong>dependency</strong> trees<br />
The dog barked again<br />
21
Dependency LM<br />
● exploit the topology <strong>of</strong> <strong>dependency</strong> trees<br />
The<br />
dog<br />
barked<br />
again<br />
MALT parser<br />
22
Dependency LM<br />
● exploit the topology <strong>of</strong> <strong>dependency</strong> trees<br />
The<br />
dog<br />
barked<br />
P( The dog barked again ) = P( The | dog ) · P( dog | barked ) ·<br />
h i = parent( w i )<br />
again<br />
P( barked | NONE ) · P( again | barked<br />
)<br />
23
The<br />
Dependency LM<br />
Long distance dependencies<br />
The dog I heard last night barked again<br />
dog<br />
I<br />
heard<br />
last<br />
night<br />
barked<br />
again<br />
24
Dependency LM<br />
Motivation for usage<br />
● How can we know the <strong>dependency</strong> structure<br />
without knowing the word-forms?<br />
25
Dependency LM<br />
Motivation for usage<br />
● How can we know the <strong>dependency</strong> structure<br />
without knowing the word-forms?<br />
● For example in tree-to-tree machine translation.<br />
Ten<br />
pes<br />
štěkal<br />
ANALYSIS<br />
znovu<br />
TRANSFER<br />
The<br />
dog<br />
barked<br />
SYNTHESIS<br />
again<br />
Ten pes štěkal znovu The dog barked again<br />
26
● Model wp<br />
word form <strong>of</strong> parent<br />
P( The dog barked again ) =<br />
P( The | dog ) ·<br />
P( dog | barked ) ·<br />
P( barked | NONE ) ·<br />
P( again | barked )<br />
Dependency LM<br />
Examples<br />
The<br />
dog<br />
barked<br />
again<br />
27
● Model wp,wg<br />
word form <strong>of</strong> parent, word form <strong>of</strong> gr<strong>and</strong>parent<br />
P( The dog barked again ) =<br />
P( The | dog , barked ) ·<br />
P( dog | barked , NONE ) ·<br />
P( barked | NONE , NONE ) ·<br />
P( again | barked , NONE )<br />
Dependency LM<br />
Examples<br />
The<br />
dog<br />
barked<br />
again<br />
28
● Model E,wp<br />
edge direction, word form <strong>of</strong> parent<br />
P( The dog barked again ) =<br />
P( The | right, dog ) ·<br />
P( dog | right, barked ) ·<br />
P( barked | left, NONE ) ·<br />
P( again | left, barked )<br />
Dependency LM<br />
Examples<br />
The<br />
dog<br />
barked<br />
again<br />
29
● Model C,wp<br />
number <strong>of</strong> children, word form <strong>of</strong> parent<br />
P( The dog barked again ) =<br />
P( The | 0, dog ) ·<br />
P( dog | 1, barked ) ·<br />
P( barked | 2, NONE ) ·<br />
P( again | 0, barked )<br />
Dependency LM<br />
Examples<br />
The<br />
dog<br />
barked<br />
again<br />
30
● Model N,wp<br />
the word is N th child <strong>of</strong> its parent, word form <strong>of</strong> parent<br />
P( The dog barked again ) =<br />
P( The | 1, dog ) ·<br />
P( dog | 1, barked ) ·<br />
P( barked | 1, NONE ) ·<br />
P( again | 2, barked )<br />
Dependency LM<br />
Examples<br />
The<br />
dog<br />
barked<br />
again<br />
31
Dependency LM<br />
Examples <strong>of</strong> additional context information<br />
● Model tp,wp<br />
POS tag <strong>of</strong> parent, word form <strong>of</strong> parent<br />
P( The dog barked again ) =<br />
P( The | NN, dog ) ·<br />
P( dog | VBD, barked ) ·<br />
P( barked | NONE, NONE ) ·<br />
P( again | VBD, barked<br />
)<br />
The<br />
dog<br />
barked<br />
again<br />
32
Dependency LM<br />
Examples <strong>of</strong> additional context information<br />
● Model tp,wp<br />
POS tag <strong>of</strong> parent, word form <strong>of</strong> parent<br />
P( The dog barked again ) =<br />
P( The | NN, dog ) ·<br />
P( dog | VBD, barked ) ·<br />
P( barked | NONE, NONE ) ·<br />
P( again | VBD, barked )<br />
naïve tagger<br />
assignes the most<br />
frequent tag<br />
for a given word<br />
The<br />
dog<br />
barked<br />
again<br />
33
Dependency LM<br />
Examples <strong>of</strong> additional context information<br />
● Model Tp,wp<br />
coarse-grained POS tag <strong>of</strong> parent, word form <strong>of</strong> parent<br />
P( The dog barked again ) =<br />
P( The | N, dog ) ·<br />
P( dog | V, barked ) ·<br />
P( barked | x, NONE ) ·<br />
P( again | V, barked )<br />
The<br />
dog<br />
barked<br />
again<br />
34
Dependency LM<br />
Examples <strong>of</strong> additional context information<br />
● Model E,C,wp,N<br />
edge direction, # children, word form <strong>of</strong> parent,<br />
word is N th child <strong>of</strong> its parent<br />
P( The dog barked again ) =<br />
P( The | right, 0, dog , 1) ·<br />
P( dog | right, 1, barked , 1) ·<br />
P( barked | left, 2, NONE , 1) ·<br />
P( again | left, 0, barked , 2)<br />
The<br />
dog<br />
barked<br />
again<br />
35
● Language Models (LM)<br />
– basics<br />
– design decisions<br />
● Post-n<strong>gram</strong> LM<br />
● Dependency LM<br />
● Evaluation<br />
Outline<br />
● Conclusion & future plans<br />
36
Evaluation<br />
● Train <strong>and</strong> test data from CoNLL 2007 shared task<br />
● 7 <strong>language</strong>s: Arabic, Catalan, Czech,<br />
English (450 000 tokens, 3 % OOV), Hungarian,<br />
Italian (75 000 tokens), <strong>and</strong> Turkish (26 % OOV)<br />
● Cross-entropy = −(1/|T |) Σ i=1..|T | log 2 P(w i | h i ),<br />
measured on the test data T<br />
● <strong>Perplexity</strong> = 2 Cross-entropy<br />
● Lower perplexity ~ better LM<br />
● Baseline … tri<strong>gram</strong> LM<br />
●<br />
37<br />
4 experimental settings: PLAIN, TAGS, DEP, DEP+TAGS
normalized 110,00%<br />
perplexity<br />
100,00%<br />
90,00%<br />
80,00%<br />
70,00%<br />
60,00%<br />
50,00%<br />
Evaluation<br />
40,00%<br />
ar ca cs en hu it tr<br />
w-1,w-2 (BASELINE)<br />
w+1,w+2 (PLAIN)<br />
T+1,t+1,l+1,w+1,T+2,t+2,l+2,w+2<br />
(TAGS)<br />
E,C,wp,N,wg (DEP)<br />
E,C,Tp,tp,N,lp,wp,Tg,tg,lg<br />
(DEP+TAGS)<br />
38
Findings confirmed<br />
for all seven <strong>language</strong>s<br />
Conclusion<br />
Improvement over baseline<br />
for English<br />
● Post-tri<strong>gram</strong> better than tri<strong>gram</strong> PLAIN 8 %<br />
Post-bi<strong>gram</strong> better than bi<strong>gram</strong><br />
● Additional context (POS & lemma) helps TAGS 20 %<br />
● Dependency structure helps even more DEP 24 %<br />
● The best perplexity achieved with DEP+TAGS 31 %<br />
39
Future plans<br />
● Investigate the reason for better post-n<strong>gram</strong> LM perplexity<br />
● Extrinsic evaluation<br />
● Post-n<strong>gram</strong> LM in speech recognition<br />
● Dependency LM in tree-to-tree machine translation<br />
● Better smoothing using Generalized Parallel Back<strong>of</strong>f<br />
● Bigger LM for real applications<br />
40
Thank you<br />
41