naist-dlec2004-lmodel

tikal

naist-dlec2004-lmodel

Doctor Lecture 2004


Daichi Mochihashi

daiti-m@is.naist.jp

daichi.mochihashi@atr.jp

NAIST / ATR SLT Dept.2 (SLR)

2004.07.16

Language models D-lec 2004 in cl-lab – p.1/39


Overview

What is the Language Model?

N-gram models (including various smoothing)

Variable N-grams

N-gram Language Model evaluation

Whole sentence maximum entropy

Long distance models

(Old) cache/trigger models

Latent variable models

Language models D-lec 2004 in cl-lab – p.2/39


What is the Language Model?

?

w = w 1 w 2 w 3 ···w n p(w)


w

, , , ...


p(w)

, , ...

⇔ .

(dual problem)

Language models D-lec 2004 in cl-lab – p.3/39


Why Language Models?

(Speech recognition)

(Brown et al. 1990)

p(J|E) ∝ p(E|J)p(J) ( × )

(Zhai & Lafferty 2001, Berger & Lafferty 1999)

(HCI), OCR/ ()

Shannon , /,

Language models D-lec 2004 in cl-lab – p.4/39


How to model p(w)?

p(w) =p(w n 1 ) ?

Conditional method (N-gram)

T∏

p(w) = p(w t |w1 t−1 ) (1)

“Whole sentence” method

t=1

p(w) ∝ p 0 (w) · exp( ∑ i

λ i f i (w)) (2)

: (2) (1) (p 0 (w) =p cond (w))

Language models D-lec 2004 in cl-lab – p.5/39


N-gram approximation model

p(w n 1 )=


=

T∏

p(w t |w t−1 w t−2 ···w 1 ) (3)

t=1

T∏

p(w t | w t−1 ···w t−(n−1) ) (4)

} {{ }

t=1

n−1

⎧ ∏ T

⎨ t=1 p(w t|w t−1 ,w t−2 ) : (n =3)

∏ T

t=1


p(w t|w t−1 ) : (n =2)

∏ T

t=1 p(w t) : (n =1)

(5)

: p(X, Y |Z) =p(X|Y,Z)p(Y |Z) (3)


Language models D-lec 2004 in cl-lab – p.6/39


Language Model Smoothing


f i|j , f j 〈w j →w i 〉, w j ,

:

ˆp(i|j) = f i|j

f j

. (6)

n-gram 0 ()


1. n-gram

2. ()

3. n-gram

Language models D-lec 2004 in cl-lab – p.7/39


Katz’s Backing-Off

f i|j =0( k ) ?

p(i|j) =

{ (1 − α(j)) · ˆp(i|j) when fi|j > 0

α(j) · p(i) when f i|j =0

(7)

f i|j > 0 , ,


f i|j =0, ,

p(i)

(Back-off):

.

()

Language models D-lec 2004 in cl-lab – p.8/39


Class-based n-gram


→ (like HMM)

w 1 ,w 2 ,... ( 1 ) c 1 ,c 2 ,...


T∏

p(w1 T )= p(w t |c t )p(c t |c t−1 ) (8)

t=1

,

〈log p(w1 T )〉 = 1 T [log p(w t|c t )+logp(c t |c t−1 )] (9)


T →∞


c 1 ,c 2

p(c 1 ,c 2 )log p(c 1,c 2 )

p(c 1 )p(c 2 ) + ∑ w

p(w)logp(w)

= ∑ c 1 ,c 2

I(c 1 ,c 2 ) − H(w). ()

Language models D-lec 2004 in cl-lab – p.9/39


Langage Model Smoothing

Simple Smoothing Methods

Laplace smoothing, Lidstone’s law


Extended Smoothing Methods

Good-Turing smoothing, Kneser-Ney Smoothing,

Bayes smoothing

NLP

2

Language models D-lec 2004 in cl-lab – p.10/39


Simple Smoothing Methods

Laplace smoothing

p(i|j) =

f i|j +1

∑ W

i

(f i|j +1) = f i|j +1

f j + W

(10)

Lidstone’s law : λ (, λ =1/2)

p(i|j) =

∑ W

i

f i|j + λ

(f i|j + λ) = f i|j + λ

f j + Wλ

= µ · fi|j

f j

+(1− µ) ·

(11)

1

W (µ = f j

f j + Wλ ) (12)


(!)

Language models D-lec 2004 in cl-lab – p.11/39


Extended Smoothing methods

Good-Turing smoothing


(Modified) Kneser-Ney smooothing

Chen and Goodman (1998)

()

Hierarchical Bayes optimal smoothing (MacKay 1994)

Language models D-lec 2004 in cl-lab – p.12/39


Good-Turing smoothing

p(i|j) =

(f i|j +1)· N(f i|j+1)

N(f i|j )

f j

if f i|j


(Modified) Kneser-Ney smoothing

Absolute discounting ()

p(i|j) = f i|j − D(f i|j )

f j

+ γ(j)p(i) (14)

D(n) :

n =1: D(1) = 1 − 2 ·

n =2: D(2) = 2 − 3 ·

N(1) · N(2)

N(1)+2N(2) N(1)

N(1) · N(3)

N(1)+2N(2) N(2)

n ≥ 3 : D(3+) = 3 − 4 ·

N(1) · N(4)

N(1)+2N(2) N(3)

γ(j) j

Language models D-lec 2004 in cl-lab – p.14/39


Hierarchical Bayes Optimal

Smoothing (MacKay 1994)

E[p(i|j)] =

=

f i|j + α

∑ i

i (f i|j + α i )

(15)

f j

· ˆp(i|j)+ α 0

· ᾱ i

f j + α 0 f j + α 0

(16)

where α 0 = ∑ k α k and ᾱ i = α i

α 0

α =(α 1 ,α 2 ,...,α W ) : .


ᾱ i ≠ p(i) (Unigram)

(α ?)

Language models D-lec 2004 in cl-lab – p.15/39


Hierarchical Bayes Optimal

Smoothing (2)

p(F |α) =

W∏

j=1

[

Γ(α 0 )

∏ W

i=1 Γ(α i) ·

∏ W

i=1 Γ(f ]

i|j + α i )

Γ(f j + α 0 )

(17)

α , .

iteration (Minka 2003).


α (t+1)

i = α (t)

j

i ·

Ψ(f i|j + α j ) − Ψ(α j )


j Ψ(f j + ∑ k α k) − Ψ( ∑ k α k)

(18)

MATLAB 45

SVM2004

Language models D-lec 2004 in cl-lab – p.16/39


Variable Order n-grams (1)

(ex. 2 words) →

/, typical

( )


Named Entity.

?

Language models D-lec 2004 in cl-lab – p.17/39


Variable Order n-grams (2)

Pereira et al. (1995) “Beyond Word N-Grams”

:

Ron, Singer, Tishby (1994) “The Power of Amnesia”

Prediction Suffix Tree (PST)

.

Willems at al. (1995) (/)

Context Tree Weighting method (CTW)


(ex. Sadakane 2000)


Language models D-lec 2004 in cl-lab – p.18/39


Ron, Singer, Tishby (1994)

e

g(0)=0.5

g(1)=0.5

Prediction Suffix Tree (PST)

1

0

g(0)=0.6

g(1)=0.4

10

g(0)=0.3

g(1)=0.7

00

g(0)=0.9

g(1)=0.1

110

g(0)=0.6

g(1)=0.4

010

g(0)=0.8

g(1)=0.2


h=00100101110 → p(0|h) =0.6, p(1|h) =0.4

Language models D-lec 2004 in cl-lab – p.19/39


Ron, Singer, Tishby (1994)(2)

PST

s (: 010) σs (σ ∈{0, 1})

(: 010 → 0010) ,

p(σs) · D(p(·|σs)||p(·|s)) (19)

(= Yodo (1998))


ɛ =0.001

N =30gram ()


≤ 3000, : ‘shall be’, ‘there was’

Language models D-lec 2004 in cl-lab – p.20/39


Pereira et al. (1995)

PST Mixture (CTW (Willems et al. (1995)) )


w = w 1 w 2 w 3 ···w N (N very large)

PST T N∏

p(w 1 ···w N |T )= γ CT (w 1···w i−1 )(w i ) (20)

i=1

,

p(w N+1 |w 1 ···w N )= p(w 1 ···w N+1 )

p(w 1 ···w N )


= ∑ T ∈T p(T )p(w 1 ···w N+1 |T )

T ∈T p(T )p(w 1 ···w N |T )

p(T ) : PST T prior.

(21)

(22)

Language models D-lec 2004 in cl-lab – p.21/39


Pereira et al. (1995) (2)

Tree Mixture:


T ∈T

.

p(T )p(w 1 ···w N |T )=L mix (ɛ) (23)

L mix (s) =αL n (s) +(1 − α) ∏ L

} {{ }

mix (σs)

σ∈W

} {{ }

emission

emission

α : Tree prior. (0 ≤ α ≤ 1)

(24)

Language models D-lec 2004 in cl-lab – p.22/39


Pereira et al. (1995) (3)

,

p(w n |w 1 ···w n−1 )=˜γ ɛ (w n ) (25)

PST s w n

˜γ s (w n ) ,

˜γ s (w n )=

{

γs (w n ) : s

q n γ s (w n )+(1− q n )˜γ wn−|s| s(w n ) : s

(26)

q n

˜γ(w) .

Language models D-lec 2004 in cl-lab – p.23/39


Language Model evaluation

?

(Perplexity)

(Cross Entropy)

..

Language models D-lec 2004 in cl-lab – p.24/39


Basic of Basics

p =

1


∴ = 1 p

Shannon ()

log 1

p(x)

= − log(p(x)) (27)

= Shannon

H(p) =− ∑ x

p(x)logp(x) (28)

= 〈−log p 〉 p (29)

Language models D-lec 2004 in cl-lab – p.25/39


Perplexity

(Perplexity) =

( T

) 1/T


PPL(w1 T 1

)=

p(w

t=1 t |w1 t−1 () (30)

)

⎛ ( T

) 1/T


∏ 1

=exp⎝log


p(w

t=1 t |w1 t−1

(31)

)

(

)

1

T∑

=exp − log p(w t |w1 t−1 ) . (32)

T

t=1

Language models D-lec 2004 in cl-lab – p.26/39


Cross Entropy (1)

Kullback-Leibler

D(p||q) = ∑ x

p(x)log p(x)

q(x)

(33)

= 〈log p(x) − log q(x)〉 p (34)

= 〈(− log q(x)) − (− log p(x))〉 p (35)

∴ p(x) q(x)

().

→ p q

D(p||q) .

Language models D-lec 2004 in cl-lab – p.27/39


Cross Entropy (2)

D(p||q) = ∑ p log p q

(36)

= ∑ p log p − ∑ p log q (37)

= −H(p) − ∑ p log q → . (38)

, H(p, q)

H(p, q) =− ∑ x

p(x)logq(x) (= 〈− log q(x)〉 p ) (39)

, KL

.

(, p(x) Uniform )

Language models D-lec 2004 in cl-lab – p.28/39


Whole sentence maximum entropy (1)

:

p(s) ∝ p 0 (s) · exp(Λ · F (s)) (40)

p 0 (s) : s

Λ =(λ 1 λ 2 ... λ n ):ME

F (s) =(f 1 (s) f 2 (s) ... f n (s)) :

p 0 (s) n-gram Whole sentence ME

n-gram

Random Field (= )


Language models D-lec 2004 in cl-lab – p.29/39


Whole sentence maximum entropy (2)

/:


p(s)f i (s) =〈f i 〉ˆp − (). (41)

s


⇒ .

(40) p(s)

Gibbs sampling () (Pietra, Lafferty 1995)

Independent Metropolis sampler

Importance Sampling from n-gram

Language models D-lec 2004 in cl-lab – p.30/39


Whole sentence maximum entropy (3)

(Rosenfeld et al. 2000)

What do you have to live los angeles

A. B. C. N. N. business news tokyo

Be of says I’m not at this it

Bill Dorman been well I think the most

(Pietra, Lafferty 1995)

was, reaser, in, there, to, will, ,, was, by, homes,

thing, be, reloverated, ther, which, conists, at,

fores, anditing, with, Mr., proveral, the, ***, ...

Gibbs sampling

Language models D-lec 2004 in cl-lab – p.31/39


Whole sentence maximum entropy (4)

...

(n-gram) p 0 : PPL = 81.37

Whole Sentence ME : PPL = 80.49 ± .02

??

f : KL D(ˆp(f)||p(f))

D(ˆp(f)||p(f)) = ˆp(f)log ˆp(f)

p(f)

≃ ˆp(f)log ˆp(f)

p(f)

+(1− ˆp(f)) log

1 − ˆp(f)

1 − p(f)

log(ˆp(f))−log(p(f)) , ˆp(f)

ME+? (eg. ME=HMM (Goodman

2002/2004), LME (Wang 2003))

Language models D-lec 2004 in cl-lab – p.32/39


Long Distance Models

“”

(Old) /


Language models D-lec 2004 in cl-lab – p.33/39


Cache/trigger models

Cache model

k

“” ,

k ?

(Beeferman

1997a)

Trigger model

‘hospital’ ‘nurse’ ‘disease’



ME (Beeferman 1997b)

∼ W × W ≃ 1 !

Language models D-lec 2004 in cl-lab – p.34/39


Latent Variable Models



Mixture Model


!

PLSI (Gildea & Hofmann 1998)

LDA ( 2002, 2003)

Language models D-lec 2004 in cl-lab – p.35/39


PLSI Language model

PLSI , K

p(w|z 1 ),p(w|z 2 ),...,p(w|z K ) EM

w = w 1 w 2 ···w n


p(w|λ) = ∏ K∑

λ i p(w j |z i ) (42)

j i=1

→ λ =(λ 1 ,...,λ K ) ,

K∑

p(w|w 1 ···w n )= λ i p(w|z i ) (43)

.

i=1

Language models D-lec 2004 in cl-lab – p.36/39


LDA Language model

PLSI LM , λ

→ .

λ ()


p(w|w) = p(w|λ)p(λ|w)dλ (44)

=

=




K∑

λ i p(w|z i ) · p(λ|w)dλ (45)

i=1

K∑

〈λ i |w〉p(w|z i ) (46)

i=1

Language models D-lec 2004 in cl-lab – p.37/39


Future of Long distance models

. ()

Hierarchical Mixture (NIPS 2003)

?

Maxent

(Long distance dependencies).

Language models D-lec 2004 in cl-lab – p.38/39


Conclusion

Let’s model a language!

Generative model and Bayesian method make us

happy.. :-)


Language models D-lec 2004 in cl-lab – p.39/39

More magazines by this user
Similar magazines