SRI-LM toolkit

SRI-LM toolkit 

2003/12/16 

Louis Tsai 

Speech Lab, CSIE, NTNU 

louis@csie.ntnu.edu.tw 

1

SRILM 

• 由 SRI International 所開發出來的 tool, 用 

在 Statistical Language Models 

• http://www.speech.sri.com/projects/srilm 

• 目前版本 Ver 1.3.3 2003/03 

2

Create language model 

~ 

W 

= 

arg max 

W 

P( 

W 

| 

X 

) 

≅ 

arg max 

W 

P( 

X 

| W 

) P( 

W 

) 

•SRI-LM toolkit : 

•Step 1. Calculate ngram count from training data 

•Step 2. Create LM from ngram count 

3

training data 

• 已斷好詞的文件 

(ssp.train) 

關心孩子學業 

可是又苦無時間為孩子挑選參考書的家長們 

從今天起 

多了一個輕鬆的選擇 

只要趁工作之餘逛到巷口的 

填個國小參考書預購單 

就能以八五折的價格在限定時間內取貨了 

這項服務明年還將擴展到國中參考書 

由於國小教科書版本不同 

參考書也相當多 

各書店進書狀況和數量也不相同 

4

Vocabulary 

(Lexicon2003-72k.txt) 

巴 

八 

扒 

叭 

… 

巴巴 

叭叭 

八寶 

八方 

八德 

八堵 

吧女 

… 

共有 71695 個 words 

5

ngram count file 

在整個 training 

data 中出現的次數 

(ssp.n3.count) 

Unigram 

Bigram 

Trigram 

想像得到 4 

想像得到的 3 

想像得到的事 2 

想像得到的 1 

想像得到球員 1 

想像得到球員可以 1 

鳳凰 123 

鳳凰騋 4 

鳳凰騋 4 

鳳凰花 5 

鳳凰花開 1 

鳳凰花 4 

鳳凰 20 

鳳凰自行車 1 

鳳凰自行車 1 

鳳凰城 20 

鳳凰城 8 

鳳凰城大學 1 

鳳凰城的 1 

鳳凰城警方 1 

鳳凰城太陽 5 

鳳凰城電 3 

鳳凰城水星 1 

鳳凰機場 1 

鳳凰機場供 1 

鳳凰一樣 1 

鳳凰一樣快速 1 

鳳凰中隊 1 

鳳凰中隊 1 

鳳凰自然 1 

鳳凰自然來 1 

鳳凰般 3 

鳳凰般重 1 

鳳凰般再度 2 

鳳凰科技 6 

…… 

6

以 \data\ 開頭 

LM Format 

(ssp.n3.kn.lm) 

最高的 ngram 沒 

有 backoff weight 

\data\ 

ngram 1=71697 

ngram 2=2795822 

ngram 3=1700527 

各種 ngram 的個數 

\1-grams: 是 

unigram 的開頭 

\1-grams: 

-1.804081 

-99 -0.5467538 

-2.548589 一 -0.6464233 

-3.989615 一一 -0.2506604 

-6.685257 一一恐怖 -0.1400223 

-6.685257 一一恐怖攻擊 -0.1128442 

-6.075799 一丁點 -0.1827688 

-4.105376 一九九 -1.257224 

-5.808214 一了百了 -0.308177 

-5.921716 一刀兩斷 -0.3545529 

-3.940887 一下 -0.367599 

… 

-0.4833062 糊裏糊塗 

-0.6593975 餐廳裏往 

-0.6593975 館裏大家 

-0.2190889 鎮裏不同 

-0.4833062 鐵窗裏竄 

-1.223669 恒大 

-1.223669 恒生 

-0.3077702 恒春 

-0.3159989 小型恒指 

-0.4833062 屏東恒春 

-0.2190889 范恒山 

-0.4833062 從恒春 

-0.4833062 邊恒雄 

-0.6593975 上粧 

-0.2759993 化粧品 

\end\ 

Log probability 

(Base 10) 

Log of backoff 

weight (Base 10) 

以 \end\ 結尾 

7

ngram-count 

• 計算 training data 中每個出現的 ngram 的 

出現次數 

3 表示 trigram 

要計算 count 的 Training data 

詞典檔 

有 -unk: 把 OOV 當作一個符號處理 

無 -unk: 遇到 OOV 就移除 

輸出的 ngram count 檔名 

8

Good-Turing Discounting and 

預設的 

discounting 跟 

backoff 方法 

Katz Backoff 

前一步 ngram count 所產生的檔 

要輸出的 LM 檔 

9

Compute perplexity 

LM 檔 

要算 perplexity 的 test data 

輸出範 

例 : 

− 

logprob 

句數 + 字數 

10 

字數 

10 −logprob 

10

Good-Turing Discounting Parameters 

maxcount 

mincount 

自行指定 mincount 跟 maxcount : 

11

Modified Kneser-Ney Discounting 

12

• Take the conditional probabilities of bigrams (m-gram, 

m=2) for example: 

0 ≤ D ≤1 

⎧max{ C[ wi− 

1, 

wi 

] − D, 

0} 

⎪ 

if C[ wi− 

1, 

wi 

] > 0 

P ( ) = ⎨ [ 

− 

] 

KN 

wi 

w 

C w 

i−1 

i 1 

⎪ 

⎩α 

( wi− 

1) PKN 

( wi 

) 

otherwise 

1. 

P 

KN 

( w ) C[ • w ]/ 

C[ • w ] 

i 

= 

i ∑ 

w 

C 

[• 

w ] 

i 

is 

j 

the 

j 

, 

unique 

words 

preceding 

w 

i 

2. normalizing constant 

α 

( w ) 

i −1 

1 − 

= 

∑ 

w 

i 

: C 

[ w w ] 

i −1 

∑ 

w 

i 

i 

> 0 

: C 

max 

[ w w ] 

i −1 

i 

{ C [ w 

i −1w 

i 

] − D , 0} 

C [ w 

i −1 

] 

P ( w ) 

= 0 

KN 

i 

13

Witten-Bell Discounting 

14


• Key Concept—Things Seen Once: Use the count 

of things you’ve seen once to help estimate the 

count of things you’ve never seen 

• So we estimate the total probability mass of all the 

zero N-grams with the number of types divided by 

the number of tokens plus observed types: 

i: 

∑ 

c i 

p 

= 

T 

* 

i 

= 0 N + 

T 

N : the number of tokens 

T : observed types 

15


• T/(N+T) gives the total “probability of unseen N- 

grams”, we need to divide this up among all the 

zero N-grams 

• We could just choose to divide it equally 

Z 

= 

i: 

∑ 

c i 

= 0 

1 

Z is the total number of 

N-grams with count zero 

p * T 

= i Z( 

N + T 

) 

16


p 

* 

i 

ci 

= if ( ci 

N + T 

> 

0) 

Alternatively, we can represent the smoothed counts directly as: 

c 

* 

i 

= 

⎧T 

⎪Z 

⎨ 

⎪ci 

⎩ 

N 

N + T 

N 

N + T 

, 

, 

if 

if 

c 

c 

i 

i 

= 

> 

0 

0 

17

18 


N 

T 

N 

T 

N 

N 

T 

N 

N 

T 

N 

NT 

T 

N 

N 

c 

Z 

T 

N 

N 

Z 

T 

c i 

i 

i 

= 

+ 

+ 

= 

+ 

+ 

+ 

= 

+ 

+ 

⋅ 

+ ∑ 

> 

) 

( 

2 

0 

: 

1 

) 

( 0 

: 

= 

+ 

+ 

+ 

= 

+ 

+ 

⋅ 

+ ∑ 

> 

T 

N 

N 

T 

N 

T 

T 

N 

c 

Z 

T 

N 

Z 

T 

c i 

i 

i

• For bigram 

∑ 

p 


( w 

| 

w 

) 

T ( wx 

) 

N( 

w ) T ( w 

* 

i x 

i: 

c( 

wx wi 

) = 0 

x 

+ 

= 

T: the number of bigram types, N: the number of bigram token 

x 

) 

Z( 

p 

* 

w x 

) = 

∑ 

1 

i: 

c( 

w x w i ) = 0 

T ( wi 

−1) 

( wi 

| wi 

−1) 

= if( c 

−1 

Z( 

w )( N( 

w ) + T ( w )) 

i−1 

i−1 

i−1 

w i w 

i 

= 

0) 

∑ 

p 

( w 

| 

w 

) 

= 

c( 

w 

c( 

w ) 

* 

x 

i x 

i: 

c( 

wx wi 

) > 0 

x 

+ 

wi 

) 

T ( w 

x 

) 

19

Absolute Discounting 

20

P 

abs 

Smoothing 

absolute discounting 

• Absolute discounting subtracting a fixed 

discount D

Ristad's Natural Discounting 

Eric Ristad's Natural Law of Succession -- The discount factor d is identical 

for all counts,. 

d = ( n(n+1) + q(1- q) ) / ( n^2 + n + 2q ) 

where n is the total number events tokens, q is the number of observed 

event types. If q equals the vocabulary size no discounting is performed. 

22

SRI-LM toolkit

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?