12.07.2015 Views

Introduction to SRILM Toolkit

Introduction to SRILM Toolkit

Introduction to SRILM Toolkit

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Available Web Resources• <strong>SRILM</strong>: “http://www.speech.sri.com/projects/srilm/”– A <strong>to</strong>olkit for building and applying various statistical languagemodels (LMs)– Current version: 1.5.1(stable)– Can be executed in Linux environment• Cygwin: “http://www.cygwin.com/”– Cygwin is a Linux-like environment for Windows– Current version: 1.5.21-12


Steps for Installing Cygwin (cont.)7. Note that:If you want <strong>to</strong> compile original source codeChange Category “View” <strong>to</strong> FullCheck if the packages “binutils”, “gawk”, “gcc”, “gzip”, “make”,“tcltk”, “tcsh” are selectedIf not, use the default setting4


Steps for Installing Cygwin (cont.)8. After installation, run cygwinIt will generate “.bash_profile”, “.bashrc”, “.inputrc” in“c:\cygwin\home\yourname\”5


Steps for Installing <strong>SRILM</strong> <strong>Toolkit</strong>Now we then install “<strong>SRILM</strong>” in<strong>to</strong> the “Cygwin” environment1. Copy “srilm.tgz” <strong>to</strong> “c:\cygwin\srilm\”– Create the “srilm” direc<strong>to</strong>ry if it doesn’t exist– Or, merely copy “srilm.zip” <strong>to</strong> c:\cygwin2. Extract “srilm.tgz” (src files) or “srilm.zip” (executablefiles)commands in cygwin:$ cd /$ mkdir srilm //create the “srilm” direc<strong>to</strong>ry$cd srilm$ tar zxvf srilm.tgz //extract srilm.tgz6


Steps for Installing <strong>SRILM</strong> <strong>Toolkit</strong> (cont.)3. Edit “c:\cygwin\home\yourname\.bashrc”– Add the following several lines in<strong>to</strong> this fileexport <strong>SRILM</strong>=/srilmexport MACHINE_TYPE=cygwinexport PATH=$PATH:$pwd:$<strong>SRILM</strong>/bin/cygwinexport MANPATH=$MANPATH:$<strong>SRILM</strong>/man4. Restart “Cygwin”– We can start <strong>to</strong> use the <strong>SRILM</strong> if the precompiled files (e.g.,those extracted from “srilm.zip”) are installed/copied in<strong>to</strong> thedesired direc<strong>to</strong>ry– Or, we have <strong>to</strong> compile the associated source code files (e.g.,those extracted from “srilm.tgz”) manually (See Steps “5”)7


Steps for Installing <strong>SRILM</strong> <strong>Toolkit</strong> (cont.)5. Compile the <strong>SRILM</strong> source code files– Run cygwin– Modify “/srilm/Makefile”• Add a line: “<strong>SRILM</strong> = /srilm” in<strong>to</strong> this file– Switch current direc<strong>to</strong>ry <strong>to</strong> “/srilm”– Execute the following commands$ make World$ make all$ make cleanest– Check “INSTALL” or “srilm/doc/README.windows” for moredetailed information8


Environmental Setups• Change cygwin’s maximum memory“reg<strong>to</strong>ol -i set /HKLM/Software/Cygnus\ Solutions/Cygwin/heap_chunk_in_mb 2048”– Referred <strong>to</strong>: “http://cygwin.com/cygwin-ug-net/setup-maxmem.html”• Use Chinese Input In Cygwin– Manually edit the “.bashrc” and “.inputrc” files.inputrc.bashrcset meta-flag onexport LESSCHARSET=latin1set convert-meta offalias ls="ls --show-control-chars"set output-meta onset input-meta on– Referred <strong>to</strong>: “http://cygwin.com/faq/faq_3.html#SEC48”9


• Three Main FunctionalitiesFunctionalities of <strong>SRILM</strong>– Generate the n-gram count file from the corpus– Train the language model from the n-gram count file– Calculate the test data perplexity using the trained languagemodelTraining Corpusngram-countCount filestep1Lexiconngram-countLMstep2Test datangrampplstep310


Format of the Training Corpus• Corpus: e.g., “CNA0001-2M.Train” (56.7MB)– Newswire Texts with Tokenized Chinese Words中 華 民 國 八 十 九 年 一 月 一 日萬黃 兆 平面 對 這 個 歷 史 性 的 時 刻由 中 國 電 視 公 司昨 晚 在 中 正 紀 念 堂 吸 引 了 超 過 十 萬 人 潮共 同 迎 接 千 禧 年勤 奮 努 力欣 欣 向 榮 外……11


Format of the Lexicon• Lexicon: “Lexicon2003-72k.txt”巴八扒叭- Vocabulary size: 71695- Maximum character-length of a word: 10墨 竹默 祝末 梢沒 收墨 守陌 生……12


• CommandGenerating the N-gram Count Filengram-count -vocab Lexicon2003-72k.txt-text CNA0001-2M.Train-order 3-write CNA0001-2M.count-unk– Parameter Settings-vocab: lexicon file name-text: training corpus name-order: n-gram count-write: output countfile name-unk: mark OOV as 13


Format of the N-gram Count File•E.g., “CNA0001-2M.count”Counts in trainingcorpusUnigramBigramTrigram想 像 得 到 1想 像 得 到 的 1想 像 得 到 的 重 大 1鳳 凰 162鳳 凰 花 5鳳 凰 花 1鳳 凰 花 開 4鳳 凰 23鳳 凰 獎 章 2鳳 凰 獎 章 2鳳 凰 城 41鳳 凰 城 6鳳 凰 城 及 1鳳 凰 城 駕 駛 1鳳 凰 城 以 北 1鳳 凰 城 舉 辦 1鳳 凰 城 十 八 1鳳 凰 城 太 陽 28…業 界 傷 心 1業 界 統 計 1業 界 統 計 分 析 1業 界 一 再 1業 界 一 再 提 出 1業 界 希 望 2業 界 希 望 迫 切 1業 界 希 望 立 法 院 1業 界 出 現 1業 界 出 現 一 1業 界 上 1業 界 上 1業 界 關 係 1業 界 關 係 良 好 1業 界 就 1業 界 就 聚 集 1…14


Generating the N-gram Language model• Commandngram-count -read CNA0001-2M.count-order 3-lm CNA0001-2M_N3_GT3-7.lm-gt1min 3 -gt1max 7-gt2min 3 -gt2max 7-gt3min 3 -gt3max 7– Parameter Settings-read: read count file-lm: output LM file name-gtnmin: Good-Turing discounting for n-gram15


Format of the N-gram Language Model File• E.g., “CNA0001-2M_N3_GT3-7.lm”\data\ngram 1=71697ngram 2=2933381ngram 3=1205445Log of backoffweight (Base 10)\1-grams:-0.8424806 -99 -1.291354-2.041174 一 -1.287858-3.804316 一 一 -0.8553778-5.374712 一 一 恐 怖 -1.269383-4.772653 一 一 恐 怖 攻 擊 -0.8950238-9.690391 一 丁 點-3.51804 一 九 九 -2.89049-7.180892 一 了 百 了 -0.1229095-6.481923 一 刀 兩 斷 -0.6672484-4.802495 一 下 -0.4828814-1.38444 裏 表 現-1.38444 裏 面-1.076253 裏 海-0.624772 戈 裏 峰-0.624772 年 裏 -1.198803 那 裏 -0.3165856 哪 裏 去-0.7112821 家 裏 的-1.323742 家 裏 開-0.4998333 時 間 裏 -0.3147101 眼 裏 -0.323742 過 程 裏 -0.721682 恒 生-0.323742 億 恒 科 技-0.1760913 化 粧 品\end\Log probability(Base 10)16


• Command:Calculating the Test Data Perplexityngram -ppl 506.pureText-order 3-lm CNA0001-2M_N3_GT3-7.lm– Parameter Settings-ppl: calculate perplexity for test datafile 506.PureText: 506 sentences, 38307 words, 0 OOVs0 zeroprobs, logprob= -117172 ppl= 1044.42 ppl1= 1144.86logprob−#Sen # Word10+#Word10 −logprob17


Other Discounting Techniques• Absolute Discountingngram-count -read CNA0001-2M.count-order 3-lm CNA0001-2M_N3_AD.lm-cdiscount1 0.5-cdiscount2 0.5-cdiscount3 0.5• Witten-Bell Discountingngram-count -read CNA0001-2M.count-order 3-lm CNA0001-2M_N3_WB.lm-wbdiscount1-wbdiscount2-wbdiscount318


Other Discounting Techniques (cont.)• Modified Kneser-Ney Discountingngram-count -read CNA0001-2M.count-order 3-lm CNA0001-2M_N3_KN.lm-kndiscount1-kndiscount2-kndiscount3• Available Online Documentation:“http://www.speech.sri.com/projects/srilm/manpages/”19

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!