12/13/2007Chia-Ho Ling1 SRILM Language Model Student: Chia-Ho Ling Instructor: Dr. Veton Z. K ë puska.

12/13/2007Chia-Ho Ling1 SRILM Language Model Student: Chia-Ho Ling Instructor: Dr. Veton Z. K ë puska

12/13/2007Chia-Ho Ling2 Objective Use SRI Language Model toolkit to build four different language models. Four different language Models:  Good-Turing Smoothing  Absolute Discounting  Witten-Bell Smoothing  Modified Kneser-Ney Smoothing Decide which one is the best in these four different 3-gram language models.

12/13/2007Chia-Ho Ling3 Linux or Linux-like Environment Choose Linux-like environment “ cygwin ” Download free cygwin form following link: http://www.cygwin.com/http://www.cygwin.com/

12/13/2007Chia-Ho Ling4 Cygwin Installation Download the cygwin installation file Execute setup.exe Choose “ Install from Internet ” Select root install directory “ C:\cygwin ” Choose a download site “ mirrors ”

12/13/2007Chia-Ho Ling5 Cygwin Installation The following packages should be selected for installing SRILM:  gcc versoin 3.4.3 or higher  GNU make  John Ousterhout ’ s TCL toolkit, version 7.3 or higher  Tcsh  gzip: to read/write compressed file  GNU awk(gawk): to interpret many of the utility script

12/13/2007Chia-Ho Ling6 SRILM Installation Download SRILM toolkit, srilm.tgz, from following link: http://www.speech.sri.com/projects/srilm Run cygwin.bat Unzip srilm.tgz by following commands: $ cd /cygdrive/c/cygwin/srilm $ tar zxvf srilm.tgz

12/13/2007Chia-Ho Ling7 SRILM Installation After SRILM installation, edit the makefile which is in the cygwin folder. Add following lines to setup the direction: SRILM=/cygdrive/c/cygwin/srilm MACHINE_TYPE=cygwin Run cygwin and type following command to install SRILM: $ Make World

12/13/2007Chia-Ho Ling8 Main Function of SRILM Generate N-gram counts from corpus Train language model based on the N- gram count file Use trained language model to calculate test data perplexity

12/13/2007Chia-Ho Ling9 Flow Chart ngram-count Training corpus Lexicon Count file Lexicon Language Model Test data ngram-count ngram Count file ppl Language Model

12/13/2007Chia-Ho Ling10 Training Corpus Download Manual auditing Call Home English conversation, “ CallHome_English_trans970711 ” for our training corpus.

12/13/2007Chia-Ho Ling11 Lexicon Use “ wordtokenization.pl ” to generate our lexicon. Because of those conversation training corpus, we have to remove time, speaker information, any kind of brackets, and interjections out of our lexicon. Therefore, we need to add some code in the wordtokenization.pl.

12/13/2007Chia-Ho Ling12 Lexicon Adding the following language code can remove time and speaker information: # remove time and the speaker information ($time_and_speaker, $sentence) = split (/:/); $_ = $sentence;

12/13/2007Chia-Ho Ling13 Lexicon Adding the following language code can remove any kind of brackets: # expand clitics $word =~ s/\>$//; $word =~ s/^\<//; $word =~ s/\>.$//; if (($word =~ /[0-9]+/) && ($word !~ /[a-zA-Z]+/)) { next; } if (($word =~ /^{/) || ($word =~ /^\[/) || ($word =~ /^\*/) || ($word =~ /^\#/) || ($word =~ /^\&/) || ($word =~ /^\&/) || ($word =~ /^\-/) || ($word =~ /^\%/) ||($word =~ /^\!/) || ($word =~ /^\ /) || ($word =~ /^\+/) || ($word =~ /^\./) || ($word =~ /^\,/) || ($word =~ /^\//) || ($word =~ /^\?/) || ($word =~ /^\'/) || ($word =~ /^\)/) || ($word =~ /^\(/)) { $not_word_flag = 1; #print "Begining: ", $word, "\n"; } if (not $not_word_flag) { print $word,"\n"; } if ($not_word_flag) { if (($word =~ /}$/) || ($word =~ /\]$/) || ($word =~ /\*$/ ) || ($word =~ /\ $/ ) || ($word =~ /\+$/)) { $not_word_flag = 0; } print "\n"; }

12/13/2007Chia-Ho Ling14 Lexicon Generate our lexicon by using following commands: $ cat train/en_*.txt > corpus.txt $ perl wordtokenization2.pl lexicon.txt

12/13/2007Chia-Ho Ling15 Lexicon

12/13/2007Chia-Ho Ling16 Count File Generate 3-gram count file by using following commands: $./ngram-count -vocab lexicon.txt -text corpus.txt -order 3 -write count.txt -unk

12/13/2007Chia-Ho Ling17 Count File

12/13/2007Chia-Ho Ling18 Count File ngram-count count N-grams and estimate language models -vocab file Read a vocabulary from file. Subsequently, out-of-vocabulary words in both counts or text are replaced with the unknown-word token. If this option is not specified all words found are implicitly added to the vocabulary. -text textfile Generate N-gram counts from text file. textfile should contain one sentence unit per line. Begin/end sentence tokens are added if not already present. Empty lines are ignored. -order n Set the maximal order (length) of N-grams to count. This also determines the order of the estimated LM, if any. The default order is 3. -write file Write total counts to file. -unk Build an “ open vocabulary ” LM, i.e., one that contains the unknown-word token as a regular word. The default is to remove the unknown word.

12/13/2007Chia-Ho Ling19 Good-Turing Language Model $./ngram-count -read project/count.txt -order 3 -lm project/gtlm.txt -gt1min 1 -gt1max 3 -gt2min 1 -gt2max 3 -gt3min 1 -gt3max 3

12/13/2007Chia-Ho Ling20 Good-Turing Language Model -read countsfile Read N-gram counts from a file. ASCII count files contain one N-gram of words per line, followed by an integer count, all separated by whitespace. Repeated counts for the same N-gram are added. -lm lmfile Estimate a backoff N-gram model from the total counts, and write it to lmfile. -gtnmin count where n is 1, 2, 3, 4, 5, 6, 7, 8, or 9. Set the minimal count of N-grams of order n that will be included in the LM. All N-grams with frequency lower than that will effectively be discounted to 0. If n is omitted the parameter for N-grams of order > 9 is set. NOTE: This option affects not only the default Good-Turing discounting but the alternative discounting methods described below as well. -gtnmax count where n is 1, 2, 3, 4, 5, 6, 7, 8, or 9. Set the maximal count of N-grams of order n that are discounted under Good-Turing. All N-grams more frequent than that will receive maximum likelihood estimates. Discounting can be effectively disabled by setting this to 0. If n is omitted the parameter for N-grams of order > 9 is set.

12/13/2007Chia-Ho Ling21 Absolute Discounting Language Model $./ngram-count -read project/count.txt -order 3 -lm adlm.txt -cdiscount1 0.5 -cdiscount2 0.5 -cdiscount3 0.5

12/13/2007Chia-Ho Ling22 Absolute Discounting Language Model -cdiscountn discount where n is 1, 2, 3, 4, 5, 6, 7, 8, or 9. Use Ney's absolute discounting for N- grams of order n, using discount as the constant to subtract.

12/13/2007Chia-Ho Ling23 Witten-Bell Discounting Language Model $./ngram-count -read project/count.txt -order 3 -lm wblm.txt -wbdiscount1 -wbdiscount2 -wbdiscount3

12/13/2007Chia-Ho Ling24 Witten-Bell Discounting Language Model -wbdiscountn where n is 1, 2, 3, 4, 5, 6, 7, 8, or 9. Use Witten-Bell discounting for N-grams of order n. (This is the estimator where the first occurrence of each word is taken to be a sample for the “ unseen ” event.)

12/13/2007Chia-Ho Ling25 Modified Kneser-Ney Discounting Language Model $./ngram-count -read project/count.txt -order 3 -lm knlm.txt -kndiscount1 -kndiscount2 -kndiscount3

12/13/2007Chia-Ho Ling26 Modified Kneser-Ney Discounting Language Model -kndiscountn where n is 1, 2, 3, 4, 5, 6, 7, 8, or 9. Use Chen and Goodman's modified Kneser-Ney discounting for N-grams of order n.

12/13/2007Chia-Ho Ling27 Four Language Models

12/13/2007Chia-Ho Ling28 Test Data Perplexity Randomly choose three articles of news from Internet for test data. Commands for four different 3-gram language Models: $./ngram -ppl project/test1.txt -order 3 -lm project/gtlm.txt $./ngram -ppl project/test1.txt -order 3 -lm project/adlm.txt $./ngram -ppl project/test1.txt -order 3 -lm project/wblm.txt $./ngram -ppl project/test1.txt -order 3 -lm project/knlm.txt

12/13/2007Chia-Ho Ling29 Result of test1.txt

12/13/2007Chia-Ho Ling30 Result of test2.txt

12/13/2007Chia-Ho Ling31 Result of text3.txt

12/13/2007Chia-Ho Ling32 Test Data Perplexity -ppl textfile Compute sentence scores (log probabilities) and perplexities from the sentences in textfile, which should contain one sentence per line. -lm file Read the (main) N-gram model from file. This option is always required, unless -null was chosen.

12/13/2007Chia-Ho Ling33 Conclusion Good-TuringAbsolute DiscountingWitten-BellKneser-Ney test1602.936635.381573.032504.988 test2470.316478.307425.725353.042 test3268.165271.759251.203252.803

12/13/2007Chia-Ho Ling34 Reference SRI International, “ The SRI Language Modeling Toolkit ”, http://www.speech.sri.com/projects/srilm/ Dec. 2007 http://www.speech.sri.com/projects/srilm/ Cygwin Information and Installation, “ Installing and Updating Cygwin ”, http://www.cygwin.com/ Dec 2007http://www.cygwin.com/ Daniel Jurafsky and James H. Martin, “ Speech and Language Processing – An Introduction to Natural Language Processing, Computational Linguistic, and Speech Recogition ”. Fourth Indian Reprint, 2005 Manual auditing Call Home English conversation, “ CallHome_English_trans970711 ” http://my.fit.edu/~vkepuska/ece5527/CallHome/ Dec. 2007 http://my.fit.edu/~vkepuska/ece5527/CallHome/ Dr. Veton Z. K ë puska, “ wordtokenization.pl ”, http://my.fit.edu/~vkepuska/ece5527/Example%20Code/wordto kenization.pl http://my.fit.edu/~vkepuska/ece5527/Example%20Code/wordto kenization.pl

12/13/2007Chia-Ho Ling1 SRILM Language Model Student: Chia-Ho Ling Instructor: Dr. Veton Z. K ë puska.

Similar presentations

Presentation on theme: "12/13/2007Chia-Ho Ling1 SRILM Language Model Student: Chia-Ho Ling Instructor: Dr. Veton Z. K ë puska."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

12/13/2007Chia-Ho Ling1 SRILM Language Model Student: Chia-Ho Ling Instructor: Dr. Veton Z. K ë puska.

Similar presentations

Presentation on theme: "12/13/2007Chia-Ho Ling1 SRILM Language Model Student: Chia-Ho Ling Instructor: Dr. Veton Z. K ë puska."— Presentation transcript:

Similar presentations

About project

Feedback