12/13/2007Chia-Ho Ling1 SRILM Language Model Student: Chia-Ho Ling Instructor: Dr. Veton Z. K ë puska.

Slides:



Advertisements
Similar presentations
1 CS 388: Natural Language Processing: N-Gram Language Models Raymond J. Mooney University of Texas at Austin.
Advertisements

Language Modeling.
Albert Gatt Corpora and Statistical Methods – Lecture 7.
Smoothing Techniques – A Primer
Smoothing N-gram Language Models Shallow Processing Techniques for NLP Ling570 October 24, 2011.
Deny A. Kwary 10 May 2010 Using Freeware Computer Programmes for English Language Teaching and Learning 1.
ISBN Regular expressions Mastering Regular Expressions by Jeffrey E. F. Friedl –(on reserve.
N-Gram Language Models CMSC 723: Computational Linguistics I ― Session #9 Jimmy Lin The iSchool University of Maryland Wednesday, October 28, 2009.
Smoothing Bonnie Dorr Christof Monz CMSC 723: Introduction to Computational Linguistics Lecture 5 October 6, 2004.
1 Language Model (LM) LING 570 Fei Xia Week 4: 10/21/2009 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAAAAA A A.
1 Smoothing LING 570 Fei Xia Week 5: 10/24/07 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAA A A AA A A A.
Factored Language Models EE517 Presentation April 19, 2005 Kevin Duh
Apache : Installation, Configuration, Basic Security Presented by, Sandeep K Thopucherela, ECE Department.
Background of Wireless Communication Student Presentations and Projects Wireless Communication Technology Wireless Networking and Mobile IP Wireless Local.
Introduction to UNIX GPS Processing and Analysis with GAMIT/GLOBK/TRACK T. Herring, R. King. M. Floyd – MIT UNAVCO, Boulder - July 8-12, 2013 Directory.
1 SEEM3460 Tutorial Unix Introduction. 2 Introduction What is Unix? An operation system (OS), similar to Windows, MacOS X Why learn Unix? Greatest Software.
SI485i : NLP Set 3 Language Models Fall 2012 : Chambers.
1 Advanced Smoothing, Evaluation of Language Models.
Name:Venkata subramanyan sundaresan Instructor:Dr.Veton Kepuska.
CMU-Statistical Language Modeling & SRILM Toolkits
1 Intro to Linux - getting around HPC systems Himanshu Chhetri.
Arthur Kunkle ECE 5525 Fall Introduction and Motivation  A Large Vocabulary Speech Recognition (LVSR) system is a system that is able to convert.
C & Unix Final Session Review (… because I, too, can do PowerPoint …)
1IBM T.J. Waston CLSP, The Johns Hopkins University Using Random Forests Language Models in IBM RT-04 CTS Peng Xu 1 and Lidia Mangu 2 1. CLSP, the Johns.
11/24/2006 CLSP, The Johns Hopkins University Random Forests for Language Modeling Peng Xu and Frederick Jelinek IPAM: January 24, 2006.
6. N-GRAMs 부산대학교 인공지능연구실 최성자. 2 Word prediction “I’d like to make a collect …” Call, telephone, or person-to-person -Spelling error detection -Augmentative.
Statistical Language Modeling using SRILM Toolkit
Internet of Things with Intel Edison Compiling and running Pierre Collet Intel Software.
11 MANAGING AND DISTRIBUTING SOFTWARE BY USING GROUP POLICY Chapter 5.
CSC 215 : Procedural Programming with C C Compilers.
How to Install and Run Prepared by: Shubhra Kanti Karmaker Santu Lecturer CSE Department BUET.
Efficient Minimal Perfect Hash Language Models David Guthrie, Mark Hepple, Wei Liu University of Sheffield.
Digital Speech Processing Homework 3
Chapter 6: Statistical Inference: n-gram Models over Sparse Data
Chapter6. Statistical Inference : n-gram Model over Sparse Data 이 동 훈 Foundations of Statistic Natural Language Processing.
Introduction to Perl Part III By: Bridget Thomson McInnes 6 Feburary 2004.
Efficient Language Model Look-ahead Probabilities Generation Using Lower Order LM Look-ahead Information Langzhou Chen and K. K. Chin Toshiba Research.
SRILM - The SRI Language Modeling Toolkit
Setting up Cygwin Computer Organization I 1 May 2010 ©2010 McQuain Cygwin: getting the setup tool Free, almost complete UNIX environment emulation.
LING 388: Language and Computers Sandiway Fong Lecture 27: 12/6.
Lecture 4 Ngrams Smoothing
Statistical NLP Winter 2009
Ngram models and the Sparcity problem. The task Find a probability distribution for the current word in a text (utterance, etc.), given what the last.
專題研究 (4) HDecode_live Prof. Lin-Shan Lee, TA. Yun-Chiao Li 1.
Estimating N-gram Probabilities Language Modeling.
Search and Decoding Final Project Identify Type of Articles Using Property of Perplexity By Chih-Ti Shih Advisor: Dr. V. Kepuska.
Digital Speech Processing HW3
Natural Language Processing Statistical Inference: n-grams
A Maximum Entropy Language Model Integrating N-grams and Topic Dependencies for Conversational Speech Recognition Sanjeev Khudanpur and Jun Wu Johns Hopkins.
Cygwin Tutorial 1. What is Cygwin? Cygwin offers a UNIX like environment on top of MS-Windows. Gives the ability to use familiar UNIX tools without losing.
Maximum Entropy techniques for exploiting syntactic, semantic and collocational dependencies in Language Modeling Sanjeev Khudanpur, Jun Wu Center for.
N-Gram Model Formulas Word sequences Chain rule of probability Bigram approximation N-gram approximation.
Digital Speech Processing Homework /05/04 王育軒.
Wouter Verkerke, NIKHEF Preparation for La Mainaz or how to run Unix apps and ROOT on your Windows Laptop without installing Linux Wouter Verkerke (NIKHEF)
 CSC 215 : Procedural Programming with C C Compilers.
Language Modeling Part II: Smoothing Techniques Niranjan Balasubramanian Slide Credits: Chris Manning, Dan Jurafsky, Mausam.
Cygwin: getting the setup tool
Cygwin Tutorial 1.
CSC 215 : Procedural Programming with C
BASIS Quick Start Guide
Getting Eclipse for C/C++ Development
Connect:Direct for UNIX v4.2.x Silent Installation
專題研究 week3 Language Model and Decoding
Digital Speech Processing
Presented by Wen-Hung Tsai Speech Lab, CSIE, NTNU 2005/07/13
Cygwin.
Cygwin Tutorial 1.
Cygwin Tutorial 1.
Getting Eclipse for C/C++ Development
Cygwin: getting the setup tool
Presentation transcript:

12/13/2007Chia-Ho Ling1 SRILM Language Model Student: Chia-Ho Ling Instructor: Dr. Veton Z. K ë puska

12/13/2007Chia-Ho Ling2 Objective Use SRI Language Model toolkit to build four different language models. Four different language Models:  Good-Turing Smoothing  Absolute Discounting  Witten-Bell Smoothing  Modified Kneser-Ney Smoothing Decide which one is the best in these four different 3-gram language models.

12/13/2007Chia-Ho Ling3 Linux or Linux-like Environment Choose Linux-like environment “ cygwin ” Download free cygwin form following link:

12/13/2007Chia-Ho Ling4 Cygwin Installation Download the cygwin installation file Execute setup.exe Choose “ Install from Internet ” Select root install directory “ C:\cygwin ” Choose a download site “ mirrors ”

12/13/2007Chia-Ho Ling5 Cygwin Installation The following packages should be selected for installing SRILM:  gcc versoin or higher  GNU make  John Ousterhout ’ s TCL toolkit, version 7.3 or higher  Tcsh  gzip: to read/write compressed file  GNU awk(gawk): to interpret many of the utility script

12/13/2007Chia-Ho Ling6 SRILM Installation Download SRILM toolkit, srilm.tgz, from following link: Run cygwin.bat Unzip srilm.tgz by following commands: $ cd /cygdrive/c/cygwin/srilm $ tar zxvf srilm.tgz

12/13/2007Chia-Ho Ling7 SRILM Installation After SRILM installation, edit the makefile which is in the cygwin folder. Add following lines to setup the direction: SRILM=/cygdrive/c/cygwin/srilm MACHINE_TYPE=cygwin Run cygwin and type following command to install SRILM: $ Make World

12/13/2007Chia-Ho Ling8 Main Function of SRILM Generate N-gram counts from corpus Train language model based on the N- gram count file Use trained language model to calculate test data perplexity

12/13/2007Chia-Ho Ling9 Flow Chart ngram-count Training corpus Lexicon Count file Lexicon Language Model Test data ngram-count ngram Count file ppl Language Model

12/13/2007Chia-Ho Ling10 Training Corpus Download Manual auditing Call Home English conversation, “ CallHome_English_trans ” for our training corpus.

12/13/2007Chia-Ho Ling11 Lexicon Use “ wordtokenization.pl ” to generate our lexicon. Because of those conversation training corpus, we have to remove time, speaker information, any kind of brackets, and interjections out of our lexicon. Therefore, we need to add some code in the wordtokenization.pl.

12/13/2007Chia-Ho Ling12 Lexicon Adding the following language code can remove time and speaker information: # remove time and the speaker information ($time_and_speaker, $sentence) = split (/:/); $_ = $sentence;

12/13/2007Chia-Ho Ling13 Lexicon Adding the following language code can remove any kind of brackets: # expand clitics $word =~ s/\>$//; $word =~ s/^\<//; $word =~ s/\>.$//; if (($word =~ /[0-9]+/) && ($word !~ /[a-zA-Z]+/)) { next; } if (($word =~ /^{/) || ($word =~ /^\[/) || ($word =~ /^\*/) || ($word =~ /^\#/) || ($word =~ /^\&/) || ($word =~ /^\&/) || ($word =~ /^\-/) || ($word =~ /^\%/) ||($word =~ /^\!/) || ($word =~ /^\ /) || ($word =~ /^\+/) || ($word =~ /^\./) || ($word =~ /^\,/) || ($word =~ /^\//) || ($word =~ /^\?/) || ($word =~ /^\'/) || ($word =~ /^\)/) || ($word =~ /^\(/)) { $not_word_flag = 1; #print "Begining: ", $word, "\n"; } if (not $not_word_flag) { print $word,"\n"; } if ($not_word_flag) { if (($word =~ /}$/) || ($word =~ /\]$/) || ($word =~ /\*$/ ) || ($word =~ /\ $/ ) || ($word =~ /\+$/)) { $not_word_flag = 0; } print "\n"; }

12/13/2007Chia-Ho Ling14 Lexicon Generate our lexicon by using following commands: $ cat train/en_*.txt > corpus.txt $ perl wordtokenization2.pl lexicon.txt

12/13/2007Chia-Ho Ling15 Lexicon

12/13/2007Chia-Ho Ling16 Count File Generate 3-gram count file by using following commands: $./ngram-count -vocab lexicon.txt -text corpus.txt -order 3 -write count.txt -unk

12/13/2007Chia-Ho Ling17 Count File

12/13/2007Chia-Ho Ling18 Count File ngram-count count N-grams and estimate language models -vocab file Read a vocabulary from file. Subsequently, out-of-vocabulary words in both counts or text are replaced with the unknown-word token. If this option is not specified all words found are implicitly added to the vocabulary. -text textfile Generate N-gram counts from text file. textfile should contain one sentence unit per line. Begin/end sentence tokens are added if not already present. Empty lines are ignored. -order n Set the maximal order (length) of N-grams to count. This also determines the order of the estimated LM, if any. The default order is 3. -write file Write total counts to file. -unk Build an “ open vocabulary ” LM, i.e., one that contains the unknown-word token as a regular word. The default is to remove the unknown word.

12/13/2007Chia-Ho Ling19 Good-Turing Language Model $./ngram-count -read project/count.txt -order 3 -lm project/gtlm.txt -gt1min 1 -gt1max 3 -gt2min 1 -gt2max 3 -gt3min 1 -gt3max 3

12/13/2007Chia-Ho Ling20 Good-Turing Language Model -read countsfile Read N-gram counts from a file. ASCII count files contain one N-gram of words per line, followed by an integer count, all separated by whitespace. Repeated counts for the same N-gram are added. -lm lmfile Estimate a backoff N-gram model from the total counts, and write it to lmfile. -gtnmin count where n is 1, 2, 3, 4, 5, 6, 7, 8, or 9. Set the minimal count of N-grams of order n that will be included in the LM. All N-grams with frequency lower than that will effectively be discounted to 0. If n is omitted the parameter for N-grams of order > 9 is set. NOTE: This option affects not only the default Good-Turing discounting but the alternative discounting methods described below as well. -gtnmax count where n is 1, 2, 3, 4, 5, 6, 7, 8, or 9. Set the maximal count of N-grams of order n that are discounted under Good-Turing. All N-grams more frequent than that will receive maximum likelihood estimates. Discounting can be effectively disabled by setting this to 0. If n is omitted the parameter for N-grams of order > 9 is set.

12/13/2007Chia-Ho Ling21 Absolute Discounting Language Model $./ngram-count -read project/count.txt -order 3 -lm adlm.txt -cdiscount cdiscount cdiscount3 0.5

12/13/2007Chia-Ho Ling22 Absolute Discounting Language Model -cdiscountn discount where n is 1, 2, 3, 4, 5, 6, 7, 8, or 9. Use Ney's absolute discounting for N- grams of order n, using discount as the constant to subtract.

12/13/2007Chia-Ho Ling23 Witten-Bell Discounting Language Model $./ngram-count -read project/count.txt -order 3 -lm wblm.txt -wbdiscount1 -wbdiscount2 -wbdiscount3

12/13/2007Chia-Ho Ling24 Witten-Bell Discounting Language Model -wbdiscountn where n is 1, 2, 3, 4, 5, 6, 7, 8, or 9. Use Witten-Bell discounting for N-grams of order n. (This is the estimator where the first occurrence of each word is taken to be a sample for the “ unseen ” event.)

12/13/2007Chia-Ho Ling25 Modified Kneser-Ney Discounting Language Model $./ngram-count -read project/count.txt -order 3 -lm knlm.txt -kndiscount1 -kndiscount2 -kndiscount3

12/13/2007Chia-Ho Ling26 Modified Kneser-Ney Discounting Language Model -kndiscountn where n is 1, 2, 3, 4, 5, 6, 7, 8, or 9. Use Chen and Goodman's modified Kneser-Ney discounting for N-grams of order n.

12/13/2007Chia-Ho Ling27 Four Language Models

12/13/2007Chia-Ho Ling28 Test Data Perplexity Randomly choose three articles of news from Internet for test data. Commands for four different 3-gram language Models: $./ngram -ppl project/test1.txt -order 3 -lm project/gtlm.txt $./ngram -ppl project/test1.txt -order 3 -lm project/adlm.txt $./ngram -ppl project/test1.txt -order 3 -lm project/wblm.txt $./ngram -ppl project/test1.txt -order 3 -lm project/knlm.txt

12/13/2007Chia-Ho Ling29 Result of test1.txt

12/13/2007Chia-Ho Ling30 Result of test2.txt

12/13/2007Chia-Ho Ling31 Result of text3.txt

12/13/2007Chia-Ho Ling32 Test Data Perplexity -ppl textfile Compute sentence scores (log probabilities) and perplexities from the sentences in textfile, which should contain one sentence per line. -lm file Read the (main) N-gram model from file. This option is always required, unless -null was chosen.

12/13/2007Chia-Ho Ling33 Conclusion Good-TuringAbsolute DiscountingWitten-BellKneser-Ney test test test

12/13/2007Chia-Ho Ling34 Reference SRI International, “ The SRI Language Modeling Toolkit ”, Dec Cygwin Information and Installation, “ Installing and Updating Cygwin ”, Dec 2007http:// Daniel Jurafsky and James H. Martin, “ Speech and Language Processing – An Introduction to Natural Language Processing, Computational Linguistic, and Speech Recogition ”. Fourth Indian Reprint, 2005 Manual auditing Call Home English conversation, “ CallHome_English_trans ” Dec Dr. Veton Z. K ë puska, “ wordtokenization.pl ”, kenization.pl kenization.pl