Presentation is loading. Please wait.

Presentation is loading. Please wait.

Name:Venkata subramanyan sundaresan Instructor:Dr.Veton Kepuska.

Similar presentations


Presentation on theme: "Name:Venkata subramanyan sundaresan Instructor:Dr.Veton Kepuska."— Presentation transcript:

1 Name:Venkata subramanyan sundaresan Instructor:Dr.Veton Kepuska

2 N-GRAM Concept The idea of word prediction in formalized with probabilistic model called N-gram. Statistical models of word sequence are also called language models or LM’S The idea of N-gram model is to approximate the history by just the last few words.

3 CORPUS Counting things in natural language is based on a corpus. What is a corpus ?  It is an online collection of text or speech There are two popular corpora.  Brown (1 million word collection )  Switch board (Collection 2430 telephone conversation )

4 Perplexity Perplexity is interpreted as the weighted average branching factor of a language.  Branching factor of a language is the number of possible next word that can follow any word. Perplexity is the most common evaluation metric for N-gram language models. Improvement in perplexity does not guarantee an improvement in speech recognition performance. It is commonly used as a quick check of an algorithm.

5 SMOOTHING It is the process of flattening a probability distribution implied by a language model,so the all reasonable word sequence can occur with some probability.

6 Aspiration To use SRI-LM (LM-Language modeling) toolkit to build different language models. The following are the language models :  Good –turning Smoothing  Absolute Discounting

7 Linux Environment in Windows To implement Linux environment in windows operating system we have to install “cygwin” This is a open source software and can be downloaded from : www.cygwin.com.www.cygwin.com Another main reason for installing cygwin is,SRI-LM can be implemented over the cygwin platform.

8 Installation Procedure “cygwin” Go to the provided webpage. Download the setup file. Select “install from Internet” Give the required destination place for the cygwin to get installed. There will be a lot of options to download from website. Select one site and install all the packages.

9 SRILM Download the SRILM toolkit,srilm.tgz from the following source: http://www.speech.sri.com/projects/srilm/ Run the terminal window of Cygwin. The srilm will be downloaded as a zip file. Unzip the srilm file inside the cygwin environment Unzip canbe done with the following with the following command: tar zxvf srilm.tgz

10 SRILM Installation Once the installation is completed,we have to edit the makefile in the cygwin folder. Once the editing is done, we have run the cygwin,to install SRILM in cygwin : $ Make World

11 Function of SRILM Generate N-gram count from the corpus Train language model based on the N-gram count file. Use trained language model to calculate test data perplexity.

12 Lexicon Lexicon is a container of words belonging to the same language. Reference: Wikipedia

13 Lexicon Generation Use “wordtokenization.pl” file to generate the Lexicon for our requirement. Generate lexicon of our requirement using the following command: cat train/en_*.txt > corpus.txt Perl wordtokenization.pl lexicon.txt

14 Count File Generate 3-gram count file by using following command: $./ngram-count –vocab lecicon.txt, -text corpus.txt,- order 2 –write count.txt, -unk

15 Good-Turing Language Model $./ngram-count -read project/count.txt -order 3 -lm project/gtlm.txt -gt1min 1 -gt1max 3 -gt2min 1 -gt2max 3-gt3min 1 -gt3max 3 This code has to be typed in the command window of the terminal. - lm lmfile Estimate a back off N-gram model from the total counts, and write it to lmfile

16 Absolute Discounting Language Model $./ngram-count -read project/count.txt-order 3-lm adlm.txt-cdiscount1 0.5-cdiscount2 0.5 -cdiscount3 0.5 Here the order N can be any thing b/w 1 to 9.


Download ppt "Name:Venkata subramanyan sundaresan Instructor:Dr.Veton Kepuska."

Similar presentations


Ads by Google