1 Natural Language Processing (6) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University

Slides:



Advertisements
Similar presentations
Chapter 6: Statistical Inference: n-gram Models over Sparse Data
Advertisements

1 CS 388: Natural Language Processing: N-Gram Language Models Raymond J. Mooney University of Texas at Austin.
Language Modeling.
Statistical Machine Translation Part II: Word Alignments and EM Alexander Fraser ICL, U. Heidelberg CIS, LMU München Statistical Machine Translation.
Data Mining Methodology 1. Why have a Methodology  Don’t want to learn things that aren’t true May not represent any underlying reality ○ Spurious correlation.
1 Natural Language Processing (4) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University
Albert Gatt Corpora and Statistical Methods – Lecture 7.
SI485i : NLP Set 4 Smoothing Language Models Fall 2012 : Chambers.
Probability Distributions CSLU 2850.Lo1 Spring 2008 Cameron McInally Fordham University May contain work from the Creative Commons.
Smoothing N-gram Language Models Shallow Processing Techniques for NLP Ling570 October 24, 2011.
Smoothing Bonnie Dorr Christof Monz CMSC 723: Introduction to Computational Linguistics Lecture 5 October 6, 2004.
1 Language Model (LM) LING 570 Fei Xia Week 4: 10/21/2009 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAAAAA A A.
1 Smoothing LING 570 Fei Xia Week 5: 10/24/07 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAA A A AA A A A.
Part 5 Language Model CSE717, SPRING 2008 CUBS, Univ at Buffalo.
Language Model. Major role: Language Models help a speech recognizer figure out how likely a word sequence is, independent of the acoustics. A lot of.
Introduction to Language Models Evaluation in information retrieval Lecture 4.
LING 438/538 Computational Linguistics Sandiway Fong Lecture 18: 10/26.
On Comparing Classifiers: Pitfalls to Avoid and Recommended Approach Published by Steven L. Salzberg Presented by Prakash Tilwani MACS 598 April 25 th.
1 Advanced Smoothing, Evaluation of Language Models.
Multi-Style Language Model for Web Scale Information Retrieval Kuansan Wang, Xiaolong Li and Jianfeng Gao SIGIR 2010 Min-Hsuan Lai Department of Computer.
1IBM T.J. Waston CLSP, The Johns Hopkins University Using Random Forests Language Models in IBM RT-04 CTS Peng Xu 1 and Lidia Mangu 2 1. CLSP, the Johns.
11/24/2006 CLSP, The Johns Hopkins University Random Forests for Language Modeling Peng Xu and Frederick Jelinek IPAM: January 24, 2006.
6. N-GRAMs 부산대학교 인공지능연구실 최성자. 2 Word prediction “I’d like to make a collect …” Call, telephone, or person-to-person -Spelling error detection -Augmentative.
A Bit of Progress in Language Modeling Extended Version
Language acquisition
1 CS 552/652 Speech Recognition with Hidden Markov Models Winter 2011 Oregon Health & Science University Center for Spoken Language Understanding John-Paul.
NLP Language Models1 Language Models, LM Noisy Channel model Simple Markov Models Smoothing Statistical Language Models.
Chapter 6: Statistical Inference: n-gram Models over Sparse Data
Statistical NLP: Lecture 8 Statistical Inference: n-gram Models over Sparse Data (Ch 6)
Chapter 6: N-GRAMS Heshaam Faili University of Tehran.
Language Modeling Anytime a linguist leaves the group the recognition rate goes up. (Fred Jelinek)
Combining Statistical Language Models via the Latent Maximum Entropy Principle Shaojum Wang, Dale Schuurmans, Fuchum Peng, Yunxin Zhao.
NLP. Introduction to NLP Extrinsic –Use in an application Intrinsic –Cheaper Correlate the two for validation purposes.
DISCRIMINATIVE TRAINING OF LANGUAGE MODELS FOR SPEECH RECOGNITION Hong-Kwang Jeff Kuo, Eric Fosler-Lussier, Hui Jiang, Chin-Hui Lee ICASSP 2002 Min-Hsuan.
Chapter6. Statistical Inference : n-gram Model over Sparse Data 이 동 훈 Foundations of Statistic Natural Language Processing.
Hypotheses tests for means
Sequence Models With slides by me, Joshua Goodman, Fei Xia.
A Multi-span Language Modeling Frame Work For Speech Recognition Jimmy Wang Speech Lab, NTU.
Language Modeling 1. Roadmap (for next two classes)  Review LMs  What are they?  How (and where) are they used?  How are they trained?  Evaluation.
9/22/1999 JHU CS /Jan Hajic 1 Introduction to Natural Language Processing ( ) LM Smoothing (The EM Algorithm) Dr. Jan Hajič CS Dept., Johns.
1 Introduction to Natural Language Processing ( ) LM Smoothing (The EM Algorithm) AI-lab
Lecture 4 Ngrams Smoothing
1 Modeling Long Distance Dependence in Language: Topic Mixtures Versus Dynamic Cache Models Rukmini.M Iyer, Mari Ostendorf.
Statistical NLP Winter 2009
Ngram models and the Sparcity problem. The task Find a probability distribution for the current word in a text (utterance, etc.), given what the last.
1 Introduction to Natural Language Processing ( ) Language Modeling (and the Noisy Channel) AI-lab
Estimating N-gram Probabilities Language Modeling.
Number Sense Disambiguation Stuart Moore Supervised by: Anna Korhonen (Computer Lab)‏ Sabine Buchholz (Toshiba CRL)‏
Supertagging CMSC Natural Language Processing January 31, 2006.
Language Modeling Putting a curve to the bag of words Courtesy of Chris Jordan.
John Lafferty Andrew McCallum Fernando Pereira
1 Adaptive Subjective Triggers for Opinionated Document Retrieval (WSDM 09’) Kazuhiro Seki, Kuniaki Uehara Date: 11/02/09 Speaker: Hsu, Yu-Wen Advisor:
Natural Language Processing Statistical Inference: n-grams
Statistical Machine Translation Part II: Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart
A Maximum Entropy Language Model Integrating N-grams and Topic Dependencies for Conversational Speech Recognition Sanjeev Khudanpur and Jun Wu Johns Hopkins.
Maximum Entropy techniques for exploiting syntactic, semantic and collocational dependencies in Language Modeling Sanjeev Khudanpur, Jun Wu Center for.
N-Gram Model Formulas Word sequences Chain rule of probability Bigram approximation N-gram approximation.
Discriminative n-gram language modeling Brian Roark, Murat Saraclar, Michael Collins Presented by Patty Liu.
Language Modeling Part II: Smoothing Techniques Niranjan Balasubramanian Slide Credits: Chris Manning, Dan Jurafsky, Mausam.
AP Statistics From Randomness to Probability Chapter 14.
N-Grams Chapter 4 Part 2.
Statistical Models for Automatic Speech Recognition
Language Modelling By Chauhan Rohan, Dubois Antoine & Falcon Perez Ricardo Supervised by Gangireddy Siva 1.
N-Gram Model Formulas Word sequences Chain rule of probability
CSCE 771 Natural Language Processing
Presented by Wen-Hung Tsai Speech Lab, CSIE, NTNU 2005/07/13
Chapter 6: Statistical Inference: n-gram Models over Sparse Data
CSCE 771 Natural Language Processing
CS249: Neural Language Model
Presentation transcript:

1 Natural Language Processing (6) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University Revised from Joshua Goodman (Microsoft Research) and Michael Collins (MIT)

2  (Statistical) Language Model Outline

3 A bad language model

4

5

6

7 Really Quick Overview Humor  What is a language model? Really quick overview –Two minute probability overview –How language models work (trigrams)

8 What ’ s a Language Model A Language model is a probability distribution over word sequences P(“And nothing but the truth”)  P(“And nuts sing on the roof”)  0

9 What ’ s a language model for? Speech recognition Handwriting recognition Spelling correction Optical character recognition Machine translation (and anyone doing statistical modeling)

10 Really Quick Overview Humor What is a language model?  Really quick overview –Two minute probability overview –How language models work (trigrams)

11 Everything you need to know about probability – definition P(X) means probability that X is true –P(baby is a boy)  0.5 (% of total that are boys) –P(baby is named John)  (% of total named John) Babies Baby boys John

12 Everything about probability Joint probabilities P(X, Y) means probability that X and Y are both true, e.g. P(brown eyes, boy) Babies Baby boys John Brown eyes

13 Everything about probability: Conditional probabilities P(X|Y) means probability that X is true when we already know Y is true –P(baby is named John | baby is a boy)  –P(baby is a boy | baby is named John )  1 Babies Baby boys John

14 Everything about probabilities: math P(X|Y) = P(X, Y) / P(Y) P(baby is named John | baby is a boy) = P(baby is named John, baby is a boy) / P(baby is a boy) = / 0.5 = Babies Baby boys John

15 Everything about probabilities: Bayes Rule Bayes rule: P(X|Y) = P(Y|X)  P(X) / P(Y) P(named John | boy) = P(boy | named John)  P(named John) / P(boy) Babies Baby boys John

16 Really Quick Overview Humor What is a language model?  Really quick overview –Two minute probability overview –How language models work (trigrams)

17 THE Equation

18 How Language Models work Hard to compute P(“And nothing but the truth”) Step 1: Decompose probability P(“And nothing but the truth) = P(“And”)  P(“nothing|and”)  P(“but|and nothing”)  P(“the|and nothing but”)  P(“truth|and nothing but the”)

19 The Trigram Approximation Step 2: Make Markov Independence Assumptions Assume each word depends only on the previous two words (three words total – tri means three, gram means writing) P(“the|… whole truth and nothing but”)  P(“the|nothing but”) P(“truth|… whole truth and nothing but the”)  P(“truth|but the”)

20 Trigrams, continued How do we find probabilities? Get real text, and start counting! P(“the | nothing but”)  C(“nothing but the”) / C(“nothing but”)

21 Real Overview Overview Basics: probability, language model definition Real Overview  Evaluation Smoothing More techniques –Caching –Skipping –Clustering –Sentence-mixture models, –Structured language models Tools

22 Evaluation How can you tell a good language model from a bad one? Run a speech recognizer (or your application of choice), calculate word error rate –Slow –Specific to your recognizer

23 Evaluation: Perplexity Intuition Ask a speech recognizer to recognize digits: “0, 1, 2, 3, 4, 5, 6, 7, 8, 9” – easy – perplexity 10 Ask a speech recognizer to recognize names at Microsoft – hard – 30,000 – perplexity 30,000 Ask a speech recognizer to recognize “Operator” (1 in 4), “Technical support” (1 in 4), “sales” (1 in 4), 30,000 names (1 in 120,000) each – perplexity 54 Perplexity is weighted equivalent branching factor.

24 Evaluation: perplexity “A, B, C, D, E, F, G…Z”: –perplexity is 26 “Alpha, bravo, charlie, delta…yankee, zulu”: –perplexity is 26 Perplexity measures language model difficulty, not acoustic difficulty.

25 Perplexity: Math Perplexity is geometric average inverse probability Imagine model: “Operator” (1 in 4), “Technical support” (1 in 4), “sales” (1 in 4), 30,000 names (1 in 120,000) Imagine data: All 30,004 equally likely Example: Perplexity of test data, given model, is 119,829 Remarkable fact: the true model for data has the lowest possible perplexity Perplexity is geometric average inverse probability

26 Perplexity: Math Imagine model: “Operator” (1 in 4), “Technical support” (1 in 4), “sales” (1 in 4), 30,000 names (1 in 120,000) Imagine data: All 30,004 equally likely Can compute three different perplexities –Model (ignoring test data): perplexity 54 –Test data (ignoring model): perplexity 30,004 –Model on test data: perplexity 119,829 When we say perplexity, we mean “model on test” Remarkable fact: the true model for data has the lowest possible perplexity

27 Perplexity: Is lower better? Remarkable fact: the true model for data has the lowest possible perplexity Lower the perplexity, the closer we are to true model. Typically, perplexity correlates well with speech recognition word error rate –Correlates better when both models are trained on same data –Doesn’t correlate well when training data changes

28 Perplexity: The Shannon Game Ask people to guess the next letter, given context. Compute perplexity. –(when we get to entropy, the “100” column corresponds to the “1 bit per character” estimate)

29 Evaluation: Cross Entropy Entropy = log 2 perplexity b Should be called “cross-entropy of model on test data.” b Remarkable fact: entropy is average number of bits per word required to encode test data using this probability model, and an optimal coder. Called bits.

30 Real Overview Overview Basics: probability, language model definition Real Overview Evaluation  Smoothing More techniques –Caching –Skipping –Clustering –Sentence-mixture models, –Structured language models Tools

31 Smoothing: None Called Maximum Likelihood estimate. Lowest perplexity trigram on training data. Terrible on test data: If no occurrences of C(xyz), probability is 0.

32 Smoothing: Add One What is P(sing|nuts)? Zero? Leads to infinite perplexity! Add one smoothing: Works very badly. DO NOT DO THIS Add delta smoothing: Still very bad. DO NOT DO THIS

33 Smoothing: Simple Interpolation Trigram is very context specific, very noisy Unigram is context-independent, smooth Interpolate Trigram, Bigram, Unigram for best combination Find  0<  <1 by optimizing on “held-out” data Almost good enough

34 Smoothing: Finding parameter values Split data into training, “held out”, test Try lots of different values for  on heldout data, pick best Test on test data Sometimes, can use tricks like “EM” (estimation maximization) to find values Goodman suggests to use a generalized search algorithm, “Powell search” – see Numerical Recipes in C

35 An Iterative Method Initialization: Pick arbitrary/random values for Step 1: Calculate the following quantities: Step 2: Re-estimate ’s as Step 3: If ’s have not converged, go to Step 1.

36 Smoothing digression: Splitting data How much data for training, heldout, test? Some people say things like “1/3, 1/3, 1/3” or “80%, 10%, 10%” They are WRONG Heldout should have (at least) words per parameter. Answer: enough test data to be statistically significant. (1000s of words perhaps)

37 Smoothing digression: Splitting data Be careful: WSJ data divided into stories. Some are easy, with lots of numbers, financial, others much harder. Use enough to cover many stories. Be careful: Some stories repeated in data sets. Can take data from end – better – or randomly from within training.

38 Smoothing: Jelinek-Mercer Simple interpolation: Better: smooth a little after “The Dow”, lots after “Adobe acquired”

39 Smoothing: Jelinek-Mercer continued Find  s by cross-validation on held-out data Also called “deleted-interpolation”

40 Smoothing: Good Turing Invented during WWII by Alan Turing (and Good?), later published by Good. Frequency estimates were needed within the Enigma code-breaking effort. Define n r = number of elements x for which Count(x) = r. Modified count for any x with Count(x) = r and r > 0: (r+1)n r+1 /n r. Leads to the following estimate of “missing mass”: n 1 /N, where N is the size of the sample. This is the estimate of the probability of seeing a new element x on the (N +1)’th draw.

41 Smoothing: Good Turing Imagine you are fishing You have caught 10 Carp, 3 Cod, 2 tuna, 1 trout, 1 salmon, 1 eel. How likely is it that next species is new? 3/18 How likely is it that next is tuna? Less than 2/18

42 Smoothing: Good Turing How many species (words) were seen once? Estimate for how many are unseen. All other estimates are adjusted (down) to give probabilities for unseen

43 Smoothing: Good Turing Example 10 Carp, 3 Cod, 2 tuna, 1 trout, 1 salmon, 1 eel. How likely is new data (p 0 ). Let n 1 be number occurring once (3), N be total (18). p 0 =3/18 How likely is eel? 1 * n 1 =3, n 2 =1 1 * =2  1/3 = 2/3 P(eel) = 1 * /N = (2/3)/18 = 1/27

44 Smoothing: Katz Use Good-Turing estimate Works pretty well. Not good for 1 counts  is calculated so probabilities sum to 1

45 Smoothing: Absolute Discounting Assume fixed discount Works pretty well, easier than Katz. Not so good for 1 counts

46 Smoothing: Interpolated Absolute Discount Backoff: ignore bigram if have trigram Interpolated: always combine bigram, trigram

47 Smoothing: Interpolated Multiple Absolute Discounts One discount is good Different discounts for different counts Multiple discounts: for 1 count, 2 counts, >2

48 Smoothing: Kneser-Ney P(Francisco | eggplant) vs P(stew | eggplant) “Francisco” is common, so backoff, interpolated methods say it is likely But it only occurs in context of “San” “Stew” is common, and in many contexts Weight backoff by number of contexts word occurs in

49 Smoothing: Kneser-Ney Interpolated Absolute-discount Modified backoff distribution Consistently best technique

50 Smoothing: Chart

51 Real Overview Overview Basics: probability, language model definition Real Overview Evaluation Smoothing  More techniques –Caching –Skipping –Clustering –Sentence-mixture models, –Structured language models Tools

52 Caching If you say something, you are likely to say it again later. Interpolate trigram with cache

53 Caching: Real Life Someone says “I swear to tell the truth” System hears “I swerve to smell the soup” Cache remembers! Person says “The whole truth”, and, with cache, system hears “The whole soup.” – errors are locked in. Caching works well when users corrects as they go, poorly or even hurts without correction.

54 Caching: Variations N-gram caches: Conditional n-gram cache: use n-gram cache only if xy  history Remove function-words from cache, like “the”, “to”

55 5-grams Why stop at 3-grams? If P(z|…rstuvwxy)  P(z|xy) is good, then P(z|…rstuvwxy)  P(z|vwxy) is better! Very important to smooth well Interpolated Kneser-Ney works much better than Katz on 5-gram, more than on 3-gram

56 N-gram versus smoothing algorithm

57 Speech recognizer mechanics Keep many hypotheses alive Find acoustic, language model scores –P(acoustics | truth =.3), P(truth | tell the) =.1 –P(acoustics | soup =.2), P(soup | smell the) =.01 “…tell the” (.01) “…smell the” (.01) “…tell the truth” (.01 .3 .1) “…smell the soup” (.01 .2 .01)

58 Speech recognizer slowdowns Speech recognizer uses tricks (dynamic programming) so merge hypotheses Trigram: Fivegram: “…tell the” “…smell the” “…swear to tell the” “…swerve to smell the” “swear too tell the” “swerve too smell the” “swerve to tell the” “swerve too tell the” …

59 Speech recognizer vs. n-gram Recognizer can threshold out bad hypotheses Trigram works so much better than bigram, better thresholding, no slow-down 4-gram, 5-gram start to become expensive

60 Real Overview Overview Basics: probability, language model definition Real Overview Evaluation Smoothing  More techniques –Caching –Skipping –Clustering –Sentence-mixture models, –Structured language models Tools

61 Skipping P(z|…rstuvwxy)  P(z|vwxy) Why not P(z|v_xy) – “skipping” n-gram – skips value of 3-back word. Example: “P(time|show John a good)” -> P(time | show ____ a good) P(…rstuvwxy)   P(z|vwxy) +  P(z|vw_y) + (1- -  )P(z|v_xy)

62 Real Overview Overview Basics: probability, language model definition Real Overview Evaluation Smoothing  More techniques –Caching –Skipping –Clustering –Sentence-mixture models –Structured language models Tools

63 Clustering CLUSTERING = CLASSES (same thing) What is P(“Tuesday | party on”) Similar to P(“Monday | party on”) Similar to P(“Tuesday | celebration on”) Put words in clusters: –WEEKDAY = Sunday, Monday, Tuesday, … –EVENT=party, celebration, birthday, …

64 Clustering overview Major topic, useful in many fields Kinds of clustering –Predictive clustering –Conditional clustering –IBM-style clustering How to get clusters –Be clever or it takes forever!

65 Predictive clustering Let “z” be a word, “Z” be its cluster One cluster per word: hard clustering –WEEKDAY = Sunday, Monday, Tuesday, … –MONTH = January, February, April, May, June, … P(z|xy) = P(Z|xy)  P(z|xyZ) P(Tuesday | party on) = P(WEEKDAY | party on)  P(Tuesday | party on WEEKDAY) P smooth (z|xy)  P smooth (Z|xy)  P smooth (z|xyZ)

66 Predictive clustering example Find P(Tuesday | party on) –P smooth (WEEKDAY | party on)  P smooth (Tuesday | party on WEEKDAY) –C( party on Tuesday) = 0 –C(party on Wednesday) = 10 –C(arriving on Tuesday) = 10 –C(on Tuesday) = 100 P smooth (WEEKDAY | party on) is high P smooth (Tuesday | party on WEEKDAY) backs off to P smooth (Tuesday | on WEEKDAY)

67 Conditional clustering P(z|xy) = P(z|xXyY) P(Tuesday | party on) = P(Tuesday | party EVENT on PREPOSITION) P smooth (z|xy)  P smooth (z|xXyY) – P ML (Tuesday | party EVENT on PREPOSITION)  +  P ML (Tuesday | EVENT on PREPOSITION) +  P ML (Tuesday | on PREPOSITION) +  ML P(Tuesday | PREPOSITION) + (1- -  -  -  ) P ML (Tuesday)

68 Conditional clustering example P (Tuesday | party EVENT on PREPOSITION)  +  P (Tuesday | EVENT on PREPOSITION) +  P(Tuesday | on PREPOSITION) +  P(Tuesday | PREPOSITION) + (1- -  -  -  ) P(Tuesday) = P (Tuesday | party on)  +  P (Tuesday | EVENT on) +  P(Tuesday | on) +  P(Tuesday | PREPOSITION) + (1- -  -  -  ) P(Tuesday) =

69 Combined clustering P(z|xy)  P smooth (Z|xXyY)  P smooth (z|xXyYZ) P(Tuesday| party on)  P smooth (WEEKDAY | party EVENT on PREPOSITION)  P smooth (Tuesday | party EVENT on PREPOSITION WEEKDAY) Much larger than unclustered, somewhat lower perplexity.

70 IBM Clustering P (z|xy)  P smooth (Z|XY)  P(z|Z) P(WEEKDAY|EVENT PREPOSITION)  P(Tuesday | WEEKDAY) Small, very smooth, mediocre perplexity P (z|xy)  P smooth (z|xy) + (1- )P smooth (Z|XY)  P(z|Z) Bigger, better than no clusters, better than combined clustering. Improvement: use P(z|XYZ) instead of P(z|Z)

71 Clustering by Position “A” and “AN”: same cluster or different cluster? Same cluster for predictive clustering Different clusters for conditional clustering Small improvement by using different clusters for conditional and predictive

72 Clustering: how to get them Build them by hand –Works ok when almost no data Part of Speech (POS) tags –Tends not to work as well as automatic Automatic Clustering –Swap words between clusters to minimize perplexity

73 Clustering: automatic Minimize perplexity of P(z|Y) Mathematical tricks speed it up Use top-down splitting, not bottom up merging!

74 Real Overview Overview Basics: probability, language model definition Real Overview Evaluation Smoothing  More techniques –Caching –Skipping –Clustering –Sentence-mixture models, –Structured language models Tools

75 Sentence Mixture Models Lots of different sentence types: –Numbers (The Dow rose one hundred seventy three points) –Quotations (Officials said “quote we deny all wrong doing ”quote) –Mergers (AOL and Time Warner, in an attempt to control the media and the internet, will merge) Model each sentence type separately

76 Sentence Mixture Models Roll a die to pick sentence type, s k with probability  k Probability of sentence, given s k Probability of sentence across types:

77 Sentence Model Smoothing Each topic model is smoothed with overall model. Sentence mixture model is smoothed with overall model (sentence type 0).

78 Sentence Mixture Results

79 Sentence Clustering Same algorithm as word clustering Assign each sentence to a type, s k Minimize perplexity of P(z|s k ) instead of P(z|Y)

80 Real Overview Overview Basics: probability, language model definition Real Overview Evaluation Smoothing  More techniques –Caching –Skipping –Clustering –Sentence-mixture models –Structured language models Tools

81 Structured Language Model “The contract ended with a loss of 7 cents after”

82 How to get structure data? Use a Treebank (a collection of sentences with structure hand annotated) like Wall Street Journal, Penn Tree Bank. Problem: need a treebank. Or – use a treebank (WSJ) to train a parser; then parse new training data (e.g. Broadcast News) Re-estimate parameters to get lower perplexity models.

83 Structured Language Models Use structure of language to detect long distance information Promising results But: time consuming –Replacement: 5-grams, skipping, capture similar information.

84 Real Overview Overview Basics: probability, language model definition Real Overview Evaluation Smoothing More techniques –Caching –Skipping –Clustering –Sentence-mixture models –Structured language models  Tools

85 Tools: CMU Language Modeling Toolkit Can handle bigram, trigrams, more Can handle different smoothing schemes Many separate tools – output of one tool is input to next: easy to use Free for research purposes –

86 Tools: SRI Language Modeling Toolkit More powerful than CMU toolkit Can handles clusters, lattices, n-best lists, hidden tags Free for research use –

IRSTLM Toolkit More friendly on Copyright issue Being recommended by standard SMT package, Moses. IRSTLM Toolkit – – 87

88 Tools: Text normalization What about “$3,100,000”  convert to “Three million one hundred thousand dollars”, etc. Need to do this for dates, numbers, maybe abbreviations. Some text-normalization tools come with Wall Street Journal corpus, from LDC (Linguistic Data Consortium) Not much available Write your own (use Perl!)

89 Small enough Real language models are often huge 5-gram models typically larger than the training data –Consider Google’s web language model Use count-cutoffs (eliminate parameters with fewer counts) or, better Use Stolcke pruning – finds counts that contribute least to perplexity reduction, –P(City | New York”)  P(City | York) –P(Friday | God it’s)  P(Friday | it’s) Remember, Kneser-Ney helped most when lots of 1 counts

90 Some Experiments Goodman re-implemented all techniques Trained on 260,000,000 words of WSJ Optimize parameters on heldout Test on separate test section Some combinations extremely time-consuming (days of CPU time) –Don’t try this at home, or in anything you want to ship Rescored N-best lists to get results –Maximum possible improvement from 10% word error rate absolute to 5%

91 Overall Results: Perplexity

92 Overall Results: Word Accuracy

93 Conclusions Use trigram models Use any reasonable smoothing algorithm (Katz, Kneser-Ney) Use caching information, clustering, sentence mixtures, skipping not usually worth effort, if you have correction

94 References Joshua Goodman’s web page: (Smoothing, introduction, more) – –Contains smoothing technical report: good introduction to smoothing and lots of details too. –Will contain journal paper of this talk, updated results. Books (all are OK, none focus on language models) –Speech and Language Processing by Dan Jurafsky and Jim Martin (especially Chapter 6)Speech and Language Processing –Foundations of Statistical Natural Language Processing by Chris Manning and Hinrich Schütze.Foundations of Statistical Natural Language Processing –Statistical Methods for Speech Recognition, by Frederick Jelinek

95 References Structured Language Models –Ciprian Chelba’s web page: Maximum Entropy –Roni Rosenfeld’s home page and thesis Stolcke Pruning –A. Stolcke (1998), Entropy-based pruning of backoff language models. Proc. DARPA Broadcast News Transcription and Understanding Workshop, pp , Lansdowne, VA. NOTE: get corrected version from

96 References: Further Reading “An Empirical Study of Smoothing Techniques for Language Modeling”. Stanley Chen and Joshua Goodman Harvard Computer Science Technical report TR –(Gives a very thorough evaluation and description of a number of methods.) “On the Convergence Rate of Good-Turing Estimators”. David McAllester and Robert E. Schapire. In Proceedings of COLT –(A pretty technical paper, giving confidence-intervals on Good- Turing estimators. Theorems 1, 3 and 9 are useful in understanding the motivation for Good-Turing discounting.)