Presentation is loading. Please wait.

Presentation is loading. Please wait.

Statistical NLP Spring 2011 Lecture 3: Language Models II Dan Klein – UC Berkeley TexPoint fonts used in EMF. Read the TexPoint manual before you delete.

Similar presentations


Presentation on theme: "Statistical NLP Spring 2011 Lecture 3: Language Models II Dan Klein – UC Berkeley TexPoint fonts used in EMF. Read the TexPoint manual before you delete."— Presentation transcript:

1 Statistical NLP Spring 2011 Lecture 3: Language Models II Dan Klein – UC Berkeley TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAA A A

2 Smoothing  We often want to make estimates from sparse statistics:  Smoothing flattens spiky distributions so they generalize better  Very important all over NLP, but easy to do badly!  We’ll illustrate with bigrams today (h = previous word, could be anything). P(w | denied the) 3 allegations 2 reports 1 claims 1 request 7 total allegations chargesmotionbenefits … allegationsreportsclaims charges request motionbenefits … allegationsreports claims request P(w | denied the) 2.5 allegations 1.5 reports 0.5 claims 0.5 request 2 other 7 total

3 Kneser-Ney  Kneser-Ney smoothing combines these two ideas  Absolute discounting  Lower order continuation probabilities  KN smoothing repeatedly proven effective  Why should things work like this?

4 Predictive Distributions  Parameter estimation:  With parameter variable:  Predictive distribution: abca θ = P(w) = [a:0.5, b:0.25, c:0.25] abca Θ abca Θ W

5 “Chinese Restaurant” Processes [Teh, 06, diagrams from Teh] Dirichlet ProcessPitman-Yor Process

6 Hierarchical Models cadb bba Θ0Θ0 ΘcΘc ΘdΘd ΘaΘa ab a ΘgΘg [MacKay and Peto, 94, Teh 06] eabfab bb ΘeΘe a ΘfΘf a ef ΘbΘb d

7 What Actually Works?  Trigrams and beyond:  Unigrams, bigrams generally useless  Trigrams much better (when there’s enough data)  4-, 5-grams really useful in MT, but not so much for speech  Discounting  Absolute discounting, Good- Turing, held-out estimation, Witten-Bell  Context counting  Kneser-Ney construction oflower-order models  See [Chen+Goodman] reading for tons of graphs! [Graphs from Joshua Goodman]

8 Data >> Method?  Having more data is better…  … but so is using a better estimator  Another issue: N > 3 has huge costs in speech and MT decoders

9 Tons of Data? [Brants et al, 2007]

10 Large Scale Methods  Language models get big, fast  English Gigawords corpus: 2G tokens, 0.3G trigrams, 1.2G 5-grams  Google N-grams: 13M unigrams, 0.3G bigrams, ~1G 3-, 4-, 5-grams  Need to access entries very often, ideally in memory  What do you do when language models get too big?  Distributing LMs across machines  Quantizing probabilities  Random hashing (e.g. Bloom filters) [Talbot and Osborne 07]

11 A Simple Java Hashmap? Per 3-gram: 1 Pointer = 8 bytes 1 Map.Entry = 8 bytes (obj) +3x8 bytes (pointers) 1 Double = 8 bytes (obj) + 8 bytes (double) 1 String[] = 8 bytes (obj) + + 3x8 bytes (pointers) … at best Strings are canonicalized Total: > 88 bytes Obvious alternatives: - Sorted arrays - Open addressing

12 Word+Context Encodings

13

14 Compression

15 Memory Requirements

16 Speed and Caching Full LM

17 LM Interfaces

18 Approximate LMs  Simplest option: hash-and-hope  Array of size K ~ N  (optional) store hash of keys  Store values in direct-address or open addressing  Collisions: store the max  What kind of errors can there be?  More complex options, like bloom filters (originally for membership, but see Talbot and Osborne 07), perfect hashing, etc

19 Beyond N-Gram LMs  Lots of ideas we won’t have time to discuss:  Caching models: recent words more likely to appear again  Trigger models: recent words trigger other words  Topic models  A few other classes of ideas  Syntactic models: use tree models to capture long-distance syntactic effects [Chelba and Jelinek, 98]  Discriminative models: set n-gram weights to improve final task accuracy rather than fit training set density [Roark, 05, for ASR; Liang et. al., 06, for MT]  Structural zeros: some n-grams are syntactically forbidden, keep estimates at zero if the look like real zeros [Mohri and Roark, 06]  Bayesian document and IR models [Daume 06]


Download ppt "Statistical NLP Spring 2011 Lecture 3: Language Models II Dan Klein – UC Berkeley TexPoint fonts used in EMF. Read the TexPoint manual before you delete."

Similar presentations


Ads by Google