Presentation is loading. Please wait.

Presentation is loading. Please wait.

Collocations. Definition Of Collocation (wrt Corpus Literature) A collocation is defined as a sequence of two or more consecutive words, that has characteristics.

Similar presentations


Presentation on theme: "Collocations. Definition Of Collocation (wrt Corpus Literature) A collocation is defined as a sequence of two or more consecutive words, that has characteristics."— Presentation transcript:

1 Collocations

2 Definition Of Collocation (wrt Corpus Literature) A collocation is defined as a sequence of two or more consecutive words, that has characteristics of a syntactic and semantic unit, and whose exact and unambiguous meaning or connotation cannot be derived directly from the meaning or connotation of its components. [Chouekra, 1988]

3 3 Word Collocations Collocation –Firth: “ word is characterized by the company it keeps ” ; collocations of a given word are statements of the habitual or customary places of that word. –non-compositionality of meaning cannot be derived directly from its parts (heavy rain) –non-substitutability in context for parts (make a decision) –non-modifiability (& non-transformability) kick the yellow bucket; take exceptions to

4 Collocations Collocations are not necessarily adjacent Collocations cannot be directly translated into other languages.

5 Example Classes Names Technical Terms “Light” Verb Constructions Phrasal verbs Noun Phrases

6 Linguistic Subclasses of Collocations Light verbs: verbs with little semantic content like make, take, do Terminological Expressions: concepts and objects in technical domains (e.g., hard drive) Idioms: fixed phrases kick the bucket, birds-of-a-feather, run for office Proper names: difficult to recognize even with lists Tuesday (person ’ s name), May, Winston Churchill, IBM, Inc. Numerical expressions –containing “ ordinary ” words Monday Oct 04 1999, two thousand seven hundred fifty Verb particle constructions or Phrasal Verbs –Separable parts: look up, take off, tell off

7 Collocation Detection Techniques Selection of Collocations by Frequency Selection of Collocation based on Mean and Variance of the distance between focal word and collocating word. Hypothesis Testing Pointwise Mutual Information

8 Frequency Technique: –Count the number of times a bigram co-occurs –Extract top counts and report them as candidates Results: –Corpus: New York Times August – November, 1990 –Extremely un-interesting

9 Frequency with Tag Filters Technique Technique: –Count the number of times a bigram co-occurs –Tag candidates for POS –Pass all candidates through POS filter, considering only ones matching filter –Extract top counts and report them as candidates

10 Frequency with Tag Filters Results

11 Mean and Variance (Smadja et al., 1993) Frequency-based search works well for fixed phrases. However, many collocations consist of two words in more flexible (although regular) relationships. For example, –Knock and door may not occur at a fixed distance from each other One method of detecting these flexible relationships uses the mean and variance of the offset (signed distance) between the two words in the corpus.

12 Mean, Sample Variance, and Standard Deviation

13 Example: Knock and Door 1.She knocked on his door. 2.They knocked at the door. 3.100 women knocked on the big red door. 4.A man knocked on the metal front door. Average offset between knock and door: (3 + 3 + 5 + 5)/ 4 = 4 Variance: ((3-4) 2 + (3-4) 2 + (5-4) 2 + (5-4) 2 )/(4-1) = 4/3=1.15

14 Mean and Variance Technique (bigram at distance) –Produce all possible pairs in a window –Consider all pairs in window as candidates –Keep data about distance of one word from another –Count the number of time each candidate occurs Measures: –Mean: average offset (possibly negative) Whether two words are related to each other –Variance:  (offset) Variability in position of two words

15 Mean and Variance Illustration Candidate Generation example: –Window: 3 Used to find collocations with long- distance relationships

16 Mean and Variance Collocations

17 Hypothesis Testing: Overview Two (or more) words co-occur a lot Is a candidate a true collocation, or a (not-at-all-interesting) phantom?

18 The t test Intuition Intuition: –Compute chance occurrence and ensure observed is significantly higher –Take several permutations of the words in the corpus –How more frequent is the set of all possible permutations than what is observed? Assumptions: –H 0 is the null hypothesis (words occur independently) P(w 1, w 2 ) = P(w 1 ) P(w 2 ) –Distribution is “normal”

19 The t test Formula Measures: –x = bigram count –  = H 0 = P(w 1 ) P(w 2 ) –s 2 = bigram count (since p ~ p[1 – p]) –N = total number of bigrams Result: –Number to look up in a tabletable –Degree of confidence that collocation is not created by chance  = the confidence (%) with which one can reject H 0

20 The t test Sample Findings

21 The t test Criticism Words are not normally distributed –Can reject valid collocation Not good on sparse data

22  2 Intuition Pearson’s chi-square test Intuition –Compare observed frequencies to expected frequencies for independence Assumptions –If sample is not small, the distribution is not normal

23  2 General Formula Measures: –E ij = Expected count of the bigram –O ij = Observed count of the bigram Result –A number to look up in a table (like the t test) –Degree of confidence (  ) with which H 0

24  2 Bigram Method and Formula Technique for Bigrams: –Arrange the bigrams in a 2x2 table with counts for each –Formula O ij : i = column; j = row

25  2 Sample Findings Comparing corpora –Machine Translation Comparison of (English) “cow” and (French) “vache” gives a  2 = 456400 –Similarity of two corpora

26  2 Criticism Not good for small datasets

27 Likelihood Ratios Within a Single Corpus (Dunning, 1993) Likelihood ratios are more appropriate for sparse data than the Chi-Square test. In addition, they are easier to interpret than the Chi-Square statistic. In applying the likelihood ratio test to collocation discovery, use the following two alternative explanations for the occurrence frequency of a bigram w 1 w 2 : –H1: The occurrence of w 2 is independent of the previous occurrence of w 1 : P(w 2 | w 1 ) = P(w 2 |  w 1 ) = p –H2: The occurrence of w 2 is dependent of the previous occurrence of w 1 : p 1 = P(w 2 | w 1 )  P(w 2 |  w 1 ) = p 2

28 Likelihood Ratios Within a Single Corpus Use the MLE for probabilities for p, p 1, and p 2 and assume the binomial distribution: –Under H 1 : P(w 2 | w 1 ) = c 2 /N, P(w 2 |  w 1 ) = c 2 /N –Under H 2 : P(w 2 | w 1 ) = c 12 / c 1 = p 1, P(w 2 |  w 1 ) = (c 2 -c 12 )/(N-c 1 ) = p 2 –Under H 1 : b(c 12 ; c 1, p) gives c 12 out of c 1 bigrams are w 1 w 2 and b(c 2 -c 12 ; N-c 1, p) gives c 2 - c 12 out of N-c 1 bigrams are  w 1 w 2 –Under H 2 : b(c 12 ; c 1, p 1 ) gives c 12 out of c 1 bigrams are w 1 w 2 and b(c 2 -c 12 ; N-c 1, p 2 ) gives c 2 - c 12 out of N-c 1 bigrams are  w 1 w 2

29 Likelihood Ratios Within a Single Corpus The likelihood of H 1 –L(H 1 ) = b(c 12 ; c 1, p)  b(c 2 -c 12 ; N-c 1, p) (likelihood of independence) The likelihood of H 2 –L(H 2 ) = b(c 12 ; c 1, p 1 )  b(c 2 - c 12 ; N-c 1, p 2 ) (likelihood of dependence) The log of likelihood ratio –log = log [L(H 1 )/ L(H 2 )] = log b(..) + log b(..) – log b(..) – log b(..) The quantity – 2 log is asymptotically  2 distributed, so we can test for significance.

30 [Pointwise] Mutual Information (I) Intuition: –Given a collocation (w 1, w 2 ) and an observation of w 1 –I(w 1 ; w 2 ) indicates how more likely it is to see w 2 –The same measure also works in reverse (observe w 2 ) Assumptions: –Data is not sparse

31 Mutual Information Formula Measures: –P(w 1 ) = unigram prob. –P(w 1 w 2 ) = bigram prob. –P (w 2 |w 1 ) = probability of w 2 given we see w 1 Result: –Number indicating increased confidence that we will see w 2 after w 1

32 Mutual Information Criticism A better measure of the independence of two words rather than the dependence of one word on another Horrible on [read: misidentifies] sparse data

33 Applications Collocations are useful in: –Comparison of Corpora –Parsing –New Topic Detection –Computational Lexicography –Natural Language Generation –Machine Translation

34 Comparison of Corpora Compare corpora to determine: –Document clustering (for information retrieval) –Plagiarism Comparison techniques: –Competing hypotheses: Documents are dependent Documents are independent –Compare hypotheses using, etc.

35 Parsing When parsing, we may get more accurate data by treating a collocation as a unit (rather than individual words) –Example: [ hand to hand ] is a unit in: (S (NP They) (VP engaged (PP in hand) (PP to (NP hand combat))))

36 New Topic Detection When new topics are reported, the count of collocations associated with those topics increases –When topics become old, the count drops

37 Computational Lexicography As new multi-word expressions become part of the language, they can be detected –Existing collocations can be acquired Can also be used for cultural identification –Examples: My friend got an A in his class My friend took an A in his class My friend made an A in his class My friend earned an A in his class

38 Natural Language Generation Problem: –Given two (or more) possible productions, which is more feasible? –Productions usually involve synonyms or near-synonyms –Languages generally favour one production

39 Machine Translation Collocation-complete problem? –Must find all used collocations –Must parse collocation as a unit –Must translate collocation as a unit –In target language production, must select among many plausible alternatives

40 Thanks! Questions?

41 Statistical Inference: n-gram Model over Sparse Data

42 Statistical inference Statistical inference consists of taking some data (generated in accordance with some unknown probability distribution) and then making some inferences about its distribution.

43 Language Models Predict the next word, given the previous words (this sort of task is often referred to as a shannon game) A language model can take the context into account. Determine probability of different sequences by examining training corpus Applications: OCR / Speech recognition – resolve ambiguity Spelling correction Machine translation etc

44 Statistical Estimators Example: Corpus: five Jane Austen novels N = 617,091 words, V = 14,585 unique words Task: predict the next word of the trigram “ inferior to ___ ” from test data, Persuasion: “ [In person, she was] inferior to both [sisters.] ” Given the observed training data … How do you develop a model (probability distribution) to predict future events?

45 45 The Perfect Language Model Sequence of word forms Notation: W = (w 1,w 2,w 3,...,w n ) The big (modeling) question is what is p(W)? Well, we know (Bayes/chain rule): p(W) = p(w 1,w 2,w 3,...,w n ) = p(w 1 )p(w 2 |w 1 )p(w 3 |w 1,w 2 )   p(w n |w 1,w 2,...,w n-1 ) Not practical (even short for W  too many parameters)

46 Markov Chain Unlimited memory (cf. previous foil): –for w i, we know its predecessors w 1,w 2,w 3,...,w i-1 Limited memory: –we disregard predecessors that are “ too old ” –remember only k previous words: w i-k,w i-k+1,...,w i-1 –called “ k th order Markov approximation ” Stationary character (no change over time): p(W)   i=1..n p(w i |w i-k,w i-k+1,...,w i-1 ), n = |W|

47 47 N-gram Language Models (n-1) th order Markov approximation  n-gram LM: p(W)   i=1..n p(w i |w i-n+1,w i-n+2,...,w i-1 ) In particular (assume vocabulary |V| = 20k): 0-gram LM: uniform model p(w) = 1/|V| 1 parameter 1-gram LM: unigram model p(w)2  10 4 parameters 2-gram LM: bigram model p(w i |w i-1 ) 4  10 8 parameters 3-gram LM: trigram modep(w i |w i-2,w i-1 ) 8  10 12 parameters 4-gram LM: tetragram modelp(w i | w i-3,w i-2,w i-1 )1.6  10 17 parameters

48 Reliability vs. Discrimination “ large green ___________ ” tree? mountain? frog? car? “ swallowed the large green ________ ” pill? tidbit? larger n: more information about the context of the specific instance (greater discrimination) smaller n: more instances in training data, better statistical estimates (more reliability)

49 49 LM Observations How large n? –zero is enough (theoretically) –but anyway: as much as possible (as close to “ perfect ” model as possible) –empirically: 3 parameter estimation? (reliability, data availability, storage space,...) 4 is too much: |V|=60k  1.296  10 19 parameters but: 6-7 would be (almost) ideal (having enough data) For now, word forms only (no “ linguistic ” processing)

50 50 Parameter Estimation Parameter: numerical value needed to compute p(w|h) From data (how else?) Data preparation: get rid of formatting etc. ( “ text cleaning ” ) define words (separate but include punctuation, call it “ word ”, unless speech) define sentence boundaries (insert “ words ” and ) letter case: keep, discard, or be smart: –name recognition –number type identification

51 51 Maximum Likelihood Estimate MLE: Relative Frequency... –...best predicts the data at hand (the “ training data ” ) –See (Ney et al. 1997) for a proof that the relative frequency really is the maximum likelihood estimate. Trigrams from Training Data T: –count sequences of three words in T: C 3 (w i-2,w i-1,w i ) –count sequences of two words in T: C 2 (w i-2,w i-1 ): P MLE (w i-2,w i-1,w i ) = C 3 (w i-2,w i-1,w i ) / N P MLE (w i |w i-2,w i-1 ) = C 3 (w i-2,w i-1,w i ) / C 2 (w i-2,w i-1 )

52 52 Character Language Model Use individual characters instead of words: Same formulas and methods Might consider 4-grams, 5-grams or even more Good for cross-language comparisons Transform cross-entropy between letter- and word-based models: H S (p c ) = H S (p w ) / avg. # of characters/word in S p(W)  df  i=1..n p(c i |c i-n+1,c i-n+2,...,c i-1 )

53 53 LM: an Example Training data: He can buy you the can of soda –Unigram: (8 words in vocabulary) p 1 (He) = p 1 (buy) = p 1 (you) = p 1 (the) = p 1 (of) = p 1 (soda) =.125 p 1 ( can ) =.25 –Bigram: p 2 ( He| ) = 1, p 2 ( can|He ) = 1, p 2 ( buy|can ) =.5, p 2 ( of|can ) =.5, p 2 ( you |buy ) = 1,... –Trigram: p 3 ( He|, ) = 1, p 3 ( can|,He ) = 1, p 3 ( buy|He,can ) = 1, p 3 ( of|the,can ) = 1,..., p 3 ( |of,soda ) = 1. –Entropy: H(p 1 ) = 2.75, H(p 2 ) = 1, H(p 3 ) = 0

54 54 LM: an Example (The Problem) Cross-entropy: S = It was the greatest buy of all Even H S (p 1 ) fails (= H S (p 2 ) = H S (p 3 ) =  ), because: –all unigrams but p 1 (the), p 1 (buy), and p 1 (of) are 0. –all bigram probabilities are 0. –all trigram probabilities are 0. Need to make all “ theoretically possible ” probabilities non-zero.

55 LM: Another Example Training data S: |V| =11 (not counting and ) – John read Moby Dick – Mary read a different book – She read a book by Cher Bigram estimates: –P(She | ) = C( She)/  w C( w) = 1/3 –P(read | She) = C(She read)/  w C(She w) = 1 –P (Moby | read) = C(read Moby)/  w C(read w) = 1/3 –P (Dick | Moby) = C(Moby Dick)/  w C(Moby w) = 1 –P( | Dick) = C(Dick )/  w C(Dick w) = 1 p(She read Moby Dick) = p(She | )  p(read | She)  p(Moby | read)  p(Dick | Moby)  p( | Dick) = 1/3  1  1/3  1  1 = 1/9

56 56 The Zero Problem “ Raw ” n-gram language model estimate: –necessarily, there will be some zeros Often trigram model  2.16  10 14 parameters, data ~ 10 9 words –which are true zeros? optimal situation: even the least frequent trigram would be seen several times, in order to distinguish it ’ s probability vs. other trigrams (hapax legomena = only-once term => uniqueness) optimal situation cannot happen, unfortunately (question: how much data would we need?) –we don ’ t know; hence, we eliminate them. Different kinds of zeros: p(w|h) = 0, p(w) = 0

57 57 Why do we need non-zero probabilities? Avoid infinite Cross Entropy: –happens when an event is found in the test data which has not been seen in training data Make the system more robust –low count estimates: they typically happen for “ detailed ” but relatively rare appearances –high count estimates: reliable but less “ detailed ”

58 58 Eliminating the Zero Probabilities: Smoothing Get new p ’ (w) (same  ): almost p(w) except for eliminating zeros Discount w for (some) p(w) > 0: new p ’ (w) < p(w)  w  discounted (p(w) - p ’ (w)) = D Distribute D to all w; p(w) = 0: new p ’ (w) > p(w) –possibly also to other w with low p(w) For some w (possibly): p ’ (w) = p(w) Make sure  w  p ’ (w) = 1 There are many ways of smoothing

59 59 Smoothing: an Example

60 60 Laplace ’ s Law: Smoothing by Adding 1 Laplace ’ s Law: –P LAP (w 1,..,w n )=(C(w 1,..,w n )+1)/(N+B), where C(w 1,..,w n ) is the frequency of n-gram w 1,..,w n, N is the number of training instances, and B is the number of bins training instances are divided into (vocabulary size) –Problem if B > C(W) (can be the case; even >> C(W)) –P LAP (w | h) = (C(h,w) + 1) / (C(h) + B) The idea is to give a little bit of the probability space to unseen events.

61 Add 1 Smoothing Example p MLE (Cher read Moby Dick) = p(Cher | )  p(read | Cher)  p(Moby | read)  p(Dick | Moby)  p( | Dick) = 0  0  1/3  1  1 = 0 –p(Cher | ) = (1 + C( Cher))/(11 + C( )) = (1 + 0) / (11 + 3) = 1/14 =.0714 –p(read | Cher) = (1 + C(Cher read))/(11 + C(Cher)) = (1 + 0) / (11 + 1) = 1/12 =.0833 –p(Moby | read) = (1 + C(read Moby))/(11 + C(read)) = (1 + 1) / (11 + 3) = 2/14 =.1429 –P(Dick | Moby) = (1 + C(Moby Dick))/(11 + C(Moby)) = (1 + 1) / (11 + 1) = 2/12 =.1667 –P( | Dick) = (1 + C(Dick ))/(11 + C ) = (1 + 1) / (11 + 3) = 2/14 =.1429 p ’ (Cher read Moby Dick) = p(Cher | )  p(read | Cher)  p(Moby | read)  p(Dick | Moby)  p( | Dick) = 1/14  1/12  2/14  2/12  2/14 = 2.02e -5

62 Objections to Laplace ’ s Law For NLP applications that are very sparse, Laplace ’ s Law actually gives far too much of the probability space to unseen events. Worse at predicting the actual probabilities of bigrams with zero counts than other methods. Count variances are actually greater than the MLE.

63 Lidstone ’ s Law P = probability of specific n-gram C = count of that n-gram in training data N = total n-grams in training data B = number of “bins” (possible n-grams) = small positive number M.L.E: = 0 LaPlace’s Law: = 1 Jeffreys-Perks Law: = ½ P Lid (w | h) = (C(h,w) + ) / (C(h) + B )

64 Objections to Lidstone ’ s Law Need an a priori way to determine. Predicts all unseen events to be equally likely. Gives probability estimates linear in the M.L.E. frequency.

65 Lidstone ’ s Law with =.5 p MLE (Cher read Moby Dick) = p(Cher | )  p(read | Cher)  p(Moby | read)  p(Dick | Moby)  p( | Dick) = 0  0  1/3  1  1 = 0 –p(Cher | ) = (.5 + C( Cher))/(.5* 11 + C( )) = (.5 + 0) / (.5*11 + 3) =.5/8.5 =.0588 –p(read | Cher) = (.5 + C(Cher read))/(.5* 11 + C(Cher)) = (.5 + 0) / (.5* 11 + 1) =.5/6.5 =.0769 –p(Moby | read) = (.5 + C(read Moby))/(.5* 11 + C(read)) = (.5 + 1) / (.5* 11 + 3) = 1.5/8.5 =.1765 –P(Dick | Moby) = (.5 + C(Moby Dick))/(.5* 11 + C(Moby)) = (.5 + 1) / (.5* 11 + 1) = 1.5/6.5 =.2308 –P( | Dick) = (.5 + C(Dick ))/(.5* 11 + C ) = (.5 + 1) / (.5* 11 + 3) = 1.5/8.5 =.1765 p ’ (Cher read Moby Dick) = p(Cher | )  p(read | Cher)  p(Moby | read)  p(Dick | Moby)  p( | Dick) =.5/8.5 .5/6.5  1.5/8.5  1.5/6.5  1.5/8.5 = 3.25e -5

66 Held-Out Estimator How much of the probability distribution should be reserved to allow for previously unseen events? Can validate choice by holding out part of the training data. How often do events seen (or not seen) in training data occur in validation data? Held out estimator by Jelinek and Mercer (1985)

67 Held Out Estimator For each n-gram, w 1,..,w n, compute C 1 (w 1,..,w n ) and C 2 (w 1,..,w n ), the frequencies of w 1,..,w n in training and held out data, respectively. –Let N r be the no. of bigrams with frequency r in the training text. –Let T r be the that all n-grams that appeared r times in the training text appeared in the held out., Then the average of the frequency r n-grams is T r /N r An estimate for of one of these n-gram is: P ho (w 1,..,w n )= (T r /N r )/N –where C(w 1,..,w n ) = r

68 Testing Models Divide data into training and testing sets. Training data: divide into normal training plus validation (smoothing) sets: around 10% for validation (fewer parameters typically) Testing data: distinguish between the “ real ” test set and a development set.

69 Cross-Validation Held out estimation is useful if there is a lot of data available. If not, it may be better to use each part of the data both as training data and held out data. –Deleted Estimation [Jelinek & Mercer, 1985] –Leave-One-Out [Ney et al., 1997]

70 Deleted Estimation Use data for both training and validation Divide training data into 2 parts (1)Train on A, validate on B (2)Train on B, validate on A Combine two models AB trainvalidate train Model 1 Model 2 Model 1Model 2 + Final Model

71 Cross-Validation Two estimates: Combined estimate: N r a = number of n-grams occurring r times in a-th part of training set T r ab = total number of those found in b-th part (arithmetic mean)

72 Leave One Out Primary training Corpus is of size N-1 tokens. 1 token is used as held out data for a sort of simulated testing. Process is repeated N times so that each piece of data is left in turn. It explores the effect of how the model changes if any particular piece of data had not been observed (advantage)

73 73 Good-Turing Estimation Intuition: re-estimate the amount of mass assigned to n- grams with low (or zero) counts using the number of n- grams with higher counts. For any n-gram that occurs r times, we should assume that it occurs r* times, where N r is the number of n-grams occurring precisely r times in the training data. To convert the count to a probability, we normalize the n-gram  with r counts as:

74 Good-Turing Estimation Note that N is equal to the original number of counts in the distribution. Makes the assumption of a binomial distribution, which works well for large amounts of data and a large vocabulary despite the fact that words and n-grams do not have that distribution.

75 Good-Turing Estimation Note that the estimate cannot be used if N r =0; hence, it is necessary to smooth the N r values. The estimate can be written as: –If C(w 1,..,w n ) = r > 0, P GT (w 1,..,w n ) = r*/N where r*=((r+1)S(r+1))/S(r) and S(r) is a smoothed estimate of the expectation of N r. –If C(w 1,..,w n ) = 0, P GT (w 1,..,w n )  (N 1 /N 0 ) /N In practice, counts with a frequency greater than five are assumed reliable, as suggested by Katz. In practice, this method is not used by itself because it does not use lower order information to estimate probabilities of higher order n-grams.

76 Good-Turing Estimation N-grams with low counts are often treated as if they had a count of 0. In practice r* is used only for small counts; counts greater than k=5 are assumed to be reliable: r*=r if r> k; otherwise:

77 Discounting Methods Absolute discounting: Decrease probability of each observed n-gram by subtracting a small constant when C(w 1, w 2, …, w n ) = r: Linear discounting: Decrease probability of each observed n-gram by multiplying by the same proportion when C(w 1, w 2, …, w n ) = r:

78 Combining Estimators: Overview If we have several models of how the history predicts what comes next, then we might wish to combine them in the hope of producing an even better model. Some combination methods: –Katz ’ s Back Off –Simple Linear Interpolation –General Linear Interpolation

79 Backoff Back off to lower order n-gram if we have no evidence for the higher order form. Trigram backoff:

80 Katz ’ s Back Off Model If the n-gram of concern has appeared more than k times, then an n-gram estimate is used but an amount of the MLE estimate gets discounted (it is reserved for unseen n-grams). If the n-gram occurred k times or less, then we will use an estimate from a shorter n-gram (back-off probability), normalized by the amount of probability remaining and the amount of data covered by this estimate. The process continues recursively.

81 Katz ’ s Back Off Model Katz used Good-Turing estimates when an n-gram appeared k or fewer times.

82 Problems with Backing-Off If bigram w 1 w 2 is common, but trigram w 1 w 2 w 3 is unseen, it may be a meaningful gap, rather than a gap due to chance and scarce data. –i.e., a “ grammatical null ” In that case, it may be inappropriate to back-off to lower-order probability.

83 Linear Interpolation One way of solving the sparseness in a trigram model is to mix that model with bigram and unigram models that suffer less from data sparseness. This can be done by linear interpolation (also called finite mixture models). The weights can be set using the Expectation- Maximization (EM) algorithm.

84 84 Simple Interpolated Smoothing Add information from less detailed distributions using =( 0, , ,  ): p ’ (w i | w i-2,w i-1 ) =  p 3 (w i | w i-2,w i-1 ) +  p 2 (w i | w i-1 ) +  p 1 (w i ) + 0  /|V| Normalize: i > 0,  i=0..n i = 1 is sufficient ( 0 = 1 -  i=1..n i ) (n=3) Estimation using MLE: –fix the p 3, p 2, p 1 and |V| parameters as estimated from the training data –then find { i }that minimizes the cross entropy (maximizes probability of data): -(1/|D|)  i=1..|D| log 2 (p ’ (w i |h i ))

85 85 Held Out Data What data to use? –try the training data T: but we will always get  = 1 why? (let p iT be an i-gram distribution estimated using T) minimizing H T (p ’ ) over a vector, p ’ =  p 3T +  p 2T +  p 1T +  /|V| –remember: H T (p ’ ) = H(p 3T ) + D(p 3T ||p ’ ); (p 3T fixed  H(p 3T ) fixed, best) –thus: do not use the training data for estimation of  must hold out part of the training data (heldout data, H):...call the remaining data the (true/raw) training data, T the test data S (e.g., for evaluation purposes): still different data!

86 86 The Formulas Repeat: minimizing -(1/|H|)  i=1..|H| log 2 (p ’ (w i |h i )) over p ’ (w i | h i ) = p ’ (w i | w i-2,w i-1 ) =  p 3 (w i | w i-2,w i-1 ) +  p 2 (w i | w i-1 ) +  p 1 (w i ) + 0  /|V| “ Expected Counts (of lambdas) ” : j = 0..3 c( j ) =  i=1..|H| ( j p j (w i |h i ) / p ’ (w i |h i )) “ Next ” : j = 0..3 j,next = c( j ) /  k=0..3 (c( k ))

87 87 The (Smoothing) EM Algorithm 1. Start with some, such that j > 0 for all j  0..3. 2. Compute “ Expected Counts ” for each j. 3. Compute new set of j, using the “ Next ” formula. 4. Start over at step 2, unless a termination condition is met. Termination condition: convergence of. –Simply set an , and finish if | j - j,next | <  for each j (step 3). Guaranteed to converge: follows from Jensen ’ s inequality, plus a technical proof.

88 88 Example Raw distribution (unigram; smooth with uniform): p(a) =.25, p(b) =.5, p(  ) = 1/64 for  {c..r}, = 0 for the rest: s,t,u,v,w,x,y,z Heldout data: baby; use one set of ( 1 : unigram, 0 : uniform) Start with 1 =.5; p ’ (b) =.5 .5 +.5 / 26 =.27 p ’ (a) =.5 .25 +.5 / 26 =.14 p ’ (y) =.5  0 +.5 / 26 =.02 c( 1 ) =.5 .5/.27 +.5 .25/.14 +.5 .5/.27 +.5  0/.02 = 2.72 c( 0 ) =.5 .04/.27 +.5 .04/.14 +.5 .04/.27 +.5 .04/.02 = 1.28 Normalize: 1,next =.68, 0,next =.32. Repeat from step 2 (recompute p ’ first for efficient computation, then c( i ),...) Finish when new lambdas differ little from the previous ones (say, < 0.01 difference). p 2 (wi| wi-1) = 1 p 1 (wi) + 2 /|V|

89 Witten-Bell Smoothing The n th order smoothed model is defined recursively as: To compute, we need the number of unique words that have that history.

90 Witten-Bell Smoothing The number of words that follow the history and have one or more counts is: We can assign the parameters such that:

91 Witten-Bell Smoothing Substituting into the first equation, we get:

92 General Linear Interpolation In simple linear interpolation, the weights are just a single number, but one can define a more general and powerful model where the weights are a function of the history. Need some way to group or bucket lambda histories.

93 Q & A

94 Reference www.cs.tau.ac.il/~nachumd/NLP http://www.cs.sfu.ca http://min.ecn.purdue.edu Manning and Schutze Jurafsky and Martin


Download ppt "Collocations. Definition Of Collocation (wrt Corpus Literature) A collocation is defined as a sequence of two or more consecutive words, that has characteristics."

Similar presentations


Ads by Google