Presentation is loading. Please wait.

Presentation is loading. Please wait.

Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012.

Similar presentations


Presentation on theme: "Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012."— Presentation transcript:

1 Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012

2 Maximum Entropy 2

3 Roadmap Maximum entropy: Overview Generative & Discriminative models: Naïve Bayes v Maxent Maximum entropy principle Feature functions Modeling Decoding Summary

4 Maximum Entropy “MaxEnt”: Popular machine learning technique for NLP First uses in NLP circa 1996 – Rosenfeld, Berger Applied to a wide range of tasks Sentence boundary detection (MxTerminator, Ratnaparkhi), POS tagging (Ratnaparkhi, Berger), topic segmentation (Berger), Language modeling (Rosenfeld), prosody labeling, etc…. 4

5 Readings & Comments Several readings: (Berger, 1996), (Ratnaparkhi, 1997) (Klein & Manning, 2003): Tutorial 5

6 Readings & Comments Several readings: (Berger, 1996), (Ratnaparkhi, 1997) (Klein & Manning, 2003): Tutorial Note: Some of these are very ‘dense’ Don’t spend huge amounts of time on every detail Take a first pass before class, review after lecture 6

7 Readings & Comments Several readings: (Berger, 1996), (Ratnaparkhi, 1997) (Klein & Manning, 2003): Tutorial Note: Some of these are very ‘dense’ Don’t spend huge amounts of time on every detail Take a first pass before class, review after lecture Going forward: Techniques more complex Goal: Understand basic model, concepts Training esp. complex – we’ll discuss, but not implement 7

8 Notation Note Not entirely consistent: We’ll use: input = x; output=y; pair = (x,y) Consistent with Berger, 1996 Ratnaparkhi, 1996: input = h; output=t; pair = (h,t) Klein/Manning, ‘03: input = d; output=c; pair = (c,d) 8

9 Joint vs Conditional Models Assuming some training data {(x,y)}, need to learn a model Θ s.t. given a new x, can predict label y. 9

10 Joint vs Conditional Models Assuming some training data {(x,y)}, need to learn a model Θ s.t. given a new x, can predict label y. Different types of models: Joint models (aka generative models) estimate P(x,y) by maximizing P(X,Y|Θ) 10

11 Joint vs Conditional Models Assuming some training data {(x,y)}, need to learn a model Θ s.t. given a new x, can predict label y. Different types of models: Joint models (aka generative models) estimate P(x,y) by maximizing P(X,Y|Θ) Most models so far: n-gram, Naïve Bayes, HMM, etc 11

12 Joint vs Conditional Models Assuming some training data {(x,y)}, need to learn a model Θ s.t. given a new x, can predict label y. Different types of models: Joint models (aka generative models) estimate P(x,y) by maximizing P(X,Y|Θ) Most models so far: n-gram, Naïve Bayes, HMM, etc Conceptually easy to compute weights: relative frequency 12

13 Joint vs Conditional Models Assuming some training data {(x,y)}, need to learn a model Θ s.t. given a new x, can predict label y. Different types of models: Joint models (aka generative models) estimate P(x,y) by maximizing P(X,Y|Θ) Most models so far: n-gram, Naïve Bayes, HMM, etc Conceptually easy to compute weights: relative frequency Conditional (aka discriminative) models estimate P(y|x), by maximizing P(Y|X, Θ) Models going forward: MaxEnt, SVM, CRF, … 13

14 Joint vs Conditional Models Assuming some training data {(x,y)}, need to learn a model Θ s.t. given a new x, can predict label y. Different types of models: Joint models (aka generative models) estimate P(x,y) by maximizing P(X,Y|Θ) Most models so far: n-gram, Naïve Bayes, HMM, etc Conceptually easy to compute weights: relative frequency Conditional (aka discriminative) models estimate P(y|x), by maximizing P(Y|X, Θ) Models going forward: MaxEnt, SVM, CRF, … Computing weights more complex 14

15 Naïve Bayes Model Naïve Bayes Model assumes features f are independent of each other, given the class C c f1f1 f2f2 f3f3 fkfk

16 Naïve Bayes Model Makes assumption of conditional independence of features given class 16

17 Naïve Bayes Model Makes assumption of conditional independence of features given class However, this is generally unrealistic 17

18 Naïve Bayes Model Makes assumption of conditional independence of features given class However, this is generally unrealistic P(“cuts”|politics) = p cuts 18

19 Naïve Bayes Model Makes assumption of conditional independence of features given class However, this is generally unrealistic P(“cuts”|politics) = p cuts What about P(“cuts”|politics,”budget”) ?= p cuts 19

20 Naïve Bayes Model Makes assumption of conditional independence of features given class However, this is generally unrealistic P(“cuts”|politics) = p cuts What about P(“cuts”|politics,”budget”) ?= p cuts Would like a model that doesn’t assume 20

21 Model Parameters Our model: c * = argmax c P(c)Π j P(f j |c) Types of parameters Two: P(C): Class priors P(f j |c): Class conditional feature probabilities Features in total |C|+|VC|, if features are words in vocabulary V

22 Weights in Naïve Bayes c1c1 c2c2 c3c3 …ckck f1f1 P(f 1 |c 1 )P(f 1 |c 2 )P(f 1 |c 3 )P(f 1 |c k ) f2f2 P(f 2 |c 1 )P(f 2 |c 2 )… …… f |V| P(f |V| |,c 1 ) 22

23 Weights in Naïve Bayes and Maximum Entropy Naïve Bayes: P(f|y) are probabilities in [0,1], weights 23

24 Weights in Naïve Bayes and Maximum Entropy Naïve Bayes: P(f|y) are probabilities in [0,1], weights P(y|x) = 24

25 Weights in Naïve Bayes and Maximum Entropy Naïve Bayes: P(f|y) are probabilities in [0,1], weights P(y|x) = 25

26 Weights in Naïve Bayes and Maximum Entropy Naïve Bayes: P(f|y) are probabilities in [0,1], weights P(y|x) = 26

27 Weights in Naïve Bayes and Maximum Entropy Naïve Bayes: P(f|y) are probabilities in [0,1], weights P(y|x) = MaxEnt: Weights are real numbers; any magnitude, sign 27

28 Weights in Naïve Bayes and Maximum Entropy Naïve Bayes: P(f|y) are probabilities in [0,1], weights P(y|x) = MaxEnt: Weights are real numbers; any magnitude, sign P(y|x) = 28

29 MaxEnt Overview Prediction: P(y|x) 29

30 MaxEnt Overview Prediction: P(y|x) f j (x,y): binary feature function, indicating presence of feature in instance x of class y 30

31 MaxEnt Overview Prediction: P(y|x) f j (x,y): binary feature function, indicating presence of feature in instance x of class y λ j : feature weights, learned in training 31

32 MaxEnt Overview Prediction: P(y|x) f j (x,y): binary feature function, indicating presence of feature in instance x of class y λ j : feature weights, learned in training Prediction: Compute P(y|x), pick highest y 32

33 Weights in MaxEnt c1c1 c2c2 c3c3 …ckck f1f1 λ1λ1 λ8λ8 … f2f2 λ2λ2 … …… f |V| λ6λ6 33

34 Maximum Entropy Principle 34

35 Maximum Entropy Principle Intuitively, model all that is known, and assume as little as possible about what is unknown 35

36 Maximum Entropy Principle Intuitively, model all that is known, and assume as little as possible about what is unknown Maximum entropy = minimum commitment 36

37 Maximum Entropy Principle Intuitively, model all that is known, and assume as little as possible about what is unknown Maximum entropy = minimum commitment Related to concepts like Occam’s razor 37

38 Maximum Entropy Principle Intuitively, model all that is known, and assume as little as possible about what is unknown Maximum entropy = minimum commitment Related to concepts like Occam’s razor Laplace’s “Principle of Insufficient Reason”: When one has no information to distinguish between the probability of two events, the best strategy is to consider them equally likely 38

39 Example I: (K&M 2003) Consider a coin flip H(X) 39

40 Example I: (K&M 2003) Consider a coin flip H(X) What values of P(X=H), P(X=T) maximize H(X)? 40

41 Example I: (K&M 2003) Consider a coin flip H(X) What values of P(X=H), P(X=T) maximize H(X)? P(X=H)=P(X=T)=1/2 If no prior information, best guess is fair coin 41

42 Example I: (K&M 2003) Consider a coin flip H(X) What values of P(X=H), P(X=T) maximize H(X)? P(X=H)=P(X=T)=1/2 If no prior information, best guess is fair coin What if you know P(X=H) =0.3? 42

43 Example I: (K&M 2003) Consider a coin flip H(X) What values of P(X=H), P(X=T) maximize H(X)? P(X=H)=P(X=T)=1/2 If no prior information, best guess is fair coin What if you know P(X=H) =0.3? P(X=T)=0.7 43

44 Example II: MT (Berger, 1996) Task: English  French machine translation Specifically, translating ‘in’ Suppose we’ve seen in translated as: {dans, en, à, au cours de, pendant} Constraint: 44

45 Example II: MT (Berger, 1996) Task: English  French machine translation Specifically, translating ‘in’ Suppose we’ve seen in translated as: {dans, en, à, au cours de, pendant} Constraint: p(dans)+p(en)+p(à)+p(au cours de)+p(pendant)=1 If no other constraint, what is maxent model? 45

46 Example II: MT (Berger, 1996) Task: English  French machine translation Specifically, translating ‘in’ Suppose we’ve seen in translated as: {dans, en, à, au cours de, pendant} Constraint: p(dans)+p(en)+p(à)+p(au cours de)+p(pendant)=1 If no other constraint, what is maxent model? p(dans)=p(en)=p(à)=p(au cours de)=p(pendant)=1/5 46

47 Example II: MT (Berger, 1996) What we find out that translator uses dans or en 30%? Constraint 47

48 Example II: MT (Berger, 1996) What we find out that translator uses dans or en 30%? Constraint: p(dans)+p(en)=3/10 Now what is maxent model? 48

49 Example II: MT (Berger, 1996) What we find out that translator uses dans or en 30%? Constraint: p(dans)+p(en)=3/10 Now what is maxent model? p(dans)=p(en)= 49

50 Example II: MT (Berger, 1996) What we find out that translator uses dans or en 30%? Constraint: p(dans)+p(en)=3/10 Now what is maxent model? p(dans)=p(en)=3/20 p(à)=p(au cours de)=p(pendant)= 50

51 Example II: MT (Berger, 1996) What we find out that translator uses dans or en 30%? Constraint: p(dans)+p(en)=3/10 Now what is maxent model? p(dans)=p(en)=3/20 p(à)=p(au cours de)=p(pendant)=7/30 What if we also know translate picks à or dans 50%? Add new constraint: p(à)+p(dans)=0.5 Now what is maxent model?? 51

52 Example II: MT (Berger, 1996) What we find out that translator uses dans or en 30%? Constraint: p(dans)+p(en)=3/10 Now what is maxent model? p(dans)=p(en)=3/20 p(à)=p(au cours de)=p(pendant)=7/30 What if we also know translate picks à or dans 50%? Add new constraint: p(à)+p(dans)=0.5 Now what is maxent model?? Not intuitively obvious… 52

53 Example III: POS (K&M, 2003) 53

54 Example III: POS (K&M, 2003) 54

55 Example III: POS (K&M, 2003) 55

56 Example III: POS (K&M, 2003) 56

57 Example III Problem: Too uniform What else do we know? Nouns more common than verbs 57

58 Example III Problem: Too uniform What else do we know? Nouns more common than verbs So f N ={NN,NNS,NNP,NNPS}, and E[f N ]=32/36 Also, proper nouns more frequent than common, so E[NNP,NNPS]=24/36 Etc 58

59 Maximum Entropy Principle: Summary Among all probability distributions p in P that satisfy the set of constraints, select p * that maximizes:

60 Maximum Entropy Principle: Summary Among all probability distributions p in P that satisfy the set of constraints, select p * that maximizes: Questions: 1) How do we model the constraints? 2) How can select the distributions?

61 MaxEnt Modeling

62 Basic Approach Given some training data: Collect training examples (x,y) in (X,Y), where

63 Basic Approach Given some training data: Collect training examples (x,y) in (X,Y), where x in X: the training data samples y in Y: the labels (classes) to predict

64 Basic Approach Given some training data: Collect training examples (x,y) in (X,Y), where x in X: the training data samples y in Y: the labels (classes) to predict For example in text classification: X: words in documents Y: text categories: guns, mideast, misc… Compute P(y|x) Select y with highest P(y|x) as classification

65 Basic Approach Given some training data: Collect training examples (x,y) in (X,Y), where x in X: the training data samples y in Y: the labels (classes) to predict For example in text classification: X

66 Basic Approach Given some training data: Collect training examples (x,y) in (X,Y), where x in X: the training data samples y in Y: the labels (classes) to predict For example in text classification: X: words in documents Y:

67 Basic Approach Given some training data: Collect training examples (x,y) in (X,Y), where x in X: the training data samples y in Y: the labels (classes) to predict For example in text classification: X: words in documents Y: text categories: guns, mideast, misc… Compute P(y|x) Select y with highest P(y|x) as classification

68 Maximum Entropy Model Compute P(y|x) Select distribution that maximizes entropy subject to constraints in training data:

69 Modeling Components Feature functions Computing expectations Calculating P(y|x) and P(x,y)

70 Feature Functions

71 A feature function is a binary-valued indicator function:

72 Feature Functions A feature function is a binary-valued indicator function: In text classification, j refers to a specific (feature,class) pair s.t. feature is present when y is class.

73 Feature Functions A feature function is a binary-valued indicator function: In text classification, j refers to a specific (feature,class) pair s.t. feature is present when y is class. f j (x,y) = {1 if y=“guns” and x includes “rifle”

74 Feature Functions A feature function is a binary-valued indicator function: In text classification, j refers to a specific (feature,class) pair s.t. feature is present when y is class. f j (x,y) = {1 if y=“guns” and x includes “rifle” {0 otherwise

75 Feature Functions Feature functions can be complex:

76 Feature Functions Feature functions can be complex: For example: Features could be: class=y and a conjunction of features e.g. class=“guns” and currwd=“rifles” and prevwd=“allow”

77 Feature Functions Feature functions can be complex: For example: Features could be: class=y and a conjunction of features e.g. class=“guns” and currwd=“rifles” and prevwd=“allow” class=y and a disjunction of features e.g. class=“guns” and currwd=“rifles” or currwd=“guns” and so

78 Feature Functions Feature functions can be complex: For example: Features could be: class=y and a conjunction of features e.g. class=“guns” and currwd=“rifles” and prevwd=“allow” class=y and a disjunction of features e.g. class=“guns” and currwd=“rifles” or currwd=“guns” and so on Many are simple

79 Feature Functions Feature functions can be complex: For example: Features could be: class=y and a conjunction of features e.g. class=“guns” and currwd=“rifles” and prevwd=“allow” class=y and a disjunction of features e.g. class=“guns” and currwd=“rifles” or currwd=“guns” and so Many are simple Feature selection will be an issue

80 Key Points Feature functions are: binary correspond to a (feature,class) pair

81 Key Points Feature functions are: binary correspond to a (feature,class) pair MaxEnt training learns weights associated with feature functions

82 Computing Expectations

83 1: Consider a coin-flipping experiment Heads: win 100 Tails: lose 50

84 Computing Expectations 1: Consider a coin-flipping experiment Heads: win 100 Tails: lose 50 What is the expected yield?

85 Computing Expectations 1: Consider a coin-flipping experiment Heads: win 100 Tails: lose 50 What is the expected yield? P(X=H)*100+P(X=T)*(-50) 2: Consider the more general case, outcome i : yield = v i What is the expected yield?

86 Computing Expectations 1: Consider a coin-flipping experiment Heads: win 100 Tails: lose 50 What is the expected yield? P(X=H)*100+P(X=T)*(-50) 2: Consider the more general case, outcome i : yield = v i

87 Calculating Expectations Let P(X=i) be a distribution of a random variable X Let f(x) be a function of x. Let E p (f) be the expectation of f(x) based on P(x)

88 Empirical Expection Empirical distribution denoted: Example: toss a coin 4 times: Result: H, T, H, H Average return: (100-50+100+100)/4=62.5 Example due to F. Xia

89 Empirical Expection Empirical distribution denoted: Example: toss a coin 4 times: Result: H, T, H, H Average return: (100-50+100+100)/4=62.5 Empirical distribution: Example due to F. Xia

90 Empirical Expection Empirical distribution denoted: Example: toss a coin 4 times: Result: H, T, H, H Average return: (100-50+100+100)/4=62.5 Empirical distribution: Empirical expectation Example due to F. Xia

91 Empirical Expection Empirical distribution denoted: Example: toss a coin 4 times: Result: H, T, H, H Average return: (100-50+100+100)/4=62.5 Empirical distribution: Empirical expectation = ¾*100 + ¼*(-50)=62.5 Example due to F. Xia

92 Model Expectation Model: p(x) Assume a fair coin

93 Model Expectation Model: p(x) Assume a fair coin P(X=H)= ½; P(X=T)= ½; Model expectation:

94 Model Expectation Model: p(x) Assume a fair coin P(X=H)= ½; P(X=T)= ½; Model expectation: ½*100 + ½*(-50)=25

95 Empirical Expectation

96 Example Training data: x1 c1 t1 t2 t3 x2 c2 t1 t4 x3 c1 t3 t4 x4 c3 t1 t3 Example due F. Xia t1t2t3t4 c11121 c21001 c31010 Raw counts

97 Example Training data: x1 c1 t1 t2 t3 x2 c2 t1 t4 x3 c1 t3 t4 x4 c3 t1 t3 Example due F. Xia t1t2t3t4 c11/4 2/41/4 c21/400 c31/40 0 Empirical distribution

98


Download ppt "Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012."

Similar presentations


Ads by Google