Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 CS546: Machine Learning and Natural Language Probabilistic Classification Feb 24, 26 2009 TexPoint fonts used in EMF. Read the TexPoint manual before.

Similar presentations


Presentation on theme: "1 CS546: Machine Learning and Natural Language Probabilistic Classification Feb 24, 26 2009 TexPoint fonts used in EMF. Read the TexPoint manual before."— Presentation transcript:

1 1 CS546: Machine Learning and Natural Language Probabilistic Classification Feb 24, 26 2009 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A

2 2 Model the problem of text correction as that of generating correct sentences. Goal: learn a model of the language; use it to predict. PARADIGM Learn a probability distribution over all sentences Use it to estimate which sentence is more likely. Pr(I saw the girl it the park) <> Pr(I saw the girl in the park) [In the same paradigm we sometimes learn a conditional probability distribution]  In practice: make assumptions on the distribution’s type  In practice: a decision policy depends on the assumptions 2: Generative Model

3 3 Consider a distribution D over space X  Y X - the instance space; Y - set of labels. (e.g. +/-1) Given a sample {(x,y)} 1 m,, and a loss function L(x,y) Find h  H that minimizes  i=1,m L(h(x i ),y i ) L can be: L(h(x),y)=1, h(x)  y, o/w L(h(x),y) = 0 (0-1 loss) L(h(x),y)= (h(x)-y) 2, (L 2 ) L(h(x),y)=exp{- y h(x)} Find an algorithm that minimizes average loss; then, we know that things will be okay (as a function of H). Before: Error Driven Learning Discriminative Learning

4 4 Goal: find the best hypothesis from some space H of hypotheses, given the observed data D. Define best to be: most probable hypothesis in H In order to do that, we need to assume a probability distribution over the class H. In addition, we need to know something about the relation between the data observed and the hypotheses (E.g., a coin problem.) As we will see, we will be Bayesian about other things, e.g., the parameters of the model Bayesian Decision Theory

5 5 P(h) - the prior probability of a hypothesis h (prior over H) Reflects background knowledge; before data is observed. If no information - uniform distribution. P(D) - The probability that this sample of the Data is observed. (No knowledge of the hypothesis) P(D|h): The probability of observing the sample D, given that the hypothesis h holds P(h|D): The posterior probability of h. The probability h holds, given that D has been observed. Basics of Bayesian Learning

6 6 P(h|D) increases with P(h) and with P(D|h) P(h|D) decreases with P(D) Bayes Theorem

7 7 The learner considers a set of candidate hypotheses H (models), and attempts to find the most probable one h  H, given the observed data. Such maximally probable hypothesis is called maximum a posteriori hypothesis (MAP); Bayes theorem is used to compute it: Learning Scenario

8 8 We may assume that a priori, hypotheses are equally probable We get the Maximum Likelihood hypothesis: Here we just look for the hypothesis that best explains the data Learning Scenario (2)

9 9 How should we use the general formalism? What should H be? H can be a collection of functions. Given the training data, choose an optimal function. Then, given new data, evaluate the selected function on it. H can be a collection of possible predictions. Given the data, try to directly choose the optimal prediction. H can be a collection of (conditional) probability distributions. Could be different! Specific examples we will discuss: Naive Bayes: a maximum likelihood based algorithm; Max Entropy: seemingly, a different selection criteria; Hidden Markov Models Bayes Optimal Classifier

10 10  f:X  V, finite set of values  Instances x  X can be described as a collection of features  Given an example, assign it the most probable value in V Bayesian Classifier

11 11 f:X  V, finite set of values Instances x  X can be described as a collection of features Given an example, assign it the most probable value in V Bayes Rule: Notational convention: P(y) means P(Y=y) Bayesian Classifier

12 12 Given training data we can estimate the two terms. Estimating P(v j ) is easy. For each value v j count how many times it appears in the training data. However, it is not feasible to estimate In this case we have to estimate, for each target value, the probability of each instance (most of which will not occur) In order to use a Bayesian classifiers in practice, we need to make assumptions that will allow us to estimate these quantities. Bayesian Classifier

13 13 Assumption: feature values are independent given the target value Naive Bayes

14 14 Assumption: feature values are independent given the target value Generative model: First choose a value v j  V according to P(v j ) For each v j : choose x 1 x 2 …, x n according to P(x k |v j ) Naive Bayes

15 15 Assumption: feature values are independent given the target value Learning method: Estimate n|V| parameters and use them to compute the new value. (how to estimate?) Naive Bayes

16 16 Assumption: feature values are independent given the target value Learning method: Estimate n|V| parameters and use them to compute the new value. This is learning without search. Given a collection of training examples, you just compute the best hypothesis (given the assumptions) This is learning without trying to achieve consistency or even approximate consistency. Why does it work? Naive Bayes

17 17 Notice that the features values are conditionally independent, given the target value, and are not required to be independent. Example: f(x,y)=x  y over the product distribution defined by p(x=0)=p(x=1)=1/2 and p(y=0)=p(y=1)=1/2 The distribution is defined so that x and y are independent: p(x,y) = p(x)p(y) (Interpretation - for every value of x and y) But, given that f(x,y)=0: p(x=1|f=0) = p(y=1|f=0) = 1/3 p(x=1,y=1 | f=0) = 0 so x and y are not conditionally independent. Conditional Independence

18 18 The other direction also does not hold. x and y can be conditionally independent but not independent. f=0: p(x=1|f=0) =1, p(y=1|f=0) = 0 f=1: p(x=1|f=1) =0, p(y=1|f=1) = 1 and assume, say, that p(f=0) = p(f=1)=1/2 Given the value of f, x and y are independent. What about unconditional independence ? Conditional Independence

19 19 The other direction also does not hold. x and y can be conditionally independent but not independent. f=0: p(x=1|f=0) =1, p(y=1|f=0) = 0 f=1: p(x=1|f=0) =0, p(y=1|f=1) = 1 and assume, say, that p(f=0) = p(f=1)=1/2 Given the value of f, x and y are independent. What about unconditional independence ? p(x=1) = p(x=1|f=0)p(f=0)+p(x=1|f=1)p(f=1) = 0.5+0=0.5 p(y=1) = p(y=1|f=0)p(f=0)+p(y=1|f=1)p(f=1) = 0.5+0=0.5 But, p(x=1, y=1)=p(x=1,y=1|f=0)p(f=0)+p(x=1,y=1|f=1)p(f=1) = 0 so x and y are not independent. Conditional Independence

20 20 Day Outlook Temperature Humidity Wind PlayTennis 1 Sunny Hot High Weak No 2 Sunny Hot High Strong No 3 Overcast Hot High Weak Yes 4 Rain Mild High Weak Yes 5 Rain Cool Normal Weak Yes 6 Rain Cool Normal Strong No 7 Overcast Cool Normal Strong Yes 8 Sunny Mild High Weak No 9 Sunny Cool Normal Weak Yes 10 Rain Mild Normal Weak Yes 11 Sunny Mild Normal Strong Yes 12 Overcast Mild High Strong Yes 13 Overcast Hot Normal Weak Yes 14 Rain Mild High Strong No Example

21 21 How do we estimate P(observation | v) ? Estimating Probabilities

22 22 Compute P( PlayTennis= yes) ; P( PlayTennis= no) = Compute P(outlook= s/oc/r | PlayTennis= yes/no) (6 numbers) Compute P(Temp= h/mild/cool | PlayTennis= yes/no) (6 numbers) Compute P(humidity= hi/nor | PlayTennis= yes/no) (4 numbers) Compute P(wind= w/st | PlayTennis= yes/no) (4 numbers) Example

23 23 Compute P( PlayTennis= yes) ; P( PlayTennis= no) = Compute P(outlook= s/oc/r | PlayTennis= yes/no) (6 numbers) Compute P(Temp= h/mild/cool | PlayTennis= yes/no) (6 numbers) Compute P(humidity= hi/nor | PlayTennis= yes/no) (4 numbers) Compute P(wind= w/st | PlayTennis= yes/no) (4 numbers) Given a new instance: (Outlook=sunny; Temperature=cool; Humidity=high; Wind=strong) Predict: PlayTennis= ? Example

24 24 Given: (Outlook=sunny; Temperature=cool; Humidity=high; Wind=strong) P( PlayTennis= yes) =9/14=0.64 P( PlayTennis= no) =5/14=0.36 P(outlook = sunny|yes)= 2/9 P(outlook = sunny|no)= 3/5 P(temp = cool | yes) = 3/9 P(temp = cool | no) = 1/5 P(humidity = hi |yes) = 3/9 P(humidity = hi |no) = 4/5 P(wind = strong | yes) = 3/9 P(wind = strong | no)= 3/5 P(yes|…..) ~ 0.0053 P(no|…..) ~ 0.0206 Example

25 25 Given: (Outlook=sunny; Temperature=cool; Humidity=high; Wind=strong) P( PlayTennis= yes) =9/14=0.64 P( PlayTennis= no) =5/14=0.36 P(outlook = sunny|yes)= 2/9 P(outlook = sunny|no)= 3/5 P(temp = cool | yes) = 3/9 P(temp = cool | yes) = 1/5 P(humidity = hi |yes) = 3/9 P(humidity = hi |yes) = 4/5 P(wind = strong | yes) = 3/9 P(wind = strong | no)= 3/5 P(yes|…..) ~ 0.0053 P(no|…..) ~ 0.0206 What is we were asked about Outlook=OC ? Example

26 26 Given: (Outlook=sunny; Temperature=cool; Humidity=high; Wind=strong) P( PlayTennis= yes) =9/14=0.64 P( PlayTennis= no) =5/14=0.36 P(outlook = sunny|yes)= 2/9 P(outlook = sunny|no)= 3/5 P(temp = cool | yes) = 3/9 P(temp = cool | no) = 1/5 P(humidity = hi |yes) = 3/9 P(humidity = hi |no) = 4/5 P(wind = strong | yes) = 3/9 P(wind = strong | no)= 3/5 P(yes|…..) ~ 0.0053 P(no|…..) ~ 0.0206 P(no|instance) = 0.0206/(0.0053+0.0206)=0.795 Example

27 27 Notice that the naïve Bayes method seems to be giving a method for predicting rather than an explicit classifier. In the case of two classes, v  {0,1} we predict that v=1 iff: Naive Bayes: Two Classes

28 28 Notice that the naïve Bayes method gives a method for predicting rather than an explicit classifier. In the case of two classes, v  {0,1} we predict that v=1 iff: Naive Bayes: Two Classes

29 29 In the case of two classes, v  {0,1} we predict that v=1 iff: Naive Bayes: Two Classes

30 30 In the case of two classes, v  {0,1} we predict that v=1 iff: Naïve Bayes: Two Classes

31 31 In the case of two classes, v  {0,1} we predict that v=1 iff: We get that the optimal Bayes behavior is given by a linear separator with Naïve Bayes: Two Classes

32 32 We have not addressed the question of why does this Classifier perform well, given that the assumptions are unlikely to be satisfied. The linear form of the classifiers provides some hints. Why does it work?

33 33 In the case of two classes we have that: Naïve Bayes: Two Classes

34 34 In the case of two classes we have that: but since We get (plug in (2) in (1); some algebra): Which is simply the logistic (sigmoid) function used in the neural network representation. Naïve Bayes: Two Classes

35 35 Another look at Naive Bayes Graphical model. It encodes the NB independence assumption in the edge structure (siblings are independent given parents) Linear Statistical Queries Model

36 36 Hidden Markov Model (HMM) HMM is a probabilistic generative model It models how an observed sequence is generated Let’s call each position in a sequence a time step At each time step, there are two variables Current state (hidden) Observation

37 37 HMM Elements Initial state probability P(s 1 ) Transition probability P(s t |s t-1 ) Observation probability P(o t |s t ) As before, the graphical model is an encoding of the independence assumptions Consider POS tagging: O – words, S – POS tags s1s1 o1o1 s2s2 o2o2 s3s3 o3o3 s4s4 o4o4 s5s5 o5o5 s6s6 o6o6

38 38 HMM for Shallow Parsing States: {B, I, O} Observations: Actual words and/or part-of-speech tags s 1 =B o 1 Mr. s 2 =I o 2 Brown s 3 =O o 3 blamed s 4 =B o 4 Mr. s 5 =I o 5 Bob s 6 =O o 6 for

39 39 HMM for Shallow Parsing Given a sentences, we can ask what the most likely state sequence is Initial state probability: P(s 1 =B),P(s 1 =I),P(s 1 =O) Transition probabilty: P(s t =B|s t-1 =B),P(s t =I|s t-1 =B),P(s t =O|s t-1 =B), P(s t =B|s t-1 =I),P(s t =I|s t-1 =I),P(s t =O|s t-1 =I), … Observation Probability: P(o t =‘Mr.’|s t =B),P(o t =‘Brown’|s t =B),…, P(o t =‘Mr.’|s t =I),P(o t =‘Brown’|s t =I),…, … s 1 =B o 1 Mr. s 2 =I o 2 Brown s 3 =O o 3 blamed s 4 =B o 4 Mr. s 5 =I o 5 Bob s 6 =O o 6 for

40 40 Finding most likely state sequence in HMM (1) P ( s k ;s k ¡ 1 ;:::;s 1 ;o k ;o k ¡ 1 ;:::;o 1 )

41 41 Finding most likely state sequence in HMM (2) argmax s k ;s k ¡ 1 ;:::;s 1 P ( s k ;s k ¡ 1 ;:::;s 1 j o k ;o k ¡ 1 ;:::;o 1 )

42 42 Finding most likely state sequence in HMM (3) A function of s k =max s k P ( o k j s k ) ¢ s k ¡ 1 ;:::;s 1 [ k ¡ 1 Y t =1 P ( s t +1 j s t ) ¢ P ( o t j s t )] ¢ P ( s 1 ) =max s k P ( o k j s k ) ¢ s k ¡ 1 [ P ( s k j s k ¡ 1 ) ¢ P ( o k ¡ 1 j s k ¡ 1 )] ¢ max s k ¡ 2 ;:::;s 1 [ k ¡ 2 Y t =1 P ( s t +1 j s t ) ¢ P ( o t j s t )] ¢ P ( s 1 ) =max s k P ( o k j s k ) ¢ s k ¡ 1 [ P ( s k j s k ¡ 1 ) ¢ P ( o k ¡ 1 j s k ¡ 1 )] ¢ max s k ¡ 2 [ P ( s k ¡ 1 j s k ¡ 2 ) ¢ P ( o k ¡ 2 j s k ¡ 2 )] ¢ ::: ¢ max s 1 [ P ( s 2 j s 1 ) ¢ P ( o 1 j s 1 )] ¢ P ( s 1 ) max s k ;s k ¡ 1 ;:::;s 1 P ( o k j s k ) ¢ [ k ¡ 1 Y t =1 P ( s t +1 j s t ) ¢ P ( o t j s t )] ¢ P ( s 1 )

43 43 Finding most likely state sequence in HMM (4) Viterbi’s Algorithm Dynamic Programming

44 44 Learning the Model Estimate Initial state probability P (s 1 ) Transition probability P(s t |s t-1 ) Observation probability P(o t |s t ) Unsupervised Learning (states are not observed) EM Algorithm Supervised Learning (states are observed; more common) ML Estimate of above terms directly from data Notice that this is completely analogues to the case of naive Bayes, and essentially all other models.

45 45 Another view of Markov Models Assumptions: Prediction: predict t  T that maximizes Input: States: Observations: T W

46 46 Another View of Markov Models As for NB: features are pairs and singletons of t‘s, w’s Only 3 active features Input: States: Observations: T W This can be extended to an argmax that maximizes the prediction of the whole state sequence and computed, as before, via Viterbi.

47 47 Learning with Probabilistic Classifiers Learning Theory We showed that probabilistic predictions can be viewed as predictions via Linear Statistical Queries Models. The low expressivity explains Generalization+Robustness Is that all? It does not explain why is it possible to (approximately) fit the data with these models. Namely, is there a reason to believe that these hypotheses minimize the empirical error on the sample? In General, No. (Unless it corresponds to some probabilistic assumptions that hold).

48 48 Learning Protocol LSQ hypotheses are computed directly, w/o assumptions on the underlying distribution: - Choose features - Compute coefficients Is there a reason to believe that an LSQ hypothesis minimizes the empirical error on the sample? In general, no. (Unless it corresponds to some probabilistic assumptions that hold).

49 49 Learning Protocol: Practice LSQ hypotheses are computed directly: - Choose features - Compute coefficients If hypothesis does not fit the training data - - Augment set of features (Forget your original assumption)

50 50 Example: probabilistic classifiers Features are pairs and singletons of t‘s, w’s Additional features are included States: Observations: T W If hypothesis does not fit the training data - augment set of features (forget assumptions)

51 51 Why is it relatively easy to fit the data? Consider all distributions with the same marginals (E.g, a naïve Bayes classifier will predict the same regardless of which distribution generated the data.) (Garg&Roth ECML’01): In most cases (i.e., for most such distributions), the resulting predictor’s error is close to optimal classifier (that if given the correct distribution) Robustness of Probabilistic Predictors

52 52 Summary: Probabilistic Modeling Classifiers derived from probability density estimation models were viewed as LSQ hypotheses. Probabilistic assumptions: + Guiding feature selection but also - - Not allowing the use of more general features.

53 53 A Unified Approach Most methods blow up original feature space. And make predictions using a linear representation over the new feature space Note: Methods do not have to actually do that; But: they produce same decision as a hypothesis that does that. (Roth 98; 99,00)

54 54 A Unified Approach Most methods blow up original feature space. And make predictions using a linear representation over the new feature space Probabilistic Methods Rule based methods (TBL; decision lists; exponentially decreasing weights) Linear representation (SNoW;Perceptron; SVM;Boosting) Memory Based Methods (subset features)

55 55 A Unified Approach Most methods blow up original feature space. And make predictions using a linear representation over the new feature space Q 1: How are weights determined? Q 2: How is the new feature-space determined? Implications? Restrictions?

56 56 What’s Next? (1) If probabilistic hypotheses are actually like other linear functions, can we interpret the outcome of other linear learning algorithms probabilistically? Yes (2) If probabilistic hypotheses are actually like other linear functions, can you actually train them similarly (that is, discriminatively)? Yes. Single Classification: Logistics regression/Max Entropy HMM (3) General Sequential Processes First, multi-class classification (what’s the relation?)

57 57 Conditional models: Introduction Starting from early 200x conditional and discriminative models become very popular in the NLP community Why? They often give better accuracy than generative models It is often easier to integrate complex features in these models Generalization results suggest these methods Does it makes sense to distinguish probabilistic conditional models and discriminative classifiers? ( I.e. models which output conditional probabilities vs methods learned to minimize the expected errors) Usually not (Recall Zhang’s paper presented by Sarah) They often can be regarded as both (either optimizing a bound on error or estimating parameters of some conditional model)

58 58 Joint vs Conditional (Probabilistic View) Joint (or generative) models model joint probability P(x,y) of both input Conditional Models model P(y | x): There is no need to make assumption on the form of P(x) Joint examples: n-gram language models, Naive-Bayes, HMM, Probabilistic Context Free Grammars (PCFGs) in parsing Conditional examples logistic regression, MaxEnt, perceptron, SNoW, etc Many of the following slides are based/ taken from Klein & Manning tutorial at ACL 03

59 59 Joint vs Conditional (Probabilistic View) Bayesian Networks for conditional and generative models, recall: circles (nodes) correspond to variables each node is dependent on its parents in the graph a table of probabilities corresponds to each node in the graph you can think of each node as a classifier which predicts its value given values of its parents

60 60 Example: Word Sense Disambiguation We can use the same features (i.e. edges in the graphical model are the same but direction may be different) and get very different results Word Sense Disambiguation (supervised): the same features (i.e. the same class of functions) but different estimation method (Klein & Manning 02):

61 61 Joint vs. Conditional However, this is not always the case, depends on How wrong the assumptions in the generative model are Amount of the training data (we plan to discuss Ng & Jordan 03 paper about it) Other reasons why we may want to use generative models Sometimes easier to integrate non-labeled date in generative model (semi-supervised learning) Often easier to estimate parameters (to train)....

62 62 Maximum Entropy Models Known by different names: Max Ent, CRF (local vs. global normalization), Logistic Regression, Log-linear models Plan: Recall feature based models Max-entropy motivation for the model ML motivation (logistic regression) Learning (parameter estimation) Bound minimization view

63 63 Features: Predicates on variables

64 64 Example Tasks

65 65 Conditional vs. Joint Joint: given data {c, d} we want to construct model P(c,d) Maximize likelihood In the simples case we just compute counts, normalize and you have a model In fact not so simple: smoothing, etc Not always trivial counts, what if you believe that there is some unknown variable h: P(c,d) =  h {P(h, c,d)} Not trivial counts any more... Conditional: Maximize conditional likelihood (probabilistic view) Harder to do Closer to minimization of error

66 66 Feature-Based Classifier

67 67 Feature-Based Classifier

68 68 Maximizing Conditional Likelihood Given the model and the data find parameters which maximize conditional likelihood L( ¸ ) =  i {P(c | d, µ )} For the log-linear case we consider it means finding ¸ to maximize:

69 69 MaxEnt model

70 70 Learining

71 71 Learining

72 72 Learining

73 73 Learining

74 74 Relation to Max Ent Principle Recall we discussed MaxEnt Principle before: We need know the marginals and now we want to fill the table: We argued that we want to have as uniform distribution as possible but which has the same feature expectation as we observed: So, we want MaxEnt distribution

75 75 Relation to Max Ent Principle It is possible to show (see additional notes on the lectures page) that the MaxEnt distribution has log-linear form And we already observed that (Conditional) ML training corresponds to finding such parameters that empirical distribution of features matches the distribution given by the model So, they are the same! Important: a log-linear model trained to maximize likelihood is equivalent to MaxEnt model with constraints on the expectations for the same set of features

76 76 Relation to Max Ent Principle

77 77 Relation to Max Ent Principle

78 78 Examples

79 79 Examples

80 80 Properties of MaxEnt Unlike Naive Bayes there is no double counting: Even a frequent feature may have small weights if it co- occurs with some other feature(-s) What if our counts are not reliable? Our prior believe can be that the function is not spily (the weights are small): min ¸ f( ¸ ) = || ¸ || / 2 + log{P(C | D, ¸ )}

81 81 MaxEnt vs. Naive Bayes

82 82 MaxEnt vs. Naive Bayes

83 83 Summary Today we reviewed max-ent framework Considered both ML and MaxEnt motivation Discussed HMMs (you can imagine conditional models instead of HMMs in a similar way) Next time we will probably start talking about the term project You may want to start reading about Semantic Role Labeling and Dependency Parsing


Download ppt "1 CS546: Machine Learning and Natural Language Probabilistic Classification Feb 24, 26 2009 TexPoint fonts used in EMF. Read the TexPoint manual before."

Similar presentations


Ads by Google