1 CS598: Machine Learning and Natural Language Lecture 7: Probabilistic Classification Oct. 19,21 2006 Dan Roth University of Illinois, Urbana-Champaign.

Slides:



Advertisements
Similar presentations
Why does it work? We have not addressed the question of why does this classifier performs well, given that the assumptions are unlikely to be satisfied.
Advertisements

Naïve Bayes. Bayesian Reasoning Bayesian reasoning provides a probabilistic approach to inference. It is based on the assumption that the quantities of.
HMM II: Parameter Estimation. Reminder: Hidden Markov Model Markov Chain transition probabilities: p(S i+1 = t|S i = s) = a st Emission probabilities:
What we will cover here What is a classifier
Naïve Bayes Classifier
Naïve Bayes Classifier
On Discriminative vs. Generative classifiers: Naïve Bayes
Ai in game programming it university of copenhagen Statistical Learning Methods Marco Loog.
Hidden Markov Models Bonnie Dorr Christof Monz CMSC 723: Introduction to Computational Linguistics Lecture 5 October 6, 2004.
Visual Recognition Tutorial
Today Linear Regression Logistic Regression Bayesians v. Frequentists
Introduction to Bayesian Learning Bob Durrant School of Computer Science University of Birmingham (Slides: Dr Ata Kabán)
1er. Escuela Red ProTIC - Tandil, de Abril, Bayesian Learning 5.1 Introduction –Bayesian learning algorithms calculate explicit probabilities.
Visual Recognition Tutorial
Introduction to Bayesian Learning Ata Kaban School of Computer Science University of Birmingham.
Classification and Prediction by Yen-Hsien Lee Department of Information Management College of Management National Sun Yat-Sen University March 4, 2003.
Rutgers CS440, Fall 2003 Introduction to Statistical Learning Reading: Ch. 20, Sec. 1-4, AIMA 2 nd Ed.
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
CS Bayesian Learning1 Bayesian Learning. CS Bayesian Learning2 States, causes, hypotheses. Observations, effect, data. We need to reconcile.
Crash Course on Machine Learning
Naïve Bayes Classifier Ke Chen Extended by Longin Jan Latecki COMP20411 Machine Learning.
Midterm Review Rao Vemuri 16 Oct Posing a Machine Learning Problem Experience Table – Each row is an instance – Each column is an attribute/feature.
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
Bayesian Learning CS446 -FALL ‘14 Recap: Error Driven Learning 1 Consider a distribution D over space X  Y X - the instance space; Y - set of labels.
Naïve Bayes Classifier Ke Chen Modified and extended by Longin Jan Latecki
1 Logistic Regression Adapted from: Tom Mitchell’s Machine Learning Book Evan Wei Xiang and Qiang Yang.
Machine Learning CSE 681 CH2 - Supervised Learning.
Naive Bayes Classifier
Bayesian Learning CS446 -FALL ‘14 f:X  V, finite set of values Instances x  X can be described as a collection of features x = (x 1, x 2, … x n ) x i.
ECE 8443 – Pattern Recognition Objectives: Error Bounds Complexity Theory PAC Learning PAC Bound Margin Classifiers Resources: D.M.: Simplified PAC-Bayes.
Kansas State University Department of Computing and Information Sciences CIS 730: Introduction to Artificial Intelligence Lecture 25 Wednesday, 20 October.
ECE 8443 – Pattern Recognition LECTURE 07: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Class-Conditional Density The Multivariate Case General.
Empirical Research Methods in Computer Science Lecture 7 November 30, 2005 Noah Smith.
CSE 446 Logistic Regression Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer.
1 Generative and Discriminative Models Jie Tang Department of Computer Science & Technology Tsinghua University 2012.
CS464 Introduction to Machine Learning1 Bayesian Learning Features of Bayesian learning methods: Each observed training example can incrementally decrease.
Bayesian Classification. Bayesian Classification: Why? A statistical classifier: performs probabilistic prediction, i.e., predicts class membership probabilities.
Classification Techniques: Bayesian Classification
Naïve Bayes Classifier Ke Chen Modified and extended by Longin Jan Latecki
Machine Learning CUNY Graduate Center Lecture 4: Logistic Regression.
1 CS546: Machine Learning and Natural Language Probabilistic Classification Feb 24, TexPoint fonts used in EMF. Read the TexPoint manual before.
Mehdi Ghayoumi MSB rm 132 Ofc hr: Thur, a Machine Learning.
Logistic Regression William Cohen.
COMP24111 Machine Learning Naïve Bayes Classifier Ke Chen.
1 Machine Learning: Lecture 6 Bayesian Learning (Based on Chapter 6 of Mitchell T.., Machine Learning, 1997)
Naïve Bayes Classifier April 25 th, Classification Methods (1) Manual classification Used by Yahoo!, Looksmart, about.com, ODP Very accurate when.
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Classification COMP Seminar BCB 713 Module Spring 2011.
Bayesian Learning Evgueni Smirnov Overview Bayesian Theorem Maximum A Posteriori Hypothesis Naïve Bayes Classifier Learning Text Classifiers.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Bayes Rule Mutual Information Conditional.
Dan Roth University of Illinois, Urbana-Champaign 7 Sequential Models Tutorial on Machine Learning in Natural.
Naive Bayes Classifier. REVIEW: Bayesian Methods Our focus this lecture: – Learning and classification methods based on probability theory. Bayes theorem.
COMP24111 Machine Learning Naïve Bayes Classifier Ke Chen.
Bayesian Learning. Uncertainty & Probability Baye's rule Choosing Hypotheses- Maximum a posteriori Maximum Likelihood - Baye's concept learning Maximum.
Why does it work? We have not addressed the question of why does this classifier performs well, given that the assumptions are unlikely to be satisfied.
Bayesian Learning Reading: Tom Mitchell, “Generative and discriminative classifiers: Naive Bayes and logistic regression”, Sections 1-2. (Linked from.
Bayesian Classifier f:XV, finite set of values
Oliver Schulte Machine Learning 726
Naïve Bayes Classifier
Naive Bayes Classifier
Bayes Net Learning: Bayesian Approaches
Oliver Schulte Machine Learning 726
Data Mining Lecture 11.
Classification Techniques: Bayesian Classification
Machine Learning: Lecture 3
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Generative Models and Naïve Bayes
Machine Learning: Lecture 6
Machine Learning: UNIT-3 CHAPTER-1
Generative Models and Naïve Bayes
Naïve Bayes Classifier
Presentation transcript:

1 CS598: Machine Learning and Natural Language Lecture 7: Probabilistic Classification Oct. 19, Dan Roth University of Illinois, Urbana-Champaign

2 Model the problem of text correction as that of generating correct sentences. Goal: learn a model of the language; use it to predict. PARADIGM Learn a probability distribution over all sentences Use it to estimate which sentence is more likely. Pr(I saw the girl it the park) <> Pr(I saw the girl in the park) [In the same paradigm we sometimes learn a conditional probability distribution]  In practice: make assumptions on the distribution’s type  In practice: a decision policy depends on the assumptions 2: Generative Model

3 Consider a distribution D over space X  Y X - the instance space; Y - set of labels. (e.g. +/-1) Given a sample {(x,y)} 1 m,, and a loss function L(x,y) Find h  H that minimizes  i=1,m L(h(x i ),y i ) L can be: L(h(x),y)=1, h(x)  y, o/w L(h(x),y) = 0 (0-1 loss) L(h(x),y)= (h(x)-y) 2, (L 2 ) L(h(x),y)=exp{- y h(x)} Find an algorithm that minimizes average loss; then, we know that things will be okay (as a function of H). Before: Error Driven Learning

4 Goal: find the best hypothesis from some space H of hypotheses, given the observed data D. Define best to be: most probable hypothesis in H In order to do that, we need to assume a probability distribution over the class H. In addition, we need to know something about the relation between the data observed and the hypotheses (E.g., a coin problem.) As we will see, we will be Bayesian about other things, e.g., the parameters of the model Basics of Bayesian Learning

5 P(h) - the prior probability of a hypothesis h Reflects background knowledge; before data is observed. If no information - uniform distribution. P(D) - The probability that this sample of the Data is observed. (No knowledge of the hypothesis) P(D|h): The probability of observing the sample D, given that the hypothesis h holds P(h|D): The posterior probability of h. The probability h holds, given that D has been observed. Basics of Bayesian Learning

6 P(h|D) increases with P(h) and with P(D|h) P(h|D) decreases with P(D) Bayes Theorem

7 The learner considers a set of candidate hypotheses H (models), and attempts to find the most probable one h  H, given the observed data. Such maximally probable hypothesis is called maximum a posteriori hypothesis (MAP); Bayes theorem is used to compute it: Learning Scenario

8 We may assume that a priori, hypotheses are equally probable We get the Maximum Likelihood hypothesis: Here we just look for the hypothesis that best explains the data Learning Scenario (2)

9 How should we use the general formalism? What should H be? H can be a collection of functions. Given the training data, choose an optimal function. Then, given new data, evaluate the selected function on it. H can be a collection of possible predictions. Given the data, try to directly choose the optimal prediction. H can be a collection of (conditional) probability distributions. Could be different! Specific examples we will discuss: Naive Bayes: a maximum likelihood based algorithm; Max Entropy: seemingly, a different selection criteria; Hidden Markov Models Bayes Optimal Classifier

10  f:X  V, finite set of values  Instances x  X can be described as a collection of features  Given an example, assign it the most probable value in V Bayesian Classifier

11 f:X  V, finite set of values Instances x  X can be described as a collection of features Given an example, assign it the most probable value in V Bayes Rule: Notational convention: P(y) means P(Y=y) Bayesian Classifier

12 Given training data we can estimate the two terms. Estimating P(v j ) is easy. For each value v j count how many times it appears in the training data. However, it is not feasible to estimate Bayesian Classifier

13 Given training data we can estimate the two terms. Estimating P(v j ) is easy. For each value v j count how many times it appears in the training data. However, it is not feasible to estimate In this case we have to estimate, for each target value, the probability of each instance (most of which will not occur) Bayesian Classifier

14 Given training data we can estimate the two terms. Estimating P(v j ) is easy. For each value v j count how many times it appears in the training data. However, it is not feasible to estimate In this case we have to estimate, for each target value, the probability of each instance (most of which will not occur) In order to use a Bayesian classifiers in practice, we need to make assumptions that will allow us to estimate these quantities. Bayesian Classifier

15 Assumption: feature values are independent given the target value Naive Bayes

16 Assumption: feature values are independent given the target value Generative model: First choose a value v j  V according to P(v j ) For each v j : choose x 1 x 2 …, x n according to P(x k |v j ) Naive Bayes

17 Assumption: feature values are independent given the target value Learning method: Estimate n|V| parameters and use them to compute the new value. (how to estimate?) Naive Bayes

18 Assumption: feature values are independent given the target value Learning method: Estimate n|V| parameters and use them to compute the new value. This is learning without search. Given a collection of training examples, you just compute the best hypothesis (given the assumptions) This is learning without trying to achieve consistency or even approximate consistency. Naive Bayes

19 Assumption: feature values are independent given the target value Learning method: Estimate n|V| parameters and use them to compute the new value. This is learning without search. Given a collection of training examples, you just compute the best hypothesis (given the assumptions) This is learning without trying to achieve consistency or even approximate consistency. Why does it work? Naive Bayes

20 Notice that the features values are conditionally independent, given the target value, and are not required to be independent. Example: f(x,y)=x  y over the product distribution defined by p(x=0)=p(x=1)=1/2 and p(y=0)=p(y=1)=1/2 The distribution is defined so that x and y are independent: p(x,y) = p(x)p(y) (Interpretation - for every value of x and y) But, given that f(x,y)=0: p(x=1|f=0) = p(y=1|f=0) = 1/3 p(x=1,y=1 | f=0) = 0 so x and y are not conditionally independent. Conditional Independence

21 The other direction also does not hold. x and y can be conditionally independent but not independent. f=0: p(x=1|f=0) =1, p(y=1|f=0) = 0 f=1: p(x=1|f=1) =0, p(y=1|f=1) = 1 and assume, say, that p(f=0) = p(f=1)=1/2 Given the value of f, x and y are independent. What about unconditional independence ? Conditional Independence

22 The other direction also does not hold. x and y can be conditionally independent but not independent. f=0: p(x=1|f=0) =1, p(y=1|f=0) = 0 f=1: p(x=1|f=0) =0, p(y=1|f=1) = 1 and assume, say, that p(f=0) = p(f=1)=1/2 Given the value of f, x and y are independent. What about unconditional independence ? p(x=1) = p(x=1|f=0)p(f=0)+p(x=1|f=1)p(f=1) = 0.5+0=0.5 p(y=1) = p(y=1|f=0)p(f=0)+p(y=1|f=1)p(f=1) = 0.5+0=0.5 But, p(x=1, y=1)=p(x=1,y=1|f=0)p(f=0)+p(x=1,y=1|f=1)p(f=1) = 0 so x and y are not independent. Conditional Independence

23 Day Outlook Temperature Humidity Wind PlayTennis 1 Sunny Hot High Weak No 2 Sunny Hot High Strong No 3 Overcast Hot High Weak Yes 4 Rain Mild High Weak Yes 5 Rain Cool Normal Weak Yes 6 Rain Cool Normal Strong No 7 Overcast Cool Normal Strong Yes 8 Sunny Mild High Weak No 9 Sunny Cool Normal Weak Yes 10 Rain Mild Normal Weak Yes 11 Sunny Mild Normal Strong Yes 12 Overcast Mild High Strong Yes 13 Overcast Hot Normal Weak Yes 14 Rain Mild High Strong No Example

24 How do we estimate P(observation | v) ? Estimating Probabilities

25 Compute P( PlayTennis= yes) ; P( PlayTennis= no) = Compute P(outlook= s/oc/r | PlayTennis= yes/no) (6 numbers) Compute P(Temp= h/mild/cool | PlayTennis= yes/no) (6 numbers) Compute P(humidity= hi/nor | PlayTennis= yes/no) (4 numbers) Compute P(wind= w/st | PlayTennis= yes/no) (4 numbers) Example

26 Compute P( PlayTennis= yes) ; P( PlayTennis= no) = Compute P(outlook= s/oc/r | PlayTennis= yes/no) (6 numbers) Compute P(Temp= h/mild/cool | PlayTennis= yes/no) (6 numbers) Compute P(humidity= hi/nor | PlayTennis= yes/no) (4 numbers) Compute P(wind= w/st | PlayTennis= yes/no) (4 numbers) Given a new instance: (Outlook=sunny; Temperature=cool; Humidity=high; Wind=strong) Predict: PlayTennis= ? Example

27 Given: (Outlook=sunny; Temperature=cool; Humidity=high; Wind=strong) P( PlayTennis= yes) =9/14=0.64 P( PlayTennis= no) =5/14=0.36 P(outlook = sunny|yes)= 2/9 P(outlook = sunny|no)= 3/5 P(temp = cool | yes) = 3/9 P(temp = cool | no) = 1/5 P(humidity = hi |yes) = 3/9 P(humidity = hi |no) = 4/5 P(wind = strong | yes) = 3/9 P(wind = strong | no)= 3/5 P(yes|…..) ~ P(no|…..) ~ Example

28 Given: (Outlook=sunny; Temperature=cool; Humidity=high; Wind=strong) P( PlayTennis= yes) =9/14=0.64 P( PlayTennis= no) =5/14=0.36 P(outlook = sunny|yes)= 2/9 P(outlook = sunny|no)= 3/5 P(temp = cool | yes) = 3/9 P(temp = cool | yes) = 1/5 P(humidity = hi |yes) = 3/9 P(humidity = hi |yes) = 4/5 P(wind = strong | yes) = 3/9 P(wind = strong | no)= 3/5 P(yes|…..) ~ P(no|…..) ~ What is we were asked about Outlook=OC ? Example

29 Given: (Outlook=sunny; Temperature=cool; Humidity=high; Wind=strong) P( PlayTennis= yes) =9/14=0.64 P( PlayTennis= no) =5/14=0.36 P(outlook = sunny|yes)= 2/9 P(outlook = sunny|no)= 3/5 P(temp = cool | yes) = 3/9 P(temp = cool | no) = 1/5 P(humidity = hi |yes) = 3/9 P(humidity = hi |no) = 4/5 P(wind = strong | yes) = 3/9 P(wind = strong | no)= 3/5 P(yes|…..) ~ P(no|…..) ~ P(no|instance) = 0/.0206/( )=0.795 Example

30 Notice that the naïve Bayes method gives a method for predicting rather than an explicit classifier. In the case of two classes, v  {0,1} we predict that v=1 iff: Naive Bayes: Two Classes

31 Notice that the naïve Bayes method gives a method for predicting rather than an explicit classifier. In the case of two classes, v  {0,1} we predict that v=1 iff: Naive Bayes: Two Classes

32 In the case of two classes, v  {0,1} we predict that v=1 iff: Naive Bayes: Two Classes

33 In the case of two classes, v  {0,1} we predict that v=1 iff: Naïve Bayes: Two Classes

34 In the case of two classes, v  {0,1} we predict that v=1 iff: We get that the optimal Bayes behavior is given by a linear separator with Naïve Bayes: Two Classes

35 We have not addressed the question of why does this Classifier perform well, given that the assumptions are unlikely to be satisfied. The linear form of the classifiers provides some hints. (More on that later; also, Roth’99 Garg&Roth ECML’02); one of the presented papers will also address this partly. Why does it work?

36 In the case of two classes we have that: Naïve Bayes: Two Classes

37 In the case of two classes we have that: but since We get (plug in (2) in (1); some algebra): Which is simply the logistic (sigmoid) function used in the neural network representation. Naïve Bayes: Two Classes

38 Another look at Naive Bayes Note this is a bit different than the previous linearization. Rather than a single function, here we have argmax over several different functions. Graphical model. It encodes the NB independence assumption in the edge structure (siblings are independent given parents)

39 Hidden Markov Model (HMM) HMM is a probabilistic generative model It models how an observed sequence is generated Let’s call each position in a sequence a time step At each time step, there are two variables Current state (hidden) Observation

40 HMM Elements Initial state probability P(s 1 ) Transition probability P(s t |s t-1 ) Observation probability P(o t |s t ) As before, the graphical model is an encoding of the independence assumptions Note that we have seen this in the context of POS tagging. s1s1 o1o1 s2s2 o2o2 s3s3 o3o3 s4s4 o4o4 s5s5 o5o5 s6s6 o6o6

41 HMM for Shallow Parsing States: {B, I, O} Observations: Actual words and/or part-of-speech tags s 1 =B o 1 Mr. s 2 =I o 2 Brown s 3 =O o 3 blamed s 4 =B o 4 Mr. s 5 =I o 5 Bob s 6 =O o 6 for

42 HMM for Shallow Parsing Given a sentences, we can ask what the most likely state sequence is Initial state probability: P(s 1 =B),P(s 1 =I),P(s 1 =O) Transition probabilty: P(s t =B|s t-1 =B),P(s t =I|s t-1 =B),P(s t =O|s t-1 =B), P(s t =B|s t-1 =I),P(s t =I|s t-1 =I),P(s t =O|s t-1 =I), … Observation Probability: P(o t =‘Mr.’|s t =B),P(o t =‘Brown’|s t =B),…, P(o t =‘Mr.’|s t =I),P(o t =‘Brown’|s t =I),…, … s 1 =B o 1 Mr. s 2 =I o 2 Brown s 3 =O o 3 blamed s 4 =B o 4 Mr. s 5 =I o 5 Bob s 6 =O o 6 for

43 Finding most likely state sequence in HMM (1)

44 Finding most likely state sequence in HMM (2)

45 Finding most likely state sequence in HMM (3) A function of s k

46 Finding most likely state sequence in HMM (4) Viterbi’s Algorithm Dynamic Programming

47 Learning the Model Estimate Initial state probability P (s 1 ) Transition probability P(s t |s t-1 ) Observation probability P(o t |s t ) Unsupervised Learning (states are not observed) EM Algorithm Supervised Learning (states are observed; more common) ML Estimate of above terms directly from data Notice that this is completely analogues to the case of naive Bayes, and essentially all other models.

48 Another view of Markov Models Assumptions: Prediction: predict t  T that maximizes Input: States: Observations: T W

49 Another View of Markov Models As for NB: features are pairs and singletons of t‘s, w’s Only 3 active features Input: States: Observations: T W This can be extended to an argmax that maximizes the prediction of the whole state sequence and computed, as before, via Viterbi.

50 Learning with Probabilistic Classifiers Learning Theory We showed that probabilistic predictions can be viewed as predictions via Linear Statistical Queries Models (Roth’99). The low expressivity explains Generalization+Robustness Is that all? It does not explain why is it possible to (approximately) fit the data with these models. Namely, is there a reason to believe that these hypotheses minimize the empirical error on the sample? In General, No. (Unless it corresponds to some probabilistic assumptions that hold).

51 Learning Protocol LSQ hypotheses are computed directly, w/o assumptions on the underlying distribution: - Choose features - Compute coefficients Is there a reason to believe that an LSQ hypothesis minimizes the empirical error on the sample? In general, no. (Unless it corresponds to some probabilistic assumptions that hold).

52 Learning Protocol: Practice LSQ hypotheses are computed directly: - Choose features - Compute coefficients If hypothesis does not fit the training data - - Augment set of features (Forget your original assumption)

53 Example: probabilistic classifiers Features are pairs and singletons of t‘s, w’s Additional features are included States: Observations: T W If hypothesis does not fit the training data - augment set of features (forget assumptions)

54 Why is it relatively easy to fit the data? Consider all distributions with the same marginals (E.g, a naïve Bayes classifier will predict the same regardless of which distribution generated the data.) (Garg&Roth ECML’01): In most cases (i.e., for most such distributions), the resulting predictor’s error is close to optimal classifier (that if given the correct distribution) Robustness of Probabilistic Predictors

55 Summary: Probabilistic Modeling Classifiers derived from probability density estimation models were viewed as LSQ hypotheses. Probabilistic assumptions: + Guiding feature selection but also - - Not allowing the use of more general features.

56 A Unified Approach Most methods blow up original feature space. And make predictions using a linear representation over the new feature space Note: Methods do not have to actually do that; But: they produce same decision as a hypothesis that does that. (Roth 98; 99,00)

57 A Unified Approach Most methods blow up original feature space. And make predictions using a linear representation over the new feature space Probabilistic Methods Rule based methods (TBL; decision lists; exponentially decreasing weights) Linear representation (SNoW;Perceptron; SVM;Boosting) Memory Based Methods (subset features)

58 A Unified Approach Most methods blow up original feature space. And make predictions using a linear representation over the new feature space Q 1: How are weights determined? Q 2: How is the new feature-space determined? Implications? Restrictions?