Machine Learning in Practice Lecture 6 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.

Slides:



Advertisements
Similar presentations
Classification Classification Examples
Advertisements

Mustafa Cayci INFS 795 An Evaluation on Feature Selection for Text Clustering.
CSCI 5582 Fall 2006 CSCI 5582 Artificial Intelligence Lecture 20 Jim Martin.
Machine Learning in Practice Lecture 7 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
Excursions in Modern Mathematics Sixth Edition
PROBABILISTIC MODELS David Kauchak CS451 – Fall 2013.
Machine Learning in Practice Lecture 3 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
Rosa Cowan April 29, 2008 Predictive Modeling & The Bayes Classifier.
Algorithms: The basic methods. Inferring rudimentary rules Simplicity first Simple algorithms often work surprisingly well Many different kinds of simple.
Evaluation.
Assuming normally distributed data! Naïve Bayes Classifier.
Lecture 20 Object recognition I
Data Mining with Naïve Bayesian Methods
Evaluation.
Review. 2 Statistical modeling  “Opposite” of 1R: use all the attributes  Two assumptions: Attributes are  equally important  statistically independent.
Naïve Bayes Classification Debapriyo Majumdar Data Mining – Fall 2014 Indian Statistical Institute Kolkata August 14, 2014.
Thanks to Nir Friedman, HU
CS Bayesian Learning1 Bayesian Learning. CS Bayesian Learning2 States, causes, hypotheses. Observations, effect, data. We need to reconcile.
Bayesian Networks. Male brain wiring Female brain wiring.
Midterm Review Rao Vemuri 16 Oct Posing a Machine Learning Problem Experience Table – Each row is an instance – Each column is an attribute/feature.
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
TagHelper: Basics Part 1 Carolyn Penstein Rosé Carnegie Mellon University Funded through the Pittsburgh Science of Learning Center and The Office of Naval.
ENSEMBLE LEARNING David Kauchak CS451 – Fall 2013.
Naive Bayes Classifier
User Study Evaluation Human-Computer Interaction.
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.6: Linear Models Rodney Nielsen Many of.
1 CS 391L: Machine Learning: Experimental Evaluation Raymond J. Mooney University of Texas at Austin.
Today Ensemble Methods. Recap of the course. Classifier Fusion
Classification Techniques: Bayesian Classification
Machine Learning II 부산대학교 전자전기컴퓨터공학과 인공지능연구실 김민호
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.2 Statistical Modeling Rodney Nielsen Many.
Algorithms for Classification: The Basic Methods.
Slides for “Data Mining” by I. H. Witten and E. Frank.
CHAPTER 6 Naive Bayes Models for Classification. QUESTION????
Carolyn Penstein Rosé Language Technologies Institute Human-Computer Interaction Institute School of Computer Science With funding from the National Science.
KNN & Naïve Bayes Hongning Wang Today’s lecture Instance-based classifiers – k nearest neighbors – Non-parametric learning algorithm Model-based.
Naïve Bayes Classification Material borrowed from Jonathan Huang and I. H. Witten’s and E. Frank’s “Data Mining” and Jeremy Wyatt and others.
Machine Learning in Practice Lecture 5 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
Mehdi Ghayoumi MSB rm 132 Ofc hr: Thur, a Machine Learning.
Machine Learning in Practice Lecture 10 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
LECTURE 07: CLASSIFICATION PT. 3 February 15, 2016 SDS 293 Machine Learning.
Machine Learning in Practice Lecture 2 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
Naïve Bayes Classifier April 25 th, Classification Methods (1) Manual classification Used by Yahoo!, Looksmart, about.com, ODP Very accurate when.
Machine Learning in Practice Lecture 21 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
Machine Learning in Practice Lecture 9 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Classification COMP Seminar BCB 713 Module Spring 2011.
Machine Learning in Practice Lecture 8 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
BAYESIAN LEARNING. 2 Bayesian Classifiers Bayesian classifiers are statistical classifiers, and are based on Bayes theorem They can calculate the probability.
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 15: Text Classification & Naive Bayes 1.
Machine Learning in Practice Lecture 9 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
KNN & Naïve Bayes Hongning Wang
Data Mining Chapter 4 Algorithms: The Basic Methods Reporter: Yuen-Kuei Hsueh.
Naive Bayes Classifier. REVIEW: Bayesian Methods Our focus this lecture: – Learning and classification methods based on probability theory. Bayes theorem.
Machine Learning in Practice Lecture 18
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
Naive Bayes Classifier
Data Science Algorithms: The Basic Methods
Perceptrons Lirong Xia.
Lecture 15: Text Classification & Naive Bayes
Data Mining Lecture 11.
CS 4/527: Artificial Intelligence
Machine Learning in Practice Lecture 11
Generally Discriminant Analysis
Machine Learning in Practice Lecture 7
Machine Learning in Practice Lecture 17
Machine Learning in Practice Lecture 6
Machine Learning in Practice Lecture 27
Sofia Pediaditaki and Mahesh Marina University of Edinburgh
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
Perceptrons Lirong Xia.
Presentation transcript:

Machine Learning in Practice Lecture 6 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute

Plan for the Day Announcements  Answer keys for past quizzes and assignments posted  Quiz feedback  Assignment 3 handed out Finish Naïve Bayes Start Linear Models

Quiz Notes Most people did well! Most frequent issue was the last question where we compared likelihoods and probabilities  Main difference is scaling  Sum of probabilities for all possible events should come out to 1 That’s what gives statistical models their nice formal properties Comment about technical versus common usages of terms like likelihood, concept, etc. Likelihood that play = yes if Outlook = rainy = Count(yes & rainy)/ Count(yes) * Count(yes)/Count(yes or no)

Finishing Naïve Bayes

Bayes Theorem How would you compute the likelihood that a person was a bagpipe major given that they had red hair?

Bayes Theorem How would you compute the likelihood that a person was a bagpipe major given that they had red hair? Could you compute the likelihood that a person has red hair given that they were a bagpipe major?

Bayes Theorem How would you compute the likelihood that a person was a bagpipe major given that they had red hair? Could you compute the likelihood that a person has red hair given that they were a bagpipe major?

Another Example ice-cream {chocolate, vanilla, coffee, rocky-road, cake {chocolate, yummy chocolate,chocolate,yum vanilla,chocolate,good coffee,chocolate,yum coffee,vanilla,ok rocky-road,chocolate,yum is-yummy Compute conditional probabilities for each attribute value/class pair  P(B|A) = Count(B&A)/Count(A)  P(coffee ice-cream | yum) =.25  P(vanilla ice-cream | yum) = 0

Another Example ice-cream {chocolate, vanilla, coffee, rocky-road, cake {chocolate, yummy chocolate,chocolate,yum vanilla,chocolate,good coffee,chocolate,yum coffee,vanilla,ok rocky-road,chocolate,yum is-yummy  What class would you assign to strawberry ice cream with chocolate cake?  Compute likelihoods and then normalize  Note: this model cannot take into account that the class might depend on how well the cake and ice cream “go together” What is the likelihood that the answer is yum? P(strawberry|yum) =.25 P(chocolate cake|yum) = *.75 *.66 =.124 What is the likelihood that The answer is good? P(strawberry|good) = 0 P(chocolate cake|good) = 1 0* 1 *.17 = 0 What is the likelihood that The answer is ok? P(strawberry|ok) = 0 P(chocolate cake|ok) = 0 0*0 *.17 = 0

Another Example ice-cream {chocolate, vanilla, coffee, rocky-road, cake {chocolate, yummy chocolate,chocolate,yum vanilla,chocolate,good coffee,chocolate,yum coffee,vanilla,ok rocky-road,chocolate,yum is-yummy  What class would you assign to strawberry ice cream with chocolate cake?  Compute likelihoods and then normalize  Note: this model cannot take into account that the class might depend on how well the cake and ice cream “go together” What is the likelihood that the answer is yum? P(strawberry|yum) =.25 P(chocolate cake|yum) = *.75 *.66 =.124 What is the likelihood that The answer is good? P(strawberry|good) = 0 P(chocolate cake|good) = 1 0* 1 *.17 = 0 What is the likelihood that The answer is ok? P(strawberry|ok) = 0 P(chocolate cake|ok) = 0 0*0 *.17 = 0

Another Example ice-cream {chocolate, vanilla, coffee, rocky-road, cake {chocolate, yummy chocolate,chocolate,yum vanilla,chocolate,good coffee,chocolate,yum coffee,vanilla,ok rocky-road,chocolate,yum is-yummy What about vanilla ice cream and vanilla cake Intuitively, there is more evidence that the selected category should be Good. What is the likelihood that the answer is yum? P(vanilla|yum) = 0 P(vanilla cake|yum) =.25 0*.25 *.66= 0 What is the likelihood that The answer is good? P(vanilla|good) = 1 P(vanilla cake|good) = 0 1*0 *.17= 0 What is the likelihood that The answer is ok? P(vanilla|ok) = 0 P(vanilla cake|ok) = 1 0* 1 *.17 = 0

Another Example ice-cream {chocolate, vanilla, coffee, rocky-road, cake {chocolate, yummy chocolate,chocolate,yum vanilla,chocolate,good coffee,chocolate,yum coffee,vanilla,ok rocky-road,chocolate,yum is-yummy What about vanilla ice cream and vanilla cake Intuitively, there is more evidence that the selected category should be Good. What is the likelihood that the answer is yum? P(vanilla|yum) = 0 P(vanilla cake|yum) =.25 0*.25 *.66= 0 What is the likelihood that The answer is good? P(vanilla|good) = 1 P(vanilla cake|good) = 0 1*0 *.17= 0 What is the likelihood that The answer is ok? P(vanilla|ok) = 0 P(vanilla cake|ok) = 1 0* 1 *.17 = 0

Statistical Modeling with Small Datasets When you train your model, how many probabilities are you trying to estimate? This statistical modeling approach has problems with small datasets where not every class is observed in combination with every attribute value  What potential problem occurs when you never observe coffee ice-cream with class ok?  When is this not a problem?

Smoothing One way to compensate for 0 counts is to add 1 to every count Then you never have 0 probabilities But what might be the problem you still have on small data sets?

Naïve Bayes with ice-cream {chocolate, vanilla, coffee, rocky-road, cake {chocolate, yummy chocolate,chocolate,yum vanilla,chocolate,good coffee,chocolate,yum coffee,vanilla,ok rocky-road,chocolate,yum is-yummy What is the likelihood that the answer is yum? P(vanilla|yum) =.11 P(vanilla cake|yum) =.33.11*.33*.66=.03 What is the likelihood that The answer is good? P(vanilla|good) =.33 P(vanilla cake|good) = *.33 *.17 =.02 What is the likelihood that The answer is ok? P(vanilla|ok) =.17 P(vanilla cake|ok) = *.66 *.17 =.02

Naïve Bayes with ice-cream {chocolate, vanilla, coffee, rocky-road, cake {chocolate, yummy chocolate,chocolate,yum vanilla,chocolate,good coffee,chocolate,yum coffee,vanilla,ok rocky-road,chocolate,yum is-yummy What is the likelihood that the answer is yum? P(vanilla|yum) =.11 P(vanilla cake|yum) =.33.11*.33*.66=.03 What is the likelihood that The answer is good? P(vanilla|good) =.33 P(vanilla cake|good) = *.33 *.17 =.02 What is the likelihood that The answer is ok? P(vanilla|ok) =.17 P(vanilla cake|ok) = *.66 *.17 =.02

Scanario Math story problem Math Skill 1 Math Skill 2 Math Skill 3 Math Skill 4 Math Skill 5 Math Skill 6 Math Skill 7 Math Skill 8 Math Skill 9 Math Skill10 Math Skill11 Math Skill12 Math Skill13 Math Skill14

Scanario Math story problem Math Skill 1 Math Skill 2 Math Skill 3 Math Skill 4 Math Skill 5 Math Skill 6 Math Skill 7 Math Skill 8 Math Skill 9 Math Skill10 Math Skill11 Math Skill12 Math Skill13 Math Skill14 Each problem may be associated with more than one skill

Scanario Math story problem Math Skill 1 Math Skill 2 Math Skill 3 Math Skill 4 Math Skill 5 Math Skill 6 Math Skill 7 Math Skill 8 Math Skill 9 Math Skill10 Math Skill11 Math Skill12 Math Skill13 Math Skill14 Each skill may be associated with more than one problem

How to address the problem? In reality there is a many-to-many mapping between math problems and skills

How to address the problem? In reality there is a many-to-many mapping between math problems and skills Ideally, we should be able to assign any subset of the full set of skills to any problem  But can we do that accurately?

How to address the problem? In reality there is a many-to-many mapping between math problems and skills Ideally, we should be able to assign any subset of the full set of skills to any problem  But can we do that accurately? If we can’t do that, it may be good enough to assign the single most important skill

How to address the problem? In reality there is a many-to-many mapping between math problems and skills Ideally, we should be able to assign any subset of the full set of skills to any problem  But can we do that accurately? If we can’t do that, it may be good enough to assign the single most important skill In that case, we will not accomplish the whole task

How to address the problem? But if we can do that part of the task more accurately, then we might accomplish more overall than if we try to achieve the more ambitious goal

Low resolution gives more information if the accuracy is higher Remember this discussion from lecture 2?

Which of these approaches is better? You have a corpus of math problem texts and you are trying to learn models that assign skill labels. Approch one: You have 91 binary prediction models, each of which makes an independent decision about each math text. Approach two: You have one multi-class classifier that assigns one out of the same 91 skill labels.

Approach 1 Math story problem Math Skill 1 Math Skill 2 Math Skill 3 Math Skill 4 Math Skill 5 Math Skill 6 Math Skill 7 Math Skill 8 Math Skill 9 Math Skill10 Math Skill11 Math Skill12 Math Skill13 Math Skill14 Each skill corresponds to a separate binary predictor. Each of 91 binary predictors is applied to each text 91 separate predictions are made for each text.

Approach 2 Math story problem Math Skill 1 Math Skill 2 Math Skill 3 Math Skill 4 Math Skill 5 Math Skill 6 Math Skill 7 Math Skill 8 Math Skill 9 Math Skill10 Math Skill11 Math Skill12 Math Skill13 Math Skill14 Each skill corresponds to a separate Class value. A single multi- class predictor is applied to each text Only 1 prediction is made for each text.

Which of these approaches is better? You have a corpus of math problem texts and you are trying to learn models that assign skill labels. Approch one: You have 91 binary prediction models, each of which makes an independent decision about each math text. Approach two: You have one multi-class classifier that assigns one out of the same 91 skill labels. More power, but more opportunity for error

Which of these approaches is better? You have a corpus of math problem texts and you are trying to learn models that assign skill labels. Approch one: You have 91 binary prediction models, each of which makes an independent decision about each math text. Approach two: You have one multi-class classifier that assigns one out of the same 91 skill labels. Less power, but fewer opportunities for error

Approach 1: One versus all Assume you have 80 example texts, and 4 of them have skill5 associated with them Assume you are using some form of smoothing – 0 counts become 1 Let’s say WordX occurs with skill5 75% of the time and only 5% of the time for majority (it’s the best predictor for skill5)  After smoothing, P(WordX|Skill5) = 2/3  P(WordX|majority) = 2/38

Counts Without Smoothing 100 math problem texts 7 instances of WordX 3 of them are skill5 (75% of skill5) WordX is the best predictor for skill5) Skill5 Majority Class WordXWordY 3 4

Counts With Smoothing 80 math problem texts 7 instances of WordX 3 of them are skill5 (75% of skill5) WordX is the best predictor for skill5) Skill5 Majority Class WordXWordY 4 5

Approach 1 Assume you have 80 example texts, and 4 of them have skill5 associated with them Assume you are using some form of smoothing – 0 counts become 1 Let’s say WordX occurs with skill5 75% of the time and 4 times with the majority class (it’s the best predictor for skill5)  After smoothing, P(WordX|Skill5) = 2/3  P(WordX|majority) = 5/78

Approach 1 Let’s say WordY occurs 17% of the time with the majority class and 25% of the time with skill5 (it’s a moderately good predictor)  In reality, 13 counts of WordY with majority and 1 with Skill5  With smoothing, we get 14 counts of WordY with majority and 2 with Skill5  P(WordY|Skill5) = 1/3  P(WordY|Majority) = 7/38  Because you multiply the conditional probabilities and the prior probabilities together, it’s nearly impossible to predict the minority class when the data is this skewed  For “WordX WordY” you would get.66*.33*.04 =.009 for skill5 and.05*.18 *.96 =.009 for majority What would you predict without smoothing?

Counts Without Smoothing 80 math problem texts 4 of them are skill5 WordY occurs 17% of the time with the majority class and 25% of the time with skill5 (it’s a moderately good predictor) Skill5 Majority Class WordXWordY

Counts With Smoothing 80 math problem texts 4 of them are skill5 WordY occurs 17% of the time with the majority class and 25% of the time with skill5 (it’s a moderately good predictor) Skill5 Majority Class WordXWordY

Approach 1 Let’s say WordY occurs 17% of the time with the majority class and 25% of the time with skill5 (it’s a moderately good predictor)  In reality, 13 counts of WordY with majority and 1 with Skill5  With smoothing, we get 14 counts of WordY with majority and 2 with Skill5  P(WordY|Skill5) = 1/3  P(WordY|Majority) = 14/78  Because you multiply the conditional probabilities and the prior probabilities together, it’s nearly impossible to predict the minority class when the data is this skewed  For “WordX WordY” you would get.66*.33*.04 =.009 for skill5 and.05*.18 *.96 =.009 for majority What would you predict without smoothing?

Approach 1 Let’s say WordY occurs 17% of the time with the majority class and 25% of the time with skill5 (it’s a moderately good predictor)  In reality, 13 counts of WordY with majority and 1 with Skill5  With smoothing, we get 14 counts of WordY with majority and 2 with Skill5  P(WordY|Skill5) = 1/3  P(WordY|Majority) = 14/78  Because you multiply the conditional probabilities and the prior probabilities together, it’s nearly impossible to predict the minority class when the data is this skewed  For “WordX WordY” you would get.66*.33*.05 =.01 for skill5 and.06*.18*.95 =.01 for majority What would you predict without smoothing?

Linear Models

Remember this: What do concepts look like?

Review: Concepts as Lines RB S T C X X X X X X

RB S T C X X X X X X

RB S T C X X X X X X

RB S T C X X X X X X

RB S T C X X X X X X X What will be the prediction for this new data point?

What are we learning? We’re learning to draw a line through a multidimensional space  Really a “hyperplane” Each function we learn is like single split in a decision tree  But it can take many features into account at one time rather than just one F(x) = C 0 + C 1 X 1 + C 2 X 2 + C 3 X 3  X 1 -X n are our attributes  C 0 -C n are coefficients  We’re learning the coefficients, which are weights

Taking a Step Back We started out with tree learning a algorithms that learn symbolic rules with the goal of achieving the highest accuracy  0R, 1R, Decision Trees (J48) Then we talked about statistical models that make decisions based on probability  Naïve Bayes  Rules look different – we just store counts  No explicit focus on accuracy during learning What are the implications of the contrast between an accuracy focus and a probability focus?

Performing well with skewed class distributions Naïve Bayes has trouble with skewed class distributions because of the contribution of prior probabilities  Remember our math problem case Linear models can compensate for this  They don’t have any notion of prior probability per se  If they can find a good split on the data, they will find it wherever it is  Problem if there is not a good split

Skewed but clean separation

Skewed but no clean separation

Taking a Step Back The models we will look at now have rules composed of numbers  So they “look” more like Naïve Bayes than like Decision Trees But the numbers are obtained through a focus on achieving accuracy  So the learning process is more like Decision Trees Given these two properties, what can you say about assumptions about the form of the solution and assumptions about the world that are made?