Machine Learning in Practice Lecture 6 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.

Machine Learning in Practice Lecture 6 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute

Plan for the Day Announcements  Answer keys for past quizzes and assignments posted  Quiz feedback  Assignment 3 handed out Finish Naïve Bayes Start Linear Models

Quiz Notes Most people did well! Most frequent issue was the last question where we compared likelihoods and probabilities  Main difference is scaling  Sum of probabilities for all possible events should come out to 1 That’s what gives statistical models their nice formal properties Comment about technical versus common usages of terms like likelihood, concept, etc. Likelihood that play = yes if Outlook = rainy = Count(yes & rainy)/ Count(yes) * Count(yes)/Count(yes or no)

Finishing Naïve Bayes

Bayes Theorem How would you compute the likelihood that a person was a bagpipe major given that they had red hair?

Bayes Theorem How would you compute the likelihood that a person was a bagpipe major given that they had red hair? Could you compute the likelihood that a person has red hair given that they were a bagpipe major?

Another Example Model @attribute ice-cream {chocolate, vanilla, coffee, rocky-road, strawberry} @attribute cake {chocolate, vanilla} @attribute yummy {yum,good,ok} @data chocolate,chocolate,yum vanilla,chocolate,good coffee,chocolate,yum coffee,vanilla,ok rocky-road,chocolate,yum strawberry,vanilla,yum @relation is-yummy Compute conditional probabilities for each attribute value/class pair  P(B|A) = Count(B&A)/Count(A)  P(coffee ice-cream | yum) =.25  P(vanilla ice-cream | yum) = 0

Another Example Model @attribute ice-cream {chocolate, vanilla, coffee, rocky-road, strawberry} @attribute cake {chocolate, vanilla} @attribute yummy {yum,good,ok} @data chocolate,chocolate,yum vanilla,chocolate,good coffee,chocolate,yum coffee,vanilla,ok rocky-road,chocolate,yum strawberry,vanilla,yum @relation is-yummy  What class would you assign to strawberry ice cream with chocolate cake?  Compute likelihoods and then normalize  Note: this model cannot take into account that the class might depend on how well the cake and ice cream “go together” What is the likelihood that the answer is yum? P(strawberry|yum) =.25 P(chocolate cake|yum) =.75.25 *.75 *.66 =.124 What is the likelihood that The answer is good? P(strawberry|good) = 0 P(chocolate cake|good) = 1 0* 1 *.17 = 0 What is the likelihood that The answer is ok? P(strawberry|ok) = 0 P(chocolate cake|ok) = 0 0*0 *.17 = 0

Another Example Model @attribute ice-cream {chocolate, vanilla, coffee, rocky-road, strawberry} @attribute cake {chocolate, vanilla} @attribute yummy {yum,good,ok} @data chocolate,chocolate,yum vanilla,chocolate,good coffee,chocolate,yum coffee,vanilla,ok rocky-road,chocolate,yum strawberry,vanilla,yum @relation is-yummy What about vanilla ice cream and vanilla cake Intuitively, there is more evidence that the selected category should be Good. What is the likelihood that the answer is yum? P(vanilla|yum) = 0 P(vanilla cake|yum) =.25 0*.25 *.66= 0 What is the likelihood that The answer is good? P(vanilla|good) = 1 P(vanilla cake|good) = 0 1*0 *.17= 0 What is the likelihood that The answer is ok? P(vanilla|ok) = 0 P(vanilla cake|ok) = 1 0* 1 *.17 = 0

Statistical Modeling with Small Datasets When you train your model, how many probabilities are you trying to estimate? This statistical modeling approach has problems with small datasets where not every class is observed in combination with every attribute value  What potential problem occurs when you never observe coffee ice-cream with class ok?  When is this not a problem?

Smoothing One way to compensate for 0 counts is to add 1 to every count Then you never have 0 probabilities But what might be the problem you still have on small data sets?

Naïve Bayes with smoothing @attribute ice-cream {chocolate, vanilla, coffee, rocky-road, strawberry} @attribute cake {chocolate, vanilla} @attribute yummy {yum,good,ok} @data chocolate,chocolate,yum vanilla,chocolate,good coffee,chocolate,yum coffee,vanilla,ok rocky-road,chocolate,yum strawberry,vanilla,yum @relation is-yummy What is the likelihood that the answer is yum? P(vanilla|yum) =.11 P(vanilla cake|yum) =.33.11*.33*.66=.03 What is the likelihood that The answer is good? P(vanilla|good) =.33 P(vanilla cake|good) =.33.33 *.33 *.17 =.02 What is the likelihood that The answer is ok? P(vanilla|ok) =.17 P(vanilla cake|ok) =.66.17 *.66 *.17 =.02

Scanario Math story problem Math Skill 1 Math Skill 2 Math Skill 3 Math Skill 4 Math Skill 5 Math Skill 6 Math Skill 7 Math Skill 8 Math Skill 9 Math Skill10 Math Skill11 Math Skill12 Math Skill13 Math Skill14

Scanario Math story problem Math Skill 1 Math Skill 2 Math Skill 3 Math Skill 4 Math Skill 5 Math Skill 6 Math Skill 7 Math Skill 8 Math Skill 9 Math Skill10 Math Skill11 Math Skill12 Math Skill13 Math Skill14 Each problem may be associated with more than one skill

Scanario Math story problem Math Skill 1 Math Skill 2 Math Skill 3 Math Skill 4 Math Skill 5 Math Skill 6 Math Skill 7 Math Skill 8 Math Skill 9 Math Skill10 Math Skill11 Math Skill12 Math Skill13 Math Skill14 Each skill may be associated with more than one problem

How to address the problem? In reality there is a many-to-many mapping between math problems and skills

How to address the problem? In reality there is a many-to-many mapping between math problems and skills Ideally, we should be able to assign any subset of the full set of skills to any problem  But can we do that accurately?

How to address the problem? In reality there is a many-to-many mapping between math problems and skills Ideally, we should be able to assign any subset of the full set of skills to any problem  But can we do that accurately? If we can’t do that, it may be good enough to assign the single most important skill

How to address the problem? In reality there is a many-to-many mapping between math problems and skills Ideally, we should be able to assign any subset of the full set of skills to any problem  But can we do that accurately? If we can’t do that, it may be good enough to assign the single most important skill In that case, we will not accomplish the whole task

How to address the problem? But if we can do that part of the task more accurately, then we might accomplish more overall than if we try to achieve the more ambitious goal

Low resolution gives more information if the accuracy is higher Remember this discussion from lecture 2?

Which of these approaches is better? You have a corpus of math problem texts and you are trying to learn models that assign skill labels. Approch one: You have 91 binary prediction models, each of which makes an independent decision about each math text. Approach two: You have one multi-class classifier that assigns one out of the same 91 skill labels.

Approach 1 Math story problem Math Skill 1 Math Skill 2 Math Skill 3 Math Skill 4 Math Skill 5 Math Skill 6 Math Skill 7 Math Skill 8 Math Skill 9 Math Skill10 Math Skill11 Math Skill12 Math Skill13 Math Skill14 Each skill corresponds to a separate binary predictor. Each of 91 binary predictors is applied to each text 91 separate predictions are made for each text.

Approach 2 Math story problem Math Skill 1 Math Skill 2 Math Skill 3 Math Skill 4 Math Skill 5 Math Skill 6 Math Skill 7 Math Skill 8 Math Skill 9 Math Skill10 Math Skill11 Math Skill12 Math Skill13 Math Skill14 Each skill corresponds to a separate Class value. A single multi- class predictor is applied to each text Only 1 prediction is made for each text.

Which of these approaches is better? You have a corpus of math problem texts and you are trying to learn models that assign skill labels. Approch one: You have 91 binary prediction models, each of which makes an independent decision about each math text. Approach two: You have one multi-class classifier that assigns one out of the same 91 skill labels. More power, but more opportunity for error

Which of these approaches is better? You have a corpus of math problem texts and you are trying to learn models that assign skill labels. Approch one: You have 91 binary prediction models, each of which makes an independent decision about each math text. Approach two: You have one multi-class classifier that assigns one out of the same 91 skill labels. Less power, but fewer opportunities for error

Approach 1: One versus all Assume you have 80 example texts, and 4 of them have skill5 associated with them Assume you are using some form of smoothing – 0 counts become 1 Let’s say WordX occurs with skill5 75% of the time and only 5% of the time for majority (it’s the best predictor for skill5)  After smoothing, P(WordX|Skill5) = 2/3  P(WordX|majority) = 2/38

Counts Without Smoothing 100 math problem texts 7 instances of WordX 3 of them are skill5 (75% of skill5) WordX is the best predictor for skill5) Skill5 Majority Class WordXWordY 3 4

Counts With Smoothing 80 math problem texts 7 instances of WordX 3 of them are skill5 (75% of skill5) WordX is the best predictor for skill5) Skill5 Majority Class WordXWordY 4 5

Approach 1 Assume you have 80 example texts, and 4 of them have skill5 associated with them Assume you are using some form of smoothing – 0 counts become 1 Let’s say WordX occurs with skill5 75% of the time and 4 times with the majority class (it’s the best predictor for skill5)  After smoothing, P(WordX|Skill5) = 2/3  P(WordX|majority) = 5/78

Approach 1 Let’s say WordY occurs 17% of the time with the majority class and 25% of the time with skill5 (it’s a moderately good predictor)  In reality, 13 counts of WordY with majority and 1 with Skill5  With smoothing, we get 14 counts of WordY with majority and 2 with Skill5  P(WordY|Skill5) = 1/3  P(WordY|Majority) = 7/38  Because you multiply the conditional probabilities and the prior probabilities together, it’s nearly impossible to predict the minority class when the data is this skewed  For “WordX WordY” you would get.66*.33*.04 =.009 for skill5 and.05*.18 *.96 =.009 for majority What would you predict without smoothing?

Counts Without Smoothing 80 math problem texts 4 of them are skill5 WordY occurs 17% of the time with the majority class and 25% of the time with skill5 (it’s a moderately good predictor) Skill5 Majority Class WordXWordY 3 4 1 13

Counts With Smoothing 80 math problem texts 4 of them are skill5 WordY occurs 17% of the time with the majority class and 25% of the time with skill5 (it’s a moderately good predictor) Skill5 Majority Class WordXWordY 4 5 2 14

Approach 1 Let’s say WordY occurs 17% of the time with the majority class and 25% of the time with skill5 (it’s a moderately good predictor)  In reality, 13 counts of WordY with majority and 1 with Skill5  With smoothing, we get 14 counts of WordY with majority and 2 with Skill5  P(WordY|Skill5) = 1/3  P(WordY|Majority) = 14/78  Because you multiply the conditional probabilities and the prior probabilities together, it’s nearly impossible to predict the minority class when the data is this skewed  For “WordX WordY” you would get.66*.33*.04 =.009 for skill5 and.05*.18 *.96 =.009 for majority What would you predict without smoothing?

Approach 1 Let’s say WordY occurs 17% of the time with the majority class and 25% of the time with skill5 (it’s a moderately good predictor)  In reality, 13 counts of WordY with majority and 1 with Skill5  With smoothing, we get 14 counts of WordY with majority and 2 with Skill5  P(WordY|Skill5) = 1/3  P(WordY|Majority) = 14/78  Because you multiply the conditional probabilities and the prior probabilities together, it’s nearly impossible to predict the minority class when the data is this skewed  For “WordX WordY” you would get.66*.33*.05 =.01 for skill5 and.06*.18*.95 =.01 for majority What would you predict without smoothing?

Linear Models

Remember this: What do concepts look like?

Review: Concepts as Lines RB S T C X X X X X X

RB S T C X X X X X X

RB S T C X X X X X X X What will be the prediction for this new data point?

What are we learning? We’re learning to draw a line through a multidimensional space  Really a “hyperplane” Each function we learn is like single split in a decision tree  But it can take many features into account at one time rather than just one F(x) = C 0 + C 1 X 1 + C 2 X 2 + C 3 X 3  X 1 -X n are our attributes  C 0 -C n are coefficients  We’re learning the coefficients, which are weights

Taking a Step Back We started out with tree learning a algorithms that learn symbolic rules with the goal of achieving the highest accuracy  0R, 1R, Decision Trees (J48) Then we talked about statistical models that make decisions based on probability  Naïve Bayes  Rules look different – we just store counts  No explicit focus on accuracy during learning What are the implications of the contrast between an accuracy focus and a probability focus?

Performing well with skewed class distributions Naïve Bayes has trouble with skewed class distributions because of the contribution of prior probabilities  Remember our math problem case Linear models can compensate for this  They don’t have any notion of prior probability per se  If they can find a good split on the data, they will find it wherever it is  Problem if there is not a good split

Skewed but clean separation

Skewed but no clean separation

Taking a Step Back The models we will look at now have rules composed of numbers  So they “look” more like Naïve Bayes than like Decision Trees But the numbers are obtained through a focus on achieving accuracy  So the learning process is more like Decision Trees Given these two properties, what can you say about assumptions about the form of the solution and assumptions about the world that are made?

Machine Learning in Practice Lecture 6 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.

Similar presentations

Presentation on theme: "Machine Learning in Practice Lecture 6 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Machine Learning in Practice Lecture 6 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.

Similar presentations

Presentation on theme: "Machine Learning in Practice Lecture 6 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute."— Presentation transcript:

Similar presentations

About project

Feedback