Presentation is loading. Please wait.

Presentation is loading. Please wait.

Machine Learning in Practice Lecture 10 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.

Similar presentations


Presentation on theme: "Machine Learning in Practice Lecture 10 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute."— Presentation transcript:

1 Machine Learning in Practice Lecture 10 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute

2 http://www.theallusionist.com/wordpress/wp-content/uploads/gambling8.jpg

3 Plan for the Day Announcements  Questions?  Quiz answer key posted Today’s Data Set: Prevalence of Gambling Exploring the Concept of Cost http://www.casino-gambling-dictionary.com/

4 Quiz Notes

5 Leave-one-out cross validation On each fold, train on all but 1 data point, test on 1 data point  Pro: Maximizes amount of training data used on each fold  Con: Not stratified  Con: Take a long time on large sets Best to only use when you have a very small amount of training data  Only needed when 10-fold cross validation is not feasible because of lack of data

6 632 Bootstrap A method for estimating performance when you have a small data set  Consider it an alternative to leave-one-out cross validation Sample n times with replacement to create the training set  Some instances will be repeated  Some will be left out – this will be your test set  About 63% of the instances in the original set will end up in the training set

7 632 Bootstrap Estimating error over the training set will be an optimistic estimate of performance  Because you trained on these examples Estimating error over test set will be a pessimistic estimate of the error  Because the 63/37 split gives you less training data than 90/10 Estimate error by combining optimistic and pessimistic estimates .63*pessimistic_estimate +.37*optimistic_estimate Iterate several times and average performance estimates

8 Prevalence of Gambling

9 Gambling Prevalence Goal is to predict how often people…  who fit in a particular demographic group i.e., male versus female, white versus black versus hispanic versus other  are classified as having a particular level of gambling risk At risk, problem, or Pathalogic  either during one specific year or in their lifetime

10 Gambling Prevalence Goal is to predict how often people…  who fit in a particular demographic group i.e., male versus female, white versus black versus hispanic versus other  are classified as having a particular level of gambling risk At risk, problem, or Pathalogic  either during one specific year or in their lifetime

11 Gambling Prevalence Goal is to predict how often people…  who fit in a particular demographic group i.e., male versus female, white versus black versus hispanic versus other  are classified as having a particular level of gambling risk At risk, problem, or Pathalogic  either during one specific year or in their lifetime

12 Gambling Prevalence Goal is to predict how often people…  who fit in a particular demographic group i.e., male versus female, white versus black versus hispanic versus other  are classified as having a particular level of gambling risk At risk, problem, or Pathalogic  either during one specific year or in their lifetime

13 Gambling Prevalence Goal is to predict how often people…  who fit in a particular demographic group i.e., male versus female, white versus black versus hispanic versus other  are classified as having a particular level of gambling risk At risk, problem, or Pathalogic  either during one specific year or in their lifetime

14 Gambling Prevalence * Risk is the most predictive feature.

15 Gambling Prevalence

16

17 * Demographic is the least predictive feature.

18 Which algorithm will perform best? http://www.albanycitizenscouncil.org/Pictures/Gambling2.jpg

19 Which algorithm will perform best? Decision Trees.26 Kappa Naïve Bayes.31 Kappa SMO.53 Kappa http://www.albanycitizenscouncil.org/Pictures/Gambling2.jpg

20 Decision Trees * What’s it ignoring and why?

21 With Binary Splits – Kappa.41

22 What was different with SMO? Trained a model for all pairs  The features that were important for one pairwise distinction were different than those for other pairwise distinctions  Characteristic=Black was most important for High versus Low (ignored by decision trees)  When and Risk were most important for High versus Medium Decision Trees pay attention to all distinctions at once  Totally ignored feature that was important for some pairwise distinctions

23 What was wrong with Naïve Bayes? Probably just learned noisy probabilities because the data set is small Hard to distinguish Low and Medium

24 Back to Chapter 5

25 Thinking about the cost of an error – A Theoretical Foundation for Machine Learning Cost  Making the right choice doesn’t cost you anything  Making an error comes with a cost  Some errors cost more than others  Rather than evaluating your model in terms of accuracy, which treats every error as though it was the same, you can think about average cost  The real cost is determined by your application

26 Unified Framework Connection between optimization techniques and evaluation methods Think about what function you are optimizing  That’s what learning is Evaluation measures how well you did that optimization  So it makes sense for there to be a deep connection between the learning technique and the evaluation New machine learning algorithms are often motivated by modifications to the conceptualization of the cost of an error

27 What’s the cost of a gambling mistake? http://imagecache2.allposters.com/images/pic/PTGPOD/321587~Pile-of-American-Money-Posters.jpg

28 Thinking About the Practical Cost of an Error In document retrieval, precision is more important than recall  You’re picking from the whole web, so if you miss some relevant documents it’s not a big deal  Precision is more important – you don’t want to have to slog through lot’s of irrelevant stuff

29 Thinking About the Practical Cost of an Error What if you are trying to predict whether someone will be late?  Is it worse to not predict someone will be late when they won’t or vice versa?

30 Thinking About the Practical Cost of an Error What if you’re trying to predict that a message will get a response or not?

31 Thinking About the Practical Cost of an Error Let’s say you are picking out errors in student essays  If you detect an error, you offer the student a correction for their error  What are the implications of missing an error?  What are the implications of imagining an error that doesn’t exist?

32 Cost Sensitive Classification An example of the connection between the notion of cost of an error and the training method Say you manipulate the cost of different types of errors  Cost of a decision is computed based on the expected cost That affects the function the algorithm is “trying” to maximize  Minimize expected cost rather than maximizing accuracy

33 Cost Sensitive Classification Cost sensitive classifiers work in two ways  manipulate the composition of the training data (by either changing the weight of some instances or by artificially boosting the number of instances of some types by strategically including some duplicates)  Manipulate the way predictions are made Select the option that minimizes cost rather than the most likely choice In practice it’s hard to use cost-sensitive classification in a useful way

34 Cost Sensitive Classification What if it’s 10 times more expensive to make a mistake when selecting Class C Expected cost of a decision   j C j p j  The cost of predicting class C j is computed by multiplying the j column of the cost matrix by the corresponding probabilities The expected cost of selecting C if probabilities are computed at A=75%, B=10%, C=15% is.75*10 +.1*1 = 7.6 0 0 01 11 110 1 ABC A B C

35 Cost Sensitive Classification The expected cost of selecting B if probabilities are computed at A=75%, B=10%, C=15% is.75*1 +.15*1 =.9 If A is selected, expected cost is.1*1 +.15*1 =.25 You can make a choice by minimizing the expected cost of an error So in this case, the expected cost is less when selecting A with highest probability 0 0 01 11 110 1 ABC A B C

36 Cost Sensitive Classification The expected cost of selecting B if probabilities are computed at A=75%, B=10%, C=15% is.75*1 +.15*1 =.9 If A is selected, expected cost is.1*1 +.15*1 =.25 You can make a choice by minimizing the expected cost of an error So in this case, the expected cost is less when selecting A with highest probability 0 0 01 11 110 1 ABC A B C

37 Using Cost Sensitive Classification

38

39 * Set up the cost matrix * Assign a high penalty to the largest error cell

40 Results Without Cost Sensitive Classification.53 Using Cost Sensitive Classification increased performance to.55  Tiny difference because SMO assigns probability 1 to all predictions  Not statistically significant SMO with default settings normally predicts one class with confidence 1 and the others with confidence 0, so cost sensitive classification does not have a big effect

41 What is the cost of an error? Assume first all errors have the same cost Quadratic loss:  j (p j – a j ) 2  Cost of a decision  J iterates over classes (A, B, C) Penalizes you for putting high confidence on a wrong prediction and/or low confidence on a right prediction 0 0 01 11 11 1 ABC A B C

42 What is the cost of an error? Assume first all errors have the same cost Quadratic loss:  j (p j – a j ) 2  Cost of a decision  J iterates over classes (A, B, C) If C is right and you say A=75%, B=10%, C=15%  (.75 – 0) 2 + (.1 – 0) 2 + (.15 -1) 2  1.3 If A is right and you say A=75%, B=10%, C=15%  (.75 – 1) 2 + (.1 – 0) 2 + (.15 – 0) 2 .09  Lower cost if highest probability is on the correct choice 0 0 01 11 11 1 ABC A B C

43 What is the cost of an error? Assume all errors have the same cost Informational Loss: -log k p i k is the number of classes i is the correct class P i is the probability of selecting class i If C is right and you say A=75%, B=10%, C=15%  -log 3 (.15)  1.73 If A is right and you say A=75%, B=10%, C=15%  -log 3 (.75) .26  Lower cost if highest probability is on the correct choice 0 0 01 11 11 1 ABC A B C

44 Trade Offs Between Quadratic Loss and Information Loss Quadratic Loss pays attention to probabilities placed on all classes  So you can get partial credit if you put really low probabilities on some of the wrong choices  Bounded (Max value is 2) Information loss only pays attention to how you treated the correct prediction  More like gambling  Not bounded

45 Minimum Description Length Principle Another way of viewing the connection between optimization and evaluation  Based on information theory Training minimizes how much information you encode in the model  How much information does it take to determine what class an instance belongs to?  Information is encoded in your feature space Evaluation measures how much information is lost in the classification Tension between complexity of the model at training time and information loss at testing time

46 Take Home Message Different types of errors have different costs  Costs associated with cells in the confusion matrix  Costs may also be associated with the level of confidence with which decisions are made Connection between concept of cost of an error and learning method  Machine learning algorithms are optimizing a cost function  The cost function should reflect the real cost in the world In cost sensitive classification, the notion of which types of errors cost more can influence classification performance


Download ppt "Machine Learning in Practice Lecture 10 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute."

Similar presentations


Ads by Google