Data Mining – Algorithms: Naïve Bayes Chapter 4, Section 4.2.

Data Mining – Algorithms: Naïve Bayes Chapter 4, Section 4.2

More Simplicity Direct contrast to OneR –Use all attributes Assume that all attributes are equally important Assume that all (non predicted) attributes are independent of each other This is clearly naïve! But it works pretty well

Again, Let’s take this a little more realistic than book does Divide into training and test data Let’s save the last record as a test (using my weather, nominal …

Determine Distribution of Attributes Outlook \ PlayYesNo Sunny41 Overcast22 Rainy04 Temperature \ Play YesNo Hot13 Mild32 Cool22

Determine Distribution of Attributes Humidity \ PlayYesNo High33 Normal34 Windy \ PlayYesNo False26 True41

Also, the attribute to be predicted … YesNo Play67

Inferring Probabilities from Observed Outlook \ PlayYesNo Sunny4/61/7 Overcast2/62/7 Rainy0/64/7 Temperature \ Play YesNo Hot1/63/7 Mild3/62/7 Cool2/62/7

Inferring Probabilities from Observed Humidity \ PlayYesNo High3/63/7 Normal3/64/7 Windy \ PlayYesNo False2/66/7 True4/61/7

Also, the attribute to be predicted … Proportion of days that were yes and no YesNo Play6/137/13

Now, suppose we must predict the test instance Rainy, mild, high, true Probability of Yes = Probability of Yes given that it is rainy * Probability of Yes given that it is mild * Probability of Yes given that humidity is high * Probability of Yes given that it is windy * Probability of Yes (in general) = 0/6 * 3/6 * 3/6 * 4/6 * 6/13 = 0 / 16848 = 0.0 Probability of No = Probability of No given that it is rainy * Probability of No given that it is mild * Probability of No given that humidity is high * Probability of No given that it is windy * Probability of No (in general) = 4/7 * 2/7 * 3/7 * 1/7 * 7/13 = 168 / 31213 = 0.005

The Foundation: Bayes Rule of Conditional Probabilities P[H|E] = Pr [E|H] P[H] Pr [ E ] The probability of a hypothesis (e.g. play=yes) given evidence E (the new test instance) is equal to: –the Probability of Evidence given the hypothesis –times the Probability of the Hypothesis, – all divided by the probability of Evidence We did the numerator of this (the denominator doesn’t matter since it is the same for both Yes and No)

The Probability of Evidence given the Hypothesis: Since this is naïve bayes, we assume that the evidence in terms of the different attributes is independent (given the class), so the probabilities of the 4 attributes having the values that they do are multiplied together

The Probability of the Hypothesis: This is just the probability of Yes (or No) This is called the “ prior probability ” of the hypothesis – it would be your guess prior to seeing and evidence We multiplied this by the previous slide’s value as called for in the formula

A complication Our probability of “yes” came out zero since no rainy day had had play=yes This may be a little extreme – this one attribute has ruled all, no matter what the other evidence says Common adjustment – start all counts off at 1 instead of at 0 (“Laplace estimator”) …

With Laplace Estimator … Determine Distribution of Attributes Outlook \ PlayYesNo Sunny52 Overcast33 Rainy15 Temperature \ Play YesNo Hot24 Mild43 Cool33

With Laplace Estimator … Determine Distribution of Attributes Humidity \ PlayYesNo High44 Normal45 Windy \ PlayYesNo False37 True52

With Laplace Estimator …, the attribute to be predicted … YesNo Play78

With Laplace Estimator … Inferring Probabilities from Observed Outlook \ PlayYesNo Sunny5/92/10 Overcast3/93/10 Rainy1/95/10 Temperature \ Play YesNo Hot2/94/10 Mild4/93/10 Cool3/93/10

With Laplace Estimator … Inferring Probabilities from Observed Humidity \ PlayYesNo High4/84/9 Normal4/85/9 Windy \ PlayYesNo False3/87/9 True5/82/9

With Laplace Estimator …, the attribute to be predicted … Proportion of days that were yes and no YesNo Play7/158/15

Now, predict the test instance Rainy, mild, high, true Probability of Yes = = 1/9 * 4/9 * 4/8 * 5/8 * 7/15 = 560 / 77760 = 0.007 Probability of No = = 5/10 * 3/10 * 4/9 * 2/9 * 8/15 = 960 / 121500 = 0.008

In a 14-fold cross validation, this would continue 13 more times Let’s run WEKA on this … NaiveBayesSimple …

WEKA results – first look near the bottom === Stratified cross-validation === === Summary === Correctly Classified Instances 9 64.2857 % Incorrectly Classified Instances 5 35.7143 % ============================================ On the cross validation – it got 9 out of 14 tests correct Same as OneR

More Detailed Results === Confusion Matrix === a b <-- classified as 3 3 | a = yes 2 6 | b = no ==================================== Here we see –the program 5 times predicted play=yes, on 3 of those it was correct – it is predicting yes less often than OneR did The program 9 times predicted play = no, on 6 of those it was correct There were 6 instances whose actual value was play=yes, the program correctly predicted that on 3 of them There were 8 instances whose actual value was play=no, the program correctly predicted that on 9 of them

Again, part of our purpose is to have a take-home message for humans Not 14 take home messages! So instead of reporting each of the things learned on each of the 14 training sets … … The program runs again on all of the data and builds a pattern for that – a take home message … However, for naïve bayes, then take home message is less easily interpreted …

WEKA - Take-Home Naive Bayes (simple) Class yes: P(C) = 0.4375 Attribute outlook sunnyovercastrainy 0.555555560.333333330.11111111 Attribute temperature hotmildcool 0.222222220.444444440.33333333 Attribute humidity highnormal 0.5 Attribute windy TRUEFALSE 0.625 0.375

WEKA - Take-Home continued Class no: P(C) = 0.5625 Attribute outlook sunnyovercastrainy 0.181818180.272727270.54545455 Attribute temperature hotmildcool 0.363636360.363636360.27272727 Attribute humidity highnormal 0.5 Attribute windy TRUEFALSE 0.3 0.7

Let’s Try WEKA Naïve Bayes on njcrimenominal Try 10-fold === Confusion Matrix === a b <-- classified as 6 1 | a = bad 7 18 | b = ok This represents a slight improvement over OneR (probably not significant) We note that OneR chose unemployment as the attribute to use, with the probabilities, note for bad crime: Attribute unemploy himedlow 0.3 0.6 0.1 … while for ok crime: Attribute unemploy himedlow 0.035714290.285714290.67857143

Naïve Bayes – Missing Values Training data – simply not included in frequency counts; probability ratios are based on pct of actually occurring rather than total # instances Test data – calculations omit the missing attribute –E.g. Prob(yes| sunny,?,high,false) = 5/9 X 4/8 X 3/8 X 7/15 (skipping temperature) – since omitted for each class, not a problem

Naïve Bayes – Numeric Values Assume values fit a “normal” curve Calculate mean and standard deviation for each class Known properties of normal curves allow us to use a formula for the “probability density function” to calculate the probability based on a value, the mean, and standard deviation. Book has equation, p87 – don’t memorize. Look up if you are writing the program

Naïve Bayes – Discussion Naïve Bayes frequently does as well or better than sophisticated classification algorithms or real datasets – despite its assumptions being violated Clearly redundant attributes hurt performance, because they have the effect of counting an attribute more than once (e.g. at a school with very high pct of “traditional students”, age and year in school are redundant Many correlated or redundant attributes make Naïve Bayes a poor choice for a dataset –(unless preprocessing removes them) Numeric data known to not be in a normal distribution can be handled using the other distribution (e.g. Poisson) or if unknown, a generic “kernal density estimation”

Example: My Weather (Nominal) OutlookTempHumidWindyPlay? sunnyhothighFALSEno sunnyhothighTRUEyes overcasthothighFALSEno rainymildhighFALSEno rainycoolnormalFALSEno rainycoolnormalTRUEno overcastcoolnormalTRUEyes sunnymildhighFALSEyes sunnycoolnormalFALSEyes rainymildnormalFALSEno sunnymildnormalTRUEyes overcastmildhighTRUEyes overcasthotnormalFALSEno rainymildhighTRUEno

Class Exercise

Let’s run WEKA NaiveBayesSimple on japanbank

End Section 4.2

Data Mining – Algorithms: Naïve Bayes Chapter 4, Section 4.2.

Similar presentations

Presentation on theme: "Data Mining – Algorithms: Naïve Bayes Chapter 4, Section 4.2."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Data Mining – Algorithms: Naïve Bayes Chapter 4, Section 4.2.

Similar presentations

Presentation on theme: "Data Mining – Algorithms: Naïve Bayes Chapter 4, Section 4.2."— Presentation transcript:

Similar presentations

About project

Feedback