Data Mining – Algorithms: Naïve Bayes Chapter 4, Section 4.2.

Slides:



Advertisements
Similar presentations
Statistics 100 Lecture Set 7. Chapters 13 and 14 in this lecture set Please read these, you are responsible for all material Will be doing chapters
Advertisements

Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 3 of Data Mining by I. H. Witten, E. Frank and M. A. Hall.
Decision Trees.
Handling Uncertainty. Uncertain knowledge Typical example: Diagnosis. Consider data instances about patients: Can we certainly derive the diagnostic rule:
Algorithms: The basic methods. Inferring rudimentary rules Simplicity first Simple algorithms often work surprisingly well Many different kinds of simple.
Evaluation (practice). 2 Predicting performance  Assume the estimated error rate is 25%. How close is this to the true error rate?  Depends on the amount.
Evaluation.
1 Bayesian Classification Instructor: Qiang Yang Hong Kong University of Science and Technology Thanks: Dan Weld, Eibe Frank.
Data Mining with Naïve Bayesian Methods
Evaluation.
1 Bayesian Classification Instructor: Qiang Yang Hong Kong University of Science and Technology Thanks: Dan Weld, Eibe Frank.
Review. 2 Statistical modeling  “Opposite” of 1R: use all the attributes  Two assumptions: Attributes are  equally important  statistically independent.
Algorithms for Classification: The Basic Methods.
Thanks to Nir Friedman, HU
On Comparing Classifiers: Pitfalls to Avoid and Recommended Approach Published by Steven L. Salzberg Presented by Prakash Tilwani MACS 598 April 25 th.
Data Mining – Algorithms: OneR Chapter 4, Section 4.1.
Bayesian Networks 4 th, December 2009 Presented by Kwak, Nam-ju The slides are based on, 2nd ed., written by Ian H. Witten & Eibe Frank. Images and Materials.
ECE 8443 – Pattern Recognition LECTURE 06: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Bias in ML Estimates Bayesian Estimation Example Resources:
Bayesian Networks. Male brain wiring Female brain wiring.
Classification II. 2 Numeric Attributes Numeric attributes can take many values –Creating branches for each value is not ideal The value range is usually.
Data Mining – Algorithms: Prism – Learning Rules via Separating and Covering Chapter 4, Section 4.4.
Classification I. 2 The Task Input: Collection of instances with a set of attributes x and a special nominal attribute Y called class attribute Output:
Lecture 7. Outline 1. Overview of Classification and Decision Tree 2. Algorithm to build Decision Tree 3. Formula to measure information 4. Weka, data.
Naïve Bayes Classifier. Bayes Classifier l A probabilistic framework for classification problems l Often appropriate because the world is noisy and also.
Evaluating What’s Been Learned. Cross-Validation Foundation is a simple idea – “ holdout ” – holds out a certain amount for testing and uses rest for.
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.6: Linear Models Rodney Nielsen Many of.
M. Sulaiman Khan Dept. of Computer Science University of Liverpool 2009 COMP527: Data Mining Classification: Bayes February 17, 2009.
Decision-Tree Induction & Decision-Rule Induction
Bayesian Classification. Bayesian Classification: Why? A statistical classifier: performs probabilistic prediction, i.e., predicts class membership probabilities.
Classification Techniques: Bayesian Classification
Oliver Schulte Machine Learning 726 Bayes Net Classifiers The Naïve Bayes Model.
1 1 Slide Using Weka. 2 2 Slide Data Mining Using Weka n What’s Data Mining? We are overwhelmed with data We are overwhelmed with data Data mining is.
Data Mining – Algorithms: Decision Trees - ID3 Chapter 4, Section 4.3.
Probability Course web page: vision.cis.udel.edu/cv March 19, 2003  Lecture 15.
1Weka Tutorial 5 - Association © 2009 – Mark Polczynski Weka Tutorial 5 – Association Technology Forge Version 0.1 ?
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.2 Statistical Modeling Rodney Nielsen Many.
 Classification 1. 2  Task: Given a set of pre-classified examples, build a model or classifier to classify new cases.  Supervised learning: classes.
Algorithms for Classification: The Basic Methods.
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Sections 4.1 Inferring Rudimentary Rules Rodney Nielsen.
Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall Chapter 5: Credibility: Evaluating What’s Been Learned.
Slide 1 DSCI 4520/5240: Data Mining Fall 2013 – Dr. Nick Evangelopoulos Lecture 5: Decision Tree Algorithms Material based on: Witten & Frank 2000, Olson.
Classification And Bayesian Learning
Machine Learning Tutorial-2. Recall, Precision, F-measure, Accuracy Ch. 5.
Example: input data outlooktemp.humiditywindyplay sunnyhothighfalseno sunnyhothightrueno overcasthothighfalseyes rainymildhighfalseyes rainycoolnormalfalseyes.
Machine Learning in Practice Lecture 5 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
An Exercise in Machine Learning
Mehdi Ghayoumi MSB rm 132 Ofc hr: Thur, a Machine Learning.
Chapter 4: Algorithms CS 795. Inferring Rudimentary Rules 1R – Single rule – one level decision tree –Pick each attribute and form a single level tree.
Machine Learning in Practice Lecture 2 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
Machine Learning in Practice Lecture 9 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
BAYESIAN LEARNING. 2 Bayesian Classifiers Bayesian classifiers are statistical classifiers, and are based on Bayes theorem They can calculate the probability.
Machine Learning in Practice Lecture 9 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
Data Mining Chapter 4 Algorithms: The Basic Methods Reporter: Yuen-Kuei Hsueh.
Data Mining Practical Machine Learning Tools and Techniques Chapter 6.3: Association Rules Rodney Nielsen Many / most of these slides were adapted from:
Bayesian Learning Reading: Tom Mitchell, “Generative and discriminative classifiers: Naive Bayes and logistic regression”, Sections 1-2. (Linked from.
Data Science Algorithms: The Basic Methods
Data Mining – Algorithms: Instance-Based Learning
Naïve Bayes Classifier
Data Science Algorithms: The Basic Methods
Data Science Algorithms: The Basic Methods
Naïve Bayes Classifier
Decision Tree Saed Sayad 9/21/2018.
Data Mining Lecture 11.
Classification Techniques: Bayesian Classification
Bayesian Classification
Machine Learning Techniques for Data Mining
CSCI N317 Computation for Scientific Applications Unit Weka
Data Mining CSCI 307, Spring 2019 Lecture 15
Data Mining CSCI 307, Spring 2019 Lecture 6
Data Mining CSCI 307, Spring 2019 Lecture 9
Presentation transcript:

Data Mining – Algorithms: Naïve Bayes Chapter 4, Section 4.2

More Simplicity Direct contrast to OneR –Use all attributes Assume that all attributes are equally important Assume that all (non predicted) attributes are independent of each other This is clearly naïve! But it works pretty well

Again, Let’s take this a little more realistic than book does Divide into training and test data Let’s save the last record as a test (using my weather, nominal …

Determine Distribution of Attributes Outlook \ PlayYesNo Sunny41 Overcast22 Rainy04 Temperature \ Play YesNo Hot13 Mild32 Cool22

Determine Distribution of Attributes Humidity \ PlayYesNo High33 Normal34 Windy \ PlayYesNo False26 True41

Also, the attribute to be predicted … YesNo Play67

Inferring Probabilities from Observed Outlook \ PlayYesNo Sunny4/61/7 Overcast2/62/7 Rainy0/64/7 Temperature \ Play YesNo Hot1/63/7 Mild3/62/7 Cool2/62/7

Inferring Probabilities from Observed Humidity \ PlayYesNo High3/63/7 Normal3/64/7 Windy \ PlayYesNo False2/66/7 True4/61/7

Also, the attribute to be predicted … Proportion of days that were yes and no YesNo Play6/137/13

Now, suppose we must predict the test instance Rainy, mild, high, true Probability of Yes = Probability of Yes given that it is rainy * Probability of Yes given that it is mild * Probability of Yes given that humidity is high * Probability of Yes given that it is windy * Probability of Yes (in general) = 0/6 * 3/6 * 3/6 * 4/6 * 6/13 = 0 / = 0.0 Probability of No = Probability of No given that it is rainy * Probability of No given that it is mild * Probability of No given that humidity is high * Probability of No given that it is windy * Probability of No (in general) = 4/7 * 2/7 * 3/7 * 1/7 * 7/13 = 168 / = 0.005

The Foundation: Bayes Rule of Conditional Probabilities P[H|E] = Pr [E|H] P[H] Pr [ E ] The probability of a hypothesis (e.g. play=yes) given evidence E (the new test instance) is equal to: –the Probability of Evidence given the hypothesis –times the Probability of the Hypothesis, – all divided by the probability of Evidence We did the numerator of this (the denominator doesn’t matter since it is the same for both Yes and No)

The Probability of Evidence given the Hypothesis: Since this is naïve bayes, we assume that the evidence in terms of the different attributes is independent (given the class), so the probabilities of the 4 attributes having the values that they do are multiplied together

The Probability of the Hypothesis: This is just the probability of Yes (or No) This is called the “ prior probability ” of the hypothesis – it would be your guess prior to seeing and evidence We multiplied this by the previous slide’s value as called for in the formula

A complication Our probability of “yes” came out zero since no rainy day had had play=yes This may be a little extreme – this one attribute has ruled all, no matter what the other evidence says Common adjustment – start all counts off at 1 instead of at 0 (“Laplace estimator”) …

With Laplace Estimator … Determine Distribution of Attributes Outlook \ PlayYesNo Sunny52 Overcast33 Rainy15 Temperature \ Play YesNo Hot24 Mild43 Cool33

With Laplace Estimator … Determine Distribution of Attributes Humidity \ PlayYesNo High44 Normal45 Windy \ PlayYesNo False37 True52

With Laplace Estimator …, the attribute to be predicted … YesNo Play78

With Laplace Estimator … Inferring Probabilities from Observed Outlook \ PlayYesNo Sunny5/92/10 Overcast3/93/10 Rainy1/95/10 Temperature \ Play YesNo Hot2/94/10 Mild4/93/10 Cool3/93/10

With Laplace Estimator … Inferring Probabilities from Observed Humidity \ PlayYesNo High4/84/9 Normal4/85/9 Windy \ PlayYesNo False3/87/9 True5/82/9

With Laplace Estimator …, the attribute to be predicted … Proportion of days that were yes and no YesNo Play7/158/15

Now, predict the test instance Rainy, mild, high, true Probability of Yes = = 1/9 * 4/9 * 4/8 * 5/8 * 7/15 = 560 / = Probability of No = = 5/10 * 3/10 * 4/9 * 2/9 * 8/15 = 960 / = 0.008

In a 14-fold cross validation, this would continue 13 more times Let’s run WEKA on this … NaiveBayesSimple …

WEKA results – first look near the bottom === Stratified cross-validation === === Summary === Correctly Classified Instances % Incorrectly Classified Instances % ============================================ On the cross validation – it got 9 out of 14 tests correct Same as OneR

More Detailed Results === Confusion Matrix === a b <-- classified as 3 3 | a = yes 2 6 | b = no ==================================== Here we see –the program 5 times predicted play=yes, on 3 of those it was correct – it is predicting yes less often than OneR did The program 9 times predicted play = no, on 6 of those it was correct There were 6 instances whose actual value was play=yes, the program correctly predicted that on 3 of them There were 8 instances whose actual value was play=no, the program correctly predicted that on 9 of them

Again, part of our purpose is to have a take-home message for humans Not 14 take home messages! So instead of reporting each of the things learned on each of the 14 training sets … … The program runs again on all of the data and builds a pattern for that – a take home message … However, for naïve bayes, then take home message is less easily interpreted …

WEKA - Take-Home Naive Bayes (simple) Class yes: P(C) = Attribute outlook sunnyovercastrainy Attribute temperature hotmildcool Attribute humidity highnormal 0.5 Attribute windy TRUEFALSE

WEKA - Take-Home continued Class no: P(C) = Attribute outlook sunnyovercastrainy Attribute temperature hotmildcool Attribute humidity highnormal 0.5 Attribute windy TRUEFALSE

Let’s Try WEKA Naïve Bayes on njcrimenominal Try 10-fold === Confusion Matrix === a b <-- classified as 6 1 | a = bad 7 18 | b = ok This represents a slight improvement over OneR (probably not significant) We note that OneR chose unemployment as the attribute to use, with the probabilities, note for bad crime: Attribute unemploy himedlow … while for ok crime: Attribute unemploy himedlow

Naïve Bayes – Missing Values Training data – simply not included in frequency counts; probability ratios are based on pct of actually occurring rather than total # instances Test data – calculations omit the missing attribute –E.g. Prob(yes| sunny,?,high,false) = 5/9 X 4/8 X 3/8 X 7/15 (skipping temperature) – since omitted for each class, not a problem

Naïve Bayes – Numeric Values Assume values fit a “normal” curve Calculate mean and standard deviation for each class Known properties of normal curves allow us to use a formula for the “probability density function” to calculate the probability based on a value, the mean, and standard deviation. Book has equation, p87 – don’t memorize. Look up if you are writing the program

Naïve Bayes – Discussion Naïve Bayes frequently does as well or better than sophisticated classification algorithms or real datasets – despite its assumptions being violated Clearly redundant attributes hurt performance, because they have the effect of counting an attribute more than once (e.g. at a school with very high pct of “traditional students”, age and year in school are redundant Many correlated or redundant attributes make Naïve Bayes a poor choice for a dataset –(unless preprocessing removes them) Numeric data known to not be in a normal distribution can be handled using the other distribution (e.g. Poisson) or if unknown, a generic “kernal density estimation”

Example: My Weather (Nominal) OutlookTempHumidWindyPlay? sunnyhothighFALSEno sunnyhothighTRUEyes overcasthothighFALSEno rainymildhighFALSEno rainycoolnormalFALSEno rainycoolnormalTRUEno overcastcoolnormalTRUEyes sunnymildhighFALSEyes sunnycoolnormalFALSEyes rainymildnormalFALSEno sunnymildnormalTRUEyes overcastmildhighTRUEyes overcasthotnormalFALSEno rainymildhighTRUEno

Class Exercise

Let’s run WEKA NaiveBayesSimple on japanbank

End Section 4.2