Review. 2 Statistical modeling  “Opposite” of 1R: use all the attributes  Two assumptions: Attributes are  equally important  statistically independent.

Slides:



Advertisements
Similar presentations
Probability: Review The state of the world is described using random variables Probabilities are defined over events –Sets of world states characterized.
Advertisements

Naïve Bayes Classifier
Naïve Bayes Classifier
Decision Trees.
Handling Uncertainty. Uncertain knowledge Typical example: Diagnosis. Consider data instances about patients: Can we certainly derive the diagnostic rule:
Algorithms: The basic methods. Inferring rudimentary rules Simplicity first Simple algorithms often work surprisingly well Many different kinds of simple.
Assuming normally distributed data! Naïve Bayes Classifier.
1 Bayesian Classification Instructor: Qiang Yang Hong Kong University of Science and Technology Thanks: Dan Weld, Eibe Frank.
1 Chapter 12 Probabilistic Reasoning and Bayesian Belief Networks.
1 Bayesian Classification Instructor: Qiang Yang Hong Kong University of Science and Technology Thanks: Dan Weld, Eibe Frank.
1 Data Mining with Bayesian Networks (I) Instructor: Qiang Yang Hong Kong University of Science and Technology Thanks: Dan Weld, Eibe.
Data Mining with Naïve Bayesian Methods
Classification: Decision Trees
1 Bayesian Classification Instructor: Qiang Yang Hong Kong University of Science and Technology Thanks: Dan Weld, Eibe Frank.
Naive Bayes (DataSet) OutlookTempHumidityWindyClass sunnyhothighfalseDon't Play sunnyhothightrueDon't Play overcasthothighfalsePlay rainymildhighfalsePlay.
Algorithms for Classification: Notes by Gregory Piatetsky.
Decision Trees an Introduction.
Handling Uncertainty. Uncertain knowledge Typical example: Diagnosis. Can we certainly derive the diagnostic rule: if Toothache=true then Cavity=true.
K Nearest Neighbor Classification Methods Qiang Yang.
Algorithms for Classification: The Basic Methods.
K Nearest Neighbor Classification Methods Qiang Yang.
Probabilistic techniques. Machine learning problem: want to decide the classification of an instance given various attributes. Data contains attributes.
Classification: Decision Trees 2 Outline  Top-Down Decision Tree Construction  Choosing the Splitting Attribute  Information Gain and Gain Ratio.
5. Machine Learning ENEE 759D | ENEE 459D | CMSC 858Z
Jeff Howbert Introduction to Machine Learning Winter Classification Bayesian Classifiers.
Classification II. 2 Numeric Attributes Numeric attributes can take many values –Creating branches for each value is not ideal The value range is usually.
Text Classification, Active/Interactive learning.
Estimating parameters in a statistical model Likelihood and Maximum likelihood estimation Bayesian point estimates Maximum a posteriori point.
Data Mining – Algorithms: Prism – Learning Rules via Separating and Covering Chapter 4, Section 4.4.
Classification I. 2 The Task Input: Collection of instances with a set of attributes x and a special nominal attribute Y called class attribute Output:
1 Bayesian Classification 2 Outline Statistics Basics Naive Bayes Bayesian Network Applications.
M. Sulaiman Khan Dept. of Computer Science University of Liverpool 2009 COMP527: Data Mining Classification: Bayes February 17, 2009.
Decision-Tree Induction & Decision-Rule Induction
Learning Theory Reza Shadmehr Linear and quadratic decision boundaries Kernel estimates of density Missing data.
Classification Techniques: Bayesian Classification
Slides for “Data Mining” by I. H. Witten and E. Frank.
Knowledge Discovery via Data mining Enrico Tronci Dipartimento di Informatica, Università di Roma “La Sapienza”, Via Salaraia 113, Roma, Italy,
Data Mining – Algorithms: Decision Trees - ID3 Chapter 4, Section 4.3.
1 Naïve Bayes Classification CS 6243 Machine Learning Modified from the slides by Dr. Raymond J. Mooney
Probability Course web page: vision.cis.udel.edu/cv March 19, 2003  Lecture 15.
1 Chapter 12 Probabilistic Reasoning and Bayesian Belief Networks.
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.2 Statistical Modeling Rodney Nielsen Many.
 Classification 1. 2  Task: Given a set of pre-classified examples, build a model or classifier to classify new cases.  Supervised learning: classes.
Algorithms for Classification: The Basic Methods.
Data Mining – Algorithms: Naïve Bayes Chapter 4, Section 4.2.
Slide 1 DSCI 4520/5240: Data Mining Fall 2013 – Dr. Nick Evangelopoulos Lecture 5: Decision Tree Algorithms Material based on: Witten & Frank 2000, Olson.
Classification And Bayesian Learning
Example: input data outlooktemp.humiditywindyplay sunnyhothighfalseno sunnyhothightrueno overcasthothighfalseyes rainymildhighfalseyes rainycoolnormalfalseyes.
1 Data Mining I Karl Young Center for Imaging of Neurodegenerative Diseases, UCSF.
Naïve Bayes Classification Material borrowed from Jonathan Huang and I. H. Witten’s and E. Frank’s “Data Mining” and Jeremy Wyatt and others.
K Nearest Neighbor Classification Methods. Training Set.
Data Management and Database Technologies 1 DATA MINING Extracting Knowledge From Data Petr Olmer CERN
Machine Learning in Practice Lecture 5 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
Machine Learning 5. Parametric Methods.
COMP24111 Machine Learning Naïve Bayes Classifier Ke Chen.
Data Mining Chapter 4 Algorithms: The Basic Methods Reporter: Yuen-Kuei Hsueh.
Text Classification and Naïve Bayes Formalizing the Naïve Bayes Classifier.
Data and its Distribution. The popular table  Table (relation)  propositional, attribute-value  Example  record, row, instance, case  Table represents.
Bayesian Learning Reading: Tom Mitchell, “Generative and discriminative classifiers: Naive Bayes and logistic regression”, Sections 1-2. (Linked from.
Algorithms for Classification:
Naive Bayes Classifier
Data Science Algorithms: The Basic Methods
Bayes Net Learning: Bayesian Approaches
Lecture 15: Text Classification & Naive Bayes
Classification Techniques: Bayesian Classification
Bayesian Classification
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Mathematical Foundations of BME
Logistic Regression [Many of the slides were originally created by Prof. Dan Jurafsky from Stanford.]
Data Mining CSCI 307, Spring 2019 Lecture 6
Presentation transcript:

Review

2 Statistical modeling  “Opposite” of 1R: use all the attributes  Two assumptions: Attributes are  equally important  statistically independent (given the class value)  I.e., knowing the value of one attribute says nothing about the value of another (if the class is known)  Independence assumption is rarely correct!  But … this scheme works well in practice

3 Probabilities for weather data OutlookTemperatureHumidityWindyPlay YesNoYesNoYesNoYesNoYesNo Sunny23Hot22High34False6295 Overcast40Mild42Normal61True33 Rainy32Cool31 Sunny2/93/5Hot2/92/5High3/94/5False6/92/59/145/14 Overcast4/90/5Mild4/92/5Normal6/91/5True3/93/5 Rainy3/92/5Cool3/91/5 OutlookTempHumidityWindyPlay SunnyHotHighFalseNo SunnyHotHighTrueNo OvercastHotHighFalseYes RainyMildHighFalseYes RainyCoolNormalFalseYes RainyCoolNormalTrueNo OvercastCoolNormalTrueYes SunnyMildHighFalseNo SunnyCoolNormalFalseYes RainyMildNormalFalseYes SunnyMildNormalTrueYes OvercastMildHighTrueYes OvercastHotNormalFalseYes RainyMildHighTrueNo Contingency tables Probabilities

4 Probabilities for weather data OutlookTemperatureHumidityWindyPlay YesNoYesNoYesNoYesNoYesNo Sunny23Hot22High34False6295 Overcast40Mild42Normal61True33 Rainy32Cool31 Sunny2/93/5Hot2/92/5High3/94/5False6/92/59/145/14 Overcast4/90/5Mild4/92/5Normal6/91/5True3/93/5 Rainy3/92/5Cool3/91/5 OutlookTemp.HumidityWindyPlay SunnyCoolHighTrue?  A new day: Posterior of the two classes (likelihood x priors) For “yes” = 2/9  3/9  3/9  3/9  9/14 = For “no” = 3/5  1/5  4/5  3/5  5/14 = Conversion into a probability (normalize by the evidence) P(“yes”) = / ( ) = P(“no”) = / ( ) = 0.795

5 Bayes’s rule  Probability of event H given parameters E :  A priori probability of H :  Probability of event before parameters are seen  A posteriori probability of H :  Probability of event after parameters are seen Thomas Bayes Born:1702 in London, England Died:1761 in Tunbridge Wells, Kent, England

6 Naïve Bayes for classification  Classification learning: what’s the probability of the class given an instance?  Parameters E = instance  Event H = class value for instance  Naïve assumption: parameters splits into parts (i.e. attributes) that are conditionally independent (independent given the class)

7 Weather data example OutlookTemp.HumidityWindyPlay SunnyCoolHighTrue? Parameters E Probability of class “yes”

8 The “zero-frequency problem”  What if an attribute value doesn’t occur with every class value? (e.g. “Humidity = high” for class “yes”)  Probability will be zero!  A posteriori probability will also be zero! (No matter how likely the other values are!)  Remedy: add 1 to the count for every attribute value-class combination (Laplace estimator)  Result: probabilities will never be zero! (also: stabilizes probability estimates)

9 Modified probability estimates  In some cases adding a constant µ, different from 1, might be more appropriate  Example: attribute outlook for class yes  Weights don’t need to be equal (but they must sum to 1) SunnyOvercastRainy

10 Missing values  Training: instance is not included in frequency count for attribute value- class combination  Classification: attribute will be omitted from calculation  Example: OutlookTemp.HumidityWindyPlay ?CoolHighTrue? Posterior of “yes” = 3/9  3/9  3/9  9/14 = Posterior of “no” = 1/5  4/5  3/5  5/14 = P(“yes”) = / ( ) = 41% P(“no”) = / ( ) = 59%

11 Numeric attributes  Usual assumption: attributes have a normal or Gaussian probability distribution (given the class)  The probability density function for the normal distribution is defined by two parameters:  Sample mean   Standard deviation   Then the density function f(x) is

12 Statistics for weather data  Example density value: OutlookTemperatureHumidityWindyPlay YesNoYesNoYesNoYesNoYesNo Sunny2364, 68,65, 71,65, 70,70, 85,False6295 Overcast4069, 70,72, 80,70, 75,90, 91,True33 Rainy3272, …85, …80, …95, … Sunny2/93/5  =73  =75  =79  =86 False6/92/59/145/14 Overcast4/90/5  =6.2  =7.9  =10.2  =9.7 True3/93/5 Rainy3/92/5

13 Classifying a new day  A new day:  Missing values during training are not included in calculation of mean and standard deviation OutlookTemp.HumidityWindyPlay Sunny6690true? Likelihood of “yes” = 2/9    3/9  9/14 = Likelihood of “no” = 3/5    3/5  5/14 = P(“yes”) = / ( ) = 20.9% P(“no”) = / ( ) = 79.1%

14 Probability densities  Relationship between probability and density:  But: this doesn’t change calculation of a posteriori probabilities because  cancels out  Exact relationship:

15 Naïve Bayes: discussion  Naïve Bayes works surprisingly well (even if independence assumption is clearly violated)  Why? Because classification doesn’t require accurate probability estimates as long as maximum probability is assigned to correct class  However: adding too many redundant attributes will cause problems (e.g. identical attributes)  Note also: many numeric attributes are not normally distributed (  kernel density estimators could be used, as they don’t assume gaussianity)