Naïve Bayes Classifier

Slides:



Advertisements
Similar presentations
Naïve Bayes. Bayesian Reasoning Bayesian reasoning provides a probabilistic approach to inference. It is based on the assumption that the quantities of.
Advertisements

Naïve-Bayes Classifiers Business Intelligence for Managers.
Data Mining Classification: Alternative Techniques
Data Mining Classification: Alternative Techniques
Sampling Distributions
Theoretical Probability Distributions We have talked about the idea of frequency distributions as a way to see what is happening with our data. We have.
Data Mining Classification: Naïve Bayes Classifier
1 Chapter 12 Probabilistic Reasoning and Bayesian Belief Networks.
CS 590M Fall 2001: Security Issues in Data Mining Lecture 3: Classification.
Classification Dr Eamonn Keogh Computer Science & Engineering Department University of California - Riverside Riverside,CA Who.
ANGUILLA AUSTRALIA St. Helena & Dependencies South Georgia & South Sandwich Islands U.K. Serbia & Montenegro (Yugoslavia) FRANCENIGER INDIA IRELAND BRAZIL.
The Major Data Mining Tasks Classification Clustering Associations Most of the other tasks (for example, outlier discovery or anomaly detection ) make.
Data Mining.
Dear SIR, I am Mr. John Coleman and my sister is Miss Rose Colemen, we are the children of late Chief Paul Colemen from Sierra Leone. I am writing you.
Data Mining: A Closer Look Chapter Data Mining Strategies (p35) Moh!
Dear SIR, I am Mr. John Coleman and my sister is Miss Rose Colemen, we are the children of late Chief Paul Colemen from Sierra Leone. I am writing you.
Decision Tree Classifier
Chapter 5 Data mining : A Closer Look.
Jeff Howbert Introduction to Machine Learning Winter Classification Bayesian Classifiers.
Theses slides are based on the slides by
Advanced Multimedia Text Classification Tamara Berg.
Ch. Eick: Support Vector Machines: The Main Ideas Reading Material Support Vector Machines: 1.Textbook 2. First 3 columns of Smola/Schönkopf article on.
CISC 4631 Data Mining Lecture 06: Bayes Theorem Theses slides are based on the slides by Tan, Steinbach and Kumar (textbook authors) Eamonn Koegh (UC Riverside)
Bayesian Networks 4 th, December 2009 Presented by Kwak, Nam-ju The slides are based on, 2nd ed., written by Ian H. Witten & Eibe Frank. Images and Materials.
Machine Learning CUNY Graduate Center Lecture 1: Introduction.
Bayesian Networks. Male brain wiring Female brain wiring.
Slides at Eamonn Keogh
CISC 4631 Data Mining Lecture 03: Introduction to classification Linear classifier Theses slides are based on the slides by Tan, Steinbach and Kumar (textbook.
Introduction to Data Mining Group Members: Karim C. El-Khazen Pascal Suria Lin Gui Philsou Lee Xiaoting Niu.
DATA MINING LECTURE 10 Classification k-nearest neighbor classifier Naïve Bayes Logistic Regression Support Vector Machines.
1 Data Mining Lecture 5: KNN and Bayes Classifiers.
Naive Bayes Classifier
Overfitting Overfitting occurs when a statistical model describes random error or noise instead of the underlying relationship. Overfitting generally occurs.
Bayesian networks Classification, segmentation, time series prediction and more. Website: Twitter:
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.6: Linear Models Rodney Nielsen Many of.
ECE 8443 – Pattern Recognition LECTURE 07: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Class-Conditional Density The Multivariate Case General.
CPS 270: Artificial Intelligence Machine learning Instructor: Vincent Conitzer.
Empirical Research Methods in Computer Science Lecture 7 November 30, 2005 Noah Smith.
Learning Theory Reza Shadmehr Linear and quadratic decision boundaries Kernel estimates of density Missing data.
Bayesian Classification. Bayesian Classification: Why? A statistical classifier: performs probabilistic prediction, i.e., predicts class membership probabilities.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Statistical Inference (By Michael Jordon) l Bayesian perspective –conditional perspective—inferences.
Classification Techniques: Bayesian Classification
Slides for “Data Mining” by I. H. Witten and E. Frank.
Data Mining – Algorithms: Naïve Bayes Chapter 4, Section 4.2.
CHAPTER 6 Naive Bayes Models for Classification. QUESTION????
Chapter 20 Classification and Estimation Classification – Feature selection Good feature have four characteristics: –Discrimination. Features.
Classification We have seen 2 classification techniques: Simple linear classifier, Nearest neighbor,. Let us see two more techniques: Decision tree, Naïve.
Machine Learning in Practice Lecture 10 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
Lecture 5: Statistical Methods for Classification CAP 5415: Computer Vision Fall 2006.
Machine Learning in Practice Lecture 6 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
Lets talk some more about features.. (Western Pipistrelle (Parastrellus hesperus) Photo by Michael Durham.
Naïve Bayes Classifier April 25 th, Classification Methods (1) Manual classification Used by Yahoo!, Looksmart, about.com, ODP Very accurate when.
BAYESIAN LEARNING. 2 Bayesian Classifiers Bayesian classifiers are statistical classifiers, and are based on Bayes theorem They can calculate the probability.
Machine Learning Machine learning explores the study and construction of algorithms that can learn from data. Basic Idea: Instead of trying to create a.
Introduction Sample surveys involve chance error. Here we will study how to find the likely size of the chance error in a percentage, for simple random.
The Law of Averages. What does the law of average say? We know that, from the definition of probability, in the long run the frequency of some event will.
Naive Bayes Classifier. REVIEW: Bayesian Methods Our focus this lecture: – Learning and classification methods based on probability theory. Bayes theorem.
Let see just one example….
Data Mining Introduction to Classification using Linear Classifiers
So far we have only seen features that are real numbers.
Histograms CSE 6363 – Machine Learning Vassilis Athitsos
The Classification Problem
Table 1. Advantages and Disadvantages of Traditional DM/ML Methods
Recall that we have seen an example of feature generation.
Data Mining Lecture 11.
Classification Techniques: Bayesian Classification
Data Mining: naïve Bayes
The basics of Bayes decision theory
Mathematical Foundations of BME
A task of induction to find patterns
Presentation transcript:

Naïve Bayes Classifier Thomas Bayes 1702 - 1761 We will start off with a visual intuition, before looking at the math…

Grasshoppers Katydids Remember this example? Let’s get lots more data… 10 1 2 3 4 5 6 7 8 9 Antenna Length Abdomen Length Remember this example? Let’s get lots more data…

With a lot of data, we can build a histogram With a lot of data, we can build a histogram. Let us just build one for “Antenna Length” for now… 10 1 2 3 4 5 6 7 8 9 Antenna Length Katydids Grasshoppers

We can leave the histograms as they are, or we can summarize them with two normal distributions. Let us us two normal distributions for ease of visualization in the following slides…

p(cj | d) = probability of class cj, given that we have observed d We want to classify an insect we have found. Its antennae are 3 units long. How can we classify it? We can just ask ourselves, give the distributions of antennae lengths we have seen, is it more probable that our insect is a Grasshopper or a Katydid. There is a formal way to discuss the most probable classification… p(cj | d) = probability of class cj, given that we have observed d 3 Antennae length is 3

p(cj | d) = probability of class cj, given that we have observed d P(Grasshopper | 3 ) = 10 / (10 + 2) = 0.833 P(Katydid | 3 ) = 2 / (10 + 2) = 0.166 10 2 3 Antennae length is 3

p(cj | d) = probability of class cj, given that we have observed d P(Grasshopper | 7 ) = 3 / (3 + 9) = 0.250 P(Katydid | 7 ) = 9 / (3 + 9) = 0.750 9 3 7 Antennae length is 7

p(cj | d) = probability of class cj, given that we have observed d P(Grasshopper | 5 ) = 6 / (6 + 6) = 0.500 P(Katydid | 5 ) = 6 / (6 + 6) = 0.500 6 6 5 Antennae length is 5

Bayes Classifiers That was a visual intuition for a simple case of the Bayes classifier, also called: Idiot Bayes Naïve Bayes Simple Bayes We are about to see some of the mathematical formalisms, and more examples, but keep in mind the basic idea. Find out the probability of the previously unseen instance belonging to each class, then simply pick the most probable class.

Bayes Classifiers Bayesian classifiers use Bayes theorem, which says p(cj | d ) = p(d | cj ) p(cj) p(d) p(cj | d) = probability of instance d being in class cj, This is what we are trying to compute p(d | cj) = probability of generating instance d given class cj, We can imagine that being in class cj, causes you to have feature d with some probability p(cj) = probability of occurrence of class cj, This is just how frequent the class cj, is in our database p(d) = probability of instance d occurring This can actually be ignored, since it is the same for all classes

c1 = male, and c2 = female. Assume that we have two classes We have a person whose sex we do not know, say “drew” or d. Classifying drew as male or female is equivalent to asking is it more probable that drew is male or female, I.e which is greater p(male | drew) or p(female | drew) (Note: “Drew can be a male or female name”) Drew Barrymore Drew Carey What is the probability of being called “drew” given that you are a male? What is the probability of being a male? p(male | drew) = p(drew | male ) p(male) p(drew) What is the probability of being named “drew”? (actually irrelevant, since it is that same for all classes)

p(cj | d) = p(d | cj ) p(cj) p(d) This is Officer Drew (who arrested me in 1997). Is Officer Drew a Male or Female? Luckily, we have a small database with names and sex. We can use it to apply Bayes rule… Name Sex Drew Male Claudia Female Alberto Karin Nina Sergio Officer Drew p(cj | d) = p(d | cj ) p(cj) p(d)

p(cj | d) = p(d | cj ) p(cj) p(d) Name Sex Drew Male Claudia Female Alberto Karin Nina Sergio p(cj | d) = p(d | cj ) p(cj) p(d) Officer Drew p(male | drew) = 1/3 * 3/8 = 0.125 3/8 3/8 Officer Drew is more likely to be a Female. p(female | drew) = 2/5 * 5/8 = 0.250 3/8 3/8

Officer Drew IS a female! p(male | drew) = 1/3 * 3/8 = 0.125 3/8 3/8 p(female | drew) = 2/5 * 5/8 = 0.250 3/8 3/8

p(cj | d) = p(d | cj ) p(cj) p(d) So far we have only considered Bayes Classification when we have one attribute (the “antennae length”, or the “name”). But we may have many features. How do we use all the features? p(cj | d) = p(d | cj ) p(cj) p(d) Name Over 170CM Eye Hair length Sex Drew No Blue Short Male Claudia Yes Brown Long Female Alberto Karin Nina Sergio

p(d|cj) = p(d1|cj) * p(d2|cj) * ….* p(dn|cj) To simplify the task, naïve Bayesian classifiers assume attributes have independent distributions, and thereby estimate p(d|cj) = p(d1|cj) * p(d2|cj) * ….* p(dn|cj) The probability of class cj generating instance d, equals…. The probability of class cj generating the observed value for feature 1, multiplied by.. The probability of class cj generating the observed value for feature 2, multiplied by..

p(d|cj) = p(d1|cj) * p(d2|cj) * ….* p(dn|cj) To simplify the task, naïve Bayesian classifiers assume attributes have independent distributions, and thereby estimate p(d|cj) = p(d1|cj) * p(d2|cj) * ….* p(dn|cj) p(officer drew|cj) = p(over_170cm = yes|cj) * p(eye =blue|cj) * …. Officer Drew is blue-eyed, over 170cm tall, and has long hair p(officer drew| Female) = 2/5 * 3/5 * …. p(officer drew| Male) = 2/3 * 2/3 * ….

cj … p(d1|cj) p(d2|cj) p(dn|cj) The Naive Bayes classifiers is often represented as this type of graph… Note the direction of the arrows, which state that each class causes certain features, with a certain probability … p(d1|cj) p(d2|cj) p(dn|cj)

cj … Naïve Bayes is fast and space efficient We can look up all the probabilities with a single scan of the database and store them in a (small) table… … p(d1|cj) p(d2|cj) p(dn|cj) Sex Over190cm Male Yes 0.15 No 0.85 Female 0.01 0.99 Sex Long Hair Male Yes 0.05 No 0.95 Female 0.70 0.30 Sex Male Female

Naïve Bayes is NOT sensitive to irrelevant features... Suppose we are trying to classify a persons sex based on several features, including eye color. (Of course, eye color is completely irrelevant to a persons gender) p(Jessica |cj) = p(eye = brown|cj) * p( wears_dress = yes|cj) * …. p(Jessica | Female) = 9,000/10,000 * 9,975/10,000 * …. p(Jessica | Male) = 9,001/10,000 * 2/10,000 * …. Almost the same! However, this assumes that we have good enough estimates of the probabilities, so the more data the better.

cj An obvious point. I have used a simple two class problem, and two possible values for each example, for my previous examples. However we can have an arbitrary number of classes, or feature values … p(d1|cj) p(d2|cj) p(dn|cj) Animal Mass >10kg Cat Yes 0.15 No 0.85 Dog 0.91 0.09 Pig 0.99 0.01 Animal Color Cat Black 0.33 White 0.23 Brown 0.44 Dog 0.97 0.03 0.90 Pig 0.04 0.01 0.95 Animal Cat Dog Pig

Naïve Bayesian Classifier Problem! Naïve Bayes assumes independence of features… p(d|cj) Naïve Bayesian Classifier p(d1|cj) p(d2|cj) p(dn|cj) Sex Over 6 foot Male Yes 0.15 No 0.85 Female 0.01 0.99 Sex Over 200 pounds Male Yes 0.11 No 0.80 Female 0.05 0.95

Naïve Bayesian Classifier Solution Consider the relationships between attributes… p(d|cj) Naïve Bayesian Classifier p(d1|cj) p(d2|cj) p(dn|cj) Sex Over 6 foot Male Yes 0.15 No 0.85 Female 0.01 0.99 Sex Over 200 pounds Male Yes and Over 6 foot 0.11 No and Over 6 foot 0.59 Yes and NOT Over 6 foot 0.05 No and NOT Over 6 foot 0.35 Female 0.01

Naïve Bayesian Classifier Solution Consider the relationships between attributes… p(d|cj) Naïve Bayesian Classifier p(d1|cj) p(d2|cj) p(dn|cj) But how do we find the set of connecting arcs??

The Naïve Bayesian Classifier has a piecewise quadratic decision boundary Grasshoppers Katydids Ants Adapted from slide by Ricardo Gutierrez-Osuna

Bee begins to cross laser One second of audio from the laser sensor. Only Bombus impatiens (Common Eastern Bumble Bee) is in the insectary. 0.2 0.1 -0.1 Background noise Bee begins to cross laser -0.2 4 x 10 0.5 1 1.5 2 2.5 3 3.5 4 4.5 -3 x 10 Single-Sided Amplitude Spectrum of Y(t) 4 Peak at 197Hz 3 60Hz interference |Y(f)| 2 Harmonics 1 100 200 300 400 500 600 700 800 900 1000 Frequency (Hz)

4 3 |Y(f)| 2 1 Frequency (Hz) Frequency (Hz) Frequency (Hz) x 10 100 -3 x 10 4 3 |Y(f)| 2 1 100 200 300 400 500 600 700 800 900 1000 Frequency (Hz) 100 200 300 400 500 600 700 800 900 1000 Frequency (Hz) 100 200 300 400 500 600 700 800 900 1000 Frequency (Hz)

100 200 300 400 500 600 700 Wing Beat Frequency Hz

100 200 300 400 500 600 700 Wing Beat Frequency Hz

Anopheles stephensi: Female mean =475, Std = 30 400 500 600 700 Anopheles stephensi: Female mean =475, Std = 30 Aedes aegyptii : Female mean =567, Std = 43 517 If I see an insect with a wingbeat frequency of 500, what is it?

Can we get more features? 517 400 500 600 700 8.02% of the area under the red curve What is the error rate? 12.2% of the area under the pink curve Can we get more features?

Aedes aegypti (yellow fever mosquito) Circadian Features Aedes aegypti (yellow fever mosquito) dawn dusk 12 24 Midnight Noon Midnight

Suppose I observe an insect with a wingbeat frequency of 420Hz 700 Suppose I observe an insect with a wingbeat frequency of 420Hz What is it? 600 500 400

400 500 600 700 Midnight 12 24 Noon Suppose I observe an insect with a wingbeat frequency of 420Hz at 11:00am What is it?

400 500 600 700 Midnight 12 24 Noon Suppose I observe an insect with a wingbeat frequency of 420 at 11:00am What is it? (Culex | [420Hz,11:00am]) = (6/ (6 + 6 + 0)) * (2/ (2 + 4 + 3)) = 0.111 (Anopheles | [420Hz,11:00am]) = (6/ (6 + 6 + 0)) * (4/ (2 + 4 + 3)) = 0.222 (Aedes | [420Hz,11:00am]) = (0/ (6 + 6 + 0)) * (3/ (2 + 4 + 3)) = 0.000

Which of the “Pigeon Problems” can be solved by a decision tree? 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 100 10 20 30 40 50 60 70 80 90

Dear SIR, I am Mr. John Coleman and my sister is Miss Rose Colemen, we are the children of late Chief Paul Colemen from Sierra Leone. I am writing you in absolute confidence primarily to seek your assistance to transfer our cash of twenty one Million Dollars ($21,000.000.00) now in the custody of a private Security trust firm in Europe the money is in trunk boxes deposited and declared as family valuables by my late father as a matter of fact the company does not know the content as money, although my father made them to under stand that the boxes belongs to his foreign partner. …

This mail is probably spam This mail is probably spam. The original message has been attached along with this report, so you can recognize or block similar unwanted mail in future. See http://spamassassin.org/tag/ for more details. Content analysis details: (12.20 points, 5 required) NIGERIAN_SUBJECT2 (1.4 points) Subject is indicative of a Nigerian spam FROM_ENDS_IN_NUMS (0.7 points) From: ends in numbers MIME_BOUND_MANY_HEX (2.9 points) Spam tool pattern in MIME boundary URGENT_BIZ (2.7 points) BODY: Contains urgent matter US_DOLLARS_3 (1.5 points) BODY: Nigerian scam key phrase ($NN,NNN,NNN.NN) DEAR_SOMETHING (1.8 points) BODY: Contains 'Dear (something)' BAYES_30 (1.6 points) BODY: Bayesian classifier says spam probability is 30 to 40% [score: 0.3728]

Advantages/Disadvantages of Naïve Bayes Fast to train (single scan). Fast to classify Not sensitive to irrelevant features Handles real and discrete data Handles streaming data well Disadvantages: Assumes independence of features

Summary of Classification We have seen 4 major classification techniques: Simple linear classifier, Nearest neighbor, Decision tree. There are other techniques: Neural Networks, Support Vector Machines, Genetic algorithms.. In general, there is no one best classifier for all problems. You have to consider what you hope to achieve, and the data itself… Let us now move on to the other classic problem of data mining and machine learning, Clustering…