Presentation is loading. Please wait.

Presentation is loading. Please wait.

Classification: Naïve Bayes Classifier

Similar presentations


Presentation on theme: "Classification: Naïve Bayes Classifier"— Presentation transcript:

1 Classification: Naïve Bayes Classifier
© Tan,Steinbach, Kumar Introduction to Data Mining /18/

2 Joint, Marginal, Conditional Probability…
To determine probabilities of events that result from combining other events in various ways. There are several types of combinations and relationships between events: Complement of an event [everything other than that event] Intersection of two events [event A and event B] or [A*B] Union of two events [event A or event B] or [A+B]

3 Example of Joint Probability
Why are some mutual fund managers more successful than others? One possible factor is where the manager earned his or her MBA. The following table compares mutual fund performance against the ranking of the school where the fund manager earned their MBA: Venn Diagrams Mutual fund outperforms the market Mutual fund doesn’t outperform the market Top 20 MBA program .11 .29 Not top 20 MBA program .06 .54 E.g. This is the probability that a mutual fund outperforms AND the manager was in a top-20 MBA program; it’s a joint probability [intersection].

4 Example of Joint Probability
Alternatively, we could introduce shorthand notation to represent the events: A1 = Fund manager graduated from a top-20 MBA program A2 = Fund manager did not graduate from a top-20 MBA program B1 = Fund outperforms the market B2 = Fund does not outperform the market B1 B2 A1 .11 .29 A2 .06 .54 E.g. P(A2 and B1) = .06 = the probability a fund outperforms the market and the manager isn’t from a top-20 school.

5 Marginal Probabilities…
Marginal probabilities are computed by adding across rows and down columns; that is they are calculated in the margins of the table: P(A2) = “what’s the probability a fund manager isn’t from a top school?” B1 B2 P(Ai) A1 .11 .29 .40 A2 .06 .54 .60 P(Bj) .17 .83 1.00 P(B1) = BOTH margins must add to 1 (useful error check) “what’s the probability a fund outperforms the market?”

6 Conditional Probability…
Conditional probability is used to determine how two events are related; that is, we can determine the probability of one event given the occurrence of another related event. Experiment: random select one student in class. P(randomly selected student is male) = P(randomly selected student is male/student is on 3rd row) = Conditional probabilities are written as P(A | B) and read as “the probability of A given B” and is calculated as:

7 Conditional Probability…
Again, the probability of an event given that another event has occurred is called a conditional probability… P( A and B) = P(A)*P(B/A) = P(B)*P(A/B) both are true Keep this in mind!

8 Conditional Probability…
Example 6.2 • What’s the probability that a fund will outperform the market given that the manager graduated from a top-20 MBA program? Recall: A1 = Fund manager graduated from a top-20 MBA program A2 = Fund manager did not graduate from a top-20 MBA program B1 = Fund outperforms the market B2 = Fund does not outperform the market Thus, we want to know “what is P(B1 | A1) ?”

9 Conditional Probability…
We want to calculate P(B1 | A1) B1 B2 P(Ai) A1 .11 .29 .40 A2 .06 .54 .60 P(Bj) .17 .83 1.00 Thus, there is a 27.5% chance that that a fund will outperform the market given that the manager graduated from a top-20 MBA program.

10 Independence… One of the objectives of calculating conditional probability is to determine whether two events are related. In particular, we would like to know whether they are independent, that is, if the probability of one event is not affected by the occurrence of the other event. Two events A and B are said to be independent if P(A|B) = P(A) and P(B|A) = P(B) P(you have a flat tire going home/radio quits working)

11 Are B1 and A1 Independent

12 Independence… For example, we saw that P(B1 | A1) = .275
The marginal probability for B1 is: P(B1) = 0.17 Since P(B1|A1) ≠ P(B1), B1 and A1 are not independent events. Stated another way, they are dependent. That is, the probability of one event (B1) is affected by the occurrence of the other event (A1).

13 Determine the probability that a fund outperforms (B1) or the manager graduated from a top-20 MBA program (A1).

14 Union… A1 or B1 occurs whenever:
Determine the probability that a fund outperforms (B1) or the manager graduated from a top-20 MBA program (A1). A1 or B1 occurs whenever: A1 and B1 occurs, A1 and B2 occurs, or A2 and B1 occurs… B1 B2 P(Ai) A1 .11 .29 .40 A2 .06 .54 .60 P(Bj) .17 .83 1.00 P(A1 or B1) = = .46

15 Data Mining Classification: Naïve Bayes Classifier
© Tan,Steinbach, Kumar Introduction to Data Mining /18/

16 Classification: Definition
Given a collection of records (training set ) Each record contains a set of attributes, one of the attributes is the class. Find a model for class attribute as a function of the values of other attributes. Goal: previously unseen records should be assigned a class as accurately as possible. A test set is used to determine the accuracy of the model. Usually, the given data set is divided into training and test sets, with training set used to build the model and test set used to validate it.

17 Illustrating Classification Task

18 Examples of Classification Task
Predicting tumor cells as benign or malignant Classifying credit card transactions as legitimate or fraudulent Classifying secondary structures of protein as alpha-helix, beta-sheet, or random coil Categorizing news stories as finance, weather, entertainment, sports, etc

19 Classification Techniques
Decision Tree based Methods Rule-based Methods Memory based reasoning Neural Networks Naïve Bayes and Bayesian Belief Networks Support Vector Machines

20 Example of a Decision Tree
categorical continuous class Splitting Attributes Refund Yes No NO MarSt Single, Divorced Married TaxInc NO < 80K > 80K NO YES Training Data Model: Decision Tree

21 Another Example of Decision Tree
categorical categorical continuous class MarSt Single, Divorced Married NO Refund No Yes NO TaxInc < 80K > 80K NO YES There could be more than one tree that fits the same data!

22 Decision Tree Classification Task

23 Apply Model to Test Data
Start from the root of tree. Refund MarSt TaxInc YES NO Yes No Married Single, Divorced < 80K > 80K

24 Apply Model to Test Data
Refund MarSt TaxInc YES NO Yes No Married Single, Divorced < 80K > 80K

25 Apply Model to Test Data
Refund Yes No NO MarSt Single, Divorced Married TaxInc NO < 80K > 80K NO YES

26 Apply Model to Test Data
Refund Yes No NO MarSt Single, Divorced Married TaxInc NO < 80K > 80K NO YES

27 Apply Model to Test Data
Refund Yes No NO MarSt Single, Divorced Married TaxInc NO < 80K > 80K NO YES

28 Apply Model to Test Data
Refund Yes No NO MarSt Married Assign Cheat to “No” Single, Divorced TaxInc NO < 80K > 80K NO YES

29 Bayes Classifier A probabilistic framework for solving classification problems Conditional Probability: Bayes theorem:

30 Bayes Theorem Given a hypothesis h and data D which bears on the hypothesis: P(h): independent probability of h: prior probability P(D): independent probability of D P(D|h): conditional probability of D given h: likelihood P(h|D): conditional probability of h given D: posterior probability

31 Example of Bayes Theorem
Given: A doctor knows that meningitis causes stiff neck 50% of the time Prior probability of any patient having meningitis is 1/50,000 Prior probability of any patient having stiff neck is 1/20 If a patient has stiff neck, what’s the probability he/she has meningitis?

32 Example of Bayes Theorem
Given: A doctor knows that meningitis causes stiff neck 50% of the time Prior probability of any patient having meningitis is 1/50,000 Prior probability of any patient having stiff neck is 1/20 If a patient has stiff neck, what’s the probability he/she has meningitis?

33 Bayesian Classifiers Consider each attribute and class label as random variables Given a record with attributes (A1, A2,…,An) Goal is to predict class C Specifically, we want to find the value of C that maximizes P(C| A1, A2,…,An ) Can we estimate P(C| A1, A2,…,An ) directly from data?

34 Bayesian Classifiers Approach: How to estimate P(A1, A2, …, An | C )?
compute the posterior probability P(C | A1, A2, …, An) for all values of C using the Bayes theorem Choose value of C that maximizes P(C | A1, A2, …, An) Equivalent to choosing value of C that maximizes P(A1, A2, …, An|C) P(C) How to estimate P(A1, A2, …, An | C )?

35 The Bayes Classifier Likelihood Prior Normalization Constant

36 Maximum A Posterior Based on Bayes Theorem, we can compute the Maximum A Posterior (MAP) hypothesis for the data We are interested in the best hypothesis for some space H given observed training data D. H: set of all hypothesis. Note that we can drop P(D) as the probability of the data is constant (and independent of the hypothesis).

37 Naïve Bayes Classifier
Assume independence among attributes Ai when class is given: P(A1, A2, …, An |C) = P(A1| Cj) P(A2| Cj)… P(An| Cj) Can estimate P(Ai| Cj) for all Ai and Cj. New point is classified to Cj if P(Cj)  P(Ai| Cj) is maximal.

38 Example Example: Play Tennis

39 Example Learning Phase 2/9 3/5 4/9 0/5 3/9 2/5 2/9 2/5 4/9 3/9 1/5 3/9
Outlook Play=Yes Play=No Sunny 2/9 3/5 Overcast 4/9 0/5 Rain 3/9 2/5 Temperature Play=Yes Play=No Hot 2/9 2/5 Mild 4/9 Cool 3/9 1/5 Humidity Play=Yes Play=No High 3/9 4/5 Normal 6/9 1/5 Wind Play=Yes Play=No Strong 3/9 3/5 Weak 6/9 2/5 P(Play=Yes) = 9/14 P(Play=No) = 5/14

40 Example Test Phase Given a new instance,
x’=(Outlook=Sunny, Temperature=Cool, Humidity=High, Wind=Strong) Look up tables MAP rule P(Outlook=Sunny|Play=Yes) = 2/9 P(Temperature=Cool|Play=Yes) = 3/9 P(Huminity=High|Play=Yes) = 3/9 P(Wind=Strong|Play=Yes) = 3/9 P(Play=Yes) = 9/14 P(Outlook=Sunny|Play=No) = 3/5 P(Temperature=Cool|Play==No) = 1/5 P(Huminity=High|Play=No) = 4/5 P(Wind=Strong|Play=No) = 3/5 P(Play=No) = 5/14 P(Yes|x’): [P(Sunny|Yes)P(Cool|Yes)P(High|Yes)P(Strong|Yes)]P(Play=Yes) = P(No|x’): [P(Sunny|No) P(Cool|No)P(High|No)P(Strong|No)]P(Play=No) = Given the fact P(Yes|x’) < P(No|x’), we label x’ to be “No”.

41 Naïve Bayes Classifier
If one of the conditional probability is zero, then the entire expression becomes zero Probability estimation: c: number of classes p: prior probability

42 Problem Solving A: attributes M: mammals N: non-mammals

43 Solution A: attributes M: mammals N: non-mammals
P(A|M)P(M) > P(A|N)P(N) => Mammals

44 Classifier Evaluation Metrics: Confusion Matrix
Actual class\Predicted class C1 ¬ C1 True Positives (TP) False Negatives (FN) False Positives (FP) True Negatives (TN) Example of Confusion Matrix: Actual class\Predicted class buy_computer = yes buy_computer = no Total buy_computer = yes 6954 46 7000 412 2588 3000 7366 2634 10000 Given m classes, an entry, CMi,j in a confusion matrix indicates # of tuples in class i that were labeled by the classifier as class j May have extra rows/columns to provide totals 44

45 Classifier Evaluation Metrics: Accuracy, Error Rate, Sensitivity and Specificity
A\P C ¬C TP FN P FP TN N P’ N’ All Class Imbalance Problem: One class may be rare, e.g. fraud, or HIV-positive Significant majority of the negative class and minority of the positive class Sensitivity: True Positive recognition rate Sensitivity = TP/P Specificity: True Negative recognition rate Specificity = TN/N Classifier Accuracy, or recognition rate: percentage of test set tuples that are correctly classified Accuracy = (TP + TN)/All Error rate: 1 – accuracy, or Error rate = (FP + FN)/All 45

46 Measuring Error Error rate = # of errors / # of instances = (FN+FP) / N Recall = # of found positives / # of positives = TP / (TP+FN) = sensitivity = hit rate Precision = # of found positives / # of found = TP / (TP+FP) Specificity = TN / (TN+FP) False alarm rate = FP / (FP+TN) = 1 - Specificity Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)


Download ppt "Classification: Naïve Bayes Classifier"

Similar presentations


Ads by Google