Chapter 6 – Three Simple Classification Methods © Galit Shmueli and Peter Bruce 2008 Data Mining for Business Intelligence Shmueli, Patel & Bruce.

Slides:



Advertisements
Similar presentations
Chapter 6 Three Simple Classification Methods
Advertisements

Naïve-Bayes Classifiers Business Intelligence for Managers.
Notes Sample vs distribution “m” vs “µ” and “s” vs “σ” Bias/Variance Bias: Measures how much the learnt model is wrong disregarding noise Variance: Measures.
Chapter 3 – Data Exploration and Dimension Reduction © Galit Shmueli and Peter Bruce 2008 Data Mining for Business Intelligence Shmueli, Patel & Bruce.
Data Mining Classification: Alternative Techniques
Data Mining Classification: Alternative Techniques
Data Mining Classification: Alternative Techniques
Chapter 5 – Evaluating Classification & Predictive Performance
Chapter 4 – Evaluating Classification & Predictive Performance © Galit Shmueli and Peter Bruce 2008 Data Mining for Business Intelligence Shmueli, Patel.
Chapter 8 – Logistic Regression
Data preprocessing before classification In Kennedy et al.: “Solving data mining problems”
Chapter 7 – Classification and Regression Trees
Chapter 7 – Classification and Regression Trees
Introduction to Data Mining with XLMiner
Data Mining Classification: Naïve Bayes Classifier
Navneet Goyal. Instance Based Learning  Rote Classifier  K- nearest neighbors (K-NN)  Case Based Resoning (CBR)
Regression. So far, we've been looking at classification problems, in which the y values are either 0 or 1. Now we'll briefly consider the case where.
MACHINE LEARNING 9. Nonparametric Methods. Introduction Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 
Chapter 7 – K-Nearest-Neighbor
Lecture Notes for CMPUT 466/551 Nilanjan Ray
CS 590M Fall 2001: Security Issues in Data Mining Lecture 3: Classification.
Carla P. Gomes CS4700 CS 4700: Foundations of Artificial Intelligence Carla P. Gomes Module: Nearest Neighbor Models (Reading: Chapter.
© Vipin Kumar CSci 8980 Fall CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance Computing Research Center Department of Computer.
Lecture outline Classification Naïve Bayes classifier Nearest-neighbor classifier.
Data Mining: A Closer Look Chapter Data Mining Strategies (p35) Moh!
MACHINE LEARNING 6. Multivariate Methods 1. Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 Motivating Example  Loan.
Memory-Based Learning Instance-Based Learning K-Nearest Neighbor.
The UNIVERSITY of Kansas EECS 800 Research Seminar Mining Biological Data Instructor: Luke Huan Fall, 2006.
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
CS Bayesian Learning1 Bayesian Learning. CS Bayesian Learning2 States, causes, hypotheses. Observations, effect, data. We need to reconcile.
Jeff Howbert Introduction to Machine Learning Winter Classification Bayesian Classifiers.
Overview DM for Business Intelligence.
1 Lazy Learning – Nearest Neighbor Lantz Ch 3 Wk 2, Part 1.
DATA MINING LECTURE 10 Classification k-nearest neighbor classifier Naïve Bayes Logistic Regression Support Vector Machines.
1 Data Mining Lecture 5: KNN and Bayes Classifiers.
Chapter 9 – Classification and Regression Trees
Chapter 12 – Discriminant Analysis © Galit Shmueli and Peter Bruce 2010 Data Mining for Business Intelligence Shmueli, Patel & Bruce.
TOPICS IN BUSINESS INTELLIGENCE K-NN & Naive Bayes – GROUP 1 Isabel van der Lijke Nathan Bok Gökhan Korkmaz.
Data Reduction. 1.Overview 2.The Curse of Dimensionality 3.Data Sampling 4.Binning and Reduction of Cardinality.
Overview of Supervised Learning Overview of Supervised Learning2 Outline Linear Regression and Nearest Neighbors method Statistical Decision.
Bayesian Classification. Bayesian Classification: Why? A statistical classifier: performs probabilistic prediction, i.e., predicts class membership probabilities.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Statistical Inference (By Michael Jordon) l Bayesian perspective –conditional perspective—inferences.
Chapter 14 – Cluster Analysis © Galit Shmueli and Peter Bruce 2010 Data Mining for Business Intelligence Shmueli, Patel & Bruce.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Chapter1: Introduction Chapter2: Overview of Supervised Learning
KNN & Naïve Bayes Hongning Wang Today’s lecture Instance-based classifiers – k nearest neighbors – Non-parametric learning algorithm Model-based.
Evaluating Classification Performance
DATA MINING LECTURE 10b Classification k-nearest neighbor classifier
Chapter 4 –Dimension Reduction Data Mining for Business Analytics Shmueli, Patel & Bruce.
Chapter 5 – Evaluating Predictive Performance Data Mining for Business Analytics Shmueli, Patel & Bruce.
CS Machine Learning Instance Based Learning (Adapted from various sources)
Chapter 8 – Naïve Bayes DM for Business Intelligence.
Eick: kNN kNN: A Non-parametric Classification and Prediction Technique Goals of this set of transparencies: 1.Introduce kNN---a popular non-parameric.
KNN & Naïve Bayes Hongning Wang
Chapter 11 – Neural Nets © Galit Shmueli and Peter Bruce 2010 Data Mining for Business Intelligence Shmueli, Patel & Bruce.
Chapter 14 – Association Rules and Collaborative Filtering © Galit Shmueli and Peter Bruce 2016 Data Mining for Business Analytics (3rd ed.) Shmueli, Bruce.
Chapter 12 – Discriminant Analysis
Ch8: Nonparametric Methods
Chapter 7 – K-Nearest-Neighbor
Machine Learning. k-Nearest Neighbor Classifiers.
Naïve Bayes CSC 600: Data Mining Class 19.
Classification Nearest Neighbor
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Data Mining Classification: Alternative Techniques
Multivariate Methods Berlin Chen
CSE4334/5334 Data Mining Lecture 7: Classification (4)
Multivariate Methods Berlin Chen, 2005 References:
Naïve Bayes CSC 576: Data Science.
Memory-Based Learning Instance-Based Learning K-Nearest Neighbor
Chapter 4 –Dimension Reduction
Presentation transcript:

Chapter 6 – Three Simple Classification Methods © Galit Shmueli and Peter Bruce 2008 Data Mining for Business Intelligence Shmueli, Patel & Bruce

Methods & Characteristics The three methods: Naïve rule Naïve Bayes K-nearest-neighbor Common characteristics: Data-driven, not model-driven Make no assumptions about the data

Naïve Rule Classify all records as the majority class Not a “real” method Introduced so it will serve as a benchmark against which to measure other results

Naïve Bayes

Naïve Bayes: The Basic Idea For a given new record to be classified, find other records like it (i.e., same values for the predictors) What is the prevalent class among those records? Assign that class to your new record

Usage Requires categorical variables Numerical variable must be binned and converted to categorical Can be used with very large data sets Example: Spell check – computer attempts to assign your misspelled word to an established “class” (i.e., correctly spelled word)

Exact Bayes Classifier Relies on finding other records that share same predictor values as record-to-be-classified. Want to find “probability of belonging to class C, given specified values of predictors.” Even with large data sets, may be hard to find other records that exactly match your record, in terms of predictor values.

Solution – Naïve Bayes Assume independence of predictor variables (within each class) Use multiplication rule Find same probability that record belongs to class C, given predictor values, without limiting calculation to records that share all those same values

Example: Financial Fraud Target variable: Audit finds fraud, no fraud Predictors: Prior pending legal charges (yes/no) Size of firm (small/large)

Exact Bayes Calculations Goal: classify (as “fraudulent” or as “truthful”) a small firm with charges filed There are 2 firms like that, one fraudulent and the other truthful P(fraud|charges=y, size=small) = ½ = 0.50 Note: calculation is limited to the two firms matching those characteristics

Naïve Bayes Calculations Goal: Still classifying a small firm with charges filed Compute 2 quantities: Proportion of “charges = y” among frauds, times proportion of “small” among frauds, times proportion frauds = 3/4 * 1/4 * 4/10 = Prop “charges = y” among frauds, times prop. “small” among truthfuls, times prop. truthfuls = 1/6 * 4/6 * 6/10 = P(fraud|charges, small) = 0.075/( ) = 0.53

Naïve Bayes, cont. Note that probability estimate does not differ greatly from exact All records are used in calculations, not just those matching predictor values This makes calculations practical in most circumstances Relies on assumption of independence between predictor variables within each class

Independence Assumption Not strictly justified (variables often correlated with one another) Often “good enough”

Advantages Handles purely categorical data well Works well with very large data sets Simple & computationally efficient

Shortcomings Requires large number of records Problematic when a predictor category is not present in training data Assigns 0 probability of response, ignoring information in other variables

On the other hand… Probability rankings are more accurate than the actual probability estimates Good for applications using lift (e.g. response to mailing), less so for applications requiring probabilities (e.g. credit scoring)

K-Nearest Neighbors

Basic Idea For a given record to be classified, identify nearby records “Near” means records with similar predictor values X 1, X 2, … X p Classify the record as whatever the predominant class is among the nearby records (the “neighbors”)

How to Measure “nearby”? The most popular distance measure is Euclidean distance

Choosing k K is the number of nearby neighbors to be used to classify the new record k=1 means use the single nearest record k=5 means use the 5 nearest records Typically choose that value of k which has lowest error rate in validation data

Low k vs. High k Low values of k (1, 3 …) capture local structure in data (but also noise) High values of k provide more smoothing, less noise, but may miss local structure Note: the extreme case of k = n (i.e. the entire data set) is the same thing as “naïve rule” (classify all records according to majority class)

Example: Riding Mowers Data: 24 households classified as owning or not owning riding mowers Predictors = Income, Lot Size

XLMiner Output For each record in validation data (6 records) XLMiner finds neighbors amongst training data (18 records). The record is scored for k=1, k=2, … k=18. Best k seems to be k=8. K = 9, k = 10, k=14 also share low error rate, but best to choose lowest k.

Using K-NN for Prediction (for Numerical Outcome) Instead of “majority vote determines class” use average of response values May be a weighted average, weight decreasing with distance

Advantages Simple No assumptions required about Normal distribution, etc. Effective at capturing complex interactions among variables without having to define a statistical model

Shortcomings Required size of training set increases exponentially with # of predictors, p This is because expected distance to nearest neighbor increases with p (with large vector of predictors, all records end up “far away” from each other) In a large training set, it takes a long time to find distances to all the neighbors and then identify the nearest one(s) These constitute “curse of dimensionality”

Dealing with the Curse Reduce dimension of predictors (e.g., with PCA) Computational shortcuts that settle for “almost nearest neighbors”

Summary Naïve rule: benchmark Naïve Bayes and K-NN are two variations on the same theme: “Classify new record according to the class of similar records” No statistical models involved These methods pay attention to complex interactions and local structure Computational challenges remain