Naïve Bayes Classifier

Naïve Bayes Classifier

QUIZZ: Probability Basics
Quiz: We have two six-sided dice. When they are rolled, it could end up with the following occurrence: (A) dice 1 lands on side “3”, (B) dice 2 lands on side “1”, and (C) Two dice sum to eight. Answer the following questions:

Probabilistic Classification

Establishing a probabilistic model for classification Discriminative model Discriminative Probabilistic Classifier What is a discriminative Probabilistic Classifier? Example C1 – benign mole C2 - cancer

Establishing a probabilistic model for classification (cont.) Generative model Probability that this fruit is an orange Probability that this fruit is an apple Generative Probabilistic Model for Class 1 for Class 2 for Class L

Background: Methods to create classifiers
There are three methods to establish a classifier a) Model a classification rule directly Examples: k-NN, decision trees, perceptron, SVM b) Model the probability of class memberships given input data Example: Perceptron with the cross-entropy cost c) Make a probabilistic model of data within each class Examples: Naive Bayes, model based classifiers a) and b) are examples of discriminative classification c) is an example of generative classification b) and c) are both examples of probabilistic classification GOOD NEWS: You can create your own hardware/software classifiers!

Probability Basics We defined prior, conditional and joint probability for random variables Prior probability: Conditional probability: Joint probability: Relationship: Independence: Bayesian Rule

Method: Probabilistic Classification with MAP
MAP classification rule MAP: Maximum A Posterior Assign x to c* if Method of Generative classification with the MAP rule Apply Bayesian rule to convert them into posterior probabilities Then apply the MAP rule We use this rule in many applications

Naïve Bayes

Naïve Bayes Bayes classification Naïve Bayes classification
For a class, the previous generative model can be decomposed by n generative models of a single input. Bayes classification Difficulty: learning the joint probability Naïve Bayes classification Assumption that all input attributes are conditionally independent! MAP classification rule: for Product of individual probabilities

Naïve Bayes Algorithm 1. Learning Phase: Given a training set S,
Naïve Bayes Algorithm (for discrete input attributes) has two phases 1. Learning Phase: Given a training set S, Output: conditional probability tables; for elements 2. Test Phase: Given an unknown instance , Look up tables to assign the label c* to X’ if Learning is easy, just create probability tables. Classification is easy, just multiply probabilities

Tennis Example Example: Play Tennis

The learning phase for tennis example
P(Play=Yes) = 9/14 P(Play=No) = 5/14 We have four variables, we calculate for each we calculate the conditional probability table Outlook Play=Yes Play=No Sunny 2/9 3/5 Overcast 4/9 0/5 Rain 3/9 2/5 Temperature Play=Yes Play=No Hot 2/9 2/5 Mild 4/9 Cool 3/9 1/5 Humidity Play=Yes Play=No High 3/9 4/5 Normal 6/9 1/5 Wind Play=Yes Play=No Strong 3/9 3/5 Weak 6/9 2/5

Formulation of a Classification Problem
Given the data as found in last slide: Find for a new point in space (vector of values) to which group it belongs (classify)

Issues Relevant to Naïve Bayes

Violation of Independence Assumption Zero conditional probability Problem

First Issue Violation of Independence Assumption For many real world tasks, Nevertheless, naïve Bayes works surprisingly well anyway! Events are correlated

Second Issue Zero conditional probability Problem Such problem exists when no example contains the attribute value In this circumstance, during test For a remedy, conditional probabilities are estimated with

Another Problem: Continuous-valued Input Attributes
What to do in such a case? Numberless values for an attribute Conditional probability is then modeled with the normal distribution Learning Phase: Output: normal distributions and Test Phase: Calculate conditional probabilities with all the normal distributions Apply the MAP rule to make a decision

Naïve Bayes Training Now that we’ve decided to use a Naïve Bayes classifier, we need to train it with some data:

Naïve Bayes Training Training in Naïve Bayes is easy:
–Estimate P(Y=v) as the fraction of records with Y=v Estimate P(Xi=u|Y=v) as the fraction of records with Y=v for which Xi=u

Naïve Bayes Training In practice, some of these counts can be zero
Fix this by adding “virtual” counts: This is called Smoothing

Naïve Bayes Training For binary digits, training amounts to averaging all of the training fives together and all of the training sixes together.

Naïve Bayes Training

Conclusion on classifiers
Naïve Bayes is based on the independence assumption Training is very easy and fast; just requiring considering each attribute in each class separately Test is straightforward; just looking up tables or calculating conditional probabilities with normal distributions Naïve Bayes is a popular generative classifier model Performance of naïve Bayes is competitive to most of state-of-the-art classifiers even if in presence of violating the independence assumption It has many successful applications, e.g., spam mail filtering A good candidate of a base learner in ensemble learning Apart from classification, naïve Bayes can do more…

https://www. datasciencecentral

R Example (http://uc-r.github.io/naive_bayes)
To illustrate the naïve Bayes classifier we will use the attrition data that has been included in the rsample package. The goal is to predict employee attrition. library(rsample) # data splitting library(dplyr) # data transformation library(ggplot2) # data visualization library(caret) # implementing with caret library(h2o) # implementing with h2o

# convert some numeric variables to factors attrition <- attrition %>% mutate( JobLevel = factor(JobLevel), StockOptionLevel = factor(StockOptionLevel), TrainingTimesLastYear = factor(TrainingTimesLastYear) ) # Create training (70%) and test (30%) sets for the attrition data. # Use set.seed for reproducibility set.seed(123) split <- initial_split(attrition, prop = .7, strata = "Attrition") train <- training(split) test <- testing(split) # distribution of Attrition rates across train & test set table(train$Attrition) %>% prop.table() No Yes table(test$Attrition) %>% prop.table() No Yes

With naïve Bayes, we assume that the predictor variables are conditionally independent of one another given the response value. This is an extremely strong assumption. We can see quickly that our attrition data violates this as we have several moderately to strongly correlated variables. library(corrplot) train %>% filter(Attrition == "Yes") %>% select_if(is.numeric) %>% cor() %>% corrplot::corrplot()

However, by making this assumption we can simplify our calculation such that the posterior probability is simply the product of the probability distribution for each individual variable conditioned on the response category. For categorical variables, this computation is quite simple as you just use the frequencies from the data. However, when including continuous predictor variables often an assumption of normality is made so that we can use the probability from the variable’s probability density function. If we pick a handful of our numeric features we quickly see assumption of normality is not always fair. train %>% select(Age, DailyRate, DistanceFromHome, HourlyRate, MonthlyIncome, MonthlyRate) %>% gather(metric, value) %>% ggplot(aes(value, fill = metric)) + geom_density(show.legend = FALSE) + facet_wrap(~ metric, scales = "free")

Some numeric features may be normalized with a Box-Cox transformation; however, we can also use non-parametric kernel density estimators to try get a more accurate representation of continuous variable probabilities. Ultimately, transforming the distributions and selecting an estimator is part of the modeling development and tuning process. Laplace Smoother We run into a serious problem when new data includes a feature value that never occurs for one or more levels of a response class. What results is P(xi|Ck)=0 for this individual feature and this zero will ripple through the entire multiplication of all features and will always force the posterior probability to be zero for that class. A solution to this problem involves using the Laplace smoother. The Laplace smoother adds a small number to each of the counts in the frequencies for each feature, which ensures that each feature has a nonzero probability of occurring for each class. Typically, a value of one to two for the Laplace smoother is sufficient, but this is a tuning parameter to incorporate and optimize with cross validation.

First, we apply a naïve Bayes model with 10-fold cross validation, which gets 83% accuracy. Considering about 83% of our observations in our training set do not attrit, our overall accuracy is no better than if we just predicted “No” attrition for every observation. There are several packages to apply naïve Bayes (i.e. e1071, klaR, naivebayes, bnclassify). library(klaR) # create response and feature data features <- setdiff(names(train), "Attrition") x <- train[, features] y <- train$Attrition # set up 10-fold cross validation procedure train_control <- trainControl( method = "cv", number = 10 ) # train model nb.m1 <- train( x = x, y = y, method = "nb", trControl = train_control # results confusionMatrix(nb.m1) Cross-Validated (10 fold) Confusion Matrix (entries are percentual average cell counts across resamples) Reference Prediction No Yes No Yes Accuracy (average) : 0.833

We can tune the few hyperparameters that a naïve Bayes model has.
Usekernel parameter allows us to use a kernel density estimate for continuous variables versus a guassian density estimate, adjust allows us to adjust the bandwidth of the kernel density (larger numbers mean more flexible density estimate), fL allows us to incorporate the Laplace smoother. If we just tuned our model with the above parameters we are able to lift our accuracy to 85%; however, by incorporating some preprocessing of our features (normalize with Box Cox, standardize with center-scaling, and reducing with PCA) we actually get about another 2% lift in our accuracy.

# set up tuning grid search_grid <- expand
# set up tuning grid search_grid <- expand.grid( usekernel = c(TRUE, FALSE), fL = 0:5, adjust = seq(0, 5, by = 1) ) # train model nb.m2 <- train( x = x, y = y, method = "nb", trControl = train_control, tuneGrid = search_grid, preProc = c("BoxCox", "center", "scale", "pca") # top 5 modesl nb.m2$results %>% top_n(5, wt = Accuracy) %>% arrange(desc(Accuracy)) usekernel fL adjust Accuracy Kappa AccuracySD KappaSD 1 TRUE TRUE TRUE TRUE TRUE

# plot search grid results plot(nb.m2)

We can assess the accuracy on our final holdout test set.
# results for best model confusionMatrix(nb.m2) Cross-Validated (10 fold) Confusion Matrix (entries are percentual average cell counts across resamples) Reference Prediction No Yes No Yes Accuracy (average) : We can assess the accuracy on our final holdout test set. pred <- predict(nb.m2, newdata = test) confusionMatrix(pred, test$Attrition) Confusion Matrix and Statistics Prediction No Yes No Yes Accuracy : 95% CI : (0.8255, ) No Information Rate : P-Value [Acc > NIR] : Kappa : Mcnemar's Test P-Value : Sensitivity : Specificity : Pos Pred Value : Neg Pred Value : Prevalence : Detection Rate : Detection Prevalence : Balanced Accuracy : 'Positive' Class : No

Some Other Sources des/Naive-Bayes.ppt

Naïve Bayes Classifier

Similar presentations

Presentation on theme: "Naïve Bayes Classifier"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Naïve Bayes Classifier

Similar presentations

Presentation on theme: "Naïve Bayes Classifier"— Presentation transcript:

Similar presentations

About project

Feedback