Naive Bayes Classifiers, an Overview By Roozmehr Safi.

Slides:



Advertisements
Similar presentations
Bayes rule, priors and maximum a posteriori
Advertisements

Naïve Bayes. Bayesian Reasoning Bayesian reasoning provides a probabilistic approach to inference. It is based on the assumption that the quantities of.
Notes Sample vs distribution “m” vs “µ” and “s” vs “σ” Bias/Variance Bias: Measures how much the learnt model is wrong disregarding noise Variance: Measures.
ECE 8443 – Pattern Recognition LECTURE 05: MAXIMUM LIKELIHOOD ESTIMATION Objectives: Discrete Features Maximum Likelihood Resources: D.H.S: Chapter 3 (Part.
What is Statistical Modeling
Data Mining Classification: Naïve Bayes Classifier
Rosa Cowan April 29, 2008 Predictive Modeling & The Bayes Classifier.
Assuming normally distributed data! Naïve Bayes Classifier.
Classification and risk prediction
Inferences About Means of Two Independent Samples Chapter 11 Homework: 1, 2, 3, 4, 6, 7.
1 CS 430 / INFO 430 Information Retrieval Lecture 12 Probabilistic Information Retrieval.
1 CS 430 / INFO 430 Information Retrieval Lecture 12 Probabilistic Information Retrieval.
Generative Models Rong Jin. Statistical Inference Training ExamplesLearning a Statistical Model  Prediction p(x;  ) Female: Gaussian distribution N(
Naïve Bayes Classification Debapriyo Majumdar Data Mining – Fall 2014 Indian Statistical Institute Kolkata August 14, 2014.
MACHINE LEARNING 6. Multivariate Methods 1. Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 Motivating Example  Loan.
Jeff Howbert Introduction to Machine Learning Winter Classification Bayesian Classifiers.
Crash Course on Machine Learning
1 Naïve Bayes A probabilistic ML algorithm. 2 Axioms of Probability Theory All probabilities between 0 and 1 True proposition has probability 1, false.
EE513 Audio Signals and Systems Statistical Pattern Classification Kevin D. Donohue Electrical and Computer Engineering University of Kentucky.
1  The goal is to estimate the error probability of the designed classification system  Error Counting Technique  Let classes  Let data points in class.
Principles of Pattern Recognition
Bayesian Networks. Male brain wiring Female brain wiring.
by B. Zadrozny and C. Elkan
Confidence Intervals for the Regression Slope 12.1b Target Goal: I can perform a significance test about the slope β of a population (true) regression.
1 Logistic Regression Adapted from: Tom Mitchell’s Machine Learning Book Evan Wei Xiang and Qiang Yang.
Population All members of a set which have a given characteristic. Population Data Data associated with a certain population. Population Parameter A measure.
Naive Bayes Classifier
Chapter 6 Lecture 3 Sections: 6.4 – 6.5.
Learning Theory Reza Shadmehr logistic regression, iterative re-weighted least squares.
1 CS 391L: Machine Learning: Bayesian Learning: Naïve Bayes Raymond J. Mooney University of Texas at Austin.
CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.
Empirical Research Methods in Computer Science Lecture 7 November 30, 2005 Noah Smith.
Learning Theory Reza Shadmehr Linear and quadratic decision boundaries Kernel estimates of density Missing data.
Bayesian Classification. Bayesian Classification: Why? A statistical classifier: performs probabilistic prediction, i.e., predicts class membership probabilities.
Review of Probability. Important Topics 1 Random Variables and Probability Distributions 2 Expected Values, Mean, and Variance 3 Two Random Variables.
Chapter 20 Classification and Estimation Classification – Feature selection Good feature have four characteristics: –Discrimination. Features.
Chapter 6 Lecture 3 Sections: 6.4 – 6.5. Sampling Distributions and Estimators What we want to do is find out the sampling distribution of a statistic.
Introduction to Machine Learning Multivariate Methods 姓名 : 李政軒.
Chapter 6. Classification and Prediction Classification by decision tree induction Bayesian classification Rule-based classification Classification by.
D/RS 1013 Discriminant Analysis. Discriminant Analysis Overview n multivariate extension of the one-way ANOVA n looks at differences between 2 or more.
Chapter 8 – Naïve Bayes DM for Business Intelligence.
Bayesian Learning. Bayes Classifier A probabilistic framework for solving classification problems Conditional Probability: Bayes theorem:
Naïve Bayes Classifier April 25 th, Classification Methods (1) Manual classification Used by Yahoo!, Looksmart, about.com, ODP Very accurate when.
Pattern Classification All materials in these slides* were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
BAYESIAN LEARNING. 2 Bayesian Classifiers Bayesian classifiers are statistical classifiers, and are based on Bayes theorem They can calculate the probability.
Computational Intelligence: Methods and Applications Lecture 26 Density estimation, Expectation Maximization. Włodzisław Duch Dept. of Informatics, UMK.
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 15: Text Classification & Naive Bayes 1.
Naive Bayes Classifier. REVIEW: Bayesian Methods Our focus this lecture: – Learning and classification methods based on probability theory. Bayes theorem.
Construction Engineering 221 Probability and statistics Normal Distribution.
Naive Bayes (Generative Classifier) vs. Logistic Regression (Discriminative Classifier) Minkyoung Kim.
Bayesian Classification 1. 2 Bayesian Classification: Why? A statistical classifier: performs probabilistic prediction, i.e., predicts class membership.
Unsupervised Learning Part 2. Topics How to determine the K in K-means? Hierarchical clustering Soft clustering with Gaussian mixture models Expectation-Maximization.
Chapter 7. Classification and Prediction
Aaker, Kumar, Day Ninth Edition Instructor’s Presentation Slides
Naive Bayes Classifier
CH 5: Multivariate Methods
Classification of unlabeled data:
Lecture 15: Text Classification & Naive Bayes
In-Class Exercise: Discrete Distributions
Mathematical Foundations of BME Reza Shadmehr
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Unsupervised Learning II: Soft Clustering with Gaussian Mixture Models
Generally Discriminant Analysis
Mathematical Foundations of BME
LECTURE 23: INFORMATION THEORY REVIEW
Parametric Methods Berlin Chen, 2005 References:
Multivariate Methods Berlin Chen
Multivariate Methods Berlin Chen, 2005 References:
Mathematical Foundations of BME Reza Shadmehr
Naïve Bayes Classifier
Presentation transcript:

Naive Bayes Classifiers, an Overview By Roozmehr Safi

What is Naive Bayes Classifier (NBC)? NBC is a probabilistic classification method. Classification (A.K.A. discrimination, or supervised learning) is assigning new cases to one of a the pre-defined classes given a sample of cases for which the true classes are known. NBC is one of the oldest and simplest classification methods.

Some NBC Applications Credit scoring Marketing applications Employee selection Image processing Speech recognition Search engines…

How does NBC Work? NBC applies Bayes’ theorem with (naive) independence assumptions. A more descriptive term for it would be "independent feature model".

How does NBC work, Cntd. Let X1,…, Xm denote our features (Height, weight, foot size…), Y is the class number (1 for men,2 for women), and C is the number of classes (2). The problem consists of classifying the case (x1,…, xm) to the class c maximizing P(Y=c| X1=x1,…, Xm=xm) over c=1,…, C. Applying Bayes’ rule gives: P(Y=c| X1=x1,…, Xm=xm) = P(X1=x1,…, Xm=xm | Y=c)P(Y=c) / P(X1=x1,…, Xm=xm). Under the NB’s assumption of conditional independence, P(X1=x1,…, Xm=xm | Y=c) is replaced by And NB reduces the original problem to:.

An example: P(Obserevd Height|Male) = a P(Observed Weight|Male) = b P(Observed Foot size|Male) = c P(Male|observed case)≈ P(male) × a × b × C P(Observed Height|Female) = d P(Observed Weight|Female) = e P(Observed Foot size|Female) = f P(Female|observed case)≈ P(Female) × d × e × f * Pick the one that is larger

NBC advantage Despite unrealistic assumption of independence, NBC is remarkably successful even when independence is violated. Due to its simple structure the NBC it is appealing when the set of variables is large. NBC requires a small amount of training data: – It only needs to estimate means and variances of the variables – No need to form the covariance matrix. – Computationally inexpensive.

A Demonstration Data: From an online B2B exchange (1220 cases). Purpose: To distinguish cheaters of good sellers. Predictors: Member type: Enterprise, personal, other Years since joined: 1 to 10 years. No. of months since last membership renewal Membership Renewal duration. Type of service bought: standard, limited edition… If the member has a registered company. If the company page is decorated. Number of days in which member logged in during past 60 days. Industry: production, distribution, investment… Target: to predict if a seller is likely to cheat buyers based on data from old sellers.

Issues involved: Prob. distribution With discrete (categorical) features, estimating the probabilities can be done using frequency counts. With continuous features one can assume a certain form of quantitative probability distribution. There is evidence that discretization of data before applying NB is effective. Equal Frequency Discretization (EFD) divides the sorted values of a continuous variable into k equally populated bins.

Issues involved: Zero probabilities The case when a class and a feature value never occur together in the training set creates a problem, because assigning a probability of zero to one of the terms causes the whole expression to evaluate to zero. The zero probability can be replaced by a small constant, such as 0.5/n where n is the number of observations in the training set.

Issues involved: Missing values In some applications, values are missing not at random and can be meaningful. Therefore, missing values are treated as a separate category. If one does not want to treat missing values as a separate category, they should be handled prior to applying this macro with either a missing value imputation or excluding cases where they are present.

Thank you