Matt Gormley Lecture 3 September 7, 2016

Slides:

Advertisements

Similar presentations

Basics of Statistical Estimation

Advertisements

Flipping A Biased Coin Suppose you have a coin with an unknown bias, θ ≡ P(head). You flip the coin multiple times and observe the outcome. From observations,

CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 10: The Bayesian way to fit models Geoffrey Hinton.

PROBABILISTIC MODELS David Kauchak CS451 – Fall 2013.

PROBABILISTIC MODELS David Kauchak CS451 – Fall 2013.

What is Statistical Modeling

Parameter Estimation using likelihood functions Tutorial #1

Today Linear Regression Logistic Regression Bayesians v. Frequentists

Visual Recognition Tutorial

Generative Models Rong Jin. Statistical Inference Training ExamplesLearning a Statistical Model  Prediction p(x;  ) Female: Gaussian distribution N(

Maximum Likelihood (ML), Expectation Maximization (EM)

Introduction to Bayesian Learning Ata Kaban School of Computer Science University of Birmingham.

Thanks to Nir Friedman, HU

Jeff Howbert Introduction to Machine Learning Winter Classification Bayesian Classifiers.

Crash Course on Machine Learning

Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.

ECE 8443 – Pattern Recognition LECTURE 06: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Bias in ML Estimates Bayesian Estimation Example Resources:

ECE 5984: Introduction to Machine Learning Dhruv Batra Virginia Tech Topics: –Classification: Naïve Bayes Readings: Barber

Bayesian Inference Ekaterina Lomakina TNU seminar: Bayesian inference 1 March 2013.

1 Logistic Regression Adapted from: Tom Mitchell’s Machine Learning Book Evan Wei Xiang and Qiang Yang.

CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 11: Bayesian learning continued Geoffrey Hinton.

Empirical Research Methods in Computer Science Lecture 7 November 30, 2005 Noah Smith.

Learning Theory Reza Shadmehr Linear and quadratic decision boundaries Kernel estimates of density Missing data.

Machine Learning CUNY Graduate Center Lecture 4: Logistic Regression.

INTRODUCTION TO Machine Learning 3rd Edition

Learning In Bayesian Networks. General Learning Problem Set of random variables X = {X 1, X 2, X 3, X 4, …} Training set D = { X (1), X (2), …, X (N)

CHAPTER 6 Naive Bayes Models for Classification. QUESTION????

ECE 5984: Introduction to Machine Learning Dhruv Batra Virginia Tech Topics: –Classification: Logistic Regression –NB & LR connections Readings: Barber.

ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Supervised Learning Resources: AG: Conditional Maximum Likelihood DP:

Naïve Bayes Classifiers William W. Cohen. TREE PRUNING.

Maximum Likelihood Estimation

Lecture 3: MLE, Bayes Learning, and Maximum Entropy

Introduction to Classifiers Fujinaga. Bayes (optimal) Classifier (1) A priori probabilities: and Decision rule: given and decide if and probability of.

CSC321: Lecture 8: The Bayesian way to fit models Geoffrey Hinton.

ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.

Naïve Bayes Classification Recitation, 1/25/07 Jonathan Huang.

Text Classification and Naïve Bayes Formalizing the Naïve Bayes Classifier.

Introduction to Machine Learning Nir Ailon Lecture 11: Probabilistic Models.

Unsupervised Learning Part 2. Topics How to determine the K in K-means? Hierarchical clustering Soft clustering with Gaussian mixture models Expectation-Maximization.

PREDICT 422: Practical Machine Learning

Matt Gormley Lecture 4 September 12, 2016

Bayesian and Markov Test

Read R&N Ch Next lecture: Read R&N

Perceptrons Lirong Xia.

Bayes Net Learning: Bayesian Approaches

ECE 5424: Introduction to Machine Learning

ECE 5424: Introduction to Machine Learning

Oliver Schulte Machine Learning 726

Lecture 15: Text Classification & Naive Bayes

ECE 5424: Introduction to Machine Learning

CSCI 5822 Probabilistic Models of Human and Machine Learning

Classifiers Fujinaga.

Read R&N Ch Next lecture: Read R&N

Hidden Markov Models Part 2: Algorithms

CS498-EA Reasoning in AI Lecture #20

CSCI 5822 Probabilistic Models of Human and Machine Learning

Classifiers Fujinaga.

LECTURE 23: INFORMATION THEORY REVIEW

LECTURE 07: BAYESIAN ESTIMATION

Parametric Methods Berlin Chen, 2005 References:

Multivariate Methods Berlin Chen

Mathematical Foundations of BME

Multivariate Methods Berlin Chen, 2005 References:

Recap: Naïve Bayes classifier

Read R&N Ch Next lecture: Read R&N

Naïve Bayes Text Classification

Logistic Regression [Many of the slides were originally created by Prof. Dan Jurafsky from Stanford.]

Perceptrons Lirong Xia.

Presentation transcript:

Matt Gormley Lecture 3 September 7, 2016 School of Computer Science 10-601 Introduction to Machine Learning Naïve Bayes Matt Gormley Lecture 3 September 7, 2016 Readings: Mitchell Ch. 6.1 – 6.10 Murphy Ch. 3

Reminders Homework 1: Homework 2: due Sept. 7 at 5:30pm (today) released today due Sept. 14 at 5:30pm

Outline Background: Generative Models Model 0: Not-so-naïve Model Maximum likelihood estimation (MLE) Maximum a posteriori (MAP) estimation Example: Exponential distribution Generative Models Model 0: Not-so-naïve Model Naïve Bayes Naïve Bayes Assumption Model 1: Bernoulli Naïve Bayes Model 2: Multinomial Naïve Bayes Model 3: Gaussian Naïve Bayes Model 4: Multiclass Naïve Bayes Smoothing Add-1 Smoothing Add-λ Smoothing MAP Estimation (Beta Prior)

Maximum Likelihood Estimate (MLE) MLE vs. MAP Maximum Likelihood Estimate (MLE)

Background: MLE Example: MLE of Exponential Distribution

Background: MLE Example: MLE of Exponential Distribution

Background: MLE Example: MLE of Exponential Distribution

Background: MLE Example: MLE of Exponential Distribution

MLE vs. MAP Maximum Likelihood Estimate (MLE) Maximum a posteriori (MAP) estimate Prior

Generative Models Specify a generative story for how the data was created (e.g. roll a weighted die) Given parameters (e.g. weights for each side) for the model, we can generate new data (e.g. roll the die again) Typical learning approach is MLE (e.g. find the most likely weights given the data)

Features Suppose we want to represent a document (M words) as a vector (vocabulary size V ) How should we do it? Option 1: Integer vector (word IDs) Option 2: Binary vector (word indicators) Option 3: Integer vector (word counts)

Today’s Goal To define a generative model of emails of two different classes (e.g. spam vs. not spam)

Spam News The Economist The Onion

Model 0: Not-so-naïve Model? Generative Story: Flip a weighted coin (Y) If heads, roll the red many sided die to sample a document vector (X) from the Spam distribution If tails, roll the blue many sided die to sample a document vector (X) from the Not-Spam distribution This model is computationally naïve!

Model 0: Not-so-naïve Model? Generative Story: Flip a weighted coin (Y) If heads, sample a document ID (X) from the Spam distribution If tails, sample a document ID (X) from the Not-Spam distribution This model is computationally naïve!

Model 0: Not-so-naïve Model? Flip weighted coin If HEADS, roll red die If TAILS, roll blue die y x1 x2 x3 … xK 1 … 1 1 … 1 1 … 1 … Each side of the die is labeled with a document vector (e.g. [1,0,1,…,1]) 1 … 1 1 …

Outline Background: Generative Models Model 0: Not-so-naïve Model Maximum likelihood estimation (MLE) Maximum a posteriori (MAP) estimation Example: Exponential distribution Generative Models Model 0: Not-so-naïve Model Naïve Bayes Naïve Bayes Assumption Model 1: Bernoulli Naïve Bayes Model 2: Multinomial Naïve Bayes Model 3: Gaussian Naïve Bayes Model 4: Multiclass Naïve Bayes Smoothing Add-1 Smoothing Add-λ Smoothing MAP Estimation (Beta Prior)

Naïve Bayes Assumption Conditional independence of features:

Estimating a joint from conditional probabilities P(C) 0.33 1 0.67 Estimating a joint from conditional probabilities A C P(A|C) 0.2 1 0.5 0.8 A B C P(A,B,C) … 1 B C P(B|C) 0.1 1 0.9 Slide from William Cohen (10-601, Spring 2016)

Estimating a joint from conditional probabilities P(C) 0.33 1 0.67 Estimating a joint from conditional probabilities A C P(A|C) 0.2 1 0.5 0.8 A B D C P(A,B,D,C) … 1 .. B C P(B|C) 0.1 1 0.9 Slide from William Cohen (10-601, Spring 2016) D C P(D|C) 0.1 1 0.9

They are very convenient for estimating Assuming conditional independence, the conditional probabilities encode the same information as the joint table. They are very convenient for estimating P( X1,…,Xn|Y)=P( X1|Y)*…*P( Xn|Y) They are almost as good for computing Slide from William Cohen (10-601, Spring 2016)

Generic Naïve Bayes Model Support: Depends on the choice of event model, P(Xk|Y) Model: Product of prior and the event model Training: Find the class-conditional MLE parameters For P(Y), we find the MLE using all the data. For each P(Xk|Y) we condition on the data with the corresponding class. Classification: Find the class that maximizes the posterior

Generic Naïve Bayes Model Classification:

Model 1: Bernoulli Naïve Bayes Support: Binary vectors of length K Generative Story: Model:

Model 1: Bernoulli Naïve Bayes Flip weighted coin If HEADS, flip each red coin If TAILS, flip each blue coin y x1 x2 x3 … xK … … 1 1 … 1 We can generate data in this fashion. Though in practice we never would since our data is given. Instead, this provides an explanation of how the data was generated (albeit a terrible one). 1 1 … 1 1 1 … 1 … Each red coin corresponds to an xk 1 … 1 1 …

Model 1: Bernoulli Naïve Bayes Support: Binary vectors of length K Generative Story: Same as Generic Naïve Bayes Model: Classification: Find the class that maximizes the posterior

Recall… Generic Naïve Bayes Model Classification:

Model 1: Bernoulli Naïve Bayes Training: Find the class-conditional MLE parameters For P(Y), we find the MLE using all the data. For each P(Xk|Y) we condition on the data with the corresponding class.

Model 1: Bernoulli Naïve Bayes Training: Find the class-conditional MLE parameters For P(Y), we find the MLE using all the data. For each P(Xk|Y) we condition on the data with the corresponding class. Data: 1 … y x1 x2 x3 xK

Outline Background: Generative Models Model 0: Not-so-naïve Model Maximum likelihood estimation (MLE) Maximum a posteriori (MAP) estimation Example: Exponential distribution Generative Models Model 0: Not-so-naïve Model Naïve Bayes Naïve Bayes Assumption Model 1: Bernoulli Naïve Bayes Model 2: Multinomial Naïve Bayes Model 3: Gaussian Naïve Bayes Model 4: Multiclass Naïve Bayes Smoothing Add-1 Smoothing Add-λ Smoothing MAP Estimation (Beta Prior)

Smoothing Add-1 Smoothing Add-λ Smoothing MAP Estimation (Beta Prior)

MLE What does maximizing likelihood accomplish? There is only a finite amount of probability mass (i.e. sum-to-one constraint) MLE tries to allocate as much probability mass as possible to the things we have observed… …at the expense of the things we have not observed

MLE For Naïve Bayes, suppose we never observe the word “serious” in an Onion article. In this case, what is the MLE of p(xk | y)? Now suppose we observe the word “serious” at test time. What is the posterior probability that the article was an Onion article?

1. Add-1 Smoothing

1. Add-1 Smoothing

For the Categorical Distribution 2. Add-λ Smoothing For the Categorical Distribution

MLE vs. MAP Recall… Maximum Likelihood Estimate (MLE) Maximum a posteriori (MAP) estimate Prior

3. MAP Estimation (Beta Prior) Generative Story: The parameters are drawn once for the entire dataset. Training: Find the class-conditional MAP parameters

Outline Background: Generative Models Model 0: Not-so-naïve Model Maximum likelihood estimation (MLE) Maximum a posteriori (MAP) estimation Example: Exponential distribution Generative Models Model 0: Not-so-naïve Model Naïve Bayes Naïve Bayes Assumption Model 1: Bernoulli Naïve Bayes Model 2: Multinomial Naïve Bayes Model 3: Gaussian Naïve Bayes Model 4: Multiclass Naïve Bayes Smoothing Add-1 Smoothing Add-λ Smoothing MAP Estimation (Beta Prior)

Model 2: Multinomial Naïve Bayes Support: Option 1: Integer vector (word IDs) Generative Story: Model:

Model 3: Gaussian Naïve Bayes Support: Model: Product of prior and the event model

Visualizing Naïve Bayes Slides in this section from William Cohen (10-601B, Spring 2016)

Fisher Iris Dataset Fisher (1936) used 150 measurements of flowers from 3 different species: Iris setosa (0), Iris virginica (1), Iris versicolor (2) collected by Anderson (1936) Species Sepal Length Sepal Width Petal Length Petal Width 4.3 3.0 1.1 0.1 4.9 3.6 1.4 5.3 3.7 1.5 0.2 1 2.4 3.3 1.0 5.7 2.8 4.1 1.3 6.3 4.7 1.6 6.7 5.0 1.7 Full dataset: https://en.wikipedia.org/wiki/Iris_flower_data_set

%% Import the IRIS data load fisheriris; X = meas; pos = strcmp(species,'setosa'); Y = 2*pos - 1; %% Visualize the data imagesc([X,Y]); title('Iris data');

%% Visualize by scatter plotting the the first two dimensions figure; scatter(X(Y<0,1),X(Y<0,2),'r*'); hold on; scatter(X(Y>0,1),X(Y>0,2),'bo'); title('Iris data');

%% Compute the mean and SD of each class PosMean = mean(X(Y>0,:)); PosSD = std(X(Y>0,:)); NegMean = mean(X(Y<0,:)); NegSD = std(X(Y<0,:)); %% Compute the NB probabilities for each class for each grid element [G1,G2]=meshgrid(3:0.1:8, 2:0.1:5); Z1 = gaussmf(G1,[PosSD(1),PosMean(1)]); Z2 = gaussmf(G2,[PosSD(2),PosMean(2)]); Z = Z1 .* Z2; V1 = gaussmf(G1,[NegSD(1),NegMean(1)]); V2 = gaussmf(G2,[NegSD(2),NegMean(2)]); V = V1 .* V2;

% Add them to the scatter plot figure; scatter(X(Y<0,1),X(Y<0,2),'r*'); hold on; scatter(X(Y>0,1),X(Y>0,2),'bo'); contour(G1,G2,Z); contour(G1,G2,V);

%% Now plot the difference of the probabilities figure; scatter(X(Y<0,1),X(Y<0,2),'r*'); hold on; scatter(X(Y>0,1),X(Y>0,2),'bo'); contour(G1,G2,Z-V); mesh(G1,G2,Z-V); alpha(0.4)

Naïve Bayes IS LINEAR Slides in this section from William Cohen (10-601B, Spring 2016)

Question: what does the boundary between positive and negative look like for Naïve Bayes?

two classes only rearrange terms

if xi = 1 or 0 ….

Why don’t we drop the generative model and try to learn this hyperplane directly?

Model 4: Multiclass Naïve Bayes

Naïve Bayes Assumption Recall… Naïve Bayes Assumption Conditional independence of features:

What’s wrong with the Naïve Bayes Assumption? The features might not be independent!! Example 1: If a document contains the word “Donald”, it’s extremely likely to contain the word “Trump” These are not independent! Example 2: If the petal width is very high, the petal length is also likely to be very high

Summary Naïve Bayes provides a framework for generative modeling Choose an event model appropriate to the data Train by MLE or MAP Classify by maximizing the posterior