CSE 446 Gaussian Naïve Bayes & Logistic Regression Winter 2012

Slides:



Advertisements
Similar presentations
Pattern Recognition and Machine Learning
Advertisements

Pattern Recognition and Machine Learning
Pattern Recognition and Machine Learning
CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.
Parameter Learning in MN. Outline CRF Learning CRF for 2-d image segmentation IPF parameter sharing revisited.
Supervised Learning Recap
Middle Term Exam 03/01 (Thursday), take home, turn in at noon time of 03/02 (Friday)
Chapter 4: Linear Models for Classification
Visual Recognition Tutorial
Classification and risk prediction
0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Logistic Regression Rong Jin. Logistic Regression Model  In Gaussian generative model:  Generalize the ratio to a linear model Parameters: w and c.
Announcements  Project proposal is due on 03/11  Three seminars this Friday (EB 3105) Dealing with Indefinite Representations in Pattern Recognition.
Today Linear Regression Logistic Regression Bayesians v. Frequentists
Machine Learning CMPT 726 Simon Fraser University
Machine Learning CUNY Graduate Center Lecture 3: Linear Regression.
Generative Models Rong Jin. Statistical Inference Training ExamplesLearning a Statistical Model  Prediction p(x;  ) Female: Gaussian distribution N(
Visual Recognition Tutorial
CHAPTER 4: Parametric Methods. Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 Parametric Estimation X = {
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
Logistic Regression 10701/15781 Recitation February 5, 2008 Parts of the slides are from previous years’ recitation and lecture notes, and from Prof. Andrew.
CSCI 347 / CS 4206: Data Mining Module 04: Algorithms Topic 06: Regression.
Review of Lecture Two Linear Regression Normal Equation
Crash Course on Machine Learning
PATTERN RECOGNITION AND MACHINE LEARNING
Crash Course on Machine Learning Part II
ECE 5984: Introduction to Machine Learning Dhruv Batra Virginia Tech Topics: –Classification: Naïve Bayes Readings: Barber
1 Logistic Regression Adapted from: Tom Mitchell’s Machine Learning Book Evan Wei Xiang and Qiang Yang.
Logistic Regression Debapriyo Majumdar Data Mining – Fall 2014 Indian Statistical Institute Kolkata September 1, 2014.
1 CS546: Machine Learning and Natural Language Discriminative vs Generative Classifiers This lecture is based on (Ng & Jordan, 02) paper and some slides.
CSE 446 Perceptron Learning Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer.
Generative verses discriminative classifier
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 3: LINEAR MODELS FOR REGRESSION.
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.6: Linear Models Rodney Nielsen Many of.
CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.
Empirical Research Methods in Computer Science Lecture 7 November 30, 2005 Noah Smith.
CSE 446 Logistic Regression Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer.
Ch 4. Linear Models for Classification (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized and revised by Hee-Woong Lim.
CSE 446: Naïve Bayes Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer & Dan Klein.
Computational Intelligence: Methods and Applications Lecture 23 Logistic discrimination and support vectors Włodzisław Duch Dept. of Informatics, UMK Google:
Machine Learning CUNY Graduate Center Lecture 4: Logistic Regression.
MLE’s, Bayesian Classifiers and Naïve Bayes Machine Learning Tom M. Mitchell Machine Learning Department Carnegie Mellon University January 30,
INTRODUCTION TO Machine Learning 3rd Edition
ECE 5984: Introduction to Machine Learning Dhruv Batra Virginia Tech Topics: –Classification: Logistic Regression –NB & LR connections Readings: Barber.
Linear Models for Classification
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Supervised Learning Resources: AG: Conditional Maximum Likelihood DP:
29 August 2013 Venkat Naïve Bayesian on CDF Pair Scores.
Over-fitting and Regularization Chapter 4 textbook Lectures 11 and 12 on amlbook.com.
Regress-itation Feb. 5, Outline Linear regression – Regression: predicting a continuous value Logistic regression – Classification: predicting a.
Logistic Regression William Cohen.
CSE 446 Logistic Regression Perceptron Learning Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer.
Giansalvo EXIN Cirrincione unit #4 Single-layer networks They directly compute linear discriminant functions using the TS without need of determining.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Naive Bayes (Generative Classifier) vs. Logistic Regression (Discriminative Classifier) Minkyoung Kim.
ECE 5984: Introduction to Machine Learning Dhruv Batra Virginia Tech Topics: –(Finish) Model selection –Error decomposition –Bias-Variance Tradeoff –Classification:
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 1: INTRODUCTION.
Matt Gormley Lecture 4 September 12, 2016
Deep Feedforward Networks
Probability Theory and Parameter Estimation I
Generative verses discriminative classifier
10701 / Machine Learning.
ECE 5424: Introduction to Machine Learning
ECE 5424: Introduction to Machine Learning
ECE 5424: Introduction to Machine Learning
Special Topics In Scientific Computing
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Pattern Recognition and Machine Learning
Parametric Methods Berlin Chen, 2005 References:
Recap: Naïve Bayes classifier
Linear Discrimination
Presentation transcript:

CSE 446 Gaussian Naïve Bayes & Logistic Regression Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer

Last Time Today Learning Gaussians Naïve Bayes Gaussians Naïve Bayes Logistic Regression

Text Classification Bag of Words Representation aardvark 0 about 2 all 2 Africa 1 apple 0 anxious 0 ... gas 1 oil 1 … Zaire 0

Bayesian Learning Use Bayes rule: P(Y | X) = P(X |Y) P(Y) P(X) Prior Use Bayes rule: Data Likelihood P(Y | X) = P(X |Y) P(Y) P(X) Posterior Normalization Or equivalently: P(Y | X)  P(X | Y) P(Y)

Naïve Bayes Naïve Bayes assumption: How many parameters now? Features are independent given class: More generally: How many parameters now? Suppose X is composed of n binary features

The Distributions We Love Discrete Continuous Binary {0, 1} k Values Single Event Bernouilli Sequence (N trials) N= H+T Binomial Multinomial Conjugate Prior Beta Dirichlet

NB with Bag of Words for Text Classification Learning phase: Prior P(Ym) Count how many documents from topic m / total # docs P(Xi|Ym) Let Bm be a bag of words formed from all the docs in topic m Let #(i, B) be the number of times word i is in bag B P(Xi | Ym) = (#(i, Bm)+1) / (W+j#(j, Bm)) where W=#unique words Test phase: For each document Use naïve Bayes decision rule

Easy to Implement But… If you do… it probably won’t work…

Probabilities: Important Detail! P(spam | X1 … Xn) =  P(spam | Xi) i Any more potential problems here? We are multiplying lots of small numbers Danger of underflow! 0.557 = 7 E -18 Solution? Use logs and add! p1 * p2 = e log(p1)+log(p2) Always keep in log form

Naïve Bayes Posterior Probabilities Classification results of naïve Bayes I.e. the class with maximum posterior probability… Usually fairly accurate (?!?!?) However, due to the inadequacy of the conditional independence assumption… Actual posterior-probability estimates not accurate. Output probabilities generally very close to 0 or 1.

Twenty News Groups results

Learning curve for Twenty News Groups

Bayesian Learning What if Features are Continuous? Prior Eg., Character Recognition: Xi is ith pixel Posterior P(Y | X)  P(X | Y) P(Y) Data Likelihood

Bayesian Learning What if Features are Continuous? Eg., Character Recognition: Xi is ith pixel P(Y | X)  P(X | Y) P(Y) P(Xi =x| Y=yk) = N(ik, ik) N(ik, ik) =

Gaussian Naïve Bayes Sometimes Assume Variance is independent of Y (i.e., i), or independent of Xi (i.e., k) or both (i.e., ) P(Y | X)  P(X | Y) P(Y) P(Xi =x| Y=yk) = N(ik, ik) N(ik, ik) =

Learning Gaussian Parameters Maximum Likelihood Estimates: Mean: Variance:

Learning Gaussian Parameters Maximum Likelihood Estimates: Mean: Variance: jth training example (x)=1 if x true, else 0

Learning Gaussian Parameters Maximum Likelihood Estimates: Mean: Variance:

Example: GNB for classifying mental states [Mitchell et al.] ~1 mm resolution ~2 images per sec. 15,000 voxels/image non-invasive, safe measures Blood Oxygen Level Dependent (BOLD) response 10 sec Typical impulse response

Brain scans can track activation with precision and sensitivity [Mitchell et al.]

Gaussian Naïve Bayes: Learned voxel,word P(BrainActivity | WordCategory = {People,Animal}) [Mitchell et al.]

Pairwise classification accuracy: 85% Gaussian Naïve Bayes: Learned voxel,word P(BrainActivity | WordCategory = {People,Animal}) [Mitchell et al.] Pairwise classification accuracy: 85% People words Animal words

What You Need to Know about Naïve Bayes Optimal Decision using Bayes Classifier Naïve Bayes Classifier What’s the assumption Why we use it How do we learn it Text Classification Bag of words model Gaussian NB Features still conditionally independent Features have Gaussian distribution given class

What’s (supervised) learning more formally Given: Dataset: Instances {x1;t(x1),…, xN;t(xN)} e.g., xi;t(xi) = (GPA=3.9,IQ=120,MLscore=99);150K Hypothesis space: H e.g., polynomials of degree 8 Loss function: measures quality of hypothesis hH e.g., squared error for regression Obtain: Learning algorithm: obtain hH that minimizes loss function e.g., using closed form solution if available Or greedy search if not Want to minimize prediction error, but can only minimize error in dataset

Types of (supervised) learning problems, revisited Decision Trees, e.g., dataset: votes; party hypothesis space: Loss function: NB Classification, e.g., dataset: brain image; {verb v. noun} Density estimation, e.g., dataset: grades ©Carlos Guestrin 2005-2009

Learning is (simply) function approximation! The general (supervised) learning problem: Given some data (including features), hypothesis space, loss function Learning is no magic! Simply trying to find a function that fits the data Regression Density estimation Classification (Not surprisingly) Seemly different problem, very similar solutions…

What you need to know about supervised learning Learning is function approximation What functions are being optimized?

Generative vs. Discriminative Classifiers Want to Learn: h:X  Y X – features Y – target classes Bayes optimal classifier – P(Y|X) Generative classifier, e.g., Naïve Bayes: Assume some functional form for P(X|Y), P(Y) Estimate parameters of P(X|Y), P(Y) directly from training data Use Bayes rule to calculate P(Y|X= x) This is a ‘generative’ model Indirect computation of P(Y|X) through Bayes rule As a result, can also generate a sample of the data, P(X) = y P(y) P(X|y) Discriminative classifiers, e.g., Logistic Regression: Assume some functional form for P(Y|X) Estimate parameters of P(Y|X) directly from training data This is the ‘discriminative’ model Directly learn P(Y|X) But cannot obtain a sample of the data, because P(X) is not available

Logistic Regression Learn P(Y|X) directly! P(Y)=0 P(Y)=1 Assume a particular functional form Not differentiable… P(Y)=0 P(Y)=1

Logistic Regression Learn P(Y|X) directly! Assume a particular functional form Logistic Function Aka Sigmoid

Logistic Function in n Dimensions Sigmoid applied to a linear function of the data: Features can be discrete or continuous!

Understanding Sigmoids w0=-2, w1=-1 w0=0, w1=-1 w0=0, w1= -0.5

linear classification rule! Very convenient! implies implies implies linear classification rule! ©Carlos Guestrin 2005-2009

Logistic regression more generally Logistic regression in more general case, where Y {y1,…,yR} for k<R for k=R (normalization, so no weights for this class) Features can be discrete or continuous! ©Carlos Guestrin 2005-2009

Loss functions: Likelihood vs. Conditional Likelihood Generative (Naïve Bayes) Loss function: Data likelihood Discriminative models cannot compute P(xj|w)! But, discriminative (logistic regression) loss function: Conditional Data Likelihood Doesn’t waste effort learning P(X) – focuses on P(Y|X) all that matters for classification ©Carlos Guestrin 2005-2009

Expressing Conditional Log Likelihood ©Carlos Guestrin 2005-2009

Maximizing Conditional Log Likelihood Good news: l(w) is concave function of w ! no locally optimal solutions Bad news: no closed-form solution to maximize l(w) Good news: concave functions easy to optimize ©Carlos Guestrin 2005-2009

Optimizing concave function – Gradient ascent Conditional likelihood for Logistic Regression is concave ! Find optimum with gradient ascent Gradient ascent is simplest of optimization approaches e.g., Conjugate gradient ascent much better (see reading) Gradient: Learning rate, >0 Update rule: ©Carlos Guestrin 2005-2009

Maximize Conditional Log Likelihood: Gradient ascent ©Carlos Guestrin 2005-2009

Gradient Descent for LR Gradient ascent algorithm: iterate until change <  For i=1,…,n, repeat ©Carlos Guestrin 2005-2009

That’s all M(C)LE. How about MAP? One common approach is to define priors on w Normal distribution, zero mean, identity covariance “Pushes” parameters towards zero Corresponds to Regularization Helps avoid very large weights and overfitting More on this later in the semester MAP estimate ©Carlos Guestrin 2005-2009

M(C)AP as Regularization Penalizes high weights, also applicable in linear regression ©Carlos Guestrin 2005-2009

Large parameters  Overfitting If data is linearly separable, weights go to infinity Leads to overfitting: Penalizing high weights can prevent overfitting… again, more on this later in the semester ©Carlos Guestrin 2005-2009

Gradient of M(C)AP ©Carlos Guestrin 2005-2009

MLE vs MAP Maximum conditional likelihood estimate Maximum conditional a posteriori estimate ©Carlos Guestrin 2005-2009

Logistic regression v. Naïve Bayes Consider learning f: X  Y, where X is a vector of real-valued features, < X1 … Xn > Y is boolean Could use a Gaussian Naïve Bayes classifier assume all Xi are conditionally independent given Y model P(Xi | Y = yk) as Gaussian N(ik,i) model P(Y) as Bernoulli(q,1-q) What does that imply about the form of P(Y|X)? Cool!!!! ©Carlos Guestrin 2005-2009

Derive form for P(Y|X) for continuous Xi ©Carlos Guestrin 2005-2009

Ratio of class-conditional probabilities ©Carlos Guestrin 2005-2009

Derive form for P(Y|X) for continuous Xi ©Carlos Guestrin 2005-2009

Gaussian Naïve Bayes v. Logistic Regression Set of Gaussian Naïve Bayes parameters (feature variance independent of class label) Set of Logistic Regression parameters Representation equivalence But only in a special case!!! (GNB with class-independent variances) But what’s the difference??? LR makes no assumptions about P(X|Y) in learning!!! Loss function!!! Optimize different functions ! Obtain different solutions ©Carlos Guestrin 2005-2009

Naïve Bayes vs Logistic Regression Consider Y boolean, Xi continuous, X=<X1 ... Xn> Number of parameters: NB: 4n +1 LR: n+1 Estimation method: NB parameter estimates are uncoupled LR parameter estimates are coupled ©Carlos Guestrin 2005-2009

G. Naïve Bayes vs. Logistic Regression 1 [Ng & Jordan, 2002] Generative and Discriminative classifiers Asymptotic comparison (# training examples  infinity) when model correct GNB, LR produce identical classifiers when model incorrect LR is less biased – does not assume conditional independence therefore LR expected to outperform GNB ©Carlos Guestrin 2005-2009

G. Naïve Bayes vs. Logistic Regression 2 [Ng & Jordan, 2002] Generative and Discriminative classifiers Non-asymptotic analysis convergence rate of parameter estimates, n = # of attributes in X Size of training data to get close to infinite data solution GNB needs O(log n) samples LR needs O(n) samples GNB converges more quickly to its (perhaps less helpful) asymptotic estimates ©Carlos Guestrin 2005-2009

Some experiments from UCI data sets Naïve bayes Logistic Regression Some experiments from UCI data sets ©Carlos Guestrin 2005-2009

What you should know about Logistic Regression (LR) Gaussian Naïve Bayes with class-independent variances representationally equivalent to LR Solution differs because of objective (loss) function In general, NB and LR make different assumptions NB: Features independent given class ! assumption on P(X|Y) LR: Functional form of P(Y|X), no assumption on P(X|Y) LR is a linear classifier decision rule is a hyperplane LR optimized by conditional likelihood no closed-form solution concave ! global optimum with gradient ascent Maximum conditional a posteriori corresponds to regularization Convergence rates GNB (usually) needs less data LR (usually) gets to better solutions in the limit ©Carlos Guestrin 2005-2009