ECE 5984: Introduction to Machine Learning Dhruv Batra Virginia Tech Topics: –Classification: Logistic Regression –NB & LR connections Readings: Barber.

Slides:



Advertisements
Similar presentations
ECE 8443 – Pattern Recognition LECTURE 05: MAXIMUM LIKELIHOOD ESTIMATION Objectives: Discrete Features Maximum Likelihood Resources: D.H.S: Chapter 3 (Part.
Advertisements

Parameter Learning in MN. Outline CRF Learning CRF for 2-d image segmentation IPF parameter sharing revisited.
Machine Learning Week 2 Lecture 1.
Middle Term Exam 03/01 (Thursday), take home, turn in at noon time of 03/02 (Friday)
Naïve Bayes Classifier
Multivariate linear models for regression and classification Outline: 1) multivariate linear regression 2) linear classification (perceptron) 3) logistic.
Artificial Intelligence Lecture 2 Dr. Bo Yuan, Professor Department of Computer Science and Engineering Shanghai Jiaotong University
Classification and risk prediction
0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Logistic Regression Rong Jin. Logistic Regression Model  In Gaussian generative model:  Generalize the ratio to a linear model Parameters: w and c.
Announcements  Homework 4 is due on this Thursday (02/27/2004)  Project proposal is due on 03/02.
Announcements  Project proposal is due on 03/11  Three seminars this Friday (EB 3105) Dealing with Indefinite Representations in Pattern Recognition.
Today Linear Regression Logistic Regression Bayesians v. Frequentists
Generative Models Rong Jin. Statistical Inference Training ExamplesLearning a Statistical Model  Prediction p(x;  ) Female: Gaussian distribution N(
ECE 5984: Introduction to Machine Learning
Bayes Classifier, Linear Regression 10701/15781 Recitation January 29, 2008 Parts of the slides are from previous years’ recitation and lecture notes,
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
Logistic Regression 10701/15781 Recitation February 5, 2008 Parts of the slides are from previous years’ recitation and lecture notes, and from Prof. Andrew.
Review of Lecture Two Linear Regression Normal Equation
Crash Course on Machine Learning
EE513 Audio Signals and Systems Statistical Pattern Classification Kevin D. Donohue Electrical and Computer Engineering University of Kentucky.
Crash Course on Machine Learning Part II
CSE 446 Gaussian Naïve Bayes & Logistic Regression Winter 2012
ECE 5984: Introduction to Machine Learning Dhruv Batra Virginia Tech Topics: –Classification: Naïve Bayes Readings: Barber
1 Logistic Regression Adapted from: Tom Mitchell’s Machine Learning Book Evan Wei Xiang and Qiang Yang.
CPSC 502, Lecture 15Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 16 Nov, 3, 2011 Slide credit: C. Conati, S.
Logistic Regression Debapriyo Majumdar Data Mining – Fall 2014 Indian Statistical Institute Kolkata September 1, 2014.
1 CS546: Machine Learning and Natural Language Discriminative vs Generative Classifiers This lecture is based on (Ng & Jordan, 02) paper and some slides.
CSE 446 Perceptron Learning Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer.
Generative verses discriminative classifier
ECE 5984: Introduction to Machine Learning Dhruv Batra Virginia Tech Topics: –Statistical Estimation (MLE, MAP, Bayesian) Readings: Barber 8.6, 8.7.
LOGISTIC REGRESSION David Kauchak CS451 – Fall 2013.
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.6: Linear Models Rodney Nielsen Many of.
CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.
Empirical Research Methods in Computer Science Lecture 7 November 30, 2005 Noah Smith.
CSE 446 Logistic Regression Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer.
Machine Learning Recitation 6 Sep 30, 2009 Oznur Tastan.
Ch 4. Linear Models for Classification (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized and revised by Hee-Woong Lim.
Computational Intelligence: Methods and Applications Lecture 23 Logistic discrimination and support vectors Włodzisław Duch Dept. of Informatics, UMK Google:
Machine Learning CUNY Graduate Center Lecture 4: Logistic Regression.
ECE 5984: Introduction to Machine Learning Dhruv Batra Virginia Tech Topics: –Unsupervised Learning: Kmeans, GMM, EM Readings: Barber
Linear Models for Classification
ECE 5984: Introduction to Machine Learning Dhruv Batra Virginia Tech Topics: –Regression Readings: Barber 17.1, 17.2.
Over-fitting and Regularization Chapter 4 textbook Lectures 11 and 12 on amlbook.com.
ECE 5984: Introduction to Machine Learning Dhruv Batra Virginia Tech Topics: –Ensemble Methods: Bagging, Boosting Readings: Murphy 16.4; Hastie 16.
Regress-itation Feb. 5, Outline Linear regression – Regression: predicting a continuous value Logistic regression – Classification: predicting a.
Logistic Regression William Cohen.
CSE 446 Logistic Regression Perceptron Learning Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer.
ECE 5984: Introduction to Machine Learning Dhruv Batra Virginia Tech Topics: –(Finish) Expectation Maximization –Principal Component Analysis (PCA) Readings:
Naive Bayes (Generative Classifier) vs. Logistic Regression (Discriminative Classifier) Minkyoung Kim.
ECE 5984: Introduction to Machine Learning Dhruv Batra Virginia Tech Topics: –(Finish) Model selection –Error decomposition –Bias-Variance Tradeoff –Classification:
Matt Gormley Lecture 4 September 12, 2016
Deep Feedforward Networks
ECE 5424: Introduction to Machine Learning
ECE 5424: Introduction to Machine Learning
Dhruv Batra Georgia Tech
Probability Theory and Parameter Estimation I
Generative verses discriminative classifier
ECE 5424: Introduction to Machine Learning
10701 / Machine Learning.
Discriminative and Generative Classifiers
ECE 5424: Introduction to Machine Learning
ECE 5424: Introduction to Machine Learning
ECE 5424: Introduction to Machine Learning
ECE 5424: Introduction to Machine Learning
Classification Discriminant Analysis
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Parametric Methods Berlin Chen, 2005 References:
Recap: Naïve Bayes classifier
Linear Discrimination
Presentation transcript:

ECE 5984: Introduction to Machine Learning Dhruv Batra Virginia Tech Topics: –Classification: Logistic Regression –NB & LR connections Readings: Barber 17.4

Administrativia HW2 –Due: Friday 03/06, 03/15, 11:55pm –Implement linear regression, Naïve Bayes, Logistic Regression Need a couple of catch-up lectures –How about 4-6pm? (C) Dhruv Batra2

Recap of last time (C) Dhruv Batra3

Naïve Bayes (your first probabilistic classifier) (C) Dhruv Batra4 Classificationxy Discrete

Classification Learn: h:X  Y –X – features –Y – target classes Suppose you know P(Y|X) exactly, how should you classify? –Bayes classifier: Why? Slide Credit: Carlos Guestrin

Error Decomposition Approximation/Modeling Error –You approximated reality with model Estimation Error –You tried to learn model with finite data Optimization Error –You were lazy and couldn’t/didn’t optimize to completion Bayes Error –Reality just sucks – (C) Dhruv Batra6

Generative vs. Discriminative Generative Approach (Naïve Bayes) –Estimate p(x|y) and p(y) –Use Bayes Rule to predict y Discriminative Approach –Estimate p(y|x) directly (Logistic Regression) –Learn “discriminant” function h(x) (Support Vector Machine) (C) Dhruv Batra7

The Naïve Bayes assumption Naïve Bayes assumption: –Features are independent given class: –More generally: How many parameters now? Suppose X is composed of d binary features Slide Credit: Carlos Guestrin(C) Dhruv Batra8

Generative vs. Discriminative Generative Approach (Naïve Bayes) –Estimate p(x|y) and p(y) –Use Bayes Rule to predict y Discriminative Approach –Estimate p(y|x) directly (Logistic Regression) –Learn “discriminant” function h(x) (Support Vector Machine) (C) Dhruv Batra9

Today: Logistic Regression Main idea –Think about a 2 class problem {0,1} –Can we regress to P(Y=1 | X=x)? Meet the Logistic or Sigmoid function –Crunches real numbers down to 0-1 Model –In regression: y ~ N(w’x, λ 2 ) –Logistic Regression: y ~ Bernoulli(σ(w’x)) (C) Dhruv Batra10

Understanding the sigmoid w 0 =0, w 1 =1w 0 =2, w 1 =1w 0 =0, w 1 =0.5 Slide Credit: Carlos Guestrin(C) Dhruv Batra11

Logistic Regression – a Linear classifier Demo – (C) Dhruv Batra12

Visualization (C) Dhruv Batra13Slide Credit: Kevin Murphy

Expressing Conditional Log Likelihood Slide Credit: Carlos Guestrin(C) Dhruv Batra14

Maximizing Conditional Log Likelihood Bad news: no closed-form solution to maximize l(w) Good news: l(w) is concave function of w! Slide Credit: Carlos Guestrin(C) Dhruv Batra15

(C) Dhruv Batra16Slide Credit: Greg Shakhnarovich

Careful about step-size (C) Dhruv Batra17Slide Credit: Nicolas Le Roux

(C) Dhruv Batra18Slide Credit: Fei Sha

When does it work? (C) Dhruv Batra19

(C) Dhruv Batra20Slide Credit: Fei Sha

(C) Dhruv Batra21Slide Credit: Greg Shakhnarovich

(C) Dhruv Batra22Slide Credit: Fei Sha

(C) Dhruv Batra23Slide Credit: Fei Sha

(C) Dhruv Batra24Slide Credit: Fei Sha

(C) Dhruv Batra25Slide Credit: Fei Sha

Optimizing concave function – Gradient ascent Conditional likelihood for Logistic Regression is concave  Find optimum with gradient ascent Gradient: Learning rate,  >0 Update rule: Slide Credit: Carlos Guestrin(C) Dhruv Batra26

Slide Credit: Carlos Guestrin(C) Dhruv Batra27 Maximize Conditional Log Likelihood: Gradient ascent

Gradient Ascent for LR Gradient ascent algorithm: iterate until change <  For i=1,…,n, repeat Slide Credit: Carlos Guestrin(C) Dhruv Batra28

(C) Dhruv Batra29

That’s all M(C)LE. How about M(C)AP? One common approach is to define priors on w –Normal distribution, zero mean, identity covariance –“Pushes” parameters towards zero Corresponds to Regularization –Helps avoid very large weights and overfitting –More on this later in the semester MAP estimate Slide Credit: Carlos Guestrin(C) Dhruv Batra30

Large parameters  Overfitting If data is linearly separable, weights go to infinity Leads to overfitting Penalizing high weights can prevent overfitting Slide Credit: Carlos Guestrin(C) Dhruv Batra31

Gradient of M(C)AP Slide Credit: Carlos Guestrin(C) Dhruv Batra32

MLE vs MAP Maximum conditional likelihood estimate Maximum conditional a posteriori estimate Slide Credit: Carlos Guestrin(C) Dhruv Batra33

HW2 Tips Naïve Bayes –Train_NB Implement “factor_tables” -- |X i | x |Y| matrices Prior |Y| x 1 vector Fill entries by counting + smoothing –Test_NB argmax_y P(Y=y) P(X i =x i )… –TIP: work in log domain Logistic Regression –Use small step-size at first –Make sure you maximize log-likelihood not minimize it –Sanity check: plot objective (C) Dhruv Batra34

Finishing up: Connections between NB & LR (C) Dhruv Batra35

Logistic regression vs Naïve Bayes Consider learning f: X  Y, where – X is a vector of real-valued features, – Y is boolean Gaussian Naïve Bayes classifier – assume all X i are conditionally independent given Y – model P(X i | Y = k) as Gaussian N(  ik,  i ) – model P(Y) as Bernoulli( ,1-  ) What does that imply about the form of P(Y|X)? Slide Credit: Carlos Guestrin36(C) Dhruv Batra Cool!!!!

Derive form for P(Y|X) for continuous X i Slide Credit: Carlos Guestrin37(C) Dhruv Batra

Ratio of class-conditional probabilities Slide Credit: Carlos Guestrin38(C) Dhruv Batra

Derive form for P(Y|X) for continuous X i Slide Credit: Carlos Guestrin39(C) Dhruv Batra

Gaussian Naïve Bayes vs Logistic Regression Representation equivalence –But only in a special case!!! (GNB with class-independent variances) But what’s the difference??? LR makes no assumptions about P(X|Y) in learning!!! Loss function!!! –Optimize different functions  Obtain different solutions Set of Gaussian Naïve Bayes parameters (feature variance independent of class label) Set of Logistic Regression parameters Slide Credit: Carlos Guestrin40(C) Dhruv Batra Not necessarily

Naïve Bayes vs Logistic Regression Slide Credit: Carlos Guestrin41(C) Dhruv Batra Consider Y boolean, Xi continuous, X= Number of parameters: –NB: 4d +1 (or 3d+1) –LR: d+1 Estimation method: –NB parameter estimates are uncoupled –LR parameter estimates are coupled

[Ng & Jordan, 2002] Slide Credit: Carlos Guestrin42(C) Dhruv Batra Generative and Discriminative classifiers Asymptotic comparison (# training examples  infinity) – when model correct GNB (with class independent variances) and LR produce identical classifiers – when model incorrect LR is less biased – does not assume conditional independence –therefore LR expected to outperform GNB G. Naïve Bayes vs. Logistic Regression 1

[Ng & Jordan, 2002] Slide Credit: Carlos Guestrin43(C) Dhruv Batra G. Naïve Bayes vs. Logistic Regression 2 Generative and Discriminative classifiers Non-asymptotic analysis –convergence rate of parameter estimates, d = # of attributes in X Size of training data to get close to infinite data solution GNB needs O(log d) samples LR needs O(d) samples –GNB converges more quickly to its (perhaps less helpful) asymptotic estimates

Some experiments from UCI data sets Naïve bayes Logistic Regression (C) Dhruv Batra

Gaussian Naïve Bayes with class-independent variances representationally equivalent to LR –Solution differs because of objective (loss) function In general, NB and LR make different assumptions –NB: Features independent given class assumption on P(X|Y) –LR: Functional form of P(Y|X), no assumption on P(X|Y) LR is a linear classifier –decision rule is a hyperplane LR optimized by conditional likelihood –no closed-form solution –Concave  global optimum with gradient ascent –Maximum conditional a posteriori corresponds to regularization Convergence rates –GNB (usually) needs less data –LR (usually) gets to better solutions in the limit (C) Dhruv Batra45Slide Credit: Carlos Guestrin What you should know about LR