1 Logistic Regression Adapted from: Tom Mitchell’s Machine Learning Book Evan Wei Xiang and Qiang Yang.

Slides:



Advertisements
Similar presentations
Pattern Classification & Decision Theory. How are we doing on the pass sequence? Bayesian regression and estimation enables us to track the man in the.
Advertisements

Why does it work? We have not addressed the question of why does this classifier performs well, given that the assumptions are unlikely to be satisfied.
Linear Regression.
R OBERTO B ATTITI, M AURO B RUNATO. The LION Way: Machine Learning plus Intelligent Optimization. LIONlab, University of Trento, Italy, Feb 2014.
Parameter Learning in MN. Outline CRF Learning CRF for 2-d image segmentation IPF parameter sharing revisited.
Supervised Learning Recap
Middle Term Exam 03/01 (Thursday), take home, turn in at noon time of 03/02 (Friday)
Chapter 4: Linear Models for Classification
Naïve Bayes Classifier
Multivariate linear models for regression and classification Outline: 1) multivariate linear regression 2) linear classification (perceptron) 3) logistic.
Classification and Prediction: Regression Via Gradient Descent Optimization Bamshad Mobasher DePaul University.
Visual Recognition Tutorial
Classification and risk prediction
Logistic Regression Rong Jin. Logistic Regression Model  In Gaussian generative model:  Generalize the ratio to a linear model Parameters: w and c.
Today Linear Regression Logistic Regression Bayesians v. Frequentists
Generative Models Rong Jin. Statistical Inference Training ExamplesLearning a Statistical Model  Prediction p(x;  ) Female: Gaussian distribution N(
Visual Recognition Tutorial
Kernel Methods Part 2 Bing Han June 26, Local Likelihood Logistic Regression.
CPSC 422, Lecture 18Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 18 Feb, 25, 2015 Slide Sources Raymond J. Mooney University of.
Bayes Classifier, Linear Regression 10701/15781 Recitation January 29, 2008 Parts of the slides are from previous years’ recitation and lecture notes,
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
Logistic Regression 10701/15781 Recitation February 5, 2008 Parts of the slides are from previous years’ recitation and lecture notes, and from Prof. Andrew.
Review of Lecture Two Linear Regression Normal Equation
Semi-Supervised Learning
Crash Course on Machine Learning
Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.
Machine Learning & Data Mining CS/CNS/EE 155 Lecture 6: Conditional Random Fields 1.
Crash Course on Machine Learning Part II
CSE 446 Gaussian Naïve Bayes & Logistic Regression Winter 2012
ECE 5984: Introduction to Machine Learning Dhruv Batra Virginia Tech Topics: –Classification: Naïve Bayes Readings: Barber
Logistic Regression Debapriyo Majumdar Data Mining – Fall 2014 Indian Statistical Institute Kolkata September 1, 2014.
1 CS546: Machine Learning and Natural Language Discriminative vs Generative Classifiers This lecture is based on (Ng & Jordan, 02) paper and some slides.
CSE 446 Perceptron Learning Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer.
Generative verses discriminative classifier
CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.
CSE 446 Logistic Regression Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer.
Computational Intelligence: Methods and Applications Lecture 23 Logistic discrimination and support vectors Włodzisław Duch Dept. of Informatics, UMK Google:
Machine Learning CUNY Graduate Center Lecture 4: Logistic Regression.
Linear Discrimination Reading: Chapter 2 of textbook.
ECE 5984: Introduction to Machine Learning Dhruv Batra Virginia Tech Topics: –Classification: Logistic Regression –NB & LR connections Readings: Barber.
Linear Methods for Classification : Presentation for MA seminar in statistics Eli Dahan.
Regress-itation Feb. 5, Outline Linear regression – Regression: predicting a continuous value Logistic regression – Classification: predicting a.
Logistic Regression William Cohen.
CSE 446 Logistic Regression Perceptron Learning Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer.
Lecture 5: Statistical Methods for Classification CAP 5415: Computer Vision Fall 2006.
Linear Classifiers Dept. Computer Science & Engineering, Shanghai Jiao Tong University.
CS 2750: Machine Learning Linear Models for Classification Prof. Adriana Kovashka University of Pittsburgh February 15, 2016.
Linear Models (II) Rong Jin. Recap  Classification problems Inputs x  output y y is from a discrete set Example: height 1.8m  male/female?  Statistical.
Conditional Random Fields and Its Applications Presenter: Shih-Hsiang Lin 06/25/2007.
CMPS 142/242 Review Section Fall 2011 Adapted from Lecture Slides.
Naive Bayes (Generative Classifier) vs. Logistic Regression (Discriminative Classifier) Minkyoung Kim.
ECE 5984: Introduction to Machine Learning Dhruv Batra Virginia Tech Topics: –(Finish) Model selection –Error decomposition –Bias-Variance Tradeoff –Classification:
Why does it work? We have not addressed the question of why does this classifier performs well, given that the assumptions are unlikely to be satisfied.
Chapter 7. Classification and Prediction
Generative verses discriminative classifier
10701 / Machine Learning.
Discriminative and Generative Classifiers
ECE 5424: Introduction to Machine Learning
Classification with Perceptrons Reading:
ECE 5424: Introduction to Machine Learning
Data Mining Lecture 11.
Roberto Battiti, Mauro Brunato
Collaborative Filtering Matrix Factorization Approach
Ying shen Sse, tongji university Sep. 2016
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Unsupervised Learning II: Soft Clustering with Gaussian Mixture Models
Logistic Regression.
Recap: Naïve Bayes classifier
Linear Discrimination
Logistic Regression Geoff Hulten.
Presentation transcript:

1 Logistic Regression Adapted from: Tom Mitchell’s Machine Learning Book Evan Wei Xiang and Qiang Yang

2 Classification Learn: h:X->Y – X – features – Y – target classes Suppose you know P(Y|X) exactly, how should you classify? – Bayes classifier: Why?

3 Generative vs. Discriminative Classifiers - Intuition Generative classifier, e.g., Naïve Bayes: – Assume some functional form for P(X|Y), P(Y) – Estimate parameters of P(X|Y), P(Y) directly from training data – Use Bayes rule to calculate P(Y|X=x) – This is ‘generative’ model Indirect computation of P(Y|X) through Bayes rule But, can generate a sample of the data, Discriminative classifier, e.g., Logistic Regression: – Assume some functional form for P(Y|X) – Estimate parameters of P(Y|X) directly from training data – This is the ‘discriminative’ model Directly learn P(Y|X) But cannot sample data, because P(X) is not available

4 The Naïve Bayes Classifier Given: – Prior P(Y) – n conditionally independent features X given the class Y – For each Xi, we have likelihood P(X i |Y) Decision rule: If assumption holds, NB is optimal classifier!

5 Logistic Regression Let X be the data instance, and Y be the class label: Learn P(Y|X) directly – Let W = (W1, W2, … Wn), X=(X1, X2, …, Xn), WX is the dot product – Sigmoid function:

6 Logistic Regression In logistic regression, we learn the conditional distribution P(y|x) Let p y (x;w) be our estimate of P(y|x), where w is a vector of adjustable parameters. Assume there are two classes, y = 0 and y = 1 and This is equivalent to That is, the log odds of class 1 is a linear function of x Q: How to find W?

7 Constructing a Learning Algorithm The conditional data likelihood is the probability of the observed Y values in the training data, conditioned on their corresponding X values. We choose parameters w that satisfy where w = is the vector of parameters to be estimated, y l denotes the observed value of Y in the l th training example, and x l denotes the observed value of X in the l th training example

8 Constructing a Learning Algorithm Equivalently, we can work with the log of the conditional likelihood: This conditional data log likelihood, which we will denote l(W) can be written as Note here we are utilizing the fact that Y can take only values 0 or 1, so only one of the two terms in the expression will be non-zero for any given y l

9 Computing the Likelihood We can re-express the log of the conditional likelihood as:

10 Fitting LR by Gradient Ascent Unfortunately, there is no closed form solution to maximizing l(w) with respect to w. Therefore, one common approach is to use gradient ascent The i th component of the vector gradient has the form Logistic Regression prediction

11 Fitting LR by Gradient Ascent Given this formula for the derivative of each wi, we can use standard gradient ascent to optimize the weights w. Beginning with initial weights of zero, we repeatedly update the weights in the direction of the gradient, changing the i th weight according to

12 Regularization in Logistic Regression Overfitting the training data is a problem that can arise in Logistic Regression, especially when data has very high dimensions and is sparse. One approach to reducing overfitting is regularization, in which we create a modified “penalized log likelihood function,” which penalizes large values of w.

13 Regularization in Logistic Regression The derivative of this penalized log likelihood function is similar to our earlier derivative, with one additional penalty term which gives us the modified gradient descent rule

14 Summary of Logistic Regression Learns the Conditional Probability Distribution P(y|x) Local Search. – Begins with initial weight vector. – Modifies it iteratively to maximize an objective function. – The objective function is the conditional log likelihood of the data – so the algorithm seeks the probability distribution P(y|x) that is most likely given the data.

15 What you should know LR In general, NB and LR make different assumptions – NB: Features independent given class -> assumption on P(X|Y) – LR: Functional form of P(Y|X), no assumption on P(X|Y) LR is a linear classifier – decision rule is a hyperplane LR optimized by conditional likelihood – no closed-form solution – concave -> global optimum with gradient ascent