Machine Learning CUNY Graduate Center Lecture 4: Logistic Regression.

Slides:



Advertisements
Similar presentations
Pattern Recognition and Machine Learning
Advertisements

Generative Models Thus far we have essentially considered techniques that perform classification indirectly by modeling the training data, optimizing.
Linear Regression.
Pattern Recognition and Machine Learning
ECE 8443 – Pattern Recognition LECTURE 05: MAXIMUM LIKELIHOOD ESTIMATION Objectives: Discrete Features Maximum Likelihood Resources: D.H.S: Chapter 3 (Part.
LECTURE 11: BAYESIAN PARAMETER ESTIMATION
Supervised Learning Recap
CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 10: The Bayesian way to fit models Geoffrey Hinton.
Middle Term Exam 03/01 (Thursday), take home, turn in at noon time of 03/02 (Friday)
Linear Models for Classification: Probabilistic Methods
Chapter 4: Linear Models for Classification
What is Statistical Modeling
Visual Recognition Tutorial
Lecture 17: Supervised Learning Recap Machine Learning April 6, 2010.
Logistic Regression Rong Jin. Logistic Regression Model  In Gaussian generative model:  Generalize the ratio to a linear model Parameters: w and c.
Announcements  Homework 4 is due on this Thursday (02/27/2004)  Project proposal is due on 03/02.
Today Linear Regression Logistic Regression Bayesians v. Frequentists
Linear Methods for Classification
Machine Learning CMPT 726 Simon Fraser University
Machine Learning CUNY Graduate Center Lecture 3: Linear Regression.
Arizona State University DMML Kernel Methods – Gaussian Processes Presented by Shankar Bhargav.
Today Logistic Regression Decision Trees Redux Graphical Models
Simple Bayesian Supervised Models Saskia Klein & Steffen Bollmann 1.
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
CSCI 347 / CS 4206: Data Mining Module 04: Algorithms Topic 06: Regression.
Today Wrap up of probability Vectors, Matrices. Calculus
Crash Course on Machine Learning
Machine Learning CUNY Graduate Center Lecture 21: Graphical Models.
Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.
1 Linear Methods for Classification Lecture Notes for CMPUT 466/551 Nilanjan Ray.
Machine Learning Queens College Lecture 3: Probability and Statistics.
PATTERN RECOGNITION AND MACHINE LEARNING
Machine Learning CUNY Graduate Center Lecture 3: Linear Regression.
Introduction to Machine Learning for Information Retrieval Xiaolong Wang.
ECE 8443 – Pattern Recognition LECTURE 06: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Bias in ML Estimates Bayesian Estimation Example Resources:
1 Logistic Regression Adapted from: Tom Mitchell’s Machine Learning Book Evan Wei Xiang and Qiang Yang.
CSE 446 Perceptron Learning Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer.
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 3: LINEAR MODELS FOR REGRESSION.
LOGISTIC REGRESSION David Kauchak CS451 – Fall 2013.
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.6: Linear Models Rodney Nielsen Many of.
ECE 8443 – Pattern Recognition LECTURE 07: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Class-Conditional Density The Multivariate Case General.
CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.
CS Statistical Machine learning Lecture 18 Yuan (Alan) Qi Purdue CS Oct
CSE 446 Logistic Regression Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer.
Ch 4. Linear Models for Classification (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized and revised by Hee-Woong Lim.
Chapter 3: Maximum-Likelihood Parameter Estimation l Introduction l Maximum-Likelihood Estimation l Multivariate Case: unknown , known  l Univariate.
ECE 5984: Introduction to Machine Learning Dhruv Batra Virginia Tech Topics: –Classification: Logistic Regression –NB & LR connections Readings: Barber.
Biointelligence Laboratory, Seoul National University
Linear Models for Classification
Linear Methods for Classification Based on Chapter 4 of Hastie, Tibshirani, and Friedman David Madigan.
Over-fitting and Regularization Chapter 4 textbook Lectures 11 and 12 on amlbook.com.
Machine Learning CUNY Graduate Center Lecture 2: Math Primer.
Logistic Regression William Cohen.
Lecture 3: MLE, Bayes Learning, and Maximum Entropy
CSE 446 Logistic Regression Perceptron Learning Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer.
Discriminative Training and Machine Learning Approaches Machine Learning Lab, Dept. of CSIE, NCKU Chih-Pin Liao.
Machine Learning CUNY Graduate Center Lecture 6: Linear Regression II.
Giansalvo EXIN Cirrincione unit #4 Single-layer networks They directly compute linear discriminant functions using the TS without need of determining.
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 1: INTRODUCTION.
Chapter 3: Maximum-Likelihood Parameter Estimation
Machine Learning Logistic Regression
Machine Learning Logistic Regression
Statistical Learning Dong Liu Dept. EEIS, USTC.
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
LECTURE 07: BAYESIAN ESTIMATION
Parametric Methods Berlin Chen, 2005 References:
Multivariate Methods Berlin Chen
Logistic Regression Chapter 7.
Multivariate Methods Berlin Chen, 2005 References:
Recap: Naïve Bayes classifier
Presentation transcript:

Machine Learning CUNY Graduate Center Lecture 4: Logistic Regression

Today Bayesians v. Frequentists Logistic Regression –Linear Model for Classification 1

Bayesians v. Frequentists What is a probability? Frequentists –A probability is the likelihood that an event will happen –It is approximated by the ratio of the number of observed events to the number of total events –Assessment is vital to selecting a model –Point estimates are absolutely fine Bayesians –A probability is a degree of believability of a proposition. –Bayesians require that probabilities be prior beliefs conditioned on data. –The Bayesian approach “is optimal”, given a good model, a good prior and a good loss function. Don’t worry so much about assessment. –If you are ever making a point estimate, you’ve made a mistake. The only valid probabilities are posteriors based on evidence given some prior 2

Logistic Regression Linear model applied to classification Supervised: target information is available –Each data point x i has a corresponding target t i. Goal: Identify a function 3

Target Variables In binary classification, it is convenient to represent t i as a scalar with a range of [0,1] –Interpretation of t i as the likelihood that x i is the member of the positive class –Used to represent the confidence of a prediction. For L > 2 classes, t i is often represented as a K element vector. –t ij represents the degree of membership in class j. –|t i | = 1 –E.g. 5-way classification vector: 4

Graphical Example of Classification 5

Decision Boundaries 6

Graphical Example of Classification 7

Classification approaches Generative –Models the joint distribution between c and x –Highest data requirements Discriminative –Fewer parameters to approximate Discriminant Function –May still be trained probabilistically, but not necessarily modeling a likelihood. 8

Treating Classification as a Linear model 9

Relationship between Regression and Classification Since we’re classifying two classes, why not set one class to ‘0’ and the other to ‘1’ then use linear regression. –Regression: -infinity to infinity, while class labels are 0, 1 Can use a threshold, e.g. –y >= 0.5 then class 1 –y < 0.5 then class 2 10 f(x)>=0.5? Happy/Good/ClassA Sad/Not Good/ClassB 1

Odds-ratio Rather than thresholding, we’ll relate the regression to the class-conditional probability. Ratio of the odd of prediction y = 1 or y = 0 –If p(y=1|x) = 0.8 and p(y=0|x) = 0.2 –Odds ratio = 0.8/0.2 = 4 Use a linear model to predict odds rather than a class label. 11

Logit – Log odds ratio function LHS: 0 to infinity RHS: -infinity to infinity Use a log function. –Has the added bonus of dissolving the division leading to easy manipulation 12

Logistic Regression A linear model used to predict log-odds ratio of two classes 13

Logit to probability 14

Sigmoid function Squashing function to map the reals to a finite domain. 15

Gaussian Class-conditional Assume the data is generated from a gaussian distribution for each class. Leads to a bayesian formulation of logistic regression. 16

Bayesian Logistic Regression 17

Maximum Likelihood Extimation Logistic Regression Class-conditional Gaussian. Multinomial Class distribution. As ever, take the derivative of this likelihood function w.r.t. 18

Maximum Likelihood Estimation of the prior 19

Maximum Likelihood Estimation of the prior 20

Maximum Likelihood Estimation of the prior 21

Discriminative Training Take the derivatives w.r.t. –Be prepared for this for homework. In the generative formulation, we need to estimate the joint of t and x. –But we get an intuitive regularization technique. Discriminative Training –Model p(t|x) directly. 22

What’s the problem with generative training Formulated this way, in D dimensions, this function has D parameters. In the generative case, 2D means, and D(D+1)/2 covariance values Quadratic growth in the number of parameters. We’d rather linear growth. 23

Discriminative Training 24

Optimization Take the gradient in terms of w 25

Optimization 26

Optimization 27

Optimization 28

Optimization: putting it together 29

Optimization We know the gradient of the error function, but how do we find the maximum value? Setting to zero is nontrivial Numerical approximation 30

Gradient Descent Take a guess. Move in the direction of the negative gradient Jump again. In a convex function this will converge Other methods include Newton-Raphson 31

Multi-class discriminant functions Can extend to multiple classes Other approaches include constructing K-1 binary classifiers. Each classifier compares c n to not c n Computationally simpler, but not without problems 32

Exponential Model Logistic Regression is a type of exponential model. –Linear combination of weights and features to produce a probabilistic model. 33

Problems with Binary Discriminant functions 34

K-class discriminant 35

Entropy Measure of uncertainty, or Measure of “Information” High uncertainty equals high entropy. Rare events are more “informative” than common events. 36

Entropy How much information is received when observing ‘x’? If independent, p(x,y) = p(x)p(y). –H(x,y) = H(x) + H(y) –The information contained in two unrelated events is equal to their sum. 37

Entropy Binary coding of p(x): -log p(x) –“How many bits does it take to represent a value p(x)?” –How many “decimal” places? How many binary decimal places? Expected value of observed information 38

Examples of Entropy Uniform distributions have higher distributions. 39

Maximum Entropy Logistic Regression is also known as Maximum Entropy. Entropy is convex. –Convergence Expectation. Constrain this optimization to enforce good classification. Increase maximum likelihood of the data while making the distribution of weights most even. –Include as many useful features as possible. 40

Maximum Entropy with Constraints From Klein and Manning Tutorial 41

Optimization formulation If we let the weights represent likelihoods of value for each feature. 42 For each feature i

Solving MaxEnt formulation Convex optimization with a concave objective function and linear constraints. Lagrange Multipliers 43 For each feature i Dual representation of the maximum likelihood estimation of Logistic Regression

Summary Bayesian Regularization –Introduction of a prior over parameters serves to constrain weights Logistic Regression –Log odds to construct a linear model –Formulation with Gaussian Class Conditionals –Discriminative Training –Gradient Descent Entropy –Logistic Regression as Maximum Entropy. 44

Next Time Graphical Models Read Chapter 8.1,