CSE 446 Logistic Regression Perceptron Learning Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer.

Slides:

Advertisements

Similar presentations

Why does it work? We have not addressed the question of why does this classifier performs well, given that the assumptions are unlikely to be satisfied.

Advertisements

Classification Classification Examples

Linear Regression.

Linear Classifiers (perceptrons)

CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.

Supervised Learning Recap

Machine Learning Week 2 Lecture 1.

Middle Term Exam 03/01 (Thursday), take home, turn in at noon time of 03/02 (Friday)

What is Statistical Modeling

Artificial Intelligence Lecture 2 Dr. Bo Yuan, Professor Department of Computer Science and Engineering Shanghai Jiaotong University

Lecture 17: Supervised Learning Recap Machine Learning April 6, 2010.

CS 188: Artificial Intelligence Spring 2007 Lecture 19: Perceptrons 4/2/2007 Srini Narayanan – ICSI and UC Berkeley.

Logistic Regression Rong Jin. Logistic Regression Model  In Gaussian generative model:  Generalize the ratio to a linear model Parameters: w and c.

Today Linear Regression Logistic Regression Bayesians v. Frequentists

Generative Models Rong Jin. Statistical Inference Training ExamplesLearning a Statistical Model  Prediction p(x;  ) Female: Gaussian distribution N(

CS 188: Artificial Intelligence Fall 2009 Lecture 23: Perceptrons 11/17/2009 Dan Klein – UC Berkeley TexPoint fonts used in EMF. Read the TexPoint manual.

Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)

Logistic Regression 10701/15781 Recitation February 5, 2008 Parts of the slides are from previous years’ recitation and lecture notes, and from Prof. Andrew.

Crash Course on Machine Learning

Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.

More Machine Learning Linear Regression Squared Error L1 and L2 Regularization Gradient Descent.

Crash Course on Machine Learning Part II

CSE 446 Gaussian Naïve Bayes & Logistic Regression Winter 2012

ECE 5984: Introduction to Machine Learning Dhruv Batra Virginia Tech Topics: –Classification: Naïve Bayes Readings: Barber

1 Logistic Regression Adapted from: Tom Mitchell’s Machine Learning Book Evan Wei Xiang and Qiang Yang.

8/25/05 Cognitive Computations Software Tutorial Page 1 SNoW: Sparse Network of Winnows Presented by Nick Rizzolo.

CPSC 502, Lecture 15Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 16 Nov, 3, 2011 Slide credit: C. Conati, S.

Logistic Regression Debapriyo Majumdar Data Mining – Fall 2014 Indian Statistical Institute Kolkata September 1, 2014.

1 CS546: Machine Learning and Natural Language Discriminative vs Generative Classifiers This lecture is based on (Ng & Jordan, 02) paper and some slides.

CSE 446 Perceptron Learning Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer.

Generative verses discriminative classifier

Classification: Feature Vectors

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 3: LINEAR MODELS FOR REGRESSION.

LOGISTIC REGRESSION David Kauchak CS451 – Fall 2013.

Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.6: Linear Models Rodney Nielsen Many of.

CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.

CSE 446 Logistic Regression Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer.

Ch 4. Linear Models for Classification (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized and revised by Hee-Woong Lim.

Computational Intelligence: Methods and Applications Lecture 23 Logistic discrimination and support vectors Włodzisław Duch Dept. of Informatics, UMK Google:

Machine Learning CUNY Graduate Center Lecture 4: Logistic Regression.

ECE 5984: Introduction to Machine Learning Dhruv Batra Virginia Tech Topics: –Classification: Logistic Regression –NB & LR connections Readings: Barber.

Linear Models for Classification

CS 188: Artificial Intelligence Fall 2007 Lecture 25: Kernels 11/27/2007 Dan Klein – UC Berkeley.

Lecture 2: Statistical learning primer for biologists

Over-fitting and Regularization Chapter 4 textbook Lectures 11 and 12 on amlbook.com.

Regress-itation Feb. 5, Outline Linear regression – Regression: predicting a continuous value Logistic regression – Classification: predicting a.

Logistic Regression William Cohen.

Machine Learning 5. Parametric Methods.

Giansalvo EXIN Cirrincione unit #4 Single-layer networks They directly compute linear discriminant functions using the TS without need of determining.

CS 188: Artificial Intelligence Fall 2008 Lecture 24: Perceptrons II 11/24/2008 Dan Klein – UC Berkeley 1.

SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.

Page 1 CS 546 Machine Learning in NLP Review 2: Loss minimization, SVM and Logistic Regression Dan Roth Department of Computer Science University of Illinois.

Pattern Recognition and Image Analysis Dr. Manal Helal – Fall 2014 Lecture 7 Linear Classifiers.

Naive Bayes (Generative Classifier) vs. Logistic Regression (Discriminative Classifier) Minkyoung Kim.

Artificial Intelligence

MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.

Generative verses discriminative classifier

Perceptrons Lirong Xia.

10701 / Machine Learning.

ECE 5424: Introduction to Machine Learning

ECE 5424: Introduction to Machine Learning

CS 4/527: Artificial Intelligence

CS 188: Artificial Intelligence

Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.

Parametric Methods Berlin Chen, 2005 References:

CS 188: Artificial Intelligence

Recap: Naïve Bayes classifier

Linear Discrimination

MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.

Perceptrons Lirong Xia.

Presentation transcript:

CSE 446 Logistic Regression Perceptron Learning Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer

Machine Learning 2 Supervised Learning Labeled Examples  Function Reinforcement Learning Experience + Rewards  Policy Unsupervised Learning Unlabeled Examples  Clusters

Machine Learning 3 Supervised Learning Labeled Examples  Function Non-parametric Parametric

Machine Learning 4 Supervised Learning Labeled Examples  Function Y DiscreteY Continuous Gaussians Learned in closed form Linear Functions 1. Learned in closed form 2. Using gradient descent Parametric

Machine Learning 5 Supervised Learning Labeled Examples  Function Y DiscreteY Continuous Gaussians Learned in closed form Linear Functions 1. Learned in closed form 2. Using gradient descent Decision Trees Greedy search; pruning Probability of Class | Features Parametric

Probabilistic Inference (Making Predictions) After learning – (And also during learning) We need to – Determine probabilities of hypotheses – Use these hypotheses to make predictions MLE MAP Bayesian Inference

Which Coin is Dan Using? P(H|C 2 ) = 0.5 P(H|C 1 ) = 0.1 C1C1 C2C2 P(H|C 3 ) = 0.9 C3C3

Prior Probabilities P(H|C 2 ) = 0.5 P(H|C 1 ) = 0.1 C1C1 C2C2 P(H|C 3 ) = 0.9 C3C3 P(C 1 ) = 0.05P(C 2 ) = 0.25P(C 3 ) = 0.70

Forms of Inference P(H|C 2 ) = 0.5 P(H|C 1 ) = 0.1 C1C1 C2C2 P(H|C 3 ) = 0.9 C3C3 P(C 1 ) = 0.05P(C 2 ) = 0.25P(C 3 ) = 0.70 Evidence: Heads; Tails P(C 1 |HT) =0.035P(C 2 |HT)=0.481P(C 3 |HT)=0.485 Predict Next Toss?? Bayesian Inference 68% heads MAP 90% heads

Making predictions after (and during) learning from a distribution of hypotheses PriorHypothesis Maximum Likelihood Estimate Maximum A Posteriori Estimate Bayesian Estimate UniformThe most likely AnyThe most likely Any Weighted combination

Themes: Learning as function approximation – What’s a good approximation? Learning as optimization – Minimize loss over training data (test data) – Loss(h,data) = error(h, data) + complexity(h) 11

Machine Learning 12 Supervised Learning Labeled Examples  Function Y DiscreteY Continuous Gaussians Learned in closed form Linear Functions 1. Learned in closed form 2. Using gradient descent Decision Trees Greedy search; pruning Probability of class | features 1.Learn P(Y), P(X|Y) in closed form Then apply Bayes rule 2.Learn P(Y|X) w/ gradient descent Parametric

Generative vs. Discriminative Classifiers Want to Learn: h:X  Y – X – features – Y – target classes Generative classifier, e.g., Naïve Bayes: – Assume some functional form for P(X|Y), P(Y) – Estimate parameters of P(X|Y), P(Y) directly from training data – Use Bayes rule to calculate P(Y|X= x) – This is a ‘generative’ model Indirect computation of P(Y|X) through Bayes rule As a result, can also generate a sample of the data, P(X) =  y P(y) P(X|y) Discriminative classifiers, e.g., Logistic Regression: – Assume some functional form for P(Y|X) – Estimate parameters of P(Y|X) directly from training data – This is the ‘discriminative’ model Directly learn P(Y|X) But cannot obtain a sample of the data, because P(X) is not available 13 P(Y | X)  P(X | Y) P(Y) Are decision trees generative or discriminative?

Discriminative Classification What Functional Form should we use for P(Y|X)? 14 P(edible|X)=1 P(edible|X)=0 Decision Boundary Want something smooth…

Logistic Function in n Dimensions 15 Sigmoid applied to a linear function of the data: Features can be discrete or continuous!

Understanding Sigmoids w 0 =0, w 1 =-1 w 0 =-2, w 1 =-1 w 0 =0, w 1 = -0.5

Very convenient! implies linear classification rule! 17 ©Carlos Guestrin

Logistic regression more generally Logistic regression in more general case, where Y  {y 1,…,y R } for k<R for k=R (normalization, so no weights for this class) Features can be discrete or continuous! 18©Carlos Guestrin

Generative (Naïve Bayes) Loss function: Data likelihood Discriminative (Logistic Regr.) Loss funct: Conditional Data Likelihood Discriminative models can’t compute P(x j |w)! Or, … “They don’t waste effort learning P(X)” Focus only on P(Y|X) - all that matters for classification Loss Functions (“Error” Term): Likelihood vs. Conditional Likelihood 19 ©Carlos Guestrin

Expressing Conditional Log Likelihood 20©Carlos Guestrin } Probability of predicting 1 Probability of predicting 0 } 1 when correct answer is 01 when correct answer is 1 ln(P(Y=0|X,w) = -ln(1+exp(w 0 +  i w i X i ) ln(P(Y=0|X,w) = w 0 +  i w i X i - ln(1+exp(w 0 +  i w i X i )

Expressing Conditional Log Likelihood ln(P(Y=0|X,w) = -ln(1+exp(w 0 +  i w i X i ) ln(P(Y=0|X,w) = w 0 +  i w i X i - ln(1+exp(w 0 +  i w i X i )

Maximizing Conditional Log Likelihood Bad news: no closed-form solution to maximize l(w) Good news: l(w) is concave function of w! No local minima Concave functions easy to optimize 22©Carlos Guestrin

Optimizing concave function – Gradient ascent Conditional likelihood for Logistic Regression is concave ! Gradient ascent is simplest of optimization approaches – e.g., Conjugate gradient ascent much better (see reading) Gradient: Update rule: Learning rate,  >0

Maximize Conditional Log Likelihood: Gradient ascent

Gradient Descent for LR Gradient ascent algorithm: (learning rate η > 0) do: For i=1 to n: (iterate over weights) until “change” <  Loop over training examples!

Large parameters… Maximum likelihood solution: prefers higher weights – higher likelihood of (properly classified) examples close to decision boundary – larger influence of corresponding features on decision – can cause overfitting!!! Regularization: penalize high weights a=1 a=5 a=10

That’s all M C LE. How about M C AP? One common approach is to define priors on w – Normal distribution, zero mean, identity covariance – “Pushes” parameters towards zero Often called Regularization – Helps avoid very large weights and overfitting MAP estimate:

M C AP as Regularization Penalizes high weights, like we did in linear regression Add log p(w) to objective: – Quadratic penalty: drives weights towards zero – Adds a negative linear term to the gradients

MLE vs. MAP Maximum conditional likelihood estimate Maximum conditional a posteriori estimate

Remember Gaussian Naïve Bayes? P(X i =x| Y=y k ) = N(  ik,  ik ) P(Y | X)  P(X | Y) P(Y) N(  ik,  ik ) = Sometimes Assume Variance – is independent of Y (i.e.,  i ), – or independent of X i (i.e.,  k ) – or both (i.e.,  )

Logistic Regression vs. Naïve Bayes Consider learning f: X  Y, where – X is a vector of real-valued features, – Y is boolean Could use a Gaussian Naïve Bayes classifier – assume all X i are conditionally independent given Y – model P(X i | Y = y k ) as Gaussian N(  ik,  i ) – model P(Y) as Bernoulli( ,1-  ) What does that imply about the form of P(Y|X)? Cool!!!!

Derive form for P(Y|X) for continuous X i only for Naïve Bayes modelsup to now, all arithmetic Can we solve for w i ? Yes, but only in Gaussian case Looks like a setting for w 0 ?

Ratio of class-conditional probabilities … Linear function! Coefficents expressed with original Gaussian parameters!

Derive form for P(Y|X) for continuous X i

Gaussian Naïve Bayes vs. Logistic Regression Representation equivalence – But only in a special case!!! (GNB with class-independent variances) But what’s the difference??? LR makes no assumptions about P(X|Y) in learning!!! Loss function!!! – Optimize different functions ! Obtain different solutions Set of Gaussian Naïve Bayes parameters (feature variance independent of class label) Set of Logistic Regression parameters Can go both ways, we only did one way

Naïve Bayes vs. Logistic Regression Consider Y boolean, X= continuous Number of parameters: Naïve Bayes: 4n +1 Logistic Regression: n+1 Estimation method: Naïve Bayes parameter estimates are uncoupled Logistic Regression parameter estimates are coupled

Naïve Bayes vs. Logistic Regression Generative vs. Discriminative classifiers Asymptotic comparison (# training examples  infinity) – when model correct GNB (with class independent variances) and LR produce identical classifiers – when model incorrect LR is less biased – does not assume conditional independence – therefore LR expected to outperform GNB [Ng & Jordan, 2002]

Naïve Bayes vs. Logistic Regression Generative vs. Discriminative classifiers Non-asymptotic analysis – convergence rate of parameter estimates, (n = # of attributes in X) Size of training data to get close to infinite data solution Naïve Bayes needs O(log n) samples Logistic Regression needs O(n) samples – GNB converges more quickly to its (perhaps less helpful) asymptotic estimates [Ng & Jordan, 2002]

Some experiments from UCI data sets ©Carlos Guestrin Naïve bayes Logistic Regression

What you should know about Logistic Regression (LR) Gaussian Naïve Bayes with class-independent variances representationally equivalent to LR – Solution differs because of objective (loss) function In general, NB and LR make different assumptions – NB: Features independent given class ! assumption on P(X|Y) – LR: Functional form of P(Y|X), no assumption on P(X|Y) LR is a linear classifier – decision rule is a hyperplane LR optimized by conditional likelihood – no closed-form solution – concave ! global optimum with gradient ascent – Maximum conditional a posteriori corresponds to regularization Convergence rates – GNB (usually) needs less data – LR (usually) gets to better solutions in the limit

41

Perceptrons 42

Who needs probabilities? Previously: model data with distributions Joint: P(X,Y) – e.g. Naïve Bayes Conditional: P(Y|X) – e.g. Logistic Regression But wait, why probabilities? Lets try to be error- driven!

Generative vs. Discriminative Generative classifiers: – E.g. naïve Bayes – A joint probability model with evidence variables – Query model for causes given evidence Discriminative classifiers: – No generative model, no Bayes rule, often no probabilities at all! – Try to predict the label Y directly from X – Robust, accurate with varied features – Loosely: mistake driven rather than model driven

Linear Classifiers Inputs are feature values Each feature has a weight Sum is the activation If the activation is: – Positive, output class 1 – Negative, output class 2  f1f1 f2f2 f3f3 w1w1 w2w2 w3w3 >0?

Example: Spam Imagine 3 features (spam is “positive” class): – free (number of occurrences of “free”) – money (occurrences of “money”) – BIAS (intercept, always has value 1) BIAS : -3 free : 4 money : 2... BIAS : 1 free : 1 money : 1... “free money” w.f(x) > 0  SPAM!!!

Binary Decision Rule In the space of feature vectors – Examples are points – Any weight vector is a hyperplane – One side corresponds to Y=+1 – Other corresponds to Y=-1 BIAS : -3 free : 4 money : free money +1 = SPAM -1 = HAM

Binary Perceptron Algorithm Start with zero weights For each training instance (x,y*): – Classify with current weights – If correct (i.e., y=y*), no change! – If wrong: update

Examples: Perceptron Separable Case

Examples: Perceptron Inseparable Case

From Logistic Regression to the Perceptron: 2 easy steps! Logistic Regression: (in vector notation): y is {0,1} Perceptron: y is {0,1}, y(x;w) is prediction given w Differences? Drop the Σ j over training examples: online vs. batch learning Drop the dist’n: probabilistic vs. error driven learning

Multiclass Decision Rule If we have more than two classes: – Have a weight vector for each class: – Calculate an activation for each class – Highest activation wins

Example BIAS : win : game : vote : the :... BIAS : win : game : vote : the :... BIAS : win : game : vote : the :... “win the vote” “win the election” “win the game”

Example BIAS : -2 win : 4 game : 4 vote : 0 the : 0... BIAS : 1 win : 2 game : 0 vote : 4 the : 0... BIAS : 2 win : 0 game : 2 vote : 0 the : 0... “win the vote” BIAS : 1 win : 1 game : 0 vote : 1 the : 1...

The Multi-class Perceptron Alg. Start with zero weights Iterate training examples – Classify with current weights – If correct, no change!  If wrong: lower score of wrong answer, raise score of right answer

Properties of Perceptrons Separability: some parameters get the training set perfectly correct Convergence: if the training is separable, perceptron will eventually converge (binary case) Mistake Bound: the maximum number of mistakes (binary case) related to the margin or degree of separability Separable Non-Separable

Problems with the Perceptron Noise: if the data isn’t separable, weights might thrash – Averaging weight vectors over time can help (averaged perceptron) Mediocre generalization: finds a “barely” separating solution Overtraining: test / held-out accuracy usually rises, then falls – Overtraining is a kind of overfitting

Linear Separators  Which of these linear separators is optimal?

Support Vector Machines  Maximizing the margin: good according to intuition, theory, practice  Support vector machines (SVMs) find the separator with max margin SVM

Three Views of Classification (more to come later in course!) Naïve Bayes: – Parameters from data statistics – Parameters: probabilistic interpretation – Training: one pass through the data Logistic Regression: – Parameters from gradient ascent – Parameters: linear, probabilistic model, and discriminative – Training: one pass through the data per gradient step, use validation to stop The perceptron: – Parameters from reactions to mistakes – Parameters: discriminative interpretation – Training: go through the data until held- out accuracy maxes out Training Data Held-Out Data Test Data