Logistic Regression Geoff Hulten.

Slides:



Advertisements
Similar presentations
Why does it work? We have not addressed the question of why does this classifier performs well, given that the assumptions are unlikely to be satisfied.
Advertisements

Regularization David Kauchak CS 451 – Fall 2013.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Supervised Learning Recap
Logistic Regression Classification Machine Learning.
Classification and Prediction: Regression Via Gradient Descent Optimization Bamshad Mobasher DePaul University.
Lecture 17: Supervised Learning Recap Machine Learning April 6, 2010.
x – independent variable (input)
1 Logistic Regression Adapted from: Tom Mitchell’s Machine Learning Book Evan Wei Xiang and Qiang Yang.
CS460/626 : Natural Language Processing/Speech, NLP and the Web (Lecture 16– Linear and Logistic Regression) Pushpak Bhattacharyya CSE Dept., IIT Bombay.
University of Southern California Department Computer Science Bayesian Logistic Regression Model (Final Report) Graduate Student Teawon Han Professor Schweighofer,
Classification / Regression Neural Networks 2
M Machine Learning F# and Accord.net. Alena Dzenisenka Software architect at Luxoft Poland Member of F# Software Foundation Board of Trustees Researcher.
CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.
Logistic Regression Week 3 – Soft Computing By Yosi Kristian.
Machine Learning CUNY Graduate Center Lecture 4: Logistic Regression.
Linear Discrimination Reading: Chapter 2 of textbook.
Non-Bayes classifiers. Linear discriminants, neural networks.
The problem of overfitting
Regularization (Additional)
Linear Models for Classification
M Machine Learning F# and Accord.net.
Bump Hunting The objective PRIM algorithm Beam search References: Feelders, A.J. (2002). Rule induction by bump hunting. In J. Meij (Ed.), Dealing with.
Linear Discriminant Functions Chapter 5 (Duda et al.) CS479/679 Pattern Recognition Dr. George Bebis.
Naive Bayes (Generative Classifier) vs. Logistic Regression (Discriminative Classifier) Minkyoung Kim.
Linear machines márc Decison surfaces We focus now on the decision surfaces Linear machines = linear decision surface Non-optimal solution but.
Why does it work? We have not addressed the question of why does this classifier performs well, given that the assumptions are unlikely to be satisfied.
CS 9633 Machine Learning Support Vector Machines
Machine Learning – Classification David Fenyő
Chapter 7. Classification and Prediction
Gradient descent David Kauchak CS 158 – Fall 2016.
Deep Feedforward Networks
Artificial Neural Networks I
Machine Learning Logistic Regression
10701 / Machine Learning.
Announcements HW4 due today (11:59pm) HW5 out today (due 11/17 11:59pm)
Classification with Perceptrons Reading:
CS 188: Artificial Intelligence
Classification / Regression Neural Networks 2
LINEAR AND NON-LINEAR CLASSIFICATION USING SVM and KERNELS
Machine Learning Logistic Regression
Logistic Regression Classification Machine Learning.
Logistic Regression Classification Machine Learning.
Linear Discriminators
Lecture 8 Generalized Linear Models &
CSC 578 Neural Networks and Deep Learning
Linear machines 28/02/2017.
Collaborative Filtering Matrix Factorization Approach
Logistic Regression Classification Machine Learning.
Logistic Regression & Parallel SGD
Logistic Regression.
Neural Networks Geoff Hulten.
ROC Curves and Operating Points
ML – Lecture 3B Deep NN.
Chapter 8: Generalization and Function Approximation
Overfitting and Underfitting
Support Vector Machines
Support Vector Machine I
Neural Networks II Chen Gao Virginia Tech ECE-5424G / CS-5824
CSCE833 Machine Learning Lecture 9 Linear Discriminant Analysis
Backpropagation David Kauchak CS159 – Fall 2019.
Neural Networks II Chen Gao Virginia Tech ECE-5424G / CS-5824
Linear Discrimination
Logistic Regression Classification Machine Learning.
Multiple features Linear Regression with multiple variables
Multiple features Linear Regression with multiple variables
Logistic Regression Classification Machine Learning.
Linear regression with one variable
Logistic Regression Classification Machine Learning.
Presentation transcript:

Logistic Regression Geoff Hulten

Overview of Logistic Regression A linear model for classification and probability estimation. Can be very effective when: The problem is linearly separable Or there are a lot of relevant features (10s - 100s of thousands can work) You need something simple and efficient as a baseline Efficient runtime Logistic regression will generally not be the most accurate option.

Components of Learning Algorithm: Logistic Regression Model Structure – Linear model with sigmoid activation Loss Function – Log Loss Optimization Method – Gradient Descent

Structure of Logistic Regression Weight per Feature Bias Weight Linear Model: 𝑤 0 + 𝑖 𝑛 𝑤 𝑖 ∗ 𝑥 𝑖 𝑦 ^ =𝑠𝑖𝑔𝑚𝑜𝑖𝑑( 𝑤 0 + 𝑖 𝑛 𝑤 𝑖 ∗ 𝑥 𝑖 ) 𝑠𝑖𝑔𝑚𝑜𝑖𝑑 𝑧 = 1 1+ 𝑒 −𝑧 Predict 1 Threshold = .9 Predict 0 Threshold = .5 Example: 𝑤 0 𝑤 1 𝑤 2 .25 -1 1 𝑥 1 𝑥 2 1

Intuition about additional dimensions Higher Threshold 3 Dimensions Decision surface is plane N-Dimensions Decision surface is n-dimensional hyper-plane High-dimensions are weird High-dimensional hyper-planes can represent quite a lot Predict 1 𝑤 0 𝑤 1 𝑤 2 .25 -1 1 Predict 0 Lower Threshold

Loss Function: Log Loss y^ -- The predicted 𝑦 (pre-threshold) Log Loss: If 𝑦 is 1: −log⁡(𝑦^) If 𝑦 is 0: −log⁡(1 −𝑦^) Examples y^ y Loss .9 1 .105 .5 .693 2.3 .95 2.99 Use Natural Log (base e)

Logistic Regression Loss Function Summary Log Loss 𝐿𝑜𝑠𝑠 𝑦 ^ ,𝑦 = −log 𝑦 ^ 𝐼𝑓 𝑦=1 𝐿𝑜𝑠𝑠 𝑦 ^ ,𝑦 = −log 1 − 𝑦 ^ 𝐼𝑓 𝑦=0 Same thing expressed in Sneaky Math 𝐿𝑜𝑠𝑠 𝑦 ^ , 𝑦 =−𝑦 ∗ 𝑙𝑜𝑔 𝑦 ^ − 1 −𝑦 ∗𝑙𝑜𝑔(1 − 𝑦 ^ ) Average across the data set 𝐿𝑜𝑠𝑠 𝑑𝑎𝑡𝑎𝑆𝑒𝑡 = 1 𝑛 𝑗 𝑛 𝐿𝑜𝑠𝑠( 𝑦 𝑗 ^ , 𝑦 𝑗 ) 𝑦 ^ is pre-thresholding Use natural log (base e)

Logistic Regression Optimization: Gradient Descent Predict 1 Predict 0 𝑤 0 𝑤 1 .25 -1 𝑤 0 𝑤 1 .1 -1.6 ‘Initial’ Model Updated Model Training Set

Finding the Gradient Derivative of Loss Function with respect to model weights Gradient for 𝑤 𝑖 for training sample 𝑥 𝑑𝐿𝑜𝑠𝑠(𝑥) 𝑑𝜃 → 𝜕𝐿𝑜𝑠𝑠(𝑥) 𝜕 𝑤 𝑖 = … = 𝑦^−𝑦 ∗ 𝑥 𝑖 Model Parameters (all the w’s) Partial Derivative per weight Calculus you don’t need to remember 𝜕𝐿𝑜𝑠𝑠(𝑇𝑟𝑎𝑖𝑛𝑆𝑒𝑡) 𝜕 𝑤 𝑖 = 1 𝑛 𝑗=0 𝑛−1 𝑦 𝑗 ^ − 𝑦 𝑗 ∗ 𝑥 𝑗𝑖 Average across training data set Compute simultaneously for all 𝑤 𝑖 with one pass over data Update each weight by stepping away from gradient 𝑤 𝑖 = 𝑤 𝑖 − ∝ 𝜕𝐿𝑜𝑠𝑠(𝑇𝑟𝑎𝑖𝑛𝑆𝑒𝑡) 𝜕 𝑤 𝑖 Note: 𝑥 0 =1.0 for all samples

Logistic Regression Optimization Algorithm Initialize model weights to 0 Do ‘numIterations’ steps of gradient descent (thousands of steps) Find the gradient for each weight by averaging across the training set Update each weight by taking a step of size ∝ opposite the gradient Parameters ∝ – size of the step to take in each iteration numIterations – number of iterations of gradient descent to perform Or use a convergence criteria… Threshold – value between 0-1 to convert 𝑦 ^ into a classification