Machine Learning CUNY Graduate Center Lecture 6: Linear Regression II.

Slides:



Advertisements
Similar presentations
Pattern Recognition and Machine Learning
Advertisements

Generative Models Thus far we have essentially considered techniques that perform classification indirectly by modeling the training data, optimizing.
Notes Sample vs distribution “m” vs “µ” and “s” vs “σ” Bias/Variance Bias: Measures how much the learnt model is wrong disregarding noise Variance: Measures.
Pattern Recognition and Machine Learning
Biointelligence Laboratory, Seoul National University
Pattern Recognition and Machine Learning
Computer vision: models, learning and inference Chapter 8 Regression.
CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.
Supervised Learning Recap
CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 10: The Bayesian way to fit models Geoffrey Hinton.
Model assessment and cross-validation - overview
CMPUT 466/551 Principal Source: CMU
Chapter 4: Linear Models for Classification
What is Statistical Modeling
Visual Recognition Tutorial
Lecture 17: Supervised Learning Recap Machine Learning April 6, 2010.
Pattern Recognition and Machine Learning
x – independent variable (input)
Today Linear Regression Logistic Regression Bayesians v. Frequentists
GAPS IN OUR KNOWLEDGE? a central problem in data modelling and how to get round it Rob Harrison AC&SE.
Machine Learning CMPT 726 Simon Fraser University
Machine Learning CUNY Graduate Center Lecture 3: Linear Regression.
Data mining and statistical learning - lecture 13 Separating hyperplane.
Arizona State University DMML Kernel Methods – Gaussian Processes Presented by Shankar Bhargav.
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
Classification and Prediction: Regression Analysis
PATTERN RECOGNITION AND MACHINE LEARNING
Machine Learning CUNY Graduate Center Lecture 3: Linear Regression.
Outline Separating Hyperplanes – Separable Case
CSC2515 Fall 2007 Introduction to Machine Learning Lecture 2: Linear regression All lecture slides will be available as.ppt,.ps, &.htm at
Outline 1-D regression Least-squares Regression Non-iterative Least-squares Regression Basis Functions Overfitting Validation 2.
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 3: LINEAR MODELS FOR REGRESSION.
CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 11: Bayesian learning continued Geoffrey Hinton.
LOGISTIC REGRESSION David Kauchak CS451 – Fall 2013.
CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.
CSE 446 Logistic Regression Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer.
Machine Learning CUNY Graduate Center Lecture 4: Logistic Regression.
Christopher M. Bishop, Pattern Recognition and Machine Learning.
INTRODUCTION TO Machine Learning 3rd Edition
Bias and Variance of the Estimator PRML 3.2 Ethem Chp. 4.
Sparse Kernel Methods 1 Sparse Kernel Methods for Classification and Regression October 17, 2007 Kyungchul Park SKKU.
Some Aspects of Bayesian Approach to Model Selection Vetrov Dmitry Dorodnicyn Computing Centre of RAS, Moscow.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Over-fitting and Regularization Chapter 4 textbook Lectures 11 and 12 on amlbook.com.
Bias and Variance of the Estimator PRML 3.2 Ethem Chp. 4.
Machine Learning 5. Parametric Methods.
CS Statistical Machine learning Lecture 12 Yuan (Alan) Qi Purdue CS Oct
CSC321: Introduction to Neural Networks and Machine Learning Lecture 23: Linear Support Vector Machines Geoffrey Hinton.
Pattern recognition – basic concepts. Sample input attribute, attribute, feature, input variable, independent variable (atribut, rys, příznak, vstupní.
Introduction to Machine Learning Prof. Nir Ailon Lecture 5: Support Vector Machines (SVM)
Regularized Least-Squares and Convex Optimization.
Ch 1. Introduction Pattern Recognition and Machine Learning, C. M. Bishop, Updated by J.-H. Eom (2 nd round revision) Summarized by K.-I.
CSC321: Lecture 8: The Bayesian way to fit models Geoffrey Hinton.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
3. Linear Models for Regression 後半 東京大学大学院 学際情報学府 中川研究室 星野 綾子.
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 1: INTRODUCTION.
PREDICT 422: Practical Machine Learning
Probability Theory and Parameter Estimation I
Ch3: Model Building through Regression
CSE 4705 Artificial Intelligence
LECTURE 16: SUPPORT VECTOR MACHINES
Special Topics In Scientific Computing
Bias and Variance of the Estimator
Probabilistic Models for Linear Regression
Pattern Recognition and Machine Learning
Biointelligence Laboratory, Seoul National University
LECTURE 17: SUPPORT VECTOR MACHINES
Ch 3. Linear Models for Regression (2/2) Pattern Recognition and Machine Learning, C. M. Bishop, Previously summarized by Yung-Kyun Noh Updated.
Support Vector Machines 2
Presentation transcript:

Machine Learning CUNY Graduate Center Lecture 6: Linear Regression II

Extension to polynomial regression 1

Polynomial regression is the same as linear regression in D dimensions 2

Generate new features 3 Standard Polynomial with coefficients, w Risk

Generate new features 4 Feature Trick: To fit a D dimensional polynomial, Create a D-element vector from x i Then standard linear regression in D dimensions

How is this still linear regression? The regression is linear in the parameters, despite projecting x i from one dimension to D dimensions. Now we fit a plane (or hyperplane) to a representation of x i in a higher dimensional feature space. This generalizes to any set of functions 5

Basis functions as feature extraction These functions are called basis functions. –They define the bases of the feature space Allows linear decomposition of any type of function to data points Common Choices: –Polynomial –Gaussian –Sigmoids –Wave functions (sine, etc.) 6

Training data vs. Testing Data Evaluating the performance of a classifier on training data is meaningless. With enough parameters, a model can simply memorize (encode) every training point To evaluate performance, data is divided into training and testing (or evaluation) data. –Training data is used to learn model parameters –Testing data is used to evaluate performance 7

Overfitting 8

9

Overfitting performance 10

Definition of overfitting When the model describes the noise, rather than the signal. How can you tell the difference between overfitting, and a bad model? 11

Possible detection of overfitting Stability –An appropriately fit model is stable under different samples of the training data –An overfit model generates inconsistent performance Performance –A good model has low test error –A bad model has high test error 12

What is the optimal model size? The best model size generalizes to unseen data the best. Approximate this by testing error. One way to optimize parameters is to minimize testing error. –This operation uses testing data as tuning or development data Sacrifices training data in favor of parameter optimization Can we do this without explicit evaluation data? 13

Context for linear regression Simple approach Efficient learning Extensible Regularization provides robust models 14

Linear Regression Identify the best parameters, w, for a regression function 15

Overfitting Recall: overfitting happens when a model is capturing idiosyncrasies of the data rather than generalities. –Often caused by too many parameters relative to the amount of training data. –E.g. an order-N polynomial can intersect any N+1 data points 16

Dealing with Overfitting Use more data Use a tuning set Regularization Be a Bayesian 17

Regularization In a linear regression model overfitting is characterized by large weights. 18

Penalize large weights Introduce a penalty term in the loss function. 19 Regularized Regression (L2-Regularization or Ridge Regression)

Regularization Derivation 20

21

Regularization in Practice 22

Regularization Results 23

More regularization The penalty term defines the styles of regularization L2-Regularization L1-Regularization L0-Regularization –L0-norm is the optimal subset of features 24

Curse of dimensionality Increasing dimensionality of features increases the data requirements exponentially. For example, if a single feature can be accurately approximated with 100 data points, to optimize the joint over two features requires 100*100 data points. Models should be small relative to the amount of available data Dimensionality reduction techniques – feature selection – can help. –L0-regularization is explicit feature selection –L1- and L2-regularizations approximate feature selection. 25

Bayesians v. Frequentists What is a probability? Frequentists –A probability is the likelihood that an event will happen –It is approximated by the ratio of the number of observed events to the number of total events –Assessment is vital to selecting a model –Point estimates are absolutely fine Bayesians –A probability is a degree of believability of a proposition. –Bayesians require that probabilities be prior beliefs conditioned on data. –The Bayesian approach “is optimal”, given a good model, a good prior and a good loss function. Don’t worry so much about assessment. –If you are ever making a point estimate, you’ve made a mistake. The only valid probabilities are posteriors based on evidence given some prior 26

Bayesian Linear Regression The previous MLE derivation of linear regression uses point estimates for the weight vector, w. Bayesians say, “hold it right there”. –Use a prior distribution over w to estimate parameters Alpha is a hyperparameter over w, where alpha is the precision or inverse variance of the distribution. Now optimize: 27

Optimize the Bayesian posterior 28 As usual it’s easier to optimize after a log transform.

Optimize the Bayesian posterior 29 As usual it’s easier to optimize after a log transform.

Optimize the Bayesian posterior 30 Ignoring terms that do not depend on w IDENTICAL formulation as L2-regularization

Context Overfitting is bad. Bayesians vs. Frequentists –Is one better? –Machine Learning uses techniques from both camps. 31

Next Time Logistic Regression 32