Chapter 4: Linear Models for Classification

Slides:



Advertisements
Similar presentations
Pattern Recognition and Machine Learning
Advertisements

Pattern Classification & Decision Theory. How are we doing on the pass sequence? Bayesian regression and estimation enables us to track the man in the.
Generative Models Thus far we have essentially considered techniques that perform classification indirectly by modeling the training data, optimizing.
Prof. Navneet Goyal CS & IS BITS, Pilani
Brief introduction on Logistic Regression
Pattern Recognition and Machine Learning
ECE 8443 – Pattern Recognition LECTURE 05: MAXIMUM LIKELIHOOD ESTIMATION Objectives: Discrete Features Maximum Likelihood Resources: D.H.S: Chapter 3 (Part.
Biointelligence Laboratory, Seoul National University
Pattern Recognition and Machine Learning
Pattern Classification. Chapter 2 (Part 1): Bayesian Decision Theory (Sections ) Introduction Bayesian Decision Theory–Continuous Features.
CMPUT 466/551 Principal Source: CMU
Linear Models for Classification: Probabilistic Methods
Linear Classification Models: Generative Prof. Navneet Goyal CS & IS BITS, Pilani.
Pattern Recognition and Machine Learning
x – independent variable (input)
Classification and risk prediction
0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Announcements  Project proposal is due on 03/11  Three seminars this Friday (EB 3105) Dealing with Indefinite Representations in Pattern Recognition.
Today Linear Regression Logistic Regression Bayesians v. Frequentists
Linear Methods for Classification
Machine Learning CMPT 726 Simon Fraser University
Machine Learning CUNY Graduate Center Lecture 3: Linear Regression.
Arizona State University DMML Kernel Methods – Gaussian Processes Presented by Shankar Bhargav.
Kernel Methods Part 2 Bing Han June 26, Local Likelihood Logistic Regression.
CHAPTER 4: Parametric Methods. Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 Parametric Estimation X = {
CHAPTER 4: Parametric Methods. Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 Parametric Estimation Given.
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
Review of Lecture Two Linear Regression Normal Equation
METU Informatics Institute Min 720 Pattern Classification with Bio-Medical Applications PART 2: Statistical Pattern Classification: Optimal Classification.
Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.
1 Linear Methods for Classification Lecture Notes for CMPUT 466/551 Nilanjan Ray.
Principles of Pattern Recognition
Learning Theory Reza Shadmehr logistic regression, iterative re-weighted least squares.
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 3: LINEAR MODELS FOR REGRESSION.
Overview of Supervised Learning Overview of Supervised Learning2 Outline Linear Regression and Nearest Neighbors method Statistical Decision.
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.6: Linear Models Rodney Nielsen Many of.
CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.
CS Statistical Machine learning Lecture 10 Yuan (Alan) Qi Purdue CS Sept
Learning Theory Reza Shadmehr Linear and quadratic decision boundaries Kernel estimates of density Missing data.
Ch 4. Linear Models for Classification (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized and revised by Hee-Woong Lim.
ECE 8443 – Pattern Recognition LECTURE 10: HETEROSCEDASTIC LINEAR DISCRIMINANT ANALYSIS AND INDEPENDENT COMPONENT ANALYSIS Objectives: Generalization of.
Machine Learning CUNY Graduate Center Lecture 4: Logistic Regression.
MACHINE LEARNING 8. Clustering. Motivation Based on E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2  Classification problem:
Ch 4. Linear Models for Classification Adopted from Seung-Joon Yi Biointelligence Laboratory, Seoul National University
Biointelligence Laboratory, Seoul National University
Linear Models for Classification
Linear Methods for Classification Based on Chapter 4 of Hastie, Tibshirani, and Friedman David Madigan.
Chapter 20 Classification and Estimation Classification – Feature selection Good feature have four characteristics: –Discrimination. Features.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Final Exam Review CS479/679 Pattern Recognition Dr. George Bebis 1.
Elements of Pattern Recognition CNS/EE Lecture 5 M. Weber P. Perona.
CS Statistical Machine learning Lecture 12 Yuan (Alan) Qi Purdue CS Oct
METU Informatics Institute Min720 Pattern Classification with Bio-Medical Applications Part 9: Review.
Machine Learning CUNY Graduate Center Lecture 6: Linear Regression II.
Linear Classifiers Dept. Computer Science & Engineering, Shanghai Jiao Tong University.
Giansalvo EXIN Cirrincione unit #4 Single-layer networks They directly compute linear discriminant functions using the TS without need of determining.
Ch 1. Introduction Pattern Recognition and Machine Learning, C. M. Bishop, Updated by J.-H. Eom (2 nd round revision) Summarized by K.-I.
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 1: INTRODUCTION.
Probability Theory and Parameter Estimation I
CH 5: Multivariate Methods
Special Topics In Scientific Computing
Statistical Learning Dong Liu Dept. EEIS, USTC.
Mathematical Foundations of BME Reza Shadmehr
Pattern Recognition and Machine Learning
Generally Discriminant Analysis
Mathematical Foundations of BME
Parametric Methods Berlin Chen, 2005 References:
Multivariate Methods Berlin Chen
Mathematical Foundations of BME
Multivariate Methods Berlin Chen, 2005 References:
Presentation transcript:

Chapter 4: Linear Models for Classification Grit Hein & Susanne Leiberg

Goal Our goal is to “classify” input vectors x into one of k classes. Similar to regression, but the output variable is discrete. input space is divided into decision regions whose boundaries are called decision boundaries or decision surfaces linear models for classification: decision boundaries are linear functions of input vector x Decision boundaries

Classifier seek an ‘optimal’ separation of classes (e.g., apples and diameter color linear decision boundary weight vector discriminant dimension Classifier seek an ‘optimal’ separation of classes (e.g., apples and oranges) by finding a set of weights for combining features (e.g., color and diameter).

of posterior probabilities Classifier Discriminant function no computation of posterior probabilities (probability of certain class given the data) directly map each x onto a class label Probabilistic Generative Models Discriminative computation of posterior probabilities model class priors (p(Ck)) & class-conditional densities (p(x/Ck)) use to compute posterior probabilities (p(Ck/x)) model posterior directly Logistic Regression Tools Least Square Classification Fisher’s Linear Discriminant Tools Tools Bayes

Pros and Cons of the three approaches Discriminant Functions are the most simple and intuitive approach to classifying data, but do not allow to compensate for class priors (e.g. class 1 is a very rare disease) minimize risk (e.g. classifying sick person as healthy more costly than classifying healthy person as sick) implement reject option (e.g. person cannot be classified as sick or healthy with a sufficiently high probability) Probabilistic Generative and Discriminative models can do all that

Pros and Cons of the three approaches Generative models provide a probabilistic model of all variables that allows to synthesize new data – but - generating all this information is computationally expensive and complex and is not needed for a simple classification decision Discriminative models provide a probabilistic model for the target variable (classes) conditional on the observed variables this is usually sufficient for making a well-informed classification decision without the disadvantages of the simple Discriminant Functions

of posterior probabilities Classifier Discriminant function no computation of posterior probabilities (probability of certain class given the data) directly map each x onto a class label Least Square Classification Fisher’s Linear Discriminant Tools

Discriminant functions are functions that are optimized to assign input x to one of k classes y(x) = wTx + ω0 feature 2 Decision region 1 decision boundary Decision region 2 w determines orientation of decision boundary ω0 determines location feature 1

Discriminant functions - How to determine parameters? Least Squares for Classification General Principle: Minimize the squared distance (residual) between the observed data point and its prediction by a model function

Discriminant functions - How to determine parameters? In the context of classification: find the parameters which minimize the squared distance (residual) between the data points and the decision boundary

Discriminant functions - How to determine parameters? Problem: sensitive to outliers; also distance between the outliers and the discriminant function is minimized --> can shift function in a way that leads to misclassifications least squares logistic regression

Discriminant functions - How to determine parameters? Fisher’s Linear Discriminant General Principle: Maximize distance between means of different classes while minimizing the variance within each class maximizing between-class variance & minimizing within-class variance maximizing between-class variance

Probabilistic Generative Models model class-conditional densities (p(x⎮Ck)) and class priors (p(Ck)) use them to compute posterior class probabilities (p(Ck⎮x)) according to Bayes theorem posterior probabilities can be described as logistic sigmoid function inverse of sigmoid function is the logit function which represents the ratio of the posterior probabilities for the two classes ln[p(C1⎮x)/p(C2⎮x)] --> log odds

Probabilistic Discriminative Models - Logistic Regression you model the posterior probabilities directly assuming that they have a sigmoid-shaped distribution (without modeling class priors and class-conditional densities) the sigmoid-shaped function (σ) is model function of logistic regressions first non-linear transformation of inputs using a vector of basis functions ϕ(x) → suitable choices of basis functions can make the modeling of the posterior probabilities easier p(C1/ϕ) = y(ϕ) = σ(wTϕ) p(C2/ϕ) = 1-p(C1/ϕ)

Probabilistic Discriminative Models - Logistic Regression Parameters of the logistic regression model determined by maximum likelihood estimation maximum likelihood estimates are computed using iterative reweighted least squares → iterative procedure that minimizes error function using mathematical algorithms (Newton-Raphson iterative optimization scheme) that means starting from some initial values the weights are changed until the likelihood is maximized

Normalizing posterior probabilities To compare models and to use posterior probabilities in Bayesian Logistic Regression it is useful to have posterior probabilities in Gaussian form LAPLACE APPROXIMATION is the tool to find a Gaussian approximation to a probability density defined over a set of continuous variables; here it is used to find a gaussian approximation of your posterior probabilities Goal is to find Gaussian approximation q(z) centered on the mode of p(z) Z = unknown normalization constant p(z) = 1/Z f(z) q(z) p(z)

How to find the best model? - Bayes Information Criterion (BIC) the approximation of the normalization constant Z can be used to obtain an approximation for the model evidence Consider data set D and models {Mi} having parameters {θi} For each model define likelihood p(D|θi,Mi} Introduce prior over parameters p(θi|Mi) Need model evidence p(D|Mi) for various models Z is approximation of model evidence p(D|Mi)

Making predictions having obtained a Gaussian approximation of your posterior distribution (using Laplace approximation) you can make predictions for new data using BAYESIAN LOGISTIC REGRESSION you use the normalized posterior distribution to arrive at a predictive distribution for the classes given new data you marginalize with respect to the normalized posterior distribution

color diameter discriminant dimension weight vector linear decision boundary weight vector discriminant dimension

Terminology Two classes single target variable with binary representation t ∈ {0,1}; t = 1 → class C1, t = 0 → class C2 K > 2 classes 1-of-K coding scheme; t is vector of length K t = (0,1,0,0,0)T