Linear Models for Classification: Probabilistic Methods

Slides:

Advertisements

Similar presentations

Pattern Recognition and Machine Learning

Advertisements

Pattern Recognition and Machine Learning

Modeling of Data. Basic Bayes theorem Bayes theorem relates the conditional probabilities of two events A, and B: A might be a hypothesis and B might.

Linear Regression.

Prof. Navneet Goyal CS & IS BITS, Pilani

CS479/679 Pattern Recognition Dr. George Bebis

Biointelligence Laboratory, Seoul National University

Pattern Recognition and Machine Learning

Supervised Learning Recap

Chapter 4: Linear Models for Classification

Linear Classification Models: Generative Prof. Navneet Goyal CS & IS BITS, Pilani.

Computer vision: models, learning and inference

Visual Recognition Tutorial

Pattern Recognition and Machine Learning

0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.

Today Linear Regression Logistic Regression Bayesians v. Frequentists

Linear Methods for Classification

Giansalvo EXIN Cirrincione unit #7/8 ERROR FUNCTIONS part one Goal for REGRESSION: to model the conditional distribution of the output variables, conditioned.

Machine Learning CMPT 726 Simon Fraser University

Visual Recognition Tutorial

Arizona State University DMML Kernel Methods – Gaussian Processes Presented by Shankar Bhargav.

Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)

Review of Lecture Two Linear Regression Normal Equation

Biointelligence Laboratory, Seoul National University

Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.

1 Linear Methods for Classification Lecture Notes for CMPUT 466/551 Nilanjan Ray.

PATTERN RECOGNITION AND MACHINE LEARNING

Ch 6. Kernel Methods Pattern Recognition and Machine Learning, C. M. Bishop, Summarized by J. S. Kim Biointelligence Laboratory, Seoul National University.

Biointelligence Laboratory, Seoul National University

Learning Theory Reza Shadmehr logistic regression, iterative re-weighted least squares.

ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Deterministic vs. Random Maximum A Posteriori Maximum Likelihood Minimum.

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 3: LINEAR MODELS FOR REGRESSION.

Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.6: Linear Models Rodney Nielsen Many of.

ECE 8443 – Pattern Recognition LECTURE 07: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Class-Conditional Density The Multivariate Case General.

Ch 2. Probability Distributions (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized by Yung-Kyun Noh and Joo-kyung Kim Biointelligence.

CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.

Machine Learning – Lecture 6

CS Statistical Machine learning Lecture 10 Yuan (Alan) Qi Purdue CS Sept

Ch 4. Linear Models for Classification (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized and revised by Hee-Woong Lim.

Computational Intelligence: Methods and Applications Lecture 23 Logistic discrimination and support vectors Włodzisław Duch Dept. of Informatics, UMK Google:

Machine Learning CUNY Graduate Center Lecture 4: Logistic Regression.

BCS547 Neural Decoding. Population Code Tuning CurvesPattern of activity (r) Direction (deg) Activity

Ch 4. Linear Models for Classification Adopted from Seung-Joon Yi Biointelligence Laboratory, Seoul National University

Biointelligence Laboratory, Seoul National University

Linear Models for Classification

Data Modeling Patrice Koehl Department of Biological Sciences National University of Singapore

CS Statistical Machine learning Lecture 12 Yuan (Alan) Qi Purdue CS Oct

6. Population Codes Presented by Rhee, Je-Keun © 2008, SNU Biointelligence Lab,

Ch 2. Probability Distributions (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized by Joo-kyung Kim Biointelligence Laboratory,

Giansalvo EXIN Cirrincione unit #4 Single-layer networks They directly compute linear discriminant functions using the TS without need of determining.

Ch 2. Probability Distribution Pattern Recognition and Machine Learning, C. M. Bishop, Update by B.-H. Kim Summarized by M.H. Kim Biointelligence.

Ch 1. Introduction (Latter) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized by J.W. Ha Biointelligence Laboratory, Seoul National.

Bayesian Brain Probabilistic Approaches to Neural Coding 1.1 A Probability Primer Bayesian Brain Probabilistic Approaches to Neural Coding 1.1 A Probability.

Ch 1. Introduction Pattern Recognition and Machine Learning, C. M. Bishop, Updated by J.-H. Eom (2 nd round revision) Summarized by K.-I.

CS Statistical Machine learning Lecture 7 Yuan (Alan) Qi Purdue CS Sept Acknowledgement: Sargur Srihari’s slides.

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 1: INTRODUCTION.

Deep Feedforward Networks

CH 5: Multivariate Methods

10701 / Machine Learning.

Probabilistic Models for Linear Regression

Statistical Learning Dong Liu Dept. EEIS, USTC.

Mathematical Foundations of BME Reza Shadmehr

دسته بندی با استفاده از مدل های خطی

Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.

Biointelligence Laboratory, Seoul National University

Generally Discriminant Analysis

Parametric Methods Berlin Chen, 2005 References:

Biointelligence Laboratory, Seoul National University

Ch 3. Linear Models for Regression (2/2) Pattern Recognition and Machine Learning, C. M. Bishop, Previously summarized by Yung-Kyun Noh Updated.

Linear Discrimination

Presentation transcript:

Linear Models for Classification: Probabilistic Methods Adopted from Seung-Joon Yi Biointelligence Laboratory, Seoul National University http://bi.snu.ac.kr/

Recall, Linear Methods for Classification Problem Definition: Given the training data {xn,tn}, find a linear model for each class yk(x) to partition the feature space into decision regions Deterministic Models: Discriminant Functions Fisher Discriminant function Perceptron

Probabilistic Approaches for Classification Generative Models: Inference : Model p(x/Ck) and p(Ck) Decision : Model p(Ck/x) Discriminative Models Model p(Ck/x) directly Use the functional form of the generalized linear model explicitly Determine the parameters directly using Maximum Likelihood

Logistic Sigmoid Function simple[2] logistic function may be defined by the formula Logistic Sigmoid Function Comes from population growth Prob distribution function of Normal R.V. İs Logistic sigmoid İf class conditional densities are Normal, posteriors become logistic sigmoid

Posterior Probabilities can be formulated by 2-Class: Logistic sigmoid acting on a linear function of x K-Class: Softmax transformation of a linear function of x Then, The parameters of the densities as well as the class priors can be determined using Maximum Likelihood

Probabilistic Generative Models: 2-Class Recall, given Posterior can be expresses by Logistic Sigmoid a is called logit function

Probabilistic Generative Models K-Class Posterior can be expresses by Softmax function or normalized exponential Multi-class generalisation of logistic sigmoid:

Probabilistic Generative Models Gaussian Class Conditionals for 2-Class Assume same covariance matrix ∑, Note The quadratic terms in x from the exponents are cancelled. The resulting decision boundary is linear in input space. The prior only shifts the decision boundary, i.e. parallel contour.

(C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/ Probabilistic Generative Models: Gaussian Class Conditionals for K-classes When, covariance matrix is the same, decision boundaries are linear. When, each class-condition density have its own covariance matrix, ak becomes quadratic functions of x, giving rise to a quadratic discriminant. (C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/

Probabilistic Generative Models -Maximum Likelihood Solution- Two classes Given

Q: Find P(C1) = π and P(C2) = 1- π parameters of p(Ck/x): μ1, μ2 and 

Probabilistic Generative Models -Maximum Likelihood Solution Let P(C1) = π and P(C2) = 1- π

Probabilistic Generative Models -Maximize log likelihood w r to .

Probabilistic Generative Models -Discrete Features- Discrete feature values When we have D inputs, the table size grows exponentially with the number of featuresto a 2D size table. . Naïve Bayes assumption, conditioned on the class Ck Linear with respect to the features as in the continuous features. (C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/

(C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/ Bayes Decision Boundaries: 2D -Pattern Classification, Duda et al. pp.42 (C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/

(C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/ Bayes Decision Boundaries: 3D -Pattern Classification, Duda et al. pp.43 (C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/

For both Gaussian distributed and discrete inputs The posterior class probabilities are given by Generalized linear models with logistic sigmoid or softmax activation functions.

Probabilistic Generative Models -Exponential Family- Recall, bernoulli, binomial, multinomial, Gaussian can be expressed in a general form

Probabilistic Generative Models Exponential Family- 2- Classes: Logistic Function The subclass for which u(x) = x. K-Classes: Softmax function. Linear with respect to x again.

Probabilistic Discriminative Models Goal: Find p(Ck/x) directly No inferrence step Discriminative Training: Max likelihood p(Ck/x) İmproves prediction performance when p(x/Ck) is poorly estimated

Fixed basis functions: x Assume fixed nonlinear transformation Transform inputs using a vector of basis functions The resulting decision boundaries will be linear in the feature space y(x)= WT Φ (C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/

Posterior probability of a class for two-class problem: The number of adjustable parameters (M-dimensional, 2-class) 2 Gaussian class conditional densities (generative model) 2M parameters for means M(M+1)/2 parameters for (shared) covariance matrix Grows quadratically with M Logistic regression (discriminative model) M parameters for Grows linearly with M

Determining the parameters using Likelihood function: Take negative log likelihood: Cross-entropy error function Recall, cross entropy between two probability distributions measures the average number of bits needed to identify an event from a set of possibilities, if a coding scheme is used based on a given probability distribution q, rather than the "true" distribution p. (C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/

The gradient of the error function w.r.t. W The same form as the linear regression prediction target value

Iterative Reweighted Least Squares Recall, Linear regression models in ch.3 ML solution on the assumption of a Gaussian noise leads to a close-form solution, as a consequence of the quadratic dependence of the log likelihood on the parameter w. Logistic regression model No longer a closed-form solution But the error function is concave and has a unique minimum Efficient iterative technique can be used The Newton-Raphson update to minimize a function E(w) Where H is the Hessian matrix, the second derivatives of E(w) (C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/

Iterative reweighted least squares (Cont’d) CASE 1: SSE function: Newton-Raphson update: CASE 2: Cross-entropy error function: Newton-Rhapson update: (iterative reweighted least squares) (C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/

Multiclass logistic regerssion Posterior probability for multiclass classification We can use ML to determine the parameters directly. Likelihood function using 1-of-K coding scheme Cross-entropy error function for the multiclass classification

Multiclass logistic regression (Cont’d) The derivative of the error function Same form, the product of error times the basis function. The Hessian matrix IRLS algorithm can also be used for a batch processing (C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/

Generalized Linear Models Recall, for a broad range of class-conditional distributions, described by the exponential family, the resulting posterior class probabilities are given by a logistic(or softmax) transformation acting on a linear function of the feature variables. However this is not the case for all choices of class-conditional density It might be worth exploring other types of discriminative probabilistic model (C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/

Generalized Linear Model: 2 Classes For example: For each input, we evaluate an=wTΦn θ

Noisy Threshold model Corresponding activation function when θ is drawn from p(θ), mixture of Gaussian

Probit Function Sigmoidal shape The generalized linear model based on a probit activation function is known as probit regression.

Canonical link functions Recall, if we take the derivative of the error function w.r.t the parameter w, it takes the form of the error times the feature vector. Logistic regression model with sigmoid activation function Logistic regression model with softmax activation function This is a general result of assuming a conditional distribution for the target variable from the exponential family, along with a corresponding choice for the activation function known as the canonical link function.

Canonical link functions (Cont’d) Consider the exponential family, Conditional distributions of the target variable Log likelihood: The derivative of the log likelihood: where The canonical link function: then (C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/

The Laplace approximation Goal: Find a Gaussian approximation to a non-Gaussian density, centered on the mode z0 of the distribution. Suppose: p(z)= (1 /Z)f(z) , non Gaussian Taylor expansion, arround mode z0, of the logarithm of the target function: Resulting approximated Gaussian distribution:

Laplace approximation for p(z) ∝ exp(-z2/2)σ(20z +4) Left: the normalized distribution p(z) in yellow, together with the Laplace approximation centred on the mode z0 of p(z) in red. Right:The negative logarithms of the corresponding curves (C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/

Model comparison and BIC Laplace approximation to the normalization constant Z This result can be used to obtain an approximation to the model evidence, which plays a central role in Bayesian model comparison. Consider a set of models having parameters The log of model evidence can be approximated as Further approximation with some more assumption: Bayesian Information Criterion (BIC) (C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/

Bayesian Logistic Regression Exact Bayesian inference is intractable. Gaussian prior: Posterior: Log of posterior: Laplace approximation of posterior distribution (C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/

Predictive distribution Can be obtained by marginalizing w.r.t the posterior distribution p (w|t) which is approximated by a Gaussian q(w) where a is a marginal distribution of a Gaussian which is also Gaussian (C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/

Predictive distribution Resulting variational approximation to the predictive distribution To integrate over a, we make use of the close similarity between the logistic sigmoid function and the probit function Then where Finally we get (C) 2006, SNU Biointelligence Lab, http://bi.snu.ac.kr/