Announcements  Homework 4 is due on this Thursday (02/27/2004)  Project proposal is due on 03/02.

Slides:



Advertisements
Similar presentations
CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.
Advertisements

Supervised Learning Recap
Middle Term Exam 03/01 (Thursday), take home, turn in at noon time of 03/02 (Friday)
Chapter 4: Linear Models for Classification
Computer vision: models, learning and inference
Probabilistic Generative Models Rong Jin. Probabilistic Generative Model Classify instance x into one of K classes Class prior Density function for class.
Classification and Prediction: Regression Via Gradient Descent Optimization Bamshad Mobasher DePaul University.
Visual Recognition Tutorial
Lecture 17: Supervised Learning Recap Machine Learning April 6, 2010.
Support Vector Machine
Unconstrained Optimization Rong Jin. Recap  Gradient ascent/descent Simple algorithm, only requires the first order derivative Problem: difficulty in.
Logistic Regression Rong Jin. Logistic Regression Model  In Gaussian generative model:  Generalize the ratio to a linear model Parameters: w and c.
Logistic Regression Rong Jin. Logistic Regression Model  In Gaussian generative model:  Generalize the ratio to a linear model Parameters: w and c.
Predictive Automatic Relevance Determination by Expectation Propagation Yuan (Alan) Qi Thomas P. Minka Rosalind W. Picard Zoubin Ghahramani.
Announcements  Project proposal is due on 03/11  Three seminars this Friday (EB 3105) Dealing with Indefinite Representations in Pattern Recognition.
CES 514 – Data Mining Lecture 8 classification (contd…)
Today Linear Regression Logistic Regression Bayesians v. Frequentists
Linear Methods for Classification
Giansalvo EXIN Cirrincione unit #7/8 ERROR FUNCTIONS part one Goal for REGRESSION: to model the conditional distribution of the output variables, conditioned.
Machine Learning CMPT 726 Simon Fraser University
Expectation Maximization Algorithm
Generative Models Rong Jin. Statistical Inference Training ExamplesLearning a Statistical Model  Prediction p(x;  ) Female: Gaussian distribution N(
Linear Discriminant Functions Chapter 5 (Duda et al.)
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
Unconstrained Optimization Rong Jin. Logistic Regression The optimization problem is to find weights w and b that maximizes the above log-likelihood How.
Radial Basis Function Networks

Collaborative Filtering Matrix Factorization Approach
Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.
1 Linear Methods for Classification Lecture Notes for CMPUT 466/551 Nilanjan Ray.
Methods in Medical Image Analysis Statistics of Pattern Recognition: Classification and Clustering Some content provided by Milos Hauskrecht, University.
PATTERN RECOGNITION AND MACHINE LEARNING
Machine Learning CUNY Graduate Center Lecture 3: Linear Regression.
Principles of Pattern Recognition
Biointelligence Laboratory, Seoul National University
1 Logistic Regression Adapted from: Tom Mitchell’s Machine Learning Book Evan Wei Xiang and Qiang Yang.
1 SUPPORT VECTOR MACHINES İsmail GÜNEŞ. 2 What is SVM? A new generation learning system. A new generation learning system. Based on recent advances in.
An Introduction to Support Vector Machine (SVM) Presenter : Ahey Date : 2007/07/20 The slides are based on lecture notes of Prof. 林智仁 and Daniel Yeung.
CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.
Empirical Research Methods in Computer Science Lecture 7 November 30, 2005 Noah Smith.
CSE 446 Logistic Regression Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer.
Ch 4. Linear Models for Classification (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized and revised by Hee-Woong Lim.
Maximum Entropy (ME) Maximum Entropy Markov Model (MEMM) Conditional Random Field (CRF)
ECE 8443 – Pattern Recognition LECTURE 10: HETEROSCEDASTIC LINEAR DISCRIMINANT ANALYSIS AND INDEPENDENT COMPONENT ANALYSIS Objectives: Generalization of.
Machine Learning CUNY Graduate Center Lecture 4: Logistic Regression.
ECE 5984: Introduction to Machine Learning Dhruv Batra Virginia Tech Topics: –Classification: Logistic Regression –NB & LR connections Readings: Barber.
An Introduction to Support Vector Machine (SVM)
Linear Models for Classification
Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.
1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
Linear Methods for Classification Based on Chapter 4 of Hastie, Tibshirani, and Friedman David Madigan.
Over-fitting and Regularization Chapter 4 textbook Lectures 11 and 12 on amlbook.com.
Final Exam Review CS479/679 Pattern Recognition Dr. George Bebis 1.
Elements of Pattern Recognition CNS/EE Lecture 5 M. Weber P. Perona.
Regress-itation Feb. 5, Outline Linear regression – Regression: predicting a continuous value Logistic regression – Classification: predicting a.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 12: Advanced Discriminant Analysis Objectives:
Linear Models (II) Rong Jin. Recap  Classification problems Inputs x  output y y is from a discrete set Example: height 1.8m  male/female?  Statistical.
Machine Learning: A Brief Introduction Fu Chang Institute of Information Science Academia Sinica ext. 1819
Giansalvo EXIN Cirrincione unit #4 Single-layer networks They directly compute linear discriminant functions using the TS without need of determining.
Generalization Error of pac Model  Let be a set of training examples chosen i.i.d. according to  Treat the generalization error as a r.v. depending on.
Part 3: Estimation of Parameters. Estimation of Parameters Most of the time, we have random samples but not the densities given. If the parametric form.
Linear Discriminant Functions Chapter 5 (Duda et al.) CS479/679 Pattern Recognition Dr. George Bebis.
CMPS 142/242 Review Section Fall 2011 Adapted from Lecture Slides.
Naive Bayes (Generative Classifier) vs. Logistic Regression (Discriminative Classifier) Minkyoung Kim.
LECTURE 11: Advanced Discriminant Analysis
Data Mining Lecture 11.
Probabilistic Models for Linear Regression
Collaborative Filtering Matrix Factorization Approach
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Generally Discriminant Analysis
Presentation transcript:

Announcements  Homework 4 is due on this Thursday (02/27/2004)  Project proposal is due on 03/02

Unconstrained Optimization Rong Jin

Logistic Regression The optimization problem is to find weights w and b that maximizes the above log-likelihood How to do it efficiently ?

Gradient Ascent  Compute the gradient  Increase weights w and threshold b in the gradient direction

Problem with Gradient Ascent  Difficult to find the appropriate step size Small   slow convergence Large   oscillation or “bubbling”  Convergence conditions Robbins-Monroe conditions Along with “regular” objective function will ensure convergence 

Newton Method  Utilizing the second order derivative  Expand the objective function to the second order around x 0  The minimum point is  Newton method for optimization  Guarantee to converge when the objective function is convex

Multivariate Newton Method  Object function comprises of multiple variables Example: logistic regression model Text categorization: thousands of words  thousands of variables  Multivariate Newton Method Multivariate function: First order derivative  a vector Second order derivative  Hessian matrix  Hessian matrix is mxm matrix  Each element in Hessian matrix is defined as:

Multivariate Newton Method  Updating equation:  Hessian matrix for logistic regression model  Can be expensive to compute Example: text categorization with 10,000 words Hessian matrix is of size 10,000 x 10,000  100 million entries Even worse, we have compute the inverse of Hessian matrix H -1

Quasi-Newton Method  Approximate the Hessian matrix H with another B matrix:  B is update iteratively (BFGS): Utilizing derivatives of previous iterations

Limited-Memory Quasi-Newton  Quasi-Newton Avoid computing the inverse of Hessian matrix But, it still requires computing the B matrix  large storage  Limited-Memory Quasi-Newton (L-BFGS) Even avoid explicitly computing B matrix B can be expressed as a product of vectors Only keep the most recently vectors of (3~20)

Linear Conjugate Gradient Method  Consider optimizing the quadratic function  Conjugate vectors The set of vector {p 1, p 2, …, p l } is said to be conjugate with respect to a matrix A if Important property  The quadratic function can be optimized by simply optimizing the function along individual direction in the conjugate set. Optimal solution:   k is the minimizer along the kth conjugate direction

Example  Minimize the following function  Matrix A  Conjugate direction  Optimization First direction, x 1 = x 2 =x: Second direction, x 1 =- x 2 =x: Solution: x 1 = x 2 =1

How to Efficiently Find a Set of Conjugate Directions  Iterative procedure Given conjugate directions {p 1,p 2,…, p k-1 } Set p k as follows: Theorem: The direction generated in the above step is conjugate to all previous directions {p 1,p 2,…, p k-1 }, i.e., Note: compute the k direction p k only requires the previous direction p k-1

Nonlinear Conjugate Gradient  Even though conjugate gradient is derived for a quadratic objective function, it can be applied directly to other nonlinear functions  Several variants: Fletcher-Reeves conjugate gradient (FR-CG) Polak-Ribiere conjugate gradient (PR-CG)  More robust than FR-CG  Compared to Newton method No need for computing the Hessian matrix No need for storing the Hessian matrix

Generalizing Decision Trees +   + a decision tree with simple data partition +   a decision tree using classifiers for data partition   + Each node is a linear classifier Attribute 1 Attribute 2 classifier

Generalized Decision Trees  Each node is a linear classifier  Pro: Usually result in shallow trees Introducing nonlinearity into linear classifiers (e.g. logistic regression) Overcoming overfitting issues through the regularization mechanism within the classifier. Better way to deal with real-value attributes  Example: Neural network Hierarchical Mixture Expert Model

Example Kernel method x=0 Generalized Tree +   +

Hierarchical Mixture Expert Model (HME) Group 1 g 1 (x) m 1,1 (x) Group Layer Expert Layer r(x) Group 2 g 2 (x) m 1,2 (x)m 2,1 (x)m 2,2 (x) X y Ask r(x): which group should be used for classifying input x ? If group 1 is chosen, which classifier m(x) should be used ? Classify input x using the chosen classifier m(x)

Hierarchical Mixture Expert Model (HME) Probabilistic Description Group 1 g 1 (x) m 1,1 (x) Group Layer Expert Layer r(x) Group 2 g 2 (x) m 1,2 (x) m 2,1 (x)m 2,2 (x) X y Two hidden variables The hidden variable for groups: g = {1, 2} The hidden variable for classifiers: m = {11, 12, 21, 22}

Hierarchical Mixture Expert Model (HME) Example Group 1 g 1 (x) m 1,1 (x) Group Layer Expert Layer r(x) Group 2 g 2 (x) m 1,2 (x) m 2,1 (x)m 2,2 (x) X y r(+1|x) = ¾, r(-1|x) = ¼ g 1 (+1|x) = ¼, g 1 (-1|x) = ¾ g 2 (+1|x) = ½, g 2 (-1|x) = ½ +1 m 1,1 (x)¼¾ m 1,2 (x)¾¼ m 2,1 (x)¼¾ m 2,2 (x)¾¼ ¾ ¼ ¼ ¾½ ½ p(+1|x) = ?, p(-1|x) = ?

Training HME  In the training examples {x i, y i } No information about r(x), g(x) for each example Random variables g, m are called hidden variables since they are not exposed in the training data.  How to train a model with hidden variable?

Start with Random Guess … x g 1 (x) m 1,1 (x) Group Layer Expert Layer r(x) g 2 (x) m 1,2 (x)m 2,1 (x)m 2,2 (x) +: {1, 2, 3, 4, 5}  : {6, 7, 8, 9} Randomly Assignment Randomly assign points to each group and expert Learn classifiers r(x), g(x), m(x) using the randomly assigned points {1,2,} {6,7}{3,4,5} {8,9} {1}{6}{2}{7}{3}{9} {5,4}{8}

Adjust Group Memeberships x g 1 (x) m 1,1 (x) Group Layer Expert Layer r(x) g 2 (x) m 1,2 (x)m 2,1 (x)m 2,2 (x) +: {1, 2, 3, 4, 5}  : {6, 7, 8, 9} The key is to assign each data point to the group who classifies the data point correctly with the largest probability How ? {1,2} {6,7}{3,4,5} {8,9} {1}{6}{2}{7}{3}{9} {5,4}{8}

Adjust Group Memberships x g 1 (x) m 1,1 (x) Group Layer Expert Layer r(x) g 2 (x) m 1,2 (x)m 2,1 (x)m 2,2 (x) +: {1, 2, 3, 4, 5}  : {6, 7, 8, 9} The key is to assign each data point to the group who classifies the data point correctly with the largest confidence Compute p(g=1|x,y) and p(g=2|x,y) {1,2} {6,7}{3,4,5} {8,9} {1}{6}{2}{7}{3}{9} {5,4}{8} Posterior Prob. For Groups Group 1Group

Adjust Memberships for Classifiers x g 1 (x) m 1,1 (x) Group Layer Expert Layer r(x) g 2 (x) m 1,2 (x)m 2,1 (x)m 2,2 (x) +: {1, 2, 3, 4, 5}  : {6, 7, 8, 9} {1,5} {6,7}{2,3,4} {8,9} The key is to assign each data point to the classifier who classifies the data point correctly with the largest confidence Compute p(m=1,1|x, y), p(m=1,2|x, y), p(m=2,1|x, y), p(m=2,2|x, y)

Adjust Memberships for Classifiers x g 1 (x) m 1,1 (x) Group Layer Expert Layer r(x) g 2 (x) m 1,2 (x)m 2,1 (x)m 2,2 (x) +: {1, 2, 3, 4, 5}  : {6, 7, 8, 9} {1,5} {6,7}{2,3,4} {8,9} Posterior Prob. For Classifiers m 1, m 1, m 2, m 2, The key is to assign each data point to the classifier who classifies the data point correctly with the largest confidence Compute p(m=1,1|x, y), p(m=1,2|x, y), p(m=2,1|x, y), p(m=2,2|x, y)

Adjust Memberships for Classifiers x g 1 (x) m 1,1 (x) Group Layer Expert Layer r(x) g 2 (x) m 1,2 (x)m 2,1 (x)m 2,2 (x) +: {1, 2, 3, 4, 5}  : {6, 7, 8, 9} {1,5} {6,7}{2,3,4} {8,9} Posterior Prob. For Classifiers m 1, m 1, m 2, m 2, The key is to assign each data point to the classifier who classifies the data point correctly with the largest confidence Compute p(m=1,1|x, y), p(m=1,2|x, y), p(m=2,1|x, y), p(m=2,2|x, y)

Adjust Memberships for Classifiers Posterior Prob. For Classifiers m 1, m 1, m 2, m 2, The key is to assign each data point to the classifier who classifies the data point correctly with the largest confidence Compute p(m=1,1|x, y), p(m=1,2|x, y), p(m=2,1|x, y), p(m=2,2|x, y) x g 1 (x) m 1,1 (x) Group Layer Expert Layer r(x) g 2 (x) m 1,2 (x)m 2,1 (x)m 2,2 (x) +: {1, 2, 3, 4, 5}  : {6, 7, 8, 9} {1,5} {6,7}{2,3,4} {8,9} {1}{6}{5}{7}{2,3}{9} {4}{8}

Retrain The Model Retrain r(x), g(x), m(x) using the new memberships x g 1 (x) m 1,1 (x) Group Layer Expert Layer r(x) g 2 (x) m 1,2 (x)m 2,1 (x)m 2,2 (x) +: {1, 2, 3, 4, 5}  : {6, 7, 8, 9} {1,5} {6,7}{2,3,4} {8,9} {1}{6}{5}{7}{2,3}{9} {4}{8}

Expectation Maximization  Two things need to estimate Logistic regression models for r(x;  r ), g(x;  g ) and m(x;  m ) Unknown group memberships and expert memberships  p(g=1,2|x), p(m=11,12|x,g=1), p(m=21,22|x,g=2) E-step 1.Estimate p(g=1|x, y), p(g=2|x, y) for training examples, given guessed r(x;  r ), g(x;  g ) and m(x;  m ) 2.Estimate p(m=11, 12|x, y) and p(m=21, 22|x, y) for all training examples, given guessed r(x;  r ), g(x;  g ) and m(x;  m ) M-step 1.Train r(x;  r ) using weighted examples: for each x, p(g=1|x) fraction as a positive example, and p(g=2|x) fraction as a negative example 2.Train g 1 (x;  g ) using weighted examples: for each x, p(g=1|x)p(m=11|x,g=1) fraction as a positive example and p(g=1|x)p(m=12|x,g=1) fraction as a negative example. Training g 2 (x;  g ) similarly 3.Train m(x;  m ) with appropriately weighted examples

Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)  Gaussian generative model p(y|x) ~ p(x|y) p(y): posterior = likelihood  prior p(x|y)  Describing the input patterns for each class y  Difficult to estimate if x is of high dimensionality  Naïve Bayes: p(x|y) ~ p(x 1 |y) p(x 2 |y)… p(x m |y) Essentially a linear model  Linear discriminative model Directly estimate p(y|x) Focusing on finding the decision boundary

Comparison of Different Classification Models  Logistic regression model A linear decision boundary: w  x+b A probabilistic model p(y|x) Maximum likelihood approach for estimating weights w and threshold b

Comparison of Different Classification Models  Logistic regression model Overfitting issue Example: text classification  Every word is assigned with a different weight  Words that appears in only one document will be assigned with infinite large weight Solution: regularization Regularization term

Comparison of Different Classification Models  Conditional exponential model An extension of logistic regression model to multiple class case A different set of weights w y and threshold b for each class y  Maximum entropy model Finding the simplest model that matches with the data Maximize Entropy  Prefer uniform distribution Constraints  Enforce the model to be consistent with observed data

Classification Margin Comparison of Different Classification Models  Support vector machine Classification margin Maximum margin principle:  Separate data far away from the decision boundary Two objectives  Minimize the classification error over training data  Maximize the classification margin Support vector  Only support vectors have impact on the location of decision boundary denotes +1 denotes -1 Support Vectors

Comparison of Different Classification Models  Separable case  Noisy case Quadratic programming!

Comparison of Classification Models  Logistic regression model vs. support vector machine Log-likelihood can be viewed as a measurement of accuracy Identical terms

Comparison of Different Classification Models Logistic regression differs from support vector machine only in the loss function

Comparison of Different Classification Models Generative models have trouble at the decision boundary Classification boundary that achieves the least training error Classification boundary that achieves large margin

Nonlinear Models  Kernel methods Add additional dimensions to help separate data Efficiently computing the dot product in a high dimension space Kernel method x=0

Nonlinear Models  Decision trees Nonlinearly combine different features through a tree structure  Hierarchical Mixture Model Replace each node with a logistic regression model Nonlinearly combine multiple linear models +   +   + Group 1 g 1 (x) m 1,1 (x) Group Layer Expert Layer r(x) Group 2 g 2 (x) m 1,2 (x) m 2,1 (x)m 2,2 (x)