 # Announcements  Homework 4 is due on this Thursday (02/27/2004)  Project proposal is due on 03/02.

## Presentation on theme: "Announcements  Homework 4 is due on this Thursday (02/27/2004)  Project proposal is due on 03/02."— Presentation transcript:

Announcements  Homework 4 is due on this Thursday (02/27/2004)  Project proposal is due on 03/02

Unconstrained Optimization Rong Jin

Logistic Regression The optimization problem is to find weights w and b that maximizes the above log-likelihood How to do it efficiently ?

Gradient Ascent  Compute the gradient  Increase weights w and threshold b in the gradient direction

Problem with Gradient Ascent  Difficult to find the appropriate step size Small   slow convergence Large   oscillation or “bubbling”  Convergence conditions Robbins-Monroe conditions Along with “regular” objective function will ensure convergence 

Newton Method  Utilizing the second order derivative  Expand the objective function to the second order around x 0  The minimum point is  Newton method for optimization  Guarantee to converge when the objective function is convex

Multivariate Newton Method  Object function comprises of multiple variables Example: logistic regression model Text categorization: thousands of words  thousands of variables  Multivariate Newton Method Multivariate function: First order derivative  a vector Second order derivative  Hessian matrix  Hessian matrix is mxm matrix  Each element in Hessian matrix is defined as:

Multivariate Newton Method  Updating equation:  Hessian matrix for logistic regression model  Can be expensive to compute Example: text categorization with 10,000 words Hessian matrix is of size 10,000 x 10,000  100 million entries Even worse, we have compute the inverse of Hessian matrix H -1

Quasi-Newton Method  Approximate the Hessian matrix H with another B matrix:  B is update iteratively (BFGS): Utilizing derivatives of previous iterations

Limited-Memory Quasi-Newton  Quasi-Newton Avoid computing the inverse of Hessian matrix But, it still requires computing the B matrix  large storage  Limited-Memory Quasi-Newton (L-BFGS) Even avoid explicitly computing B matrix B can be expressed as a product of vectors Only keep the most recently vectors of (3~20)

Linear Conjugate Gradient Method  Consider optimizing the quadratic function  Conjugate vectors The set of vector {p 1, p 2, …, p l } is said to be conjugate with respect to a matrix A if Important property  The quadratic function can be optimized by simply optimizing the function along individual direction in the conjugate set. Optimal solution:   k is the minimizer along the kth conjugate direction

Example  Minimize the following function  Matrix A  Conjugate direction  Optimization First direction, x 1 = x 2 =x: Second direction, x 1 =- x 2 =x: Solution: x 1 = x 2 =1

How to Efficiently Find a Set of Conjugate Directions  Iterative procedure Given conjugate directions {p 1,p 2,…, p k-1 } Set p k as follows: Theorem: The direction generated in the above step is conjugate to all previous directions {p 1,p 2,…, p k-1 }, i.e., Note: compute the k direction p k only requires the previous direction p k-1

Nonlinear Conjugate Gradient  Even though conjugate gradient is derived for a quadratic objective function, it can be applied directly to other nonlinear functions  Several variants: Fletcher-Reeves conjugate gradient (FR-CG) Polak-Ribiere conjugate gradient (PR-CG)  More robust than FR-CG  Compared to Newton method No need for computing the Hessian matrix No need for storing the Hessian matrix

Generalizing Decision Trees +   + a decision tree with simple data partition +   a decision tree using classifiers for data partition   + Each node is a linear classifier Attribute 1 Attribute 2 classifier

Generalized Decision Trees  Each node is a linear classifier  Pro: Usually result in shallow trees Introducing nonlinearity into linear classifiers (e.g. logistic regression) Overcoming overfitting issues through the regularization mechanism within the classifier. Better way to deal with real-value attributes  Example: Neural network Hierarchical Mixture Expert Model

Example Kernel method x=0 Generalized Tree +   +

Hierarchical Mixture Expert Model (HME) Group 1 g 1 (x) m 1,1 (x) Group Layer Expert Layer r(x) Group 2 g 2 (x) m 1,2 (x)m 2,1 (x)m 2,2 (x) X y Ask r(x): which group should be used for classifying input x ? If group 1 is chosen, which classifier m(x) should be used ? Classify input x using the chosen classifier m(x)

Hierarchical Mixture Expert Model (HME) Probabilistic Description Group 1 g 1 (x) m 1,1 (x) Group Layer Expert Layer r(x) Group 2 g 2 (x) m 1,2 (x) m 2,1 (x)m 2,2 (x) X y Two hidden variables The hidden variable for groups: g = {1, 2} The hidden variable for classifiers: m = {11, 12, 21, 22}

Hierarchical Mixture Expert Model (HME) Example Group 1 g 1 (x) m 1,1 (x) Group Layer Expert Layer r(x) Group 2 g 2 (x) m 1,2 (x) m 2,1 (x)m 2,2 (x) X y r(+1|x) = ¾, r(-1|x) = ¼ g 1 (+1|x) = ¼, g 1 (-1|x) = ¾ g 2 (+1|x) = ½, g 2 (-1|x) = ½ +1 m 1,1 (x)¼¾ m 1,2 (x)¾¼ m 2,1 (x)¼¾ m 2,2 (x)¾¼ ¾ ¼ ¼ ¾½ ½ p(+1|x) = ?, p(-1|x) = ?

Training HME  In the training examples {x i, y i } No information about r(x), g(x) for each example Random variables g, m are called hidden variables since they are not exposed in the training data.  How to train a model with hidden variable?

Start with Random Guess … x g 1 (x) m 1,1 (x) Group Layer Expert Layer r(x) g 2 (x) m 1,2 (x)m 2,1 (x)m 2,2 (x) +: {1, 2, 3, 4, 5}  : {6, 7, 8, 9} Randomly Assignment Randomly assign points to each group and expert Learn classifiers r(x), g(x), m(x) using the randomly assigned points {1,2,} {6,7}{3,4,5} {8,9} {1}{6}{2}{7}{3}{9} {5,4}{8}

Adjust Group Memeberships x g 1 (x) m 1,1 (x) Group Layer Expert Layer r(x) g 2 (x) m 1,2 (x)m 2,1 (x)m 2,2 (x) +: {1, 2, 3, 4, 5}  : {6, 7, 8, 9} The key is to assign each data point to the group who classifies the data point correctly with the largest probability How ? {1,2} {6,7}{3,4,5} {8,9} {1}{6}{2}{7}{3}{9} {5,4}{8}

Adjust Group Memberships x g 1 (x) m 1,1 (x) Group Layer Expert Layer r(x) g 2 (x) m 1,2 (x)m 2,1 (x)m 2,2 (x) +: {1, 2, 3, 4, 5}  : {6, 7, 8, 9} The key is to assign each data point to the group who classifies the data point correctly with the largest confidence Compute p(g=1|x,y) and p(g=2|x,y) {1,2} {6,7}{3,4,5} {8,9} {1}{6}{2}{7}{3}{9} {5,4}{8} Posterior Prob. For Groups Group 1Group 2 10.80.2 20.40.6 30.30.7 40.10.9 50.650.35

Adjust Memberships for Classifiers x g 1 (x) m 1,1 (x) Group Layer Expert Layer r(x) g 2 (x) m 1,2 (x)m 2,1 (x)m 2,2 (x) +: {1, 2, 3, 4, 5}  : {6, 7, 8, 9} {1,5} {6,7}{2,3,4} {8,9} The key is to assign each data point to the classifier who classifies the data point correctly with the largest confidence Compute p(m=1,1|x, y), p(m=1,2|x, y), p(m=2,1|x, y), p(m=2,2|x, y)

Adjust Memberships for Classifiers x g 1 (x) m 1,1 (x) Group Layer Expert Layer r(x) g 2 (x) m 1,2 (x)m 2,1 (x)m 2,2 (x) +: {1, 2, 3, 4, 5}  : {6, 7, 8, 9} {1,5} {6,7}{2,3,4} {8,9} Posterior Prob. For Classifiers 12345 m 1,1 0.70.10.150.10.05 m 1,2 0.2 0.200.10.55 m 2,1 0.050.50.600.10.3 m 2,1 0.050.20.050.70.1 The key is to assign each data point to the classifier who classifies the data point correctly with the largest confidence Compute p(m=1,1|x, y), p(m=1,2|x, y), p(m=2,1|x, y), p(m=2,2|x, y)

Adjust Memberships for Classifiers x g 1 (x) m 1,1 (x) Group Layer Expert Layer r(x) g 2 (x) m 1,2 (x)m 2,1 (x)m 2,2 (x) +: {1, 2, 3, 4, 5}  : {6, 7, 8, 9} {1,5} {6,7}{2,3,4} {8,9} Posterior Prob. For Classifiers 12345 m 1,1 0.70.10.150.10.05 m 1,2 0.2 0.200.10.55 m 2,1 0.050.50.600.10.3 m 2,1 0.050.20.050.70.1 The key is to assign each data point to the classifier who classifies the data point correctly with the largest confidence Compute p(m=1,1|x, y), p(m=1,2|x, y), p(m=2,1|x, y), p(m=2,2|x, y)

Adjust Memberships for Classifiers Posterior Prob. For Classifiers 12345 m 1,1 0.70.10.150.10.05 m 1,2 0.2 0.200.10.55 m 2,1 0.050.50.600.10.3 m 2,1 0.050.20.050.70.1 The key is to assign each data point to the classifier who classifies the data point correctly with the largest confidence Compute p(m=1,1|x, y), p(m=1,2|x, y), p(m=2,1|x, y), p(m=2,2|x, y) x g 1 (x) m 1,1 (x) Group Layer Expert Layer r(x) g 2 (x) m 1,2 (x)m 2,1 (x)m 2,2 (x) +: {1, 2, 3, 4, 5}  : {6, 7, 8, 9} {1,5} {6,7}{2,3,4} {8,9} {1}{6}{5}{7}{2,3}{9} {4}{8}

Retrain The Model Retrain r(x), g(x), m(x) using the new memberships x g 1 (x) m 1,1 (x) Group Layer Expert Layer r(x) g 2 (x) m 1,2 (x)m 2,1 (x)m 2,2 (x) +: {1, 2, 3, 4, 5}  : {6, 7, 8, 9} {1,5} {6,7}{2,3,4} {8,9} {1}{6}{5}{7}{2,3}{9} {4}{8}

Expectation Maximization  Two things need to estimate Logistic regression models for r(x;  r ), g(x;  g ) and m(x;  m ) Unknown group memberships and expert memberships  p(g=1,2|x), p(m=11,12|x,g=1), p(m=21,22|x,g=2) E-step 1.Estimate p(g=1|x, y), p(g=2|x, y) for training examples, given guessed r(x;  r ), g(x;  g ) and m(x;  m ) 2.Estimate p(m=11, 12|x, y) and p(m=21, 22|x, y) for all training examples, given guessed r(x;  r ), g(x;  g ) and m(x;  m ) M-step 1.Train r(x;  r ) using weighted examples: for each x, p(g=1|x) fraction as a positive example, and p(g=2|x) fraction as a negative example 2.Train g 1 (x;  g ) using weighted examples: for each x, p(g=1|x)p(m=11|x,g=1) fraction as a positive example and p(g=1|x)p(m=12|x,g=1) fraction as a negative example. Training g 2 (x;  g ) similarly 3.Train m(x;  m ) with appropriately weighted examples

Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)  Gaussian generative model p(y|x) ~ p(x|y) p(y): posterior = likelihood  prior p(x|y)  Describing the input patterns for each class y  Difficult to estimate if x is of high dimensionality  Naïve Bayes: p(x|y) ~ p(x 1 |y) p(x 2 |y)… p(x m |y) Essentially a linear model  Linear discriminative model Directly estimate p(y|x) Focusing on finding the decision boundary

Comparison of Different Classification Models  Logistic regression model A linear decision boundary: w  x+b A probabilistic model p(y|x) Maximum likelihood approach for estimating weights w and threshold b

Comparison of Different Classification Models  Logistic regression model Overfitting issue Example: text classification  Every word is assigned with a different weight  Words that appears in only one document will be assigned with infinite large weight Solution: regularization Regularization term

Comparison of Different Classification Models  Conditional exponential model An extension of logistic regression model to multiple class case A different set of weights w y and threshold b for each class y  Maximum entropy model Finding the simplest model that matches with the data Maximize Entropy  Prefer uniform distribution Constraints  Enforce the model to be consistent with observed data

Classification Margin Comparison of Different Classification Models  Support vector machine Classification margin Maximum margin principle:  Separate data far away from the decision boundary Two objectives  Minimize the classification error over training data  Maximize the classification margin Support vector  Only support vectors have impact on the location of decision boundary denotes +1 denotes -1 Support Vectors

Comparison of Different Classification Models  Separable case  Noisy case Quadratic programming!

Comparison of Classification Models  Logistic regression model vs. support vector machine Log-likelihood can be viewed as a measurement of accuracy Identical terms

Comparison of Different Classification Models Logistic regression differs from support vector machine only in the loss function

Comparison of Different Classification Models Generative models have trouble at the decision boundary Classification boundary that achieves the least training error Classification boundary that achieves large margin

Nonlinear Models  Kernel methods Add additional dimensions to help separate data Efficiently computing the dot product in a high dimension space Kernel method x=0

Nonlinear Models  Decision trees Nonlinearly combine different features through a tree structure  Hierarchical Mixture Model Replace each node with a logistic regression model Nonlinearly combine multiple linear models +   +   + Group 1 g 1 (x) m 1,1 (x) Group Layer Expert Layer r(x) Group 2 g 2 (x) m 1,2 (x) m 2,1 (x)m 2,2 (x)

Download ppt "Announcements  Homework 4 is due on this Thursday (02/27/2004)  Project proposal is due on 03/02."

Similar presentations