 # Announcements  Project proposal is due on 03/11  Three seminars this Friday (EB 3105) Dealing with Indefinite Representations in Pattern Recognition.

## Presentation on theme: "Announcements  Project proposal is due on 03/11  Three seminars this Friday (EB 3105) Dealing with Indefinite Representations in Pattern Recognition."— Presentation transcript:

Announcements  Project proposal is due on 03/11  Three seminars this Friday (EB 3105) Dealing with Indefinite Representations in Pattern Recognition (10:00 am - 11:00 am) Computational Analysis of Drosophila Gene Expression Pattern Image (11:00 am - 12:00 pm) 3D General Lesion Segmentation in CT (3:00 pm - 4:00 pm)

Hierarchical Mixture Expert Model Rong Jin

Good Things about Decision Trees  Decision trees introduce nonlinearity through the tree structure Viewing A^B^C as A*B*C Compared to kernel methods  Less adhoc  Easy understanding

Example Kernel method x=0 Generalized Tree +   + In general, mixture model is powerful in fitting complex decision boundary, for instance, stacking, boosting, bagging

Generalize Decision Trees From slides of Andrew Moore Each node of decision tree only depends on a single feature. Is this the best idea?

Partition Datasets  The goal of each node is to partition the data set into disjoint subsets such that each subset is easier to classify. Original Dataset Partition by a single attribute cylinders = 4 cylinders = 5 cylinders = 6 cylinders = 8

Partition Datasets (cont’d)  More complicated partitions Original Dataset Partition by multiple attributes Other cases Cylinders 4 ton Cylinders  6 and Weight < 3 ton  How to accomplish such a complicated partition?  Each partition  a class  Partition a dataset into disjoint subsets  Classify a dataset into multiple classes Using a classification model for each node

A More General Decision Tree +   + a decision tree with simple data partition +   a decision tree using classifiers for data partition   + Each node is a linear classifier Attribute 1 Attribute 2 classifier

General Schemes for Decision Trees  Each node within the tree is a linear classifier  Pro: Usually result in shallow trees Introducing nonlinearity into linear classifiers (e.g. logistic regression) Overcoming overfitting issues through the regularization mechanism within the classifier. Partition datasets with soft memberships A better way to deal with real-value attributes  Example: Neural network Hierarchical Mixture Expert Model +  

Hierarchical Mixture Expert Model (HME) Group 1 g 1 (x) m 1,1 (x) Group Layer Expert Layer r(x) Group 2 g 2 (x) m 1,2 (x)m 2,1 (x)m 2,2 (x) Classifier Determines the class for input x Router Decides which classifier should x be route to x

Hierarchical Mixture Expert Model (HME) x Group 1 g 1 (x) m 1,1 (x) Group Layer Expert Layer r(x) Group 2 g 2 (x) m 1,2 (x)m 2,1 (x)m 2,2 (x) Which group should be used for classifying x ? ??

Hierarchical Mixture Expert Model (HME) x Group 1 g 1 (x) m 1,1 (x) Group Layer Expert Layer r(x) Group 2 g 2 (x) m 1,2 (x)m 2,1 (x)m 2,2 (x) r(x) = +1

Hierarchical Mixture Expert Model (HME) x Group 1 g 1 (x) m 1,1 (x) Group Layer Expert Layer r(x) Group 2 g 2 (x) m 1,2 (x)m 2,1 (x)m 2,2 (x) Which expert should be used for classifying x ? ??

Hierarchical Mixture Expert Model (HME) x Group 1 g 1 (x) m 1,1 (x) Group Layer Expert Layer r(x) Group 2 g 2 (x) m 1,2 (x)m 2,1 (x)m 2,2 (x) g 1 (x) = -1

Hierarchical Mixture Expert Model (HME) x Group 1 g 1 (x) m 1,1 (x) Group Layer Expert Layer r(x) Group 2 g 2 (x) m 1,2 (x)m 2,1 (x)m 2,2 (x) m 1,2 (x) =+1 The class label for +1

Hierarchical Mixture Expert Model (HME) x Group 1 g 1 (x) m 1,1 (x) Group Layer Expert Layer r(x) Group 2 g 2 (x) m 1,2 (x)m 2,1 (x)m 2,2 (x) Which group should be used for classifying x ? ?? More Complicated Case

Hierarchical Mixture Expert Model (HME) x Group 1 g 1 (x) m 1,1 (x) Group Layer Expert Layer r(x) Group 2 g 2 (x) m 1,2 (x)m 2,1 (x)m 2,2 (x) r(+1|x) = ¾, r(-1|x) = ¼ More Complicated Case

Hierarchical Mixture Expert Model (HME) x Group 1 g 1 (x) m 1,1 (x) Group Layer Expert Layer r(x) Group 2 g 2 (x) m 1,2 (x)m 2,1 (x)m 2,2 (x) x Which expert should be used for classifying x ? ???? r(+1|x) = ¾, r(-1|x) = ¼ More Complicated Case

Hierarchical Mixture Expert Model (HME) x Group 1 g 1 (x) m 1,1 (x) Group Layer Expert Layer r(x) Group 2 g 2 (x) m 1,2 (x)m 2,1 (x)m 2,2 (x) g 1 (+1|x) = ¼, g 1 (-1|x) = ¾ g 2 (+1|x) = ½, g 2 (-1|x) = ½ r(+1|x) = ¾, r(-1|x) = ¼ +1 m 1,1 (x)¼¾ m 1,2 (x)¾¼ m 2,1 (x)¼¾ m 2,2 (x)¾¼ More Complicated Case x How to compute the probability p(+1|x) and p(-1|x)?

HME: Probabilistic Description Group 1 g 1 (x) m 1,1 (x) Group Layer Expert Layer r(x) Group 2 g 2 (x) m 1,2 (x)m 2,1 (x)m 2,2 (x) Random variable g = {1, 2} r(+1|x)=p(g = 1|x), r(-1|x)=p(g = 2|x) Random variable m = {11, 12, 21, 22} g 1 (+1|x) = p(m=11|x, g=1), g 1 (-1|x) = p(m=12|x, g=1) g 2 (+1|x) =p(m=21|x, g=2) g 2 (-1|x) =p(m=22|x, g=2)

HME: Probabilistic Description g 1 (+1|x) = ¼, g 1 (-1|x) = ¾ g 2 (+1|x) = ½, g 2 (-1|x) = ½ r(+1|x) = ¾, r(-1|x) = ¼ +1 m 1,1 (x)¼¾ m 1,2 (x)¾¼ m 2,1 (x)¼¾ m 2,2 (x)¾¼ Group 1 g 1 (x) m 1,1 (x) Group Layer Expert Layer r(x) Group 2 g 2 (x) m 1,2 (x)m 2,1 (x)m 2,2 (x) Compute P(+1|x) and P(-1|x)

HME: Probabilistic Description g 1 (+1|x) = ¼, g 1 (-1|x) = ¾ g 2 (+1|x) = ½, g 2 (-1|x) = ½ r(+1|x) = ¾, r(-1|x) = ¼ +1 m 1,1 (x)¼¾ m 1,2 (x)¾¼ m 2,1 (x)¼¾ m 2,2 (x)¾¼ Group 1 g 1 (x) m 1,1 (x) Group Layer Expert Layer r(x) Group 2 g 2 (x) m 1,2 (x)m 2,1 (x)m 2,2 (x)

HME: Probabilistic Description g 1 (+1|x) = ¼, g 1 (-1|x) = ¾ g 2 (+1|x) = ½, g 2 (-1|x) = ½ r(+1|x) = ¾, r(-1|x) = ¼ +1 m 1,1 (x)¼¾ m 1,2 (x)¾¼ m 2,1 (x)¼¾ m 2,2 (x)¾¼

Hierarchical Mixture Expert Model (HME) r(x ) x Group 1 g 1 (x) m 1,1 (x) Group Layer Expert Layer Group 2 g 2 (x) m 1,2 (x)m 2,1 (x)m 2,2 (x) y Is HME more powerful than a simple majority vote approach?

Problem with Training HME  Using logistic regression to model r(x), g(x), and m(x)  No training examples r(x), g(x) For each training example (x, y), we don’t know its group ID or expert ID.  can’t apply training procedure of logistic regression model to train r(x) and g(x) directly. Random variables g, m are called hidden variables since they are not exposed in the training data.  How to train a model with incomplete data?

Start with Random Guess … x g 1 (x) m 1,1 (x) Group Layer Expert Layer r(x) g 2 (x) m 1,2 (x)m 2,1 (x)m 2,2 (x) +: {1, 2, 3, 4, 5}  : {6, 7, 8, 9} Iteration 1: random guess: Randomly assign points to groups and experts

Start with Random Guess … x g 1 (x) m 1,1 (x) Group Layer Expert Layer r(x) g 2 (x) m 1,2 (x)m 2,1 (x)m 2,2 (x) +: {1, 2, 3, 4, 5}  : {6, 7, 8, 9} Iteration 1: random guess: Randomly assign points to groups and experts {1,2,} {6,7}{3,4,5} {8,9} {1}{6}{2}{7}{3}{9} {5,4}{8}

Start with Random Guess … x g 1 (x) m 1,1 (x) Group Layer Expert Layer r(x) g 2 (x) m 1,2 (x)m 2,1 (x)m 2,2 (x) +: {1, 2, 3, 4, 5}  : {6, 7, 8, 9} Iteration 1: random guess: Randomly assign points to groups and experts Learn r(x), g 1 (x), g 2 (x), m 11 (x), m 12 (x), m 21 (x), m 22 (x) {1,2,} {6,7}{3,4,5} {8,9} {1}{6}{2}{7}{3}{9} {5,4}{8} Now, what should we do?

Refine HME Model x g 1 (x) m 1,1 (x) Group Layer Expert Layer r(x) g 2 (x) m 1,2 (x)m 2,1 (x)m 2,2 (x) +: {1, 2, 3, 4, 5}  : {6, 7, 8, 9} Iteration 2: regroup data points Reassign the group membership to each data point Reassign the expert membership to each expert {1,5} {6,7}{2,3,4} {8,9} But, how?

Determine Group Memberships g 1 (+1|x) = ¼, g 1 (-1|x) = ¾ g 2 (+1|x) = ½, g 2 (-1|x) = ½ r(+1|x) = ¾, r(-1|x) = ¼ +1 m 1,1 (x)¼¾ m 1,2 (x)¾¼ m 2,1 (x)¼¾ m 2,2 (x)¾¼ Group 1 g 1 (x) m 1,1 (x) Group Layer Exper tLayer r(x) Group 2 g 2 (x) m 1,2 (x)m 2,1 (x) m 2,2 (x) x Consider an example (x, +1) Compute the posterior on your own sheet !

Determine Group Memberships g 1 (+1|x) = ¼, g 1 (-1|x) = ¾ g 2 (+1|x) = ½, g 2 (-1|x) = ½ r(+1|x) = ¾, r(-1|x) = ¼ +1 m 1,1 (x)¼¾ m 1,2 (x)¾¼ m 2,1 (x)¼¾ m 2,2 (x)¾¼ Group 1 g 1 (x) m 1,1 (x) Group Layer Exper tLayer r(x) Group 2 g 2 (x) m 1,2 (x)m 2,1 (x) m 2,2 (x) x Consider an example (x, +1)

Determine Expert Memberships g 1 (+1|x) = ¼, g 1 (-1|x) = ¾ g 2 (+1|x) = ½, g 2 (-1|x) = ½ r(+1|x) = ¾, r(-1|x) = ¼ +1 m 1,1 (x)¼¾ m 1,2 (x)¾¼ m 2,1 (x)¼¾ m 2,2 (x)¾¼ Group 1 g 1 (x) m 1,1 (x) Group Layer Exper tLayer r(x) Group 2 g 2 (x) m 1,2 (x)m 2,1 (x) m 2,2 (x) x Consider an example (x, +1)

Refine HME Model x g 1 (x) m 1,1 (x) Group Layer Expert Layer r(x) g 2 (x) m 1,2 (x)m 2,1 (x)m 2,2 (x) +: {1, 2, 3, 4, 5}  : {6, 7, 8, 9} Iteration 2: regroup data points Reassign the group membership to each data point Reassign the expert membership to each expert Compute the posteriors p(g|x,y) and p(m|x,y,g) for each training example (x,y) Retrain r(x), g 1 (x), g 2 (x), m 11 (x), m 12 (x), m 21 (x), m 22 (x) using estimated posteriors {1,5} {6,7}{2,3,4} {8,9} But, how ?

Logistic Regression: Soft Memberships  Example: train r(x) Soft memberships

Logistic Regression: Soft Memberships  Example: train m 11 (x) Soft memberships

Start with Random Guess … x g 1 (x) m 1,1 (x) Group Layer Expert Layer r(x) g 2 (x) m 1,2 (x)m 2,1 (x)m 2,2 (x) +: {1, 2, 3, 4, 5}  : {6, 7, 8, 9} Repeat the above procedure until it converges (it guarantees to converge a local minimum) {1,5} {6,7}{2,3,4} {8,9} {1}{6}{5}{7}{2,3}{9} {4}{8} This is famous Expectation-Maximization Algorithm (EM) ! Iteration 2: regroup data points Reassign the group membership to each data point Reassign the expert membership to each expert Compute the posteriors p(g|x,y) and p(m|x,y,g) for each training example (x,y) Retrain r(x), g 1 (x), g 2 (x), m 11 (x), m 12 (x), m 21 (x), m 22 (x)

Formal EM algorithm for HME  Unknown logistic regression models r(x;  r ), {g i (x;  g )} and {m i (x;  m )}  Unknown group memberships and expert memberships p(g|x,y), p(m|x, y, g) E-step Fixed logistic regression model and estimate memberships: Estimate p(g=1|x,y), p(g=2|x,y) for all training examples Estimate p(m=11, 12|x, y, g=1) and p(m=21, 22|x, y, g=2) for all training examples M-step Fixed memberships and learn logistic regression models Train r(x;  r ) using soft memberships p(g=1|x,y) and p(g=2|x,y) Train g 1 (x;  g ) and g 2 (x;  g ) using soft memberships p(m=11, 12|x, y, g=1), p(m=21, 22|x, y,g=2) Train m 11 (x;  m ), m 12 (x;  m ), m 21 (x;  m ), and m 22 (x;  m ) using soft memberships p(m=11,12|x,y,g=1), p(m=21,22|x,y,g=2)

Formal EM algorithm for HME  Unknown logistic regression models r(x;  r ), {g i (x;  g )} and {m i (x;  m )}  Unknown group memberships and expert memberships p(g|x,y), p(m|x, y, g) E-step Fixed logistic regression model and estimate memberships: Estimate p(g=1|x,y), p(g=2|x,y) for all training examples Estimate p(m=11, 12|x, y, g=1) and p(m=21, 22|x, y, g=2) for all training examples M-step Fixed memberships and learn logistic regression models Train r(x;  r ) using soft memberships p(g=1|x,y) and p(g=2|x,y) Train g 1 (x;  g ) and g 2 (x;  g ) using soft memberships p(m=11, 12|x, y, g=1), p(m=21, 22|x, y,g=2) Train m 11 (x;  m ), m 12 (x;  m ), m 21 (x;  m ), and m 22 (x;  m ) using soft memberships p(m=11,12|x,y,g=1), p(m=21,22|x,y,g=2)

What are We Doing?  What is the objective of doing Expectation-Maximization?  It is still a simple maximum likelihood!  Expectation-Maximization algorithm actually tries to maximize the log-likelihood function  Most time, it converges to local maximum, not a global one  Improved version: annealing EM

Annealing EM

Improve HME  It is sensitive to initial assignments How can we reduce the risk of initial assignments?  Binary tree  K-way trees Logistic regression  conditional exponential model  Tree structure Can we determine the optimal tree structure for a given dataset?

Comparison of Classification Models  The goal of classifier Predicting class label y for an input x Estimate p(y|x)  Gaussian generative model p(y|x) ~ p(x|y) p(y): posterior = likelihood  prior Difficulty in estimating p(x|y) if x comprises of multiple elements  Naïve Bayes: p(x|y) ~ p(x 1 |y) p(x 2 |y)… p(x d |y)  Linear discriminative model Estimate p(y|x) Focusing on finding the decision boundary

Comparison of Classification Models  Logistic regression model A linear decision boundary: w  x+b A probabilistic model p(y|x) Maximum likelihood approach for estimating weights w and threshold b

Comparison of Classification Models  Logistic regression model Overfitting issue  In text classification problem, words that only appears in only one document will be assigned with infinite large weight Solution: regularization  Conditional exponential model  Maximum entropy model A dual problem of conditional exponential model

Comparison of Classification Models  Support vector machine Classification margin Maximum margin principle: two objective  Minimize the classification error over training data  Maximize classification margin Support vector  Only support vectors have impact on the location of decision boundary denotes +1 denotes -1 Support Vectors

Comparison of Classification Models  Separable case  Noisy case Quadratic programming!

Comparison of Classification Models  Similarity between logistic regression model and support vector machine Log-likelihood can be viewed as a measurement of accuracy Identical terms Logistic regression model is almost identical to support vector machine except for different expression for classification errors

Comparison of Classification Models Generative models have trouble at the decision boundary Classification boundary that achieves the least training error Classification boundary that achieves large margin

Download ppt "Announcements  Project proposal is due on 03/11  Three seminars this Friday (EB 3105) Dealing with Indefinite Representations in Pattern Recognition."

Similar presentations