Download presentation
Presentation is loading. Please wait.
1
Project Now it is time to think about the project It is a team work Each team will consist of 2 people It is better to consider a project of your own Otherwise, I will assign you to some “difficult” project. Important date 03/11: project proposal due 04/01: project progress report due 04/22 and 04/24: final presentation 05/03: final report due
2
Project Proposal What do I expect? Introduction: describe the research problem that you try to solve Related wok: describe the existing approaches and their deficiency Proposed approaches: describe your approaches and why it may have potential to alleviate the deficiency with existing approaches Plan: what you plan to do in this project? Format It should look like a research paper The required format (both Microsoft Word and Latex) can be downloaded from www.cse.msu.edu/~cse847/assignments/format.zip
3
Project Progress Report Introduction: overview the problem that you try to solve and the solutions that you present in the proposal Progress Algorithm description in more details Related data collection and cleanup Preliminary results Format should be same as the project report
4
Project Final Report It should like a research paper that is ready for submission to research conferences What do I expect? Introduction Algorithm description and discussion Empirical studies I am expecting careful analysis of results no matter if it is a successful approach or a complete failure Presentation 25 minute presentation 5 minute discussion
5
Exponential Model and Maximum Entropy Model Rong Jin
6
Recap: Logistic Regression Model Assume the inputs and outputs are related in the log linear function Estimate weights: MLE approach
7
How to Extend Logistic Regression Model to Multiple Classes? y {+1, -1} {1,2,…,C}?
8
Conditional Exponential Model Introduce a different set of parameters for each class Ensure the sum of probability to be 1
9
Conditional Exponential Model Predication probability Model parameters: For each class y, we have weights w y and threshold c y Maximum likelihood estimation Any Problems?
10
Conditional Exponential Model Add a constant vector to every weight vector, we have the same log-likelihood function Not unique optimum solution! How to resolve this problem? Solution: Set w 1 to be a zero vector and c 1 to be zero
11
Modified Conditional Exponential Model Prediction probability Model parameters: For each class y>1, we have weights w y and threshold c y Maximum likelihood estimation
12
Maximum Entropy Model: Motivation Consider a translation example English ‘in’ French {dans, en, à, au-cours-de, pendant} Goal: p(dans), p(en), p(à), p(au-cours-de), p(pendant) Case 1: no prior knowledge on tranlation What is your guess of the probabilities?
13
Maximum Entropy Model: Motivation Consider a translation example English ‘in’ French {dans, en, à, au cours de, pendant} Goal: p(dans), p(en), p(à), p(au-cours-de), p(pendant) Case 1: no prior knowledge on tranlation What is your guess of the probabilities? p(dans)=p(en)=p(à)=p(au-cours-de)=p(pendant)=1/5 Case 2: 30% of times either dans or en is used
14
Maximum Entropy Model: Motivation Consider a translation example English ‘in’ French {dans, en, à, au cours de, pendant} Goal: p(dans), p(en), p(à), p(au-cours-de), p(pendant) Case 1: no prior knowledge on tranlation What is your guess of the probabilities? p(dans)=p(en)=p(à)=p(au-cours-de)=p(pendant)=1/5 Case 2: 30% of times either dans or en is used What is your guess of the probabilities? p(dans)=p(en)=3/20 p(à)=p(au-cours-de)=p(pendant)=7/30 Uniform distribution is favored
15
Maximum Entropy Model: Motivation Case 3: 30% of time dans or en is used, and 50% of times dans or à is used What is your guess of the probabilities?
16
Maximum Entropy Model: Motivation Case 3: 30% of time dans or en is used, and 50% of times dans or à is used What is your guess of the probabilities? A good probability distribution should Satisfy the constraints Be close to uniform distribution, but how? Measure Uniformality using Kullback-Leibler Distance !
17
Maximum Entropy Principle (MaxEnt) A uniformity of distribution is measured by entropy of the distribution Solution: p(dans) = 0.2, p(a) = 0.3, p(en)=0.1, p(au-cours-de) = 0.2, p(pendant) = 0.2
18
MaxEnt for Classification Problems Want a p(y|x) to be close to a uniform distribution Maximize the conditional entropy of training data Constraints Valid probability distribution From training data: the model should be consistent with data For each class, model mean of x = empirical mean of x
19
MaxEnt for Classification Problems Want a p(y|x) to be close to a uniform distribution Maximize the conditional entropy of training data Constraints Valid probability distribution From training data: the model should be consistent with data For each class, model mean of x = empirical mean of x
20
MaxEnt for Classification Problems Requiring the mean be consistent between the empirical data and the model No assumption about the parametric form for likelihood Only assume it is C 2 continuous
21
MaxEnt Model Consistency with data is ensured by the equality constraints For each feature, the empirical mean equal to the model mean Beyond feature vector x:
22
Translation Problem Parameters: p(dans), p(en), p(au), p(a), p(pendant) Represent each French word with two features {dans, en}{dans, a} dans11 en10 au-cours-de00 a01 pendant00 Empirical Average0.30.5
23
Constraints
24
Solution to MaxEnt Surprisingly, the solution is just conditional exponential model without thresholds Why?
25
Solution to MaxEnt
26
Maximum Entropy Model versus Conditional Exponential Model Maximum Entropy Model Conditional Exponential Model Dual Problem
27
Maximum Entropy Model vs. Conditional Exponential Model However, where is the threshold term c? Maximum EntropyConditional Exponential
28
Solving Maximum Entropy Model Iterative scaling algorithm Assume
29
Solving Maximum Entropy Model Compute the empirical mean for each feature of every class, i.e., for every j and every class y Start w 1,w 2 …, w c = 0 Repeat Compute p(y|x) for each training data point (x i, y i ) using w and c from the previous iteration Compute the mean of each feature of every class using the estimated probabilities, i.e., for every j and every y Compute for every j and every y Update w as
30
Solving Maximum Entropy Model Compute the empirical mean for each feature of every class, i.e., for every j and every class y Start w 1,w 2 …, w c = 0 Repeat Compute p(y|x) for each training data point (x i, y i ) using w from the previous iteration Compute the mean of each feature of every class using the estimated probabilities, i.e., for every j and every y Compute for every j and every y Update w as
31
Solving Maximum Entropy Model The likelihood function always increases !
32
Solving Maximum Entropy Model How about each feature can take both positive and negative values? How about the sum of features is not a constant? How to apply this approach to conditional exponential model with bias term (or threshold term)?
33
Improved Iterative Scaling It only requires all the input features to be positive Compute the empirical mean for each feature of every class, i.e., for every j and every class y Start w 1,w 2 …, w c = 0 Repeat Compute p(y|x) for each training data point (x i, y i ) using w and c from the previous iteration Solve for every j and every y Update w as
34
Choice of Features A feature does not have to be one of the inputs For maximum entropy model, bound features are more favorable. Very often, people use binary feature Feature selection Features with small weights are eliminated
35
Feature Selection vs. Regularizers Regularizer sparse solution automatic feature selection But, L2 regularizer rarely results in features with zero weights not appropriate for feature selection For the purpose of feature selection, usually using L1 norm
36
Feature Selection vs. Regularizers Regularizer sparse solution automatic feature selection But, L2 regularizer rarely results in features with zero weights not appropriate for feature selection For the purpose of feature selection, usually using L1 norm
37
Solving the L1 Regularized Conditional Exponential Model Solving the L1 regularized conditional exponential model directly is rather difficult Because the absolute value is a discontinuous function Any suggestion to alleviate this problem?
38
Solving the L1 Regularized Conditional Exponential Model Slack Variables
Similar presentations
© 2024 SlidePlayer.com Inc.
All rights reserved.