Presentation is loading. Please wait.

Presentation is loading. Please wait.

Project  Now it is time to think about the project  It is a team work Each team will consist of 2 people  It is better to consider a project of your.

Similar presentations


Presentation on theme: "Project  Now it is time to think about the project  It is a team work Each team will consist of 2 people  It is better to consider a project of your."— Presentation transcript:

1 Project  Now it is time to think about the project  It is a team work Each team will consist of 2 people  It is better to consider a project of your own Otherwise, I will assign you to some “difficult” project.  Important date 03/11: project proposal due 04/01: project progress report due 04/22 and 04/24: final presentation 05/03: final report due

2 Project Proposal  What do I expect? Introduction: describe the research problem that you try to solve Related wok: describe the existing approaches and their deficiency Proposed approaches: describe your approaches and why it may have potential to alleviate the deficiency with existing approaches Plan: what you plan to do in this project?  Format It should look like a research paper The required format (both Microsoft Word and Latex) can be downloaded from www.cse.msu.edu/~cse847/assignments/format.zip

3 Project Progress Report  Introduction: overview the problem that you try to solve and the solutions that you present in the proposal  Progress Algorithm description in more details Related data collection and cleanup Preliminary results  Format should be same as the project report

4 Project Final Report  It should like a research paper that is ready for submission to research conferences  What do I expect? Introduction Algorithm description and discussion Empirical studies  I am expecting careful analysis of results no matter if it is a successful approach or a complete failure  Presentation 25 minute presentation 5 minute discussion

5 Exponential Model and Maximum Entropy Model Rong Jin

6 Recap: Logistic Regression Model  Assume the inputs and outputs are related in the log linear function  Estimate weights: MLE approach

7 How to Extend Logistic Regression Model to Multiple Classes?  y  {+1, -1}  {1,2,…,C}?

8 Conditional Exponential Model  Introduce a different set of parameters for each class  Ensure the sum of probability to be 1

9 Conditional Exponential Model  Predication probability  Model parameters: For each class y, we have weights w y and threshold c y Maximum likelihood estimation Any Problems?

10 Conditional Exponential Model  Add a constant vector to every weight vector, we have the same log-likelihood function Not unique optimum solution!  How to resolve this problem? Solution: Set w 1 to be a zero vector and c 1 to be zero

11 Modified Conditional Exponential Model  Prediction probability  Model parameters: For each class y>1, we have weights w y and threshold c y Maximum likelihood estimation

12 Maximum Entropy Model: Motivation  Consider a translation example  English ‘in’  French {dans, en, à, au-cours-de, pendant}  Goal: p(dans), p(en), p(à), p(au-cours-de), p(pendant)  Case 1: no prior knowledge on tranlation What is your guess of the probabilities?

13 Maximum Entropy Model: Motivation  Consider a translation example  English ‘in’  French {dans, en, à, au cours de, pendant}  Goal: p(dans), p(en), p(à), p(au-cours-de), p(pendant)  Case 1: no prior knowledge on tranlation What is your guess of the probabilities? p(dans)=p(en)=p(à)=p(au-cours-de)=p(pendant)=1/5  Case 2: 30% of times either dans or en is used

14 Maximum Entropy Model: Motivation  Consider a translation example  English ‘in’  French {dans, en, à, au cours de, pendant}  Goal: p(dans), p(en), p(à), p(au-cours-de), p(pendant)  Case 1: no prior knowledge on tranlation What is your guess of the probabilities? p(dans)=p(en)=p(à)=p(au-cours-de)=p(pendant)=1/5  Case 2: 30% of times either dans or en is used What is your guess of the probabilities? p(dans)=p(en)=3/20 p(à)=p(au-cours-de)=p(pendant)=7/30  Uniform distribution is favored

15 Maximum Entropy Model: Motivation  Case 3: 30% of time dans or en is used, and 50% of times dans or à is used What is your guess of the probabilities?

16 Maximum Entropy Model: Motivation  Case 3: 30% of time dans or en is used, and 50% of times dans or à is used What is your guess of the probabilities?  A good probability distribution should Satisfy the constraints Be close to uniform distribution, but how? Measure Uniformality using Kullback-Leibler Distance !

17 Maximum Entropy Principle (MaxEnt)  A uniformity of distribution is measured by entropy of the distribution  Solution: p(dans) = 0.2, p(a) = 0.3, p(en)=0.1, p(au-cours-de) = 0.2, p(pendant) = 0.2

18 MaxEnt for Classification Problems  Want a p(y|x) to be close to a uniform distribution Maximize the conditional entropy of training data  Constraints Valid probability distribution From training data: the model should be consistent with data  For each class, model mean of x = empirical mean of x

19 MaxEnt for Classification Problems  Want a p(y|x) to be close to a uniform distribution Maximize the conditional entropy of training data  Constraints Valid probability distribution From training data: the model should be consistent with data  For each class, model mean of x = empirical mean of x

20 MaxEnt for Classification Problems  Requiring the mean be consistent between the empirical data and the model  No assumption about the parametric form for likelihood Only assume it is C 2 continuous

21 MaxEnt Model  Consistency with data is ensured by the equality constraints For each feature, the empirical mean equal to the model mean Beyond feature vector x:

22 Translation Problem  Parameters: p(dans), p(en), p(au), p(a), p(pendant)  Represent each French word with two features {dans, en}{dans, a} dans11 en10 au-cours-de00 a01 pendant00 Empirical Average0.30.5

23 Constraints

24 Solution to MaxEnt  Surprisingly, the solution is just conditional exponential model without thresholds  Why?

25 Solution to MaxEnt

26 Maximum Entropy Model versus Conditional Exponential Model Maximum Entropy Model Conditional Exponential Model Dual Problem

27 Maximum Entropy Model vs. Conditional Exponential Model  However, where is the threshold term c? Maximum EntropyConditional Exponential

28 Solving Maximum Entropy Model  Iterative scaling algorithm  Assume

29 Solving Maximum Entropy Model  Compute the empirical mean for each feature of every class, i.e., for every j and every class y  Start w 1,w 2 …, w c = 0  Repeat Compute p(y|x) for each training data point (x i, y i ) using w and c from the previous iteration Compute the mean of each feature of every class using the estimated probabilities, i.e., for every j and every y Compute for every j and every y Update w as

30 Solving Maximum Entropy Model  Compute the empirical mean for each feature of every class, i.e., for every j and every class y  Start w 1,w 2 …, w c = 0  Repeat Compute p(y|x) for each training data point (x i, y i ) using w from the previous iteration Compute the mean of each feature of every class using the estimated probabilities, i.e., for every j and every y Compute for every j and every y Update w as

31 Solving Maximum Entropy Model  The likelihood function always increases !

32 Solving Maximum Entropy Model  How about each feature can take both positive and negative values?  How about the sum of features is not a constant?  How to apply this approach to conditional exponential model with bias term (or threshold term)?

33 Improved Iterative Scaling  It only requires all the input features to be positive  Compute the empirical mean for each feature of every class, i.e., for every j and every class y  Start w 1,w 2 …, w c = 0  Repeat Compute p(y|x) for each training data point (x i, y i ) using w and c from the previous iteration Solve for every j and every y Update w as

34 Choice of Features  A feature does not have to be one of the inputs  For maximum entropy model, bound features are more favorable. Very often, people use binary feature  Feature selection Features with small weights are eliminated

35 Feature Selection vs. Regularizers  Regularizer  sparse solution  automatic feature selection  But, L2 regularizer rarely results in features with zero weights  not appropriate for feature selection  For the purpose of feature selection, usually using L1 norm

36 Feature Selection vs. Regularizers  Regularizer  sparse solution  automatic feature selection  But, L2 regularizer rarely results in features with zero weights  not appropriate for feature selection  For the purpose of feature selection, usually using L1 norm

37 Solving the L1 Regularized Conditional Exponential Model  Solving the L1 regularized conditional exponential model directly is rather difficult Because the absolute value is a discontinuous function  Any suggestion to alleviate this problem?

38 Solving the L1 Regularized Conditional Exponential Model Slack Variables


Download ppt "Project  Now it is time to think about the project  It is a team work Each team will consist of 2 people  It is better to consider a project of your."

Similar presentations


Ads by Google