 # Middle Term Exam 03/04, in class. Project It is a team work No more than 2 people for each team Define a project of your own Otherwise, I will assign.

## Presentation on theme: "Middle Term Exam 03/04, in class. Project It is a team work No more than 2 people for each team Define a project of your own Otherwise, I will assign."— Presentation transcript:

Middle Term Exam 03/04, in class

Project It is a team work No more than 2 people for each team Define a project of your own Otherwise, I will assign you to a “tough” project Important date 03/23: project proposal 04/27 and 04/29: presentation 05/02: final report

Project Proposal Introduction: describe the research problem Related wok: describe the existing approaches and their deficiency Proposed approaches: describe your approaches and its potential to overcome the shortcomings of existing approaches Plan: the plan for this project (code development, data sets, and evaluation) Format: it should look like a research paper The required format (both Microsoft Word and Latex) can be downloaded from www.cse.msu.edu/~cse847/assignments/format.zip www.cse.msu.edu/~cse847/assignments/format.zip Warning: any submission that does not follow the format will be given zero score.

Project Report The same format as the proposal Expand the proposal with detailed description of your algorithm and evaluation results Presentation 25 minute presentation 5 minute discussion

Introduction to Information Theory Rong Jin

Information Information  knowledge Information: reduction in uncertainty Example: 1.flip a coin 2. roll a die #2 is more uncertain than #1 Therefore, more information is provided by the outcome of #2 than #1

Definition of Information Let E be some event that occurs with probability P(E). If we are told that E has occurred, then we say we have received I(E)=log 2 (1/P(E)) bits of information Example: Result of a fair coin flip (log 2 2=1 bit) Result of a fair die roll (log 2 6=2.585 bits)

Entropy A zero-memory information source S is a source that emits symbols from an alphabet {s 1, s 2,…, s k } with probability {p 1, p 2,…,p k }, respectively, where the symbols emitted are statistically independent. Entropy is the average amount of information in observing the output from S

Entropy 1.0  H(P)  logk 2.Measures the uniformness of a distribution P: The further P is from uniform, the lower the entropy. 3.For any other probability distribution {q 1,…,q k },

A Distance Measure Between Distributions Kullback-Leibler distance between distributions P and Q 0  D(P, Q) The smaller D(P, Q), the more Q is similar to P Non-symmetric: D(P, Q)  D(Q, P)

Mutual Information Indicate the amount of information shared between two random variables Symmetric: I(X;Y) = I(Y;X) Zero iff X and Y are independent

Maximum Entropy Rong Jin

Motivation Consider a translation example English ‘in’  French {dans, en, à, au-cours-de, pendant} Goal: p(dans), p(en), p(à), p(au-cours-de), p(pendant) Case 1: no prior knowledge on translation Case 2: 30% of times either dans or en is used

Maximum Entropy Model: Motivation Case 3: 30% of time dans or en is used, and 50% of times dans or à is used Need a measure the uninformness of a distribution

Maximum Entropy Principle (MaxEnt) p(dans) = 0.2, p(a) = 0.3, p(en)=0.1 p(au-cours-de) = 0.2, p(pendant) = 0.2

MaxEnt for Classification Objective is to learn p(y|x) Constraints Appropriate normalization

MaxEnt for Classification Constraints Consistent with data Feature function Empirical mean of feature functions Model mean of feature functions

MaxEnt for Classification No assumption about p(y|x) (non-parametric) Only need the empirical mean of feature functions

MaxEnt for Classification Feature function

Example of Feature Functions f 1 (x) I(x  {dans, en}) f 2 (x) I(x  {dans, a}) dans11 en10 au-cours-de00 a01 pendant00 Empirical Average0.30.5

Solution to MaxEnt Identical to conditional exponential model Solve W by maximum likelihood estimation

Iterative Scaling (IS) Algorithm Assume

Iterative Scaling (IS) Algorithm Compute the empirical mean for every feature and every class Initialize Repeat Compute p(y|x) for each training example (x i, y i ) using W Compute the model mean of every feature for every class Update W

Iterative Scaling (IS) Algorithm It guarantees that the likelihood function always increases

Iterative Scaling (IS) Algorithm How about features that can take both positive and negative values? How about the sum of features is not a constant?

MaxEnt for Classification

Download ppt "Middle Term Exam 03/04, in class. Project It is a team work No more than 2 people for each team Define a project of your own Otherwise, I will assign."

Similar presentations