Maximum Entropy Discrimination Tommi Jaakkola Marina Meila Tony Jebara MIT CMU MIT.

Maximum Entropy Discrimination Tommi Jaakkola Marina Meila Tony Jebara MIT CMU MIT

· inputs x, class y = +1, -1 · data D = { (x 1,y 1 ), …. (x T,y T ) } · learn f opt (x) discriminant function from F = {f} family of discriminants · classify y = sign f opt (x) Classification

Model averaging · many f with near optimal performance · Instead of choosing f opt, average over all f in F Q(f) = weight of f y(x) = sign Q(f)f(x) F = sign Q · To specify: F = { f } family of discriminant functions · To learn Q(f) distribution over F

Goal of this work · Define a discriminative criterion for averaging over models Advantages ·can incorporate prior ·can use generative model ·computationally feasible ·generalizes to other discrimination tasks

Maximum Entropy Discrimination given data set D = { (x 1,y 1 ), … (x T,y T ) } find Q ME = argmax Q H(Q) s.t. y t Q  for all t = 1,…,T (C)  and some  > 0  solution Q ME correctly classifies D · among all admissible Q, Q ME has max entropy · max entropy least specific about f

· convex problem: Q ME unique · solution T Q ME (f) ~ exp{  t y t f(x t ) } t=1 · t 0 Lagrange multipliers · finding Q ME : start with =0 and follow gradient of unsatisfied constraints Solution: Q ME as a projection uniform Q 0 Q ME admissible Q =0 ME

Finding the solution · needed t, t = 1,...T · by solving the dual problem max J( ) = max [ - log Z + - log Z - -   t ] s.t. t >= 0 for t = 1,...T Algorithm · start with t = 0 (uniform distribution) · iterative ascent on J( ) until convergence · derivative J/ t = y t Q(P) -  P + (x) P - (x)

Q ME as sparse solution · Classification rule y(x) = sign Q ME   is classification margin  t  > 0 for y t Q =   x t on the margin (support vector!)

Q ME as regularization · Uniform distribution Q 0 =0 · ”smoothness” of Q = H(Q) · Q ME is smoothest admissible distribution f opt Q ME Q0Q0 Q(f) f

Goal of this work · Define a discriminative criterion for averaging over models 4 Extensions ·incorporate prior ·relationship to support vectors ·use generative models ·generalizes to other discrimination tasks

Priors · prior Q 0 ( f ) · Minimum Relative Entropy Discrimination Q MRE = argmin Q KL( Q || Q 0 ) s.t. y t Q  for all t = 1,…,T (C) · prior on  learn Q MRE ( f,  ) soft margin Q 0 Q MRE admissible Q prior KL( Q || Q 0 )

Soft margins · average also over margin  · define Q 0 (f,  ) = Q 0 (f) Q 0 (  ) · constraints Q(f,  ) 0 · learn Q MRE (f,  ) = Q MRE (f) Q MRE (  ) Q 0 (  ) =c exp[c(  -1)] Potential as function of

Examples: support vector machines · Theorem For f(x) = . x + b, Q 0 (  ) = Normal( 0, I ), Q 0 (b) = non-informative prior, the Lagrange multipliers are obtained by maximizing J( ) subject to 0 t 0 and  t  t y t = 0, where J( ) =  t [ t + log( 1 - t /c) ] - 1/2  t,s  t s y t y s x t.x s · Separable D SVM recovered exactly · Inseparable D SVM recovered with different misclassification penalty · Adaptive kernel SVM....

SVM extensions · Example: Leptograpsus Crabs ( 5 inputs, T train =80, T test =120) f(x) = log + b with P + ( x ) = normal( x ; m +, V + ) quadratic classifier Q( V +, V - ) = distribution of kernel width P + (x) P - (x) MRE Gaussian Linear SVM Max Likelihood Gaussian

Using generative models · generative models P + (x), P - (x) for y = +1, -1 · f(x) = log + b · learn Q MRE (P +,P -, b,  ) · if Q 0 (P +,P - b,  ) = Q 0 (P + ) Q 0 ( P - ) Q 0 ( b) Q 0 (  ) · Q MRE (P +,P - ) = Q ME (P + ) Q ME (P - ) Q MRE ( b) Q MRE (  ) (factored prior factored posterior) P + (x) P - (x)

Examples: other distributions · Multinomial (1 discrete variable) 4 · Graphical model 4 (fixed structure, no hidden variables) · Tree graphical model 4 ( Q over structures and parameters)

Tree graphical models · P(x| E,  ) = P 0 (x) P uv (x u x v |  uv ) · prior Q 0 (P) = Q 0 (E) Q 0 (  |E) ·Q 0 (E) =  uv ·Q 0 (  |E) = conjugate prior Q MRE (P) = W 0 W uv can be integrated analytically Q 0 (P) conjugate prior over E and  EE EE EE

Trees: experiments · Splice junction classification task 25 inputs, 400 training examples compared with Max Likelihood trees ML, err=14% MaxEnt, err=12.3%

Trees experiments (contd) Tree edges’ weights

Discrimination tasks · Classification · Classification with partially labeled data · Anomaly detection + + + + - - - x x x x x x x x x x x + + + + + + + + + + + + + + + + + + + + + +

Partially labeled data · Problem: given F families of discriminants and data set D = { (x 1, y 1 )… (x T,y T ), x T+1,… x N } find Q(f, ,y) = argmin Q KL(Q||Q 0 ) s. t. Q 0 for all t = 1,…,T (C) 

Partially labeled data : experiment Complete data 10% labeled + 90% unlabeled 10% labeled · Splice junction classification 25 inputs T total =1000

Anomaly detection · Problem: given P = { P } family of generative models and data set D = { x 1, … x T } find Q(P) that Q( P,  ) = argmin Q KL(Q||Q 0 ) s. t. Q 0 for all t = 1,…,T (C) 

Anomaly detection: experiments MaxEnt MaxLikelihood

Conclusions · New framework for classification · Based on regularization in the space of distributions · Enables use of generative models · Enables use of priors · Generalizes to other discrimination tasks

Maximum Entropy Discrimination Tommi Jaakkola Marina Meila Tony Jebara MIT CMU MIT.

Similar presentations

Presentation on theme: "Maximum Entropy Discrimination Tommi Jaakkola Marina Meila Tony Jebara MIT CMU MIT."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Maximum Entropy Discrimination Tommi Jaakkola Marina Meila Tony Jebara MIT CMU MIT.

Similar presentations

Presentation on theme: "Maximum Entropy Discrimination Tommi Jaakkola Marina Meila Tony Jebara MIT CMU MIT."— Presentation transcript:

Similar presentations

About project

Feedback