Presentation is loading. Please wait.

Presentation is loading. Please wait.

Learning a Small Mixture of Trees M. Pawan Kumar Daphne Koller Aim: To efficiently learn a.

Similar presentations


Presentation on theme: "Learning a Small Mixture of Trees M. Pawan Kumar Daphne Koller Aim: To efficiently learn a."— Presentation transcript:

1 Learning a Small Mixture of Trees M. Pawan Kumar Daphne Koller http://ai.stanford.edu/~pawan http://ai.stanford.edu/~koller Aim: To efficiently learn a small mixture of trees that approximates an observed distribution Results Mixture of Trees Minimizing  -Divergence Problem Formulation Modifying Fractional Covering Minimizing the KL Divergence Meila and Jordan, 2000 (MJ00) Plotkin et al., 1995 Variables V = {v 1, v 2, …, v n } Label x a  X a for variable v a Labeling x v1v1 v2v2 v3v3 v1v1 v2v2 v3v3 v1v1 v2v2 v3v3 z Hidden variable t1t1 t2t2 t3t3 Pr(x |  m ) = ∑ t  T Pr(x |  t ) Pr(x |  t ) = ∏ (a,b)  t ab (x a,x b ) ∏ a  t a (x a ) d a -1  t ab (x a,x b ): Pairwise potentials  t a (x a ): Unary potentials d a : Degree of v a Renyi, 1961 KL(  1 ||  2 ) = ∑ x Pr(x |  1 ) log Pr(x |  1 ) Pr(x |  2 )  1 : observed distribution  2 : simpler distribution EM Algorithm (Relies heavily on initialization) E-step: Estimate Pr(x |  t ) for each x and t M-step: Obtain structure and potentials (Chow-Liu)Focuses on dominant mode Rosset and Segal, 2002 (RS02) arg min  m p(x i ) Pr(x i |  m ) max i log = arg max  m p(x i ) Pr(x i |  m ) min i  m* = MJ00 uses twice as many trees Fractional Covering Standard UCI datasets MJ00RS02Our Agaricus99.98 (.04)100 (0) Nursery99.2 (.02)98.35 (0.3)99.28 (.13) Splice95.5 (0.3)95.6 (.42)96.1 (.15) Learning Pictorial Structures 11 characters in an episode of “Buffy” 24,244 faces (first 80% train, last 20% test) 13 facial features (variables) + positions (labels) Unary: logistic regression, Pairwise:  m Bag of visual words : 65.68% RS02 Our 66.05 66.01 66.65 66.01 66.86 66.08 67.25 66.08 67.48 66.16 67.50 66.20 67.68 Pr(x |  2 ) 1-   -1 D 1 (  1 ||  2 ) = KL(  1 ||  2 ) D  (  1 ||  2 ) = Pr(x |  1 )  1 log ∑ x Generalization of KL Divergence Fitting q to p Larger  is inclusive Minka, 2005 Use  =   = 1  = 0.5  =  Choose from all possible trees T = {  t j } defined over n random variables Matrix A where A(i,j) = Pr(x i |  t j ) Vector b where b(i) = p(x i ) Vector  ≥ 0 such that ∑  j = 1   P  P max  s.t. a i  ≥ b i   P Constraints defined on infinite variables min  ∑ i exp(-  a i  /b i ) s.t.   P Parameter  log(m) Width w= max  max i a i  /b i Initial solution  0 Define  0 = min i a i  0 /b i Define  =  /4  w Finding  -optimal solution? While  < 2  0, iterate Define y i = exp(-  a i  /b i )/b i Find  ’ = argmax  y T A  Update  = (1-  )  +  ’ Minimize first-order approximation (1) Slow convergence (2) Singleton trees (Probability = 0 for unseen test examples) Drawbacks Overview An intuitive objective function for learning a mixture of trees Formulate the problem using fractional covering Identify the drawbacks of fractional covering Make suitable modifications to the algorithm (1) Start with  = 1/w. Increase by a factor of 2 if necessary. Large step-size  Large y i for numerical stability (2) Minimize  using convex relaxation. -  Pr(x i |  t )  t  T p(x i ) min   ∑ i exp s.t. Pr(x i |  t ) ≥ 0, ∑ i Pr(x i |  t )≤ 1 Dropped Initialize tolerance , parameter , factor f Solve for distribution Pr(. |  t ) min f  - ∑ i log(Pr(x i |  t )) -∑ i log(1- Pr(x i |  t )) Update f =  f until m/f ≤  Log-barrier approach. Use Newton’s method. To minimize g(z), update z = z - (  2 g(z)) -1  g(z) Hessian Gradient Hessian with uniform off-diagonal elements Matrix inversion in linear time Project to tree distribution using Chow-Liu May result in increase in  Discard best explained sample and recompute  t Enforce Pr(x i’ |  t ) = 0 i’ = argmax i Pr(x i |  t )/p(x i ) Given distribution p(.) find a mixture of trees by minimizing  -divergence Computationally expensive operation? Use previous solution Only one log-barrier optimization required Convergence Properties Maximum number of increases for  = O(log(log(m))) Maximum discarded samples = m-1 Polynomial time per iteration. Polynomial time convergence of overall algorithm. Mixtures in log-probability space? Connections to Discrete AdaBoost? Future Work Pr(x i |  t ) = Pr(x i |  t ) + s i Pr(x i’ |  t ) s i = p(x i |  t )/∑ k p(x k |  t ) STANFORD


Download ppt "Learning a Small Mixture of Trees M. Pawan Kumar Daphne Koller Aim: To efficiently learn a."

Similar presentations


Ads by Google