Learning a Small Mixture of Trees M. Pawan Kumar Daphne Koller Aim: To efficiently learn a.

Slides:



Advertisements
Similar presentations
Bayesian Belief Propagation
Advertisements

Solving Markov Random Fields using Second Order Cone Programming Relaxations M. Pawan Kumar Philip Torr Andrew Zisserman.
Active Appearance Models
Constrained Approximate Maximum Entropy Learning (CAMEL) Varun Ganapathi, David Vickrey, John Duchi, Daphne Koller Stanford University TexPoint fonts used.
Supervised Learning Recap
Segmentation and Fitting Using Probabilistic Methods
Computer vision: models, learning and inference
. Learning – EM in ABO locus Tutorial #08 © Ydo Wexler & Dan Geiger.
. Learning – EM in The ABO locus Tutorial #8 © Ilan Gronau. Based on original slides of Ydo Wexler & Dan Geiger.
Classification and Prediction: Regression Via Gradient Descent Optimization Bamshad Mobasher DePaul University.
Visual Recognition Tutorial
Restrict learning to a model-dependent “easy” set of samples General form of objective: Introduce indicator of “easiness” v i : K determines threshold.
Lecture 17: Supervised Learning Recap Machine Learning April 6, 2010.
. Phylogeny II : Parsimony, ML, SEMPHY. Phylogenetic Tree u Topology: bifurcating Leaves - 1…N Internal nodes N+1…2N-2 leaf branch internal node.
The EM algorithm (Part 1) LING 572 Fei Xia 02/23/06.
Predictive Automatic Relevance Determination by Expectation Propagation Yuan (Alan) Qi Thomas P. Minka Rosalind W. Picard Zoubin Ghahramani.
Announcements  Homework 4 is due on this Thursday (02/27/2004)  Project proposal is due on 03/02.
L15:Microarray analysis (Classification). The Biological Problem Two conditions that need to be differentiated, (Have different treatments). EX: ALL (Acute.
Efficiently Solving Convex Relaxations M. Pawan Kumar University of Oxford for MAP Estimation Philip Torr Oxford Brookes University.
Clustering.
Recovering Articulated Object Models from 3D Range Data Dragomir Anguelov Daphne Koller Hoi-Cheung Pang Praveen Srinivasan Sebastian Thrun Computer Science.
Learning to Segment from Diverse Data M. Pawan Kumar Daphne KollerHaithem TurkiDan Preston.
Expectation-Maximization
Visual Recognition Tutorial
Hierarchical Graph Cuts for Semi-Metric Labeling M. Pawan Kumar Joint work with Daphne Koller.
What is it? When would you use it? Why does it work? How do you implement it? Where does it stand in relation to other methods? EM algorithm reading group.
Part 3 Vector Quantization and Mixture Density Model CSE717, SPRING 2008 CUBS, Univ at Buffalo.
Advanced Topics in Optimization
Expectation-Maximization (EM) Chapter 3 (Duda et al.) – Section 3.9
EM algorithm LING 572 Fei Xia 03/02/06. Outline The EM algorithm EM for PM models Three special cases –Inside-outside algorithm –Forward-backward algorithm.

Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.
Alignment and classification of time series gene expression in clinical studies Tien-ho Lin, Naftali Kaminski and Ziv Bar-Joseph.
1 Naïve Bayes Models for Probability Estimation Daniel Lowd University of Washington (Joint work with Pedro Domingos)
1 HMM - Part 2 Review of the last lecture The EM algorithm Continuous density HMM.
Rounding-based Moves for Metric Labeling M. Pawan Kumar Center for Visual Computing Ecole Centrale Paris.
1 Unconstrained Optimization Objective: Find minimum of F(X) where X is a vector of design variables We may know lower and upper bounds for optimum No.
CHAPTER 7: Clustering Eick: K-Means and EM (modified Alpaydin transparencies and new transparencies added) Last updated: February 25, 2014.
Discrete Optimization Lecture 4 – Part 2 M. Pawan Kumar Slides available online
CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.
CSE 446 Logistic Regression Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer.
MACHINE LEARNING 8. Clustering. Motivation Based on E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2  Classification problem:
Mixture of Gaussians This is a probability distribution for random variables or N-D vectors such as… –intensity of an object in a gray scale image –color.
Efficient Discriminative Learning of Parts-based Models M. Pawan Kumar Andrew Zisserman Philip Torr
HMM - Part 2 The EM algorithm Continuous density HMM.
Lecture 2: Statistical learning primer for biologists
Discrete Optimization Lecture 2 – Part 2 M. Pawan Kumar Slides available online
Flat clustering approaches
Chapter 13 (Prototype Methods and Nearest-Neighbors )
Inference for Learning Belief Propagation. So far... Exact methods for submodular energies Approximations for non-submodular energies Move-making ( N_Variables.
Final Exam Review CS479/679 Pattern Recognition Dr. George Bebis 1.
Learning from Big Data Lecture 5
Optimization of functions of one variable (Section 2)
Maximum Entropy Discrimination Tommi Jaakkola Marina Meila Tony Jebara MIT CMU MIT.
Logistic Regression William Cohen.
INTRO TO OPTIMIZATION MATH-415 Numerical Analysis 1.
MAP Estimation of Semi-Metric MRFs via Hierarchical Graph Cuts M. Pawan Kumar Daphne Koller Aim: To obtain accurate, efficient maximum a posteriori (MAP)
Deep Feedforward Networks
Learning a Region-based Scene Segmentation Model
Alan Qi Thomas P. Minka Rosalind W. Picard Zoubin Ghahramani
Announcements HW4 due today (11:59pm) HW5 out today (due 11/17 11:59pm)
Clustering (3) Center-based algorithms Fuzzy k-means
Probabilistic Models for Linear Regression
Statistical Learning Dong Liu Dept. EEIS, USTC.
Pattern Recognition CS479/679 Pattern Recognition Dr. George Bebis
Bayesian Models in Machine Learning
Probabilistic Models with Latent Variables
Introduction to EM algorithm
An Analysis of Convex Relaxations for MAP Estimation
MAP Estimation of Semi-Metric MRFs via Hierarchical Graph Cuts
Presentation transcript:

Learning a Small Mixture of Trees M. Pawan Kumar Daphne Koller Aim: To efficiently learn a small mixture of trees that approximates an observed distribution Results Mixture of Trees Minimizing  -Divergence Problem Formulation Modifying Fractional Covering Minimizing the KL Divergence Meila and Jordan, 2000 (MJ00) Plotkin et al., 1995 Variables V = {v 1, v 2, …, v n } Label x a  X a for variable v a Labeling x v1v1 v2v2 v3v3 v1v1 v2v2 v3v3 v1v1 v2v2 v3v3 z Hidden variable t1t1 t2t2 t3t3 Pr(x |  m ) = ∑ t  T Pr(x |  t ) Pr(x |  t ) = ∏ (a,b)  t ab (x a,x b ) ∏ a  t a (x a ) d a -1  t ab (x a,x b ): Pairwise potentials  t a (x a ): Unary potentials d a : Degree of v a Renyi, 1961 KL(  1 ||  2 ) = ∑ x Pr(x |  1 ) log Pr(x |  1 ) Pr(x |  2 )  1 : observed distribution  2 : simpler distribution EM Algorithm (Relies heavily on initialization) E-step: Estimate Pr(x |  t ) for each x and t M-step: Obtain structure and potentials (Chow-Liu)Focuses on dominant mode Rosset and Segal, 2002 (RS02) arg min  m p(x i ) Pr(x i |  m ) max i log = arg max  m p(x i ) Pr(x i |  m ) min i  m* = MJ00 uses twice as many trees Fractional Covering Standard UCI datasets MJ00RS02Our Agaricus99.98 (.04)100 (0) Nursery99.2 (.02)98.35 (0.3)99.28 (.13) Splice95.5 (0.3)95.6 (.42)96.1 (.15) Learning Pictorial Structures 11 characters in an episode of “Buffy” 24,244 faces (first 80% train, last 20% test) 13 facial features (variables) + positions (labels) Unary: logistic regression, Pairwise:  m Bag of visual words : 65.68% RS02 Our Pr(x |  2 ) 1-   -1 D 1 (  1 ||  2 ) = KL(  1 ||  2 ) D  (  1 ||  2 ) = Pr(x |  1 )  1 log ∑ x Generalization of KL Divergence Fitting q to p Larger  is inclusive Minka, 2005 Use  =   = 1  = 0.5  =  Choose from all possible trees T = {  t j } defined over n random variables Matrix A where A(i,j) = Pr(x i |  t j ) Vector b where b(i) = p(x i ) Vector  ≥ 0 such that ∑  j = 1   P  P max  s.t. a i  ≥ b i   P Constraints defined on infinite variables min  ∑ i exp(-  a i  /b i ) s.t.   P Parameter  log(m) Width w= max  max i a i  /b i Initial solution  0 Define  0 = min i a i  0 /b i Define  =  /4  w Finding  -optimal solution? While  < 2  0, iterate Define y i = exp(-  a i  /b i )/b i Find  ’ = argmax  y T A  Update  = (1-  )  +  ’ Minimize first-order approximation (1) Slow convergence (2) Singleton trees (Probability = 0 for unseen test examples) Drawbacks Overview An intuitive objective function for learning a mixture of trees Formulate the problem using fractional covering Identify the drawbacks of fractional covering Make suitable modifications to the algorithm (1) Start with  = 1/w. Increase by a factor of 2 if necessary. Large step-size  Large y i for numerical stability (2) Minimize  using convex relaxation. -  Pr(x i |  t )  t  T p(x i ) min   ∑ i exp s.t. Pr(x i |  t ) ≥ 0, ∑ i Pr(x i |  t )≤ 1 Dropped Initialize tolerance , parameter , factor f Solve for distribution Pr(. |  t ) min f  - ∑ i log(Pr(x i |  t )) -∑ i log(1- Pr(x i |  t )) Update f =  f until m/f ≤  Log-barrier approach. Use Newton’s method. To minimize g(z), update z = z - (  2 g(z)) -1  g(z) Hessian Gradient Hessian with uniform off-diagonal elements Matrix inversion in linear time Project to tree distribution using Chow-Liu May result in increase in  Discard best explained sample and recompute  t Enforce Pr(x i’ |  t ) = 0 i’ = argmax i Pr(x i |  t )/p(x i ) Given distribution p(.) find a mixture of trees by minimizing  -divergence Computationally expensive operation? Use previous solution Only one log-barrier optimization required Convergence Properties Maximum number of increases for  = O(log(log(m))) Maximum discarded samples = m-1 Polynomial time per iteration. Polynomial time convergence of overall algorithm. Mixtures in log-probability space? Connections to Discrete AdaBoost? Future Work Pr(x i |  t ) = Pr(x i |  t ) + s i Pr(x i’ |  t ) s i = p(x i |  t )/∑ k p(x k |  t ) STANFORD