Information Bottleneck EM School of Engineering & Computer Science The Hebrew University, Jerusalem, Israel Gal Elidan and Nir Friedman.

Slides:



Advertisements
Similar presentations
Ideal Parent Structure Learning School of Engineering & Computer Science The Hebrew University, Jerusalem, Israel Gal Elidan with Iftach Nachman and Nir.
Advertisements

Learning with Missing Data
Feature Selection as Relevant Information Encoding Naftali Tishby School of Computer Science and Engineering The Hebrew University, Jerusalem, Israel NIPS.
Unsupervised Learning Clustering K-Means. Recall: Key Components of Intelligent Agents Representation Language: Graph, Bayes Nets, Linear functions Inference.
Mixture Models and the EM Algorithm
Unsupervised Learning
Biointelligence Laboratory, Seoul National University
Expectation Maximization
The EM algorithm LING 572 Fei Xia Week 10: 03/09/2010.
Variational Inference for Dirichlet Process Mixture Daniel Klein and Soravit Beer Changpinyo October 11, 2011 Applied Bayesian Nonparametrics Special Topics.
Mixture Language Models and EM Algorithm
Visual Recognition Tutorial
Regulatory Network (Part II) 11/05/07. Methods Linear –PCA (Raychaudhuri et al. 2000) –NIR (Gardner et al. 2003) Nonlinear –Bayesian network (Friedman.
The EM algorithm (Part 1) LING 572 Fei Xia 02/23/06.
. Learning Bayesian networks Slides by Nir Friedman.
Lecture 5: Learning models using EM
Announcements  Project proposal is due on 03/11  Three seminars this Friday (EB 3105) Dealing with Indefinite Representations in Pattern Recognition.
Multiple-Instance Learning Paper 1: A Framework for Multiple-Instance Learning [Maron and Lozano-Perez, 1998] Paper 2: EM-DD: An Improved Multiple-Instance.
Conditional Random Fields
Clustering.
Parametric Inference.
The EM algorithm LING 572 Fei Xia 03/01/07. What is EM? EM stands for “expectation maximization”. A parameter estimation method: it falls into the general.
Expectation Maximization Algorithm
Maximum Likelihood (ML), Expectation Maximization (EM)
Expectation-Maximization
Visual Recognition Tutorial
What is it? When would you use it? Why does it work? How do you implement it? Where does it stand in relation to other methods? EM algorithm reading group.
Copyright N. Friedman, M. Ninio. I. Pe’er, and T. Pupko. 2001RECOMB, April 2001 Structural EM for Phylogentic Inference Nir Friedman Computer Science &
Expectation-Maximization (EM) Chapter 3 (Duda et al.) – Section 3.9
Learning HMM parameters Sushmita Roy BMI/CS 576 Oct 21 st, 2014.
. Learning Bayesian networks Most Slides by Nir Friedman Some by Dan Geiger.
EM algorithm LING 572 Fei Xia 03/02/06. Outline The EM algorithm EM for PM models Three special cases –Inside-outside algorithm –Forward-backward algorithm.
The Power of Word Clusters for Text Classification Noam Slonim and Naftali Tishby Presented by: Yangzhe Xiao.
. Expressive Graphical Models in Variational Approximations: Chain-Graphs and Hidden Variables Tal El-Hay & Nir Friedman School of Computer Science & Engineering.
Latent Boosting for Action Recognition Zhi Feng Huang et al. BMVC Jeany Son.
1 Naïve Bayes Models for Probability Estimation Daniel Lowd University of Washington (Joint work with Pedro Domingos)
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
Using Fast Weights to Improve Persistent Contrastive Divergence Tijmen Tieleman Geoffrey Hinton Department of Computer Science, University of Toronto ICML.
1 Logistic Regression Adapted from: Tom Mitchell’s Machine Learning Book Evan Wei Xiang and Qiang Yang.
Adaptive CSMA under the SINR Model: Fast convergence using the Bethe Approximation Krishna Jagannathan IIT Madras (Joint work with) Peruru Subrahmanya.
Annealing Paths for the Evaluation of Topic Models James Foulds Padhraic Smyth Department of Computer Science University of California, Irvine* *James.
Language Models Hongning Wang Two-stage smoothing [Zhai & Lafferty 02] c(w,d) |d| P(w|d) = +  p(w|C) ++ Stage-1 -Explain unseen words -Dirichlet.
CPSC 502, Lecture 15Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 16 Nov, 3, 2011 Slide credit: C. Conati, S.
Lecture 19: More EM Machine Learning April 15, 2010.
Boltzmann Machine (BM) (§6.4) Hopfield model + hidden nodes + simulated annealing BM Architecture –a set of visible nodes: nodes can be accessed from outside.
CHAPTER 7: Clustering Eick: K-Means and EM (modified Alpaydin transparencies and new transparencies added) Last updated: February 25, 2014.
Module networks Sushmita Roy BMI/CS 576 Nov 18 th & 20th, 2014.
Lecture 2: Statistical learning primer for biologists
Flat clustering approaches
CSE 517 Natural Language Processing Winter 2015
Information Bottleneck versus Maximum Likelihood Felix Polyakov.
DENCLUE 2.0: Fast Clustering based on Kernel Density Estimation Alexander Hinneburg Martin-Luther-University Halle-Wittenberg, Germany Hans-Henning Gabriel.
Modeling Annotated Data (SIGIR 2003) David M. Blei, Michael I. Jordan Univ. of California, Berkeley Presented by ChengXiang Zhai, July 10, 2003.
ICCV 2007 National Laboratory of Pattern Recognition Institute of Automation Chinese Academy of Sciences Half Quadratic Analysis for Mean Shift: with Extension.
Machine Learning Expectation Maximization and Gaussian Mixtures CSE 473 Chapter 20.3.
. The EM algorithm Lecture #11 Acknowledgement: Some slides of this lecture are due to Nir Friedman.
Unsupervised Learning Part 2. Topics How to determine the K in K-means? Hierarchical clustering Soft clustering with Gaussian mixture models Expectation-Maximization.
Visualizing High-Dimensional Data
Learning Deep Generative Models by Ruslan Salakhutdinov
Lecture 18 Expectation Maximization
Multimodal Learning with Deep Boltzmann Machines
Probabilistic Models with Latent Variables
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Lecture 5 Unsupervised Learning in fully Observed Directed and Undirected Graphical Models.
EM for Inference in MV Data
More on Search: A* and Optimization
Boltzmann Machine (BM) (§6.4)
Unifying Variational and GBP Learning Parameters of MNs EM for BNs
EM for Inference in MV Data
Stochastic Methods.
Presentation transcript:

Information Bottleneck EM School of Engineering & Computer Science The Hebrew University, Jerusalem, Israel Gal Elidan and Nir Friedman

Problem: No closed-form solution for ML estimation Use Expectation Maximization (EM) Problem: Stuck in inferior local Maxima  Random Restarts  Deterministic  Simulated annealing Learning with Hidden Variables Input:Output: A model P(X,T) DATA ???????????? X 1 … X N T Params Likelihood EM + information regularization for learning parameters T X2X2 X3X3 X1X1

Learning Parameters Input:Output: A model P  (X) DATA X 1 … X N Empirical distribution Q(X) Parametrization  of P P  (X 1 ) = Q(X 1 ) P  (X 2 |X 1 ) = Q(X 2 |X 1 ) P  (X 3 |X 1 ) = Q(X 3 |X 1 ) X1X1 X2X2 X3X3

Empirical distribution Q(X,T) = ? Learning with Hidden Variables DATA X 1 … X N ???????????? T M Y Parametrization  for P T X2X2 X3X3 X1X1 guess of  Desired structure: EM Iterations Empirical distribution Q(X,T,Y) = Empirical distribution Q(X,T,Y)=Q(X,T)Q(T|Y) Input: For each instance ID, complete value of T

The EM Algorithm: E-Step: Generate empirical distribution M-Step: Maximize using  EM is equivalent to optimizing function of Q,P   Each step increases value of functional EM Functional [Neal and Hinton, 1998]

Information Bottleneck EM Target: In the rest of the talk…  Understanding this objective  How to use it to learn better models EM target Information between hidden and ID

Information Regularization Motivating idea: Fit training data: Set T to be instance ID to “predict” X Generalization: “Forget” ID and keep essence of X Objective: parameter free regularization of Q [Tishby et. al, 1999] (lower bound of) Likelihood of P  Compression of instance ID vs.

total compression  = Clustering example EM Target Compression measure  =0 EM Target Compression measure

Clustering example total preservation  = Compression measure EM Target  =1 T  ID

Clustering example Desired  =? Compression measure EM Target  =? |T| = 2

Information Bottleneck EM Formal equivalence with Information Bottleneck at  =1 EM and Information Bottleneck coincide [Generalizing result of Slonim and Weiss for univariate case] EM functional

Information Bottleneck EM Formal equivalence with Information Bottleneck EM functional Prediction of T using P  Marginal of T in Q Normalization Maximum of Q(T|Y) is obtained when

The IB-EM Algorithm for fixed   Iterate until convergance E-Step: Maximize L IB-EM by optimizing Q M-Step: Maximize L IB-EM by optimizing P  (same as standard M-step) Each step improves L IB-EM Guaranteed to converge

Information Bottleneck EM Target: In the rest of the talk…  Understanding this objective  How to use it to learn better models EM target Information between hidden and ID

Continuation easy hard Follow ridge from optimum at  =0 L IB-EM  Q 0 1

Continuation  Recall, if Q is a local maxima of L IB-EM then  We want to follow a path in (Q,  ) space so that… for all t, and y  Q Local maxima for all 

Continuation Step 1.Start at (Q,  ) so that 2.Compute gradient 3.Take  direction  4.Take a step in the desired direction  Q 0 1 start

Staying on the ridge  Q 0 1 start Potential problem:  Direction is tangent to path miss optimum Solution: Use EM steps to regain path

The IB-EM Algorithm  Set  =0 (start at easy solution)  Iterate until  =1 (EM solution is reached)  Iterate (stay on the ridge)  E-Step: Maximize L IB-EM by optimizing Q  M-Step: Maximize L IB-EM by optimizing P   Step (follow the ridge)  Compute gradient and  direction  Take the step by changing  and Q

 Q 0 1 Calibrating the step size Potential problem:  Step size too small  too slow  Step size too large  overshoot target Inferior solution

Non-parametric: involves only Q Can be bounded: I(T;Y) ≤ log 2 |T| Calibrating the step size  Use change in I(T;Y)  I(T;Y) Naive Recall that I(T;Y) measures compression of ID When I(T;Y) rises more of data is captured Too sparse “Interesting” area

The IB-EM Algorithm  Set  =0  Iterate until  =1 (EM solution is reached)  Iterate (stay on the ridge)  E-Step: Maximize L IB-EM by optimizing Q  M-Step: Maximize L IB-EM by optimizing P   Step (follow the ridge)  Compute gradient and  direction  Calibrate step size using I(T;Y)  Take the step by changing  and Q

The Stock Dataset Naive Bayes model Daily changes of 20 NASDAQ stocks train, 303 test IB-EM outperforms best of EM solutions I(T;Y) follows changes of likelihood Continuation ~follows region of change ( marks evaluated  )  I(T;Y) Train likelihood IB-EM Best of EM [Boyen et. al, 1999]

Multiple Hidden Variables We want to learn a model with many hiddens ( ) Naive: Potentially exponential in # of hiddens Variational approximation: use factorized form (Mean Field) L IB-EM =  (Variational EM) - (1-  )Regularization [Friedman et. al, 2002] P P  Q(T|Y)  Y

Percentage of random runs Test log-loss / instance Mean Field EM 1 min/run 400 samples 21 hiddens  Superior to all Mean Field EM runs  Time  single exact EM run The USPS Digits dataset single IB-EM 27 min exact EM 25 min/run Offers good value for your time! 3/50 EM runs are  IB-EM: EM needs  x17 time for similar results

Precentage of random runs Test log-loss / instance Mean Field EM ~0.5 hours Yeast Stress Response 173 experiments (variables) 6152 genes (samples) 25 hidden variables  Superior to all Mean Field EM runs  An order of magnitude faster then exact EM Effective when exact solution becomes intractable! IB-EM ~6 hours Exact EM >60 hours 5-24 experiments

Summary New framework for learning hidden variables Formal relation of Bottleneck and EM Continuation for bypassing local maxima Flexible: structure / variational approximation Learn optimal  ≤1 for better generalization Explore other approximations of Q(T|Y) Model selection: learning cardinality and enrich structure Future Work

Relation to Weight Annealing [Elidan et. al, 2002] Y M DATA X 1 … X N W Init: temp = hot Iterate until temp = cold  Perturb w  temp  Use Q W and optimize  Cool down Similarities:  Change in empirical Q  Morph towards EM solution Differences:  IB-EM uses info. regulatization  IB-EM uses continuation  WA requires cooling policy  WA applicable for wider range of problems

Relation to Deterministic Annealing Y M DATA X 1 … X N Init: temp = hot Iterate until temp = cold  “Insert” entropy  temp into model  Optimize noisy model  Cool down Similarities:  Use information measure  Morph towards EM solution Differences:  DA parameterization dependent  IB-EM uses continuation  DA requires cooling policy  DA applicable for wider range of problems