Information Bottleneck versus Maximum Likelihood Felix Polyakov.

Slides:



Advertisements
Similar presentations
Image Modeling & Segmentation
Advertisements

Feature Selection as Relevant Information Encoding Naftali Tishby School of Computer Science and Engineering The Hebrew University, Jerusalem, Israel NIPS.
ECE 8443 – Pattern Recognition LECTURE 05: MAXIMUM LIKELIHOOD ESTIMATION Objectives: Discrete Features Maximum Likelihood Resources: D.H.S: Chapter 3 (Part.
Expectation Maximization
Maximum Likelihood And Expectation Maximization Lecture Notes for CMPUT 466/551 Nilanjan Ray.
The EM algorithm LING 572 Fei Xia Week 10: 03/09/2010.
Information Bottleneck EM School of Engineering & Computer Science The Hebrew University, Jerusalem, Israel Gal Elidan and Nir Friedman.
CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 10: The Bayesian way to fit models Geoffrey Hinton.
Segmentation and Fitting Using Probabilistic Methods
Hidden Variables, the EM Algorithm, and Mixtures of Gaussians Computer Vision CS 143, Brown James Hays 02/22/11 Many slides from Derek Hoiem.
Hidden Variables, the EM Algorithm, and Mixtures of Gaussians Computer Vision CS 543 / ECE 549 University of Illinois Derek Hoiem 03/15/12.
Mixture Language Models and EM Algorithm
Visual Recognition Tutorial
Overview Full Bayesian Learning MAP learning
EE 290A: Generalized Principal Component Analysis Lecture 6: Iterative Methods for Mixture-Model Segmentation Sastry & Yang © Spring, 2011EE 290A, University.
Parameter Estimation: Maximum Likelihood Estimation Chapter 3 (Duda et al.) – Sections CS479/679 Pattern Recognition Dr. George Bebis.
Hidden Markov Model 11/28/07. Bayes Rule The posterior distribution Select k with the largest posterior distribution. Minimizes the average misclassification.
Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.
Information Bottleneck presented by Boris Epshtein & Lena Gorelick Advanced Topics in Computer and Human Vision Spring 2004.
. Learning – EM in The ABO locus Tutorial #9 © Ilan Gronau.
Maximum likelihood (ML) and likelihood ratio (LR) test
Basics of Statistical Estimation. Learning Probabilities: Classical Approach Simplest case: Flipping a thumbtack tails heads True probability  is unknown.
Gaussian Mixture Example: Start After First Iteration.
. PGM: Tirgul 10 Parameter Learning and Priors. 2 Why learning? Knowledge acquisition bottleneck u Knowledge acquisition is an expensive process u Often.
Expectation Maximization Algorithm
Maximum Likelihood (ML), Expectation Maximization (EM)
Visual Recognition Tutorial
Copyright N. Friedman, M. Ninio. I. Pe’er, and T. Pupko. 2001RECOMB, April 2001 Structural EM for Phylogentic Inference Nir Friedman Computer Science &
Expectation-Maximization (EM) Chapter 3 (Duda et al.) – Section 3.9
EM Algorithm Likelihood, Mixture Models and Clustering.
EM algorithm LING 572 Fei Xia 03/02/06. Outline The EM algorithm EM for PM models Three special cases –Inside-outside algorithm –Forward-backward algorithm.
The Power of Word Clusters for Text Classification Noam Slonim and Naftali Tishby Presented by: Yangzhe Xiao.
Semi-Supervised Learning
Crash Course on Machine Learning
Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.
Machine Learning Saarland University, SS 2007 Holger Bast Max-Planck-Institut für Informatik Saarbrücken, Germany Lecture 9, Friday June 15 th, 2007 (EM.
Correlated Topic Models By Blei and Lafferty (NIPS 2005) Presented by Chunping Wang ECE, Duke University August 4 th, 2006.
Topic Models in Text Processing IR Group Meeting Presented by Qiaozhu Mei.
EM and expected complete log-likelihood Mixture of Experts
A statistical model Μ is a set of distributions (or regression functions), e.g., all uni-modal, smooth distributions. Μ is called a parametric model if.
1 HMM - Part 2 Review of the last lecture The EM algorithm Continuous density HMM.
Lecture 19: More EM Machine Learning April 15, 2010.
CHAPTER 7: Clustering Eick: K-Means and EM (modified Alpaydin transparencies and new transparencies added) Last updated: February 25, 2014.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: ML and Simple Regression Bias of the ML Estimate Variance of the ML Estimate.
MACHINE LEARNING 8. Clustering. Motivation Based on E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2  Classification problem:
HMM - Part 2 The EM algorithm Continuous density HMM.
CS Statistical Machine learning Lecture 24
Computer Vision Lecture 6. Probabilistic Methods in Segmentation.
Flat clustering approaches
Maximum Likelihood Estimation
Analysis of Social Media MLD , LTI William Cohen
DENCLUE 2.0: Fast Clustering based on Kernel Density Estimation Alexander Hinneburg Martin-Luther-University Halle-Wittenberg, Germany Hans-Henning Gabriel.
RADFORD M. NEAL GEOFFREY E. HINTON 발표: 황규백
A PAC-Bayesian Approach to Formulation of Clustering Objectives Yevgeny Seldin Joint work with Naftali Tishby.
Hidden Variables, the EM Algorithm, and Mixtures of Gaussians Computer Vision CS 543 / ECE 549 University of Illinois Derek Hoiem 02/22/11.
Learning Theory Reza Shadmehr Distribution of the ML estimates of model parameters Signal dependent noise models.
Information Bottleneck versus Maximum Likelihood Felix Polyakov.
Graphical Models for Segmenting and Labeling Sequence Data Manoj Kumar Chinnakotla NLP-AI Seminar.
CSC321: Lecture 8: The Bayesian way to fit models Geoffrey Hinton.
. The EM algorithm Lecture #11 Acknowledgement: Some slides of this lecture are due to Nir Friedman.
Unsupervised Learning Part 2. Topics How to determine the K in K-means? Hierarchical clustering Soft clustering with Gaussian mixture models Expectation-Maximization.
CS479/679 Pattern Recognition Dr. George Bebis
Maximum Likelihood Estimate
Classification of unlabeled data:
Latent Variables, Mixture Models and EM
Expectation Maximization
Topic models for corpora and for graphs
Topic Models in Text Processing
The EM Algorithm With Applications To Image Epitome
Clustering (2) & EM algorithm
Presentation transcript:

Information Bottleneck versus Maximum Likelihood Felix Polyakov

Overview of the talk  Brief review of the Information Bottleneck Maximum Likelihood Information Bottleneck and Maximum Likelihood Example from Image Segmentation

A Simple Example...

Simple Example

A new compact representation The document clusters preserve the relevant information between the documents and words

Feature Selection? NO ASSUMPTIONS about the source of the data Extracting relevant structure from data –functions of the data (statistics) that preserve information Information about what? Need a principle that is both general and precise.

DocumentsWords

The information bottleneck or relevance through distortion We would like the relevant partitioning T to compress X as much as possible, and to capture as much information about Y as possible X Y N. Tishby, F. Pereira, and W. Bialek

Goal: find q(T | X) –note Markovian independence relationT  X  Y

Variational problem Iterative algorithm

Overview of the talk Short review of the Information Bottleneck  Maximum Likelihood Information Bottleneck and Maximum Likelihood Example from Image Segmentation

A coin is known to be biased The coin is tossed three times – two heads and one tail Use ML to estimate the probability of throwing a head  Try P = 0.2  Try P = 0.6 Probability of a head Likelihood of the Data L(O) = 0.2 * 0.2 * 0.8 =  Try P = 0.4 L(O) = 0.4 * 0.4 * 0.6 = L(O) = 0.6 * 0.6 * 0.4 =  Try P = 0.8 L(O) = 0.8 * 0.8 * 0.2 = A simple example... Model: −p(head) = P −p(tail) = 1 - P

A bit more complicated example… : Mixture Model Three baskets with white (O = 1), grey (O = 2), and black (O = 3) balls B1B1 B2B2 B3B3 15 balls were drawn as follows: 1.Choose a basket according to p(i) =  b i 2.Draw the ball j from basket i with probability Use ML to estimate  given the observations: sequence of balls’ colors

Likelihood of observations Log Likelihood of observations Maximal Likelihood of observations

Likelihood of the observed data x – hidden random variables [e.g. basket] y – observed random variables [e.g. color]  - model parameters [e.g. they define p(y|x)]  0 – current estimate of model parameters

1.Expectation −Compute −Get 2.Maximization − Expectation-maximization algorithm (I) EM algorithm converges to local maxima

Log-likelihood is non-decreasing, examples

EM – another approach  Goal: Jensen’s inequality for concave function

1.Expectation 2.Maximization Expectation-maximization algorithm (II) (I) and (II) are equivalent

Scheme of the approach

Overview of the talk Short review of the Information Bottleneck Maximum Likelihood  Information Bottleneck and Maximum Likelihood for a toy problem Example from Image Segmentation

Words - Y Documents - X Topics - t t ~  (t) x ~  (x) y|t ~  (y|t)

Model parameters Example x i = 9 –t(9) = 2 –sample from  (y|2)  get y i = “Drug” −set n(9, “Drug”) = n(9, “Drug”) + 1 Sampling algorithm For i = 1:N –choose x i by sampling from  (x) –choose y i by sampling from  (y|t(x i )) –increase n(x i, y i ) by one

X t(X)  (y|t=1)  (y|t=2)  (y|t=3)

Toy problem: which parameters maximize the likelihood? = topics X = documents Y = words X Y t(x)  (y|t(x))

EM approach E-step M-step Normalization factor

IB approach Normalization factor

ML IB,, r is a scaling constant

X is uniformly distributed r = |X|  The EM algorithm is equivalent to the IB iterative algorithm,, IBML IB ML mapping Iterative IB EM

X is uniformly distributed  = n(x) IB ML IB ML mapping  All the fixed points of the likelihood L are mapped to all the fixed points of the IB-functional L = I(T;X) -  I(T;Y)  At the fixed points –log L  L + const

X is uniformly distributed  = n(x)  -(1/r) F -  H(Y) = L  -F  L + const  Every algorithm increases F, iff it decreases L

Deterministic case N  (or   ) EM: MLIB IB:

N  (or   ) –Do not speak about uniformity of X here  All the fixed points of L are mapped to all the fixed points of L  -F  L + const  Every algorithm which finds a fixed point of L, induces a fixed point of L and vice versa  In case of several different f.p., the solution that maximized L is mapped to the solution that minimizes L.

Example  (x) x 2/3Yellow submarine 1/3Red bull N=   =  t EMIB  (t) q(t) 11/22/3 21/21/3 This does not mean that q(t) =  (t)

 When N , every algorithm increases F iff it decrease L with   How large must N (or  ) be? How is it related to the “amount of uniformity” in n(x)?

Simulations for iIB

Simulations for EM

Simulations 200 runs = 100 (small N) (large N)  58 runs IIB converged to a smaller value of (-F) than EM  46 runs EM converged to (-F) related to a smaller value of L

Quality estimation for EM solution The quality of IB solution is measured through the theoretic upper bound Using mapping, one can adopt this measure for the ML esimation problem, for large enough N IB ML

Summary: IB versus ML ML and IB approaches are equivalent under certain conditions Models comparison –The mixture model assumes that Y is independent of X given T(X): X  T  Y –In the IB framework, T is defined through the IB Markovian independence relation: T  X  Y Can adapt the quality estimation measure from IB to the ML estimation problem, for large N

Overview of the talk Brief review of the Information Bottleneck Maximum Likelihood Information Bottleneck and Maximum Likelihood  Example from Image Segmentation ( L. Hermes et. al. )

The clustering model Pixels o i, i = 1, …, n Deterministic clusters c,, = 1, …, k Boolean assignment matrix M  M = {0, 1} n x k,  M i  Observations

oioi q r Observations

Likelihood Discretization of the color space into intervals I j Set Data likelihood

Relation to the IB

Log-Likelihood IB functional Assume that n i = const, set = n i  then L = -log L

Images generated from the learned statistics

References N. Tishby, F. Pereira, and W. Bialek. The Information Bottleneck method. Noam Slonim YairWeiss. Maximum Likelihood and the Information Bottleneck’ R. M. Neal and G. E. Hinton. A view of the EM algorithm that justifies incremental, sparse, and other variants. J. Goldberger. Lecture notes L. Hermes, T. Zoller, and J. M. Buhmann. Parametric distributional clustering for image segmentation. The end