Presentation is loading. Please wait.

Presentation is loading. Please wait.

Information Bottleneck versus Maximum Likelihood Felix Polyakov.

Similar presentations


Presentation on theme: "Information Bottleneck versus Maximum Likelihood Felix Polyakov."— Presentation transcript:

1 Information Bottleneck versus Maximum Likelihood Felix Polyakov

2 Overview of the talk  Brief review of the Information Bottleneck Maximum Likelihood Information Bottleneck and Maximum Likelihood Example from Image Segmentation

3 A Simple Example...

4 Simple Example

5 A new compact representation The document clusters preserve the relevant information between the documents and words

6 Feature Selection? NO ASSUMPTIONS about the source of the data Extracting relevant structure from data –functions of the data (statistics) that preserve information Information about what? Need a principle that is both general and precise.

7 DocumentsWords

8 The information bottleneck or relevance through distortion We would like the relevant partitioning T to compress X as much as possible, and to capture as much information about Y as possible X Y N. Tishby, F. Pereira, and W. Bialek

9 Goal: find q(T | X) –note Markovian independence relationT  X  Y

10 Variational problem Iterative algorithm

11 Overview of the talk Short review of the Information Bottleneck  Maximum Likelihood Information Bottleneck and Maximum Likelihood Example from Image Segmentation

12 A coin is known to be biased The coin is tossed three times – two heads and one tail Use ML to estimate the probability of throwing a head  Try P = 0.2  Try P = 0.6 Probability of a head Likelihood of the Data L(O) = 0.2 * 0.2 * 0.8 = 0.032  Try P = 0.4 L(O) = 0.4 * 0.4 * 0.6 = 0.096 L(O) = 0.6 * 0.6 * 0.4 = 0.144  Try P = 0.8 L(O) = 0.8 * 0.8 * 0.2 = 0.128 A simple example... Model: −p(head) = P −p(tail) = 1 - P

13 A bit more complicated example… : Mixture Model Three baskets with white (O = 1), grey (O = 2), and black (O = 3) balls B1B1 B2B2 B3B3 15 balls were drawn as follows: 1.Choose a basket according to p(i) =  b i 2.Draw the ball j from basket i with probability Use ML to estimate  given the observations: sequence of balls’ colors

14 Likelihood of observations Log Likelihood of observations Maximal Likelihood of observations

15 Likelihood of the observed data x – hidden random variables [e.g. basket] y – observed random variables [e.g. color]  - model parameters [e.g. they define p(y|x)]  0 – current estimate of model parameters

16

17 1.Expectation −Compute −Get 2.Maximization − Expectation-maximization algorithm (I) EM algorithm converges to local maxima

18 Log-likelihood is non-decreasing, examples

19 EM – another approach  Goal: Jensen’s inequality for concave function

20

21

22 1.Expectation 2.Maximization Expectation-maximization algorithm (II) (I) and (II) are equivalent

23 Scheme of the approach

24 Overview of the talk Short review of the Information Bottleneck Maximum Likelihood  Information Bottleneck and Maximum Likelihood for a toy problem Example from Image Segmentation

25 Words - Y Documents - X Topics - t t ~  (t) x ~  (x) y|t ~  (y|t)

26 Model parameters Example x i = 9 –t(9) = 2 –sample from  (y|2)  get y i = “Drug” −set n(9, “Drug”) = n(9, “Drug”) + 1 Sampling algorithm For i = 1:N –choose x i by sampling from  (x) –choose y i by sampling from  (y|t(x i )) –increase n(x i, y i ) by one

27 12345678910 1221323123 X t(X)  (y|t=1)  (y|t=2)  (y|t=3)

28 Toy problem: which parameters maximize the likelihood? = topics X = documents Y = words X Y t(x)  (y|t(x))

29 EM approach E-step M-step Normalization factor

30 IB approach Normalization factor

31 ML IB,, r is a scaling constant

32 X is uniformly distributed r = |X|  The EM algorithm is equivalent to the IB iterative algorithm,, IBML IB ML mapping + + + + + + Iterative IB EM

33 X is uniformly distributed  = n(x) IB ML IB ML mapping + + + +  All the fixed points of the likelihood L are mapped to all the fixed points of the IB-functional L = I(T;X) -  I(T;Y)  At the fixed points –log L  L + const

34 X is uniformly distributed  = n(x)  -(1/r) F -  H(Y) = L  -F  L + const  Every algorithm increases F, iff it decreases L

35 Deterministic case N  (or   ) EM: MLIB IB:

36 N  (or   ) –Do not speak about uniformity of X here  All the fixed points of L are mapped to all the fixed points of L  -F  L + const  Every algorithm which finds a fixed point of L, induces a fixed point of L and vice versa  In case of several different f.p., the solution that maximized L is mapped to the solution that minimizes L.

37 Example  (x) x 2/3Yellow submarine 1/3Red bull N=   =  t EMIB  (t) q(t) 11/22/3 21/21/3 This does not mean that q(t) =  (t)

38  When N , every algorithm increases F iff it decrease L with   How large must N (or  ) be? How is it related to the “amount of uniformity” in n(x)?

39 Simulations for iIB

40 Simulations for EM

41 Simulations 200 runs = 100 (small N) + 100 (large N)  58 runs IIB converged to a smaller value of (-F) than EM  46 runs EM converged to (-F) related to a smaller value of L

42 Quality estimation for EM solution The quality of IB solution is measured through the theoretic upper bound Using mapping, one can adopt this measure for the ML esimation problem, for large enough N IB ML

43 Summary: IB versus ML ML and IB approaches are equivalent under certain conditions Models comparison –The mixture model assumes that Y is independent of X given T(X): X  T  Y –In the IB framework, T is defined through the IB Markovian independence relation: T  X  Y Can adapt the quality estimation measure from IB to the ML estimation problem, for large N

44 Overview of the talk Brief review of the Information Bottleneck Maximum Likelihood Information Bottleneck and Maximum Likelihood  Example from Image Segmentation ( L. Hermes et. al. )

45 The clustering model Pixels o i, i = 1, …, n Deterministic clusters c,, = 1, …, k Boolean assignment matrix M  M = {0, 1} n x k,  M i  Observations

46 oioi q r Observations

47 Likelihood Discretization of the color space into intervals I j Set Data likelihood

48 Relation to the IB

49 Log-Likelihood IB functional Assume that n i = const, set = n i  then L = -log L

50 Images generated from the learned statistics

51 References N. Tishby, F. Pereira, and W. Bialek. The Information Bottleneck method. Noam Slonim YairWeiss. Maximum Likelihood and the Information Bottleneck’ R. M. Neal and G. E. Hinton. A view of the EM algorithm that justifies incremental, sparse, and other variants. J. Goldberger. Lecture notes L. Hermes, T. Zoller, and J. M. Buhmann. Parametric distributional clustering for image segmentation. The end


Download ppt "Information Bottleneck versus Maximum Likelihood Felix Polyakov."

Similar presentations


Ads by Google