Presentation is loading. Please wait.

Presentation is loading. Please wait.

NIPS 2003 Workshop on Information Theory and Learning: The Bottleneck and Distortion Approach Organizers: Thomas Gedeon Naftali Tishby

Similar presentations


Presentation on theme: "NIPS 2003 Workshop on Information Theory and Learning: The Bottleneck and Distortion Approach Organizers: Thomas Gedeon Naftali Tishby"— Presentation transcript:

1 NIPS 2003 Workshop on Information Theory and Learning: The Bottleneck and Distortion Approach Organizers: Thomas Gedeon Naftali Tishby http://www.cs.huji.ac.il/~tishby/NIPS-Workshop

2 Saturday, December 13 Morning session   7:30-8:20 N. Tishby Introduction – A new look at Shannon’s Information Theory   8:20-9:00 T. Gedeon Mathematical structure of Information Distortion methods   9:00-9:30 D. Miller Deterministic annealing   9:30-10:00 N. Slonim Multivariate Information BottleneckMultivariate Information Bottleneck   10:00-10:30 B. Mumey Optimal Mutual Information Quantization is NP-complete Afternoon session   16:00-16:20 G. Elidan The Information Bottleneck EM algorithmThe Information Bottleneck EM algorithm   16:20-16:40 A. Globerson Sufficient Dimensionality Reduction with Side InformationSufficient Dimensionality Reduction with Side Information   16:40-17:00 A. Parker Phase Transitions in the Information Distortion   17:00-17:20 A. Dimitrov Information Distortion as a model of sensory processing break   17:40-18:00 J. Sinkkonen IB-type clustering for continuous finite dataIB-type clustering for continuous finite data   18:00-18:20 G. Chechik GIB: Gaussian Information BottleneckGIB: Gaussian Information Bottleneck   18:20-18:40 Y. Crammer IB and data-representation – Bregman to the rescueIB and data-representation – Bregman to the rescue   18:40-19:00 B. Westover Achievable rates for Pattern RecognitionAchievable rates for Pattern Recognition

3 Outline: The fundamental dilemma: Simplicity/Complexity versus Accuracy The fundamental dilemma: Simplicity/Complexity versus Accuracy Lessons from Statistical Physics for Machine Learning Lessons from Statistical Physics for Machine Learning Is there a “right level” of description? Is there a “right level” of description?  A Variational Principle Shannon’s source and channel coding – dual problems Shannon’s source and channel coding – dual problems [Mutual] Information as the Fundamental Quantity [Mutual] Information as the Fundamental Quantity Information Bottleneck (IB) and Efficient Representations Information Bottleneck (IB) and Efficient Representations Finite Data Issues Finite Data Issues  Algorithms, Applications and Extensions Words, documents and meaning… Words, documents and meaning… Understanding neural codes … Understanding neural codes … Quantization of side-information Quantization of side-information Relevant linear dimension reduction: Gaussian IB Relevant linear dimension reduction: Gaussian IB Using “irrelevant” side information Using “irrelevant” side information Multivariate-IB and Graphical Models Multivariate-IB and Graphical Models Sufficient Dimensionality Reduction (SDR) Sufficient Dimensionality Reduction (SDR)  The importance of being principled…

4 Learning Theory: The fundamental dilemma… X Y y=f(x) Good models should enable Prediction of new data… Tradeoff between accuracy and simplicity Tradeoff between accuracy and simplicity

5 Or in unsupervised learning … X Y Cluster models?

6 Lessons from Statistical Physics Physical systems with many degrees of freedom: Physical systems with many degrees of freedom: What is the right level of description? What is the right level of description?  “Microscopic Variables” – dynamical variables ( positions, momenta,… )  “Variational Functions” – Thermodynamic potentials (Energy, Entropy, Free-energy,…) (Energy, Entropy, Free-energy,…)  Competition between order ( energy ) and disorder (entropy) … Emergent “relevant” description Emergent “relevant” description Order Parameters (magnetization,…)

7 Statistical Models in Learning and AI Statistical models of complex systems: Statistical models of complex systems: What is the right level of description? What is the right level of description?  “Microscopic Variables” – probability distributions ( over all observed and hidden variables)  “Variational Functions” – Log-likelihood, Entropy,… ? (Multi-Information, “Free-energy”,… )  Competition between accuracy and complexity Emerged “relevant” description Emerged “relevant” description Features (clusters, sufficient statistics,…)

8 The Fundamental Dilemma (of science): Model Complexity vs Prediction Accuracy Complexity Accuracy Possible Models/representations Efficiency Efficiency Survivability Limited data BoundedComputation

9 Can we quantify it…? When there is a (relevant) prediction or distortion measure Accuracy small average distortion ( good predictions ) Complexity short description ( high compression ) A general tradeoff between distortion and compression: Shannon’s Information Theory

10 Sent Sentmessages messages sourcesource receiverreceiver Received Receivedmessages messages channelchannel symbolssymbols 0110101001110… 0110101001110… encoderencoderdecoderdecoder SourcecodingSourcecodingChannelcodingChannelcodingSourcedecodingSourcedecodingChanneldecodingChanneldecoding Error Correction Channel Capacity CompressionCompression Source Entropy DecompressionDecompression Rate vs Distortion Capacity vs Efficiency

11 We need to index the max number of non-overlapping green blobs inside the blue blob: (mutual information!) Compression and Mutual Information

12 Axioms for “(multi) information” (w I Nemenman) u M1: For two nodes, X and Y, the shared information, I[X;Y], is a continuous function of p(x,y). u M2: when one of the variables defines the other uniquely (1-1 mapping from X to Y) and both p(x)=p(y)=1/K, then I[X;Y] is a monotonic function of K. u M3: If X is partitioned into two subsets, X1 and X2, then the amount of shared information is additive in the following sense: I[X;Y]=I[(X1,X2);Y]+p(x1)I[X;Y|X1]+p(x2)I[X;Y|X2] u M4: Shared information is symmetric: I[X;Y]=I[Y;X]. u M5: Theorem:

13 The Dual Variational problems of IT: Rate vs Distortion and Capacity vs Efficiency Q: What is the simplest representation – fewer bits(/sec) (Rate) for a given expected distortion (accuracy)? A: (RDT, Shannon 1960) solve: Q: What is the maximum number of bits(/sec) that can be sent reliably (prediction rate) for a given efficiency of the channel (power, cost)? A: (Capacity-power tradeoff, Shannon 1956) solve:

14 Rate-Distortion theory   The tradeoff between expected representation size (Rate) and the expected distortion is expressed by the rate-distortion function:   By introducing a Lagrange multiplier,, we have a variational principle:   with:

15 The Rate-Distortion function, R(D), is the optimal rate for a given distortion and is a convex function. R(D) D Achievable region impossible -bits/accuracy

16 0 The Capacity – Efficiency Tradeoff Achievable Region E (Efficiency (power)) E (Efficiency (power)) C(E) C(E) Capacity at E bits/erg

17 Double matching of source to channel- eliminates the need for coding (?!) R(D) D C(E) E In the case of a Gaussian channel (w. power constraint) and square distortion, double matching means: And No coding!

18 “There is a curious and provocative duality between the properties of a source with a distortion measure and those of a channel … if we consider channels in which there is a “cost” associated with the different input letters… The solution to this problem amounts, mathematically, to maximizing a mutual information under linear inequality constraint… which leads to a capacity-cost function C(E) for the channel… In a somewhat dual way, evaluating the rate-distortion function R(D) for source amounts, mathematically, to minimizing a mutual Information … again with a linear inequality constraint. This duality can be pursued further and is related to the duality between past and future and the notions of control and knowledge. Thus we may have knowledge of the past but cannot control it; we may control the future but have no knowledge of it.” C. E. Shannon (1959)

19 Bottlenecks and Neural Nets   Auto association: forcing compact representations   provide a “relevant code” of w.r.t. Input Output Sample 1 Sample 2 PastFuture

20 Bottlenecks in Nature… (w E Schniedman, Rob de Ruyter, W Bialek, N Brenner)

21  The idea: find a compressed variable that enables short encoding of ( small ) while preserving as much as possible the information on the relevant signal ( )

22 A Variational Principle We want a short representation of X that keeps the information about another variable, Y, if possible.

23 Self Consistent Equations  Marginal:  Markov condition:  Bayes’ rule:

24 effective distortion The emerged effective distortion measure: Regular if is absolutely continuous w.r.t. Small if predicts y as well as x:

25 I-Projections on a set of distributions (Csiszar 75,84) The I-projection of a distribution q(x) on a convex set of distributions L:

26 The Blahut-Arimoto Algorithm

27 Information Bottleneck The Information Bottleneck Algorithm “free energy”

28 The IB emergent effective distortion measure: The IB algorithm

29 Key Theorems IB Coding Theorem (Wyner 1975, Bachrach, Navot, Tishby 2003): The IB variational problem provides a tight convex lower bound on the expected size of the achievable representations of X, that maintain at least of mutual information on the variable Y. The bound can be achieved by:. Formally equivalent to “Channel coding with Side-Information” (introduced in a very different context by Wyner, 75). The same bound is true for the dual channel coding problem and the optimal representation constitutes a “perfectly matched source-channel” and requires “no further coding”. IB Algorithm Convergence (Tishby 99, Friedman, Slonim,Tishby 01): The Generalized Arimoto-Blauht IB algorithm converges to a local minima of the IB functional, (including its generalized multivariate case).

30 Source (p(x))Channel: (q(y|x)) Entropy:Capacity: Rate-Distortion function:Capacity-Expense function: Constrained Reliability-Rate function Information Bottleneck (source)Information Bottleneck (channel)

31 Summary - -The IB method turns this unified coding principle into algorithms for extracting “informative” relevant structures from empirical joint probability distributions P(X 1,X 2 )… - -The Multivariate IB extends this framework to extract “informative” structures from multivariate distributions, P(X 1,…,X n ), via Min Max I (X 1,…,Xn), and Graphical models - - IB can be further extended for Sufficient Dimensionality Reduction a method for finding informative continuous low dimensional representations via Min Max I(X 1,X 2 ) - - Shannon’s Information Theory suggests a compelling framework for quantifying the “Complexity-Accuracy” dilemma, by unifying source and channel coding as a Min Max of mutual information.

32 Many thanks to: Fernando Pereira William Bialek Nir Friedman Noam Slonim Gal Chechik Amir Globerson Ran Bachrach Amir Navot Ilya Nemenman

33


Download ppt "NIPS 2003 Workshop on Information Theory and Learning: The Bottleneck and Distortion Approach Organizers: Thomas Gedeon Naftali Tishby"

Similar presentations


Ads by Google