Phase Transitions in the Information Distortion NIPS 2003 workshop on Information Theory and Learning: The Bottleneck and Distortion Approach December.

Slides:

Advertisements

Similar presentations

Fin500J: Mathematical Foundations in Finance

Advertisements

Chapter 11-Functions of Several Variables

Incremental Linear Programming Linear programming involves finding a solution to the constraints, one that maximizes the given linear function of variables.

Feature Selection as Relevant Information Encoding Naftali Tishby School of Computer Science and Engineering The Hebrew University, Jerusalem, Israel NIPS.

Lecture 3: Source Coding Theory TSBK01 Image Coding and Data Compression Jörgen Ahlberg Div. of Sensor Technology Swedish Defence Research Agency (FOI)

Sampling and Pulse Code Modulation

Center for Computational Biology Department of Mathematical Sciences Montana State University Collaborators: Alexander Dimitrov Tomas Gedeon John P. Miller.

Optimization 吳育德.

Engineering Optimization

How should we define corner points? Under any reasonable definition, point x should be considered a corner point x What is a corner point?

Lecture 8 – Nonlinear Programming Models Topics General formulations Local vs. global solutions Solution characteristics Convexity and convex programming.

Graph Laplacian Regularization for Large-Scale Semidefinite Programming Kilian Weinberger et al. NIPS 2006 presented by Aggeliki Tsoli.

Continuation and Symmetry Breaking Bifurcation of the Information Distortion Function September 19, 2002 Albert E. Parker Complex Biological Systems Department.

Chapter 7 Maximum Flows: Polynomial Algorithms

Visual Recognition Tutorial

MIT and James Orlin © Nonlinear Programming Theory.

Principal Component Analysis

Modelling and Control Issues Arising in the Quest for a Neural Decoder Computation, Control, and Biological Systems Conference VIII, July 30, 2003 Albert.

Symmetry Breaking Bifurcations of the Information Distortion Dissertation Defense April 8, 2003 Albert E. Parker III Complex Biological Systems Department.

Center for Computational Biology Department of Mathematical Sciences Montana State University Collaborators: Alexander Dimitrov John P. Miller Zane Aldworth.

Information Bottleneck presented by Boris Epshtein & Lena Gorelick Advanced Topics in Computer and Human Vision Spring 2004.

Symmetry Breaking Bifurcation of the Distortion Problem Albert E. Parker Complex Biological Systems Department of Mathematical Sciences Center for Computational.

Maximization without Calculus Not all economic maximization problems can be solved using calculus –If a manager does not know the profit function, but.

CES 514 – Data Mining Lecture 8 classification (contd…)

Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.

We use Numerical continuation Bifurcation theory with symmetries to analyze a class of optimization problems of the form max F(q,  )=max (G(q)+  D(q)).

Symmetry breaking clusters when deciphering the neural code September 12, 2005 Albert E. Parker Department of Mathematical Sciences Center for Computational.

Center for Computational Biology Department of Mathematical Sciences Montana State University Collaborators: Alexander Dimitrov John P. Miller Zane Aldworth.

A Bifurcation Theoretical Approach to the Solving the Neural Coding Problem June 28 Albert E. Parker Complex Biological Systems Department of Mathematical.

Support Vector Machines and Kernel Methods

EE 685 presentation Optimization Flow Control, I: Basic Algorithm and Convergence By Steven Low and David Lapsley Asynchronous Distributed Algorithm Proof.

Collaborators: Tomas Gedeon Alexander Dimitrov John P. Miller Zane Aldworth Information Theory and Neural Coding PhD Oral Examination November 29, 2001.

Constrained Optimization Economics 214 Lecture 41.

We use Numerical continuation Bifurcation theory with symmetries to analyze a class of optimization problems of the form max F(q,  )=max (G(q)+  D(q)).

Lecture 10: Support Vector Machines

Sufficient Dimensionality Reduction with Irrelevance Statistics Amir Globerson 1 Gal Chechik 2 Naftali Tishby 1 1 Center for Neural Computation and School.

NIPS 2003 Workshop on Information Theory and Learning: The Bottleneck and Distortion Approach Organizers: Thomas Gedeon Naftali Tishby

THE MATHEMATICS OF OPTIMIZATION

Applied Economics for Business Management

Machine Learning Week 4 Lecture 1. Hand In Data Is coming online later today. I keep test set with approx test images That will be your real test.

Boltzmann Machine (BM) (§6.4) Hopfield model + hidden nodes + simulated annealing BM Architecture –a set of visible nodes: nodes can be accessed from outside.

The Integers. The Division Algorithms A high-school question: Compute 58/17. We can write 58 as 58 = 3 (17) + 7 This forms illustrates the answer: “3.

Kernel Methods A B M Shawkat Ali 1 2 Data Mining ¤ DM or KDD (Knowledge Discovery in Databases) Extracting previously unknown, valid, and actionable.

LURE 2009 SUMMER PROGRAM John Alford Sam Houston State University.

Nonlinear Programming Models

The Information Bottleneck Method clusters the response space, Y, into a much smaller space, T. In order to informatively cluster the response space, the.

Machine Learning Weak 4 Lecture 2. Hand in Data It is online Only around 6000 images!!! Deadline is one week. Next Thursday lecture will be only one hour.

EASTERN MEDITERRANEAN UNIVERSITY Department of Industrial Engineering Non linear Optimization Spring Instructor: Prof.Dr.Sahand Daneshvar Submited.

EE 685 presentation Optimization Flow Control, I: Basic Algorithm and Convergence By Steven Low and David Lapsley.

1 Linear consecutive-k-out-of-n systems Variant optimal design problem Malgorzata O’Reilly University of Adelaide.

Support Vector Machines Tao Department of computer science University of Illinois.

Chapter 4 Sensitivity Analysis, Duality and Interior Point Methods.

1 Source Coding and Compression Dr.-Ing. Khaled Shawky Hassan Room: C3-222, ext: 1204, Lecture 10 Rate-Distortion.

(iii) Lagrange Multipliers and Kuhn-tucker Conditions D Nagesh Kumar, IISc Introduction to Optimization Water Resources Systems Planning and Management:

Maximum Entropy Discrimination Tommi Jaakkola Marina Meila Tony Jebara MIT CMU MIT.

Linear Programming Chapter 9. Interior Point Methods  Three major variants  Affine scaling algorithm - easy concept, good performance  Potential.

Support Vector Machine: An Introduction. (C) by Yu Hen Hu 2 Linear Hyper-plane Classifier For x in the side of o : w T x + b  0; d = +1; For.

Massive Support Vector Regression (via Row and Column Chunking) David R. Musicant and O.L. Mangasarian NIPS 99 Workshop on Learning With Support Vectors.

Center for Computational Biology Department of Mathematical Sciences Montana State University Collaborators: Alexander Dimitrov John P. Miller Zane Aldworth.

Approximation Algorithms based on linear programming.

D Nagesh Kumar, IISc Water Resources Systems Planning and Management: M2L2 Introduction to Optimization (ii) Constrained and Unconstrained Optimization.

Regularized Least-Squares and Convex Optimization.

deterministic operations research

Lecture 8 – Nonlinear Programming Models

One- and Two-Dimensional Flows

Chapter 8. General LP Problems

Outline Unconstrained Optimization Functions of One Variable

Chapter 8. General LP Problems

Chapter 5: Morse functions and function-induced persistence

Chapter 8. General LP Problems

Presentation transcript:

Phase Transitions in the Information Distortion NIPS 2003 workshop on Information Theory and Learning: The Bottleneck and Distortion Approach December 13, 2003 Albert E. Parker Department of Mathematical Sciences Center for Computational Biology Montana State University Collaborators: Tomas Gedeon, Alex Dimitrov, John Miller, and Zane Aldworth

The Goal: To determine the phase transitions or the bifurcation structure of solutions to clustering problems of the form max q  G(q) constrained by D(q)  I 0 where  is the set of valid conditional probabilities in R NK. G and D are sufficiently smooth in . G and D have symmetry: they are invariant to relabelling of the classes of Z. The Hessians  q G and  q D are block diagonal. XZ q(Z|X) K objectsN clusters

A similar formulation: Using the method Lagrange multipliers, the goal of determining the bifurcation structure of solutions of the optimization problem can be rephrased as finding the bifurcation structure of stationary points of the problem max q  (G(q)+  D(q)) where   [0,  ).  is the set of valid conditional probabilities in R NK. G and D are sufficiently smooth in . G and D have symmetry: they are invariant to relabelling of the classes of Z. The Hessian  q (G+  D) is block diagonal, and satisfies a set of regularity conditions at bifurcation: (e.g. the kernel of each block is one dimensional) XZ q(Z|X) K objectsN clusters

How: Use the Symmetries By capitalizing on the symmetries of the cost functions, we have described the bifurcation structure of stationary points to problems of the form max q  G(q) constrained by D(q)  I 0 or max q  (G(q)+  D(q)) where   [0,  ).  is the set of valid conditional probabilities in R NK. G and D are sufficiently smooth in . G and D have symmetry: they are invariant to relabelling of the classes of Z. The Hessian  q (G+  D) is block diagonal, and satisfies a set of regularity conditions at bifurcation: (e.g. the kernel is one dimensional)

Rate Distortion Theory (Shannon 1950’s) Minimal Informative Compression min I(X,Z) constrained by D(X,Z)  D 0 Deterministic Annealing (Rose 1990’s) A Clustering Algorithm max H(Z|X) constrained by D(X,Z)  D 0 Examples optimizing at a distortion level D(Y,Z)  D 0

Rate Distortion Theory (Shannon 1950’s) Minimal Informative Compression max -I(X,Z) constrained by D(X,Z)  D 0 Deterministic Annealing (Rose 1998) A Clustering Algorithm max H(Z|X) constrained by D(X,Z)  D 0 I(X,Z)=H(Z) – H(Z|X) Examples optimizing at a distortion level D(Y,Z)  D 0

Y X InputsOutputs Z q(Z|X) Clustered Outputs K objects {x i } N objects {z i }L objects {y i } p(X,Y) Inputs and Outputs and Clustered Outputs

Y X InputsOutputs Z q(Z|X) Clustered Outputs K objects {x i } N objects {z i }L objects {y i } p(X,Y) Inputs and Outputs and Clustered Outputs

Information Bottleneck Method (Tishby, Pereira, Bialek 1999) min I(X,Z) constrained by D I (X,Z)  D 0 max –I(X,Z) +  I(Y;Z) Information Distortion Method (Dimitrov and Miller 2001) max H(Z|X) constrained by D I (X,Z)  D 0 max H(Z|X) +  I(Y;Z) Two methods which use an information distortion function to cluster

Information Bottleneck Method (Tishby, Pereira, Bialek 1999) min I(X,Z) constrained by D I (X,Z)  D 0 max –I(X,Z) +  I(Y;Z) Information Distortion Method (Dimitrov and Miller 2001) max H(Z|X) constrained by D I (X,Z)  D 0 max H(Z|X) +  I(Y;Z) Two methods which use an information distortion function to cluster The Hessian is always singular … (-I(X,Z) is not strictly concave) The theory which follows does not apply

Information Bottleneck Method (Tishby, Pereira, Bialek 1999) min I(X,Z) constrained by D I (X,Z)  D 0 max –I(X,Z) +  I(Y;Z) Information Distortion Method (Dimitrov and Miller 2001) max H(Z|X) constrained by D I (X,Z)  D 0 max H(Z|X) +  I(Y;Z) Two methods which use an information distortion function to cluster The Hessian is always singular … (I(X,Z) is not strictly concave) The theory which follows does not apply H(Z|X) is strictly concave) The theory which follows does apply

A basic annealing algorithm to solve max q  (G(q)+  D(q)) Let q 0 be the maximizer of max q G(q), and let  0 =0. For k  0, let (q k,  k ) be a solution to max q G(q) +  D(q ). Iterate the following steps until  K =  max for some K. 1.Perform  -step: Let  k+1 =  k + d k where d k >0 2.The initial guess for q k+1 at  k+1 is q k+1 (0) = q k +  for some small perturbation . 3.Optimization: solve max q (G(q) +  k+1 D(q)) to get the maximizer q k+1, using initial guess q k+1 (0).

Application of the annealing method to the Information Distortion problem max q  (H(Z|X) +  I(X;Z)) when p(X,Y) is defined by four gaussian blobs Y, Inputs X, Outputs YX K=52 outputs L=52 inputs p(X,Y) XZ q(Z|X) K=52 outputsN=4 clustered outputs X, Outputs Z, Clustered Outputs

Evolution of the optimal clustering: Observed Bifurcations for the Four Blob problem: We just saw the optimal clusterings q * at some  * =  max. What do the clusterings look like for  <  max ?? I(Y,Z) bits

?????? Why are there only 3 bifurcations observed? In general, are there only N-1 bifurcations? What kinds of bifurcations do we expect: pitchfork-like, transcritical, saddle-node, or some other type? How many bifurcating branches are there? What do the bifurcating branches look like? Are they 1 st order phase transitions (subcritical) or 2 nd order phase transitions (supercritical) ? What is the stability of the bifurcating branches? Is there always a bifurcating branch which contains solutions of the optimization problem? Are there bifurcations after all of the classes have resolved ?  q* Conceptual Bifurcation Structure Observed Bifurcations for the 4 Blob Problem I(Y,Z) bits

Recall the Symmetries: To better understand the bifurcation structure, we capitalize on the symmetries of the function G(q)+  D(q) X Z q(Z|X) : a clustering K objects {x i } N objects {z i } class 1 class 3

X Z q(Z|X) : a clustering K objects {x i } N objects {z i } class 3 class 1 Recall the Symmetries: To better understand the bifurcation structure, we capitalize on the symmetries of the function G(q)+  D(q)

The symmetry group of all permutations on N symbols is S N.

A partial subgroup lattice for S N when N=4.

A partial lattice of the maximal subgroups S 2 x S 2 of S 4

This Group Structure determines the Bifurcation Structure

Define a Gradient Flow Goal: To determine the bifurcation structure of stationary points of max q  (G(q) +  D(q)) Method: Study the equilibria of the of the flow Equilibria of this system (in R NK+K ) are possible solutions of the optimization problem The Jacobian  q, L (q *,  * ) is symmetric, and so only bifurcations of equilibria can occur. The first equilibrium is q * (  0 = 0)  1/N. If w T  q F(q *,  ) w < 0 for every w  ker J, then q * (  ) is a maximizer of. The first equilibrium is q*(  0 = 0)  1/N.

Symmetry Breaking Bifurcations  q*

Symmetry Breaking Bifurcations  q*

Symmetry Breaking Bifurcations  q*

Symmetry Breaking Bifurcations  q*

Symmetry Breaking Bifurcations  q*

Symmetry Breaking Bifurcations  q*

Existence Theorems for Bifurcating Branches Given a bifurcation at a point fixed by S N, Equivariant Branching Lemma The Smoller-Wasserman Theorem (Vanderbauwhede and Cicogna ) (Smoller and Wasserman ) There are N bifurcating branches, each which have symmetry S N-1. There are N!/(2m!n!) bifurcating branches which have symmetry S m x S n if N=m+n.  q*

 Existence Theorems for Bifurcating Branches Given a bifurcation at a point fixed by S N-1, Equivariant Branching Lemma The Smoller-Wasserman Theorem (Vanderbauwhede and Cicogna ) (Smoller and Wasserman ) There are N-1 bifurcating branches, each which have symmetry S N-2. There are (N-1)!/(2m!n!) bifurcating branches which have symmetry S m x S n if N-1=m+n.

Group Structure Observed Bifurcation Structure

Group Structure  q* Observed Bifurcation Structure The Equivariant Branching Lemma shows that the bifurcation structure contains the branches …

Group Structure  q* Observed Bifurcation Structure The subgroups {S 2 x S 2 } give additional structure …

Group Structure  q* Observed Bifurcation Structure The subgroups {S 2 x S 2 } give additional structure …

 q* Theorem: There are at exactly K bifurcations on the branch (q 1/N,  ) whenever G(q 1/N ) is nonsingular There are K=52 bifurcations on the first branch Observed Bifurcation Structure

A partial subgroup lattice for S 4 and the corresponding bifurcating directions given by the Equivariant Branching Lemma

A partial subgroup lattice for S 4 and the corresponding bifurcating directions corresponding to subgroups isomorphic to S 2 x S 2.

This theory enables us to answer the questions previously posed …

?????? Why are there only 3 bifurcations observed? In general, are there only N-1 bifurcations? What kinds of bifurcations do we expect: pitchfork-like, transcritical, saddle-node, or some other type? How many bifurcating solutions are there? What do the bifurcating branches look like? Are they subcritical or supercritical ? What is the stability of the bifurcating branches? Is there always a bifurcating branch which contains solutions of the optimization problem? Are there bifurcations after all of the classes have resolved ?  q* Conceptual Bifurcation Structure Observed Bifurcations for the 4 Blob Problem

Why are there only 3 bifurcations observed? In general, are there only N-1 bifurcations? There are N-1 symmetry breaking bifurcations from S M to S M-1 for M  N. What kinds of bifurcations do we expect: pitchfork-like, transcritical, saddle-node, or some other type? How many bifurcating solutions are there? There are at least N from the first bifurcation, at least N-1 from the next one, etc. What do the bifurcating branches look like? They are subcritical or supercritical depending on the sign of the bifurcation discriminator  (q *,  *,u k ). What is the stability of the bifurcating branches? Is there always a bifurcating branch which contains solutions of the optimization problem? No. Are there bifurcations after all of the classes have resolved ? Generically, no. Conceptual Bifurcation Structure  q*

Continuation techniques numerically illustrate the theory using the Information Distortion

 q* I(Y,Z) bits

Bifurcating branches with symmetry S 2 x S 2 =  q* I(Y,Z) bits

Additional structure!! I(Y,Z) bits

A closer look …  q* I(Y,Z) bits

Bifurcation from S 4 to S 3 …  q* I(Y,Z) bits

The bifurcation from S 4 to S 3 is subcritical … (the theory predicted this since the bifurcation discriminator  (q 1/4,  *,u)<0 ) I(Y,Z) bits

(4) R H (I 0 ) = max q  H(Z|X) constrained by I(Y,Z)  I 0 (7) max q  (H(Z|X) +  I(Y,Z)) What does this mean regarding solutions of the original problems? I(Y,Z) bits

Theorem: dR/dI 0 = -  (I 0 ) d 2 R/dI 0 2 = -d  (I 0 )/dI 0 (4) R H (I 0 ) = max q  H(Z|X) constrained by I(Y,Z)  I 0 (7) max q  (H(Z|X) +  I(Y,Z))

Theorem: dR/dI 0 = -  (I 0 ) d 2 R/dI 0 2 = d  (I 0 )/dI 0 R H as a function of I 0 R H (I 0 ) = max q  H(Z|Y) constrained by I(X;Z)  I 0 is not convex and not concave is a monotonically decreasing, continuous function RHRH

Consequences?? Analogue for the Information Distortion R H (I 0 ) = max q  H(Z|X) constrained by I(Y;Z)  I 0 is neither concave nor convex since subcritical bifurcations and saddle nodes exist. Rate Distortion Function (from Information Theory) R(D 0 ) = min q  I(X;Z) constrained by D(X,Z)  D 0 is convex if D(Y,Z) is linear in q (Rose, 1994; Cover and Thomas; Grey). Relevance Compression Function (for Information Bottleneck) R I (I 0 ) = min q  I(X;Z) constrained by I(Y;Z)  I 0 is convex if N>K+1 (Witsenhausen and Wyner 1975, Bachrach et al 2003)

Analogue for the Information Distortion R H (I 0 ) = max q  H(Z|X) constrained by I(Y;Z)  I 0 is neither concave nor convex since subcritical bifurcations and saddle nodes exist. Relevance Compression Function (for Information Bottleneck) R I (I 0 ) = min q  I(X;Z) constrained by I(Y;Z)  I 0 is convex if N>K+1 (Bachrach et al 2003) R I (I 0 ) and R H (I 0 ) are related by I(X;Z) = H(Z) - H(Z|X). The Information Bottleneck can not have a subcritical bifurcation when N > K+1. Are there subcritical bifurcations when N<K+1 ? Is R H (I 0 ) convex when N>K+1 ? That would mean that the subcritical bifurcations go away when considering the gradient flow in R (K+2)K instead of R NK. So What??

Application to cricket sensory data E(Y|Z): stimulus means conditioned on each of the classes spike patterns optimal clustering

Conclusions … We have a complete theoretical picture of how the clusterings evolve for a class of annealing problems of the form max q  (G(q)+  D(q)) subject to the assumptions stated earlier. oWhen clustering to N classes, there are N-1 bifurcations. oIn general, there are only pitchfork and saddle-node bifurcations. oWe can determine whether pitchfork bifurcations are either subcritical or supercritical (1 st or 2 nd order phase transitions) oWe know the explicit bifurcating directions SO WHAT?? There are theoretical consequences … This suggests an algorithm for solving the annealing problem … (NIPS 2002)