Information Bottleneck presented by Boris Epshtein & Lena Gorelick Advanced Topics in Computer and Human Vision Spring 2004.

Slides:



Advertisements
Similar presentations
Bayesian Belief Propagation
Advertisements

Clustering k-mean clustering Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein.
CS 478 – Tools for Machine Learning and Data Mining Clustering: Distance-based Approaches.
Feature Selection as Relevant Information Encoding Naftali Tishby School of Computer Science and Engineering The Hebrew University, Jerusalem, Israel NIPS.
Information Theory For Data Management
Clustering Categorical Data The Case of Quran Verses
Unsupervised Learning
Fast Algorithms For Hierarchical Range Histogram Constructions
Information Bottleneck EM School of Engineering & Computer Science The Hebrew University, Jerusalem, Israel Gal Elidan and Nir Friedman.
Clustering and Dimensionality Reduction Brendan and Yifang April
A Probabilistic Framework for Semi-Supervised Clustering
Visual Recognition Tutorial
Lecture 6 Image Segmentation
Unsupervised Image Clustering using Probabilistic Continuous Models and Information Theoretic Principles Shiri Gordon Electrical Engineering – System,
Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.
Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Slide 1 EE3J2 Data Mining Lecture 16 Unsupervised Learning Ali Al-Shahib.
Clustering Color/Intensity
Clustering.
Recovering Articulated Object Models from 3D Range Data Dragomir Anguelov Daphne Koller Hoi-Cheung Pang Praveen Srinivasan Sebastian Thrun Computer Science.
Adapted by Doug Downey from Machine Learning EECS 349, Bryan Pardo Machine Learning Clustering.
Maximum Likelihood (ML), Expectation Maximization (EM)
Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.
Visual Recognition Tutorial
What is Cluster Analysis?
Maximum Entropy Model LING 572 Fei Xia 02/07-02/09/06.
Clustering with Bregman Divergences Arindam Banerjee, Srujana Merugu, Inderjit S. Dhillon, Joydeep Ghosh Presented by Rohit Gupta CSci 8980: Machine Learning.
The Power of Word Clusters for Text Classification Noam Slonim and Naftali Tishby Presented by: Yangzhe Xiao.
Maximum Entropy Model LING 572 Fei Xia 02/08/07. Topics in LING 572 Easy: –kNN, Rocchio, DT, DL –Feature selection, binarization, system combination –Bagging.
Clustering Unsupervised learning Generating “classes”
If we measured a distribution P, what is the tree- dependent distribution P t that best approximates P? Search Space: All possible trees Goal: From all.
Incomplete Graphical Models Nan Hu. Outline Motivation K-means clustering Coordinate Descending algorithm Density estimation EM on unconditional mixture.
Computer Vision James Hays, Brown
1 HMM - Part 2 Review of the last lecture The EM algorithm Continuous density HMM.
START OF DAY 8 Reading: Chap. 14. Midterm Go over questions General issues only Specific issues: visit with me Regrading may make your grade go up OR.
Text Clustering.
Clustering Supervised vs. Unsupervised Learning Examples of clustering in Web IR Characteristics of clustering Clustering algorithms Cluster Labeling 1.
CHAPTER 7: Clustering Eick: K-Means and EM (modified Alpaydin transparencies and new transparencies added) Last updated: February 25, 2014.
MACHINE LEARNING 8. Clustering. Motivation Based on E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2  Classification problem:
CS654: Digital Image Analysis
Clustering Gene Expression Data BMI/CS 576 Colin Dewey Fall 2010.
HMM - Part 2 The EM algorithm Continuous density HMM.
CS 8751 ML & KDDData Clustering1 Clustering Unsupervised learning Generating “classes” Distance/similarity measures Agglomerative methods Divisive methods.
Lecture 2: Statistical learning primer for biologists
Machine Learning Queens College Lecture 7: Clustering.
Image Segmentation Shengnan Wang
5. Maximum Likelihood –II Prof. Yuille. Stat 231. Fall 2004.
Information Bottleneck versus Maximum Likelihood Felix Polyakov.
Unique Games Approximation Amit Weinstein Complexity Seminar, Fall 2006 Based on: “Near Optimal Algorithms for Unique Games" by M. Charikar, K. Makarychev,
Minimizing Delay in Shared Pipelines Ori Rottenstreich (Technion, Israel) Joint work with Isaac Keslassy (Technion, Israel) Yoram Revah, Aviran Kadosh.
1 Machine Learning Lecture 9: Clustering Moshe Koppel Slides adapted from Raymond J. Mooney.
Lecture 3: MLE, Bayes Learning, and Maximum Entropy
Discriminative Training and Machine Learning Approaches Machine Learning Lab, Dept. of CSIE, NCKU Chih-Pin Liao.
1 Pattern Recognition: Statistical and Neural Lonnie C. Ludeman Lecture 28 Nov 9, 2005 Nanjing University of Science & Technology.
Rate Distortion Theory. Introduction The description of an arbitrary real number requires an infinite number of bits, so a finite representation of a.
A PAC-Bayesian Approach to Formulation of Clustering Objectives Yevgeny Seldin Joint work with Naftali Tishby.
A PAC-Bayesian Approach to Formulation of Clustering Objectives Yevgeny Seldin Joint work with Naftali Tishby.
Non-parametric Methods for Clustering Continuous and Categorical Data Steven X. Wang Dept. of Math. and Stat. York University May 13, 2010.
Information Bottleneck Method & Double Clustering + α Summarized by Byoung Hee, Kim.
Data Science Practical Machine Learning Tools and Techniques 6.8: Clustering Rodney Nielsen Many / most of these slides were adapted from: I. H. Witten,
Semi-Supervised Clustering
Latent Variables, Mixture Models and EM
LECTURE 05: THRESHOLD DECODING
Bayesian Models in Machine Learning
LECTURE 05: THRESHOLD DECODING
Clustering Techniques
Text Categorization Berlin Chen 2003 Reference:
Clustering Techniques
LECTURE 05: THRESHOLD DECODING
Presentation transcript:

Information Bottleneck presented by Boris Epshtein & Lena Gorelick Advanced Topics in Computer and Human Vision Spring 2004

Agenda Motivation Information Theory - Basic Definitions Rate Distortion Theory –Blahut-Arimoto algorithm Information Bottleneck Principle IB algorithms –iIB –dIB –aIB Application

Motivation Clustering Problem

Motivation “Hard” Clustering – partitioning of the input data into several exhaustive and mutually exclusive clusters Each cluster is represented by a centroid

Motivation “Good” clustering – should group similar data points together and dissimilar points apart Quality of partition – average distortion between the data points and corresponding representatives (cluster centroids)

“Soft” Clustering – each data point is assigned to all clusters with some normalized probability Goal – minimize expected distortion between the data points and cluster centroids Motivation

Complexity-Precision Trade-off Too simple modelPoor precision Higher precision requires more complex model Motivation…

Complexity-Precision Trade-off Too simple modelPoor precision Higher precision requires more complex model Too complex modelOverfitting Motivation…

Complexity-Precision Trade-off Too Complex Model – can lead to overfitting – is hard to learn Too Simple Model –can not capture the real structure of the data Examples of approaches: –SRM Structural Risk Minimization –MDL Minimum Description Length –Rate Distortion Theory Motivation… Poor generalization

Agenda Motivation Information Theory - Basic Definitions Rate Distortion Theory –Blahut-Arimoto algorithm Information Bottleneck Principle IB algorithms –iIB –dIB –aIB Application

Entropy The measure of uncertainty about the random variable Definitions…

Entropy - Example –Fair Coin: –Unfair Coin: Definitions…

Entropy - Illustration Definitions… Highest Lowest

Conditional Entropy The measure of uncertainty about the random variable given the value of the variable Definitions…

Conditional Entropy Example Definitions…

Mutual Information The reduction in uncertainty of due to the knowledge of –Nonnegative –Symmetric –Convex w.r.t. for a fixed Definitions…

Mutual Information - Example Definitions…

A distance between distributions –Nonnegative –Asymmetric Kullback Leibler Distance Definitions… Over the same alphabet

Agenda Motivation Information Theory - Basic Definitions Rate Distortion Theory –Blahut-Arimoto algorithm Information Bottleneck Principle IB algorithms –iIB –dIB –aIB Application

Rate Distortion Theory Introduction Goal: obtain compact clustering of the data with minimal expected distortion Distortion measure is a part of the problem setup The clustering and its quality depend on the choice of the distortion measure

Rate Distortion Theory Obtain compact clustering of the data with minimal expected distortion given fixed set of representatives Data ? Cover & Thomas

–zero distortion –not compact –high distortion –very compact Rate Distortion Theory - Intuition

The quality of clustering is determined by – Complexity is measured by – Distortion is measured by Rate Distortion Theory – Cont. (a.k.a. Rate)

Rate Distortion Plane Ed(X,T) Maximal Compression Minimal Distortion D - distortion constraint

Higher values of mean more relaxed distortion constraint Stronger compression levels are attainable Rate Distortion Function Given the distortion constraint find the most compact model (with smallest complexity ) Let be an upper bound constraint on the expected distortion

Rate Distortion Function Given –Set of points with prior –Set of representatives –Distortion measure Find –The most compact soft clustering of points of that satisfies the distortion constraint Rate Distortion Function

Lagrange Multiplier Complexity Term Distortion Term Minimize!

Rate Distortion Curve Ed(X,T) Maximal Compression Minimal Distortion

Subject to The minimum is attained when Rate Distortion Function Normalization Minimize

Known Solution - Analysis The solution is implicit Solution:

When is similar to is small closer points are attached to with higher probability Solution - Analysis Solution: For a fixed

reduces the influence of distortion does not depend on this + maximal compression single cluster Solution - Analysis Solution: most of cond. prob. goes to some with smallest distortion hard clustering Fix t Fix x

Varying Solution - Analysis Solution: Intermediate soft clustering, intermediate complexity

Agenda Motivation Information Theory - Basic Definitions Rate Distortion Theory –Blahut-Arimoto algorithm Information Bottleneck Principle IB algorithms –iIB –dIB –aIB Application

Blahut – Arimoto Algorithm Input: Randomly init Optimize convex function over convex set the minimum is global

Blahut-Arimoto Algorithm Advantages: Obtains compact clustering of the data with minimal expected distortion Optimal clustering given fixed set of representatives

Blahut-Arimoto Algorithm Drawbacks: Distortion measure is a part of the problem setup –Hard to obtain for some problems –Equivalent to determining relevant features Fixed set of representatives Slow convergence

Rate Distortion Theory – Additional Insights –Another problem would be to find optimal representatives given the clustering. –Joint optimization of clustering and representatives doesn’t have a unique solution. (like EM or K-means)

Agenda Motivation Information Theory - Basic Definitions Rate Distortion Theory –Blahut-Arimoto algorithm Information Bottleneck Principle IB algorithms –iIB –dIB –aIB Application

Information Bottleneck Copes with the drawbacks of Rate Distortion approach Compress the data while preserving “important” (relevant) information It is often easier to define what information is important than to define a distortion measure. Replace the distortion upper bound constraint by a lower bound constraint over the relevant information Tishby, Pereira & Bialek, 1999

Information Bottleneck-Example DocumentsJoint priorTopics Given:

Information Bottleneck-Example WordsPartitioningTopics Obtain: I(Cluster;Topic) I(Word;Topic) I(Word;Cluster)

Information Bottleneck-Example Extreme case 1: I(Cluster;Topic)=0 I(Word;Cluster)=0 Very Compact Not Informative

Information Bottleneck-Example Minimize I (Word; Cluster) & maximize I (Cluster; Topic) I(Cluster;Topic)=max I(Word;Cluster)=max Not Compact Very Informative Extreme case 2:

Information Bottleneck Compactness Relevant Information words topics

Relevance Compression Curve Maximal Compression Maximal Relevant Information D – relevance constraint

Let be minimal allowed value of Smaller more relaxed relevant information constraint Stronger compression levels are attainable Relevance Compression Function Given relevant information constraint Find the most compact model (with smallest )

Relevance Compression Function Lagrange Multiplier Compression Term Relevance Term Minimize!

Relevance Compression Curve Maximal Compression Maximal Relevant Information

Subject to The minimum is attained when Relevance Compression Function Normalization Minimize

Solution - Analysis The solution is implicit Solution: Known

Solution - Analysis Solution: KL distance emerges as effective distortion measure from IB principle The optimization is also over cluster representatives When is similar to KL is small attach such points to with higher probability For a fixed

reduces the influence of KL does not depend on this + maximal compression single cluster Solution - Analysis Solution: most of cond. prob. goes to some with smallest KL (hard mapping) Fix t Fix x

Relevance Compression Curve Maximal Compression Maximal Relevant Information Hard Mapping

Agenda Motivation Information Theory - Basic Definitions Rate Distortion Theory –Blahut-Arimoto algorithm Information Bottleneck Principle IB algorithms –iIB –dIB –aIB Application

Iterative Optimization Algorithm (iIB) Input: Randomly init Pereira, Tishby, Lee, 1993; Tishby, Pereira, Bialek, 2001

Iterative Optimization Algorithm (iIB) p(cluster | word) p(cluster) p(topic | cluster) Pereira, Tishby, Lee, 1993;

iIB simulation Given: –300 instances of with prior –Binary relevant variable –Joint prior – Obtain: –Optimal clustering (with minimal )

X points and their priors iIB simulation…

Given the is given by the color of the point on the map iIB simulation…

Single Cluster – Maximal Compression

iIB simulation…

Hard Clustering – Maximal Relevant Information

Iterative Optimization Algorithm (iIB) Analogy to K-means or EM Optimize non-convex functional over 3 convex sets the minimum is local

“Semantic change” in the clustering solution

Advantages: Defining relevant variable is often easier and more intuitive than defining distortion measure Finds local minimum Iterative Optimization Algorithm (iIB)

Drawbacks: Finds local minimum (suboptimal solutions) Need to specify the parameters Slow convergence Large data sample is required Iterative Optimization Algorithm (iIB)

Agenda Motivation Information Theory - Basic Definitions Rate Distortion Theory –Blahut-Arimoto algorithm Information Bottleneck Principle IB algorithms –iIB –dIB –aIB Application

Iteratively increase the parameter and then adapt the solution from the previous value of to the new one. Track the changes in the solution as the system shifts its preference from compression to relevance Tries to reconstruct the relevance-compression curve Deterministic Annealing-like algorithm (dIB) Slonim, Friedman, Tishby, 2002

Solution from previous step: Deterministic Annealing-like algorithm (dIB)

Small Perturbation Deterministic Annealing-like algorithm (dIB)

Apply iIB using the duplicated cluster set as initialization Deterministic Annealing-like algorithm (dIB)

if are different leave the split else use the old Deterministic Annealing-like algorithm (dIB)

Illustration What clusters split at which values of

Advantages: Finds local minimum (suboptimal solutions) Speed-up convergence by adapting previous soultion Deterministic Annealing-like algorithm (dIB)

Drawbacks: Need to specify and tune several parameters: - perturbation size - step for (splits might be “skipped”) - similarity threshold for splitting - may need to vary parameters during the process Finds local minimum (suboptimal solutions) Large data sample is required Deterministic Annealing-like algorithm (dIB)

Agenda Motivation Information Theory - Basic Definitions Rate Distortion Theory –Blahut-Arimoto algorithm Information Bottleneck Principle IB algorithms –iIB –dIB –aIB Application

Agglomerative Algorithm (aIB) Find hierarchical clustering tree in a greedy bottom-up fashion Results in different trees for each Each tree is a range of clustering solutions at different resolutions Same Different Resolutions Slonim & Tishby 1999

Agglomerative Algorithm (aIB) Fix Start with

Agglomerative Algorithm (aIB) For each pair Compute new Merge and that produce the smallest

Agglomerative Algorithm (aIB) For each pair Compute new Merge and that produce the smallest

Agglomerative Algorithm (aIB) For each pair Continue merging until single cluster is left

Agglomerative Algorithm (aIB)

Advantages: Non-parametric Full Hierarchy of clusters for each Simple

Agglomerative Algorithm (aIB) Drawbacks: Greedy – is not guaranteed to extract even locally minimal solutions along the tree Large data sample is required

Agenda Motivation Information Theory - Basic Definitions Rate Distortion Theory –Blahut-Arimoto algorithm Information Bottleneck Principle IB algorithms –iIB –dIB –aIB Application

Unsupervised Clustering of Images Modeling assumption: For a fixed colors and their spatial distribution are generated by a mixture of Gaussians in 5-dim Applications… Shiri Gordon et. al., 2003

Unsupervised Clustering of Images Apply EM procedure to estimate the mixture parameters Applications… Shiri Gordon et. al., 2003 Mixture of Gaussians model:

Unsupervised Clustering of Images Applications… Shiri Gordon et. al., 2003 Assume uniform prior Calculate conditional Apply aIB algorithm

Unsupervised Clustering of Images Applications… Shiri Gordon et. al., 2003

Unsupervised Clustering of Images Applications… Shiri Gordon et. al., 2003

Summary Rate Distortion Theory –Blahut-Arimoto algorithm Information Bottleneck Principle IB algorithms –iIB –dIB –aIB Application

Thank you

Blahut-Arimoto algorithm A B When does it converge to the global minimum? - A and B are convex + some requirements on distance measure Convex set of distributions Minimum Distance ? Csiszar & Tusnady, 1984

Blahut-Arimoto algorithm A B Reformulate using distance

Blahut-Arimoto algorithm A B

Rate Distortion Theory - Intuition –zero distortion –not compact – –high distortion –very compact –

Assume Markov relations: –T is a compressed representation of X, thus independent of Y if X is given –Information processing inequality: Information Bottleneck - cont’d