Bayesian Hierarchical Clustering Paper by K. Heller and Z. Ghahramani ICML 2005 Presented by HAO-WEI, YEH.

Slides:



Advertisements
Similar presentations
Part 2: Unsupervised Learning
Advertisements

Xiaolong Wang and Daniel Khashabi
Course: Neural Networks, Instructor: Professor L.Behera.
MAD-Bayes: MAP-based Asymptotic Derivations from Bayes
Probabilistic models Jouni Tuomisto THL. Outline Deterministic models with probabilistic parameters Hierarchical Bayesian models Bayesian belief nets.
Gibbs Sampling Methods for Stick-Breaking priors Hemant Ishwaran and Lancelot F. James 2001 Presented by Yuting Qi ECE Dept., Duke Univ. 03/03/06.
Hierarchical Dirichlet Processes
Flipping A Biased Coin Suppose you have a coin with an unknown bias, θ ≡ P(head). You flip the coin multiple times and observe the outcome. From observations,
Structural Inference of Hierarchies in Networks BY Yu Shuzhi 27, Mar 2014.
Variational Inference for Dirichlet Process Mixture Daniel Klein and Soravit Beer Changpinyo October 11, 2011 Applied Bayesian Nonparametrics Special Topics.
Ai in game programming it university of copenhagen Statistical Learning Methods Marco Loog.
Clustering II.
Iterative Optimization of Hierarchical Clusterings Doug Fisher Department of Computer Science, Vanderbilt University Journal of Artificial Intelligence.
Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.
Basics of Statistical Estimation. Learning Probabilities: Classical Approach Simplest case: Flipping a thumbtack tails heads True probability  is unknown.
Presenting: Assaf Tzabari
Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
Computer vision: models, learning and inference Chapter 6 Learning and Inference in Vision.
Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.
A k-Nearest Neighbor Based Algorithm for Multi-Label Classification Min-Ling Zhang
Motivation Parametric models can capture a bounded amount of information from the data. Real data is complex and therefore parametric assumptions is wrong.
Adaption Def: To adjust model parameters for new speakers. Adjusting all parameters requires too much data and is computationally complex. Solution: Create.
Data mining and machine learning A brief introduction.
1 Bayesian Learning for Latent Semantic Analysis Jen-Tzung Chien, Meng-Sun Wu and Chia-Sheng Wu Presenter: Hsuan-Sheng Chiu.
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
Hierarchical Dirichelet Processes Y. W. Tech, M. I. Jordan, M. J. Beal & D. M. Blei NIPS 2004 Presented by Yuting Qi ECE Dept., Duke Univ. 08/26/05 Sharing.
Bayesian Sets Zoubin Ghahramani and Kathertine A. Heller NIPS 2005 Presented by Qi An Mar. 17 th, 2006.
Bayesian Extension to the Language Model for Ad Hoc Information Retrieval Hugo Zaragoza, Djoerd Hiemstra, Michael Tipping Presented by Chen Yi-Ting.
Mixture Models, Monte Carlo, Bayesian Updating and Dynamic Models Mike West Computing Science and Statistics, Vol. 24, pp , 1993.
An Overview of Nonparametric Bayesian Models and Applications to Natural Language Processing Narges Sharif-Razavian and Andreas Zollmann.
Timeline: A Dynamic Hierarchical Dirichlet Process Model for Recovering Birth/Death and Evolution of Topics in Text Stream (UAI 2010) Amr Ahmed and Eric.
Randomized Algorithms for Bayesian Hierarchical Clustering
Learning With Bayesian Networks Markus Kalisch ETH Zürich.
Clustering Gene Expression Data BMI/CS 576 Colin Dewey Fall 2010.
Chapter 11 Statistical Techniques. Data Warehouse and Data Mining Chapter 11 2 Chapter Objectives  Understand when linear regression is an appropriate.
Hierarchical Dirichlet Process and Infinite Hidden Markov Model Duke University Machine Learning Group Presented by Kai Ni February 17, 2006 Paper by Y.
1 Dirichlet Process Mixtures A gentle tutorial Graphical Models – Khalid El-Arini Carnegie Mellon University November 6 th, 2006 TexPoint fonts used.
Introduction to LDA Jinyang Gao. Outline Bayesian Analysis Dirichlet Distribution Evolution of Topic Model Gibbs Sampling Intuition Analysis of Parameter.
Stick-Breaking Constructions
Generalized Model Selection For Unsupervised Learning in High Dimension Vaithyanathan and Dom IBM Almaden Research Center NIPS ’ 99.
Multi-Speaker Modeling with Shared Prior Distributions and Model Structures for Bayesian Speech Synthesis Kei Hashimoto, Yoshihiko Nankaku, and Keiichi.
DATA MINING WITH CLUSTERING AND CLASSIFICATION Spring 2007, SJSU Benjamin Lam.
1 Clustering in Generalized Linear Mixed Model Using Dirichlet Process Mixtures Ya Xue Xuejun Liao April 1, 2005.
Statistical Models for Partial Membership Katherine Heller Gatsby Computational Neuroscience Unit, UCL Sinead Williamson and Zoubin Ghahramani University.
Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall 6.8: Clustering Rodney Nielsen Many / most of these.
Beam Sampling for the Infinite Hidden Markov Model by Jurgen Van Gael, Yunus Saatic, Yee Whye Teh and Zoubin Ghahramani (ICML 2008) Presented by Lihan.
Probabilistic models Jouni Tuomisto THL. Outline Deterministic models with probabilistic parameters Hierarchical Bayesian models Bayesian belief nets.
Bayesian Speech Synthesis Framework Integrating Training and Synthesis Processes Kei Hashimoto, Yoshihiko Nankaku, and Keiichi Tokuda Nagoya Institute.
The Uniform Prior and the Laplace Correction Supplemental Material not on exam.
Nonparametric Bayesian Models. HW 4 x x Parametric Model Fixed number of parameters that is independent of the data we’re fitting.
NIPS 2013 Michael C. Hughes and Erik B. Sudderth
Hierarchical clustering approaches for high-throughput data Colin Dewey BMI/CS 576 Fall 2015.
APPLICATIONS OF DIRICHLET PROCESS MIXTURES TO SPEAKER ADAPTATION Amir Harati and Joseph PiconeMarc Sobel Institute for Signal and Information Processing,
Bayesian Hierarchical Clustering Paper by K. Heller and Z. Ghahramani ICML 2005 Presented by David Williams Paper Discussion Group ( )
Markov-Chain-Monte-Carlo (MCMC) & The Metropolis-Hastings Algorithm P548: Intro Bayesian Stats with Psych Applications Instructor: John Miyamoto 01/19/2016:
Variational Infinite Hidden Conditional Random Fields with Coupled Dirichlet Process Mixtures K. Bousmalis, S. Zafeiriou, L.-P. Morency, M. Pantic, Z.
Computing with R & Bayesian Statistical Inference P548: Intro Bayesian Stats with Psych Applications Instructor: John Miyamoto 01/11/2016: Lecture 02-1.
Data Science Practical Machine Learning Tools and Techniques 6.8: Clustering Rodney Nielsen Many / most of these slides were adapted from: I. H. Witten,
Generalization Performance of Exchange Monte Carlo Method for Normal Mixture Models Kenji Nagata, Sumio Watanabe Tokyo Institute of Technology.
MCMC Stopping and Variance Estimation: Idea here is to first use multiple Chains from different initial conditions to determine a burn-in period so the.
Bayesian Generalized Product Partition Model
ICS 280 Learning in Graphical Models
Bayes Net Learning: Bayesian Approaches
CSCI 5822 Probabilistic Models of Human and Machine Learning
Hierarchical clustering approaches for high-throughput data
Multitask Learning Using Dirichlet Process
Hierarchical Topic Models and the Nested Chinese Restaurant Process
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Text Categorization Berlin Chen 2003 Reference:
Presentation transcript:

Bayesian Hierarchical Clustering Paper by K. Heller and Z. Ghahramani ICML 2005 Presented by HAO-WEI, YEH

Outline  Background - Traditional Methods  Bayesian Hierarchical Clustering (BHC)  Basic ideas  Dirichlet Process Mixture Model (DPM)  Algorithm  Experiment results  Conclusion  Background - Traditional Methods  Bayesian Hierarchical Clustering (BHC)  Basic ideas  Dirichlet Process Mixture Model (DPM)  Algorithm  Experiment results  Conclusion

Background Traditional Method

Hierarchical Clustering  Given : data points  Output: a tree (series of clusters)  Leaves : data points  Internal nodes : nested clusters  Examples  Evolutionary tree of living organisms  Internet newsgroups  Newswire documents  Given : data points  Output: a tree (series of clusters)  Leaves : data points  Internal nodes : nested clusters  Examples  Evolutionary tree of living organisms  Internet newsgroups  Newswire documents

Traditional Hierarchical Clustering  Bottom-up agglomerative algorithm  Closeness based on given distance measure (e.g. Euclidean distance between cluster means)  Bottom-up agglomerative algorithm  Closeness based on given distance measure (e.g. Euclidean distance between cluster means)

Traditional Hierarchical Clustering (cont’d)  Limitations  No guide to choosing correct number of clusters, or where to prune tree.  Distance metric selection (especially for data such as images or sequences)  Evaluation (Probabilistic model)  How to evaluate how good result is ?  How to compare to other models ?  How to make predictions and cluster new data with existing hierarchy ?  Limitations  No guide to choosing correct number of clusters, or where to prune tree.  Distance metric selection (especially for data such as images or sequences)  Evaluation (Probabilistic model)  How to evaluate how good result is ?  How to compare to other models ?  How to make predictions and cluster new data with existing hierarchy ?

BHC Bayesian Hierarchical Clustering

 Basic ideas:  Use marginal likelihoods to decide which clusters to merge  P(Data to merge were from the same mixture component) vs. P(Data to merge were from different mixture components)  Generative Model : Dirichlet Process Mixture Model (DPM)  Basic ideas:  Use marginal likelihoods to decide which clusters to merge  P(Data to merge were from the same mixture component) vs. P(Data to merge were from different mixture components)  Generative Model : Dirichlet Process Mixture Model (DPM)

Dirichlet Process Mixture Model (DPM)  Formal Definition  Different Perspectives  Infinite version of Mixture Model (Motivation and Problems)  Stick-breaking Process (How generated distribution look like)  Chinese Restaurant Process, Polya urn scheme  Benefits  Conjugate prior  Unlimited clusters  “Rich-Get-Richer, ” Does it really work? Depends!  Pitman-Yor process, Uniform Process, …  Formal Definition  Different Perspectives  Infinite version of Mixture Model (Motivation and Problems)  Stick-breaking Process (How generated distribution look like)  Chinese Restaurant Process, Polya urn scheme  Benefits  Conjugate prior  Unlimited clusters  “Rich-Get-Richer, ” Does it really work? Depends!  Pitman-Yor process, Uniform Process, …

BHC Algorithm - Overview  Same as traditional  One-pass, bottom-up method  Initializes each data point in own cluster, and iteratively merges pairs of clusters.  Difference  Uses a statistical hypothesis test to choose which clusters to merge.  Same as traditional  One-pass, bottom-up method  Initializes each data point in own cluster, and iteratively merges pairs of clusters.  Difference  Uses a statistical hypothesis test to choose which clusters to merge.

BHC Algorithm - Concepts  Two hypotheses to compare  1. All data was generated i.i.d. from the same probabilistic model with unknown parameters.  2. Data has two or more clusters in it.  Two hypotheses to compare  1. All data was generated i.i.d. from the same probabilistic model with unknown parameters.  2. Data has two or more clusters in it.

Hypothesis H 1  Probability of the data under H 1 :  : prior over the parameters  D k : data in the two trees to be merged  Integral is tractable with conjugate prior  Probability of the data under H 1 :  : prior over the parameters  D k : data in the two trees to be merged  Integral is tractable with conjugate prior

Hypothesis H 2  Probability of the data under H 2 :  Product over sub-trees  Probability of the data under H 2 :  Product over sub-trees

 From Bayes Rule, the posterior probability of the merged hypothesis:  The pair of trees with highest probability are merged.  Natural place to cut the final tree:  From Bayes Rule, the posterior probability of the merged hypothesis:  The pair of trees with highest probability are merged.  Natural place to cut the final tree: Data number, concentration(DPM)Hidden features (Beneath Distribution) BHC Algorithm - Working Flow

Tree-Consistent Partitions  Consider the right tree and all 15 possible partitions of {1,2,3,4}: (1)(2)(3)(4), (1 2)(3)(4), (1 3)(2)(4), (1 4)(2)(3), (2 3)(1)(4), (2 4)(1)(3), (3 4)(1)(2), (1 2)(3 4), (1 3)(2 4), (1 4)(2 3), (1 2 3)(4), (1 2 4)(3), (1 3 4)(2), (2 3 4)(1), ( )  (1 2) (3) (4) and (1 2 3) (4) are tree-consistent partitions.  (1)(2 3)(4) and (1 3)(2 4) are not tree-consistent partitions.  Consider the right tree and all 15 possible partitions of {1,2,3,4}: (1)(2)(3)(4), (1 2)(3)(4), (1 3)(2)(4), (1 4)(2)(3), (2 3)(1)(4), (2 4)(1)(3), (3 4)(1)(2), (1 2)(3 4), (1 3)(2 4), (1 4)(2 3), (1 2 3)(4), (1 2 4)(3), (1 3 4)(2), (2 3 4)(1), ( )  (1 2) (3) (4) and (1 2 3) (4) are tree-consistent partitions.  (1)(2 3)(4) and (1 3)(2 4) are not tree-consistent partitions.

Merged Hypothesis Prior ( π k )  Based on DPM (CRP perspective)  π k = P(All points belong to one cluster)  d’s are the case for all tree-consistent partitions  Based on DPM (CRP perspective)  π k = P(All points belong to one cluster)  d’s are the case for all tree-consistent partitions

Predictive Distribution  BHC allow to define predictive distributions for new data points.  Note : P(x|D) != P(x|D k ) for root!?  BHC allow to define predictive distributions for new data points.  Note : P(x|D) != P(x|D k ) for root!?

Approximate Inference for DPM prior  BHC forms a lower bound for the marginal likelihood of an infinite mixture model by efficiently summing over an exponentially large subset of all partitions.  Idea : deterministically sum over partitions with high probability, therefore accounting for most of the mass.  Compare to MCMC method, this is more deterministic and efficient.  BHC forms a lower bound for the marginal likelihood of an infinite mixture model by efficiently summing over an exponentially large subset of all partitions.  Idea : deterministically sum over partitions with high probability, therefore accounting for most of the mass.  Compare to MCMC method, this is more deterministic and efficient.

Learning Hyperparameters  α : Concentration parameter  β : Define G0  Learned by recursive gradients and EM-like method  α : Concentration parameter  β : Define G0  Learned by recursive gradients and EM-like method

To Sum Up for BHC  Statistical model for comparison and decides when to stop.  Allow to define predictive distributions for new data points.  Approximate Inference for DPM marginal.  Parameters  α : Concentration parameter  β : Define G0  Statistical model for comparison and decides when to stop.  Allow to define predictive distributions for new data points.  Approximate Inference for DPM marginal.  Parameters  α : Concentration parameter  β : Define G0

Unique Aspects of BHC Algorithm  Hierarchical way of organizing nested clusters, not a hierarchical generative model.  Derived from DPM.  Hypothesis test : one vs. many other clusterings (compare to one vs. two clusters at each stage)  Not iterative and does not require sampling. (except for learning parameters)  Hierarchical way of organizing nested clusters, not a hierarchical generative model.  Derived from DPM.  Hypothesis test : one vs. many other clusterings (compare to one vs. two clusters at each stage)  Not iterative and does not require sampling. (except for learning parameters)

Results from the experiments

Conclusion and some take home notes

Conclusion  Limitations -> No guide to choosing correct number of clusters, or where to prune tree. (Natural Stop Criterion) <- -> Distance metric selection (Model-based Criterion) <- -> Evaluation, Comparison, Inference (Probabilistic model) <- Some useful results for DPM) <-  Limitations -> No guide to choosing correct number of clusters, or where to prune tree. (Natural Stop Criterion) <- -> Distance metric selection (Model-based Criterion) <- -> Evaluation, Comparison, Inference (Probabilistic model) <- Some useful results for DPM) <- Solved!!

Summary  Defines probabilistic model of data, can compute probability of new data point belonging to any cluster in tree.  Model-based criterion to decide on merging clusters.  Bayesian hypothesis testing used to decide which merges are advantageous, and to decide appropriate depth of tree.  Algorithm can be interpreted as approximate inference method for a DPM; gives new lower bound on marginal likelihood by summing over exponentially many clusterings of the data.  Defines probabilistic model of data, can compute probability of new data point belonging to any cluster in tree.  Model-based criterion to decide on merging clusters.  Bayesian hypothesis testing used to decide which merges are advantageous, and to decide appropriate depth of tree.  Algorithm can be interpreted as approximate inference method for a DPM; gives new lower bound on marginal likelihood by summing over exponentially many clusterings of the data.

Limitations  Inherent greediness  Lack of any incorporation of tree uncertainty  O(n 2 ) complexity for building tree  Inherent greediness  Lack of any incorporation of tree uncertainty  O(n 2 ) complexity for building tree

References  Main paper:  Bayesian Hierarchical Clustering, K. Heller and Z. Ghahramani, ICML 2005  Thesis:  Efficient Bayesian Methods for Clustering, Katherine Ann Heller  Other references:  Wikipedia  Paper Slides    General ML   Main paper:  Bayesian Hierarchical Clustering, K. Heller and Z. Ghahramani, ICML 2005  Thesis:  Efficient Bayesian Methods for Clustering, Katherine Ann Heller  Other references:  Wikipedia  Paper Slides    General ML 

References  Other references(cont’d)  DPM & Nonparametric Bayesian :      (Easy to read)   Heavy text:     Hierarchical DPM   Other methods   Other references(cont’d)  DPM & Nonparametric Bayesian :      (Easy to read)   Heavy text:     Hierarchical DPM   Other methods 

Thank You for Your Attentions!