Statistical Models for Partial Membership Katherine Heller Gatsby Computational Neuroscience Unit, UCL Sinead Williamson and Zoubin Ghahramani University.

Slides:



Advertisements
Similar presentations
Part 2: Unsupervised Learning
Advertisements

Sinead Williamson, Chong Wang, Katherine A. Heller, David M. Blei
Gentle Introduction to Infinite Gaussian Mixture Modeling
Xiaolong Wang and Daniel Khashabi
Information retrieval – LSI, pLSI and LDA
Gibbs Sampling Methods for Stick-Breaking priors Hemant Ishwaran and Lancelot F. James 2001 Presented by Yuting Qi ECE Dept., Duke Univ. 03/03/06.
Hierarchical Dirichlet Processes
Statistics review of basic probability and statistics.
HW 4. Nonparametric Bayesian Models Parametric Model Fixed number of parameters that is independent of the data we’re fitting.
Segmentation and Fitting Using Probabilistic Methods
Bayesian Content-Based Image Retrieval research with: Katherine A. Heller based on (Heller and Ghahramani, 2006) part IB, paper 8, Lent.
Dimensional reduction, PCA
Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Machine Learning CMPT 726 Simon Fraser University
. PGM: Tirgul 10 Parameter Learning and Priors. 2 Why learning? Knowledge acquisition bottleneck u Knowledge acquisition is an expensive process u Often.
Computer vision: models, learning and inference Chapter 10 Graphical Models.
Clustering with Bregman Divergences Arindam Banerjee, Srujana Merugu, Inderjit S. Dhillon, Joydeep Ghosh Presented by Rohit Gupta CSci 8980: Machine Learning.
Entropy and some applications in image processing Neucimar J. Leite Institute of Computing
Biointelligence Laboratory, Seoul National University
Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.
Introduction to Machine Learning for Information Retrieval Xiaolong Wang.
Correlated Topic Models By Blei and Lafferty (NIPS 2005) Presented by Chunping Wang ECE, Duke University August 4 th, 2006.
Topic Models in Text Processing IR Group Meeting Presented by Qiaozhu Mei.
Bayesian parameter estimation in cosmology with Population Monte Carlo By Darell Moodley (UKZN) Supervisor: Prof. K Moodley (UKZN) SKA Postgraduate conference,
Fast Max–Margin Matrix Factorization with Data Augmentation Minjie Xu, Jun Zhu & Bo Zhang Tsinghua University.
Hierarchical Dirichelet Processes Y. W. Tech, M. I. Jordan, M. J. Beal & D. M. Blei NIPS 2004 Presented by Yuting Qi ECE Dept., Duke Univ. 08/26/05 Sharing.
Bayesian Sets Zoubin Ghahramani and Kathertine A. Heller NIPS 2005 Presented by Qi An Mar. 17 th, 2006.
CSC321: Neural Networks Lecture 12: Clustering Geoffrey Hinton.
Bayesian Hierarchical Clustering Paper by K. Heller and Z. Ghahramani ICML 2005 Presented by HAO-WEI, YEH.
Bayesian Extension to the Language Model for Ad Hoc Information Retrieval Hugo Zaragoza, Djoerd Hiemstra, Michael Tipping Presented by Chen Yi-Ting.
Memory Bounded Inference on Topic Models Paper by R. Gomes, M. Welling, and P. Perona Included in Proceedings of ICML 2008 Presentation by Eric Wang 1/9/2009.
Finding Scientific topics August , Topic Modeling 1.A document as a probabilistic mixture of topics. 2.A topic as a probability distribution.
ECE 8443 – Pattern Recognition LECTURE 07: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Class-Conditional Density The Multivariate Case General.
Mixture Models, Monte Carlo, Bayesian Updating and Dynamic Models Mike West Computing Science and Statistics, Vol. 24, pp , 1993.
Ch 2. Probability Distributions (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized by Yung-Kyun Noh and Joo-kyung Kim Biointelligence.
Fast Simulators for Assessment and Propagation of Model Uncertainty* Jim Berger, M.J. Bayarri, German Molina June 20, 2001 SAMO 2001, Madrid *Project of.
Integrating Topics and Syntax -Thomas L
The Dirichlet Labeling Process for Functional Data Analysis XuanLong Nguyen & Alan E. Gelfand Duke University Machine Learning Group Presented by Lu Ren.
Machine Learning CUNY Graduate Center Lecture 4: Logistic Regression.
Summary We propose a framework for jointly modeling networks and text associated with them, such as networks or user review websites. The proposed.
Randomized Algorithms for Bayesian Hierarchical Clustering
Latent Dirichlet Allocation D. Blei, A. Ng, and M. Jordan. Journal of Machine Learning Research, 3: , January Jonathan Huang
Probabilistic Models for Discovering E-Communities Ding Zhou, Eren Manavoglu, Jia Li, C. Lee Giles, Hongyuan Zha The Pennsylvania State University WWW.
1 Dirichlet Process Mixtures A gentle tutorial Graphical Models – Khalid El-Arini Carnegie Mellon University November 6 th, 2006 TexPoint fonts used.
Introduction to LDA Jinyang Gao. Outline Bayesian Analysis Dirichlet Distribution Evolution of Topic Model Gibbs Sampling Intuition Analysis of Parameter.
Stick-Breaking Constructions
Topic Models Presented by Iulian Pruteanu Friday, July 28 th, 2006.
Latent Dirichlet Allocation
Beam Sampling for the Infinite Hidden Markov Model by Jurgen Van Gael, Yunus Saatic, Yee Whye Teh and Zoubin Ghahramani (ICML 2008) Presented by Lihan.
CS246 Latent Dirichlet Analysis. LSI  LSI uses SVD to find the best rank-K approximation  The result is difficult to interpret especially with negative.
Stick-breaking Construction for the Indian Buffet Process Duke University Machine Learning Group Presented by Kai Ni July 27, 2007 Yee Whye The, Dilan.
CS Statistical Machine learning Lecture 12 Yuan (Alan) Qi Purdue CS Oct
1 Chapter 8: Model Inference and Averaging Presented by Hui Fang.
Nonparametric Bayesian Models. HW 4 x x Parametric Model Fixed number of parameters that is independent of the data we’re fitting.
A latent Gaussian model for compositional data with structural zeroes Adam Butler & Chris Glasbey Biomathematics & Statistics Scotland.
Multi-label Prediction via Sparse Infinite CCA Piyush Rai and Hal Daume III NIPS 2009 Presented by Lingbo Li ECE, Duke University July 16th, 2010 Note:
Hierarchical Beta Process and the Indian Buffet Process by R. Thibaux and M. I. Jordan Discussion led by Qi An.
Bayesian Hierarchical Clustering Paper by K. Heller and Z. Ghahramani ICML 2005 Presented by David Williams Paper Discussion Group ( )
Latent Feature Models for Network Data over Time Jimmy Foulds Advisor: Padhraic Smyth (Thanks also to Arthur Asuncion and Chris Dubois)
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Mixture Densities Maximum Likelihood Estimates.
A Collapsed Variational Bayesian Inference Algorithm for Latent Dirichlet Allocation Yee W. Teh, David Newman and Max Welling Published on NIPS 2006 Discussion.
Clustering (2) Center-based algorithms Fuzzy k-means Density-based algorithms ( DBSCAN as an example ) Evaluation of clustering results Figures and equations.
Parameter Estimation 主講人:虞台文.
Clustering (3) Center-based algorithms Fuzzy k-means
Latent Variables, Mixture Models and EM
A Non-Parametric Bayesian Method for Inferring Hidden Causes
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Matching Words with Pictures
Topic Models in Text Processing
Parametric Methods Berlin Chen, 2005 References:
Presentation transcript:

Statistical Models for Partial Membership Katherine Heller Gatsby Computational Neuroscience Unit, UCL Sinead Williamson and Zoubin Ghahramani University of Cambridge

Partial Membership Example: Person with mixed ethnic background.  Someone who is 50% Asian and 50% European partly belongs to 2 different groups (ethnicities).  This partial membership may be relevant for predicting this person’s phenotype or food preferences. Conceptually not the same as uncertain membership.  Being certain that someone is half Asian and half European is very different than being unsure of their ethnicity.  More evidence (like DNA tests) can help resolve uncertainty but will not change their ethnicity memberships. Work on modeling partial membership by fuzzy logic community

Outline Goal: Describe a fully probabilistic approach to data modeling with partial memberships. Introduction Bayesian Partial Membership Model (BPM) BPM Learning Experiments  Synthetic  Senate Roll Call data Related Work Conclusions  Nonparametric Extension?

Finite Mixture Models Generative Process: where : Consider modeling a data set,, using a finite mixture of K components… and denote memberships of data points to clusters! 1) Choose a cluster 2) Generate a data point from that cluster

denote memberships of data points to clusters!denote partial memberships of data points to clusters! Finite Mixture Models where : and Continuous Relaxation

Why does this make sense? If there is an “Asian” cluster and a “European” cluster, the partial membership model will better capture people with mixed ethnicity, whose features lie in between. Partial Membership Mixture Model (1,0) (0,1) (.5,.5)

Exponential Family Distributions Sufficient Statistics: Conjugate prior can be written as: Lets consider the case where: Natural Parameters: It follows that:

Bayesian Partial Membership Model Generative Process: For each k: For each n: Ethnicity Example: Defines a distribution over features for each of k ethnic groups Defines ethnic composition of the population Controls how similar to the population an individual is expected to be Ethnic composition of individual n Feature values of individual n

Bayesian Partial Membership Model Generative Process: For each k: For each n:

BPM Sampled Data Each of the four plots shows 3000 data points drawn from the BPM with the same 3 full-covariance Gaussian clusters.

BPM Theory Lemma 1 In the limit as a  0 the exponential family BPM model is a mixture of K components with mixing proportions Lemma 2 In the limit as a  the exponential family BPM model has only one component with natural parameters

BPM Learning Want to infer all unknowns given X: We treat as fixed hyperparameters: Goal: Infer using MCMC All parameters in the BPM are continuous so we can use Hybrid Monte Carlo.  Hybrid Monte Carlo is an efficient MCMC method that uses gradient information to find high probability regions.

Synthetic Data Generated synthetic binary data set of 50 data points, 32 dimensions, and 3 clusters. Ran HMC sampler for 4000 iterations. Computed: is the true generated matrix andwhereis sampled.

Senate Roll Call Data ( ) (99 senators + 1 outcome) x 633 votes K=2 multivariate Bernoulli clusters Model adapted to handle missing data

Senate Roll Call Comparisons Fuzzy K-means: Blue: Senator Schumer Black: “Outcome” Red: Senator Ensign Partial membership values are very sensitive to exponent For no value of do the membership values make sense

Senate Roll Call Comparisons Dirichlet Process Mixtures: DPM confidently infers 4 clusters Uncertainty is not a good substitute for partial membership DPM BPM MeanMedianMinMax“Outcome” Negative log predictive probability (in bits) across senators

Image Data 329 Tower and Sunset Images with 240 simple binary texture and color features and K=2 clusters.

Related Work Latent Dirichlet Allocation (LDA)  Mixed Membership Models Fuzzy Clustering Exponential Family PCA

Future Work Would be nice to have a nonparametric version. Obvious thing to try: Hierarchical Dirichlet Processes. But this would require summing over all infinitely many elements of, which isn’t computationally feasible. Also semantically not very nice. Indian Buffet Processes might work. Sample an IBP matrix with interpretation that a 1 means having some non-zero amount of membership in that cluster, then draw continuous exact amount separately.

Conclusions Developed a fully probabilistic approach to data modeling with partial membership. Uses continuous latent variables and can be seen as a relaxation of clustering with standard mixture models. Used Hybrid Monte Carlo for inference which was extremely fast (finding sensible partial membership structure after very few samples).

Thank You

Partial Membership Cornerstone of fuzzy set theory  Traditional set theory: Items belong to a set or they don’t {0,1}.  Fuzzy set theory: membership function where denotes the degree to which belongs to set Fuzzy logic versus probabilistic models  Misguided arguments that fuzzy logic is different or supercedes probability theory.  While it might be easy to dismiss fuzzy logic, its framework for representing partial membership has inspired many researchers.  Google Scholar: Over 45,000 fuzzy clustering papers. Most cited papers cited as frequently as most cited “NIPS” area papers.

Related Work - Latent Dirichlet Allocation (LDA) and Mixed Membership Models BPM generates data points at the document level of LDA (no word plate). Whereas LDA (or Mixed Membership models) assume words (or attributes) are drawn using as mixing proportions in a mixture model, and are factorized, the BPM uses to form a convex combination of natural parameters. Attributes not drawn from mixture model and need not be factorized. BPM - potentially faster MCMC sampling since BPM has all continuous parameters and LDA must infer a discrete topic assignment for each word.

Mixed Membership Model Generation

Related Work: Fuzzy Clustering Fuzzy k-means iteratively minimizes the following objective: where d is the distance between a data point and a cluster center, is the degree of membership of a data point in a cluster, and controls the amount of partial membership ( =1 is normal k-means) None of these variables have probabilistic interpretations.

Related Work: Exponential Family PCA Originally formulated in terms of Bregman divergences, it can be seen as a non-Bayesian version of the BPM where the s are not constrained (to normalize to 1 or be positive).  Not a convex combination of natural parameters with the same sort of partial membership interpretation. If we wanted we could relax these same constraints to get a Bayesian version of Exponential Family PCA, but we’d have to tweak the model e.g. a Gaussian prior on.

Hybrid Monte Carlo is an MCMC method that uses gradient information. Hybrid Monte Carlo simulates dynamics of a system with continuous state variable on an energy function: provide forces on the state variables which encourage the system to find high probability regions, while maintaining detailed balance. BPM Learning

Bregman Divergence F is a strictly convex function, p and q are points Intuitively the difference between the value of F at p and the value of the first order Taylor expansion of F around q, evaluated at p.

LDA Review 1. for z=1…K, Draw 2. For d=1…D, a) Draw b) for n=1…N d i. Draw ii. Draw, hyperparameters, multinomial parameters for topics multinomial parameters for words given topics, words, topics - # topics - # words in doc - # documents