Probabilistic Clustering-Projection Model for Discrete Data

Slides:



Advertisements
Similar presentations
Topic models Source: Topic models, David Blei, MLSS 09.
Advertisements

Information retrieval – LSI, pLSI and LDA
Community Detection with Edge Content in Social Media Networks Paper presented by Konstantinos Giannakopoulos.
Title: The Author-Topic Model for Authors and Documents
Supervised Learning Recap
An Introduction to LDA Tools Kuan-Yu Chen Institute of Information Science, Academia Sinica.
Statistical Topic Modeling part 1
Variational Inference for Dirichlet Process Mixture Daniel Klein and Soravit Beer Changpinyo October 11, 2011 Applied Bayesian Nonparametrics Special Topics.
Unsupervised and Weakly-Supervised Probabilistic Modeling of Text Ivan Titov April TexPoint fonts used in EMF. Read the TexPoint manual before.
COMP 328: Final Review Spring 2010 Nevin L. Zhang Department of Computer Science & Engineering The Hong Kong University of Science & Technology
2. Introduction Multiple Multiplicative Factor Model For Collaborative Filtering Benjamin Marlin University of Toronto. Department of Computer Science.
Parallelized variational EM for Latent Dirichlet Allocation: An experimental evaluation of speed and scalability Ramesh Nallapati, William Cohen and John.
Lecture 17: Supervised Learning Recap Machine Learning April 6, 2010.
EE 290A: Generalized Principal Component Analysis Lecture 6: Iterative Methods for Mixture-Model Segmentation Sastry & Yang © Spring, 2011EE 290A, University.
Lecture 5: Learning models using EM
Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Latent Dirichlet Allocation a generative model for text
Text Classification from Labeled and Unlabeled Documents using EM Kamal Nigam Andrew K. McCallum Sebastian Thrun Tom Mitchell Machine Learning (2000) Presented.
Clustering.
Switch to Top-down Top-down or move-to-nearest Partition documents into ‘k’ clusters Two variants “Hard” (0/1) assignment of documents to clusters “soft”
Unsupervised Learning
Collaborative Ordinal Regression Shipeng Yu Joint work with Kai Yu, Volker Tresp and Hans-Peter Kriegel University of Munich, Germany Siemens Corporate.
Maximum Likelihood (ML), Expectation Maximization (EM)
Visual Recognition Tutorial
Clustering with Bregman Divergences Arindam Banerjee, Srujana Merugu, Inderjit S. Dhillon, Joydeep Ghosh Presented by Rohit Gupta CSci 8980: Machine Learning.
A Unifying Review of Linear Gaussian Models
Incomplete Graphical Models Nan Hu. Outline Motivation K-means clustering Coordinate Descending algorithm Density estimation EM on unconditional mixture.
Correlated Topic Models By Blei and Lafferty (NIPS 2005) Presented by Chunping Wang ECE, Duke University August 4 th, 2006.
Example 16,000 documents 100 topic Picked those with large p(w|z)
1 Naïve Bayes Models for Probability Estimation Daniel Lowd University of Washington (Joint work with Pedro Domingos)
Topic Models in Text Processing IR Group Meeting Presented by Qiaozhu Mei.
COMMON EVALUATION FINAL PROJECT Vira Oleksyuk ECE 8110: Introduction to machine Learning and Pattern Recognition.
Annealing Paths for the Evaluation of Topic Models James Foulds Padhraic Smyth Department of Computer Science University of California, Irvine* *James.
CSC321: Neural Networks Lecture 12: Clustering Geoffrey Hinton.
Memory Bounded Inference on Topic Models Paper by R. Gomes, M. Welling, and P. Perona Included in Proceedings of ICML 2008 Presentation by Eric Wang 1/9/2009.
Topic Modelling: Beyond Bag of Words By Hanna M. Wallach ICML 2006 Presented by Eric Wang, April 25 th 2008.
MACHINE LEARNING 8. Clustering. Motivation Based on E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2  Classification problem:
A Model for Learning the Semantics of Pictures V. Lavrenko, R. Manmatha, J. Jeon Center for Intelligent Information Retrieval Computer Science Department,
Latent Dirichlet Allocation D. Blei, A. Ng, and M. Jordan. Journal of Machine Learning Research, 3: , January Jonathan Huang
Generalized Model Selection For Unsupervised Learning in High Dimension Vaithyanathan and Dom IBM Almaden Research Center NIPS ’ 99.
CS Statistical Machine learning Lecture 24
Topic Models Presented by Iulian Pruteanu Friday, July 28 th, 2006.
An Introduction to Latent Dirichlet Allocation (LDA)
Lecture 2: Statistical learning primer for biologists
Latent Dirichlet Allocation
Flat clustering approaches
Techniques for Dimensionality Reduction
Information-Theoretic Co- Clustering Inderjit S. Dhillon et al. University of Texas, Austin presented by Xuanhui Wang.
Link Distribution on Wikipedia [0407]KwangHee Park.
Bayesian Speech Synthesis Framework Integrating Training and Synthesis Processes Kei Hashimoto, Yoshihiko Nankaku, and Keiichi Tokuda Nagoya Institute.
Analysis of Social Media MLD , LTI William Cohen
CS Statistical Machine learning Lecture 25 Yuan (Alan) Qi Purdue CS Nov
Advanced Artificial Intelligence Lecture 8: Advance machine learning.
NIPS 2013 Michael C. Hughes and Erik B. Sudderth
Modeling Annotated Data (SIGIR 2003) David M. Blei, Michael I. Jordan Univ. of California, Berkeley Presented by ChengXiang Zhai, July 10, 2003.
A Collapsed Variational Bayesian Inference Algorithm for Latent Dirichlet Allocation Yee W. Teh, David Newman and Max Welling Published on NIPS 2006 Discussion.
Big Data Infrastructure Week 9: Data Mining (4/4) This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States.
Unsupervised Learning Part 2. Topics How to determine the K in K-means? Hierarchical clustering Soft clustering with Gaussian mixture models Expectation-Maximization.
Document Clustering Based on Non-negative Matrix Factorization
Latent Variables, Mixture Models and EM
Bayesian Models in Machine Learning
Probabilistic Models with Latent Variables
Bayesian Inference for Mixture Language Models
Stochastic Optimization Maximization for Latent Variable Models
Unsupervised Learning II: Soft Clustering with Gaussian Mixture Models
Michal Rosen-Zvi University of California, Irvine
Generally Discriminant Analysis
Latent Dirichlet Allocation
Junghoo “John” Cho UCLA
Topic Models in Text Processing
Presentation transcript:

Probabilistic Clustering-Projection Model for Discrete Data Shipeng Yu1,2, Kai Yu2, Volker Tresp2, Hans-Peter Kriegel1 1Institute for Computer Science, University of Munich 2Siemens Corporate Technology, Munich, Germany October 2005

Outline Motivation Previous Work The PCP Model Learning in PCP Model Experiments Conclusion and Future Work

Motivation We model discrete data in this work Properties w ¢ d . Fundamental problem for data mining and machine learning In “bag-of-words” document modelling: document-word pairs In collaborative filtering: item-rating pairs Properties The data can be described as a big matrix with integer entries The data matrix is normally very sparse (>90% are zeros) Words w 1 2 ¢ V d . D Documents Occurrences

Data Clustering Goal: Group similar documents together For continuous data: Distance-based similarity (k-means) Iteratively minimize a distance-based cost function Equivalent to a Gaussian mixture model For discrete data: Occurrence-based similarity Similar documents should have similar occurrences of words No Gaussianity holds for discrete data w 1 2 ¢ V d . D

Data Projection Goal: Find a low-dimensional feature mapping For continuous data: Principal Component Analysis Find orthogonal dimensions to explain data covariance For discrete data: Topic detection Topics explain the co-occurrences of words Topics are not orthogonal, but independent z 1 ¢ K w 1 2 ¢ V d . D

Projection versus Clustering They are normally modelled separately But why not jointly? More informative projection  better document clusters Better clustering structure  better projection for words There should be a stable situation And how? PCP Model Well-defined generative model for the data Standard ways for learning and inference Generalizable to new data w 1 2 ¢ V d . D z K

Previous Work for Discrete Data Projection model Clustering model PLSI [Hofmann 99] First topic model Not well-defined generative model LDA [Blei et al 03] State-of-the-art topic model Generalize PLSI with Dirichlet prior No clustering effect is modelled NMF [Lee & Seung 99] Factorize the data matrix Can be explained as a clustering model No projection of words is directly modelled w 1 2 ¢ V d . D z K µ Joint Projection-Clustering model Two-sided clustering [Hofmann & Puzicha 98]: Same problem as PLSI Discrete-PCA [Buntine & Perttu 03]: Similar to LDA in spirit TTMM [Keller & Bengio 04]: Lack a full Bayesian explanation w 1 2 ¢ V d . D

PCP Model: Overview Probabilistic Clustering-Projection Model A probabilistic model for discrete data A clustering model using projected features A projection model with structural data Learning in PCP model: Variational EM Exactly equivalent to iteratively performing clustering and projection operations Guaranteed convergence

PCP Model: Sampling Process Clustering Multinomial Projection Multinomial ® Dirichlet topic z 1 ; N …... word …... w 1 ; N document w 1 P r o j e c t i n ¯ cluster µ 1 Weights ¼ Cluster centers µ Multinomial …... …... …... …... topic z D ; N 1 …... word …... w D ; N 1 ¸ cluster µ D document w D Dirichlet D documents M clusters K topics V words Clustering model using projected features Projection model with structural data

PCP Model: Plate Model Likelihood Model Parameters Latent Variables Observations Clustering Model Projection Model

Learning in PCP Model We are interested in the posterior distribution The integral is intractable Variational EM learning Approximate the posterior with a variational distribution Minimize the KL-divergence Variational E-step: Minimize w.r.t. variational parameters Variational M-step: Minimize w.r.t. model parameters Iterate until convergence Variational Parameters Dirichlet Dirichlet Multinomial Multinomial D K L ( q j ^ p )

Update Equations Equations can be separated to clustering updates and projection updates Variational EM learning corresponds to iteratively performing clustering and projection until convergence Clustering Updates Projection Updates

Sufficient Projection term Clustering Updates Sufficient Projection term Update soft cluster assignments, P ( c d = m ) Prior term Prior term Likelihood term Likelihood term Update cluster centers Update cluster weights

Sufficient Clustering term Projection Updates Sufficient Clustering term Update word projection, P ( z d ; n = k ) Empirical estimate Update projection matrix

PCP Learning Algorithm Sufficient Projection term Sufficient Clustering term Clustering Updates Projection Updates

Experiments Methodology Data sets Preprocessing Document Modelling: Compare model generalization Word Projection: Evaluate topic space Document Clustering: Evaluate clustering results Data sets 5 categories in Reuters-21578: 3948 docs, 7665 words 4 categories in 20Newsgroup: 3888 docs, 8396 words Preprocessing Stemming and stop-word removing Pick up words that occur at least in 5 documents

Case Study Run on a 4-group subset of 20Newsgroup data Car Bike Baseball Hockey

Exp1: Document Modelling Goal: Evaluate generalization performance Methods to compare PLSI: A “pseudo” form for generalization LDA: State-of-the-art method Metric: Perplexity 90% for training and 10% for testing P e r p ( D t s ) = x ¡ d l n w j

Exp2: Word Projection Goal: Evaluate the projection matrix ¯ Methods to compare: PLSI, LDA We train SVMs on the 10-dimensional space after projection Test classification accuracy on leave-out data ¯ Reuters Newsgroup

Exp3: Document Clustering Goal: Evaluate clustering for documents Methods to compare NMF: Do factorization for clustering LDA+k-means: Do clustering on the projected space Metric: normalized mutual information

Conclusion PCP is a well-defined generative model PCP models clustering and projection jointly Learning in PCP corresponds to an iterative process of clustering and projection PCP learning guarantees convergence Future work Large scale experiments Build a probabilistic model with more factors

Thank you! Questions?