Generalized Model Selection For Unsupervised Learning in High Dimension Vaithyanathan and Dom IBM Almaden Research Center NIPS ’ 99.

Slides:



Advertisements
Similar presentations
Mustafa Cayci INFS 795 An Evaluation on Feature Selection for Text Clustering.
Advertisements

Chapter 5: Introduction to Information Retrieval
Biointelligence Laboratory, Seoul National University
Pattern Recognition and Machine Learning
Unsupervised Learning Reading: Chapter 8 from Introduction to Data Mining by Tan, Steinbach, and Kumar, pp , , Chapter 8.
CSCI 5417 Information Retrieval Systems Jim Martin Lecture 16 10/18/2011.
Supervised Learning Recap
Middle Term Exam 03/01 (Thursday), take home, turn in at noon time of 03/02 (Friday)
Probabilistic Clustering-Projection Model for Discrete Data
Chapter 4: Linear Models for Classification
K-means clustering Hongning Wang
Christoph F. Eick Questions and Topics Review Dec. 1, Give an example of a problem that might benefit from feature creation 2.Compute the Silhouette.
Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.
Predictive Automatic Relevance Determination by Expectation Propagation Yuan (Alan) Qi Thomas P. Minka Rosalind W. Picard Zoubin Ghahramani.
Latent Dirichlet Allocation a generative model for text
Adapted by Doug Downey from Machine Learning EECS 349, Bryan Pardo Machine Learning Clustering.
Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.
Visual Recognition Tutorial
Microarray analysis 2 Golan Yona. 2) Analysis of co-expression Search for similarly expressed genes experiment1 experiment2 experiment3 ……….. Gene i:
“A Comparison of Document Clustering Techniques” Michael Steinbach, George Karypis and Vipin Kumar (Technical Report, CSE, UMN, 2000) Mahashweta Das
Fast Subsequence Matching in Time-Series Databases Christos Faloutsos M. Ranganathan Yannis Manolopoulos Department of Computer Science and ISR University.
Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.
Unsupervised Learning. CS583, Bing Liu, UIC 2 Supervised learning vs. unsupervised learning Supervised learning: discover patterns in the data that relate.
PATTERN RECOGNITION AND MACHINE LEARNING
Data mining and machine learning A brief introduction.
Introduction to variable selection I Qi Yu. 2 Problems due to poor variable selection: Input dimension is too large; the curse of dimensionality problem.
Unsupervised Learning Reading: Chapter 8 from Introduction to Data Mining by Tan, Steinbach, and Kumar, pp , , (
COMMON EVALUATION FINAL PROJECT Vira Oleksyuk ECE 8110: Introduction to machine Learning and Pattern Recognition.
Bayesian Hierarchical Clustering Paper by K. Heller and Z. Ghahramani ICML 2005 Presented by HAO-WEI, YEH.
Bayesian Extension to the Language Model for Ad Hoc Information Retrieval Hugo Zaragoza, Djoerd Hiemstra, Michael Tipping Presented by Chen Yi-Ting.
Clustering Supervised vs. Unsupervised Learning Examples of clustering in Web IR Characteristics of clustering Clustering algorithms Cluster Labeling 1.
Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1.
Unsupervised Learning: Clustering Some material adapted from slides by Andrew Moore, CMU. Visit for
Unsupervised Learning. Supervised learning vs. unsupervised learning.
Slides for “Data Mining” by I. H. Witten and E. Frank.
Multi-Speaker Modeling with Shared Prior Distributions and Model Structures for Bayesian Speech Synthesis Kei Hashimoto, Yoshihiko Nankaku, and Keiichi.
Topic Models Presented by Iulian Pruteanu Friday, July 28 th, 2006.
Mehdi Ghayoumi MSB rm 132 Ofc hr: Thur, a Machine Learning.
Guest lecture: Feature Selection Alan Qi Dec 2, 2004.
Machine Learning Queens College Lecture 7: Clustering.
Chapter 20 Classification and Estimation Classification – Feature selection Good feature have four characteristics: –Discrimination. Features.
Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1.
Bayesian Speech Synthesis Framework Integrating Training and Synthesis Processes Kei Hashimoto, Yoshihiko Nankaku, and Keiichi Tokuda Nagoya Institute.
1 CSC 594 Topics in AI – Text Mining and Analytics Fall 2015/16 8. Text Clustering.
Feature Selction for SVMs J. Weston et al., NIPS 2000 오장민 (2000/01/04) Second reference : Mark A. Holl, Correlation-based Feature Selection for Machine.
NIPS 2013 Michael C. Hughes and Erik B. Sudderth
Bayesian Hierarchical Clustering Paper by K. Heller and Z. Ghahramani ICML 2005 Presented by David Williams Paper Discussion Group ( )
Machine Learning in Practice Lecture 21 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Mixture Densities Maximum Likelihood Estimates.
Gaussian Mixture Model classification of Multi-Color Fluorescence In Situ Hybridization (M-FISH) Images Amin Fazel 2006 Department of Computer Science.
Ch 1. Introduction Pattern Recognition and Machine Learning, C. M. Bishop, Updated by J.-H. Eom (2 nd round revision) Summarized by K.-I.
Accelerating K-Means Clustering with Parallel Implementations and GPU Computing Janki Bhimani Miriam Leeser Ningfang Mi
CMPS 142/242 Review Section Fall 2011 Adapted from Lecture Slides.
Predictive Automatic Relevance Determination by Expectation Propagation Y. Qi T.P. Minka R.W. Picard Z. Ghahramani.
Data Science Practical Machine Learning Tools and Techniques 6.8: Clustering Rodney Nielsen Many / most of these slides were adapted from: I. H. Witten,
Unsupervised Learning Part 2. Topics How to determine the K in K-means? Hierarchical clustering Soft clustering with Gaussian mixture models Expectation-Maximization.
Unsupervised Learning: Clustering
Unsupervised Learning: Clustering
Chapter 3: Maximum-Likelihood Parameter Estimation
Model Averaging with Discrete Bayesian Network Classifiers
CSE572, CBS598: Data Mining by H. Liu
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Data Transformations targeted at minimizing experimental variance
Topic Models in Text Processing
Text Categorization Berlin Chen 2003 Reference:
Parametric Methods Berlin Chen, 2005 References:
Ch 3. Linear Models for Regression (2/2) Pattern Recognition and Machine Learning, C. M. Bishop, Previously summarized by Yung-Kyun Noh Updated.
CSE572: Data Mining by H. Liu
Presentation transcript:

Generalized Model Selection For Unsupervised Learning in High Dimension Vaithyanathan and Dom IBM Almaden Research Center NIPS ’ 99

Abstract Bayesian approach to model selection in unsupervised learning –propose a unified objective function whose arguments include both the feature space and number of clusters. determining feature set (dividing feature set into noise features and useful features determining the number of clusters –marginal likelihood with Bayesian scheme vs. cross-validation(cross-validated likelihood). DC (Distributional clustering of terms) for initial feature selection.

Model Selection in Clustering Bayesian approaches 1), cross-validation 2) techniques, MDL approaches 3). Need for unified objective function –the optimal number of clusters is dependent on the feature space in which the clustering is performed. –c.f. feature selection in clustering

Model Selection in Clustering (Cont ’ d) Generalized model for clustering –data D = {d 1, …,d }, feature space T with dimension M –likelihood P(DT|  ) maximization, where  (with parameter  ) is the structure of the model (# of clusters, the partitioning of the feature set into U(useful set), N(noise set) and the assignment of patterns to clusters). Bayesian approach to model selection –regularization using marginal likelihood

Bayesian Approach to Model Selection for Clustering Data –data D = {d 1, …,d n }, feature space T with dimension M Clustering D –finding and such that –where  is the structure of the model and  is the set of all parameter vectors –the model structure  consists of the # of clusters + the partitioning of the feature set and the assignment of patterns to clusters.

Assumptions 1.The feature sets T represented by U and N are conditionally independent and the data is independent. 2. Data = {d 1, …,d n } is i.i.d lack of regularization  marginal or integrated likelihood

3. All parameter vectors are independent. –marginal likelihood –Approximations to Marginal Likelihood/Stochastic Complexity computationally very expensive  pruning of search space by reducing the number of feature partitions model complexity

Document Clustering Marginal likelihood (11) adapting multinomial models using term counts as the features assuming that priors  (..) is conjugate to the Dirichlet distribution NLML (Negative Log Marginal Likelihood)

Cross-Validated likelihood Document Clustering (cont ’ )

Distributional clustering for feature subset selection heuristic method to obtain a subset of tokens that are topical and can be used as features in the bag-of- words model to cluster documents reduce feature size M to C by clustering words based on their distributions over the documents. A histogram for each token –the first bin: # of documents with zero occurrences of the token –the second bin: # of documents consisting of a single occurrence of the token –the third bin: # of documents that contain two or more occurrence of the term

DC for feature subset selection (Cont ’ d) measure of similarity of the histograms –relative entropy or the K-L distance  (.||.) e.g. for two terms with prob. p 1 (.), p 2 (.) k-means DC

Experimental Setup AP Reuters Newswire articles from the TREC-6 –8235 documents from the routing track, 25 classes, disregard multiple classes –32450 unique terms (discarding terms that appeared in less than 3 documents) Evaluation measure of clustering –MI

Results of Distributional Clustering cluster tokens into 3,4,5 clusters. eliminating function words function words Figure 1. centroid of a typical high-frequency function-words cluster

Finding the Optimum Features and Document Clusters for a Fixed Number of Clusters Now, apply the objective function (11) to the feature subsets selected by DC –EM/CEM (Classification EM: hard-thresholded version of the EM) 1) initialization: k-means algorithm

Comparison of feature-selection heuristics FBTop20: Removal of the top 20% of the most frequent terms FBTop40: Removal of the top 40% of the most frequent terms FBTop40Bot10: Removal of top 40% of the most frequent terms and removal of all tokens that do not appear in at least 10 documents NF: No feature selection CSW: Common stop words removed