Spectral Algorithms for Learning and Clustering Santosh Vempala Georgia Tech.

Slides:



Advertisements
Similar presentations
Probabilistic analog of clustering: mixture models
Advertisements

Data Mining Feature Selection. Data reduction: Obtain a reduced representation of the data set that is much smaller in volume but yet produces the same.
The Stability of a Good Clustering Marina Meila University of Washington
The Complexity of Unsupervised Learning Santosh Vempala, Georgia Tech.
Dimensionality reduction. Outline From distances to points : – MultiDimensional Scaling (MDS) Dimensionality Reductions or data projections Random projections.
Dimensionality Reduction PCA -- SVD
Principal Component Analysis (PCA) for Clustering Gene Expression Data K. Y. Yeung and W. L. Ruzzo.
Exploiting Sparse Markov and Covariance Structure in Multiresolution Models Presenter: Zhe Chen ECE / CMR Tennessee Technological University October 22,
Information Networks Graph Clustering Lecture 14.
Online Social Networks and Media. Graph partitioning The general problem – Input: a graph G=(V,E) edge (u,v) denotes similarity between u and v weighted.
Optimal Column-Based Low-Rank Matrix Reconstruction SODA’12 Ali Kemal Sinop Joint work with Prof. Venkatesan Guruswami.
Why do we want a good ratio anyway? Approximation stability and proxy objectives Avrim Blum Carnegie Mellon University Based on work joint with Pranjal.
Clustering and Dimensionality Reduction Brendan and Yifang April
10/11/2001Random walks and spectral segmentation1 CSE 291 Fall 2001 Marina Meila and Jianbo Shi: Learning Segmentation by Random Walks/A Random Walks View.
COMP 328: Final Review Spring 2010 Nevin L. Zhang Department of Computer Science & Engineering The Hong Kong University of Science & Technology
Computer Vision Spring ,-685 Instructor: S. Narasimhan Wean 5403 T-R 3:00pm – 4:20pm Lecture #20.
Introduction to Approximation Algorithms Lecture 12: Mar 1.
Graph Clustering. Why graph clustering is useful? Distance matrices are graphs  as useful as any other clustering Identification of communities in social.
EE 290A: Generalized Principal Component Analysis Lecture 6: Iterative Methods for Mixture-Model Segmentation Sastry & Yang © Spring, 2011EE 290A, University.
Lecture 21: Spectral Clustering
Dimensionality reduction. Outline From distances to points : – MultiDimensional Scaling (MDS) – FastMap Dimensionality Reductions or data projections.
Affine-invariant Principal Components Charlie Brubaker and Santosh Vempala Georgia Tech School of Computer Science Algorithms and Randomness Center.
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
Dimensionality Reduction
Cluster Analysis.  What is Cluster Analysis?  Types of Data in Cluster Analysis  A Categorization of Major Clustering Methods  Partitioning Methods.
Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Semi-Supervised Learning Using Randomized Mincuts Avrim Blum, John Lafferty, Raja Reddy, Mugizi Rwebangira.
Lecture 4 Unsupervised Learning Clustering & Dimensionality Reduction
Clustering In Large Graphs And Matrices Petros Drineas, Alan Frieze, Ravi Kannan, Santosh Vempala, V. Vinay Presented by Eric Anderson.
Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.
Dimensionality Reduction
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
Dimensionality Reduction. Multimedia DBs Many multimedia applications require efficient indexing in high-dimensions (time-series, images and videos, etc)
Radial Basis Function Networks
Principal Component Analysis (PCA) for Clustering Gene Expression Data K. Y. Yeung and W. L. Ruzzo.
Graph-based consensus clustering for class discovery from gene expression data Zhiwen Yum, Hau-San Wong and Hongqiang Wang Bioinformatics, 2007.
Domain decomposition in parallel computing Ashok Srinivasan Florida State University COT 5410 – Spring 2004.
CHAMELEON : A Hierarchical Clustering Algorithm Using Dynamic Modeling
Presented By Wanchen Lu 2/25/2013
COMMON EVALUATION FINAL PROJECT Vira Oleksyuk ECE 8110: Introduction to machine Learning and Pattern Recognition.
Models and Algorithms for Complex Networks Graph Clustering and Network Communities.
Graph Sparsifiers Nick Harvey Joint work with Isaac Fung TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A A A.
Data Reduction. 1.Overview 2.The Curse of Dimensionality 3.Data Sampling 4.Binning and Reduction of Cardinality.
CLUSTERABILITY A THEORETICAL STUDY Margareta Ackerman Joint work with Shai Ben-David.
Classification Course web page: vision.cis.udel.edu/~cv May 12, 2003  Lecture 33.
SINGULAR VALUE DECOMPOSITION (SVD)
Max-Margin Classification of Data with Absent Features Presented by Chunping Wang Machine Learning Group, Duke University July 3, 2008 by Chechik, Heitz,
A Toolkit for Remote Sensing Enviroinformatics Clustering Fazlul Shahriar, George Bonev Advisors: Michael Grossberg, Irina Gladkova, Srikanth Gottipati.
Dimension reduction : PCA and Clustering Slides by Agnieszka Juncker and Chris Workman modified by Hanne Jarmer.
Andreas Papadopoulos - [DEXA 2015] Clustering Attributed Multi-graphs with Information Ranking 26th International.
CSC2535: Computation in Neural Networks Lecture 12: Non-linear dimensionality reduction Geoffrey Hinton.
Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.
Gene expression & Clustering. Determining gene function Sequence comparison tells us if a gene is similar to another gene, e.g., in a new species –Dynamic.
Implicit Hitting Set Problems Richard M. Karp Erick Moreno Centeno DIMACS 20 th Anniversary.
Clustering Instructor: Max Welling ICS 178 Machine Learning & Data Mining.
Domain decomposition in parallel computing Ashok Srinivasan Florida State University.
DATA MINING LECTURE 8 Sequence Segmentation Dimensionality Reduction.
Spectral Clustering Shannon Quinn (with thanks to William Cohen of Carnegie Mellon University, and J. Leskovec, A. Rajaraman, and J. Ullman of Stanford.
Intro. ANN & Fuzzy Systems Lecture 16. Classification (II): Practical Considerations.
Clustering with Spectral Norm and the k-means algorithm Ravi Kannan Microsoft Research Bangalore joint work with Amit Kumar (Indian Institute of Technology,
Outline Time series prediction Find k-nearest neighbors Lag selection Weighted LS-SVM.
DATA MINING: CLUSTER ANALYSIS (3) Instructor: Dr. Chun Yu School of Statistics Jiangxi University of Finance and Economics Fall 2015.
A Story of Principal Component Analysis in the Distributed Model David Woodruff IBM Almaden Based on works with Christos Boutsidis, Ken Clarkson, Ravi.
Clustering (2) Center-based algorithms Fuzzy k-means Density-based algorithms ( DBSCAN as an example ) Evaluation of clustering results Figures and equations.
Deep Feedforward Networks
Structural Properties of Low Threshold Rank Graphs
Feature space tansformation methods
On Clusterings: Good, Bad, and Spectral
“Traditional” image segmentation
Lecture 16. Classification (II): Practical Considerations
Presentation transcript:

Spectral Algorithms for Learning and Clustering Santosh Vempala Georgia Tech

Thanks to: Nina Balcan Avrim Blum Charlie Brubaker David Cheng Amit Deshpande Petros Drineas Alan Frieze Ravi Kannan Luis Rademacher Adrian Vetta V. Vinay Grant Wang

Warning This is the speaker’s first hour-long computer talk --- viewer discretion advised.

“Spectral Algorithm”?? Input is a matrix or a tensor Algorithm uses singular values/vectors (or principal components) of the input. Does something interesting!

Spectral Methods Indexing, e.g., LSI Embeddings, e.g., CdeV parameter Combinatorial Optimization, e.g., max-cut in dense graphs, planted clique/partition problems A course at Georgia Tech this Fall will be online and more comprehensive!

Two problems Learn a mixture of Gaussians Classify a sample Cluster from pairwise similarities

Singular Value Decomposition Real m x n matrix A can be decomposed as:

SVD in geometric terms Rank-1 approximation is the projection to the line through the origin that minimizes the sum of squared distances. Rank-k approximation is projection to k-dimensional subspace that minimizes sum of squared distances.

Fast SVD/PCA with sampling [Frieze-Kannan-V. ‘98] Sample a “constant” number of rows/colums of input matrix. SVD of sample approximates top components of SVD of full matrix. [Drineas-F-K-V-Vinay] [Achlioptas-McSherry] [D-K-Mahoney] [Deshpande-Rademacher-V-Wang] [Har-Peled] [Arora, Hazan, Kale] [De-V] [Sarlos] … Fast (nearly linear time) SVD/PCA appears practical for massive data.

Mixture models Easy to unravel if components are far enough apart Impossible if components are too close

Distance-based classification How far apart? Thus, suffices to have [Dasgupta ‘99] [Dasgupta, Schulman ‘00] [Arora, Kannan ‘01] (more general)

Hmm… Random Projection anyone? Project to a random low-dimensional subspace n  k ||X’-Y’|| ||X-Y|| || || No improvement!

Spectral Projection Project to span of top k principal components of the data Replace A with A = Apply distance-based classification in this subspace

Guarantee Theorem [V-Wang ’02]. Let F be a mixture of k spherical Gaussians with means separated as Then probability 1-, the Spectral Algorithm correctly classifies m samples.

Main idea Subspace of top k principal components (SVD subspace) spans the means of all k Gaussians

SVD in geometric terms Rank 1 approximation is the projection to the line through the origin that minimizes the sum of squared distances. Rank k approximation is projection k-dimensional subspace minimizing sum of squared distances.

Why? Best line for 1 Gaussian? - line through the mean Best k-subspace for 1 Gaussian? - any k-subspace through the mean Best k-subspace for k Gaussians? - the k-subspace through all k means!

How general is this? Theorem[VW’02]. For any mixture of weakly isotropic distributions, the best k-subspace is the span of the means of the k components. Covariance matrix = multiple of identity

Sample SVD Sample SVD subspace is “close” to mixture’s SVD subspace. Doesn’t span means but is close to them.

2 Gaussians in 20 dimensions

4 Gaussians in 49 dimensions

Mixtures of logconcave distributions Theorem [Kannan, Salmasian, V, ‘04]. For any mixture of k distributions with SVD subspace V,

Questions 1.Can Gaussians separable by hyperplanes be learned in polytime? 2.Can Gaussian mixture densities be learned in polytime? [Feldman, O’Donell, Servedio]

Clustering from pairwise similarities Input: A set of objects and a (possibly implicit) function on pairs of objects. Output: 1.A flat clustering, i.e., a partition of the set 2.A hierarchical clustering 3.(A weighted list of features for each cluster)

Typical approach Optimize a “natural” objective function E.g., k-means, min-sum, min-diameter etc. Using EM/local search (widely used) OR a provable approximation algorithm Issues: quality, efficiency, validity. Reasonable functions are NP-hard to optimize

Divide and Merge Recursively partition the graph induced by the pairwise function to obtain a tree Find an “optimal” tree-respecting clustering Rationale: Easier to optimize over trees; k-means, k-median, correlation clustering all solvable quickly with dynamic programming

Divide and Merge

How to cut? Min cut? (in weighted similarity graph) Min conductance cut [Jerrum-Sinclair] Sparsest cut [Alon], Normalized cut [Shi-Malik] Many applications: analysis of Markov chains, pseudorandom generators, error-correcting codes...

How to cut? Min conductance/expansion is NP-hard to compute. -Leighton-Rao Arora-Rao-Vazirani -Fiedler cut: Minimum of n-1 cuts when vertices are arranged according to component in 2 nd largest eigenvector of similarity matrix.

Worst-case guarantees Suppose we can find a cut of conductance at most A.C where C is the minimum. Theorem [Kannan-V.-Vetta ’00]. If there exists an ( )-clustering, then the algorithm is guaranteed to find a clustering of quality

Experimental evaluation Evaluation on data sets where true clusters are known  Reuters, 20 newsgroups, KDD UCI data, etc.  Test how well algorithm does in recovering true clusters – look an entropy of clusters found with respect to true labels. Question 1: Is the tree any good? Question 2: How does the best partition (that matches true clusters) compare to one that optimizes some objective function?

Cluster 44: [938] 64.82%: "Antidiabetic Agents, Misc." %: Ace Inhibitors & Comb %: Sulfonylureas %: Antihyperlipidemic Drugs %: Blood Glucose Test Supplies %: Non-Steroid/Anti-Inflam. Agent %: Beta Blockers & Comb %: Calcium Channel Blockers&Comb %: Insulins %: Antidepressants. Cluster 97: [111] %: Mental Health/Substance Abuse %: Depression %: X-ray %: Neurotic and Personality Disorders %: Year 3 cost - year 2 cost %: Antidepressants %: Durable Medical Equipment %: Psychoses %: Subsequent Hospital Care. 8.11%: Tranquilizers/Antipsychotics. Cluster 48: [39] 94.87%: Cardiography - includes stress testing %: Nuclear Medicine %: CAD %: Chest Pain %: Cardiology - Ultrasound/Doppler %: X-ray %: Other Diag Radiology %: Cardiac Cath Procedures 25.64%: Abnormal Lab and Radiology %: Dysrhythmias. Clustering medical records Medical records: patient records (> 1 million) with symptoms, procedures & drugs Goals: predict cost/risk, discover relationships between different conditions, flag at-risk patients etc.. [Bertsimas, Bjarnodottir, Kryder, Pandey, V, Wang]

Other domains Clustering genes of different species to discover orthologs – genes performing similar tasks across species. –(current work by R. Singh, MIT) EigenclusterEigencluster to cluster search results Compare to GoogleGoogle [Cheng, Kannan,Vempala,Wang]

Future of clustering? Move away from explicit objective functions? E.g., feedback models, similarity functions [Balcan, Blum] Efficient regularity-style quasi-random clustering: partition into a small number of pieces so that edges between pairs appear random Tensor Clustering: using relationships of small subsets ?!

Spectral Methods Indexing, e.g., LSI Embeddings, e.g., CdeV parameter Combinatorial Optimization, e.g., max-cut in dense graphs, planted clique/partition problems A course at Georgia Tech this Fall will be online and more comprehensive!