Presentation on theme: "Shai Ben-David University of Waterloo Canada, Dec 2004 Generalization Bounds for Clustering - Some Thoughts and Many Questions."— Presentation transcript:
Shai Ben-David University of Waterloo Canada, Dec 2004 Generalization Bounds for Clustering - Some Thoughts and Many Questions
Provide rigorous generalization bounds for clustering. The Goal Why ? It would be useful to have assurances that clusterings that we produce are meaningful, rather than just an artifact of data randomness.
There is some large, possibly infinite, domain set X. An unknown probability distribution over X generates an i. i.d sample. Upon viewing such a sample, a learner wishes to deduce a clustering, as a simple, yet meaningful, description of the distribution. 1 st Step: A formal model for Sample Based Clustering
Roughly, we wish to be able to say: If sufficiently many sample points have been drawn, then the clustering we come up with is “stable”. 2 nd Step: What should a bound look like?
If S 1, S 2 are sufficiently large i.id. samples from the same distribution, then, w.h.p., C(S 1 ) is ‘similar’ to C(S 2 ) Where C(S) is the clustering we get by applying our clustering alg. to S What should a bound look like? More formally
Classification generalization bounds guarantee the convergence of the loss of the hypothesis – “For any distribution P, large enough samples S, L(A(S1)) is close to L(A(P))” Since for clustering there is no natural analogue of the distribution true cost, L(A(P)), we consider its ‘stability’ implication: “ If S 1, S 2 are sufficiently large i.id. samples from the same distribution, then, w.h.p., L(A(S 1 )) is close to L(A(S 2 ))” How is it Different than Classification bounds? Here, for clustering, we seek a stronger statement, namely: “If S 1, S 2 are sufficiently large i.id. samples from the same distribution, then, w.h.p., C(S 1 ) is ‘ similar ’ to C(S 2 )”
From a more traditional scientific-methodology point of view, Stability can be viewed as the fundamental issue of replication -- to what extent are the results of an experiment reproducible? A Different Perspective – Replication Replication has been investigated in many applications of clustering, but mostly by visual inspection of the results of cluster analysis on two samples.
If S 1, S 2 are sufficiently large i.id. samples from the same distribution, then, w.h.p., C(S 1 ) is ‘similar’ to C(S 2 ) where C(S) is the clustering we get by applying our clustering alg. to S What should a bound look like? More formally
How should similarity between clusters be defined? Some Issues need Clarification: There are two notions to be defined; Similarity between clusterings of the same set, and similarity between clusterings of different sets. Similarity between two clusterings of the same set have been extensively discussed in the literature (see, e.g, Meila in COLT’03 ).
A common approach to the defining similarity between clusterings of different sets is to reduce it to a definition of similarity between clusterings of the same set. Reducing the Second Notion to the First: This is done via an extension operator - a method for extending a clustering of a domain subset to a clustering of the full domain (Breckenridge ’89, Roth et al COMPSTAT’02 and BD in COLT’04 ) Examples of such extensions are Nearest Neighbor, or Center-Based clustering.
For a clustering C 1 of S 1 (or C 2 of S 2 ), use the extension operator to extend C 1 to a clustering C 1,2 of S 2 (or C 2,1 of C 1, respectively). Reducing the Similarity over Two Sets to Similarity over Same Set : Given a similarity measure d for same-set clusterings, define a similarity measure D(C 1, C 2 ) = ½(d(C 1, C 2,1 ) + d(C 2, C 1,2 ))
If the number of clusters, k, is fixed, there is no hope to get distribution free stability results. Types of Potential Bounds: 1. Fixed # of Clusters Example2: Square with 4 equal mass hips on its corners --bad for k ≠4 Example 3: Cocentric rings -- bad for center-based clustering algorithms. Example1: The uniform distribution over a circle:
Von Luxemburg, Bousquet and Belkin (this NIPS), analyze when does Spectral Clustering converge to a global clustering of the domain space. Koltchinskii (2002) proved that if the underlying distribution is generated by a certain tree structure of Gaussians, then a clustering algorithm can recover this structure from random samples. BD (COLT 2004) showed distribution-free s convergence rates for the limited issue of clustering loss function. What Can We Currently Prove? (Not too much …)
What is the “Intrinsic Instability” of a given sample distribution? (Buhmann et al) Fixed # of Clusters – Natural questions Can one characterize (useful) families of probability distributions for which cluster stability holds (i.e., the intrinsic instability is zero)? What levels of intrinsic instability grant a clustering meaningless ?
Now there may be hope for distribution-free Bounds (the algorithm may choose to have just one cluster for a uniform distribution). Types of Potential Bounds: 2. Let the algorithm Chose k Major issue: A tradeoff between the stability and the “information content” of a clustering.
To assure that the outcome of a clustering algorithm is meaningful. Potential Uses of Bounds: Help Detect changes in the sample generating distribution (“the two-sample problem”) Model selection – Choose the number of clusters that maximizes a stability-based criterion (Lange – Braun- Roth- Buhmann NIPS’02)