Presentation is loading. Please wait.

Presentation is loading. Please wait.

On a Theory of Similarity functions for Learning and Clustering Avrim Blum Carnegie Mellon University This talk is based on work joint with Nina Balcan,

Similar presentations


Presentation on theme: "On a Theory of Similarity functions for Learning and Clustering Avrim Blum Carnegie Mellon University This talk is based on work joint with Nina Balcan,"— Presentation transcript:

1 On a Theory of Similarity functions for Learning and Clustering Avrim Blum Carnegie Mellon University This talk is based on work joint with Nina Balcan, Nati Srebro and Santosh Vempala Theory and Practice of Computational Learning, 2009

2 2-minute version Suppose we are given a set of images, and want to learn a rule to distinguish men from women. Problem: pixel representation not so good.Suppose we are given a set of images, and want to learn a rule to distinguish men from women. Problem: pixel representation not so good. A powerful technique for such settings is to use a kernel: a special kind of pairwise similarity function K(, ).A powerful technique for such settings is to use a kernel: a special kind of pairwise similarity function K(, ). But, theory in terms of implicit mappings.But, theory in terms of implicit mappings. Q: Can we develop a theory that just views K as a measure of similarity? Develop more general and intuitive theory of when K is useful for learning?

3 2-minute version Suppose we are given a set of images, and want to learn a rule to distinguish men from women. Problem: pixel representation not so good.Suppose we are given a set of images, and want to learn a rule to distinguish men from women. Problem: pixel representation not so good. A powerful technique for such settings is to use a kernel: a special kind of pairwise similarity function K(, ).A powerful technique for such settings is to use a kernel: a special kind of pairwise similarity function K(, ). But, theory in terms of implicit mappings.But, theory in terms of implicit mappings. Q: What if we only have unlabeled data (i.e., clustering)? Can we develop a theory of properties that are sufficient to be able to cluster well?

4 2-minute version Suppose we are given a set of images, and want to learn a rule to distinguish men from women. Problem: pixel representation not so good.Suppose we are given a set of images, and want to learn a rule to distinguish men from women. Problem: pixel representation not so good. A powerful technique for such settings is to use a kernel: a special kind of pairwise similarity function K(, ).A powerful technique for such settings is to use a kernel: a special kind of pairwise similarity function K(, ). But, theory in terms of implicit mappings.But, theory in terms of implicit mappings. Develop a kind of PAC model for clustering.

5 Part 1: On similarity functions for learning

6 Theme of this part Theory of natural sufficient conditions for similarity functions to be useful for classification learning problems. Don’t require PSD, no implicit spaces, but includes notion of large-margin kernel. At a formal level, can even allow you to learn more (can define classes of functions with no large- margin kernel even if allow substantial hinge-loss but that do have a good similarity fn under this notion)

7 Kernels We have a lot of great algorithms for learning linear separators (perceptron, SVM, …). But, a lot of time, data is not linearly separable. –“Old” answer: use a multi-layer neural network. –“New” answer: use a kernel function! Many algorithms only interact with the data via dot-products. –So, let’s just re-define dot-product. –E.g., K(x,y) = (1 + x ¢ y) d. K(x,y) =  (x) ¢  (y), where  () is implicit mapping into an n d -dimensional space. –Algorithm acts as if data is in “  -space”. Allows it to produce non-linear curve in original space. + + + + - - - -

8 Kernels A kernel K is a legal def of dot-product: i.e. there exists an implicit mapping  such that K(, )=  ( ) ¢  ( ). E.g., K(x,y) = (x ¢ y + 1) d  (n-dimensional space) ! n d -dimensional space Why Kernels are so useful Many algorithms interact with data only via dot-products. So, if replace x ¢ y with K(x,y), they act implicitly as if data was in the higher-dimensional  -space.

9 x2x2 x1x1 O O O O O O O O X X X X X X X X X X X X X X X X X X z1z1 z3z3 O O O O O O O O O X X X X X X X X X X X X X X X X X X Example E.g., for n=2, d=2, the kernel z2z2 K(x,y) = (x ¢ y) d corresponds to original space  space

10 Moreover, generalize well if good Margin If data is linearly separable by large margin in  -space, then good sample complexity. |  (x)| · 1 + + + + + + - - - - -   If margin  in  -space, then need sample size of only Õ(1/  2 ) to get confidence in generalization. Kernels useful in practice for dealing with many, many different kinds of data. [no dependence on dimension]

11 Limitations of the Current Theory Existing Theory: in terms of margins in implicit spaces. In practice: kernels are constructed by viewing them as measures of similarity. Kernel requirement rules out many natural similarity functions. Not best for intuition. Alternative, perhaps more general theoretical explanation?

12 2) Is broad: includes usual notion of good kernel, A notion of a good similarity function that is: 1)In terms of natural direct quantities. no implicit high-dimensional spaces no requirement that K(x,y)=  (x) ¢  (y) K can be used to learn well. has a large margin sep. in  -space Good kernels First attempt Main notion [Balcan-Blum, ICML 2006][Balcan-Blum-Srebro, MLJ 2008][Balcan-Blum-Srebro, COLT 2008] 3) Even formally allows you to do more.

13 A First Attempt K is ( ,  )-good for P if a 1-  prob. mass of x satisfy: E y~P [K(x,y)| l (y)= l (x)] ¸ E y~P [K(x,y)| l (y)  l (x)]+  P distribution over labeled examples (x, l (x)) K is good if most x are on average more similar to points y of their own type than to points y of the other type. Goal: output classification rule good for P Average similarity to points of opposite label gap Average similarity to points of the same label

14 A First Attempt E.g., most images of men are on average  -more similar to random images of men than random images of women, and vice-versa. K is ( ,  )-good for P if a 1-  prob. mass of x satisfy: E y~P [K(x,y)| l (y)= l (x)] ¸ E y~P [K(x,y)| l (y)  l (x)]+ 

15 A First Attempt Algorithm Draw sets S +, S - of positive and negative examples. Classify x based on average similarity to S + versus to S -. K is ( ,  )-good for P if a 1-  prob. mass of x satisfy: S+S+ 1 1 0.5 0.4 S-S- xx E y~P [K(x,y)| l (y)= l (x)] ¸ E y~P [K(x,y)| l (y)  l (x)]+ 

16 A First Attempt K is ( ,  )-good for P if a 1-  prob. mass of x satisfy: E y~P [K(x,y)| l (y)= l (x)] ¸ E y~P [K(x,y)| l (y)  l (x)]+  Theorem Algorithm Draw sets S +, S - of positive and negative examples. Classify x based on average similarity to S + versus to S -. If |S + | and |S - | are  ((1/  2 ) ln(1/  ’)), then with probability ¸ 1- , error ·  +  ’.

17 A First Attempt: Not Broad Enough E y~P [K(x,y)| l (y)= l (x)] ¸ E y~P [K(x,y)| l (y)  l (x)]+  has a large margin separator; + + + ++ + - - - -- - more similar to - than to typical + 30 o Similarity function K(x,y)=x ¢ y does not satisfy our definition. ½ versus ½ ¢ 1 + ½ ¢ (- ½) = ¼

18 A First Attempt: Not Broad Enough E y~P [K(x,y)| l (y)= l (x)] ¸ E y~P [K(x,y)| l (y)  l (x)]+  + + + ++ + - - - -- - R Broaden: 9 non-negligible R s.t. most x are on average more similar to y 2 R of same label than to y 2 R of other label. [even if do not know R in advance] 30 o

19 Broader Definition K is ( , ,  )-good if 9 a set R of “reasonable” y (allow probabilistic) s.t. 1-  fraction of x satisfy: Draw S={y 1, , y d } set of landmarks. F(x) = [K(x,y 1 ), …,K(x,y d )]. RdRd F F(P) Algorithm x ! If enough landmarks (d=  (1/  2  )), then with high prob. there exists a good L 1 large margin linear separator. Re-represent data. w=[0,0,1/n+,1/n+,0,0,0,-1/n-,0,0] P At least  prob. mass of reasonable positives & negatives. E y~P [K(x,y)| l (y)= l (x), R(y)] ¸ E y~P [K(x,y)| l (y)  l (x), R(y)]+  (technically  hinge loss)

20 Broader Definition Draw S={y 1, , y d } set of landmarks. F(x) = [K(x,y 1 ), …,K(x,y d )] RdRd F F(P) Algorithm x ! Re-represent data. O O O O O X X X X X XX X XX O O O O O and run a good L 1 linear separator alg.Take a new set of labeled examples, project to this space, and run a good L 1 linear separator alg. (e.g., Winnow etc). P K is ( , ,  )–good if 9 a set R of “reasonable” y (allow probabilistic) s.t. 1-  fraction of x satisfy: d u =Õ(1/(  2  )) d l =O((1/(  2 ² acc ))ln d u ) At least  prob. mass of reasonable positives & negatives. E y~P [K(x,y)| l (y)= l (x), R(y)] ¸ E y~P [K(x,y)| l (y)  l (x), R(y)]+ 

21 Kernels and Similarity Functions Theorem K is also a good similarity function. (but  gets squared). K is a good kernel If K has margin  in implicit space, then for any , K is ( ,  2,  )-good in our sense. Large-margin Kernels Good Similarities

22 Kernels and Similarity Functions Can also show a separation. Large-margin Kernels Good Similarities Exists class C, distrib D s.t. 9 a similarity function with large  for all f in C, but no large-margin kernel function exists. Theorem K is also a good similarity function. (but  gets squared). K is a good kernel

23 Kernels and Similarity Functions For any class C of pairwise uncorrelated functions, 9 a similarity function good for all f in C, but no such good kernel function exists. Theorem In principle, should be able to learn from O(  -1 log(|C|/  )) labeled examples. Claim 1: can define generic (0,1,1/|C|)-good similarity function achieving this bound. (Assume D not too concentrated) Claim 2: There is no ( ,  ) good kernel in hinge loss, even if  =1/2 and  =1/|C| 1/2. So, margin based SC is d=  (|C|).

24 Generic Similarity Function Partition X into regions R 1,…,R |C| with P(R i ) > 1/poly(|C|). R i will be “R” for target f i. For y in R i, define K(x,y)=f i (x)f i (y). So, for any target f i in C, any x, we get: –E y [l(x)l(y)K(x,y) | y in R i ] = E[l(x) 2 l(y) 2 ] = 1. So, K is (0,1,1/poly(|C|))-good. hinge lossmarginPr(R i ) R1R1 R2R2 R3R3 R |C| … Gives bound O(  -1 log(|C|))

25 Similarity Functions for Classification Algorithmic Implications Can use non-PSD similarities, no need to “transform” them into PSD functions and plug into SVM. Instead use empirical similarity map. E.g., Liao and Noble, Journal of Computational Biology Give justification to the following rule: Shows that anything learnable with SVM is also learnable this way.

26 Learning with Multiple Similarity Functions Let K 1, …, K r be similarity functions s. t. some (unknown) convex combination of them is ( ,  )-good. Algorithm Draw S={y 1, , y d } set of landmarks. Concatenate features. F(x) = [K 1 (x,y 1 ), …,K r (x,y 1 ), …, K 1 (x,y d ),…,K r (x,y d )]. Run same L 1 optimization algorithm as before in this new feature space.

27 Learning with Multiple Similarity Functions Let K 1, …, K r be similarity functions s. t. some (unknown) convex combination of them is ( ,  )-good. Guarantee: Whp the induced distribution F(P) in R 2dr has a separator of error ·  +  at L 1 margin at least Algorithm Draw S={y 1, , y d } set of landmarks. Concatenate features. Sample complexity is roughly: F(x) = [K 1 (x,y 1 ), …,K r (x,y 1 ), …, K 1 (x,y d ),…,K r (x,y d )]. Only increases by log(r) factor!

28 Learning with Multiple Similarity Functions Let K 1, …, K r be similarity functions s. t. some (unknown) convex combination of them is ( ,  )-good. Guarantee: Whp the induced distribution F(P) in R 2dr has a separator of error ·  +  at L 1 margin at least Algorithm Draw S={y 1, , y d } set of landmarks. Concatenate features. F(x) = [K 1 (x,y 1 ), …,K r (x,y 1 ), …, K 1 (x,y d ),…,K r (x,y d )]. Proof: imagine mapping F o (x) = [K o (x,y 1 ), …,K o (x,y d )], for the good similarity function K o =  1 K 1 + …. +  r K r Consider w o =(w 1, …, w d ) of L 1 norm 1, margin  /4. The vector w = (  1 w 1,  2 w 1,…,  r w 1, …,  1 w d,  2 w d,…,  r w d ) also has norm 1 and has w ¢ F(x) = w o ¢ F o (x).

29 Because property defined in terms of L 1, no change in margin! –Only log(r) penalty for concatenating feature spaces. –If L 2, margin would drop by factor r 1/2, giving O(r) penalty in sample complexity. Algorithm is also very simple (just concatenate). Alternative algorithm: do joint optimization: –solve for K o = (  1 K 1 + … +  n K n ), vector w o s.t. w o has good L 1 margin in space defined by F o (x) = [K o (x,y 1 ),…,K o (x,y d )] –Bound also holds here since capacity only lower. –But we don’t know how to do this efficiently… Learning with Multiple Similarity Functions

30 Interesting fact: because property defined in terms of L 1, no change in margin! –Only log(r) penalty for concatenating feature spaces. –If L 2, margin would drop by factor r 1/2, giving O(r) penalty in sample complexity. Also, since any large-margin kernel is also a good similarity function, –log(r) penalty applies to “concatenate and optimize L 1 margin” alg for kernels. –But  is potentially squared in translation and add extra  to hinge loss at 1/  cost in unlabeled data. –Nonetheless, if r is large, this can be a good tradeoff! Learning with Multiple Similarity Functions

31 Can we deal (efficiently?) with general convex class K of similarity functions? –Not just K = {  1 K 1 +…+  r K r :  i ¸ 0,  1 +…+  r =1}. Can we efficiently implement direct joint optimization for convex combination case? –Alternatively: can we use concatenation alg to extract a good convex combination K o ? Two quite different algorithm styles – anything in- between? Use this approach for transfer learning? Open questions (part I)

32 Part 2: Can we use this angle to help think about clustering?

33 Given a set of documents or search results, cluster them by topic. Given a collection of protein sequences, cluster them by function. Given a set of images of people, cluster by who is in them. … Clustering comes up in many places

34 Given data set S of n objects. There is some (unknown) “ground truth” clustering Goal: produce hypothesis clustering C 1,C 2,…,C k that matches target as much as possible. Problem: no labeled data! But: do have a measure of similarity… Can model clustering like this: C 1 *,C 2 *,…,C k *. [news articles] [sports] [politics] [minimize # mistakes up to renumbering of indices]

35 Given data set S of n objects. There is some (unknown) “ground truth” clustering Goal: produce hypothesis clustering C 1,C 2,…,C k that matches target as much as possible. Problem: no labeled data! But: do have a measure of similarity… Can model clustering like this: C 1 *,C 2 *,…,C k *. [news articles] [sports] [politics] What conditions on a similarity measure would be enough to allow one to cluster well? [minimize # mistakes up to renumbering of indices]

36 Contrast with more standard approach to clustering analysis: View similarity/distance info as “ground truth” Analyze abilities of algorithms to achieve different optimization criteria. Or, assume generative model, like mixture of Gaussians Here, no generative assumptions. Instead: given data, how powerful a K do we need to be able to cluster it well? What conditions on a similarity measure would be enough to allow one to cluster well? min-sum, k-means, k-median,…

37 Here is a condition that trivially works: Suppose K has property that: K(x,y) > 0 for all x,y such that C * (x) = C * (y).K(x,y) > 0 for all x,y such that C * (x) = C * (y). K(x,y) < 0 for all x,y such that C * (x)  C * (y).K(x,y) < 0 for all x,y such that C * (x)  C * (y). If we have such a K, then clustering is easy. Now, let’s try to make this condition a little weaker…. What conditions on a similarity measure would be enough to allow one to cluster well?

38 baseball basketball Suppose K has property that all x are more similar to all points y in their own cluster than to any y’ in other clusters. Still a very strong condition.Still a very strong condition. Problem: the same K can satisfy for two very different clusterings of the same data! What conditions on a similarity measure would be enough to allow one to cluster well? Math Physics

39 Suppose K has property that all x are more similar to all points y in their own cluster than to any y’ in other clusters. Still a very strong condition.Still a very strong condition. Problem: the same K can satisfy for two very different clusterings of the same data! What conditions on a similarity measure would be enough to allow one to cluster well? baseball basketball Math Physics

40 Let’s weaken our goals a bit… OK to produce a hierarchical clustering (tree) such that target clustering is apx some pruning of it. –E.g., in case from last slide: –Can view as saying “if any of these clusters is too broad, just click and I will split it for you” Or, OK to output a small # of clusterings such that at least one has low error (like list-decoding) but won’t talk about this one today. baseball basketball Math Physics baseball basketball math physics sports science all documents

41 Then you can start getting somewhere…. 1. is sufficient to get hierarchical clustering such that target is some pruning of tree. (Kruskal’s / single-linkage works) “all x more similar to all y in their own cluster than to any y’ from any other cluster”

42 Then you can start getting somewhere…. 1. is sufficient to get hierarchical clustering such that target is some pruning of tree. (Kruskal’s / single-linkage works) 2. Weaker condition: ground truth is “stable”: For all clusters C, C’, for all A µ C, A’ µ C’: A and A’ not both more similar on avg to each other than to rest of own clusters. “all x more similar to all y in their own cluster than to any y’ from any other cluster” View K(x,y) as attraction between x and y (plus technical conditions at boundary) Sufficient to get a good tree using average single linkage alg.

43 43 Analysis for slightly simpler version Assume for all C, C’, all A ½ C, A’ µ C’, we have K(A,C-A) > K(A,A’), and say K is symmetric. Algorithm: average single-linkage Like Kruskal, but at each step merge pair of clusters whose average similarity is highest.Like Kruskal, but at each step merge pair of clusters whose average similarity is highest. Analysis: (all clusters made are laminar wrt target) Failure iff merge C 1, C 2 s.t. C 1 ½ C, C 2 Å C = . Avg x 2 A, y 2 C-A [S(x,y)]

44 44 Analysis for slightly simpler version Assume for all C, C’, all A ½ C, A’ µ C’, we have K(A,C-A) > K(A,A’), and say K is symmetric. Algorithm: average single-linkage Like Kruskal, but at each step merge pair of clusters whose average similarity is highest.Like Kruskal, but at each step merge pair of clusters whose average similarity is highest. Analysis: (all clusters made are laminar wrt target) Failure iff merge C 1, C 2 s.t. C 1 ½ C, C 2 Å C = . But must exist C 3 ½ C at least as similar to C 1 as the average. Contradiction. C1C1C1C1 C3C3C3C3 C2C2 Avg x 2 A, y 2 C-A [S(x,y)]

45 More sufficient properties: 3. But add noisy data. –Noisy data can ruin bottom-up algorithms, but can show a generate-and-test style algorithm works. –Create collection of plausible clusters. –Use series of pairwise tests to remove/shrink clusters until consistent with a tree “all x more similar to all y in their own cluster than to any y’ from any other cluster”

46 More sufficient properties: 3. But add noisy data. 4. Implicit assumptions made by optimization approach: “all x more similar to all y in their own cluster than to any y’ from any other cluster” “Any approximately-optimal..k-median.. solution is close (in terms of how pts are clustered) to the target.” [Nina Balcan’s talk on Saturday]

47 Can also analyze inductive setting Assume for all C, C’, all A ½ C, A’ µ C’, we have K(A,C-A) > K(A,A’)+ , but only see small sample S Can use “regularity” type results of [AFKK] to argue that whp, a reasonable size S will give good estimates of all desired quantities. Once S is hierarchically partitioned, can insert new points as they arrive.

48 Like a PAC model for clustering A property is a relation between target and similarity information (data). Like a data- dependent concept class in learning.A property is a relation between target and similarity information (data). Like a data- dependent concept class in learning. Given data and a similarity function K, a property induces a “concept class” C of all clusterings c such that (c,K) is consistent with the property.Given data and a similarity function K, a property induces a “concept class” C of all clusterings c such that (c,K) is consistent with the property. Tree model: want tree T s.t. set of prunings of T form an  -cover of C.Tree model: want tree T s.t. set of prunings of T form an  -cover of C. In inductive model, want this with prob 1- .In inductive model, want this with prob 1- .

49 Summary (part II) Exploring the question: what does an algorithm need in order to cluster well? What natural properties allow a similarity measure to be useful for clustering? –To get a good theory, helps to relax what we mean by “useful for clustering”. –User can then decide how specific he wanted to be in each part of domain. Analyze a number of natural properties and prove guarantees on algorithms able to use them.

50 Wrap-up Tour through learning and clustering by similarity functions. –User with some knowledge of the problem domain comes up with pairwise similarity measure K(x,y) that makes sense for the given problem. –Algorithm uses this (together with labeled data in the case of learning) to find a good solution. Goals of a theory: –Give guidance to similarity-function designer (what properties to shoot for?). –Understand what properties are sufficient for learning/clustering, and by what algorithms. For learning, get theory of kernels without need for “implicit spaces”. For clustering, “reverses” the usual view. Suggests giving the algorithm some slack (tree vs partitioning). A lot of interesting questions still open in these areas.


Download ppt "On a Theory of Similarity functions for Learning and Clustering Avrim Blum Carnegie Mellon University This talk is based on work joint with Nina Balcan,"

Similar presentations


Ads by Google