Maria-Florina Balcan Learning with Similarity Functions Maria-Florina Balcan & Avrim Blum CMU, CSD.

Maria-Florina Balcan Learning with Similarity Functions Maria-Florina Balcan & Avrim Blum CMU, CSD

Maria-Florina Balcan Kernels and Similarity Functions Useful in practice for dealing with many different kinds of data. Elegant theory about what makes a given kernel good for a given learning problem. Our Goal: analyze more general similarity functions. In the process we describe ways of constructing good data dependent kernels. Kernels have become a powerful tool in ML.

Maria-Florina Balcan Kernels A kernel K is a pairwise similarity function s.t. 9 an implicit mapping  s.t. K(x,y)=  (x) ¢  (y). Point is: many learning algorithms can be written so only interact with data via dot-products. If replace x ¢ y with K(x,y), it acts implicitly as if data was in higher-dimensional  -space. If data is linearly separable by large margin in  -space, don’t have to pay in terms of data or comp time. If margin  in  -space, only need 1/  2 examples to learn well. w  (x)  1

Maria-Florina Balcan General Similarity Functions Goal: definition of good similarity function for a learning problem that: 1) Talks in terms of natural direct properties: no implicit high-dimensional spaces no requirement of positive-semidefiniteness 2) If K satisfies these properties for our given problem, then has implications to learning. 3) Is broad: includes usual notion of “good kernel”. (induces a large margin separator in  -space)

Maria-Florina Balcan A First Attempt: Definition satisfying properties (1) and (2) K:(x,y) ! [-1,1] is an ( ,  )-good similarity for P if at least a 1-  probability mass of x satisfy: E y~P [K(x,y)| l (y)= l (x)] ¸ E y~P [K(x,y)| l (y)  l (x)]+  Note: this might not be a legal kernel. Suppose that positives have K(x,y) ¸ 0.2, negatives have K(x,y) ¸ 0.2, but for a positive and a negative K(x,y) are uniform random in [-1,1]. Let P be a distribution over labeled examples (x, l (x)) A B C + - -

Maria-Florina Balcan A First Attempt: Definition satisfying properties (1) and (2). How to use it? K:(x,y) ! [-1,1] is an ( ,  )-good similarity for P if at least a 1-  probability mass of x satisfy: E y~P [K(x,y)| l (y)= l (x)] ¸ E y~P [K(x,y)| l (y)  l (x)]+  Algorithm Draw S + of O((1/  2 ) ln(1/  2 )) positive examples. Draw S - of O((1/  2 ) ln(1/  2 )) negative examples. Classify x based on which gives better score.

Maria-Florina Balcan A First Attempt: How to use it? K:(x,y) ! [-1,1] is an ( ,  )-good similarity for P if at least a 1-  probability mass of x satisfy: Algorithm Draw S + of O((1/  2 ) ln(1/  2 )) positive examples. Draw S - of O((1/  2 ) ln(1/  2 )) negative examples. Classify x based on which gives better score. Hoeffding: for any given “good x”, probability of error w.r.t. x (over draw of S +, S -) at most  2. By Markov, at most  chance that the error rate over GOOD is more than . So overall error rate ·  + . Guarantee: with probability ¸ 1- , error ·  +  Proof E y~P [K(x,y)| l (y)= l (x)] ¸ E y~P [K(x,y)| l (y)  l (x)]+ 

Maria-Florina Balcan A First Attempt: Not Broad Enough K:(x,y) ! [-1,1] is an ( ,  )-good similarity for P if at least a 1-  probability mass of x satisfy: E y~P [K(x,y)| l (y)= l (x)] ¸ E y~P [K(x,y)| l (y)  l (x)]+  K(x,y)=x ¢ y has good (large margin) separator but doesn’t satisfy our definition. + + + ++ + - - - -- - more similar to negs than to typical pos

Maria-Florina Balcan A First Attempt: Not Broad Enough K:(x,y) ! [-1,1] is an ( ,  )-good similarity for P if at least a 1-  probability mass of x satisfy: Idea: would work if we didn’t pick y’s rom top-left. Broaden to say: OK if 9 large region R s.t. most x are on average more similar to y 2 R of same label than to y 2 R of other label. E y~P [K(x,y)| l (y)= l (x)] ¸ E y~P [K(x,y)| l (y)  l (x)]+  R + + + ++ + - - - -- -

Maria-Florina Balcan Broader/Main Definition K:(x,y) ! [-1,1] is an ( ,  )-good similarity for P if exists a weighting function w(y) 2 [0,1] at least a 1-  probability mass of x satisfy: E y~P [w(y)K(x,y)| l (y)= l (x)] ¸ E y~P [w(y)K(x,y)| l (y)  l (x)]+ 

Maria-Florina Balcan Main Definition, How to Use It K:(x,y) ! [-1,1] is an ( ,  )-good similarity for P if exists a weighting function w(y) 2 [0,1] at least a 1-  probability mass of x satisfy: E y~P [w(y)K(x,y)| l (y)= l (x)] ¸ E y~P [w(y)K(x,y)| l (y)  l (x)]+  Algorithm Draw S + ={y 1, , y d }, S - ={z 1, , z d }, d=O((1/  2 ) ln(1/  2 )). Use to “triangulate” data: F(x) = [K(x,y 1 ), …,K(x,y d ), K(x,z d ),…,K(x,z d )]. Take a new set of labeled examples, project to this space, and run your favorite alg for learning lin. separators. Point is: with probability ¸ 1- , exists linear separator of error ·  +  at margin  /4. (w = [w(y 1 ), …,w(y d ),-w(z d ),…,-w(z d )])

Maria-Florina Balcan Main Definition, Implications Algorithm Draw S + ={y 1, , y d }, S - ={z 1, , z d }, d=O((1/  2 ) ln(1/  2 )). Use to “triangulate” data:F(x) = [K(x,y 1 ), …,K(x,y d ), K(x,z d ),…,K(x,z d )]. Guarantee: with prob. ¸ 1- , exists linear separator of error ·  +  at margin  /4. Implications legal kernel K arbitrary sim. function ( ,  )-good sim. function (  + ,  /4)-good kernel function

Maria-Florina Balcan Good Kernels are Good Similarity Functions Main Definition: K:(x,y) ! [-1,1] is an ( ,  )-good similarity for P if exists a weighting function w(y) 2 [0,1] at least a 1-  probability mass of x satisfy: E y~P [w(y)K(x,y)| l (y)= l (x)] ¸ E y~P [w(y)K(x,y)| l (y)  l (x)]+  An ( ,  )-good kernel is an (  ’,  ’)-good similarity function under main definition. Theorem Our current proofs incur some penalty:  ’ =  +  extra,  ’ =  3  extra.

Maria-Florina Balcan Good Kernels are Good Similarity Functions An ( ,  )-good kernel is an (  ’,  ’)-good similarity function under main definition, where Theorem  ’ =  +  extra,  ’ =  3  extra. Proof Sketch Suppose K is a good kernel in usual sense. Then, standard margin bounds imply: –if S is a random sample of size Õ(1/(  2 )), then whp we can give weights w S (y) to all examples y 2 S so that the weighted sum of these examples defines a good LTF. But, we want sample-independent weights [and bounded]. –Boundedness not too hard (imagine a margin-perceptron run over just the good y). –Get sample-independence using an averaging argument.

Maria-Florina Balcan Learning with Multiple Similarity Functions Let K 1, …, K r be similarity functions s. t. some (unknown) convex combination of them is ( ,  )-good. Draw S + ={y 1, , y d }, S - ={z 1, , z d }, d=O((1/  2 ) ln(1/  2 )). Use to “triangulate” data: F(x) = [K 1 (x,y 1 ), …,K r (x,y d ), K 1 (x,z d ),…,K r (x,z d )]. Guarantee: The induced distribution F(P) in R 2dr has a separator of error ·  +  at margin at least Algorithm Sample complexity is roughly

Maria-Florina Balcan Implications & Conclusions Develop theory that provides a formal way of understanding kernels as similarity function. Our algorithms work for similarity fns that aren’t necessarily PSD (or even symmetric). Open Problems Better results for learning with multiple similarity functions. Extending [SB’06]. Improve existing bounds.

Maria-Florina Balcan Learning with Similarity Functions Maria-Florina Balcan & Avrim Blum CMU, CSD.

Similar presentations

Presentation on theme: "Maria-Florina Balcan Learning with Similarity Functions Maria-Florina Balcan & Avrim Blum CMU, CSD."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Maria-Florina Balcan Learning with Similarity Functions Maria-Florina Balcan & Avrim Blum CMU, CSD.

Similar presentations

Presentation on theme: "Maria-Florina Balcan Learning with Similarity Functions Maria-Florina Balcan & Avrim Blum CMU, CSD."— Presentation transcript:

Similar presentations

About project

Feedback