Presentation is loading. Please wait.

Presentation is loading. Please wait.

On Kernels, Margins, and Low- dimensional Mappings or Kernels versus features Nina Balcan CMU Avrim Blum CMU Santosh Vempala MIT.

Similar presentations


Presentation on theme: "On Kernels, Margins, and Low- dimensional Mappings or Kernels versus features Nina Balcan CMU Avrim Blum CMU Santosh Vempala MIT."— Presentation transcript:

1 On Kernels, Margins, and Low- dimensional Mappings or Kernels versus features Nina Balcan CMU Avrim Blum CMU Santosh Vempala MIT

2 Generic problem  Given a set of images:, want to learn a linear separator to distinguish men from women.  Problem: pixel representation no good. Old style advice:  Pick a better set of features!  But seems ad-hoc. Not scientific. New style advice:  Use a Kernel! K (, ) =  ( ) ¢  ( ).  is implicit, high-dimensional mapping.  Sounds more scientific. Many algorithms can be “kernelized”. Use “magic” of implicit high-dim’l space. Don’t pay for it if exists a large margin separator.

3 Generic problem Old style advice:  Pick a better set of features!  But seems ad-hoc. Not scientific. New style advice:  Use a Kernel! K (, ) =  ( ) ¢  ( ).  is implicit, high-dimensional mapping.  Sounds more scientific. Many algorithms can be “kernelized”. Use “magic” of implicit high-dim’l space. Don’t pay for it if exists a large margin separator.  E.g., K(x,y) = (x ¢ y + 1) m.  :(n-diml space) ! (n m -diml space).

4 Main point of this work: Can view new method as way of conducting old method.  Given a kernel [as a black-box program K(x,y)] and access to typical inputs [samples from D],  Claim: Can run K and reverse-engineer an explicit (small) set of features, such that if K is good [ 9 large-margin separator in  -space for D,c], then this is a good feature set [ 9 almost-as-good separator]. “You give me a kernel, I give you a set of features”

5 Main point of this work: Can view new method as way of conducting old method.  Given a kernel [as a black-box program K(x,y)] and access to typical inputs [samples from D],  Claim: Can run K and reverse-engineer an explicit (small) set of features, such that if K is good [ 9 large-margin separator in  -space for D,c], then this is a good feature set [ 9 almost-as-good separator]. E.g., sample z 1,...,z d from D. Given x, define x i = K(x,z i ). Implications:  Practical: alternative to kernelizing the algorithm.  Conceptual: View kernel as (principled) way of doing feature generation. View as similarity function, rather than “magic power of implicit high dimensional space”.

6 Basic setup, definitions  Instance space X. X  Distribution D, target c. Use P = (D,c).  K(x,y) =  (x) ¢  (y).  P is separable with margin  in  -space if 9 w s.t. Pr (x, l ) 2 P [ l (w ¢  (x)) <   normalizing |w|=1, |  (x)|=1) P=(D,c) + -  w  Error  at margin  : replace “0” with “  ”. Goal is to use K to get mapping to low-dim’l space.

7 Idea: Johnson-Lindenstrauss lemma  If P separable with margin  in  -space, then with prob 1- , a random linear projection down to space of dimension d = O((1/  2 )log[1/(  )]) will have a linear separator of error < . [AV] XP=(D,c) + -   If vectors are r 1,r 2,...,r d, then can view as features x i =  (x) ¢ r i.  Problem: uses . Can we do directly, using K as black- box, without computing  ?

8 3 methods (from simplest to best) 1.Draw d examples z 1,...,z d from D. Use: F(x) = (K(x,z 1 ),..., K(x,z d )). [So, “x i ” = K(x,z i )] For d = (8/  )[1/  2 + ln 1/  ], if P was separable with margin  in  -space, then whp this will be separable with error . (but this method doesn’t preserve margin). 2.Same d, but a little more complicated. Separable with error  at margin  /2. 3.Combine (2) with further projection as in JL lemma. Get d with log dependence on 1/ , rather than linear. So, can set  ¿ 1/d. All these methods need access to D, unlike JL. Can this be removed? We show NO for generic K, but may be possible for natural K.

9 Actually, the argument is pretty easy... (though we did try a lot of things first that didn’t work...)

10 Key fact Claim : If 9 perfect w of margin  in  -space, then if draw z 1,...,z d 2 D for d ¸ (8/  )[1/  2 + ln 1/  ], whp (1-  ) exists w’ in span(  (z 1 ),...,  (z d )) of error ·  at margin  /2. Proof: Let S = examples drawn so far. Assume |w|=1, |  (z)|=1 8 z.  w in = proj(w,span(S)), w out = w – w in.  Say w out is large if Pr z ( |w out ¢  (z)| ¸  /2 ) ¸  ; else small.  If small, then done: w’ = w in.  Else, next z has at least  prob of improving S. |w out | 2 Ã |w out | 2 – (  /2) 2  Can happen at most 4/  2 times. a

11 So....  If draw z 1,...,z d 2 D for d = (8/  )[1/  2 + ln 1/  ], then whp exists w’ in span(  (z 1 ),...,  (z d )) of error ·  at margin  /2.  So, for some w’ =  1  (z 1 ) +... +  d  (z d ), Pr (x, l ) 2 P [sign(w’ ¢  (x))  l ] · .  But notice that w’ ¢  (x) =  1 K(x,z 1 ) +... +  d K(x,z d ). ) vector (  1,...  d ) is an  -good separator in the feature space: x i = K(x,z i ).  But margin not preserved because of length of target, examples.

12 How to preserve margin? (mapping #2)  We know 9 w’ in span(  (z 1 ),...,  (z d )) of error ·  at margin  /2.  So, given a new x, just want to do an orthogonal projection into that span. (preserves dot-product, decreases |x|, so only increases margin). Run K(z i,z j ) for all i,j=1,...,d. Get matrix M. Decompose M = U T U. (Mapping #2) = (mapping #1)U -1. a

13 How to improve dimension?  Current mapping gives d = (8/  )[1/  2 + ln 1/  ].  Johnson-Lindenstrauss gives d = O((1/  2 ) log 1/(  ) ).  JL is nice because can have  ¿ 1/d. Good if alg wants data to be perfectly separable. (Learning a separator of margin  can be done in time poly(1/  ), but if no perfect separator exists, minimizing error is NP-hard.)  Answer: just combine the two...

14 X X X X X X O O O O X X O O O O X X X X X O O O O X X X RdRd R d1  JL F X X O O O O X X X RNRN F1F1

15 Mapping #3  Do JL(mapping2(x)).  JL says: fix y,w. Random projection M down to space of dimension O(1/  2 log 1/  ’) will with prob (1-  ’) preserve margin of y up to §  /4.  Use  ’ = . ) For all y, Pr M [failure on y] < , ) Pr D, M [failure on y] < , ) Pr M [fail on prob mass  ] < .  So, we get desired dimension (# features), though sample-complexity remains as in mapping #2.

16 Lower bound (on necessity of access to D) For arbitrary black-box kernel K, can’t hope to convert to small feature space without access to D.  Consider X={0,1} n, random X’ ½ X of size 2 n/2, D = uniform over X’.  c = arbitrary function (so learning is hopeless).  But we have this magic kernel K(x,y) =  (x) ¢  (y)  (x) = (1,0) if x  X’.  (x) = (-½, p 3/2) if x 2 X’, c(x)=pos.  (x) = (-½,- p 3/2) if x 2 X’, c(x)=neg.  P is separable with margin p 3/2 in  - space.  But, without access to D, all attempts at running K(x,y) will give answer of 1.   

17 Open Problems  For specific, natural kernels, like, K(x,y) = (1 + x ¢ y) m, Is there an efficient (probability distribution over) mappings that is good for any P = (c,D) for which the kernel is good?  I.e., an efficient analog to JL for these kernels.  Or, at least can these mappings be constructed using less sample-complexity (fewer accesses to D)?


Download ppt "On Kernels, Margins, and Low- dimensional Mappings or Kernels versus features Nina Balcan CMU Avrim Blum CMU Santosh Vempala MIT."

Similar presentations


Ads by Google