Presentation is loading. Please wait.

Presentation is loading. Please wait.

Kernels, Margins, and Low-dimensional Mappings [NIPS 2007 Workshop on TOPOLOGY LEARNING ] Maria-Florina Balcan, Avrim Blum, Santosh Vempala.

Similar presentations


Presentation on theme: "Kernels, Margins, and Low-dimensional Mappings [NIPS 2007 Workshop on TOPOLOGY LEARNING ] Maria-Florina Balcan, Avrim Blum, Santosh Vempala."— Presentation transcript:

1 Kernels, Margins, and Low-dimensional Mappings [NIPS 2007 Workshop on TOPOLOGY LEARNING ] Maria-Florina Balcan, Avrim Blum, Santosh Vempala

2 Generic problem  Given a set of images:, want to learn a linear separator to distinguish men from women.  Problem: pixel representation no good. Old style advice:  Pick a better set of features!  But seems ad-hoc. Not scientific. New style advice:  Use a Kernel! K (, ) =  ( ) ¢  ( ).  is implicit, high-dimensional mapping.  Feels more scientific. Many algorithms can be “kernelized”. Use “magic” of implicit high-dim’l space. Don’t pay for it if exists a large margin separator.

3 Generic problem Old style advice:  Pick a better set of features!  But seems ad-hoc. Not scientific. New style advice:  Use a Kernel! K (, ) =  ( ) ¢  ( ).  is implicit, high-dimensional mapping.  Feels more scientific. Many algorithms can be “kernelized”. Use “magic” of implicit high-dim’l space. Don’t pay for it if exists a large margin separator.  E.g., K(x,y) = (x ¢ y + 1) m.  :(n-diml space) ! (n m -diml space).

4 Claim: Can view new method as way of conducting old method.  Given a kernel [as a black-box program K(x,y)] and access to typical inputs [samples from D],  Claim: Can run K and reverse-engineer an explicit (small) set of features, such that if K is good [ 9 large-margin separator in  -space for D,c], then this is a good feature set [ 9 almost-as-good separator]. “You give me a kernel, I give you a set of features” Do this using idea of random projection…

5 Claim: Can view new method as way of conducting old method.  Given a kernel [as a black-box program K(x,y)] and access to typical inputs [samples from D],  Claim: Can run K and reverse-engineer an explicit (small) set of features, such that if K is good [ 9 large-margin separator in  -space for D,c], then this is a good feature set [ 9 almost-as-good separator]. E.g., sample z 1,...,z d from D. Given x, define x i = K(x,z i ). Implications:  Practical: alternative to kernelizing the algorithm.  Conceptual: View kernel as (principled) way of doing feature generation. View as similarity function, rather than “magic power of implicit high dimensional space”.

6 Basic setup, definitions  Instance space X. X  Distribution D, target c. Use P = (D,c).  K(x,y) =  (x) ¢  (y).  P is separable with margin  in  -space if 9 w s.t. Pr (x, l ) 2 P [ l (w ¢  (x)/ |  (x)| ) <  (|w|=1) P=(D,c) + -  w  Error  at margin  : replace “0” with “  ”. Goal is to use K to get mapping to low-dim’l space.

7 One idea: Johnson-Lindenstrauss lemma  If P separable with margin  in  -space, then with prob 1- , a random linear projection down to space of dimension d = O((1/  2 )log[1/(  )]) will have a linear separator of error < . [Arriaga Vempala] XP=(D,c) + -   If vectors are r 1,r 2,...,r d, then can view as features x i =  (x) ¢ r i.  Problem: uses . Can we do directly, using K as black- box, without computing  ?

8 3 methods (from simplest to best) 1.Draw d examples z 1,...,z d from D. Use: F(x) = (K(x,z 1 ),..., K(x,z d )). [So, “x i ” = K(x,z i )] For d = (8/  )[1/  2 + ln 1/  ], if P was separable with margin  in  -space, then whp this will be separable with error . (but this method doesn’t preserve margin). 2.Same d, but a little more complicated. Separable with error  at margin  /2. 3.Combine (2) with further projection as in JL lemma. Get d with log dependence on 1/ , rather than linear. So, can set  ¿ 1/d. All these methods need access to D, unlike JL. Can this be removed? We show NO for generic K, but may be possible for natural K.

9 Key fact Claim : If 9 perfect w of margin  in  -space, then if draw z 1,...,z d 2 D for d ¸ (8/  )[1/  2 + ln 1/  ], whp (1-  ) exists w’ in span(  (z 1 ),...,  (z d )) of error ·  at margin  /2. Proof: Let S = examples drawn so far. Assume |w|=1, |  (z)|=1 8 z.  w in = proj(w,span(S)), w out = w – w in.  Say w out is large if Pr z ( |w out ¢  (z)| ¸  /2 ) ¸  ; else small.  If small, then done: w’ = w in.  Else, next z has at least  prob of improving S. |w out | 2 Ã |w out | 2 – (  /2) 2  Can happen at most 4/  2 times. □

10 So.... If draw z 1,...,z d 2 D for d = (8/  )[1/  2 + ln 1/  ], then whp exists w’ in span(  (z 1 ),...,  (z d )) of error ·  at margin  /2.  So, for some w’ =  1  (z 1 ) +... +  d  (z d ), Pr (x, l ) 2 P [sign(w’ ¢  (x))  l ] · .  But notice that w’ ¢  (x) =  1 K(x,z 1 ) +... +  d K(x,z d ). ) vector (  1,...  d ) is an  -good separator in the feature space: x i = K(x,z i ).  But margin not preserved because length of target, examples not preserved.

11 How to preserve margin? (mapping #2)  We know 9 w’ in span(  (z 1 ),...,  (z d )) of error ·  at margin  /2.  So, given a new x, just want to do an orthogonal projection of  (x) into that span. (preserves dot- product, decreases |  (x)|, so only increases margin). Run K(z i,z j ) for all i,j=1,...,d. Get matrix M. Decompose M = U T U. (Mapping #2) = (mapping #1)U -1. □

12 Mapping #2, Details  Draw a set S={z 1,..., z d } of d = (8/  )[1/  2 + ln 1/  ], unlabeled examples from D.  Run K(x,y) for all x,y 2 S, get M(S)=(K(z i,z j )) z i,z j 2 S.  Place S into d-dim. space based on K (or M(S)). X z1z1 z3z3 z2z2 K(z 1,z 1 )=|F 2 (z 1 )| 2 F 2 (z 1 ) F 2 (z 2 ) K(z 2,z 2 ) K(z 1,z 2 ) K(z 3,z 3 ) F 2 (z 3 ) RdRd F1F1

13 Mapping #2, Details, cont  What to do with new points?  Extend the embedding F 1 to all of X: consider F 2 : X ! R d defined as follows: for x 2 X, let F 2 (x) 2 R d be the point of smallest length such that F 2 (x) ¢ F 2 (z i ) = K(x,z i ), for all i 2 {1,..., d}.  The mapping is equivalent to orthogonally projecting  (x) down to span(  (z 1 ),…,  (z d )).

14 How to improve dimension?  Current mapping (F 2 ) gives d = (8/  )[1/  2 + ln 1/  ].  Johnson-Lindenstrauss gives d 1 = O((1/  2 ) log 1/(  ) ). Nice because can have d ¿ 1/ .  Answer: just combine the two... Run Mapping #2, then do random projection down from that. Gives us desired dimension (# features), though sample-complexity remains as in mapping #2.

15 X X X X X X O O O O X X O O O O X X X X X O O O O X X X RdRd  JL F X X O O O O X X X RNRN F2F2 R d1

16 Mapping #3  Do JL(mapping2(x)).  JL says: fix y,w. Random projection M down to space of dimension O(1/  2 log 1/  ’) will with prob (1-  ’) preserve margin of y up to §  /4.  Use  ’ = . ) For all y, Pr M [failure on y] < , ) Pr D, M [failure on y] < , ) Pr M [fail on prob mass  ] < .  So, we get desired dimension (# features), though sample-complexity remains as in mapping #2.

17 Lower bound (on necessity of access to D) For arbitrary black-box kernel K, can’t hope to convert to small feature space without access to D.  Consider X={0,1} n, random X’ ½ X of size 2 n/2, D = uniform over X’.  c = arbitrary function (so learning is hopeless).  But we have this magic kernel K(x,y) =  (x) ¢  (y)  (x) = (1,0) if x  X’.  (x) = (-½, p 3/2) if x 2 X’, c(x)=pos.  (x) = (-½,- p 3/2) if x 2 X’, c(x)=neg.  P is separable with margin p 3/2 in  - space.  But, without access to D, all attempts at running K(x,y) will give answer of 1.   

18 Open Problems  For specific natural kernels, like K(x,y) = (1 + x ¢ y) m, is there an efficient analog to JL, without needing access to D? Or, at least can one at least reduce the sample- complexity ? (use fewer accesses to D)  Can one extend results (e.g., mapping #1: x  [K(x,z 1 ),..., K(x,z d )]) to more general similarity functions K? Not exactly clear what theorem statement would look like.


Download ppt "Kernels, Margins, and Low-dimensional Mappings [NIPS 2007 Workshop on TOPOLOGY LEARNING ] Maria-Florina Balcan, Avrim Blum, Santosh Vempala."

Similar presentations


Ads by Google