Maria Florina Balcan 03/04/2010

Maria Florina Balcan 03/04/2010
Kernels and Margins Maria Florina Balcan 03/04/2010

Kernel Methods Amazingly popular method in ML in recent years.
Lots of Books, Workshops. Significant percentage of ICML, NIPS, COLT. Important ICML 2007, Business meeting

Linear Separators w Instance space: X=Rn
Hypothesis class of linear decision surfaces in Rn w h(x)=w ¢ x, if h(x)> 0, then label x as +, otherwise label it as -

Lin. Separators: Perceptron algorithm
Start with all-zeroes weight vector w. Given example x, predict positive , w ¢ x ¸ 0. On a mistake, update as follows: Mistake on positive, then update w Ã w + x Mistake on negative, then update w Ã w - x Note: w is a weighted sum of the incorrectly classified examples Guarantee: mistake bound is 1/2

Geometric Margin If S is a set of labeled examples, then a vector w has margin  w.r.t. S if  x a  w h: w¢x = 0

What if Not Linearly Separable
Problem: data not linearly separable in the most natural feature representation. Example: No good linear separator in pixel representation. vs Solutions: Classic: “Learn a more complex class of functions”. Modern: “Use a Kernel” (prominent method today)

Overview of Kernel Methods
What is a Kernel? A kernel K is a legal def of dot-product: i.e. there exists an implicit mapping  such that K( , )= ( )¢ ( ). Why Kernels matter? Many algorithms interact with data only via dot-products. So, if replace x ¢ y with K(x,y), they act implicitly as if data was in the higher-dimensional -space. If data is linearly separable by margin in -space, then good sample complexity.

Kernels K(¢,¢) - kernel if it can be viewed as a legal definition of inner product: 9 : X ! RN such that K(x,y) =(x) ¢ (y) range of  - “-space” N can be very large But think of  as implicit, not explicit!

Example K(x,y) = (x¢y)d corresponds to E.g., for n=2, d=2, the kernel
original space -space x2 x1 O X z2 z1 z3 O X

Example original space -space x2 x1 O X z2 z1 z3 O X

Example Note: feature space need not be unique

Kernels K is a kernel iff K is symmetric
More examples: Linear: K(x,y)=x ¢ y Polynomial: K(x,y) =(x ¢ y)d or K(x,y) =(1+x ¢ y)d Gaussian: Theorem K is a kernel iff K is symmetric for any set of training points x1, x2, …,xm and for any a1, a2, …, am 2 R we have:

Kernelizing a learning algorithm
If all computations involving instances are in terms of inner products then: Conceptually, work in a very high diml space and the alg’s performance depends only on linear separability in that extended space. Computationally, only need to modify the alg. by replacing each x ¢ y with a K(x,y). Examples of kernalizable algos: Perceptron, SVM.

Lin. Separators: Perceptron algorithm
Start with all-zeroes weight vector w. Given example x, predict positive , w ¢ x ¸ 0. On a mistake, update as follows: Mistake on positive, then update w Ã w + x Mistake on negative, then update w Ã w - x Easy to kernelize since w is a weighted sum of examples: Replace with Note: need to store all the mistakes so far.

Generalize Well if Good Margin
If data is linearly separable by margin in -space, then good sample complexity. + -  If margin  in -space, then need sample size of only Õ(1/2) to get confidence in generalization. |(x)| · 1 Cannot rely on standard VC-bounds since the dimension of the phi-space might be very large. VC-dim for the class of linear sep. in Rm is m+1.

Kernels & Large Margins
If S is a set of labeled examples, then a vector w in the -space has margin  if: A vector w in the -space has margin  with respect to P if: A vector w in the -space has error  at margin  if: (,)-good kernel

Large Margin Classifiers
If large margin, then the amount of data we need depends only on 1/ and is independent on the dim of the space! If large margin  and if our alg. produces a large margin classifier, then the amount of data we need depends only on 1/ [Bartlett & Shawe-Taylor ’99] If large margin, then Perceptron also behaves well. Another nice justification based on Random Projection [Arriaga & Vempala ’99].

Kernels & Large Margins
Powerful combination in ML in recent years! A kernel implicitly allows mapping data into a high dimensional space and performing certain operations there without paying a high price computationally. If data indeed has a large margin linear separator in that space, then one can avoid paying a high price in terms of sample size as well.

Kernels Methods Offer great modularity.
No need to change the underlying learning algorithm to accommodate a particular choice of kernel function. Also, we can substitute a different algorithm while maintaining the same kernel.

Kernels, Closure Properties
Easily create new kernels using basic ones! A few properties of legal kernels

Kernels, Closure Properties
Easily create new kernels using basic ones!

What we really care about are good kernels not only legal kernels!

Good Kernels, Margins, and Low Dimensional Mappings
Designing a kernel function is much like designing a feature space. Given a good kernel K, we can reinterpret K as defining a new set of features. [Balcan-Blum -Vempala, MLJ’06]

Kernels as Features [BBV06]
If indeed large margin under K, then a random linear projection of the -space down to a low dimensional space approximately preserves linear separability. by Johnson-Lindenstrauss lemma! A few properties of good kernels

Main Idea - Johnson-Lindenstrauss lemma
RN X O w X X O x X O  X O A X X X X X X O O O Rd O X O X w’ O X x’ O X X O

Main Idea - Johnson-Lindenstrauss lemma, cont
For any vectors u,v with prob. (1-), ](u,v) is preserved up to § /2, if we use Usual use in algorithms: m points, set =O(1/m2) In our case, if we want w.h.p. 9 separator of error , use

Main Idea, cont X O Rd RN  A F If c has margin  in the -space, then F¤(D,c), will w.h.p. have a linear separator of error at most . X

Problem Statement For a given kernel K, the dimensionality and description of (x) might be large, or even unknown. Do not want to explicitly compute (x). Given kernel K, produce such a mapping F efficiently: running time that depends polynomially only on 1/ and the time to compute K. no dependence on the dimension of the “-space”.

Main Result [BBV06] Positive answer - if our procedure for computing the mapping F is also given black-box access to the distribution D (unlabeled data). Formally..... Given black-box access to K(¢,¢), given access to D and , , , construct, in poly time, F:X ! Rd, where , s. t. if c has margin  in the -space, then with prob. 1-, the induced distribution in Rd is separable at margin °/2 with error · .

3 methods (from simplest to best)
Draw d examples x1, …, xd from D. Use: F0(x) = (K(x,x1), ..., K(x,xd)). For d = (8/e)[1/g2 + ln 1/d], if P was separable with margin g in f-space, then w.h.p. this will be separable with error e. (but this method doesn’t preserve margin). Same d, but a little more complicated. Separable with error e at margin g/2. Combine (2) with further projection as in JL lemma. Get d with log dependence on 1/e, rather than linear. So, can set e ¿ 1/d.

A Key Fact Claim 1: If 9 unit-length w of margin g in f-space, then if draw x1, …, xd 2 D for d ¸ (8/e)[1/g2 + ln 1/d], w.h.p. (1-d) exists w’ in span((x1),...,(xd)) of error · e at margin g/2. Proof: Let S = {(x)} for examples x drawn so far. win = proj(w,span(S)), wout = w – win. Say wout is large if Prx(|wout¢(x)| ¸ g/2) ¸ e; else small. If small, then done: w’ = win. Else, next x has at least e prob of improving S. w |wout|2 Ã |wout|2 – (g/2)2 Can happen at most 4/g2 times. □

Pr(x,l) 2 P [sign(w’ ¢ f(x)) ¹ l] · e.
A First Attempt If draw x1,...,xd 2 D for d = (8/e)[1/g2 + ln 1/d], then whp exists w’ in span(f(x1),...,f(xd)) of error · e at margin g/2. So, for some w’ = a1f(x1) adf(xd), Pr(x,l) 2 P [sign(w’ ¢ f(x)) ¹ l] · e. But notice that w’¢f(x) = a1K(x, x1) adK(x, xd). ) vector (a1,...ad) is a separator in the feature space (K(x,x1), …, K(x,xd)) with error  . However margin not preserved because of length of target, examples.

A First Mapping Draw a set S={x1, ..., xd} of unlabeled examples from D. Run K(x,y) for all x,y2S, get M(S)=(K(xi,xj))xi,xj2 S. Place S into d-dim. space based on K (or M(S)). Rd F1(x3) X K(x1,x1)=|F1(x1)|2 K(x3,x3) x3 x1 F1(x1) F1 x2 K(x1,x2) F1(x2) K(x2,x2)

A First Mapping, cont What to do with new points?
Extend the embedding F1 to all of X: consider F1: X ! Rd defined as follows: for x 2 X, let F1(x) 2 Rd be the point such that F1(x) ¢F1(xi) = K(x,xi), for all i 2 {1, ..., d}. The mapping is equivalent to orthogonally projecting (x) down to span((x1),, (xd)).

A First Mapping, cont So:
From Claim 1, we have that with prob. 1-±, for some w’ = a1f(x1) adf(xd) we have: Consider w’’ = a1F_1(x1) adF_d(xd). We have |w’’| = |w’|, w’ ¢ Á(x) = w’’ ¢ F1(x), |F1(x)| · |Á(x)| So:

An improved mapping A two-stage process, compose the first mapping, F1, with a random linear projection. Combine two types of random projection: a projection based on points chosen at random from D. a projection based on choosing points uniformly at random in the intermediate space.

RN  Rn F1 A F (N>>n) X Rd X O X X O X O X O X O X X O X O O X X

Improved Mapping - Properties
Given , ,  <1, , if P has margin  in the -space, then with probability ¸1-, our mapping into Rd, has the property that F(D,c) is linearly separable with error at most , at margin at most , given that we use unlabeled examples.

Improved Mapping, Consequences
If P has margin  in the -space then we can use unlabeled examples to produce a mapping into Rd for , such that w.h.p. data is linearly separable with error ¿'/d. The error rate of the induced target function in Rd is so small that a set S of labeled examples will, w.h.p., be perfectly separable in Rd. Can use any generic, zero-noise linear separator algorithm in Rd.

Maria Florina Balcan 03/04/2010

Similar presentations

Presentation on theme: "Maria Florina Balcan 03/04/2010"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Maria Florina Balcan 03/04/2010

Similar presentations

Presentation on theme: "Maria Florina Balcan 03/04/2010"— Presentation transcript:

Similar presentations

About project

Feedback