Presentation is loading. Please wait.

Presentation is loading. Please wait.

Sp’10Bafna/Ideker Classification (SVMs / Kernel method)

Similar presentations


Presentation on theme: "Sp’10Bafna/Ideker Classification (SVMs / Kernel method)"— Presentation transcript:

1 Sp’10Bafna/Ideker Classification (SVMs / Kernel method)

2 Sp’10Bafna/Ideker LP versus Quadratic programming LP: linear constraints, linear objective function LP can be solved in polynomial time. In QP, the objective function contains a quadratic form. For +ve semindefinite Q, the QP can be solved in polynomial time

3 Sp’10Bafna/Ideker Margin of separation Suppose we find a separating hyperplane ( ,  0 ) s.t. – For all +ve points x  T x-  0 >=1 – For all +ve points x  T x-  0 <= -1 What is the margin of separation?  T x-  0 =0  T x-  0 =1  T x-  0 =-1

4 Sp’10Bafna/Ideker Separating by a wider margin Solutions with a wider margin are better.

5 Sp’10Bafna/Ideker Separating via misclassification In general, data is not linearly separable What if we also wanted to minimize misclassified points Recall that, each sample x i in our training set has the label y i  {- 1,1} For each point i, y i (  T x i -  0 ) should be positive Define  i >= max {0, 1- y i (  T x i -  0 ) } If i is correctly classified ( y i (  T x i -  0 ) >= 1), and  i = 0 If i is incorrectly classified, or close to the boundaries  i > 0 We must minimize  i  i

6 Sp’10Bafna/Ideker Support Vector machines (wide margin and misclassification) Maximimize margin while minimizing misclassification Solved using non-linear optimization techniques The problem can be reformulated to exclusively using cross products of variables, which allows us to employ the kernel method. This gives a lot of power to the method.

7 Sp’10Bafna/Ideker Reformulating the optimization

8 Sp’10Bafna/Ideker Lagrangian relaxation Goal S.t. We minimize

9 Sp’10Bafna/Ideker Simplifying For fixed  >= 0, >= 0, we minimize the lagrangian

10 Sp’10Bafna/Ideker Substituting Substituting (1)

11 Sp’10Bafna/Ideker Substituting (2,3), we have the minimization problem

12 Sp’10Bafna/Ideker Classification using SVMs Under these conditions, the problem is a quadratic programming problem and can be solved using known techniques Quiz: When we have solved this QP, how do we classify a point x?

13 Sp’10Bafna/Ideker The kernel method The SVM formulation can be solved using QP on dot-products. As these are wide-margin classifiers, they provide a more robust solution. However, the true power of SVMs approach from using ‘the kernel method’, which allows us to go to higher dimensional (and non- linear spaces)

14 Sp’10Bafna/Ideker kernel Let X be the set of objects – Ex: X =the set of samples in micro-arrays. – Each object x  X is a vector of gene expression values k: X  X -> R is a positive semidefinite kernel if – k is symmetric. – k is +ve semidefinite

15 Sp’10Bafna/Ideker Kernels as dot-product Quiz: Suppose the objects x are all real vectors (as in gene expression) Define Is k L a kernel? It is symmetric, but is is +ve semi-definite?

16 Sp’10Bafna/Ideker Linear kernel is +ve semidefinite Recall X as a matrix, such that each column is a sample – X=[x 1 x 2 …] By definition, the linear kernel k L =X T X For any c

17 Sp’10Bafna/Ideker Generalizing kernels Any object can be represented by a feature vector in real space.

18 Sp’10Bafna/Ideker Generalizing Note that the feature mapping could actually be non-linear. On the flip side, Every kernel can be represented as a dot-product in a high dimensional space. Sometimes the kernel space is easier to define than the mapping 

19 Sp’10Bafna/Ideker The kernel trick If an algorithm for vectorial data is expressed exclusively in the form of dot-products, it can be changed to an algorithm on an arbitrary kernel – Simply replace the dot-product by the kernel

20 Sp’10Bafna/Ideker Kernel trick example Consider a kernel k defined on a mapping  – k(x,x’) =  (x) T  (x’) It could be that  is very difficult to compute explicitly, but k is easy to compute Suppose we define a distance function between two objects as How do we compute this distance?

21 Sp’10Bafna/Ideker Kernels and SVMs Recall that SVM based classification is described as

22 Sp’10Bafna/Ideker Kernels and SVMs Applying the kernel trick We can try kernels that are biologically relevant

23 Sp’10Bafna/Ideker Examples of kernels for vectors

24 Sp’10Bafna/Ideker String kernel Consider a string s = s 1, s 2,… Define an index set I as a subset of indices s[I] is the substring limited to those indices l(I) = span W(I) = c l(I) c<1 – Weight decreases as span increases For any string u of length k l(I)

25 Sp’10Bafna/Ideker String Kernel Map every string to a |  | n dimensional space, indexed by all strings u of length upto n The mapping is expensive, but given two strings s,t,the dot-product kernel k(s,t) =  (s) T  (t) can be computed in O(n |s| |t|) time su

26 Sp’10Bafna/Ideker SVM conclusion SVM are a generic scheme for classifying data with wide margins and low misclassifications For data that is not easily represented as vectors, the kernel trick provides a standard recipe for classification – Define a meaningful kernel, and solve using SVM Many standard kernels are available (linear, poly., RBF, string)

27 Sp’10Bafna/Ideker Classification review We started out by treating the classification problem as one of separating points in high dimensional space Obvious for gene expression data, but applicable to any kind of data Question of separability, linear separation Algorithms for classification – Perceptron – Lin. Discriminant – Max Likelihood – Linear Programming – SVMs – Kernel methods & SVM

28 Sp’10Bafna/Ideker Classification review Recall that we considered 3 problems: – Group together samples in an unsupervised fashion (clustering) – Classify based on a training data (often by learning a hyperplane that separates). – Selection of marker genes that are diagnostic for the class. All other genes can be discarded, leading to lower dimensionality.

29 Sp’10Bafna/Ideker Dimensionality reduction Many genes have highly correlated expression profiles. By discarding some of the genes, we can greatly reduce the dimensionality of the problem. There are other, more principled ways to do such dimensionality reduction.

30 Sp’10Bafna/Ideker Why is high dimensionality bad? With a high enough dimensionality, all points can be linearly separated. Recall that a point x i is misclassified if – it is +ve, but  T x i -  0 <=0 – it is -ve, but  T x i +  0 > 0 In the first case choose  i s.t. –  T x i -  0 +  i >= 0 By adding a dimension for each misclassified point, we create a higher dimension hyperplane that perfectly separates all of the points!

31 Sp’10Bafna/Ideker Principle Components Analysis We get the intrinsic dimensionality of a data- set.

32 Sp’10Bafna/Ideker Principle Components Analysis Consider the expression values of 2 genes over 6 samples. Clearly, the expression of the two genes is highly correlated. Projecting all the genes on a single line could explain most of the data. This is a generalization of “discarding the gene”.

33 Sp’10Bafna/Ideker Projecting Consider the mean of all points m, and a vector emanating from the mean Algebraically, this projection on  means that all samples x can be represented by a single value  T( x-m)  m x x-m TT = M  T( x-m)

34 Sp’10Bafna/Ideker Higher dimensions Consider a set of 2 (k) orthonormal vectors  1,  2 … Once projected, each sample means that all samples x can be represented by 2 (k) dimensional vector –  1 T (x-m),  2 T( x-m) 11 m x x-m 1T1T = M  1 T( x-m) 22

35 Sp’10Bafna/Ideker How to project The generic scheme allows us to project an m dimensional surface into a k dimensional one. How do we select the k ‘best’ dimensions? The strategy used by PCA is one that maximizes the variance of the projected points around the mean

36 Sp’10Bafna/Ideker PCA Suppose all of the data were to be reduced by projecting to a single line  from the mean. How do we select the line  ? m

37 Sp’10Bafna/Ideker PCA cont’d Let each point x k map to x’ k =m+a k . We want to minimize the error Observation 1: Each point x k maps to x’ k = m +  T (x k -m)  – (a k =  T (x k -m)) m  xkxk x’ k

38 Sp’10Bafna/Ideker Proof of Observation 1 Differentiating w.r.t a k

39 Sp’10Bafna/Ideker Minimizing PCA Error To minimize error, we must maximize  T S  By definition, =  T S  implies that is an eigenvalue, and  the corresponding eigenvector. Therefore, we must choose the eigenvector corresponding to the largest eigenvalue.

40 Sp’10Bafna/Ideker PCA steps X = starting matrix with n columns, m rows xjxj X

41 End of Lecture Sp’10Bafna/Ideker

42 Sp’10Bafna/Ideker

43 Sp’10Bafna/Ideker ALL-AML classification The two leukemias need different different therapeutic regimen. Usually distinguished through hematopathology Can Gene expression be used for a more definitive test? – 38 bonemarrow samples – Total mRNA was hybridized against probes for 6817 genes – Q: Are these classes separable

44 Sp’10Bafna/Ideker Neighborhood analysis (cont’d) Each gene is represented by an expression vector v(g) = (e 1,e 2,…,e n ) Choose an idealized expression vector as center. Discriminating genes will be ‘closer’ to the center (any distance measure can be used). Discriminating gene

45 Sp’10Bafna/Ideker Neighborhood analysis Q: Are there genes, whose expression correlates with one of the two classes A: For each class, create an idealized vector c – Compute the number of genes N c whose expression ‘matches’ the idealized expression vector – Is N c significantly larger than N c* for a random c*?

46 Sp’10Bafna/Ideker Neighborhood test Distance measure used: – For any binary vector c, let the one entries denote class 1, and the 0 entries denote class 2 – Compute mean and std. dev. [  1 (g),  1 (g)] of expression in class 1 and also [  2 (g),  2 (g)]. – P(g,c) = [  1 (g)-  2 (g)]/ [  1 (g)+  2 (g)] – N 1 (c,r) = {g | P(g,c) == r} – High density for some r is indicative of correlation with class distinction – Neighborhood is significant if a random center does not produce the same density.

47 Sp’10Bafna/Ideker Neighborhood analysis #{g |P(g,c) > 0.3} > 709 (ALL) vs 173 by chance. Class prediction should be possible using micro- array expression values.

48 Sp’10Bafna/Ideker Class prediction Choose a fixed set of informative genes (based on their correlation with the class distinction). – The predictor is uniquely defined by the sample and the subset of informative genes. For each informative gene g, define (w g,b g ). – w g =P(g,c) (When is this +ve?) – b g = [  1 (g)+  2 (g)]/2 Given a new sample X – x g is the normalized expression value at g – Vote of gene g =w g |x g -b g | (+ve value is a vote for class 1, and negative for class 2 )

49 Sp’10Bafna/Ideker Prediction Strength PS = [V win -V lose ]/[V win +V lose ] – Reflects the margin of victory A 50 gene predictor is correct 36/38 (cross-validation) Prediction accuracy on other samples 100% (prediction made for 29/34 samples. Median PS = 0.73 Other predictors between 10 and 200 genes all worked well.

50 Sp’10Bafna/Ideker Performance

51 Sp’10Bafna/Ideker Differentially expressed genes? Do the predictive genes reveal any biology? Initial expectation is that most genes would be of a hematopoetic lineage. However, many genes encode – Cell cycle progression genes – Chromatin remodelling – Transcription – Known oncogenes – Leukemia targets (etopside)

52 Sp’10Bafna/Ideker Relationship between ML, and Golub predictor ML when the covariance matrix is a diagonal matrix with identical variance for different classes is similar to Golub’s classifier

53 Sp’10Bafna/Ideker Automatic class discovery The classification of different cancers is over years of hypothesis driven research. Suppose you were given unlabeled samples of ALL/AML. Would you be able to distinguish the two classes?

54 Sp’10Bafna/Ideker Self Organizing Maps SOMs was applied to group the 38 samples Class A1 contained 24/25 ALL and 3/13 AML samples. How can we validate this? Use the labels to do supervised classification via cross-validation A 20 gene predictor gave 34 accurate predictions, 1 error, and 2 of 3 uncertains

55 Sp’10Bafna/Ideker Comparing various error models

56 Sp’10Bafna/Ideker Conclusion


Download ppt "Sp’10Bafna/Ideker Classification (SVMs / Kernel method)"

Similar presentations


Ads by Google