L15:Microarray analysis (Classification) The Biological Problem Two conditions that need to be differentiated, (Have different treatments). EX: ALL (Acute.

Slides:



Advertisements
Similar presentations
Component Analysis (Review)
Advertisements

Surface normals and principal component analysis (PCA)
Face Recognition Ying Wu Electrical and Computer Engineering Northwestern University, Evanston, IL
1er. Escuela Red ProTIC - Tandil, de Abril, 2006 Principal component analysis (PCA) is a technique that is useful for the compression and classification.
Separating Hyperplanes
Visual Recognition Tutorial
CSE182 L14 Mass Spec Quantitation MS applications Microarray analysis.
Principal Component Analysis
Prénom Nom Document Analysis: Linear Discrimination Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Dimensional reduction, PCA
L16: Micro-array analysis Dimension reduction Unsupervised clustering.
Linear Discriminant Analysis (Part II) Lucian, Joy, Jie.
Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
L15:Microarray analysis (Classification). The Biological Problem Two conditions that need to be differentiated, (Have different treatments). EX: ALL (Acute.
The Terms that You Have to Know! Basis, Linear independent, Orthogonal Column space, Row space, Rank Linear combination Linear transformation Inner product.
Bayesian belief networks 2. PCA and ICA
Whole Genome Assembly Microarray analysis. Mate Pairs Mate-pairs allow you to merge islands (contigs) into super-contigs.
What is Learning All about ?  Get knowledge of by study, experience, or being taught  Become aware by information or from observation  Commit to memory.
Linear Discriminant Functions Chapter 5 (Duda et al.)
Techniques for studying correlation and covariance structure
Sp’10Bafna/Ideker Classification (SVMs / Kernel method)
Summarized by Soo-Jin Kim
Principle Component Analysis (PCA) Networks (§ 5.8) PCA: a statistical procedure –Reduce dimensionality of input vectors Too many features, some of them.
Probability of Error Feature vectors typically have dimensions greater than 50. Classification accuracy depends upon the dimensionality and the amount.
CSE182 L14 Mass Spec Quantitation MS applications Microarray analysis.
N– variate Gaussian. Some important characteristics: 1)The pdf of n jointly Gaussian R.V.’s is completely described by means, variances and covariances.
Ch 4. Linear Models for Classification (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized and revised by Hee-Woong Lim.
ECE 8443 – Pattern Recognition LECTURE 10: HETEROSCEDASTIC LINEAR DISCRIMINANT ANALYSIS AND INDEPENDENT COMPONENT ANALYSIS Objectives: Generalization of.
Computational Intelligence: Methods and Applications Lecture 23 Logistic discrimination and support vectors Włodzisław Duch Dept. of Informatics, UMK Google:
Non-Bayes classifiers. Linear discriminants, neural networks.
ECE 8443 – Pattern Recognition LECTURE 08: DIMENSIONALITY, PRINCIPAL COMPONENTS ANALYSIS Objectives: Data Considerations Computational Complexity Overfitting.
CSE 185 Introduction to Computer Vision Face Recognition.
CISC667, F05, Lec22, Liao1 CISC 667 Intro to Bioinformatics (Fall 2005) Support Vector Machines I.
Linear hyperplanes as classifiers Usman Roshan. Hyperplane separators.
Lecture 4 Linear machine
Linear Models for Classification
Discriminant Analysis
1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
METU Informatics Institute Min720 Pattern Classification with Bio-Medical Applications Part 7: Linear and Generalized Discriminant Functions.
CS621 : Artificial Intelligence Pushpak Bhattacharyya CSE Dept., IIT Bombay Lecture 21: Perceptron training and convergence.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 12: Advanced Discriminant Analysis Objectives:
Dimensionality reduction
Classification Course web page: vision.cis.udel.edu/~cv May 14, 2003  Lecture 34.
METU Informatics Institute Min720 Pattern Classification with Bio-Medical Applications Part 9: Review.
Feature Extraction 主講人:虞台文. Content Principal Component Analysis (PCA) PCA Calculation — for Fewer-Sample Case Factor Analysis Fisher’s Linear Discriminant.
CSE182 L14 Mass Spec Quantitation MS applications Microarray analysis.
Linear Classifiers Dept. Computer Science & Engineering, Shanghai Jiao Tong University.
Ch 2. Probability Distributions (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized by Joo-kyung Kim Biointelligence Laboratory,
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 10: PRINCIPAL COMPONENTS ANALYSIS Objectives:
Giansalvo EXIN Cirrincione unit #4 Single-layer networks They directly compute linear discriminant functions using the TS without need of determining.
Feature Extraction 主講人:虞台文.
SUPPORT VECTOR MACHINES Presented by: Naman Fatehpuria Sumana Venkatesh.
Computational Intelligence: Methods and Applications Lecture 22 Linear discrimination - variants Włodzisław Duch Dept. of Informatics, UMK Google: W Duch.
Part 3: Estimation of Parameters. Estimation of Parameters Most of the time, we have random samples but not the densities given. If the parametric form.
Linear Discriminant Functions Chapter 5 (Duda et al.) CS479/679 Pattern Recognition Dr. George Bebis.
Machine Learning Supervised Learning Classification and Regression K-Nearest Neighbor Classification Fisher’s Criteria & Linear Discriminant Analysis Perceptron:
Principal Component Analysis (PCA)
PREDICT 422: Practical Machine Learning
LINEAR CLASSIFIERS The Problem: Consider a two class task with ω1, ω2.
LECTURE 11: Advanced Discriminant Analysis
LECTURE 09: BAYESIAN ESTIMATION (Cont.)
LECTURE 10: DISCRIMINANT ANALYSIS
Whole Genome Assembly.
Bayesian belief networks 2. PCA and ICA
Techniques for studying correlation and covariance structure
Feature space tansformation methods
Generally Discriminant Analysis
LECTURE 09: DISCRIMINANT ANALYSIS
Principal Component Analysis
Presentation transcript:

L15:Microarray analysis (Classification)

The Biological Problem Two conditions that need to be differentiated, (Have different treatments). EX: ALL (Acute Lymphocytic Leukemia) & AML (Acute Myelogenous Leukima) Possibly, the set of genes over-expressed are different in the two conditions

Geometric formulation Each sample is a vector with dimension equal to the number of genes. We have two classes of vectors (AML, ALL), and would like to separate them, if possible, with a hyperplane.

Basic geometry What is ||x|| 2 ? What is x/||x|| Dot product? x=(x 1,x 2 ) y

Dot Product Let  be a unit vector. –||  || = 1 Recall that –  T x = ||x|| cos  What is  T x if x is orthogonal (perpendicular) to  ?  x   T x = ||x|| cos 

Hyperplane How can we define a hyperplane L? Find the unit vector that is perpendicular (normal to the hyperplane)

Points on the hyperplane Consider a hyperplane L defined by unit vector , and distance  0 from the origin Notes; –For all x  L, x T  must be the same, x T  =  0 –For any two points x 1, x 2, (x 1 - x 2 ) T  =0 Therefore, given a vector , and an offset  0, the hyperplane is the set of all points –{x : x T  =  0 } x1x1 x2x2

Hyperplane properties Given an arbitrary point x, what is the distance from x to the plane L? D(x,L) = (  T x -  0 ) When are points x1 and x2 on different sides of the hyperplane? x 00

Hyperplane properties Given an arbitrary point x, what is the distance from x to the plane L? –D(x,L) = (  T x -  0 ) When are points x 1 and x 2 on different sides of the hyperplane? Ans: If D(x 1,L)* D(x 2,L) < 0 x 00

Separating by a hyperplane Input: A training set of +ve & -ve examples Recall that a hyperplane is represented by – {x:-  0 +  1 x 1 +  2 x 2 =0} or –(in higher dimensions) {x:  T x-  0 =0} Goal: Find a hyperplane that ‘separates’ the two classes. Classification: A new point x is +ve if it lies on the +ve side of the hyperplane (D(x,L)> 0), -ve otherwise. x2x2 x1x1 + -

Hyperplane separation What happens if we have many choices of a hyperplane? –We try to maximize the distance of the points from the hyperplane. What happens if the classes are not separable by a hyperplane? –We define a function based on the amount of mis-classification, and try to minimize it

Error in classification Sample Function: sum of distances of all misclassified points –Let y i =-1 for +ve example i, y i =+1 otherwise. The best hyperplane is one that minimizes D( ,  0 ) Other definitions are also possible. x2x2 x1x1 + - 

Restating Classification The (supervised) classification problem can now be reformulated as an optimization problem. Goal: Find the hyperplane ( ,  0 ), that optimizes the objective D( ,  0 ). No efficient algorithm is known for this problem, but a simple generic optimization can be applied. Start with a randomly chosen ( ,  0 ) Move to a neighboring (  ’,  ’ 0 ) if D(  ’,  ’ 0 )< D( ,  0 )

Gradient Descent The function D(  ) defines the error. We follow an iterative refinement. In each step, refine  so the error is reduced. Gradient descent is an approach to such iterative refinement. D(  )  D’(  )

Rosenblatt’s perceptron learning algorithm

Classification based on perceptron learning Use Rosenblatt’s algorithm to compute the hyperplane L=( ,  0 ). Assign x to class 1 if D(x,L)=  T x-  0 >= 0, and to class 2 otherwise. x  00

Perceptron learning If many solutions are possible, it does no choose between solutions If data is not linearly separable, it does not terminate, and it is hard to detect. Time of convergence is not well understood

Linear Discriminant analysis Provides an alternative approach to classification with a linear function. Project all points, including the means, onto vector . We want to choose  such that –Difference of projected means is large. –Variance within group is small x2x2 x1x1 + - 

Choosing the right  x2x2 x1x1 + - 22 x2x2 x1x1 + - 11  1 is a better choice than  2 as the variance within a group is small, and difference of means is large. How do we compute the best  ?

Linear Discriminant analysis Fisher Criterion

LDA cont’d x2x2 x1x1 + -  What is the projection of a point x onto  ? –Ans:  T x What is the distance between projected means? x

LDA Cont’d Fisher Criterion

LDA Therefore, a simple computation (Matrix inverse) is sufficient to compute the ‘best’ separating hyperplane

Maximum Likelihood discrimination Consider the simple case of single dimensional data. Compute a distribution of the values in each class. values Pr

Maximum Likelihood discrimination Suppose we knew the distribution of points in each class  i. –We can compute Pr(x|  i ) for all classes i, and take the maximum The true distribution is not known, so usually, we assume that it is Gaussian

ML discrimination Suppose all the points were in 1 dimension, and all classes were normally distributed. 11 22 x

ML discrimination (multi- dimensional case) Not part of the syllabus.

Supervised classification summary Most techniques for supervised classification are based on the notion of a separating hyperplane. The ‘optimal’ separation can be computed using various combinatorial (perceptron), algebraic (LDA), or statistical (ML) analyses.

Dimensionality reduction Many genes have highly correlated expression profiles. By discarding some of the genes, we can greatly reduce the dimensionality of the problem. There are other, more principled ways to do such dimensionality reduction.

Principle Components Analysis Consider the expression values of 2 genes over 6 samples. Clearly, the expression of the two genes is highly correlated. Projecting all the genes on a single line could explain most of the data. This is a generalization of “discarding the gene”.

PCA Suppose all of the data were to be reduced by projecting to a single line  from the mean. How do we select the line  ? m

PCA cont’d Let each point x k map to x’ k. We want to mimimize the error Observation 1: Each point x k maps to x’ k = m +  T (x k -m)  m  xkxk x’ k

PCA: motivating example Consider the expression values of 2 genes over 6 samples. Clearly, the expression of g 1 is not informative, and it suffices to look at g 2 values. Dimensionality can be reduced by discarding the gene g 1 g1g1 g2g2

PCA: Ex2 Consider the expression values of 2 genes over 6 samples. Clearly, the expression of the two genes is highly correlated. Projecting all the genes on a single line could explain most of the data.

PCA Suppose all of the data were to be reduced by projecting to a single line  from the mean. How do we select the line  ? m 

PCA cont’d Let each point x k map to x’ k =m+a k . We want to minimize the error Observation 1: Each point x k maps to x’ k = m +  T (x k -m)  –(a k =  T (x k -m)) m  xkxk x’ k

Proof of Observation 1 Differentiating w.r.t a k

Minimizing PCA Error To minimize error, we must maximize  T S  By definition, =  T S  implies that is an eigenvalue, and  the corresponding eigenvector. Therefore, we must choose the eigenvector corresponding to the largest eigenvalue.

PCA The single best dimension is given by the eigenvector of the largest eigenvalue of S The best k dimensions can be obtained by the eigenvectors {  1,  2, …,  k } corresponding to the k largest eigenvalues. To obtain the k dimensional surface, take B T M BTBT 1T1T M