Presentation is loading. Please wait.

Presentation is loading. Please wait.

Principle Component Analysis and its use in MA clustering Lecture 12.

Similar presentations


Presentation on theme: "Principle Component Analysis and its use in MA clustering Lecture 12."— Presentation transcript:

1 Principle Component Analysis and its use in MA clustering Lecture 12

2 What is PCA? This is a MATHEMATICAL procedure that transforms a set of correlated responses into a smaller set of uncorrelated variables called PRINCIPAL COMPONENTS. Uses: Data screening Clustering Discriminant Analysis Regression combating Multicollinearity

3 Objectives of PCA It is an exploratory technique meant to give researchers a better FEEL for their data Reduce dimensionality, rather try to understand the TRUE dimensionality of the data Identify “meaningful” variables If you have a VARIANCE-COVARIANCE MATRIX, S: PCA returns new variables called Principal Components that are: –Uncorrelated –First component explains MOST of the variability –The remaining PC explain decreasing amounts of variability

4 Idea of PCA Consider x to be a random variable with mean  and Variance given by . The first PC variable is defined by: y 1 = a 1 ’(x-  ), such that a1 is chosen so that the VAR(a 1 ’(x-  )) is maximized for all vectors a1 satisfying a 1 ’a 1 =1 It can be shown that the maximum value of the variance of a 1 ’(x-  ) among all vectors satisfying the condition is 1 (the first or largest eigen value of the Matrix,  ). This implies a 1 is the eigen vector corresponding to the eigen value 1. The second PC is the eigen vector corresponding to the second largest eigen value 2 and so on to the pth eigen value.

5 Supplementary Info: What are Eigen Values and Eigen Vectors? Also called characteristic root (latent root) eigen values are the roots of the polynomial equation defined by: |  I| =0 This leads to an equation of form: c 1 p + c 2 p-1 + … c p + c p+1 = 0 If  is symmetric then the eigen values are real numbers and can be ordered.

6 Supplementary Info: II What are Eigen Vectors? Similarly, eigen vectors are the vectors satisfying the equation:  a  a =0 If  is symmetric then there will be p eigen vectors corresponding to the p eigen values. Generally not unique and are normalized to a j ’a j = 1 Remarks: if two eigen values are NOT equal there eigen vectors will be orthogonal to each other. When two eigen values are equal their eigen vectors are CHOSEN orthogonal to each other (in this case these are non-unique). Tr(  ) =  i |  | =  i

7 Idea of PCA contd… Hence the p principal components are a 1, a 2 ….a p, the eigen vectors corresponding to the ordered eigen values of . Here, 1  2  …  p. Result: two principal components are uncorrelated if and only if their defining eigen vectors are orthogonal to each other. Hence the PC are placed on a orthogonal axis system where are the data fall.

8 Idea of PCA contd… The varaince of the jth component is j, j=1,…,p. Remember: tr(  ) =  11 +  22 +…+  pp. Also, tr(  )= 1 + 2 +…+ p. Hence, often a measure of “importance” of the jth principal component is given by j /tr(  ).

9 Comments To actually do PCA we need to compute the principal component scores or the values of the principal component variable for each unit in the data set. These scores provide locations of the observations in a data set with respect to the principal component axis. Generally eigen vectors are normalized to length 1, a j ’a j =1. Often to make comparison between eigen values each element in the vector is multiplied by the square root of the corresponding eigen value( called component vectors), c j = ( j 1/2 )a j.

10 Estimating PC Life would be easy if  and  were known. All we had to do was to estimate the normalized eigen vectors and corresponding eigen values. But, most of the time we DO NOT know  and  and we need to estimate those and hence the PCA are the sample values corresponding to the estimated  and . Determining the # of PC: –Look for the eigen values that are much smaller than the others. Plots like SCREE plot (plot of eigen value versus the eigen number)

11 Caveats The whole idea of PCA is to transform a set of correlated variables to a set of uncorrelated variables, hence if the data are already uncorrelated, not much additional advantage of doing PCA. One can do PCA on correlation matrix or the Covariance matrix. In the latter case, the component correlation vectors c j = ( j 1/2 )a j give the correlations between the original variables and the jth principal component variable.

12 PCA and Multidimensional Scaling Essentially what PCA does is what is called SINGULAR VALUE DECOMPOSITION(SVD) of a matrix X=UDV’ Where X is n by p, with n<<p (in MA) U is n by n D is n by n, diagonal matrix with the diagonals decreasing, d1  d2…  dn. V is a p by n matrix, which rotates X into a new set of co-ordinates. such that XV=UD

13 SVD and MDS SVD is a VERY memory hungry procedure and especially for MA data when there a large number of genes it is very slow and often needs HUGE amounts of memory to work. Multidimensional Scaling (MDS): is a collection of methods that do not use the full data matrix but rather the distance matrix between the variables. This reduces the computation from n by p to n by n (quite a reduction!).

14 Sammon Mapping A common method used in MA is SAMMON mapping which aims to find the two-dimensional representation that has the maximum dissimilarity matrix compared to the original one. PCA has the advantage in the sense that it represents the samples in a scatterplot whose axes are made up of a linear combination of the most variable genes. Sammon mapping treats all genes equivalently and hence is a bit “duller” than PCA based clustering.

15 PCA in Microarrays Useful technique to understand the TRUE dimensionality of the data. Useful for clustering. In R under the MASS package you can use: my.data1=read.table("cluster.csv",header=TRUE,sep=",") princomp(my.data1) myd.sam <- sammon(dist(my.data1)) plot(myd.sam$points, type = "n") text(myd.sam$points, labels =as.character(1:nrow(my.data1)))

16


Download ppt "Principle Component Analysis and its use in MA clustering Lecture 12."

Similar presentations


Ads by Google