09/05/2005 סמינריון במתמטיקה ביולוגית Dimension Reduction - PCA Principle Component Analysis.

Slides:



Advertisements
Similar presentations
Krishna Rajan Data Dimensionality Reduction: Introduction to Principal Component Analysis Case Study: Multivariate Analysis of Chemistry-Property data.
Advertisements

Covariance Matrix Applications
Component Analysis (Review)
Data Mining Feature Selection. Data reduction: Obtain a reduced representation of the data set that is much smaller in volume but yet produces the same.
Dimensionality Reduction PCA -- SVD
PCA + SVD.
1er. Escuela Red ProTIC - Tandil, de Abril, 2006 Principal component analysis (PCA) is a technique that is useful for the compression and classification.
1 Multivariate Statistics ESM 206, 5/17/05. 2 WHAT IS MULTIVARIATE STATISTICS? A collection of techniques to help us understand patterns in and make predictions.
Principal Components Analysis Babak Rasolzadeh Tuesday, 5th December 2006.
Principal Component Analysis CMPUT 466/551 Nilanjan Ray.
An introduction to Principal Component Analysis (PCA)
LISA Short Course Series Multivariate Analysis in R Liang (Sally) Shan March 3, 2015 LISA: Multivariate Analysis in RMar. 3, 2015.
Principal Component Analysis
‘Gene Shaving’ as a method for identifying distinct sets of genes with similar expression patterns Tim Randolph & Garth Tan Presentation for Stat 593E.
Dimension reduction : PCA and Clustering Slides by Agnieszka Juncker and Chris Workman.
Face Recognition Using Eigenfaces
The Terms that You Have to Know! Basis, Linear independent, Orthogonal Column space, Row space, Rank Linear combination Linear transformation Inner product.
Three Algorithms for Nonlinear Dimensionality Reduction Haixuan Yang Group Meeting Jan. 011, 2005.
Dimension reduction : PCA and Clustering Christopher Workman Center for Biological Sequence Analysis DTU.
Microarray analysis Algorithms in Computational Biology Spring 2006 Written by Itai Sharon.
Exploring Microarray data Javier Cabrera. Outline 1.Exploratory Analysis Steps. 2.Microarray Data as Multivariate Data. 3.Dimension Reduction 4.Correlation.
Patrick Kemmeren Using EP:NG.
Assigning Numbers to the Arrows Parameterizing a Gene Regulation Network by using Accurate Expression Kinetics.
Principal Component Analysis. Philosophy of PCA Introduced by Pearson (1901) and Hotelling (1933) to describe the variation in a set of multivariate data.
Microarray Gene Expression Data Analysis A.Venkatesh CBBL Functional Genomics Chapter: 07.
The Tutorial of Principal Component Analysis, Hierarchical Clustering, and Multidimensional Scaling Wenshan Wang.
Summarized by Soo-Jin Kim
NUS CS5247 A dimensionality reduction approach to modeling protein flexibility By, By Miguel L. Teodoro, George N. Phillips J* and Lydia E. Kavraki Rice.
Dimensionality Reduction: Principal Components Analysis Optional Reading: Smith, A Tutorial on Principal Components Analysis (linked to class webpage)
Chapter 2 Dimensionality Reduction. Linear Methods
Presented By Wanchen Lu 2/25/2013
Principal Components Analysis BMTRY 726 3/27/14. Uses Goal: Explain the variability of a set of variables using a “small” set of linear combinations of.
Chapter 3 Data Exploration and Dimension Reduction 1.
Next. A Big Thanks Again Prof. Jason Bohland Quantitative Neuroscience Laboratory Boston University.
Feature extraction 1.Introduction 2.T-test 3.Signal Noise Ratio (SNR) 4.Linear Correlation Coefficient (LCC) 5.Principle component analysis (PCA) 6.Linear.
ArrayCluster: an analytic tool for clustering, data visualization and module finder on gene expression profiles 組員:李祥豪 謝紹陽 江建霖.
CSE554AlignmentSlide 1 CSE 554 Lecture 5: Alignment Fall 2011.
Principal Component Analysis Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.
N– variate Gaussian. Some important characteristics: 1)The pdf of n jointly Gaussian R.V.’s is completely described by means, variances and covariances.
Descriptive Statistics vs. Factor Analysis Descriptive statistics will inform on the prevalence of a phenomenon, among a given population, captured by.
CSE554AlignmentSlide 1 CSE 554 Lecture 8: Alignment Fall 2013.
Principal Component Analysis (PCA). Data Reduction summarization of data with many (p) variables by a smaller set of (k) derived (synthetic, composite)
CSE 185 Introduction to Computer Vision Face Recognition.
A NOVEL METHOD FOR COLOR FACE RECOGNITION USING KNN CLASSIFIER
Principle Component Analysis and its use in MA clustering Lecture 12.
Principal Component Analysis (PCA)
Principal Component Analysis Zelin Jia Shengbin Lin 10/20/2015.
Principal Component Analysis and Linear Discriminant Analysis for Feature Reduction Jieping Ye Department of Computer Science and Engineering Arizona State.
Principal Components Analysis ( PCA)
Multivariate statistical methods. Multivariate methods multivariate dataset – group of n objects, m variables (as a rule n>m, if possible). confirmation.
Unsupervised Learning II Feature Extraction
Unsupervised Learning II Feature Extraction
Machine Learning Supervised Learning Classification and Regression K-Nearest Neighbor Classification Fisher’s Criteria & Linear Discriminant Analysis Perceptron:
CSE 554 Lecture 8: Alignment
Principal Component Analysis
Principal Component Analysis (PCA)
Unsupervised Learning
PREDICT 422: Practical Machine Learning
Exploring Microarray data
School of Computer Science & Engineering
Principal Component Analysis (PCA)
Machine Learning Dimensionality Reduction
Principal Component Analysis
Principal Component Analysis
Descriptive Statistics vs. Factor Analysis
Dimension reduction : PCA and Clustering
Feature space tansformation methods
Feature Selection Methods
Principal Component Analysis
Unsupervised Learning
Presentation transcript:

09/05/2005 סמינריון במתמטיקה ביולוגית Dimension Reduction - PCA Principle Component Analysis

סמינריון במתמטיקה ביולוגית The Goals Reduce the number of dimensions of a data set.  Capture the maximum information present in the initial data set.  Minimize the error between the original data set and the reduced dimensional data set. Simpler visualization of complex data.

סמינריון במתמטיקה ביולוגית The Algorithm Step 1: Calculate the Covariance Matrix of the observation matrix. Step 2: Calculate the eigenvalues and the corresponding eigenvectors. Step 3: Sort eigenvectors by the magnitude of their eigenvalues. Step 4: Project the data points on those vectors.

סמינריון במתמטיקה ביולוגית The Algorithm Step 1: Calculate the Covariance Matrix of the observation matrix. Step 2: Calculate the eigenvalues and the corresponding eigenvectors. Step 3: Sort eigenvectors by the magnitude of their eigenvalues. Step 4: Project the data points on those vectors.

סמינריון במתמטיקה ביולוגית PCA – Step 1: Covariance Matrix C  - Data Matrix

סמינריון במתמטיקה ביולוגית Covariance Matrix - Example

סמינריון במתמטיקה ביולוגית The Algorithm Step 1: Calculate the Covariance Matrix of the observation matrix. Step 2: Calculate the eigenvalues and the corresponding eigenvectors. Step 3: Sort eigenvectors by the magnitude of their eigenvalues. Step 4: Project the data points on those vectors.

סמינריון במתמטיקה ביולוגית Linear Algebra Review – Eigenvalue and Eigenvector C - a square n  n matrix Example eigenvector eigenvalue

סמינריון במתמטיקה ביולוגית Singular Value Decomposition

סמינריון במתמטיקה ביולוגית SVD Example Let us find SVD for the matrix 1) First, compute X T X: 2) Second, find the eigenvalues of X T X and the corresponding eigenvectors: ( use the following formula - )

סמינריון במתמטיקה ביולוגית

SVD Example - Continue 3) Now, we obtain the U and Σ : 4) And the decomposition C=UΣV T :

סמינריון במתמטיקה ביולוגית The Algorithm Step 1: Calculate the Covariance Matrix of the observation matrix. Step 2: Calculate the eigenvalues and the corresponding eigenvectors. Step 3: Sort eigenvectors by the magnitude of their eigenvalues. Step 4: Project the data points on those vectors.

סמינריון במתמטיקה ביולוגית PCA – Step 3 Sort eigenvectors by the magnitude of their eigenvalues

סמינריון במתמטיקה ביולוגית The Algorithm Step 1: Calculate the Covariance Matrix of the observation matrix. Step 2: Calculate the eigenvalues and the corresponding eigenvectors. Step 3: Sort eigenvectors by the magnitude of their eigenvalues. Step 4: Project the data points on those vectors.

סמינריון במתמטיקה ביולוגית PCA – Step 4 Project the input data onto the principal components. The new data values are generated for each observation, which are a linear combination as follows:  score  observation  principal component  loading (-1 to 1)  variable

סמינריון במתמטיקה ביולוגית PCA - Fundamentals 1st PC 2nd PC Projections X1X1 X2X2 X3X3 The first PC is the eigenvector with the greatest eigenvalue for the covariance matrix of the dataset. The Eigenvalues are also the variances of the observations in each of the new coordinate axes Var(PC1) Var(PC2)

סמינריון במתמטיקה ביולוגית PCA: Scores x1x1 x2x2 x3x3 Obs. i 1st PC 2nd PC  The scores are the places along the component lines where the observations are projected.

סמינריון במתמטיקה ביולוגית PCA: Loadings x1x1 x2x2 x3x3  The loadings b pc,k (dimension a, variable k) indicate the importance of the variable k to the given dimension.  b pc,k is the direction cosine (cos a) of the given component line vs. the x k coordinate axis. 11 x1x1 x2x2 x3x3 22 33 1st PC Cos(  X/PC

סמינריון במתמטיקה ביולוגית PCA - Summary Multivariate projection technique. Reduce dimensionality of data by transforming correlated variables into a smaller number of uncorrelated components. Graphical overview. Plot data in K-Dimensional space. Directions of maximum variation. Best preserves the variance as measured in the high- dimensional input space. Projection of data onto lower dimensional planes.

09/05/2005 סמינריון במתמטיקה ביולוגית Biological Background

סמינריון במתמטיקה ביולוגית Reverse Transcriptase c

סמינריון במתמטיקה ביולוגית Areas Being Studied With Microarrays To compare the expression of a protein (gene) between two or more tissues. To check whether a protein appears in a specific tissue. To find the difference in gene expression between a normal and a cancerous tissue.

סמינריון במתמטיקה ביולוגית cDNA Microarray Experiments Different tissues, same organism (brain v. liver). Same tissue, different organisms. Same tissue, same organism (tumour v. non-tumour). Time course experiments.

סמינריון במתמטיקה ביולוגית Microarray Technology Method for measuring levels of expression of thousands of genes simultaneously. There are two types of arrays:  cDNA and long oligonucleotide arrays.  Short oligonucleotide arrays. Each probe is ~25 nucleotide long probes for each gene.

סמינריון במתמטיקה ביולוגית The Idea Target: cDNA (variables to be detected) Probe: oligos/cDNA (gene templates) + Hybridization

סמינריון במתמטיקה ביולוגית Brief Outline of Steps for Producing a Microarray Produce mRNA Hybridise  Complimentary sequence will bind  Fluorescence shows binding Scan array ( Extraction of intensities with picture analysis software)

סמינריון במתמטיקה ביולוגית Hybridization RNA is cloned to cDNA with reverse transcriptase. The cDNA is labeled.  Fluorescent labeling is most common, but radioactive labeling is also used.  Labeling may be incorporated in hybridization, or applied afterwards. Then the labeled samples are hybridized to the microarrays.

סמינריון במתמטיקה ביולוגית

Gene Expression Database – a Conceptual View Gene expression levels Gene expression matrix Genes Gene annotations Samples Samples annotations

09/05/2005 סמינריון במתמטיקה ביולוגית The Article

סמינריון במתמטיקה ביולוגית The Biological Problem The very high dimensional space of gene expression measurements obtained by DNA micro arrays impedes the detection of underlying patterns in gene expression data and the identification of discriminatory genes.

סמינריון במתמטיקה ביולוגית Why to Use PCA? To obtain a direct link between patterns in gene and patterns in samples. Sample annotations Gene annotations

סמינריון במתמטיקה ביולוגית The Paper Shows: Distinct patterns are obtained when the genes are projected an a two-dimensional plane. After the removal of irrelevant genes, the scores on the new space showed distinct tissue patterns.

סמינריון במתמטיקה ביולוגית The Data Used in Experiment Oligonucleotide microarray measurements of 7070 genes made in 40 normal human tissue samples. The tissues they used were from brain, kidney, liver, lung, esophagus, skeletal muscle, breast, stomach, colon, blood, spleen, prostate, testes, vulva, proliferative endometrium, myometrium, placenta, cervix, and ovary.

סמינריון במתמטיקה ביולוגית Results PCA Loadings Can Be Used to Filter Irrelevant Genes TThe data from 40 human tissues were first projected using PCA. TThe first and second PCs account for ∼ 70% of the information present in the entire data set.

Gene Selection Based on the Loadings on the Principal Components Graph A shows the score plot of the samples before any filtering is implemented. Score Plot of the Tissue Samples Scores on Principle Component 1 Scores on Principle Component 2

סמינריון במתמטיקה ביולוגית Graphs B shows the loading plot of the genes before any filtering is implemented. Loadings on Principle Component 1 Loadings on Principle Component 2 Loading Plot of the Genes

סמינריון במתמטיקה ביולוגית The Filter on Loadings Graph E displays quantitatively the decisions that went into the choice of the filtering threshold. It displays the distortion in the observed patterns, as measured through the squared difference, and the number of genes retained for analysis as the threshold is varied. Squared Difference Threshold Number of genes

סמינריון במתמטיקה ביולוגית The Filter on the Loadings - Continue The chosen filter threshold was Filtering reduced the number of genes from 7070 to 425. Squared Difference Threshold Number of genes

סמינריון במתמטיקה ביולוגית Graphs C show the score plot after the filtering. Scores on Principle Component 1 Score Plot of the Tissue Samples Scores on Principle Component 2

סמינריון במתמטיקה ביולוגית Graphs D show the loading plot after the filtering. Loadings on Principle Component 1 Loadings on Principle Component 2 Loading Plot of the Genes

סמינריון במתמטיקה ביולוגית Scores on Principle Component 1 Score Plot of the Tissue Samples Scores on Principle Component 2 Score Plot of the Tissue Samples Scores on Principle Component 1 Scores on Principle Component 2 Compare.. Dramatic reduction from the initial 7070 genes to the 425, finally retained, resulted in a minimal information loss relevant to the description of the samples in the reduced space.

סמינריון במתמטיקה ביולוגית Loadings on Principle Component 1 Loadings on Principle Component 2 Loading Plot of the Genes Loadings on Principle Component 1 Loadings on Principle Component 2 Loading Plot of the Genes Compare.. Three linear structures can be identified in the loading plot of the 425 genes selected by the above analysis. Each structure comprising a set of genes.

סמינריון במתמטיקה ביולוגית PCA – Discussion PCA has strong, yet flexible, mathematical structure. PCA simplifies the “views” of the data. Reduces dimensionality of gene expression space. The correspondence between the score plot and the loading plot enables the elimination of redundant variables. PCA allowed the classification of new samples belonging to the used types of tissues.

סמינריון במתמטיקה ביולוגית PCA – Discussion (Cont.) In the article this method facilitated the identification of strong underlying structures in the data. The identification of such structures is uniquely dependent on the data and is not generally guaranteed. No “correct” way of classification, “biological understanding” is the ultimate guide.

סמינריון במתמטיקה ביולוגית My Critics Positives  Can deal with large data sets.  There weren’t done any assumptions on the data. This method is general and may be applied to any data set. Negatives  Nonlinear structure is invisible to PCA  The meaning of features is lost when linear combinations are formed

סמינריון במתמטיקה ביולוגית True covariance matrices are usually not known, estimated from data. The Graph :  First component will be chosen along the largest variance line => both clusters will strongly overlap.  Projection to orthogonal axis to the first PCA component will give much more discriminating power.

סמינריון במתמטיקה ביולוגית Thank you !!!