Dimension reduction : PCA and Clustering Slides by Agnieszka Juncker and Chris Workman.

Slides:



Advertisements
Similar presentations
Statistical perturbation theory for spectral clustering Harrachov, 2007 A. Spence and Z. Stoyanov.
Advertisements

Dimensionality reduction. Outline From distances to points : – MultiDimensional Scaling (MDS) Dimensionality Reductions or data projections Random projections.
Dimensionality Reduction PCA -- SVD
Dimension reduction (1)
Principal Component Analysis (PCA) for Clustering Gene Expression Data K. Y. Yeung and W. L. Ruzzo.
Presented by: Mingyuan Zhou Duke University, ECE April 3, 2009
Principal Component Analysis
Gene expression analysis summary Where are we now?
DNA Microarray Bioinformatics - #27611 Program Normalization exercise (from last week) Dimension reduction theory (PCA/Clustering) Dimension reduction.
SocalBSI 2008: Clustering Microarray Datasets Sagar Damle, Ph.D. Candidate, Caltech  Distance Metrics: Measuring similarity using the Euclidean and Correlation.
DNA Microarray Bioinformatics - #27612 Normalization and Statistical Analysis.
Dimension reduction : PCA and Clustering Agnieszka S. Juncker Slides: Christopher Workman and Agnieszka S. Juncker Center for Biological Sequence Analysis.
Unsupervised Learning - PCA The neural approach->PCA; SVD; kernel PCA Hertz chapter 8 Presentation based on Touretzky + various additions.
Classification of Microarray Data. Sample Preparation Hybridization Array design Probe design Question Experimental Design Buy Chip/Array Statistical.
Statistical Analysis of Microarray Data
Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Dimension reduction : PCA and Clustering by Agnieszka S. Juncker
09/05/2005 סמינריון במתמטיקה ביולוגית Dimension Reduction - PCA Principle Component Analysis.
Clustering Petter Mostad. Clustering vs. class prediction Class prediction: Class prediction: A learning set of objects with known classes A learning.
Multidimensional Analysis If you are comparing more than two conditions (for example 10 types of cancer) or if you are looking at a time series (cell cycle.
The Terms that You Have to Know! Basis, Linear independent, Orthogonal Column space, Row space, Rank Linear combination Linear transformation Inner product.
Classification of Microarray Data. Sample Preparation Hybridization Array design Probe design Question Experimental Design Buy Chip/Array Statistical.
Dimension reduction : PCA and Clustering Christopher Workman Center for Biological Sequence Analysis DTU.
Microarray analysis Algorithms in Computational Biology Spring 2006 Written by Itai Sharon.
Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.
Exploring Microarray data Javier Cabrera. Outline 1.Exploratory Analysis Steps. 2.Microarray Data as Multivariate Data. 3.Dimension Reduction 4.Correlation.
Microarray analysis 2 Golan Yona. 2) Analysis of co-expression Search for similarly expressed genes experiment1 experiment2 experiment3 ……….. Gene i:
Dimension reduction : PCA and Clustering by Agnieszka S. Juncker Part of the slides is adapted from Chris Workman.
Analysis of GO annotation at cluster level by H. Bjørn Nielsen Slides from Agnieszka S. Juncker.
Dimensionality Reduction. Multimedia DBs Many multimedia applications require efficient indexing in high-dimensions (time-series, images and videos, etc)
Agenda Dimension reduction Principal component analysis (PCA)
Microarray Gene Expression Data Analysis A.Venkatesh CBBL Functional Genomics Chapter: 07.
Principal Component Analysis (PCA) for Clustering Gene Expression Data K. Y. Yeung and W. L. Ruzzo.
The Tutorial of Principal Component Analysis, Hierarchical Clustering, and Multidimensional Scaling Wenshan Wang.
Chapter 2 Dimensionality Reduction. Linear Methods
CSE 185 Introduction to Computer Vision Pattern Recognition.
PCA, Clustering and Classification by Agnieszka S. Juncker Part of the slides is adapted from Chris Workman.
es/by-sa/2.0/. Principal Component Analysis & Clustering Prof:Rui Alves Dept Ciencies Mediques.
Computer Vision Lab. SNU Young Ki Baik Nonlinear Dimensionality Reduction Approach (ISOMAP, LLE)
Descriptive Statistics vs. Factor Analysis Descriptive statistics will inform on the prevalence of a phenomenon, among a given population, captured by.
Dimension reduction : PCA and Clustering Slides by Agnieszka Juncker and Chris Workman modified by Hanne Jarmer.
Analysis of GO annotation at cluster level by Agnieszka S. Juncker.
Clustering Gene Expression Data BMI/CS 576 Colin Dewey Fall 2010.
Clustering.
Marketing Research Aaker, Kumar, Day and Leone Tenth Edition Instructor’s Presentation Slides 1.
Course Work Project Project title “Data Analysis Methods for Microarray Based Gene Expression Analysis” Sushil Kumar Singh (batch ) IBAB, Bangalore.
Tony Jebara, Columbia University Advanced Machine Learning & Perception Instructor: Tony Jebara.
Statistical Analysis of Microarray Data By H. Bjørn Nielsen.
CpSc 881: Machine Learning PCA and MDS. 2 Copy Right Notice Most slides in this presentation are adopted from slides of text book and various sources.
Computational Biology Clustering Parts taken from Introduction to Data Mining by Tan, Steinbach, Kumar Lecture Slides Week 9.
Analyzing Expression Data: Clustering and Stats Chapter 16.
Definition Finding groups of objects such that the objects in a group will be similar (or related) to one another and different from (or unrelated to)
Principle Component Analysis and its use in MA clustering Lecture 12.
Principal Component Analysis and Linear Discriminant Analysis for Feature Reduction Jieping Ye Department of Computer Science and Engineering Arizona State.
Classification Categorization is the process in which ideas and objects are recognized, differentiated and understood. Categorization implies that objects.
Principal Components Analysis ( PCA)
Machine Learning Supervised Learning Classification and Regression K-Nearest Neighbor Classification Fisher’s Criteria & Linear Discriminant Analysis Perceptron:
Unsupervised Learning
PREDICT 422: Practical Machine Learning
Dimension reduction : PCA and Clustering by Agnieszka S. Juncker
Principal Component Analysis
Probabilistic Models with Latent Variables
PCA, Clustering and Classification by Agnieszka S. Juncker
Analysis of GO annotation at cluster level by Agnieszka S. Juncker
Descriptive Statistics vs. Factor Analysis
Multivariate Statistical Methods
Dimension reduction : PCA and Clustering
Principal Component Analysis
Unsupervised Learning
Presentation transcript:

Dimension reduction : PCA and Clustering Slides by Agnieszka Juncker and Chris Workman

Sample Preparation Hybridization Array design Probe design Question Experimental Design Buy Chip/Array Statistical Analysis Fit to Model (time series) Expression Index Calculation Advanced Data Analysis ClusteringPCAClassification Promoter Analysis Meta analysisSurvival analysisRegulatory Network Normalization Image analysis The DNA Array Analysis Pipeline Comparable Gene Expression Data

Motivation: Multidimensional data Pat1Pat2Pat3Pat4Pat5Pat6Pat7Pat8Pat _at _at _s_at _at _at _at _at _x_at _at _s_at _s_at _at _s_at _s_at _x_at _at _x_at _s_at _s_at _at _s_at _at _at _at _at _s_at _s_at _s_at _at _at _at _at _at _s_at _s_at _s_at _at _at _at _s_at _s_at _s_at _at _at _x_at _at _s_at _s_at _at _at

Dimension reduction methods Principal component analysis (PCA) –Singular value decomposition (SVD) Multidimensional scaling Correspondence analysis Cluster analysis –Can be thought of as a dimensionality reduction method as clusters summarize data

Fundamental methods Multidimensional scaling –Rearranges objects so as to arrive at a configuration that best approximates the observed distances Factor analysis (PCA, SVD) –New vector space defined by variability in the data Independent component analysis (ICA) –In factor analysis, the similarities between objects are expressed in the correlation matrix. With MDS one may analyze any kind of similarity or dissimilarity matrix, in addition to correlation matrices.

Principal Component Analysis (PCA) Used for visualization of high-dimensional data Projects high-dimensional data into a small number of dimensions –Typically 2-3 principle component dimensions Often captures much of the total data variation in a only few dimensions Exact solutions require a fully determined system (matrix with full rank) –i.e. A “square” matrix with independent entries

PCA

Singular Value Decomposition

Principal components 1 st Principal component (PC1) –Direction along which there is greatest variation 2 nd Principal component (PC2) –Direction with maximum variation left in data, orthogonal to PC1

PCA: Eigen values (variance by dimension)

PCA Eigen vectors

PCA projections (as XY-plot)

PCA: Leukemia data, precursor B and T Plot of 34 patients, dimension of 8973 genes reduced to 2

PCA of genes (Leukemia data) Plot of 8973 genes, dimension of 34 patients reduced to 2

Why do we cluster? Organize observed data into meaningful structures Summarize large data sets Used when we have no a priori hypotheses Optimization: –Minimize within cluster distances –Maximize between cluster distances

Many types of clustering methods Method: –K-class –Hierarchical, e.g. UPGMA Agglomerative (bottom-up) Divisive (top-down) –Graph theoretic Information used: –Supervised vs unsupervised Final description of the items: –Partitioning vs non-partitioning –fuzzy, multi-class

Hierarchical clustering Representation of all pair-wise distances Parameters: none (distance measure) Results: –One large cluster –Hierarchical tree (dendrogram) Deterministic

Hierarchical clustering – UPGMA Algorithm Assign each item to its own cluster Join the nearest clusters Re-estimate the distance between clusters Repeat for 1 to n

Hierarchical clustering

Data with clustering order and distances Dendrogram representation

Leukemia data - clustering of patients

Leukemia data - clustering of patients on top 100 significant genes

Leukemia data - clustering of genes

K-means clustering Partition data into K clusters Parameter: Number of clusters (K) must be chosen Randomized initialization: –Different clusters each time –Non-deterministic

K-means - Algorithm

K-mean clustering, K=3

K-means clustering of Leukemia data

K-means clustering of Cell Cycle data

Self Organizing Maps (SOM) Partitioning method (similar to the K-means method) Clusters are organized in a two-dimensional grid Size of grid is specified –(eg. 2x2 or 3x3) SOM algorithm finds the optimal organization of data in the grid

SOM - example

Comparison of clustering methods Hierarchical clustering –Distances between all variables –Time consuming with a large number of gene –Advantage to cluster on selected genes K-means clustering –Faster algorithm –Does only show relations between all variables SOM –Machine learning algorithm

Distance measures Euclidian distance Vector angle distance Pearsons distance

Comparison of distance measures

Summary Dimension reduction important to visualize data Methods: –Principal Component Analysis –Clustering Hierarchical K-means Self organizing maps (distance measure important)