Dimension reduction : PCA and Clustering Slides by Agnieszka Juncker and Chris Workman modified by Hanne Jarmer.

Slides:



Advertisements
Similar presentations
Basic Gene Expression Data Analysis--Clustering
Advertisements

CS 478 – Tools for Machine Learning and Data Mining Clustering: Distance-based Approaches.
Clustering.
Cluster Analysis: Basic Concepts and Algorithms
Clustering Basic Concepts and Algorithms
PARTITIONAL CLUSTERING
Microarray Data Analysis (Lecture for CS397-CXZ Algorithms in Bioinformatics) March 19, 2004 ChengXiang Zhai Department of Computer Science University.
Cluster analysis for microarray data Anja von Heydebreck.
Dimensionality Reduction PCA -- SVD
Principal Component Analysis (PCA) for Clustering Gene Expression Data K. Y. Yeung and W. L. Ruzzo.
PCA + SVD.
Presented by: Mingyuan Zhou Duke University, ECE April 3, 2009
DNA Microarray Bioinformatics - #27611 Program Normalization exercise (from last week) Dimension reduction theory (PCA/Clustering) Dimension reduction.
SocalBSI 2008: Clustering Microarray Datasets Sagar Damle, Ph.D. Candidate, Caltech  Distance Metrics: Measuring similarity using the Euclidean and Correlation.
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
Dimension reduction : PCA and Clustering Agnieszka S. Juncker Slides: Christopher Workman and Agnieszka S. Juncker Center for Biological Sequence Analysis.
Unsupervised Learning - PCA The neural approach->PCA; SVD; kernel PCA Hertz chapter 8 Presentation based on Touretzky + various additions.
Classification of Microarray Data. Sample Preparation Hybridization Array design Probe design Question Experimental Design Buy Chip/Array Statistical.
Statistical Analysis of Microarray Data
Dimension reduction : PCA and Clustering by Agnieszka S. Juncker
Dimension reduction : PCA and Clustering Slides by Agnieszka Juncker and Chris Workman.
Clustering Petter Mostad. Clustering vs. class prediction Class prediction: Class prediction: A learning set of objects with known classes A learning.
Cluster Analysis: Basic Concepts and Algorithms
Classification of Microarray Data. Sample Preparation Hybridization Array design Probe design Question Experimental Design Buy Chip/Array Statistical.
Cluster Analysis Class web site: Statistics for Microarrays.
Dimension reduction : PCA and Clustering Christopher Workman Center for Biological Sequence Analysis DTU.
Microarray analysis Algorithms in Computational Biology Spring 2006 Written by Itai Sharon.
Introduction to Bioinformatics - Tutorial no. 12
Microarray analysis 2 Golan Yona. 2) Analysis of co-expression Search for similarly expressed genes experiment1 experiment2 experiment3 ……….. Gene i:
Dimension reduction : PCA and Clustering by Agnieszka S. Juncker Part of the slides is adapted from Chris Workman.
Revision (Part II) Ke Chen COMP24111 Machine Learning Revision slides are going to summarise all you have learnt from Part II, which should be helpful.
Analysis of GO annotation at cluster level by H. Bjørn Nielsen Slides from Agnieszka S. Juncker.
Dimensionality Reduction. Multimedia DBs Many multimedia applications require efficient indexing in high-dimensions (time-series, images and videos, etc)
Chapter 3: Cluster Analysis  3.1 Basic Concepts of Clustering  3.2 Partitioning Methods  3.3 Hierarchical Methods The Principle Agglomerative.
Clustering Unsupervised learning Generating “classes”
Principal Component Analysis (PCA) for Clustering Gene Expression Data K. Y. Yeung and W. L. Ruzzo.
CSE554AlignmentSlide 1 CSE 554 Lecture 8: Alignment Fall 2014.
CSE 185 Introduction to Computer Vision Pattern Recognition.
CSE554AlignmentSlide 1 CSE 554 Lecture 5: Alignment Fall 2011.
START OF DAY 8 Reading: Chap. 14. Midterm Go over questions General issues only Specific issues: visit with me Regrading may make your grade go up OR.
Microarrays.
PCA, Clustering and Classification by Agnieszka S. Juncker Part of the slides is adapted from Chris Workman.
CSE554AlignmentSlide 1 CSE 554 Lecture 8: Alignment Fall 2013.
Quantitative analysis of 2D gels Generalities. Applications Mutant / wild type Physiological conditions Tissue specific expression Disease / normal state.
Analysis of GO annotation at cluster level by Agnieszka S. Juncker.
Clustering Gene Expression Data BMI/CS 576 Colin Dewey Fall 2010.
Clustering.
Marketing Research Aaker, Kumar, Day and Leone Tenth Edition Instructor’s Presentation Slides 1.
Course Work Project Project title “Data Analysis Methods for Microarray Based Gene Expression Analysis” Sushil Kumar Singh (batch ) IBAB, Bangalore.
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.8: Clustering Rodney Nielsen Many of these.
Computational Biology Clustering Parts taken from Introduction to Data Mining by Tan, Steinbach, Kumar Lecture Slides Week 9.
Analyzing Expression Data: Clustering and Stats Chapter 16.
Definition Finding groups of objects such that the objects in a group will be similar (or related) to one another and different from (or unrelated to)
Principle Component Analysis and its use in MA clustering Lecture 12.
Lloyd Algorithm K-Means Clustering. Gene Expression Susumu Ohno: whole genome duplications The expression of genes can be measured over time. Identifying.
CZ5211 Topics in Computational Biology Lecture 4: Clustering Analysis for Microarray Data II Prof. Chen Yu Zong Tel:
Example Apply hierarchical clustering with d min to below data where c=3. Nearest neighbor clustering d min d max will form elongated clusters!
Given a set of data points as input Randomly assign each point to one of the k clusters Repeat until convergence – Calculate model of each of the k clusters.
Classification Categorization is the process in which ideas and objects are recognized, differentiated and understood. Categorization implies that objects.
Principal Components Analysis ( PCA)
CLUSTER ANALYSIS. Cluster Analysis  Cluster analysis is a major technique for classifying a ‘mountain’ of information into manageable meaningful piles.
CSE 554 Lecture 8: Alignment
Unsupervised Learning
PREDICT 422: Practical Machine Learning
Dimension reduction : PCA and Clustering by Agnieszka S. Juncker
PCA, Clustering and Classification by Agnieszka S. Juncker
Multivariate Statistical Methods
Dimension reduction : PCA and Clustering
Text Categorization Berlin Chen 2003 Reference:
Unsupervised Learning
Presentation transcript:

Dimension reduction : PCA and Clustering Slides by Agnieszka Juncker and Chris Workman modified by Hanne Jarmer

Sample Preparation Hybridization Array design Probe design Question Experimental Design Buy Chip/Array Statistical Analysis Fit to Model (time series) Expression Index Calculation Advanced Data Analysis ClusteringPCAClassification Promoter Analysis Meta analysisSurvival analysisRegulatory Network Normalization Image analysis The DNA Array Analysis Pipeline Comparable Gene Expression Data

Motivation: Multidimensional data Pat1Pat2Pat3Pat4Pat5Pat6Pat7Pat8Pat _at _at _s_at _at _at _at _at _x_at _at _s_at _s_at _at _s_at _s_at _x_at _at _x_at _s_at _s_at _at _s_at _at _at _at _at _s_at _s_at _s_at _at _at _at _at _at _s_at _s_at _s_at _at _at _at _s_at _s_at _s_at _at _at _x_at _at _s_at _s_at _at _at

Dimension reduction methods Principal component analysis (PCA) –Singular value decomposition (SVD) MultiDimensional Scaling (MDS) Correspondence Analysis (CA) Cluster analysis –Can be thought of as a dimensionality reduction method as clusters summarize data

Principal Component Analysis (PCA) Used for visualization of high-dimensional data Projects high-dimensional data into a small number of dimensions –Typically 2-3 principle component dimensions Often captures much of the total data variation in a only few dimensions Exact solutions require a fully determined system (matrix with full rank) –i.e. A “square” matrix with independent entries

PCA

Singular Value Decomposition

Principal components 1 st Principal component (PC1) –Direction along which there is greatest variation 2 nd Principal component (PC2) –Direction with maximum variation left in data, orthogonal to PC1

PCA: Variance by dimension

PCA dimensions by experiment

PCA projections (as XY-plot)

PCA: Leukemia data, precursor B and T Plot of 34 patients, dimension of 8973 genes reduced to 2

PCA of genes (Leukemia data) Plot of 8973 genes, dimension of 34 patients reduced to 2

Why do we cluster? Organize observed data into meaningful structures Summarize large data sets Used when we have no a priori hypotheses Optimization: –Minimize within cluster distances –Maximize between cluster distances

Many types of clustering methods Methods: –Hierarchical, e.g. UPGMA Agglomerative (bottom-up) Divisive (top-down) –partitioning K-means PAM SOM

Hierarchical clustering Representation of all pair-wise distances Parameters: none (distance measure) Results: –One large cluster –Hierarchical tree (dendrogram) Deterministic

Hierarchical clustering – UPGMA Algorithm Assign each item to its own cluster Join the nearest clusters Re-estimate the distance between clusters Repeat for 1 to n –UPGMA: Unweighted Pair Group Method with Arithmetic mean

Hierarchical clustering

Data with clustering order and distances Dendrogram representation

Leukemia data - clustering of patients

Leukemia data - clustering of patients on top 100 significant genes

Leukemia data - clustering of genes

K-means clustering Input: N objects given as data points in R p Specify the number k of clusters Initialize k cluster centers. Iterate until convergence: - Assign each object to the cluster with the closest center (Euclidean distance) - The centroids of the obtained clusters are taken as new cluster centers K-means can be seen as an optimization problem: Minimize the sum of squared within-clusters distances The result is depended on the initialization

K-means - Algorithm

K-means clustering, k=3

K-means clustering of Leukemia data

K-means clustering of Cell Cycle data

Partioning Around Medoids (PAM) PAM is a partitioning method like K-means For a prespecified number of clusters k, the PAM procedure is based on the search for k representative objects, or medoids M = (m1,...,mk) The medoids minimize the sum of the distances of the observations to their closest medoid After finding a set of k medoids, k clusters are constructed by assigning each observation to the nearest medoid PAM can be applied to general data types and tends to be more robust than k-means

Self Organizing Maps (SOM) Partitioning method (similar to the K-means method) Clusters are organized in a two-dimensional grid SOM algorithm finds the optimal organization of data in the grid Iteration steps ( ): -Pick data point P at random -Move all nodes in direction of P, the closest node more -Decrease amount of movement

SOM - example

Comparison of clustering methods Hierarchical –Advantage: Fast to compute –Disadvantage: Rigid Partitioning –Advantage: Provides clusters that roughly satisfy an optimality criterion –Disadvantage: Needs initial k, and is time consuming

Distance measures Euclidian distance Vector angle distance Pearsons distance

Comparison of distance measures

Summary Dimension reduction important to visualize data Methods: –PCA/SVD –Clustering Hierarchical K-means/PAM SOM (distance measure important)

Coffee break Next: Exercises in Dimension Reduction and clustering