Presentation is loading. Please wait.

Presentation is loading. Please wait.

Clustering and MDS Exploratory Data Analysis. Outline What may be hoped for by clustering What may be hoped for by clustering Representing differences.

Similar presentations


Presentation on theme: "Clustering and MDS Exploratory Data Analysis. Outline What may be hoped for by clustering What may be hoped for by clustering Representing differences."— Presentation transcript:

1 Clustering and MDS Exploratory Data Analysis

2 Outline What may be hoped for by clustering What may be hoped for by clustering Representing differences as distances Representing differences as distances Choosing a clustering method Choosing a clustering method Hierarchical clustering: choosing linkage Hierarchical clustering: choosing linkage Multi-dimensional scaling Multi-dimensional scaling

3 Legitimate hopes for clustering To uncover unsuspected structure in data To uncover unsuspected structure in data Sample types or technical artifacts Sample types or technical artifacts To find related genes To find related genes Not a method of classification Not a method of classification The first big microarray studies used clustering to identify genes transcribed at similar stages in the cell cycle The first big microarray studies used clustering to identify genes transcribed at similar stages in the cell cycle This does not mean that clustering is the ‘proper’ way to analyze microarray data This does not mean that clustering is the ‘proper’ way to analyze microarray data

4 Clustering Issues Which scale? Which scale? True scale, log scale, True scale, log scale, variance-stabilizing transforms variance-stabilizing transforms Which metric (distance)? Which metric (distance)? Euclidean, Manhattan Euclidean, Manhattan Correlation, mutual information Correlation, mutual information Algorithms Algorithms K-means, Hierarchical: Neighbor-joining, UPGMA,… K-means, Hierarchical: Neighbor-joining, UPGMA,… Reliability Reliability bootstrapping bootstrapping

5 Scales Logarithmic scale emphasizes fold-change Logarithmic scale emphasizes fold-change Noise at low end Noise at low end Don’t want to emphasize differences due to noise Don’t want to emphasize differences due to noise Select genes according to measure of quality Select genes according to measure of quality Variance-stabilizing transform makes variation (roughly) equal Variance-stabilizing transform makes variation (roughly) equal

6 Distance-like measures of difference: Distance-like measures of difference: Euclidean - ‘geometric’ distance - emphasizes large differences Euclidean - ‘geometric’ distance - emphasizes large differences Manhattan - sum of differences - emphasizes consistent differences Manhattan - sum of differences - emphasizes consistent differences Correlation-like measures: Correlation-like measures: Correlation coefficient Correlation coefficient Mutual Information Mutual Information Entropy: H =  p(x)log 2 (p(x)) Entropy: H =  p(x)log 2 (p(x)) MI(g 1,g 2 ) = H(g 1 ) + H(g 2 ) – H(g 1,g 2 ) MI(g 1,g 2 ) = H(g 1 ) + H(g 2 ) – H(g 1,g 2 ) Robust - less affected by outliers Robust - less affected by outliers Tedious to program – requires adaptive binning Tedious to program – requires adaptive binning Common Metrics

7 Different Metrics – Same Scale 8 tumor; 2 normal tissue samples 8 tumor; 2 normal tissue samples Distances are similar in each tree Distances are similar in each tree Normals close Normals close Tree topologies appear different Tree topologies appear different Take with a grain of salt! Take with a grain of salt!

8 Algorithms Hierarchical Hierarchical Simple and familiar in concept Simple and familiar in concept k-means k-means Assume you know how many groups there should be Assume you know how many groups there should be Often started with hierarchical then try values of k Often started with hierarchical then try values of k Forces outliers into groups Forces outliers into groups SOM SOM Machine learning approach Machine learning approach

9 Clustered Image Map (Heat Map) Cluster both rows and columns Cluster both rows and columns Represent levels by colors Represent levels by colors

10 Multivariate methods Principal Components Analysis (PCA) Principal Components Analysis (PCA) Aim: identify combinations of features that usefully characterize samples Aim: identify combinations of features that usefully characterize samples Not very robust to outliers Not very robust to outliers Multi-dimensional scaling (MDS) Multi-dimensional scaling (MDS) Represent distances between samples as a two- or three-dimensional distance Represent distances between samples as a two- or three-dimensional distance Easy to visualize Easy to visualize

11 What is Multi-dimensional scaling? Represent ‘metric’ distances as physical distances on page or in 3D Represent ‘metric’ distances as physical distances on page or in 3D Not possible to represent exactly higher dimensional distances Not possible to represent exactly higher dimensional distances Start with first two PC’s Start with first two PC’s Iterative procedure to adjust lengths Iterative procedure to adjust lengths ‘strain’ factor - less than 20% is good ‘strain’ factor - less than 20% is good Good for small samples Good for small samples

12 Representing Groups Cluster diagram Multi-dimensional scaling Day 1 Chips

13

14

15


Download ppt "Clustering and MDS Exploratory Data Analysis. Outline What may be hoped for by clustering What may be hoped for by clustering Representing differences."

Similar presentations


Ads by Google