More Analysis of Gene Expression Data Brent D. Foy, Ph.D. Wright State University.

More Analysis of Gene Expression Data Brent D. Foy, Ph.D. Wright State University

Overview Types of Data Sets Data Analysis –Clustering Hierarchical Self-Organizing Maps Principal Components Analysis –Statistical Hypothesis Testing (ANOVA)

Types of Data – 1D, 2 Conditions Many genes 2 conditions A few replicates per condition GeneCondition 1Condition 2 Rep 1 Rep 2 Rep 3 Rep 1 Rep 2 Rep 3 A 150160150180190180 B 504045504540 C 800760680400450425 …

Types of Data – 1D, 2 Conditions (cont) Conditions can be control vs treated, different cell types, different time points, etc. Typical Question – Which genes’ expression levels change due to condition? –T-test, Mann-Whitney, Comparison Analysis

Types of Data – 1D, Multiple Conditions Many genes Multiple conditions A few replicates per condition

Types of Data – 1D, Multiple Conditions (cont) GeneCondition 1Condition 2Condition 3… Rep 1 Rep 2 Rep 3 Rep 1 Rep 2 Rep 3 Rep 1 Rep 2 Rep 3 A 150160150180190180150155135 B 5040455045408090105 C 800760680400450425200220400 …

Types of Data – 1D, Multiple Conditions (cont) Again, conditions can be treatments or chemicals, cell types, time points, etc. Typical question – Which genes’ expression levels change due to one or more conditions? –1-way ANOVA, Kruskal-Wallis

Types of Data – 1D, Multiple Conditions (cont) Typical question – Which genes’ expression levels behave similarly for all the conditions? –Self-Organizing Maps, Hierarchical Clustering, Principal Components Analysis Typical question – Which conditions show similar expression levels among genes? (Toxicogenomic Fingerprint) –Hierarchical Clustering, Principal Components Analysis, (Self-Organizing Maps)

Types of Data – 2D, Multiple x Multiple Conditions Many genes 2 Factors, multiple conditions per factor –For example, Factor 1 could be dose of a chemical, and Factor 2 could be time point after dosing Multiple replicates per condition

Types of Data – 2D, Multiple x Multiple Conditions (cont) GeneDose 1Dose 2 … Time 1Time 2Time 1Time 2 … Rep 1 Rep 2 Rep 1 Rep 2 Rep 1 Rep 2 Rep 1 Rep 2 A 150160150180190180150155 B 5040455045408090 C 800760680400450425200220 …

Types of Data – 2D, Multiple x Multiple Conditions (cont) Typical Question – Which genes’ expression levels change due to time? Due to dose? Due to an interaction between the two? –2-way ANOVA Or, eliminate one of the dimensions and ask the same questions as before – At time 1, which doses show similar expression levels among genes?

Typical Applications of Clustering Algorithms 0 2 4 6 0246 Gene A Gene B chem1 chem6 chem2 chem3 chem4 chem5 Many samples/cell lines/chemicals, Many genes Number of axes can be very large here Many samples/cell lines/chemicals, Principal components of genes 0 2 4 6 0246 Principal component 1 Principal component 2 chem1 chem6 chem2 chem3 chem4 chem5

Typical Applications of Clustering Algorithms Many genes, multiple time points. (Different letters represent different genes.) 0 2 4 6 0246 T1 T2 A F B C D E Number of dimensions (time points) can be greater than 2 Many genes, multiple doses 0 2 4 6 8 02468 Dose 1 Dose 2 A F B C D E Reasons to cluster genes of similar behavior together?

Hierarchical Clustering Focus on 1D, multiple conditions type of data Here, group cell types according to similar gene response

Hierarchical Clustering (cont) Construct pairwise groupings of data elements based on similarity. Definition of similarity is typically the separation of data elements in n-dimensional space. Chem 2 Chem 3 Chem 1 Chem 6 Chem 4 Chem 5 Generation 3 2 1 0 # clusters 6 3 2 1 0 2 4 6 0246 Gene A Gene B chem1 chem6 chem2 chem3 chem4 chem5

A F B C D E Hierarchical clustering - chooses pairwise groupings based on distances between pairs of points Once the two closest points are found, the two are grouped together, and a new point is placed at the average location of the old 2 points.

Hierarchical clustering Advantages Computationally efficient Produces tree-like structure Disadvantage Clusters are not optimal. Once branches split, it’s permanent. There is no way to reevaluate whether it was the best division based on whole data set.

Principal Component Analysis - Each data point is a single condition - Each axis is a linear combination of hundreds or thousands of gene expression levels

Principal Component Analysis Reduces the dimensionality of the data set –Thousands of genes are combined in a few linear combinations to make 2 or 3 Principal Components (PC). Going from thousands of axes, with each axis representing the expression level for a gene, to 2 or 3 axes. These few PCs may capture most of the variability of the original data set Hope is that the first few PCs extract or expose the cluster structure of the original data set –i.e. Another clustering algorithm still needed after PCA

Principal Component Analysis – A Simple Example PC1

Self Organizing Maps Partition data into specified number of groupings. Iterative procedure, so seeks to produce optimal clusters. K-means clustering is a specific form of the self- organizing map

Self Organizing Maps - General Procedure Consider n data points in d-dimensional space. In the hypothetical data set, there are 6 data points (gene expression levels) in 2-dimensional space (2 time points). Say you want k = 3 clusters. 1. Select k of your data points to each be the original center of a cluster 2. Place the next data point in the nearest cluster 3. Compute the new location of the cluster center 4. Repeat the previous 2 steps for each data point 5. After all data is placed in a cluster, use final cluster centers as starting point for another iteration beginning at step 2.

Self Organizing Maps – Simple Example A F B C D E

A F B C D E Let Genes A, B, and C be initial cluster centers. A F B C D E Clusters after 1st pass A F B C D E Clusters after 2nd pass

Self Organizing Maps – Simple Example

Self Organizing Maps – Larger example X-axis is time after dose Y-axis is normalized gene expression level Group ~1000 genes into 24 categories

Self Organizing maps - details to consider Several methods exist for choosing initial data points for clusters. How to choose the initial number of clusters. Method of recalculating cluster center after adding a new data point can be varied. How much ‘weight’ is given to new data point. Routines for merging and dividing clusters and detecting outliers can be added at each iteration.

Self Organizing maps Advantages Able to come closer to ‘optimal’ clustering through iterations. Doesn’t force a tree-structure on data Disadvantage Larger number of options for clustering means that details of process may be hidden.

Data Preprocessing Filter data –Remove genes with expression levels in the noise –Focus on a group of genes with a particular function Normalize data –Subtract a control condition –Scale so that a gene whose expression level changes from 5000 to 10000 looks the same as a gene whose expression level changes from 500 to 1000. One possibility is to scale all genes to mean of 0 and standard deviation of 1.

Detecting Statistically Significant Changes Consider 1D, multiple conditions 1-way ANOVA Similar tests for 1D, 2 condition data: –Fold changes –Tests Steve described in previous talk (Mann- Whitney, Comparison Analysis)

1D, Multiple Condition Data GeneDose 1Dose 2Dose 3… Rep 1 Rep 2 Rep 3 Rep 1 Rep 2 Rep 3 Rep 1 Rep 2 Rep 3 A 150160150180190180150155135 B 5040455045408090105 C 800760680400450425200220400 …

1-Way ANOVA Question being asked is whether the expression level for each gene (taken one at a time) changes significantly as a function of dose. More specifically, it compares the variability within replicates for a given dose to the variability caused by changing the dose. If gene chip contains 1000 genes, then do 1000 ANOVAs. Consider “repeated measures ANOVA” if multiple measurements done on same animal

ANOVA for Hepatocytes exposed to Hydrazine, time 0 SourceSSdfMSFP Columns5566227833.210.1798 Error26023867 Total81685

2-way ANOVA Apply to 2D, multiple x multiple condition data sets Consider 3 doses, 5 time points per dose, 2 replicates per condition Can reveal significant effect of time, significant effect of dose, or a significant interaction between the two A “2-way repeated measures ANOVA” also exists

2-way ANOVA for Hydrazine Data – Output for 1 gene SourceSSdfMSFP Time28724471819.207.3e-4 Dose114325720.730.498 Time*dose22940828683.670.016 Error1093014781 Total6440928

2-Way ANOVA – p-value Summary for 10 Genes

2-Way ANOVA – Dose effect Red, 0 mM Green, 50 mM Blue, 75 mM

2-way ANOVA – Time x Dose effect

Software Free –Eisen’s software Cluster, Treeview Hierarchical clustering, SOM http://rana.lbl.gov/ –Genecluster SOM http://www- genome.wi.mit.edu/cancer/software/software.htmlhttp://www- genome.wi.mit.edu/cancer/software/software.html

Software (cont) Commercial, gene-specific –Genelinker Gold PCA, clustering, SOM, statistics http://microarray.genelinker.com/products.html#GeneLinkerG oldhttp://microarray.genelinker.com/products.html#GeneLinkerG old –GeneSpring PCA, clustering, SOM, statistics http://www.sigenetics.com/cgi/SiG.cgi/Products/GeneSpring/in dex.smfhttp://www.sigenetics.com/cgi/SiG.cgi/Products/GeneSpring/in dex.smf –Rosetta PCA, clustering, SOM, ANOVA http://www.rosettabio.com/products/resolver/default.htm –Several others

Software (cont) Tools, not gene specific –Matlab –SPSS –SAS A useful web site, briefly summarizes many software packages, up-to-date –http://ihome.cuhk.edu.hk/~b400559/arraysoft.ht mlhttp://ihome.cuhk.edu.hk/~b400559/arraysoft.ht ml

Collaborators AFRL Dr. John Frazier Dr. Charles Wang Dr. Victor Chan AFOSR Dr. Walt Kozumbo AFIT Dr. Dennis Quinn Rebecca Olson Tom Hopkins 2Lt Matt Campbell WSU Dr. Nick Reo Dr. Steve Berberich Dr. Tatiana Karpinets

Questions?

More Analysis of Gene Expression Data Brent D. Foy, Ph.D. Wright State University.

Similar presentations

Presentation on theme: "More Analysis of Gene Expression Data Brent D. Foy, Ph.D. Wright State University."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

More Analysis of Gene Expression Data Brent D. Foy, Ph.D. Wright State University.

Similar presentations

Presentation on theme: "More Analysis of Gene Expression Data Brent D. Foy, Ph.D. Wright State University."— Presentation transcript:

Similar presentations

About project

Feedback