Pathway Analysis Michael Sneddon Southern California Bioinformatics Institute August 20, 2004.

Pathway Analysis Michael Sneddon Southern California Bioinformatics Institute August 20, 2004

Project Overview Specializes in microarray data analysis software –Image Analysis –Data Analysis –Data Management How can microarray data be used to find information about biological pathways? Project: explore different ways to extract information about biological pathways from microarray data. CONFIDENTIAL

Sample Microarray Data Microarrays can provide information on differential expression between conditions. The most differentially expressed genes are singled out for further study. Healthy Infected Gene 195101112106 Gene 2150175163145 Gene 3123118212248 Gene 464735058 Gene 5284253258270 Gene 6100899294 Gene 77886132125 Gene 8184170145153 Gene 9138146130185 Genes Conditions Gene 3 would be selected for further study. CONFIDENTIAL

A Different Approach Difficulties With Old Approach No gene is significantly differentially expressed. Many genes are significantly differentially expressed. Not making use of prior knowledge. A Different Approach Look for affected biological processes, sometimes called pathways, instead of individual genes. Need a way to convert a list of differential gene expression values into scores for pathways. The way to do that is through a scoring metric. CONFIDENTIAL

Method of Ranking Pathways CONFIDENTIAL Scoring Metric Microarray Data Annotations A score for each Pathway: indicates how much it was affected by the condition Many Different Scoring Metrics Available

A Simple Metric Gene NamesP-Value 1)200078_s_at.13458 2)201172_x_at.05124 3)205473_at.98341 4)208678_at.46123 5)214244_s_at.00032 6)230565_at.00341 7)36994_at.28952 8)39144_s_at.17345 Photosynthesis Score = # of genes below 0.2 total # of genes in pathway 5 genes have a P-value below 0.2 out of 8 genes in this pathway Score = 5/8 = 0.625 CONFIDENTIAL

Project Goals I. Analyze and compare different scoring metrics –How similar are the different metrics? –Which metric produces the most biologically significant results? –When should we use a particular metric over another? II. Explore known ranking metrics –How and why do they work? –Is there a way to improve them or design a better one? CONFIDENTIAL

The Metrics Investigated Enrichment – the original method first used to rank pathways, it is still widely used today GSEA (Gene Set Enrichment Analysis) – a recently published* method using a Kolmogorov-Smirnov statistic Shams 1 Shams 2 Shams 3 } Potential BioDiscovery Scoring Metrics CONFIDENTIAL * Mootha, et al, “PGC-1alpha-responsive genes involved in oxidative phosphorylation are coordinately downregulated in human diabetes.” (Nat Genet. 2003 Jul;34(3):267-73).

Part I: Compare the metrics Compare each metric to all the others to see if they produce similar results. If they are very similar, it doesn’t matter which one we use. If they are different, which one is correct? Or, are they both correct? CONFIDENTIAL

How to Compare Metrics Wrote a program that does the following: (1)Rank the same pathways using different metrics (2)Take the top pathways from each ranking (3)Count the number of pathways that are in common among the top pathways being considered (4)Construct a % similarity score = # of pathways in common divided by the total number of pathways CONFIDENTIAL

An Example PATHWAY NAME SCOREPATHWAY NAME SCORE tachykinin signaling pathway2.051nucleotide-sugar metabolism 298.4 endothelin receptor activity1.983endothelin-B receptor activity270.5 initiation factor 4F complex1.965endothelin receptor activity 266.7 nucleotide-sugar metabolism1.807odorant binding 221.3 sarcoglycan complex1.829 interleukin-6 receptor binding 218.1 ubiquitin C-terminal activity1.664histone deacetylation 181.3 odorant binding 1.539clathrin binding 176.4 interleukin-6 receptor binding1.507female germ-cell nucleus 162.9 cysteine-type peptidase activity1.488AMP deaminase activity 159.2 clathrin binding1.432cholecystokinin receptor activity 150.3 female germ-cell nucleus 1.422malate metabolism 144.0 AMP deaminase activity1.415malate activity 138.9 anticoagulant activity1.337malate dehydrogenase 113.5 protease activator activity1.336delta-opioid receptor activity112.8 malate metabolism1.326vasculogenesis109.0 SHAMS IGSEA Compare the top 12 pathways from each metric. CONFIDENTIAL

An Example PATHWAY NAME SCOREPATHWAY NAME SCORE tachykinin signaling pathway2.051nucleotide-sugar metabolism 298.4 endothelin receptor activity1.983endothelin-B receptor activity270.5 initiation factor 4F complex1.965endothelin receptor activity 266.7 nucleotide-sugar metabolism1.807odorant binding 221.3 sarcoglycan complex1.829 interleukin-6 receptor binding 218.1 ubiquitin C-terminal activity1.664histone deacetylation 181.1 odorant binding 1.539clathrin binding 176.4 interleukin-6 receptor binding1.507female germ-cell nucleus 162.9 cysteine-type peptidase activity1.488clathrin binding159.2 cholecystokinin receptor activity 1.385malate metabolism 150.3 female germ-cell nucleus 1.422AMP deaminase activity 144.0 AMP deaminase activity1.415malate activity 138.9 anticoagulant activity1.337malate dehydrogenase 113.5 protease activator activity1.336delta-opioid receptor activity112.8 malate metabolism1.326vasculogenesis109.0 SHAMS IGSEA 6 Matches out of 12 Total Pathways = 50% Similarity CONFIDENTIAL

Repeat The Process First, take the top 10 pathways. Then take the top 20 pathways. Then take the top 30 pathways.. Continue until a pattern is seen. CONFIDENTIAL

Example Graph of Results Cut-Off Value (out of 2646 pathways) % Similarity Between Shams 1 and Shams 2, the top 20 pathways have about 36% Similarity CONFIDENTIAL

Results No two metrics were very similar in any dataset tested (i.e. 85%+) Percent Similarities differed greatly between different datasets – no two metrics demonstrated a consistent amount of similarity. Since the metrics ranked the pathways differently… Which metrics are correct? Or are they all correct? Begin by verifying and understanding what has already been researched – GSEA. CONFIDENTIAL

Part II: Exploring a Metric: GSEA Gene Set Enrichment Analysis A result of the collaboration of many individuals from a number of institutions including MIT and Harvard. Devised in order to identify the pathways that are significantly affected in individuals with type 2 diabetes compared to healthy individuals. How, exactly, does GSEA work? Is our implementation correct? CONFIDENTIAL

How GSEA works (1) Rank the genes based on differential expression #GeneP-Value (T-test) 1 2 3 4 5 6 7 8 9 10 11 217245_at 204157_s_at 208670_s_at 203569_s_at 211432_s_at 215551_at 206226_at 201662_s_at 210322_x_at 216776_at 220405_at 0.011737478 0.011873747 0.01204891 0.012177919 0.012267433 0.012284646 0.012354957 0.01257829 0.012579987 0.012583022 0.012684445 Then pathway one is given a higher score than pathway two. And pathway two contains these three genes (2) Compute a score for each pathway based on where the genes of that pathway appear. If pathway one contains these three genes CONFIDENTIAL

Importance of a P-Value A metric will always produce a ranking. Is the ranking we get significant or could it have been generated randomly? Answer: We need to compute a P-value to make sure that the score we get is unlikely to have been produced by chance. CONFIDENTIAL

Constructing a P-value (1)Permute class labels 1000 times (2)Rank the pathways with each different permutation (3)Create a histogram of top values based on the permutations (4)Figure out where in the histogram the actual data lies – shows how significant the score is. CONFIDENTIAL

Constructing a P-value GSEA SCORE Number of Permutations If the actual score falls here, the score is significant But, if the actual score falls here, the score is not significant CONFIDENTIAL

Implementation BioDiscovery already had an implementation of the GSEA scoring metric. What I did: –Tweaked the code so that it works better and functions more like the original published method. –Extended the code to compute a P-value to measure the significance of GSEA scores. CONFIDENTIAL

Results of GSEA analysis A better understanding of how GSEA operates especially in comparison to other potential metrics. A good implementation of the GSEA metric. An implementation of a permutation analysis to judge the significance of calculated scores. CONFIDENTIAL

Next steps Extend the GSEA implementation of permutation analysis for all the metrics to verify the significance of the results. Submit these significant results to biologists to see which metrics make the most sense. Final Step: Integrate the best metrics and the permutation analysis into one application for biologists. CONFIDENTIAL

Acknowledgments Special Thanks to: Dr. Soheil Shams Dr. Bruce Hoff Keala Chan The staff of BioDiscovery, Inc. The professors of SoCalBSI The students of SoCalBSI Funding Provided by: National Science Foundation National Institutes of Health CONFIDENTIAL

Works Cited Mootha, et al, “PGC-1alpha-responsive genes involved in oxidative phosphorylation are coordinately downregulated in human diabetes.” (Nat Genet. 2003 Jul;34(3):267-73). Damian D, Gorfine M., “Statistical concerns about the GSEA procedure” (Nat Genet. 2004 Jul;36(7):663; author reply 663) Confidential Documents of BioDiscovery, Inc. http://www.biocarta.com http://www.geneontology.org http://www.affymetrix.com CONFIDENTIAL

Pathway Analysis Michael Sneddon Southern California Bioinformatics Institute August 20, 2004.

Similar presentations

Presentation on theme: "Pathway Analysis Michael Sneddon Southern California Bioinformatics Institute August 20, 2004."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Pathway Analysis Michael Sneddon Southern California Bioinformatics Institute August 20, 2004.

Similar presentations

Presentation on theme: "Pathway Analysis Michael Sneddon Southern California Bioinformatics Institute August 20, 2004."— Presentation transcript:

Similar presentations

About project

Feedback