Presentation is loading. Please wait.

Presentation is loading. Please wait.

Enabling Reproducible Gene Expression Analysis Using Biological Pathways Limsoon Wong 7 April 2011 (Joint work with Donny Soh, Difeng Dong, Yike Guo)

Similar presentations


Presentation on theme: "Enabling Reproducible Gene Expression Analysis Using Biological Pathways Limsoon Wong 7 April 2011 (Joint work with Donny Soh, Difeng Dong, Yike Guo)"— Presentation transcript:

1 Enabling Reproducible Gene Expression Analysis Using Biological Pathways Limsoon Wong 7 April 2011 (Joint work with Donny Soh, Difeng Dong, Yike Guo)

2 Talk at IPM-NUS Workshop on Bioinformatics and Computer Science, 7 April 2011 Copyright 2011 © Limsoon Wong 2 Plan An issue in gene expression analysis Comparing pathway sources: Comprehensiveness, Consistency, Compatibility Matching pathways in different sources Finding more consistent disease subnetworks

3 An Issue in Gene Expression Analysis

4 Talk at IPM-NUS Workshop on Bioinformatics and Computer Science, 7 April 2011 Copyright 2011 © Limsoon Wong 4 First, the good news..

5 Talk at IPM-NUS Workshop on Bioinformatics and Computer Science, 7 April 2011 Copyright 2011 © Limsoon Wong 5 The subtypes look similar Conventional diagnosis –Immunophenotyping –Cytogenetics –Molecular diagnostics  Unavailable in developing countries Childhood Acute Lymphoblastic Leukemia Major subtypes: T-ALL, E2A-PBX, TEL-AML, BCR- ABL, MLL genome rearrangements, Hyperdiploid>50 Diff subtypes respond differently to same Tx Over-intensive Tx –Development of secondary cancers –Reduction of IQ Under-intensiveTx –Relapse

6 Talk at IPM-NUS Workshop on Bioinformatics and Computer Science, 7 April 2011 Copyright 2011 © Limsoon Wong 6 Fold Change T-test x – Microarray value after drug y – Microarray value before drug i – Gene x – Log2 value of treatment y – Log2 value of control s – Standard error i – Gene Individual Gene Testing Golub et al, Science, 1999

7 Talk at IPM-NUS Workshop on Bioinformatics and Computer Science, 7 April 2011 Copyright 2011 © Limsoon Wong 7 Yeoh et al, Cancer Cell 2002

8 Talk at IPM-NUS Workshop on Bioinformatics and Computer Science, 7 April 2011 Copyright 2011 © Limsoon Wong 8 Conventional Tx: intermediate intensity to all  10% suffers relapse  50% suffers side effects  costs US$150m/yr Our optimized Tx: high intensity to 10% intermediate intensity to 40% low intensity to 50% costs US$100m/yr Copyright © 2004 by Jinyan Li and Limsoon Wong High cure rate of 80% Less relapse Less side effects Save US$51.6m/yr Impact

9 Talk at IPM-NUS Workshop on Bioinformatics and Computer Science, 7 April 2011 Copyright 2011 © Limsoon Wong 9 Now, the bad news..

10 Talk at IPM-NUS Workshop on Bioinformatics and Computer Science, 7 April 2011 Copyright 2011 © Limsoon Wong 10 Percentage of Overlapping Genes Low % of overlapping genes from diff expt in general –Prostate cancer Lapointe et al, 2004 Singh et al, 2002 –Lung cancer Garber et al, 2001 Bhattacharjee et al, 2001 –DMD Haslett et al, 2002 Pescatori et al, 2007 DatasetsDEGPOG Prostate Cancer Top 100.30 Top 500.14 Top1000.15 Lung Cancer Top 100.00 Top 500.20 Top1000.31 DMD Top 100.20 Top 500.42 Top1000.54 Zhang et al, Bioinformatics, 2009

11 Talk at IPM-NUS Workshop on Bioinformatics and Computer Science, 7 April 2011 Copyright 2011 © Limsoon Wong 11 Gene Regulatory Circuits Each disease subtype has underlying cause There is a unifying biological theme for genes that are truly associated with a disease subtype Uncertainty in selected genes can be reduced by considering biological processes of the genes The unifying biological theme is basis for inferring the underlying cause of disease subtype

12 Talk at IPM-NUS Workshop on Bioinformatics and Computer Science, 7 April 2011 Copyright 2011 © Limsoon Wong 12 Towards More Meaningful Genes ORA –Khatri et al –Genomics, 2002 FCS –Pavlidis & Noble –PSB 2002 GSEA –Subramanian et al –PNAS, 2005 Pathway Express –Draghici et al –Genome Res, 2007

13 Talk at IPM-NUS Workshop on Bioinformatics and Computer Science, 7 April 2011 Copyright 2011 © Limsoon Wong 13 All of these newer methods rely on gene group or pathway information. But how good are the available sources of pathway information?

14 Comparing Pathway Sources: Comprehensiveness, Consistency, & Compatibility

15 Talk at IPM-NUS Workshop on Bioinformatics and Computer Science, 7 April 2011 Copyright 2011 © Limsoon Wong 15 Data Sources KEGG –Curated by a single lab –Long famous history –Used by many people Wikipathways –Community effort –new curation model Ingenuity –Commercial effort –Used by many biopharma’s

16 Talk at IPM-NUS Workshop on Bioinformatics and Computer Science, 7 April 2011 Copyright 2011 © Limsoon Wong 16 Low Comprehensiveness of Pathway Sources # of Pathways # of Genes Pairs # of Genes

17 Talk at IPM-NUS Workshop on Bioinformatics and Computer Science, 7 April 2011 Copyright 2011 © Limsoon Wong 17 Gene Pair Overlap Gene Overlap Wiki vs KEGGWiki vs IngenuityKEGG vs Ingenuity Wiki vs KEGGWiki vs IngenuityKEGG vs Ingenuity Low Consistency of Pathway Sources

18 Talk at IPM-NUS Workshop on Bioinformatics and Computer Science, 7 April 2011 Copyright 2011 © Limsoon Wong 18 Example: Apoptosis Pathway

19 Talk at IPM-NUS Workshop on Bioinformatics and Computer Science, 7 April 2011 Copyright 2011 © Limsoon Wong 19 Would Unifying Pathway Sources Help? Incompatibility Issues!Data extraction method variations Format variations Data differences Gene/GeneID name differences Pathway name differences

20 Talk at IPM-NUS Workshop on Bioinformatics and Computer Science, 7 April 2011 Copyright 2011 © Limsoon Wong 20 The preceding analyses hide an intricate issue… The same pathways in the different sources are often given different names. So how do we even know two pathways are the same and should be compared / merged?

21 Intricacy of Pathway Matching

22 Talk at IPM-NUS Workshop on Bioinformatics and Computer Science, 7 April 2011 Copyright 2011 © Limsoon Wong 22 Possible Ways to Match Pathways Match based on name –Pathways w/ similar name should be the same pathway –But annotations are very noisy  Likely to mismatch pathways?  Likely to match too many pathways? Are the followings good alternative approaches? –Match based on overlap of genes –Match based on overlap of gene pairs

23 Talk at IPM-NUS Workshop on Bioinformatics and Computer Science, 7 April 2011 Copyright 2011 © Limsoon Wong 23 Matching Pathways by Name LCS procedure –Given pathway X in db A –Sort pathways in db B by “longest common substring” with X –Manually scan the ranked list to choose closest nomen- clatural match Issue: Accuracy –When LCS says two pathways are the same one, are they really the same? Issue: Completeness –When LCS says two pathways are different, are they really different?

24 Talk at IPM-NUS Workshop on Bioinformatics and Computer Science, 7 April 2011 Copyright 2011 © Limsoon Wong 24 LCS vs Gene-Agreement Matching Accuracy –94% of LCS matches are in top 3 gene agreement matches –6% of LCS matches not in top 3 of gene agreement matches; but their gene-pair agreement levels are higher Completeness –Let Pi be a pathway in db A that LCS cannot find match in db B –Let Qi be pathway in db B with highest gene agreement to Pi –Gene-pair agreement of Pi-Qi is much lower than pathway pairs matched by LCS LCS is better than gene-agreement based matching!

25 Talk at IPM-NUS Workshop on Bioinformatics and Computer Science, 7 April 2011 Copyright 2011 © Limsoon Wong 25 LCS vs Gene-Agreement Matching LCS consistently has higher gene-pair agreement  LCS is better than gene-agreement based matching! gene overlap percentage Gene-pair overlap percentage LCS match Gene- agreement match

26 Talk at IPM-NUS Workshop on Bioinformatics and Computer Science, 7 April 2011 Copyright 2011 © Limsoon Wong 26 LCS vs Gene-Pair Agreement Matching 8 24 16 LCS Gene-Pair Overlap The 8 pathway pairs singled out by LCS The 24 pathway pairs singled out by maximal gene-pair overlap Note: We consider only pathway pairs that have at least 20 reaction overlap.

27 Talk at IPM-NUS Workshop on Bioinformatics and Computer Science, 7 April 2011 Copyright 2011 © Limsoon Wong 27 LCS vs Gene-Pair Agreement Matching Gene-pair agreement match will miss when –Pathway P in db A has few overlap with pathway P in db B due to incompleteness of db, even if pathway name matches perfectly! –Example: wnt signaling pathway, VEGF signaling pathway, MAPK signaling pathway, etc. in KEGG don’t have largest gene-pair overlap w/ corresponding pathways in Wikipathways & Ingenuity  Bad for getting a more complete unified pathway P

28 Talk at IPM-NUS Workshop on Bioinformatics and Computer Science, 7 April 2011 Copyright 2011 © Limsoon Wong 28 LCS vs Gene-Pair Agreement Matching Pathways having large gene-pair overlap are not necessarily the same pathways Examples –“Synaptic Long Term Potentiation” in Ingenuity vs “calcium signalling” in KEGG –“PPAR-alpha/RXR-alpha Signaling” in Ingenuity vs “TGF-beta signaling pathway” in KEGG  Difficult to set correct gene-pair overlap threshold to balance against false positive matches

29 Talk at IPM-NUS Workshop on Bioinformatics and Computer Science, 7 April 2011 Copyright 2011 © Limsoon Wong 29 % overlap with LCS Top n% matching pathways based on gene / gene pair overlap Gene vs Gene Pair Agreement Matching Pathways w/ higher gene/gene pair overlap have higher overlap w/ LCS Gene pair matching is better than gene matching But both are not as good as LCS

30 Talk at IPM-NUS Workshop on Bioinformatics and Computer Science, 7 April 2011 Copyright 2011 © Limsoon Wong 30 PathwayAPI = KEGG + Wikipathways + Ingenuity Having found a good way to match up pathways in different datasources, we proceeded to build a big unified pathway db…. Donny Soh, Difeng Dong, Yike Guo, Limsoon Wong. Consistency, Comprehensiveness, and Compatibility of Pathway Databases. BMC Bioinformatics, 11:449, September 2010.

31 More Consistent Disease Subnetworks

32 Talk at IPM-NUS Workshop on Bioinformatics and Computer Science, 7 April 2011 Copyright 2011 © Limsoon Wong 32 But these methods still don’t return the precise parts of a pathway that are significant… ORA –Khatri et al –Genomics, 2002 FCS –Pavlidis & Noble –PSB 2002 GSEA –Subramanian et al –PNAS, 2005 Pathway Express –Draghici et al –Genome Res, 2007 Test whole gene group at a time Test a node and its immediate neighbours

33 Talk at IPM-NUS Workshop on Bioinformatics and Computer Science, 7 April 2011 Copyright 2011 © Limsoon Wong 33 The SNet Method Group samples into type D and  D Extract & score subnetworks for type D –Get list of genes highly expressed in most D samples These genes need not be differentially expressed! –Put these genes into pathways –Locate connected components (ie., candidate subnetworks) from these pathway graphs –Score subnetworks on D samples and on  D samples For each subnetwork, compute t-statistics on the two sets of scores Determine significant subnetworks by permutations

34 Talk at IPM-NUS Workshop on Bioinformatics and Computer Science, 7 April 2011 Copyright 2011 © Limsoon Wong 34 SNet: Extract Subnetworks Genes highly expressed in many type-D samples

35 Talk at IPM-NUS Workshop on Bioinformatics and Computer Science, 7 April 2011 Copyright 2011 © Limsoon Wong 35 SNet: Score Subnetworks

36 Talk at IPM-NUS Workshop on Bioinformatics and Computer Science, 7 April 2011 Copyright 2011 © Limsoon Wong 36 SNet: Significant Subnetworks Randomize patient samples many times Get t-score for subnetworks from the randomizations Use these t-scores to establish null distribution Filter for significant subnetworks from real samples

37 Talk at IPM-NUS Workshop on Bioinformatics and Computer Science, 7 April 2011 Copyright 2011 © Limsoon Wong 37 Let’s see whether SNet gives us subnetworks that are (i) more consistent between datasets of the same types of disease samples (ii) larger and more meaningful

38 Talk at IPM-NUS Workshop on Bioinformatics and Computer Science, 7 April 2011 Copyright 2011 © Limsoon Wong 38 Recall Examples from “Bad News” Low % of overlapping genes from diff expt in general –Prostate cancer Lapointe et al, 2004 Singh et al, 2002 –Lung cancer Garber et al, 2001 Bhattacharjee et al, 2001 –DMD Haslett et al, 2002 Pescatori et al, 2007 DatasetsDEGPOG Prostate Cancer Top 100.30 Top 500.14 Top1000.15 Lung Cancer Top 100.00 Top 500.20 Top1000.31 DMD Top 100.20 Top 500.42 Top1000.54 Zhang et al, Bioinformatics, 2009

39 Talk at IPM-NUS Workshop on Bioinformatics and Computer Science, 7 April 2011 Copyright 2011 © Limsoon Wong 39 Better Subnetwork Overlap For each disease, take significant subnetworks from one dataset and see if it is also significant in the other dataset

40 Talk at IPM-NUS Workshop on Bioinformatics and Computer Science, 7 April 2011 Copyright 2011 © Limsoon Wong 40 Better Gene Overlaps For each disease, take significant subnetworks extracted independently from both datasets and see how much their genes overlap

41 Talk at IPM-NUS Workshop on Bioinformatics and Computer Science, 7 April 2011 Copyright 2011 © Limsoon Wong 41 Larger Subnetworks

42 Talk at IPM-NUS Workshop on Bioinformatics and Computer Science, 7 April 2011 Copyright 2011 © Limsoon Wong 42 Genes A, B, C are high in phenotype D A is high in phenotype ~D but B and C are not A B C Conventional techniques: Gene B and Gene C are selected. Possible incorrect postulation of mutations in gene B and C Key Insight # 1 SNet does not require all the genes in subnet to be diff expressed It only requires the subnet as a whole to be diff expressed Able to capture entire relationship, postulating a mutation in gene A

43 Talk at IPM-NUS Workshop on Bioinformatics and Computer Science, 7 April 2011 Copyright 2011 © Limsoon Wong 43 A branch within pathway consisting of genes A, B, C, D and E are high in phenotype D Genes C, D and E not high in phenotype ~D 30 other genes not diff expressed A B C Conventional techniques: Entire subnetwork is likely to be missed D E 30 other genes Key Insight # 2 SNet: Able to capture the entire subnetwork branch within the pathway

44 Talk at IPM-NUS Workshop on Bioinformatics and Computer Science, 7 April 2011 Copyright 2011 © Limsoon Wong 44 Genes A, B and C are present in two separate pathways A, B and C are high in phenotype D, but not high in phenotype ~D Conventional techniques: Both pathways are scored equally. So both got selected, resulting in pathway 2 being a false positive A B C A B C Pathway 1Pathway 2 Key Insight # 3 SNet: Able to select only pathway 1, which has the relevant relationship

45 Remarks

46 Talk at IPM-NUS Workshop on Bioinformatics and Computer Science, 7 April 2011 Copyright 2011 © Limsoon Wong 46 What have we learned? Significant lack of concordance betw db’s –Level of consistency for genes is 0% to 88% –Level of consistency for genes pairs is 0%-61% –Most db contains less than half of the pathways in other db’s Matching pathways by name is better than matching by gene overlap or gene-pair overlap SNet method yields more consistent and larger disease subnetworks

47 Talk at IPM-NUS Workshop on Bioinformatics and Computer Science, 7 April 2011 Copyright 2011 © Limsoon Wong 47 Acknowledgements A*STAR AIP scholarship A*STAR SERC PSF grant Difeng DongDonny Soh Yike Guo

48 Talk at IPM-NUS Workshop on Bioinformatics and Computer Science, 7 April 2011 Copyright 2011 © Limsoon Wong 48 References Eng-Juh Yeoh, Mary E. Ross, Sheila A. Shurtleff, W. Kent William, Divyen Patel, Rami Mahfouz, Fred G. Behm, Susana C. Raimondi, Mary V. Reilling, Anami Patel, Cheng Cheng, Dario Campana, Dawn Wilkins, Xiaodong Zhou, Jinyan Li, Huiqing Liu, Chin-Hon Pui, William E. Evans, Clayton Naeve, Limsoon Wong, James R. Downing. Classification, subtype discovery, and prediction of outcome in pediatric acute lymphoblastic leukemia by gene expression profiling. Cancer Cell, 1:133--143, March 2002. Donny Soh, Difeng Dong, Yike Guo, Limsoon Wong. Enabling More Sophisticated Gene Expression Analysis for Understanding Diseases and Optimizing Treatments. ACM SIGKDD Explorations, 9(1):3--14, June 2007. Donny Soh, Difeng Dong, Yike Guo, Limsoon Wong. Consistency, Comprehensiveness, and Compatibility of Pathway Databases. BMC Bioinformatics, 11:449, September 2010. Donny Soh, Understanding Pathways, PhD thesis, December 2010, Imperial College London


Download ppt "Enabling Reproducible Gene Expression Analysis Using Biological Pathways Limsoon Wong 7 April 2011 (Joint work with Donny Soh, Difeng Dong, Yike Guo)"

Similar presentations


Ads by Google