Enabling Reproducible Gene Expression Analysis Using Biological Pathways Limsoon Wong 7 April 2011 (Joint work with Donny Soh, Difeng Dong, Yike Guo)

Slides:



Advertisements
Similar presentations
Gene Set Enrichment Analysis Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein.
Advertisements

Control Case Common Always active
Detecting active subnetworks in molecular interaction networks with missing data Luke Hunter Texas A&M University SHURP 2007 Student.
Gene Set Enrichment Analysis Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein.
Linking Genetic Profiles to Biological Outcome Paul Fogel Consultant, Paris S. Stanley Young National Institute of Statistical Sciences NISS, NMF Workshop.
Microarray technology and analysis of gene expression data Hillevi Lindroos.
Threshold selection in gene co- expression networks using spectral graph theory techniques Andy D Perkins*,Michael A Langston BMC Bioinformatics 1.
Detecting Differentially Expressed Genes Pengyu Hong 09/13/2005.
27803::Systems Biology1CBS, Department of Systems Biology Schedule for the Afternoon 13:00 – 13:30ChIP-chip lecture 13:30 – 14:30Exercise 14:30 – 14:45Break.
Differentially expressed genes
Copyright  2004 limsoon wong Assessing Reliability of Protein- Protein Interaction Experiments Limsoon Wong Institute for Infocomm Research.
Gene Set Analysis 09/24/07. From individual gene to gene sets Finding a list of differentially expressed genes is only the starting point. Suppose we.
. Differentially Expressed Genes, Class Discovery & Classification.
Biological Interpretation of Microarray Data Helen Lockstone DTC Bioinformatics Course 9 th February 2010.
27803::Systems Biology1CBS, Department of Systems Biology Schedule for the Afternoon 13:00 – 13:30ChIP-chip lecture 13:30 – 14:30Exercise 14:30 – 14:45Break.
Generate Affy.dat file Hyb. cRNA Hybridize to Affy arrays Output as Affy.chp file Text Self Organized Maps (SOMs) Functional annotation Pathway assignment.
Gene Expression Based Tumor Classification Using Biologically Informed Models ISI 2003 Berlin Claudio Lottaz und Rainer Spang Computational Diagnostics.
Copyright  2003 limsoon wong Diagnosis of Childhood Acute Lymphoblastic Leukemia and Optimization of Risk-Benefit Ratio of Therapy Limsoon Wong Institute.
Different Expression Multiple Hypothesis Testing STAT115 Spring 2012.
Systematic Analysis of Interactome: A New Trend in Bioinformatics KOCSEA Technical Symposium 2010 Young-Rae Cho, Ph.D. Assistant Professor Department of.
1 Harvard Medical School Transcriptional Diagnosis by Bayesian Network Hsun-Hsien Chang and Marco F. Ramoni Children’s Hospital Informatics Program Harvard-MIT.
Exciting Bioinformatics Adventures Limsoon Wong Institute for Infocomm Research.
Structured Analysis of Microarrays & Differential Coexpression Claudio Lottaz, Dennis Kostka & Rainer Spang Courses in Practical DNA Microarray Analysis.
Knowledge Discovery in Biomedicine Limsoon Wong Institute for Infocomm Research.
Copyright  2003 limsoon wong Data Mining of Gene Expression Profiles for the Diagnosis and Understanding of Diseases Limsoon Wong Institute for Infocomm.
Analysis and Management of Microarray Data Dr G. P. S. Raghava.
Frédéric Schütz Statistics and bioinformatics applied to –omics technologies Part II: Integrating biological knowledge Center.
Gene Set Enrichment Analysis (GSEA)
Exagen Diagnostics, Inc., all rights reserved Biomarker Discovery in Genomic Data with Partial Clinical Annotation Cole Harris, Noushin Ghaffari.
Copyright  2004 limsoon wong CS2220: Computation Foundation in Bioinformatics Limsoon Wong Institute for Infocomm Research Lecture slides for 3 February.
Selection of Patient Samples and Genes for Disease Prognosis Limsoon Wong Institute for Infocomm Research Joint work with Jinyan Li & Huiqing Liu.
1 Decision tree based classifications of heterogeneous lung cancer data Student: Yi LI Supervisor: Associate Prof. Jiuyong Li Data: 15 th May 2009.
Using Emerging Patterns to Analyze Gene Expression Data Jinyan Li BioComputing Group Knowledge & Discovery Program Laboratories for Information Technology.
Relevance Detection Approach to Gene Annotation Aid to automatic annotation of databases Annotation flow –Extraction of molecular function of a gene from.
Bioinformatics Expression profiling and functional genomics Part II: Differential expression Ad 27/11/2006.
F INDING C ONSISTENT S UBNETWORKS ACROSS M ICROARRAY DATASET Fan Qi GS5002 Journal Club.
Copyright  2003 limsoon wong From Informatics to Bioinformatics: The Knowledge Discovery Perspective Limsoon Wong Institute for Infocomm Research Singapore.
The Use of Predictive Biomarkers in Clinical Trial Design Richard Simon, D.Sc. Chief, Biometric Research Branch National Cancer Institute
Impact of microRNAs on Organization of Protein Interactions and Formation of Protein Complexes Limsoon Wong 7 April 2011 (Thanks: Wilson Goh, Guimei Liu,
Evolutionary Algorithms for Finding Optimal Gene Sets in Micro array Prediction. J. M. Deutsch Presented by: Shruti Sharma.
Whole Genome Approaches to Cancer 1. What other tumor is a given rare tumor most like? 2. Is tumor X likely to respond to drug Y?
Application of Class Discovery and Class Prediction Methods to Microarray Data Kellie J. Archer, Ph.D. Assistant Professor Department of Biostatistics.
Copyright  2004 limsoon wong A Practical Introduction to Bioinformatics Limsoon Wong Institute for Infocomm Research Lecture 3, May 2004 For written notes.
Statistical Testing with Genes Saurabh Sinha CS 466.
By: Amira Djebbari and John Quackenbush BMC Systems Biology 2008, 2: 57 Presented by: Garron Wright April 20, 2009 CSCE 582.
Gene set analyses of genomic datasets Andreas Schlicker Jelle ten Hoeve Lodewyk Wessels.
A comparative study of survival models for breast cancer prognostication based on microarray data: a single gene beat them all? B. Haibe-Kains, C. Desmedt,
Molecular Classification of Cancer Class Discovery and Class Prediction by Gene Expression Monitoring.
The Broad Institute of MIT and Harvard Differential Analysis.
Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring T.R. Golub et al., Science 286, 531 (1999)
Nonlinear differential equation model for quantification of transcriptional regulation applied to microarray data of Saccharomyces cerevisiae Vu, T. T.,
Case Study: Characterizing Diseased States from Expression/Regulation Data Tuck et al., BMC Bioinformatics, 2006.
Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong For written notes on this lecture, please read chapter 14 of The Practical Bioinformatician, CS2220:
Eigengenes as biological signatures Dr. Habil Zare, PhD PI of Oncinfo Lab Assistant Professor, Department of Computer Science Texas State University 5.
Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong For written notes on this lecture, please read chapter 3 of The Practical Bioinformatician, CS2220:
Statistical Analysis for Expression Experiments Heather Adams BeeSpace Doctoral Forum Thursday May 21, 2009.
Advanced Gene Selection Algorithms Designed for Microarray Datasets Limitation of current feature selection methods: –Ignores gene/gene interaction: single.
Copyright © 2004 by Jinyan Li and Limsoon Wong Rule-Based Data Mining Methods for Classification Problems in Biomedical Domains Jinyan Li Limsoon Wong.
Gene Set Analysis using R and Bioconductor Daniel Gusenleitner
Network applications Sushmita Roy BMI/CS 576 Dec 9 th, 2014.
Eigengenes as biological signatures Dr. Habil Zare, PhD PI of Oncinfo Lab Assistant Professor, Department of Computer Science Texas State University 3.
Classifiers!!! BCH364C/391L Systems Biology / Bioinformatics – Spring 2015 Edward Marcotte, Univ of Texas at Austin.
Show & Tell Limsoon Wong Kent Ridge Digital Labs Singapore Role of Bioinformatics in the Genomic Era.
David Amar, Tom Hait, and Ron Shamir
Statistical Testing with Genes
Volume 1, Issue 2, Pages (March 2002)
Gene expression profiling of pediatric acute myelogenous leukemia
Single Sample Expression-Anchored Mechanisms Predict Survival in Head and Neck Cancer Yang et al Presented by Yves A. Lussier MD PhD The University.
Altered Caspase-8 Expression
Statistical Testing with Genes
Presentation transcript:

Enabling Reproducible Gene Expression Analysis Using Biological Pathways Limsoon Wong 7 April 2011 (Joint work with Donny Soh, Difeng Dong, Yike Guo)

Talk at IPM-NUS Workshop on Bioinformatics and Computer Science, 7 April 2011 Copyright 2011 © Limsoon Wong 2 Plan An issue in gene expression analysis Comparing pathway sources: Comprehensiveness, Consistency, Compatibility Matching pathways in different sources Finding more consistent disease subnetworks

An Issue in Gene Expression Analysis

Talk at IPM-NUS Workshop on Bioinformatics and Computer Science, 7 April 2011 Copyright 2011 © Limsoon Wong 4 First, the good news..

Talk at IPM-NUS Workshop on Bioinformatics and Computer Science, 7 April 2011 Copyright 2011 © Limsoon Wong 5 The subtypes look similar Conventional diagnosis –Immunophenotyping –Cytogenetics –Molecular diagnostics  Unavailable in developing countries Childhood Acute Lymphoblastic Leukemia Major subtypes: T-ALL, E2A-PBX, TEL-AML, BCR- ABL, MLL genome rearrangements, Hyperdiploid>50 Diff subtypes respond differently to same Tx Over-intensive Tx –Development of secondary cancers –Reduction of IQ Under-intensiveTx –Relapse

Talk at IPM-NUS Workshop on Bioinformatics and Computer Science, 7 April 2011 Copyright 2011 © Limsoon Wong 6 Fold Change T-test x – Microarray value after drug y – Microarray value before drug i – Gene x – Log2 value of treatment y – Log2 value of control s – Standard error i – Gene Individual Gene Testing Golub et al, Science, 1999

Talk at IPM-NUS Workshop on Bioinformatics and Computer Science, 7 April 2011 Copyright 2011 © Limsoon Wong 7 Yeoh et al, Cancer Cell 2002

Talk at IPM-NUS Workshop on Bioinformatics and Computer Science, 7 April 2011 Copyright 2011 © Limsoon Wong 8 Conventional Tx: intermediate intensity to all  10% suffers relapse  50% suffers side effects  costs US$150m/yr Our optimized Tx: high intensity to 10% intermediate intensity to 40% low intensity to 50% costs US$100m/yr Copyright © 2004 by Jinyan Li and Limsoon Wong High cure rate of 80% Less relapse Less side effects Save US$51.6m/yr Impact

Talk at IPM-NUS Workshop on Bioinformatics and Computer Science, 7 April 2011 Copyright 2011 © Limsoon Wong 9 Now, the bad news..

Talk at IPM-NUS Workshop on Bioinformatics and Computer Science, 7 April 2011 Copyright 2011 © Limsoon Wong 10 Percentage of Overlapping Genes Low % of overlapping genes from diff expt in general –Prostate cancer Lapointe et al, 2004 Singh et al, 2002 –Lung cancer Garber et al, 2001 Bhattacharjee et al, 2001 –DMD Haslett et al, 2002 Pescatori et al, 2007 DatasetsDEGPOG Prostate Cancer Top Top Top Lung Cancer Top Top Top DMD Top Top Top Zhang et al, Bioinformatics, 2009

Talk at IPM-NUS Workshop on Bioinformatics and Computer Science, 7 April 2011 Copyright 2011 © Limsoon Wong 11 Gene Regulatory Circuits Each disease subtype has underlying cause There is a unifying biological theme for genes that are truly associated with a disease subtype Uncertainty in selected genes can be reduced by considering biological processes of the genes The unifying biological theme is basis for inferring the underlying cause of disease subtype

Talk at IPM-NUS Workshop on Bioinformatics and Computer Science, 7 April 2011 Copyright 2011 © Limsoon Wong 12 Towards More Meaningful Genes ORA –Khatri et al –Genomics, 2002 FCS –Pavlidis & Noble –PSB 2002 GSEA –Subramanian et al –PNAS, 2005 Pathway Express –Draghici et al –Genome Res, 2007

Talk at IPM-NUS Workshop on Bioinformatics and Computer Science, 7 April 2011 Copyright 2011 © Limsoon Wong 13 All of these newer methods rely on gene group or pathway information. But how good are the available sources of pathway information?

Comparing Pathway Sources: Comprehensiveness, Consistency, & Compatibility

Talk at IPM-NUS Workshop on Bioinformatics and Computer Science, 7 April 2011 Copyright 2011 © Limsoon Wong 15 Data Sources KEGG –Curated by a single lab –Long famous history –Used by many people Wikipathways –Community effort –new curation model Ingenuity –Commercial effort –Used by many biopharma’s

Talk at IPM-NUS Workshop on Bioinformatics and Computer Science, 7 April 2011 Copyright 2011 © Limsoon Wong 16 Low Comprehensiveness of Pathway Sources # of Pathways # of Genes Pairs # of Genes

Talk at IPM-NUS Workshop on Bioinformatics and Computer Science, 7 April 2011 Copyright 2011 © Limsoon Wong 17 Gene Pair Overlap Gene Overlap Wiki vs KEGGWiki vs IngenuityKEGG vs Ingenuity Wiki vs KEGGWiki vs IngenuityKEGG vs Ingenuity Low Consistency of Pathway Sources

Talk at IPM-NUS Workshop on Bioinformatics and Computer Science, 7 April 2011 Copyright 2011 © Limsoon Wong 18 Example: Apoptosis Pathway

Talk at IPM-NUS Workshop on Bioinformatics and Computer Science, 7 April 2011 Copyright 2011 © Limsoon Wong 19 Would Unifying Pathway Sources Help? Incompatibility Issues!Data extraction method variations Format variations Data differences Gene/GeneID name differences Pathway name differences

Talk at IPM-NUS Workshop on Bioinformatics and Computer Science, 7 April 2011 Copyright 2011 © Limsoon Wong 20 The preceding analyses hide an intricate issue… The same pathways in the different sources are often given different names. So how do we even know two pathways are the same and should be compared / merged?

Intricacy of Pathway Matching

Talk at IPM-NUS Workshop on Bioinformatics and Computer Science, 7 April 2011 Copyright 2011 © Limsoon Wong 22 Possible Ways to Match Pathways Match based on name –Pathways w/ similar name should be the same pathway –But annotations are very noisy  Likely to mismatch pathways?  Likely to match too many pathways? Are the followings good alternative approaches? –Match based on overlap of genes –Match based on overlap of gene pairs

Talk at IPM-NUS Workshop on Bioinformatics and Computer Science, 7 April 2011 Copyright 2011 © Limsoon Wong 23 Matching Pathways by Name LCS procedure –Given pathway X in db A –Sort pathways in db B by “longest common substring” with X –Manually scan the ranked list to choose closest nomen- clatural match Issue: Accuracy –When LCS says two pathways are the same one, are they really the same? Issue: Completeness –When LCS says two pathways are different, are they really different?

Talk at IPM-NUS Workshop on Bioinformatics and Computer Science, 7 April 2011 Copyright 2011 © Limsoon Wong 24 LCS vs Gene-Agreement Matching Accuracy –94% of LCS matches are in top 3 gene agreement matches –6% of LCS matches not in top 3 of gene agreement matches; but their gene-pair agreement levels are higher Completeness –Let Pi be a pathway in db A that LCS cannot find match in db B –Let Qi be pathway in db B with highest gene agreement to Pi –Gene-pair agreement of Pi-Qi is much lower than pathway pairs matched by LCS LCS is better than gene-agreement based matching!

Talk at IPM-NUS Workshop on Bioinformatics and Computer Science, 7 April 2011 Copyright 2011 © Limsoon Wong 25 LCS vs Gene-Agreement Matching LCS consistently has higher gene-pair agreement  LCS is better than gene-agreement based matching! gene overlap percentage Gene-pair overlap percentage LCS match Gene- agreement match

Talk at IPM-NUS Workshop on Bioinformatics and Computer Science, 7 April 2011 Copyright 2011 © Limsoon Wong 26 LCS vs Gene-Pair Agreement Matching LCS Gene-Pair Overlap The 8 pathway pairs singled out by LCS The 24 pathway pairs singled out by maximal gene-pair overlap Note: We consider only pathway pairs that have at least 20 reaction overlap.

Talk at IPM-NUS Workshop on Bioinformatics and Computer Science, 7 April 2011 Copyright 2011 © Limsoon Wong 27 LCS vs Gene-Pair Agreement Matching Gene-pair agreement match will miss when –Pathway P in db A has few overlap with pathway P in db B due to incompleteness of db, even if pathway name matches perfectly! –Example: wnt signaling pathway, VEGF signaling pathway, MAPK signaling pathway, etc. in KEGG don’t have largest gene-pair overlap w/ corresponding pathways in Wikipathways & Ingenuity  Bad for getting a more complete unified pathway P

Talk at IPM-NUS Workshop on Bioinformatics and Computer Science, 7 April 2011 Copyright 2011 © Limsoon Wong 28 LCS vs Gene-Pair Agreement Matching Pathways having large gene-pair overlap are not necessarily the same pathways Examples –“Synaptic Long Term Potentiation” in Ingenuity vs “calcium signalling” in KEGG –“PPAR-alpha/RXR-alpha Signaling” in Ingenuity vs “TGF-beta signaling pathway” in KEGG  Difficult to set correct gene-pair overlap threshold to balance against false positive matches

Talk at IPM-NUS Workshop on Bioinformatics and Computer Science, 7 April 2011 Copyright 2011 © Limsoon Wong 29 % overlap with LCS Top n% matching pathways based on gene / gene pair overlap Gene vs Gene Pair Agreement Matching Pathways w/ higher gene/gene pair overlap have higher overlap w/ LCS Gene pair matching is better than gene matching But both are not as good as LCS

Talk at IPM-NUS Workshop on Bioinformatics and Computer Science, 7 April 2011 Copyright 2011 © Limsoon Wong 30 PathwayAPI = KEGG + Wikipathways + Ingenuity Having found a good way to match up pathways in different datasources, we proceeded to build a big unified pathway db…. Donny Soh, Difeng Dong, Yike Guo, Limsoon Wong. Consistency, Comprehensiveness, and Compatibility of Pathway Databases. BMC Bioinformatics, 11:449, September 2010.

More Consistent Disease Subnetworks

Talk at IPM-NUS Workshop on Bioinformatics and Computer Science, 7 April 2011 Copyright 2011 © Limsoon Wong 32 But these methods still don’t return the precise parts of a pathway that are significant… ORA –Khatri et al –Genomics, 2002 FCS –Pavlidis & Noble –PSB 2002 GSEA –Subramanian et al –PNAS, 2005 Pathway Express –Draghici et al –Genome Res, 2007 Test whole gene group at a time Test a node and its immediate neighbours

Talk at IPM-NUS Workshop on Bioinformatics and Computer Science, 7 April 2011 Copyright 2011 © Limsoon Wong 33 The SNet Method Group samples into type D and  D Extract & score subnetworks for type D –Get list of genes highly expressed in most D samples These genes need not be differentially expressed! –Put these genes into pathways –Locate connected components (ie., candidate subnetworks) from these pathway graphs –Score subnetworks on D samples and on  D samples For each subnetwork, compute t-statistics on the two sets of scores Determine significant subnetworks by permutations

Talk at IPM-NUS Workshop on Bioinformatics and Computer Science, 7 April 2011 Copyright 2011 © Limsoon Wong 34 SNet: Extract Subnetworks Genes highly expressed in many type-D samples

Talk at IPM-NUS Workshop on Bioinformatics and Computer Science, 7 April 2011 Copyright 2011 © Limsoon Wong 35 SNet: Score Subnetworks

Talk at IPM-NUS Workshop on Bioinformatics and Computer Science, 7 April 2011 Copyright 2011 © Limsoon Wong 36 SNet: Significant Subnetworks Randomize patient samples many times Get t-score for subnetworks from the randomizations Use these t-scores to establish null distribution Filter for significant subnetworks from real samples

Talk at IPM-NUS Workshop on Bioinformatics and Computer Science, 7 April 2011 Copyright 2011 © Limsoon Wong 37 Let’s see whether SNet gives us subnetworks that are (i) more consistent between datasets of the same types of disease samples (ii) larger and more meaningful

Talk at IPM-NUS Workshop on Bioinformatics and Computer Science, 7 April 2011 Copyright 2011 © Limsoon Wong 38 Recall Examples from “Bad News” Low % of overlapping genes from diff expt in general –Prostate cancer Lapointe et al, 2004 Singh et al, 2002 –Lung cancer Garber et al, 2001 Bhattacharjee et al, 2001 –DMD Haslett et al, 2002 Pescatori et al, 2007 DatasetsDEGPOG Prostate Cancer Top Top Top Lung Cancer Top Top Top DMD Top Top Top Zhang et al, Bioinformatics, 2009

Talk at IPM-NUS Workshop on Bioinformatics and Computer Science, 7 April 2011 Copyright 2011 © Limsoon Wong 39 Better Subnetwork Overlap For each disease, take significant subnetworks from one dataset and see if it is also significant in the other dataset

Talk at IPM-NUS Workshop on Bioinformatics and Computer Science, 7 April 2011 Copyright 2011 © Limsoon Wong 40 Better Gene Overlaps For each disease, take significant subnetworks extracted independently from both datasets and see how much their genes overlap

Talk at IPM-NUS Workshop on Bioinformatics and Computer Science, 7 April 2011 Copyright 2011 © Limsoon Wong 41 Larger Subnetworks

Talk at IPM-NUS Workshop on Bioinformatics and Computer Science, 7 April 2011 Copyright 2011 © Limsoon Wong 42 Genes A, B, C are high in phenotype D A is high in phenotype ~D but B and C are not A B C Conventional techniques: Gene B and Gene C are selected. Possible incorrect postulation of mutations in gene B and C Key Insight # 1 SNet does not require all the genes in subnet to be diff expressed It only requires the subnet as a whole to be diff expressed Able to capture entire relationship, postulating a mutation in gene A

Talk at IPM-NUS Workshop on Bioinformatics and Computer Science, 7 April 2011 Copyright 2011 © Limsoon Wong 43 A branch within pathway consisting of genes A, B, C, D and E are high in phenotype D Genes C, D and E not high in phenotype ~D 30 other genes not diff expressed A B C Conventional techniques: Entire subnetwork is likely to be missed D E 30 other genes Key Insight # 2 SNet: Able to capture the entire subnetwork branch within the pathway

Talk at IPM-NUS Workshop on Bioinformatics and Computer Science, 7 April 2011 Copyright 2011 © Limsoon Wong 44 Genes A, B and C are present in two separate pathways A, B and C are high in phenotype D, but not high in phenotype ~D Conventional techniques: Both pathways are scored equally. So both got selected, resulting in pathway 2 being a false positive A B C A B C Pathway 1Pathway 2 Key Insight # 3 SNet: Able to select only pathway 1, which has the relevant relationship

Remarks

Talk at IPM-NUS Workshop on Bioinformatics and Computer Science, 7 April 2011 Copyright 2011 © Limsoon Wong 46 What have we learned? Significant lack of concordance betw db’s –Level of consistency for genes is 0% to 88% –Level of consistency for genes pairs is 0%-61% –Most db contains less than half of the pathways in other db’s Matching pathways by name is better than matching by gene overlap or gene-pair overlap SNet method yields more consistent and larger disease subnetworks

Talk at IPM-NUS Workshop on Bioinformatics and Computer Science, 7 April 2011 Copyright 2011 © Limsoon Wong 47 Acknowledgements A*STAR AIP scholarship A*STAR SERC PSF grant Difeng DongDonny Soh Yike Guo

Talk at IPM-NUS Workshop on Bioinformatics and Computer Science, 7 April 2011 Copyright 2011 © Limsoon Wong 48 References Eng-Juh Yeoh, Mary E. Ross, Sheila A. Shurtleff, W. Kent William, Divyen Patel, Rami Mahfouz, Fred G. Behm, Susana C. Raimondi, Mary V. Reilling, Anami Patel, Cheng Cheng, Dario Campana, Dawn Wilkins, Xiaodong Zhou, Jinyan Li, Huiqing Liu, Chin-Hon Pui, William E. Evans, Clayton Naeve, Limsoon Wong, James R. Downing. Classification, subtype discovery, and prediction of outcome in pediatric acute lymphoblastic leukemia by gene expression profiling. Cancer Cell, 1: , March Donny Soh, Difeng Dong, Yike Guo, Limsoon Wong. Enabling More Sophisticated Gene Expression Analysis for Understanding Diseases and Optimizing Treatments. ACM SIGKDD Explorations, 9(1):3--14, June Donny Soh, Difeng Dong, Yike Guo, Limsoon Wong. Consistency, Comprehensiveness, and Compatibility of Pathway Databases. BMC Bioinformatics, 11:449, September Donny Soh, Understanding Pathways, PhD thesis, December 2010, Imperial College London