NISS Metabolomics Workshop, 20051 Integrative Analysis of High Dimensional Gene Expression, Metabolite and Blood Chemistry Data Kwan R. Lee, Ph.D. and.

Slides:



Advertisements
Similar presentations
Challenges In Progressing Biomarkers To Clinical Use Proteomic Experiences Chris Harbron Technical Lead For High Dimensional Data AstraZeneca FDA Industry.
Advertisements

Machine learning methods for the analysis of heterogeneous, multi- source data Ilkka Huopaniemi Statistical machine learning and.
Bioinformatics (and Systems Biology?) in Biomedical Research Donald Dunbar Systems Biology Club 30th November 2005.
Protein Quantitation II: Multiple Reaction Monitoring
UAB Metabolomics Symposium December 12, 2012 Christopher B. Newgard, Ph.D. Sarah W. Stedman Nutrition and Metabolism Center Department of Pharmacology.
System Biology October 2013 Gustavo de Souza IMM, OUS.
Prof. Carolina Ruiz Computer Science Department Bioinformatics and Computational Biology Program WPI WELCOME TO BCB4003/CS4803 BCB503/CS583 BIOLOGICAL.
Cardiovascular Continuum Sampling from Extremes Padmanabhan S et al. PLoS Genet 2011.
Yeast - why it simply has a lot to say about human disease.
Introduction to Bioinformatics Richard H. Scheuermann, Ph.D. Director of Informatics JCVI.
Metabonomics A New Potential Diagnostic Tool David Huffman Western Michigan University WDA 2010.
University of Turku Department of Biochemistry Jukka-Pekka Suomela Biomarkers.
The Golden Age of Biology DNA -> RNA -> Proteins -> Metabolites Genomics Technologies MECHANISMS OF LIFE Health Care Diagnostics Medicines Animal Products.
Metabolomics DNA RNA Protein Biochemicals (Metabolites) Genomics – 25,000 Genes Transcriptomics – 100,000 Transcripts Metabolomics – 2,800 Compounds Proteomics.
GeneData Solutions in-silico Swapna Annavarapu SoCalBSI CalState, LA.
Doug Brutlag 2011 Genomics, Bioinformatics & Medicine Drug Development
Analyzing Metabolomic Datasets Jack Liu Statistical Science, RTP, GSK
Exploring Metabolomic data with recursive partitioning Metabolomic Workshop NISS July 14-15, 2005.
23 May June May 2002 From genes to drugs via crystallography 19 May 1996 Experimental and computational approaches to structure based.
2007 GeneSpring MS GeneSpring for Metabolite BioMarker Analysis using Mass Spectrometry data Agilent Q-TOF VIP Visit Jan 16-17, 2007 Santa Clara, CA Thon.
NMR and Mass Spectrometry approaches to metabolomics in man and mouse Dr. Julian Griffin Dept of Biochemistry, University of Cambridge.
Sage Bionetworks A non-profit organization with a vision to enable networked team approaches to building better models of disease BIOMEDICINE INFORMATION.
Personal Omics Profiling Reveals Dynamic Molecular and Medical Phenotypes Chen, et al (2012) Robert Magie and Ronni Park.
Min Zhang, MD PhD Purdue University Joint work with Yanzhu Lin, Dabao Zhang.
High throughput Protein Measurement Techniques Harin Kanani.
Sage Bionetworks A non-profit organization with a vision to enable networked team approaches to building better models of disease BIOMEDICINE INFORMATION.
TOXICOGENOMICS.
Joe Pekny, Professor Chemical Engineering Director, e-Enterprise Center Discovery Park Marietta Harrison, Professor Medicinal Chemistry & Molecular Pharmacology.
1 PLEASING CLIENTS AT A MOLECULAR AND CELLULAR LEVEL AUGUST 7, 2015.
Designing a metabolomics experiment
13 October 2004Statistics: Yandell © Inferring Genetic Architecture of Complex Biological Processes Brian S. Yandell 12, Christina Kendziorski 13,
Prospects of genomics technologies in cancer molecular epidemiology Soterios A. Kyrtopoulos National Hellenic Research Foundation, Institute.
Date of download: 6/24/2016 Copyright © The American College of Cardiology. All rights reserved. From: Proteomic Strategies in the Search of New Biomarkers.
Ingenuity Pathway Analysis Alex Pico. Description "IPA is a software application that enables researchers to analyze and understand the complex biological.
Canadian Bioinformatics Workshops
Shotgun protein identification Creative Proteomics offers iTRAQ protein quantification service suited for unbiased untargeted biomarker discovery. Relative.
SUNY Korea BioData Mining Lab - Journal Review
Szilard Voros, MD, FACC, FSCCT, FAHA Chief Executive Officer | Founder
2 1 Assays Modules Features Study info Subject info Design Samples
NATIONAL NUTRITIONAL PHENOTYPE DATABASE
Canadian Bioinformatics Workshops
David Janz and Lucy Kapronczai
Conceptual approach for incorporating “omics” technologies and resulting large databases into toxicological evaluation. Data from experiments that evaluate.
Metabolomics Study of Human Seminal Plasma of Infertile Men
Data challenges in the pharmaceutical industry
Visualization of Adverse effect pathways
Biotechnology Objectives: At the end of this lecture we will be able to identify and describe the uses of biotechnology in society.
Integrated Metabolomics Research Group, Korea Basic Science Institute
What is relevant for primary care in the U-BIOPRED?
Knowledge l Action l Impact
New genes can be added to an organism’s DNA.
The Omics Dashboard Suzanne Paley Pathway Tools Workshop 2018
Post-GWAS and Mechanistic Analyses
“Proteomics is a science that focuses on the study of proteins: their roles, their structures, their localization, their interactions, and other factors.”
Human Health and Disease
Using Spotfire for Proteomic Analysis
Microbiome: Metabolomics
KEY CONCEPT Entire genomes are sequenced, studied, and compared.
Topic: Medicine of the future Reading: Harbron, Chris (2006)
Metabolic Phenotyping and Systems Biology Approaches to Understanding Metabolic Syndrome and Fatty Liver Disease  Marc–Emmanuel Dumas, James Kinross,
The Impact of Network Medicine in Gastroenterology and Hepatology
Figure 3 Statistical approaches for the analysis of metabolomic data
Standards Development for Metabolomics
Volume 25, Issue 3, Pages (March 2017)
Department of Biochemistry and Molecular Biology
Diagnostics and Prognostics
The Omics Dashboard.
Session 1: WELCOME AND INTRODUCTIONS
Stat4Onco Annual Symposium Zhenming Shun April 27, 2019
Genome‐scale metabolic models (GEMs) provide a scaffold for integrative analysis of clinical data. Genome‐scale metabolic models (GEMs) provide a scaffold.
Presentation transcript:

NISS Metabolomics Workshop, Integrative Analysis of High Dimensional Gene Expression, Metabolite and Blood Chemistry Data Kwan R. Lee, Ph.D. and Lei A. Zhu, Amit Bhattacharyya, J. Alan Menius Biomedical Data Sciences GlaxoSmithKline

NISS Metabolomics Workshop, Overview Systems Biology Challenges for Statisticians Possible solutions Example of integrative data analysis Summary and discussion

NISS Metabolomics Workshop, Of mice and men ????

NISS Metabolomics Workshop, Integrate knowledge and technologies Reduce attrition by running coordinated studies in animal and man

NISS Metabolomics Workshop, Focusing on one platform may miss an obvious signal!!!

NISS Metabolomics Workshop, How can efficacy failures be attacked? Animal PhenotypeHuman Phenotype Classic Phenotypic Approach Animal PhenotypeHuman Phenotype Animal Biomarker Fingerprint Human Biomarker Fingerprint Integrative Biology Few data to support analogy Many data to support analogy

NISS Metabolomics Workshop, ‘Systems Biology’ approach to drug discovery

NISS Metabolomics Workshop, H NMR metabolites Affy Transcriptome LC-MS Lipid LC-MS metabolites “Non-omic” markers Veh A B C D Normal Disease A A Experimental Platforms Non-omics and Omics, what are they?

NISS Metabolomics Workshop, Experimental Platforms Non-omics and Omics, what are they? (cont.) Traditional Blood Chemistry (non-omics) Gene Expression (transcriptomics) Metabolite (metabonomics) Lipid (lipomics) Protein (proteomics)

NISS Metabolomics Workshop, Five Challenges 1.Data Pre-processing 2.High Dimensionality 3.Multiple Testing for Marker Selection 4.Data Integration 5.Validation of the Prediction Model

NISS Metabolomics Workshop, Peak Alignment (NMR, LC/MS) Normalization (Gene Chip, NMR, LC/MS data) –Why? Remove systematic bias in the data –Normalization within the platform makes data comparable across samples Challenge #1: Data Pre-processing

NISS Metabolomics Workshop, Challenge # 2: High Dimensionality # of subjects << # of variables Blood Chemistry: 9 markers Gene Expression: 22,000 probe sets Lipid LC/MS: 2, 000 peaks Metabolite LS/MS: 3,000 peaks NMR: 500 buckets Animal 1 Animal 2. Animal 100 probe set 1 …… 22,000Lipid 1...… 2,000Metabolite 1 … 3,000NMR 1 …… 500Choles, Trig,…...

NISS Metabolomics Workshop, NoiseSignalSignal+Noise No Adjustment for Multiple Testing FWER Adjustment FDR += Challenge #3: Multiple Testing in Variable Selection

NISS Metabolomics Workshop, H NMR metabolites Affy Transcriptome LC-MS Lipid LC-MS metabolites “Non-omic” markers Veh A B C D Normal Disease A Challenge #4: Data integration A

NISS Metabolomics Workshop, Platform A 20000s var. Platform B 1000s var. Combined Data Platform A 20000s var. Platform B 1000s var. Dimension Reduction (eg variable selection) Platform A 1000s var. Platform B 100s var. Combined Data Integration Approach 1: Integration Approach 2: Challenge #4: Data integration (cont.)

NISS Metabolomics Workshop, Integration approach 1: Simple data integration –Simply combining the platform data together, the platform with large amount of data and variability will dominate the other platforms Challenge #4: Data integration Example 1

NISS Metabolomics Workshop, PCA on Non-omics, Transcriptomics, and Combined. Non-omics (20) Transcriptomics (12,488) Combined (12,508) Mirror image!!! Transcriptomics data dominate Non-omics data!!!

NISS Metabolomics Workshop, PCA on Non-omics, Transcriptomics, and Combined. Non-omics (20) Transcriptomics (20 PCs) Combined (40) Like a mirror image!!!

NISS Metabolomics Workshop, Integration approach 2: Integrate on selected markers –9 blood chemistry probe sets metabolites –There are still platforms with more selected markers –How to weight different platforms appropriately? Eg. 9 blood chemistry markers are known to relate to disease or drug –Identify relationship among the probe sets, metabolites, along with the blood chemistry markers in terms of biological pathways Challenge #4: Data integration Example 2

NISS Metabolomics Workshop, Normal Disease Principle Component Analysis (PCA ) Projection of 67 animals of 28 normal (black), 39 disease (red) (9 NO, 1991 TA, 115 MT) All markers used for projection

NISS Metabolomics Workshop, Loading Plot

NISS Metabolomics Workshop, Partial Least Square Discriminant Analysis (PLS-DA) Disease group only Vehicle Drug

NISS Metabolomics Workshop, PLS-DA: Corresponding projection of all markers (9 NO, 1991 TA, 115 MT), Which are important drug markers? Drug Veh

NISS Metabolomics Workshop, Ranked drug markers by importance or by coefficients. marker importance by variable importance on projection Up or down regulation by coefficients

NISS Metabolomics Workshop, Validation of the model: R2, Q2 and permutation tests 100 times (P < 0.01)

NISS Metabolomics Workshop, Variation explained by each platform PLS-DA for prediction of 2 experimental groups Two Groups HFD, vehicle HFD, Drug treated Q2(Y) = amount of variation among the 2 groups explained by the model (cross-validated) The above table is based on 2- component model. If the 4th model uses more components, 91% of the variation in the data can be explained by 4 components.

NISS Metabolomics Workshop, Challenge #5: Validation of the Prediction Model Correct way of doing cross-validation –Especially when the variables are selected Is your prediction accuracy significant?

NISS Metabolomics Workshop, Random Noise Data Simulate 20,000 marker columns of random noise for 100 patients and one additional column containing arbitrary labels of class indicators. Select 5 marker columns showing most correlation with class label. Make a prediction model for class indicators based on these 5 selected markers.

NISS Metabolomics Workshop, PCA of Full Markers

NISS Metabolomics Workshop, PLS-DA on Random Noise Data Running a full model on SIMCA does not yield a model – no significant Q2. –Multivariate approach is conservative. –Q2 computes prediction performance. But forced the software to fit a 6 - component model by PLS-DA (R2 = 1.0, Q2 = 0.225)

NISS Metabolomics Workshop, Full marker model PLS-DA

NISS Metabolomics Workshop, Was it real or by chance?

NISS Metabolomics Workshop, Select 5 Markers Selected top 5 markers using VIP from the over-fitted model and fit PLS-DA again on the same data. Now we have (R2 = 0.459, Q2 = 0.348)

NISS Metabolomics Workshop, Good prediction from PLS-DA? Q2 = 0.35

NISS Metabolomics Workshop, Validated by permutation test? Significance of Q2

NISS Metabolomics Workshop, Selection Bias When a prediction model is tested on the same data that were used in the first instance to select the markers, selection bias makes the test error overly optimistic. –Many publications claimed a small set of selected “genes” is highly predictive. –IBI practice is to use a data set to select markers and use the same data set to fit a prediction model based on selected markers.

NISS Metabolomics Workshop, How to correct for selection bias? External validation should be undertaken subsequent to feature selection process. 1.Independent test data set (hold-out data set) that never used for feature selection. 2.External cross-validation (ECV). Cross validation of the prediction model is external to the selection process. In other words, make a new selection for each cross validation round.

NISS Metabolomics Workshop, Externally Validated PLS. Model and variable selection Divide the data set randomly into d parts. Set ecv = 1; (this means hold-out one part and use d-1 parts for modeling) Set a =1 ; (the number of components, do until 10) Set k = total number of variables; Loop: Fit PLS model with given a and k, PLS (a,k); Predict hold-out set, compute PRESS (ecv, a, k) and save; Choose top half of the variables by appropriate statistics (coeff, vip, t-ratio etc); Set k = k/2; Go back to Loop until k = 2; Set a = a + 1; Go back to Loop until a =10; Set ecv = ecv + 1; Go back to Loop until ecv = d; Compute PRESS (a, k) = Sum over ecv {PRESS (ecv, a, k)}; Compute Q2(a, k) = 1 – PRESS (a, k)/TSS; Plot Q2(a,k) vs. log2(k);

NISS Metabolomics Workshop, Simulation of 2000 Random Data R. Simon x 6000 and 10/10 for class labels Repeat 2000 times Compute 3 different error rates –Re-substitution (wrong) –Cross validation after selection (wrong) –Cross validation before selection (correct)

NISS Metabolomics Workshop, Results of 2000 Random Data

NISS Metabolomics Workshop, Permutation testing Because of the high dimensionality of gene expression data, it may be possible to achieve relatively small error rates even for random data. To assess the significance of the classification results, permutation test may be suggested.

NISS Metabolomics Workshop, Challenge #5: Validation of the Prediction Model - summary Correct way of doing cross-validation –All the steps of the prediction modeling should be cross-validated. –Each cross validation step should start from scratch Is your prediction accuracy significant? –Random data can give you low prediction error –Permutation tests, bootstrap aggregation

NISS Metabolomics Workshop, Summary and Discussion Recent technological advances present challenging and interesting biological data at molecular level. Statistics and multivariate analysis play an important role in understanding and extracting knowledge from these type of data. Integrative analysis is even more challenging and we presented some solutions to these challenges. There is plenty of room for improvement.

NISS Metabolomics Workshop, Acknowledgement GlaxoSmithKline –High Throughput Biology –Biomedical Data Sciences –Genomics and Proteomics Science –Pathology, Cellular & Biochemical Toxicology –Discovery IT

NISS Metabolomics Workshop, Data exploration: Present Challenges Data is an extremely valuable asset, but like a cash crop, unless harvested, it is wasted. - Sid Adelman