Metabolomic Data Processing & Statistical Analysis

Slides:



Advertisements
Similar presentations
Institute on Research and Statistics, Sacramento 04/08/04
Advertisements

PCA for analysis of complex multivariate data. Interpretation of large data tables by PCA In industry, research and finance the amount of data is often.
Principal Component Analysis Based on L1-Norm Maximization Nojun Kwak IEEE Transactions on Pattern Analysis and Machine Intelligence, 2008.
Pattern Recognition for the Natural Sciences Explorative Data Analysis Principal Component Analysis (PCA) Lutgarde Buydens, IMM, Analytical Chemistry.
Methods: Metabolomics Workflow Introduction Figure 1a: 1 H NMR spectrum of blood serum sample from a breast cancer patient. Results The emerging area of.
Dimension reduction (1)
An Overview of Machine Learning
Correlation Aware Feature Selection Annalisa Barla Cesare Furlanello Giuseppe Jurman Stefano Merler Silvano Paoli Berlin – 8/10/2005.
Proposal for a Standard Representation of the Results of GC-MS Analysis: A Module for ArMet Helen Fuell 1, Manfred Beckmann 2, John Draper 2, Oliver Fiehn.
© 2005 The McGraw-Hill Companies, Inc., All Rights Reserved. Chapter 14 Using Multivariate Design and Analysis.
Principal Component Analysis
Microarray Data Preprocessing and Clustering Analysis
CALIBRATION Prof.Dr.Cevdet Demir
Gene Set Analysis 09/24/07. From individual gene to gene sets Finding a list of differentially expressed genes is only the starting point. Suppose we.
Lecture 24: Thurs. Dec. 4 Extra sum of squares F-tests (10.3) R-squared statistic (10.4.1) Residual plots (11.2) Influential observations (11.3,
Exploring Microarray data Javier Cabrera. Outline 1.Exploratory Analysis Steps. 2.Microarray Data as Multivariate Data. 3.Dimension Reduction 4.Correlation.
Microarray analysis 2 Golan Yona. 2) Analysis of co-expression Search for similarly expressed genes experiment1 experiment2 experiment3 ……….. Gene i:
Significance Tests P-values and Q-values. Outline Statistical significance in multiple testing Statistical significance in multiple testing Empirical.
Linear Regression/Correlation
Analyzing Metabolomic Datasets Jack Liu Statistical Science, RTP, GSK
Microarray Gene Expression Data Analysis A.Venkatesh CBBL Functional Genomics Chapter: 07.
Practical statistics for Neuroscience miniprojects Steven Kiddle Slides & data :
1 Statistical Tools for Multivariate Six Sigma Dr. Neil W. Polhemus CTO & Director of Development StatPoint, Inc.
Proteomics Informatics – Data Analysis and Visualization (Week 13)
A Multivariate Biomarker for Parkinson’s Disease M. Coakley, G. Crocetti, P. Dressner, W. Kellum, T. Lamin The Michael L. Gargano 12 th Annual Research.
Practical Issues in Microarray Data Analysis Mark Reimers National Cancer Institute Bethesda Maryland.
Metabolomics 5/2/2014. ‘Omics Family Tree W. M. Claudino, et al., Journal of Clinical Oncology, 2007, 25(19), pp /2/2014.
2007 GeneSpring MS GeneSpring for Metabolite BioMarker Analysis using Mass Spectrometry data Agilent Q-TOF VIP Visit Jan 16-17, 2007 Santa Clara, CA Thon.
Analysis and Management of Microarray Data Dr G. P. S. Raghava.
Essential Statistics in Biology: Getting the Numbers Right
CSCE555 Bioinformatics Lecture 16 Identifying Differentially Expressed Genes from microarray data Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun.
ASCA: analysis of multivariate data from an experimental design, Biosystems Data Analysis group Universiteit van Amsterdam.
Dr Paul Lewis Lecturer in Bioinformatics Lecturer in Bioinformatics Cardiff University Cardiff University Biostatistics & Bioinformatics Unit Biostatistics.
Analysis of Complex Proteomic Datasets Using Scaffold Free Scaffold Viewer can be downloaded at:
Controlling FDR in Second Stage Analysis Catherine Tuglus Work with Mark van der Laan UC Berkeley Biostatistics.
Microarray data analysis David A. McClellan, Ph.D. Introduction to Bioinformatics Brigham Young University Dept. Integrative Biology.
Bioinformatics Expression profiling and functional genomics Part II: Differential expression Ad 27/11/2006.
Metabolomics Metabolome Reflects the State of the Cell, Organ or Organism Change in the metabolome is a direct consequence of protein activity changes.
© Copyright McGraw-Hill Correlation and Regression CHAPTER 10.
Introduction to Statistical Analysis of Gene Expression Data Feng Hong Beespace meeting April 20, 2005.
This supervised learning technique uses Bayes’ rule but is different in philosophy from the well known work of Aitken, Taroni, et al. Bayes’ rule: Pr is.
CLASSIFICATION. Periodic Table of Elements 1789 Lavosier 1869 Mendelev.
Academic Research Academic Research Dr Kishor Bhanushali M
PATTERN RECOGNITION : CLUSTERING AND CLASSIFICATION Richard Brereton
Nuria Lopez-Bigas Methods and tools in functional genomics (microarrays) BCO17.
© 2006 by The McGraw-Hill Companies, Inc. All rights reserved. 1 Chapter 12 Testing for Relationships Tests of linear relationships –Correlation 2 continuous.
Introduction to Microarrays Kellie J. Archer, Ph.D. Assistant Professor Department of Biostatistics
Innovative Paths to Better Medicines Design Considerations in Molecular Biomarker Discovery Studies Doris Damian and Robert McBurney June 6, 2007.
Computational Approaches for Biomarker Discovery SubbaLakshmiswetha Patchamatla.
Comp. Genomics Recitation 10 4/7/09 Differential expression detection.
Analyzing Expression Data: Clustering and Stats Chapter 16.
Tutorial 8 Gene expression analysis 1. How to interpret an expression matrix Expression data DBs - GEO Clustering –Hierarchical clustering –K-means clustering.
Statistical Analysis for Expression Experiments Heather Adams BeeSpace Doctoral Forum Thursday May 21, 2009.
Applying MetaboAnalyst
Gist 2.3 John H. Phan MIBLab Summer Workshop June 28th, 2006.
Biostatistics Regression and Correlation Methods Class #10 April 4, 2000.
Multivariate statistical methods. Multivariate methods multivariate dataset – group of n objects, m variables (as a rule n>m, if possible). confirmation.
Canadian Bioinformatics Workshops
Chapter 14 EXPLORATORY FACTOR ANALYSIS. Exploratory Factor Analysis  Statistical technique for dealing with multiple variables  Many variables are reduced.
Strategies for Metabolomic Data Analysis Dmitry Grapov, PhD.
Advanced Strategies for Metabolomic Data Analysis Dmitry Grapov, PhD.
Canadian Bioinformatics Workshops
1 C.A.L. Bailer-Jones. Machine Learning. Data exploration and dimensionality reduction Machine learning, pattern recognition and statistical data modelling.
JMP Discovery Summit 2016 Janet Alvarado
A new R package statTarget Hemi Luan Hong Kong Baptist University.
Introduction to Data Mining
Canadian Bioinformatics Workshops
Descriptive Statistics vs. Factor Analysis
Feature extraction and alignment for LC/MS data
Sera metabolite profiles of patients with RA discriminate rituximab responders and non-responders. Sera metabolite profiles of patients with RA discriminate.
Presentation transcript:

Metabolomic Data Processing & Statistical Analysis The Fifth International Conference of Metabolomic Society Metabolomic Data Processing & Statistical Analysis Jianguo (Jeff) Xia Dr. David Wishart Lab University of Alberta, Canada

Outline Overview of procedures for metabolomic studies Introduction to different data processing & statistical methods MetaboAnalyst – a web service for metabolomic data processing, analysis and annotation Conclusions & future directions The Fifth International Conference of Metabolomic Society

A data-centric overview of metabolomic studies 1. Data Collection 2. Data Processing 3. Data Analysis 4. Data Interpretation The Fifth International Conference of Metabolomic Society

Data collection Biological Samples → Spectra Separation Techniques Gas Chromatography (GC)‏ Liquid Chromatography (LC)‏ Capillary Electrophoresis (CE)‏ Detection Techniques Nuclear Magnetic Resonance Spectroscopy (NMR)‏ Mass Spectrometry (MS) Hyphenated Techniques Gas Chromatography - Mass Spectrometry (GC-MS) Liquid Chromatography -Mass Spectrometry (LC-MS) Liquid Chromatography - Nuclear Magnetic Resonance (LC-NMR) The Fifth International Conference of Metabolomic Society

Data processing Quantitative Chemometric Raw Spectra → Data Matrix Compound concentration data; Involving compound identification & quantification; Currently labor intensive with a lot of manual efforts Chemometric Spectral bins (NMR, Direct injection–MS) Peak lists (LC/GC – MS) Largely automated process The Fifth International Conference of Metabolomic Society

Data analysis Extract important features/patterns Biomarker discovery Exploratory Analysis Data overview Outlier detection Grouping patterns Biomarker discovery To identify metabolites that are significantly different between groups Classification To build a model for the prediction of unlabeled new samples The Fifth International Conference of Metabolomic Society

Data interpretation Features/patterns → biological knowledge Mainly a manual process Require domain expert knowledge Tools are coming: Comprehensive metabolite databases Network visualization Pathway analysis The Fifth International Conference of Metabolomic Society

Data processing & normalization 1. Data Collection 2. Data Processing 3. Data Analysis 4. Data Interpretation The Fifth International Conference of Metabolomic Society

Data processing (I) Purposes: To convert different metabolomic data into data matrices suitable for varieties of statistical analysis Quality control To check for inconsistencies To deal with missing values To remove noises The Fifth International Conference of Metabolomic Society

Data processing (II) Compound concentrations Nothing to do Spectral bins Removing baseline noises Peak list data Peak alignment GC/LC-MS spectra Peak picking A data matrix with rows represent samples and columns represents features (concentrations/intensities/ areas) The Fifth International Conference of Metabolomic Society

Data normalization Purposes: To remove systematic variation between experimental conditions unrelated to the biological differences (i.e. dilutions, mass) Sample normalization (row-wise) To bring variances of all features close to equal Feature normalization (column-wise) The Fifth International Conference of Metabolomic Society

Sample normalization By sum or total peak area By a reference compound (i.e. creatinine, internal standard) By a reference sample a.k.a “probabilistic quotient normalization” (Dieterle F, et al. Anal. Chem. 2006) By dry mass, volume, etc The Fifth International Conference of Metabolomic Society

Feature normalization Log transformation Scaling -- van den Berg RA, et al. BMC Genomics (2006) 7:142 The Fifth International Conference of Metabolomic Society

Statistical Analysis 1. Data Collection 2. Data Processing 3. Data Analysis 4. Data Interpretation The Fifth International Conference of Metabolomic Society

Data Analysis Univariate Chemometrics Fold change analysis, T-tests Volcano plots Chemometrics Principal component analysis (PCA) Partial least squares - discriminant analysis (PLS-DA) High-dimensional feature selection Significance analysis of microarrays (and metabolites) (SAM) Empirical Bayesian analysis of microarrays (and metabolites) (EBAM) Clustering Dendrogram & Heatmap K-means, Self Organizing Map (SOM) Classification Random Forests Support Vector Machine (SVM) The Fifth International Conference of Metabolomic Society

Volcano-plot Arrange features along dimensions of statistical (p-values from t-tests) and biological (fold changes) changes; The assumption is that features with both statistical and biological significance are more likely to be true positive. Widely used in microarray and proteomics data analysis The Fifth International Conference of Metabolomic Society

PLS-DA De facto standard for chemometric analysis A supervised method that uses multiple linear regression technique to find the direction of maximum covariance between a data set (X) and the class membership (Y) Extracted features are in the form of latent variables (LV) The Fifth International Conference of Metabolomic Society

PLS-DA for feature selection Variable importance in projection or VIP score A weighted sum of squares of the PLS loadings. The weights are based on the amount of explained Y-variance in each dimension. Based on the weighted sum of PLS-regression coefficients. The weights are a function of the reduction of the sums of squares across the number of PLS components. The Fifth International Conference of Metabolomic Society

Over fitting problem PLS-DA tend to over fit data It will try to separate classes even there is no real difference between them! Westerhuis, C.A., et al. (2007) Assessment of PLSDA cross validation. Metabolomics, 4, 81-89. Require more rigorous validation For example, to use permutations to test the significance of class separations The Fifth International Conference of Metabolomic Society

Permutation tests Use the same data set with its class labels reassigned randomly. Build a new model and measure its performance (B/W) Repeat many times to estimate the distribution of the performance measure (not necessarily follows a normal distribution). Compare the performance using the original label and the performance based on the randomly labeled data The Fifth International Conference of Metabolomic Society

Multi-testing problem P-value appropriate to a single test situation is inappropriate to presenting evidence for a set of changed features. Adjusting p-values Bonferroni correction Holm step-down procedure Using false discovery rate (FDR) A percentage indicating the expected false positives among all features predicted to be significant More powerful, suitable for multiple testing The Fifth International Conference of Metabolomic Society

Significance Analysis of Microarray (and Metabolomics) A well-established method widely used for identification of differentially expressed genes in microarray experiments Use moderated t-tests to computes a statistic dj for each gene j, which measures the strength of the relationship between gene expression (X) and a response variable (Y). Uses non-parametric statistics by repeated permutations of the data to determine if the expression of any gene is significant related to the response. The Fifth International Conference of Metabolomic Society

Clustering Unsupervised learning Good for data overview Use some sort of distance measures to group samples PCA Heatmap & dendrogram SOM & K-means The Fifth International Conference of Metabolomic Society

Classification Supervised learning Many traditional multivariate statistical methods are not suitable for high-dimensional data, particularly small sample size with large feature numbers New or improved methods, developed in the past decades for microarray data analysis Support vector machine (SVM) Random Forests The Fifth International Conference of Metabolomic Society

To develop a pipeline service for metabolomic studies The Fifth International Conference of Metabolomic Society

Microarray data analysis pipeline The Fifth International Conference of Metabolomic Society

A proposed pipeline for metabolomics studies Bijlsma S et al. Large-scale human metabolomics studies: a strategy for data (pre-) processing and validation. Anal. Chem. (2006) 78:567–574 The Fifth International Conference of Metabolomic Society

MetaboAnalyst -- A web service for high-throughput metabolomic data processing, analysis and annotation -- Implementation of all the methods mentioned in the form of user-friendly web interfaces -- www.metaboanalyst.ca The Fifth International Conference of Metabolomic Society

The Fifth International Conference of Metabolomic Society GC/LC-MS raw spectra MS / NMR peak lists MS / NMR spectra bins Metabolite concentrations Peak detection Retention time correction Baseline filtering Peak alignment Data integrity check Missing value imputation Data normalization Row-wise normalization (4) Column-wise normalization (4) Data analysis Univariate analysis (3) Dimension reduction (2) Feature selection (2) Cluster analysis (4) Classification (2) Data annotation Peak searching (3) Pathway mapping Download Processed data PDF report Images The Fifth International Conference of Metabolomic Society

Implementation features MetaboAnalyst Latest Java Server Faces (JSF) technology for web interface design R (esp. Bioconductor packages) for backend statistical analysis & visualization Using resources in HMDB for peak annotation, compound identification, as well as pathway mapping Comprehensive analysis report generation & documentation The Fifth International Conference of Metabolomic Society

The Fifth International Conference of Metabolomic Society

Over 1,200 visits since publication (~15 / day) Some usage statistics Over 1,200 visits since publication (~15 / day) The Fifth International Conference of Metabolomic Society

Current status Differential Analysis Class Prediction Class Discovery (Biomarker Identification) Class Prediction (Supervised learning) Class Discovery (Clustering) Pathway Analysis The Fifth International Conference of Metabolomic Society

Challenges & future directions Unbiased and comprehensive survey of metabolome NMR only able to detect more abundant compound species (> 1 µmol) MS are usually optimized to detect compounds of certain classes Systematic classification of compounds (ontology) More efficient pathway analysis & visualization The Fifth International Conference of Metabolomic Society

Acknowledgement Dr. David Wishart Dr. Nick Psychogios Nelson Young Alberta Ingenuity Fund (AIF) The Human Metabolome Project (HMP) University of Alberta The Fifth International Conference of Metabolomic Society