Structural Equation Modeling analysis for causal inference from multiple -omics datasets So-Youn Shin, Ann-Kristin Petersen Christian Gieger, Nicole Soranzo.

Slides:



Advertisements
Similar presentations
StatisticalDesign&ModelsValidation. Introduction.
Advertisements

Probabilistic analog of clustering: mixture models
Regression analysis Relating two data matrices/tables to each other Purpose: prediction and interpretation Y-data X-data.
SAMSI Program Beyond Bioinformatics: Statistical and Mathematical Challenges Topic: Data Integration Katerina Kechris, PhD Associate Professor.
Notes Sample vs distribution “m” vs “µ” and “s” vs “σ” Bias/Variance Bias: Measures how much the learnt model is wrong disregarding noise Variance: Measures.
Dimension reduction (2) Projection pursuit ICA NCA Partial Least Squares Blais. “The role of the environment in synaptic plasticity…..” (1998) Liao et.
An Introduction to Multivariate Analysis
6-1 Introduction To Empirical Models 6-1 Introduction To Empirical Models.
Dimension reduction (1)
Lab 13: Association Genetics. Goals Use a Mixed Model to determine genetic associations. Understand the effect of population structure and kinship on.
1 Multivariate Statistics ESM 206, 5/17/05. 2 WHAT IS MULTIVARIATE STATISTICS? A collection of techniques to help us understand patterns in and make predictions.
A genome-wide perspective of genetic variation in human metabolism Thomas Illig, Christian Gieger, Guangju Zhai, Werner Römisch-Margl, Rui Wang-Sattler,
Principal Component Analysis
Regulatory Network (Part II) 11/05/07. Methods Linear –PCA (Raychaudhuri et al. 2000) –NIR (Gardner et al. 2003) Nonlinear –Bayesian network (Friedman.
Metabolomics Bob Ward German Lab Food Science and Technology.
MSc GBE Course: Genes: from sequence to function Genome-wide Association Studies Sven Bergmann Department of Medical Genetics University of Lausanne Rue.
Data mining and statistical learning, lecture 4 Outline Regression on a large number of correlated inputs  A few comments about shrinkage methods, such.
Integrating domain knowledge with statistical and data mining methods for high-density genomic SNP disease association analysis Dinu et al, J. Biomedical.
Give me your DNA and I tell you where you come from - and maybe more! Lausanne, Genopode 21 April 2010 Sven Bergmann University of Lausanne & Swiss Institute.
Structural Equation Modeling Intro to SEM Psy 524 Ainsworth.
Analyzing Metabolomic Datasets Jack Liu Statistical Science, RTP, GSK
Factor Analysis Psy 524 Ainsworth.
1 Statistical Tools for Multivariate Six Sigma Dr. Neil W. Polhemus CTO & Director of Development StatPoint, Inc. Revised talk:
Statistical Bioinformatics QTL mapping Analysis of DNA sequence alignments Postgenomic data integration Systems biology.
Measuring Functional Integration: Connectivity Analyses
Chapter 2 Dimensionality Reduction. Linear Methods
Presented By Wanchen Lu 2/25/2013
This week: overview on pattern recognition (related to machine learning)
بسم الله الرحمن الرحیم.. Multivariate Analysis of Variance.
The Complexities of Data Analysis in Human Genetics Marylyn DeRiggi Ritchie, Ph.D. Center for Human Genetics Research Vanderbilt University Nashville,
11/12/2012ISC471 / HCI571 Isabelle Bichindaritz 1 Prediction.
Generalizing Linear Discriminant Analysis. Linear Discriminant Analysis Objective -Project a feature space (a dataset n-dimensional samples) onto a smaller.
Input: A set of people with/without a disease (e.g., cancer) Measure a large set of genetic markers for each person (e.g., measurement of DNA at various.
Classification Course web page: vision.cis.udel.edu/~cv May 12, 2003  Lecture 33.
Digital Media Lab 1 Data Mining Applied To Fault Detection Shinho Jeong Jaewon Shim Hyunsoo Lee {cinooco, poohut,
Mediation: Solutions to Assumption Violation
Lab 13: Association Genetics December 5, Goals Use Mixed Models and General Linear Models to determine genetic associations. Understand the effect.
Multivariate Data Analysis Chapter 1 - Introduction.
STATISTICS FOR HIGH DIMENSIONAL BIOLOGICAL RECORDINGS Dr Cyril Pernet, Centre for Clinical Brain Sciences Brain Research Imaging Centre
Innovative Paths to Better Medicines Design Considerations in Molecular Biomarker Discovery Studies Doris Damian and Robert McBurney June 6, 2007.
Principal Component Analysis (PCA)
Feature Selection and Extraction Michael J. Watts
Principal Component Analysis and Linear Discriminant Analysis for Feature Reduction Jieping Ye Department of Computer Science and Engineering Arizona State.
Statistical Analysis for Expression Experiments Heather Adams BeeSpace Doctoral Forum Thursday May 21, 2009.
Intro. ANN & Fuzzy Systems Lecture 16. Classification (II): Practical Considerations.
Copyright © 2012 Wolters Kluwer Health | Lippincott Williams & Wilkins Chapter 18 Multivariate Statistics.
An atlas of genetic influences on human blood metabolites Nature Genetics 2014 Jun;46(6)
Strategies for Metabolomic Data Analysis Dmitry Grapov, PhD.
Statistics 350 Review. Today Today: Review Simple Linear Regression Simple linear regression model: Y i =  for i=1,2,…,n Distribution of errors.
Advanced Strategies for Metabolomic Data Analysis Dmitry Grapov, PhD.
Dimension reduction (2) EDR space Sliced inverse regression Multi-dimensional LDA Partial Least Squares Network Component analysis.
Exposure Prediction and Measurement Error in Air Pollution and Health Studies Lianne Sheppard Adam A. Szpiro, Sun-Young Kim University of Washington CMAS.
INFERENCE FOR BIG DATA Mike Daniels The University of Texas at Austin Department of Statistics & Data Sciences Department of Integrative Biology.
Canonical Correlation Analysis (CCA). CCA This is it! The mother of all linear statistical analysis When ? We want to find a structural relation between.
I. Statistical Methods for Genome-Enabled Prediction of Complex Traits OUTLINE THE CHALLENGES OF PREDICTING COMPLEX TRAITS ORDINARY LEAST SQUARES (OLS)
Some statistical musings Naomi Altman Penn State 2015 Dagstuhl Workshop.
1 C.A.L. Bailer-Jones. Machine Learning. Data exploration and dimensionality reduction Machine learning, pattern recognition and statistical data modelling.
JMP Discovery Summit 2016 Janet Alvarado
CH 5: Multivariate Methods
Gene Set Enrichment Analysis
Regression.
Multivariate Analysis Lec 4
Filtering and State Estimation: Basic Concepts
What is Regression Analysis?
Figure 3 Statistical approaches for the analysis of metabolomic data
Dimension reduction : PCA and Clustering
V13 Multi-omics data integration
3 basic analytical tasks in bivariate (or multivariate) analyses:
Lecture 16. Classification (II): Practical Considerations
Structural Equation Modeling
Presentation transcript:

Structural Equation Modeling analysis for causal inference from multiple -omics datasets So-Youn Shin, Ann-Kristin Petersen Christian Gieger, Nicole Soranzo

[J. L. Griffin and J. P. Shockcor (2004) Nature Reviews Cancer]

[K. Suhre, S.-Y. Shin, et al. (2011) Nature]

Integrative Analysis for multiple –omics 1.Motivations – To dissect biological and genetic determinants of normal phenotypic variation and disease states – To validate results of individual –omics levels by reducing false positives caused by technical and methodological biases 2.Several analytical challenges – High dimensional, highly correlated datasets Normalization and Missing value estimation/imputation Biologically relevant dimension reduction – Methodologies Correlation → Causation Linearity → Nonlinearity (e.g. interaction) Multiple testing correction, Validation and Replication

Causal inference Study aim: To dissect mediation at serum lipid loci using metabolomics DNA variationMetabolomicsLipids ? [A.-K. Petersen, S.-Y. Shin, et al. (2011) Under Review]

Study design Model selection(KORA, N=~1,800) Linear models on 95 LipidSNPs, 151 Metabolites (and ~10,000 ratios), 4 Lipids P ≤ 3.4x10 -6 P ≤ 8.7x10 -5 P ≤ 0.05 LipidSNPMetabolite LipidLipidSNPLipid [A.-K. Petersen, S.-Y. Shin, et al. (2011) Under Review] Model testing (Structural Equation Modeling) LipidSNP Metabolite Lipid Replication (TwinsUK, N=~800) Model testing (Structural Equation Modeling) LipidSNP PC Lipid 50 Principal Components (97% Variance) Metabolite PC Replication (TwinsUK, N=~800) Interpretation of Principal Components

Structural Equation Modeling MAMA A MBMB B MPMP R package “sem”

Structural Equation Modeling MET LIP SNP Model 1 MET LIP SNP Model 2 MET LIP SNP Model 3 MET LIP SNP Model 4 MET LIP SNP Model 5 MET LIP SNP Model 6 MET LIP SNP Model 7 MET LIP SNP Model 8 MET LIP SNP Model 9 MET LIP SNP Model 10 All possible models -> Best fit (p-value, BIC)

Structural Equation Modeling Assumptions – Statistical assumptions (like any regression models) – Causal assumptions (based on biological knowledge) Pros – Flexible hypotheses : Both direct and indirect effects are allowed. (vs. Mendelian Randomization) – A variable can be both predictor and response simultaneously. (vs. Bayesian network analysis) Cons – Nonlinearity cannot be detected. – Hidden confounders or measurement errors can mislead causal inference. (same with biological experiments)

Causal inference We tested 95 loci associated with serum lipid levels. We applied SEM to test causal inference, on METs or PCs. 260 association sets met our criteria for significant edges in SNP -> MET -> Lipid at 3 loci (FADS1, GCKR, APOA1). METs and PCs showed similar results. We suggest that SEM is an appropriate statistical instrument to dissect the contribution of intermediate phenotypes to complex biological pathways. DNA variationMetabolomicsLipids [A.-K. Petersen, S.-Y. Shin, et al. (2011) Under Review]

Our ongoing project TwinsUK ~600k SNPs ~48k Probes ~32k Metabolic traits Overlapping N = ~600

Missing Values Issues in multivariate analyses Ignore vs. Impute How to impute – Impute with mean (row mean) – K-nearest-neighbors (kNN) – Transform based methods (SVD, Bayesian PCA) BPCA and GMC (Gaussian mixtures) seemed to perform better than SVD, row mean and kNN [R. Jornsten et al. (2005) Bioinformatics] BPCA and LSA (least squares adaptive) appeared to be the best [S. Oh et al. (2011) Bioinformatics]

Test of Bayesian PCA R-package “pcaMethods”

Dimension Reduction Methods Principal Component Analysis – kernel PCA Factor Analysis, Multidimensional Scaling Canonical Correlation Analysis – regularized CCA – kernel CCA Partial Least Squares (canonical mode) – Sparse PLS

Test of rCCA No significant cross correlation in two datasets rCCA extract features (canonical covariates) while maximizing the correlation between two datasets Significant cross correlation after rCCA : Is this biological meaningful or not? R package “CCA” or “mixOmics”

Open questions remain how best to integrate the multiple omics datasets to understand underlying biological mechanisms and infer causal pathways. Integrative Analysis for multiple –omics

Acknowledgements Wellcome Trust Sanger Institute Nicole Soranzo, YasinMemari, AparnaRadhakrishnan PanosDeloukas, ElinGrunberg KORA Ann-Kristin Petersen, Christian Gieger, KarstenShure TwinsUK Tim Spector, Massimo Mangino, GuangjuZhai, Kerrin Small