Download presentation

Presentation is loading. Please wait.

Published byBriana Sherfield Modified about 1 year ago

1
SAMSI Program Beyond Bioinformatics: Statistical and Mathematical Challenges Topic: Data Integration Katerina Kechris, PhD Associate Professor Biostatistics and Informatics Colorado School of Public Health University of Colorado Denver

2
Omics Large-scale analyses for studying a population of molecules or molecular mechanisms High-throughput data Examples – Genomics (entire genome – DNA) – Proteomics (study of protein repertoire) – Epigenomics (study of DNA and histone modifications)

3
Omics Epigenome Phenome Adapted from

4
Large-scale Projects & Databases NCI 60 Database

5
Integration of Omics Data Each type of data gives a different snapshot of the biological or disease system Why integrate data? Reduce false positives/negatives Identify interactions between different molecules Explore functional mechanisms

6
Challenges 1.When to integrate? 2.Dimensionality 3.Resolution 4.Heterogeneity 5.Interactions and Pathways

7
Challenge 1: When to integrate? Early – Merging data to increase sample size Intermediate – Convert different data sources into common format (e.g., ranks, correlation matrices), kernel-based analysis Late – Meta-analysis (combine effect size or p-value), aggregate voting for classifiers, genomic enrichment and overlap of significant results

8
Genomic Meta-analysis: Combining Multiple Transcriptomic Studies Tseng Lab, U. of Pitt.

9
Assessing Genomic Overlap: Permutation-based Strategies Bickel Lab, Berkeley & ENCODE Ann. Appl. Stat. (2010) 4:

10
Challenge 2: Dimensionality Most technologies produce 10Ks to 100Ks measurements per sample – Exponential increase with 2+ data types Dimension reduction – Process data type separately (filtering) – Combine with model fitting – Multivariate analysis

11
Sparse Multivariate Methods Variable Selection, Discriminant Analysis, Visualization Penalties (or regularization) to reduce parameter space, only a few entries are non- zero (sparsity) Sparse Canonical Correlation Analysis (CCA) and Partial Least Squares Regression (PLS) Le Cao, U. of Queensland; Besse, U. of Toulose; Witten, U. of Wash; Tibshirani, Stanford Stat Appl Genet Mol Biol January 1; 8(1): Article 28; Stat Appl Genet Mol Biol. 2008;7(1):Article 35

12
Challenge 3: Genomic Resolution Base level (conservation, motif scores) Regular intervals (expression/binding from tiling arrays) Irregular intervals – Gene/ncRNA level data (expression) – Individual positions (SNP, methylation sites)

13
Challenge 4: Heterogeneity Technology-specific sources of error Different pre-processing, normalization Different amounts of missing values Data matching – Different identifiers – Not always one-to-one (microarrays) – Imputation

14
Challenge 4: Heterogeneity Continuous – expression and binding data from microarrays, motif scores, protein/metabolite abundance Counts – expression data from sequencing 0-1 – conservation (UCSC), DNA methylation Binary/Categorical – Thresh-holding (e.g., motif scores), genotype

15
Case Study: Development Ci important for differentiation of appendages during development transcription factor – binds to DNA near target genes Kechris Lab, CU Denver

16
Hierarchical Mixture Model Data -Transcriptome: Ci pathway mutants (expr) – irregular interval -Genome: DNA binding data of Ci (bind) – regular interval, DNA conservation across 14 insect species (cons)– base level Goal: Predict gene targets of Ci Hidden variable is gene target – hierarchical mixture model Dvorkin et al., 2013 (under review)

17
Challenge 5: Interactions and Pathways Known Pathways – Incorporate information in databases (curated but sparse) – e.g., KEGG pathways have metabolite – protein interactions (directed graphs) De novo Pathways – Discover novel interactions

18
Known Pathways Jornsten, Chalmers & Michailidis, U. Michigan Biostatistics (2012) 13: Joint modeling of metabolite and transcript data to identify active pathways metabolite gene

19
de novo Interactions Single data INTEGRATION Pair-wise – Correlations (e.g., eQTL) – Bayesian networks Multiple – Kernel-based methods – Probabilistic graphical models – Network analysis gene SNP protein metabolite gene methylation site PHENOTYPE

20
de novo Interactions Shojaie Lab U. Washington Biometrika (2010) 97 (3):

21
Summary Methodology 1.Meta-analysis 2.Permutation-based Methods 3.Sparse Multivariate Methods 4.Graphical Models 5.Network Analysis

Similar presentations

© 2017 SlidePlayer.com Inc.

All rights reserved.

Ads by Google