Download presentation

Presentation is loading. Please wait.

Published byBriana Sherfield Modified over 2 years ago

1
SAMSI 2014-2015 Program Beyond Bioinformatics: Statistical and Mathematical Challenges Topic: Data Integration Katerina Kechris, PhD Associate Professor Biostatistics and Informatics Colorado School of Public Health University of Colorado Denver

2
Omics Large-scale analyses for studying a population of molecules or molecular mechanisms High-throughput data Examples – Genomics (entire genome – DNA) – Proteomics (study of protein repertoire) – Epigenomics (study of DNA and histone modifications)

3
Omics Epigenome Phenome Adapted from http://www.sciencebasedmedicine.org http://www.scientificpsychic.com/fitness/transcription.gifhttp://www.sciencebasedmedicine.org http://themedicalbiochemistrypage.org/images/hemoglobin.jpghttp://themedicalbiochemistrypage.org/images/hemoglobin.jpg http://upload.wikimedia.org/wikipedia/commons/c/c6/Clopidogrel_active_metabolite.png http://creatia2013.files.wordpress.com/2013/03/dna.gifhttp://upload.wikimedia.org/wikipedia/commons/c/c6/Clopidogrel_active_metabolite.png

4
Large-scale Projects & Databases NCI 60 Database

5
Integration of Omics Data Each type of data gives a different snapshot of the biological or disease system Why integrate data? Reduce false positives/negatives Identify interactions between different molecules Explore functional mechanisms

6
Challenges 1.When to integrate? 2.Dimensionality 3.Resolution 4.Heterogeneity 5.Interactions and Pathways

7
Challenge 1: When to integrate? Early – Merging data to increase sample size Intermediate – Convert different data sources into common format (e.g., ranks, correlation matrices), kernel-based analysis Late – Meta-analysis (combine effect size or p-value), aggregate voting for classifiers, genomic enrichment and overlap of significant results

8
Genomic Meta-analysis: Combining Multiple Transcriptomic Studies Tseng Lab, U. of Pitt.

9
Assessing Genomic Overlap: Permutation-based Strategies Bickel Lab, Berkeley & ENCODE Ann. Appl. Stat. (2010) 4:4 1660-1697.

10
Challenge 2: Dimensionality Most technologies produce 10Ks to 100Ks measurements per sample – Exponential increase with 2+ data types Dimension reduction – Process data type separately (filtering) – Combine with model fitting – Multivariate analysis

11
Sparse Multivariate Methods Variable Selection, Discriminant Analysis, Visualization Penalties (or regularization) to reduce parameter space, only a few entries are non- zero (sparsity) Sparse Canonical Correlation Analysis (CCA) and Partial Least Squares Regression (PLS) Le Cao, U. of Queensland; Besse, U. of Toulose; Witten, U. of Wash; Tibshirani, Stanford Stat Appl Genet Mol Biol. 2009 January 1; 8(1): Article 28; Stat Appl Genet Mol Biol. 2008;7(1):Article 35

12
Challenge 3: Genomic Resolution Base level (conservation, motif scores) Regular intervals (expression/binding from tiling arrays) Irregular intervals – Gene/ncRNA level data (expression) – Individual positions (SNP, methylation sites)

13
Challenge 4: Heterogeneity Technology-specific sources of error Different pre-processing, normalization Different amounts of missing values Data matching – Different identifiers – Not always one-to-one (microarrays) – Imputation

14
Challenge 4: Heterogeneity Continuous – expression and binding data from microarrays, motif scores, protein/metabolite abundance Counts – expression data from sequencing 0-1 – conservation (UCSC), DNA methylation Binary/Categorical – Thresh-holding (e.g., motif scores), genotype

15
Case Study: Development Ci important for differentiation of appendages during development transcription factor – binds to DNA near target genes http://www.biology.ualberta.ca/locke.hp/research.htm http://howardhughes.trinity.duke.edu Kechris Lab, CU Denver

16
Hierarchical Mixture Model Data -Transcriptome: Ci pathway mutants (expr) – irregular interval -Genome: DNA binding data of Ci (bind) – regular interval, DNA conservation across 14 insect species (cons)– base level Goal: Predict gene targets of Ci Hidden variable is gene target – hierarchical mixture model Dvorkin et al., 2013 (under review)

17
Challenge 5: Interactions and Pathways Known Pathways – Incorporate information in databases (curated but sparse) – e.g., KEGG pathways have metabolite – protein interactions (directed graphs) De novo Pathways – Discover novel interactions

18
Known Pathways Jornsten, Chalmers & Michailidis, U. Michigan Biostatistics (2012) 13:4 748-761 Joint modeling of metabolite and transcript data to identify active pathways metabolite gene

19
de novo Interactions Single data INTEGRATION Pair-wise – Correlations (e.g., eQTL) – Bayesian networks Multiple – Kernel-based methods – Probabilistic graphical models – Network analysis gene SNP protein metabolite gene methylation site PHENOTYPE

20
de novo Interactions Shojaie Lab U. Washington Biometrika (2010) 97 (3): 519-538.

21
Summary Methodology 1.Meta-analysis 2.Permutation-based Methods 3.Sparse Multivariate Methods 4.Graphical Models 5.Network Analysis

Similar presentations

OK

Statistical methods and tools for integrative analysis of perturbation signatures Mario Medvedovic Laboratory for Statistical Genomics and Systems Biology.

Statistical methods and tools for integrative analysis of perturbation signatures Mario Medvedovic Laboratory for Statistical Genomics and Systems Biology.

© 2017 SlidePlayer.com Inc.

All rights reserved.

Ads by Google

Ppt on natural resources for class 10 Ppt on sixth sense technology free download Ppt on case study of microsoft Ppt on water scarcity pictures Ppt on dth service free download Isometric view ppt online Presenter notes in microsoft ppt online Ppt on matter in our surroundings Ppt on storage devices-recent trends Ppt on ms access 2007