SAMSI 2014-2015 Program Beyond Bioinformatics: Statistical and Mathematical Challenges Topic: Data Integration Katerina Kechris, PhD Associate Professor.

SAMSI 2014-2015 Program Beyond Bioinformatics: Statistical and Mathematical Challenges Topic: Data Integration Katerina Kechris, PhD Associate Professor Biostatistics and Informatics Colorado School of Public Health University of Colorado Denver

Omics Large-scale analyses for studying a population of molecules or molecular mechanisms High-throughput data Examples – Genomics (entire genome – DNA) – Proteomics (study of protein repertoire) – Epigenomics (study of DNA and histone modifications)

Omics Epigenome Phenome Adapted from http://www.sciencebasedmedicine.org http://www.scientificpsychic.com/fitness/transcription.gifhttp://www.sciencebasedmedicine.org http://themedicalbiochemistrypage.org/images/hemoglobin.jpghttp://themedicalbiochemistrypage.org/images/hemoglobin.jpg http://upload.wikimedia.org/wikipedia/commons/c/c6/Clopidogrel_active_metabolite.png http://creatia2013.files.wordpress.com/2013/03/dna.gifhttp://upload.wikimedia.org/wikipedia/commons/c/c6/Clopidogrel_active_metabolite.png

Large-scale Projects & Databases NCI 60 Database

Integration of Omics Data Each type of data gives a different snapshot of the biological or disease system Why integrate data? Reduce false positives/negatives Identify interactions between different molecules Explore functional mechanisms

Challenges 1.When to integrate? 2.Dimensionality 3.Resolution 4.Heterogeneity 5.Interactions and Pathways

Challenge 1: When to integrate? Early – Merging data to increase sample size Intermediate – Convert different data sources into common format (e.g., ranks, correlation matrices), kernel-based analysis Late – Meta-analysis (combine effect size or p-value), aggregate voting for classifiers, genomic enrichment and overlap of significant results

Genomic Meta-analysis: Combining Multiple Transcriptomic Studies Tseng Lab, U. of Pitt.

Assessing Genomic Overlap: Permutation-based Strategies Bickel Lab, Berkeley & ENCODE Ann. Appl. Stat. (2010) 4:4 1660-1697.

Challenge 2: Dimensionality Most technologies produce 10Ks to 100Ks measurements per sample – Exponential increase with 2+ data types Dimension reduction – Process data type separately (filtering) – Combine with model fitting – Multivariate analysis

Sparse Multivariate Methods Variable Selection, Discriminant Analysis, Visualization Penalties (or regularization) to reduce parameter space, only a few entries are non- zero (sparsity) Sparse Canonical Correlation Analysis (CCA) and Partial Least Squares Regression (PLS) Le Cao, U. of Queensland; Besse, U. of Toulose; Witten, U. of Wash; Tibshirani, Stanford Stat Appl Genet Mol Biol. 2009 January 1; 8(1): Article 28; Stat Appl Genet Mol Biol. 2008;7(1):Article 35

Challenge 3: Genomic Resolution Base level (conservation, motif scores) Regular intervals (expression/binding from tiling arrays) Irregular intervals – Gene/ncRNA level data (expression) – Individual positions (SNP, methylation sites)

Challenge 4: Heterogeneity Technology-specific sources of error Different pre-processing, normalization Different amounts of missing values Data matching – Different identifiers – Not always one-to-one (microarrays) – Imputation

Challenge 4: Heterogeneity Continuous – expression and binding data from microarrays, motif scores, protein/metabolite abundance Counts – expression data from sequencing 0-1 – conservation (UCSC), DNA methylation Binary/Categorical – Thresh-holding (e.g., motif scores), genotype

Case Study: Development Ci important for differentiation of appendages during development transcription factor – binds to DNA near target genes http://www.biology.ualberta.ca/locke.hp/research.htm http://howardhughes.trinity.duke.edu Kechris Lab, CU Denver

Hierarchical Mixture Model Data -Transcriptome: Ci pathway mutants (expr) – irregular interval -Genome: DNA binding data of Ci (bind) – regular interval, DNA conservation across 14 insect species (cons)– base level Goal: Predict gene targets of Ci Hidden variable is gene target – hierarchical mixture model Dvorkin et al., 2013 (under review)

Challenge 5: Interactions and Pathways Known Pathways – Incorporate information in databases (curated but sparse) – e.g., KEGG pathways have metabolite – protein interactions (directed graphs) De novo Pathways – Discover novel interactions

Known Pathways Jornsten, Chalmers & Michailidis, U. Michigan Biostatistics (2012) 13:4 748-761 Joint modeling of metabolite and transcript data to identify active pathways metabolite gene

de novo Interactions Single data INTEGRATION Pair-wise – Correlations (e.g., eQTL) – Bayesian networks Multiple – Kernel-based methods – Probabilistic graphical models – Network analysis gene SNP protein metabolite gene methylation site PHENOTYPE

de novo Interactions Shojaie Lab U. Washington Biometrika (2010) 97 (3): 519-538.

Summary Methodology 1.Meta-analysis 2.Permutation-based Methods 3.Sparse Multivariate Methods 4.Graphical Models 5.Network Analysis

SAMSI 2014-2015 Program Beyond Bioinformatics: Statistical and Mathematical Challenges Topic: Data Integration Katerina Kechris, PhD Associate Professor.

Similar presentations

Presentation on theme: "SAMSI 2014-2015 Program Beyond Bioinformatics: Statistical and Mathematical Challenges Topic: Data Integration Katerina Kechris, PhD Associate Professor."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

SAMSI 2014-2015 Program Beyond Bioinformatics: Statistical and Mathematical Challenges Topic: Data Integration Katerina Kechris, PhD Associate Professor.

Similar presentations

Presentation on theme: "SAMSI 2014-2015 Program Beyond Bioinformatics: Statistical and Mathematical Challenges Topic: Data Integration Katerina Kechris, PhD Associate Professor."— Presentation transcript:

Similar presentations

About project

Feedback