1 Bayesian inference of genome structure and application to base composition variation Nick Smith and Paul Fearnhead, University of Lancaster.

Slides:



Advertisements
Similar presentations
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: The Linear Prediction Model The Autocorrelation Method Levinson and Durbin.
Advertisements

Markov Chain Monte Carlo Convergence Diagnostics: A Comparative Review By Mary Kathryn Cowles and Bradley P. Carlin Presented by Yuting Qi 12/01/2006.
Monte Carlo Methods and Statistical Physics
METHODS FOR HAPLOTYPE RECONSTRUCTION
Recombination and genetic variation – models and inference
Bayesian Estimation in MARK
From the homework: Distribution of DNA fragments generated by Micrococcal nuclease digestion mean(nucs) = bp median(nucs) = 110 bp sd(nucs+ = 17.3.
Ch 11. Sampling Models Pattern Recognition and Machine Learning, C. M. Bishop, Summarized by I.-H. Lee Biointelligence Laboratory, Seoul National.
Autocorrelation and Linkage Cause Bias in Evaluation of Relational Learners David Jensen and Jennifer Neville.
CHAPTER 16 MARKOV CHAIN MONTE CARLO
Bayesian statistics – MCMC techniques
BAYESIAN INFERENCE Sampling techniques
Maximum Likelihood. Likelihood The likelihood is the probability of the data given the model.
Error Propagation. Uncertainty Uncertainty reflects the knowledge that a measured value is related to the mean. Probable error is the range from the mean.
The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press IMPRS Summer School 2009, Prof. William H. Press 1 4th IMPRS Astronomy.
Results 2 (cont’d) c) Long term observational data on the duration of effective response Observational data on n=50 has EVSI = £867 d) Collect data on.
REGRESSION MODEL ASSUMPTIONS. The Regression Model We have hypothesized that: y =  0 +  1 x +  | | + | | So far we focused on the regression part –
1 Transforming the efficiency of Partial EVSI computation Alan Brennan Health Economics and Decision Science (HEDS) Samer Kharroubi Centre for Bayesian.
Estimating recombination rates using three-site likelihoods Jeff Wall Program in Molecular and Computational Biology, USC.
Using ranking and DCE data to value health states on the QALY scale using conventional and Bayesian methods Theresa Cain.
Computer vision: models, learning and inference Chapter 10 Graphical Models.
The Analysis of Variance
Bayesian Analysis for Extreme Events Pao-Shin Chu and Xin Zhao Department of Meteorology School of Ocean & Earth Science & Technology University of Hawaii-
Lecture 12 Splicing and gene prediction in eukaryotes
Analysis of Variance. ANOVA Probably the most popular analysis in psychology Why? Ease of implementation Allows for analysis of several groups at once.
Applications of Bayesian sensitivity and uncertainty analysis to the statistical analysis of computer simulators for carbon dynamics Marc Kennedy Clive.
October 14, 2014Computer Vision Lecture 11: Image Segmentation I 1Contours How should we represent contours? A good contour representation should meet.
Bayesian parameter estimation in cosmology with Population Monte Carlo By Darell Moodley (UKZN) Supervisor: Prof. K Moodley (UKZN) SKA Postgraduate conference,
Number of Blocks per Pole Diego Arbelaez. Option – Number of Blocks per Pole Required magnetic field tolerance of ~10 -4 For a single gap this can be.
2 nd Order CFA Byrne Chapter 5. 2 nd Order Models The idea of a 2 nd order model (sometimes called a bi-factor model) is: – You have some latent variables.
BINF6201/8201 Hidden Markov Models for Sequence Analysis
Education Research 250:205 Writing Chapter 3. Objectives Subjects Instrumentation Procedures Experimental Design Statistical Analysis  Displaying data.
University of Ottawa - Bio 4118 – Applied Biostatistics © Antoine Morin and Scott Findlay 08/10/ :23 PM 1 Some basic statistical concepts, statistics.
Gil McVean Department of Statistics, Oxford Approximate genealogical inference.
1 Statistical Distribution Fitting Dr. Jason Merrick.
Module 1: Statistical Issues in Micro simulation Paul Sousa.
The Common Shock Model for Correlations Between Lines of Insurance
Random Numbers and Simulation  Generating truly random numbers is not possible Programs have been developed to generate pseudo-random numbers Programs.
Mixture Models, Monte Carlo, Bayesian Updating and Dynamic Models Mike West Computing Science and Statistics, Vol. 24, pp , 1993.
Phylogenetic Prediction Lecture II by Clarke S. Arnold March 19, 2002.
A markovian approach for the analysis of the gene structure C. MelodeLima 1, L. Guéguen 1, C. Gautier 1 and D. Piau 2 1 Biométrie et Biologie Evolutive.
Hypothesis Testing A procedure for determining which of two (or more) mutually exclusive statements is more likely true We classify hypothesis tests in.
Bioinformatics Expression profiling and functional genomics Part II: Differential expression Ad 27/11/2006.
Estimating the Predictive Distribution for Loss Reserve Models Glenn Meyers Casualty Loss Reserve Seminar September 12, 2006.
Comp. Genomics Recitation 3 The statistics of database searching.
Forward-Scan Sonar Tomographic Reconstruction PHD Filter Multiple Target Tracking Bayesian Multiple Target Tracking in Forward Scan Sonar.
Bayesian MCMC QTL mapping in outbred mice Andrew Morris, Binnaz Yalcin, Jan Fullerton, Angela Meesaq, Rob Deacon, Nick Rawlins and Jonathan Flint Wellcome.
1 E. Fatemizadeh Statistical Pattern Recognition.
A statistical test for point source searches - Aart Heijboer - AWG - Cern june 2002 A statistical test for point source searches Aart Heijboer contents:
Chapter 16 Data Analysis: Testing for Associations.
Stable Multi-Target Tracking in Real-Time Surveillance Video
Computer Graphics and Image Processing (CIS-601).
5-1 ANSYS, Inc. Proprietary © 2009 ANSYS, Inc. All rights reserved. May 28, 2009 Inventory # Chapter 5 Six Sigma.
CHAPTER 27: One-Way Analysis of Variance: Comparing Several Means
Bayesian Travel Time Reliability
Tutorial I: Missing Value Analysis
1 Chapter 8: Model Inference and Averaging Presented by Hui Fang.
The Unscented Particle Filter 2000/09/29 이 시은. Introduction Filtering –estimate the states(parameters or hidden variable) as a set of observations becomes.
Kevin Stevenson AST 4762/5765. What is MCMC?  Random sampling algorithm  Estimates model parameters and their uncertainty  Only samples regions of.
HW7: Evolutionarily conserved segments ENCODE region 009 (beta-globin locus) Multiple alignment of human, dog, and mouse 2 states: neutral (fast-evolving),
Density Estimation in R Ha Le and Nikolaos Sarafianos COSC 7362 – Advanced Machine Learning Professor: Dr. Christoph F. Eick 1.
Inferences on human demographic history using computational Population Genetic models Gabor T. Marth Department of Biology Boston College Chestnut Hill,
Ch 1. Introduction Pattern Recognition and Machine Learning, C. M. Bishop, Updated by J.-H. Eom (2 nd round revision) Summarized by K.-I.
Indel rates and probabilistic alignments Gerton Lunter Budapest, June 2008.
Hierarchical Models. Conceptual: What are we talking about? – What makes a statistical model hierarchical? – How does that fit into population analysis?
Generalization Performance of Exchange Monte Carlo Method for Normal Mixture Models Kenji Nagata, Sumio Watanabe Tokyo Institute of Technology.
Markov Chain Monte Carlo in R
Multiple Imputation using SOLAS for Missing Data Analysis
Remember that our objective is for some density f(y|) for observations where y and  are vectors of data and parameters,  being sampled from a prior.
Mattew Mazowita, Lani Haque, and David Sankoff
Presentation transcript:

1 Bayesian inference of genome structure and application to base composition variation Nick Smith and Paul Fearnhead, University of Lancaster

2 Plan of talk Examples of genome structure Patterns of base composition in vertebrate genomes Recursive segmentation Bayesian approach to segmentation inference Application to simulated data Application to real data Conclusions and future directions

3 Examples of genome structure Variation in base composition (GC%) in the human genome

4 Variation in substitution rate, polymorphism, repeat element density, recombination rate across human chromosome 22

5 General research goals How can we analyse structure in the genome? Could partition genome into uniform windows, but better to partition genome according to structure in the data. Base composition: Partition genome according to GC content, testing differences in GC structure among chromosomes and species

6 Base composition in the vertebrate genome 1970’s: Physical analysis of large regions of DNA (~100 kb) revealed striking heterogeneity in GC content, especially in warm-blooded vertebrates; idea of isochores (>300 kb) Full genome sequences: Sliding window analyses show significant variation in GC at a variety of scales. Long-range autocorrelations up to many Mb.

7 Recursive segmentation for “isochore” identification Current standard method is recursive segmentation, e.g.Isofinder which generates isochore maps shown in UCSC genome browser. Isofinder: Apply “coarse graining” to filter out short-scale variation (exons, repetitive elements, CpG islands), choose potential cutting point as maximum t-statistic between right and left means, test significance of t-test by Monte Carlo methods, make cut if p value below threshold. Repeat process for resultant sub-sequences. 1-10kb 10’s Mb

8 Statistical methods for segmentation Problems with recursive segmentation: (1) depends on significance threshold, (2) no idea of uncertainty in segmentation, (3) possible loss of information. Various methods for DNA segmentation have been considered in the statistics literature. Have adapted a Bayesian model for simultaneously inferring multiple changepoints which divide the data into segments. Model uses a combination of perfect simulation with recursions and MCMC.

9 Bayesian model Apply coarse graining to get data (y 1:n ), number of GC bases in non- overlapping windows 2-level model (1) Model data within segments as N(  i,  2 ), where  i is segment mean and  2 gives variance within segments. (2) Model segment means as N( ,  2  2 ), where  is grand mean and  2 is ratio of variance among segment means to variance within segments. Can calculate P(t,s) = Pr(y t:s |t,s in same segment). Model distribution of segment lengths as negative binomial, g( ,k),  is inverse of mean segment length, k is shape parameter. Define recursions with Q(t) = Pr(y t:n |changepoint at t-1) using P, Q and g functions. Fundamental assumption: independence of segments. Proceed from Q(n) to Q(1) = Pr(y 1:n ). Posterior distributions of changepoints are given by formulae with P, Q and g functions, other information easily follows.

10 Tuning priors and MCMC algorithm Problem is we don’t know correct values for underlying priors. Priors tuned by maximising product of Q(1) and hyperpriors. MCMC algorithm designed to overcome dependence on priors and possible mis-specifications of model. Start with priors at tuned values (reference priors). Store N simulations of changepoints based on reference priors. Three updates at each new step of MCMC algorithm (N steps) (1)Metropolis-Hastings update changepoints dependent on current values of priors, if (2) Update , , and  i ’s given changepoints and  and  (Gibbs) (3) Update  and  given  i ’s (Gibbs)

11 Simulated data 1: Idealised simulations Idealised simulations: strong segmentation signal and correct model. Simulated discretised data (5000 data points, 5 kb windows) using  = 0.01, k = 2,  = 140,  = 2,  = For 100 simulations, compared true values of parameters from simulations with estimated parameter values (mean of posterior distribution): highly accurate inference of  (mean estimate , mean error ),  (140.03, 0.23) and  (2.02, 0.043). Inference of k = 2 was supported by Q(1) values: average 2  logQ(1) was 3 to k = 1 and 10 to k = 3. Compared Bayesian and Isofinder results for a single example. True number of changepoints 58. Bayesian: mean  implied 59.9, best set of changepoints had 53. Isofinder: number of changepoints varied with threshold probability from 96 with p = 0.05 to 64 with p = 0.01 to 52 with p =

12 Idealised simulation example

13 Simulated data 2: Correlations between segments In real DNA data there appear to correlations between isochores so important to test sensitivity of method given fundamental assumption of recursions. Simulated data in which there are correlations between segments, positive 1-lag autocorrelations with segment GC and length (r = 0.30), negative correlation between GC and length (r = -0.30), other simulation details as before. Inference of parameters remains good:  (mean estimate , mean true ),  (140.16, ) and  (1.93, 1.94). Correlations between segments are underestimated: GC (0.203, 0.248), length (0.114, 0.270), GC-length (-0.297,-0.378).

14 Simulated data 3: multimodal distribution The model segment means as a single normal. Simulated data according to model of four isochore classes, reduced  to 70 and increased  to 3.5. Parameters chosen so that distribution of segment means is multimodal. Inference according to Bayesian model recreated the multimodal structure of segment means. But acceptance probability very low.

15 Non-parametric test of changepoint inference Want some n-p way of evaluating the method, the most fundamental issue is whether changepoints are inferred correctly. Compare two sets of changepoints to get mean minimum distance between both sets of changepoints (I to nearest T and vice versa). For each of set of inferred changepoints, compare to true changepoints: get inferred-true distance. For randomised sets of changepoints get random-true distance. Ratio of inferred-true/random-true gives performance: idealised ratio = 0.238, correlations ratio = 0.256, multimodal ratio = Performance robust to model mis-specification. Comparison of methods for idealised simulation example: Isofinder with p = ratio = 0.279, Bayesian best set of changepoints ratio =

16 Real data: human chromosome Mb, 5 kb windows, 7358 data points, mean 2068, standard deviation 277. Q(1) analysis supports k = 1, geometric distribution. Means and 95% CIs from posterior distributions:  ( ),  1.91 ( ),  138 ( ). Mean segment length 80 kb, lots of short segments. Strong neighbouring segment correlations: GC 0.21, length 0.11, GC-length –0.29

17 Real data: human chromosome 1

18 Human chromosome 1: levels of structure Higher level structure: Can enforce higher k and lower , k = 5 and  gives mean segment size of 185 kb. Significant partial autocorrelations between segment means up to 9-lag. Applying Bayesian model to inferred segment mean output gives three super-changepoints at 10 Mb (2008 in 5kb data), 15 Mb (3018 in 5kb data) and 17 Mb (3419 in 5kb data). Effect of “coarse graining” window size: Segment size positively correlated with window size (3 kb - 60 kb; 5 kb - 80 kb; 10 kb kb). Same process for Isofinder, with p = 0.95, 3 kb windows give 85 kb segments and 5 kb windows gives 115 kb segments. Three super-changepoints at 10, 15 and 17 Mb robust to window size.

19 Real data: mouse chromosome Mb, 5kb windows, 4777 data points, mean 2050, standard deviation 180. Means and 95% CIs from posterior distributions:  ( ),  1.71 ( ),  103 ( ). Less variation in base composition in mouse genome than in human genome: due to longer segments  vs , 120 kb vs. 80 kb), less variation within segments (  103 vs. 138) and less variation among segments (   176 vs. 264).

20 Preliminary conclusions on GC analyses Methods which partition genome according to data are preferable to sliding windows methods. Bayesian method does not require the threshold parameter of recursive segmentation methods, allows improved statistical analyses, seems to perform better than recursive segmentation. Bayesian method appears to be robust to model mis- specification.

21 Future work - correlated genome structure One exciting possibility of the Bayesian method is extendsion to multiple genomic features, e.g. the issue of GC-substitution rate (K) covariation. Accounting for uncertainty in inference of K. Do GC and K changepoints tend to be close together, and do the changes between segments covary? Higher level: What about super-changepoints in GC and K? Lower level: Do GC and covary K within segments?