The distribution of the IBD sharing and applications Tel Aviv University July 23, 2012 Shai Carmi Itsik Pe’er’s lab Department of Computer Science Columbia.

Slides:



Advertisements
Similar presentations
Chapter 3 Properties of Random Variables
Advertisements

The Ashkenazi Genome Project
Gene tree analyses of Aboriginal Australians Rosalind Harding University of Oxford.
Hidden Markov Models (1)  Brief review of discrete time finite Markov Chain  Hidden Markov Model  Examples of HMM in Bioinformatics  Estimations Basic.
METHODS FOR HAPLOTYPE RECONSTRUCTION
IBD sharing: Theory and applications in the Ashkenazi Jewish population Shai Carmi Pe’er lab, Columbia University Mt. Sinai, NY March 2014.
Sampling distributions of alleles under models of neutral evolution.
Basics of Linkage Analysis
High resolution detection of IBD Sharon R Browning and Brian L Browning Supported by the Marsden Fund.
N-gene Coalescent Problems Probability of the 1 st success after waiting t, given a time-constant, a ~ p, of success 5/20/2015Comp 790– Continuous-Time.
MALD Mapping by Admixture Linkage Disequilibrium.
1 QTL mapping in mice Lecture 10, Statistics 246 February 24, 2004.
Methods and challenges in the analysis of admixed human genomes Simon Gravel Stanford University.
Signatures of Selection
Forward Genealogical Simulations Assumptions:1) Fixed population size 2) Fixed mating time Step #1:The mating process: For a fixed population size N, there.
Optimatization of a New Score Function for the Detection of Remote Homologs Kann et al.
Heuristic alignment algorithms and cost matrices
Continuous Coalescent Model
Evaluating Hypotheses
Dispersal models Continuous populations Isolation-by-distance Discrete populations Stepping-stone Island model.
Sequence comparison: Significance of similarity scores Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.
Whole genome alignments Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas
Sequencing an Ashkenazi Jewish Reference Cohort for Medical Genetics and Implications for Ashkenazi History Shai Carmi Department of Computer Science Columbia.
Sequencing 128 Ashkenazi Genomes: Implications for Medical Genetics and History Shai Carmi Department of Computer Science Columbia University Itsik Pe’er’s.
The Ashkenazi Genome Project Shai Carmi Pe’er lab, Columbia University and The Ashkenazi Genome Consortium (TAGC) Personal Genomes & Medical Genomics Cold.
Extensions to Basic Coalescent Chapter 4, Part 1.
Haplotype Blocks An Overview A. Polanski Department of Statistics Rice University.
Sharing of long genomic segments: Theory and results in Ashkenazi Jews Bar-Ilan University July 26, 2012 Shai Carmi Itsik Pe’er’s lab Department of Computer.
The Ashkenazi Genome Project
Theory of Probability Statistics for Business and Economics.
Gapped BLAST and PSI- BLAST: a new generation of protein database search programs By Stephen F. Altschul, Thomas L. Madden, Alejandro A. Schäffer, Jinghui.
Length Distributions of Identity by Descent Reveal Fine-Scale Demographic History Pier Francesco Palamara, Todd Lencz, Ariel Darvasi, Itsik Pe’er The American.
Comp. Genomics Recitation 3 The statistics of database searching.
Lecture 2 Forestry 3218 Lecture 2 Statistical Methods Avery and Burkhart, Chapter 2 Forest Mensuration II Avery and Burkhart, Chapter 2.
1 Chapter 7 Sampling Distributions. 2 Chapter Outline  Selecting A Sample  Point Estimation  Introduction to Sampling Distributions  Sampling Distribution.
Getting Parameters from data Comp 790– Coalescence with Mutations1.
Large-scale recombination rate patterns are conserved among human populations David Serre McGill University and Genome Quebec Innovation Center UQAM January.
Coalescent Models for Genetic Demography
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
Lecture 13: Linkage Analysis VI Date: 10/08/02  Complex models  Pedigrees  Elston-Stewart Algorithm  Lander-Green Algorithm.
Lecture 12: Linkage Analysis V Date: 10/03/02  Least squares  An EM algorithm  Simulated distribution  Marker coverage and density.
FINE SCALE MAPPING ANDREW MORRIS Wellcome Trust Centre for Human Genetics March 7, 2003.
California Pacific Medical Center
Errors in Genetic Data Gonçalo Abecasis. Errors in Genetic Data Pedigree Errors Genotyping Errors Phenotyping Errors.
Imputation-based local ancestry inference in admixed populations
Practical With Merlin Gonçalo Abecasis. MERLIN Website Reference FAQ Source.
Fast test for multiple locus mapping By Yi Wen Nisha Rajagopal.
Inferring the Demographic History of the Ashkenazi Jewish population Shai Carmi Pe’er lab, Columbia University Leicester, UK April 2014.
By Mireya Diaz Department of Epidemiology and Biostatistics for EECS 458.
Lecture 22: Quantitative Traits II
Statistical Tests We propose a novel test that takes into account both the genes conserved in all three regions ( x 123 ) and in only pairs of regions.
Coalescent theory CSE280Vineet Bafna Expectation, and deviance Statements such as the ones below can be made only if we have an underlying model that.
Admixture Mapping Controlled Crosses Are Often Used to Determine the Genetic Basis of Differences Between Populations. When controlled crosses are not.
Using Merlin in Rheumatoid Arthritis Analyses Wei V. Chen 05/05/2004.
Meiotic gene conversion in humans: rate, sex ratio, and GC bias Amy L. Williams June 19, 2013 University of Chicago.
Fixed Parameters: Population Structure, Mutation, Selection, Recombination,... Reproductive Structure Genealogies of non-sequenced data Genealogies of.
Modelling evolution Gil McVean Department of Statistics TC A G.
Inferences on human demographic history using computational Population Genetic models Gabor T. Marth Department of Biology Boston College Chestnut Hill,
Monkey Business Bioinformatics Research Center University of Aarhus Thomas Mailund Joint work with Asger Hobolth, Ole F. Christiansen and Mikkel H. Schierup.
An Algorithm for Computing the Gene Tree Probability under the Multispecies Coalescent and its Application in the Inference of Population Tree Yufeng Wu.
Gil McVean Department of Statistics
Constrained Hidden Markov Models for Population-based Haplotyping
Imputation-based local ancestry inference in admixed populations
David H. Spencer, Kerry L. Bubb, Maynard V. Olson 
Pier Francesco Palamara, Laurent C. Francioli, Peter R
Shuhua Xu, Wei Huang, Ji Qian, Li Jin 
Pier Francesco Palamara, Todd Lencz, Ariel Darvasi, Itsik Pe’er 
Further Topics on Random Variables: 1
Yu Zhang, Tianhua Niu, Jun S. Liu 
The Time and Place of European Gene Flow into Ashkenazi Jews
Presentation transcript:

The distribution of the IBD sharing and applications Tel Aviv University July 23, 2012 Shai Carmi Itsik Pe’er’s lab Department of Computer Science Columbia University

Outline IBD: introduction Coalescent theory of IBD – Distribution of pairwise sharing. – The variance. – The variance of the cohort-averaged sharing. Applications – Imputation by IBD – Sequencing study design. – Siblings. – Demographic inference. – Jewish genetics. Summary

Identity-by-descent (IBD) In isolated, small populations all individuals have recent common ancestor. Abundance of long haplotypes which are IBD. L. Macmillan, UNC A B A B A shared segment

IBD detection Until last decade, IBD usually defined for single markers. Genome-wide SNP arrays enable detection of long segments. GERMLINE (Gusev et al., Genome Res., 2009) : A fast algorithm for detection of IBD segment in large cohorts. Divide the chromosomes into small windows. For each window, hash the genotypes of each individual and search for perfect matches. Extend seeds, as long as match is good enough. Record matches longer than a cutoff m. Other methods exist. A B

Questions How much IBD is expected in model populations? – Consider the fraction of genome shared between all possible pairs. – Mean? – Variance? – Distribution? Applications – Demographic inference – Study design – Positive selection detection – Phasing and imputation – Pedigree reconstruction

Sequencing study design A large genotyped cohort. A subset is selected for sequencing. Look for IBD segments between sequenced and not-sequenced individuals. Select A Impute variants along IBD segments. To maximize utility, select individuals with most sharing (Gusev at al., Genetics, 2012 (INFOSTIP)).

Sequencing study design A large genotyped cohort. A subset is selected for sequencing. Look for IBD segments between sequenced and not-sequenced individuals. Select A Is the strategy useful? Is it worth prioritzing? How is the average sharing of each individual to the rest of the cohort distributed?

Wright-Fisher model Non-overlapping, discrete generations. A population of constant size of N haploid individuals. Ignore mutations (when studying IBD). Recombination is a Poisson process. Each pair of individuals (linages) has probability 1/N to coalesce in the previous generation. In the limit of continuous-time and large population size, approximated by the coalescent. (Scaled) Time to MRCA is exponential with rate 1. N=10

Mosaic of segments Consider two (unrelated) chromosomes. The total sharing f T : The fraction of the chromosome in shared segments of length ≥m. Observation: All sites are in shared segments, but length can be small due to ancient common ancestor. ℓ1ℓ1 0 L coordinate ℓ2ℓ2 ℓ3ℓ3 ℓ4ℓ4 ℓ5ℓ5 ℓ6ℓ6 ℓ7ℓ7 ℓ8ℓ8 ℓ9ℓ9 ℓ 10 ℓ 11 m ℓ T =ℓ 1 +ℓ 5 +ℓ 9 A B

Mosaic of segments ℓ1ℓ1 0 L coordinate ℓ2ℓ2 ℓ3ℓ3 ℓ4ℓ4 ℓ5ℓ5 ℓ6ℓ6 ℓ7ℓ7 ℓ8ℓ8 ℓ9ℓ9 ℓ 10 ℓ 11 m ℓ T =ℓ 1 +ℓ 5 +ℓ 9 A B t A B A B

Mosaic of segments ℓ1ℓ1 0 L coordinate ℓ2ℓ2 ℓ3ℓ3 ℓ4ℓ4 ℓ5ℓ5 ℓ6ℓ6 ℓ7ℓ7 ℓ8ℓ8 ℓ9ℓ9 ℓ 10 ℓ 11 m ℓ T =ℓ 1 +ℓ 5 +ℓ 9 A B

Renewal theory Distribution of waiting times: τ1τ1 0 T time τ2τ2 τ3τ3 τ4τ4 τ5τ5 τ6τ6 τ7τ7 τ8τ8 τ9τ9 τ 10 τ 11 m t S =τ 1 +τ 5 +τ 9 A B

Renewal theory: solution Laplace transform T→s, t S →u

Mean IBD sharing Can be derived in many ways. (1) (2) The average number of segments ≥m is 2NL·P(ℓ≥m). (3) Palamara, …, Pe’er, AJHG, At the end of the talk (time-permitting).

Varying population size

The variance of the IBD sharing Can also be calculated in a number of ways. (1) (2) Define I(s), the indicator, with probability π (= ), that site s is in a shared segment between two given chromosomes. Define the number of sites as M. The variance requires calculating two-sites probabilities. Almost-exact solution at the end of the talk (time-permitting).

The variance: simplified (3) Idea: Two distant sites will always be on a shared segment if there was no recombination event in their history. If there was, treat sites as independent. Neglect some small terms. The probability of no recombination: The variance: For the human genome, d≥m

The cohort-averaged sharing The distribution is close to normal. But with variance that approaches a constant even for large sample size n. Why? Scales as 1/n for small n. Approaches a constant for large samples. For the human genome,

The tail of the cohort-averaged sharing- `hyper sharing’ Even for large cohorts, the distribution of the cohort-averaged sharing retains a constant width. Some individuals will be in the tails of this distribution!  ‘hyper sharing’. Can be taken advantage of in sequencing studies.

Imputation by IBD Our results can be used to calculate the expected imputation power when sequencing a subset of a cohort. Assume a cohort of size n, n s of which are sequenced. Random selection of individuals: Selection of highest-sharing individuals: where

Increase in association power The imputed genomes can be thought of as increasing the effective number of sequences. A simple model (Shen et al., Bioinformatics, 2011) : Variant appears in cases only. Carrier frequency in cases equal β. Dominant effect. Association detected if P-value below a threshold. For a fixed budget, trade-off in the number of cases/controls to sequence.

Siblings Siblings share, on average, 50% of their genomes. What is the variance? A classic problem. (Visscher et al. PLoS Genet. 2006). Used the variance to estimate heritability from siblings studies. Genome-wide SD 5.5%. But what if parents are inbred? Assume shared segments are either from parents or are more remote.

Estimator of population size Given one genome, estimate the population size N. Calculate the total sharing f T. We know that Invert to suggest an estimator: Not very useful: estimator is biased and has SD Compared to for Watterson’s estimator (based on the number of het sites).

Ashkenazi Jews In recent years, shown to be a genetically distinct group. Close to Middle-Easterns and Europeans (particularly Italians and Adygei). (Atzmon et al., Am. J. Hum. Genet., 2010) Very large amounts of IBD (Gusev et al., Mol. Biol. Evol., 2011), likely due to a recent, severe bottleneck.

IBD in Ashkenazi Jews 2,600 Ashkenazi Jews, 1M SNP array (Guha et al., Genome Biol. 2012). Use Germline to detect IBD segments. Compare the total sharing to simulations of inferred demography based on mean IBD in different length ranges (Palamara et al., AJHG, 2012). Excess of `hyper sharing’ in AJ

Admixture in AJ Most plausible explanation: correct for admixture. The AJ component was calculated in comparison to CEU. When considering only individuals with close to median AJ ancestry, most of the unexplained variance disappears.

Summary We calculated the distribution of the total IBD sharing in the Wright-Fisher model using renewal theory. We obtained explicit expressions for the variance of both the pairwise sharing and the cohort-averaged sharing. We calculated the expected gain in imputation and association power if individuals at the tail of the cohort-averaged sharing distribution are selected for sequencing. The variance/distribution of IBD has many applications, some of which we presented, some are left for future work. In the AJ population, individuals differ in cohort-averaged sharing by up to 30%. Admixture explains some of the variance.

The end Thanks to: Pier Francesco Palamara. Vladimir Vacic Itsik Pe’er Todd Lencz, Ariel Darvasi (for AJ genotypes) Human Frontiers Science program Cross-Disciplinary Fellowship.

Identity-by-descent (IBD) founder chromosomes contemporary chromosomes Identity-by-descent

Mean IBD (Palamara et al.) See (Palamara et al., AJHG, 2012). Assume shared segments must have length at least m. Define I(s): the indicator, with probability π, that site s is in a shared segment between two given chromosomes. Define f T : the mean fraction of the chromosome found in shared segments, or the total sharing. Given g, the number of generations to the MRCA: In the coalescent, g→Nt: Then, =π.

The variance of the total sharing (1) The variance requires calculating two-sites probabilities. Idea: For one site, PDF of the coalescence time is Φ(t)~Exp(1). For two sites, calculate the joint PDF Φ(t 1,t 2 ). Φ(t 1,t 2 ) takes into account the interaction between the sites. Given t 1, t 2, calculate π 2 as if sites are independent.

The variance of the total sharing (2) Express π 2 in terms of the Laplace transform of Φ(t 1,t 2 ). π 2 Use the coalescent with recombination to find where A-E are defined in terms of q 1, q 2, and the scaled recombination rate ρ.

IBD in AJ Are `hyper-sharing’ individuals sharing more with everyone else, or just with other `hyper-sharing’ individuals? Each curve represents average of 1/7 of the individuals in order of their cohort-averaged sharing. Highest sharing Lowest sharing Highest sharing Lowest sharing