Presentation is loading. Please wait.

Presentation is loading. Please wait.

Sharing of long genomic segments: Theory and results in Ashkenazi Jews Bar-Ilan University July 26, 2012 Shai Carmi Itsik Pe’er’s lab Department of Computer.

Similar presentations


Presentation on theme: "Sharing of long genomic segments: Theory and results in Ashkenazi Jews Bar-Ilan University July 26, 2012 Shai Carmi Itsik Pe’er’s lab Department of Computer."— Presentation transcript:

1 Sharing of long genomic segments: Theory and results in Ashkenazi Jews Bar-Ilan University July 26, 2012 Shai Carmi Itsik Pe’er’s lab Department of Computer Science Columbia University

2 Outline Introduction: Identity-by-descent (IBD) sharing Theory of IBD sharing – The Wright-Fisher model and coalescent theory – The distribution of the total sharing – The cohort-averaged sharing Applications – Imputation by IBD – Siblings Jewish genetics – Background – IBD and ancient demography – The Ashkenazi Sequencing Project Summary

3 Genetic drift The number of offspring of each individual is random. All pairs of individuals descend from a common ancestor.

4 Identity-by-descent (IBD) When the population is small, the common ancestors are frequently recent. Abundance of long haplotypes which are IBD. A B A B A shared segment

5 IBD detection Until last decade, IBD usually defined for single markers. Genome-wide SNP arrays enable detection of long segments. GERMLINE (Gusev et al., Genome Res., 2009) : A fast algorithm for detection of IBD segment in large cohorts. Divide the chromosomes into small windows. For each window, hash the genotypes of each individual and search for perfect matches. Extend seeds, as long as match is good enough. Record matches longer than a cutoff m. Other methods exist. A B

6 IBD applications Demographic inference (Palamara et al., AJHG, 2012). Phasing (Palin et al., Genetic Epi., 2011). Imputation (Gusev et al., Genetics, 2012). Positive selection detection (Albrechtsen et al., Genetics, 2010). Disease mapping (Browning and Thompson, Genetics, 2012). Pedigree reconstruction (Huff et al., Genome Res., 2011). AG CT The cell A,CG,T SNP array A G CT AT C G ?

7 IBD in Ashkenazi Jews Links connect individuals with shared segments. (Gusev et al., Mol. Biol. Evol., 2011) Ashkenazi Jewish Other European

8 Imputation by IBD A large genotyped cohort. A subset is selected for sequencing. Look for IBD segments between sequenced and not-sequenced individuals. Select A Impute variants along IBD segments. To maximize utility, select individuals with most sharing (Gusev at al., Genetics, 2012 (INFOSTIP)).

9 Wright-Fisher model and the coalescent N=10 t

10 Theory: mosaic of segments ℓ1ℓ1 0 L coordinate ℓ2ℓ2 ℓ3ℓ3 ℓ4ℓ4 ℓ5ℓ5 ℓ6ℓ6 ℓ7ℓ7 ℓ8ℓ8 ℓ9ℓ9 ℓ 10 ℓ 11 m ℓ T =ℓ 1 +ℓ 5 +ℓ 9 A B

11 Renewal theory τ1τ1 0 T time τ2τ2 τ3τ3 τ4τ4 τ5τ5 τ6τ6 τ7τ7 τ8τ8 τ9τ9 τ 10 τ 11 m t S =τ 1 +τ 5 +τ 9 A B

12 Renewal theory: solution Laplace transform T→s, t S →u

13 Mean IBD sharing The average number of segments ≥m is 2NL·P(ℓ≥m). For large N, ≈1/(mN). Alternative derivation at the end of the talk (time-permitting).

14 The variance of the IBD sharing

15 The variance: simplified (3) Idea: Two distant sites will always be on a shared segment if there was no recombination event in their history. If there was, treat sites as independent. Neglect some small terms. The probability of no recombination: The variance: For the human genome, d≥m

16 The cohort-averaged sharing The distribution is close to normal. With variance: Scales as 1/n for small n. Approaches a constant for large samples. Some individuals will be in the tails of this distribution!  ‘hyper sharing’. ‘hyper- sharing’

17 Imputation by IBD Calculate the expected imputation power when sequencing a subset of a cohort. Assume a cohort of size n, n s of which are sequenced. Random selection of individuals: Selection of highest-sharing individuals: where

18 Siblings Siblings share, on average, 50% of their genomes. What is the variance? A classic problem. (Visscher et al. PLoS Genet. 2006). Used the variance to estimate heritability from siblings studies. Genome-wide SD 5.5%. But what if parents are inbred? Assume shared segments are either from parents or are more remote.

19 Ashkenazi Jewish brief history End of 1 st millennium: Small Jewish communities in the Rhineland. 1096: Crusades. 12-13 th centuries: First Jewish communities in Eastern Europe. Few thousands of individuals. 16-19 th centuries: The demographic miracle: exponential growth. Prewar: about 10 million, 90% of all Jewish people.

20 Ashkenazi Jewish genetics In recent years, AJ shown to be a genetically distinct group. Close to Middle-Easterns and Europeans (particularly Italians and Adygei). (Atzmon et al., Am. J. Hum. Genet., 2010) 300 Jews in 900k SNPs.

21 Ashkenazi Jewish genetics Bray et al., PNAS, 2010. 471 AJ in 700k SNPs. Need et al., Genome Biology, 2009. ~100 AJ in 550k SNPs. Kopelman et al., BMC Genetics, 2009. 80 AJ in 700 microsatellites.

22 Ashkenazi Jewish genetics Behar et al., Nature, 2010. ~120 Jews in 600k SNPs. Khazar theory incompatible. European admixture ~20% (but 30-50% according to other studies). No genetic sub-structure. AJ diseases likely due to founder effect (no selection) Guha et al., Genome Biology, 2012. ~1312 AJ in 740k SNPs. AJ EU ME AJ different countries

23 Ashkenazi Jewish genetics

24 IBD in Ashkenazi Jews Inference of AJ history (Palamara et al., AJHG, 2012) 2,600 AJ, 700k SNPs. Detect IBD segments and calculate their distribution. Use IBD theory to obtain an initial guess of the demographic parameters. Grid search around initial guess: Compare sharing in simulations of different demographies and the mean IBD in different length ranges. IBD is particularly informative on recent history.

25 AJ (genetic) history Expansion rate ≈1.34 3,000 N t Effective size 60,000 300 5,000,000 Years ago 800 Present

26 AJ sequencing Why Sequencing? Rare variants (no ascertainment bias) Copy-number variants Functional variants Improve power of demographic inference Improve understanding of recent population explosion Natural selection (positive/negative) Jewish disease genes Higher power in disease mapping?

27 The Ashkenazi Genome Consortium Labs: Lencz, Atzmon, Cho, Clark, Ostrer, Ozelius, Peter, Darvasi, Offit, Pe’er Columbia, Einstein, Mount Sinai, MSKCC, Yale, HUJI Phase I: 137 healthy AJ genomes, 40 AJ Schizophrenia patients 25/7/2012: 77 delivered (48+29) Samples: ~60yo, multi-disease controls Technology: Complete Genomics Cost: about $2500/genome Phase II (2013): Sequence the entire bottleneck (300-400 individuals).

28 Sample selection Remove relatives Remove non-AJ individuals Select individuals to maximize utility for imputation.

29 Backup and distribution pipeline Raw size: 300GB/genome (60TB/project). Variant calls and summaries: 1.5GB/genome (300GB/project). Pipeline: Checksum disks Copy entire data to a fault tolerant, network distributed file system (MooseFS). Checksum copy Backup entire data also in Einstein and Columbia Medical School. Distribute variant calls, summaries, and new processed files in a dedicated server. Combine all genomes (VCF, Plink). Phasing (statistical + molecular).

30 Quality control PropertyValue (exome) Fraction called96.6% (98%) Coverage55x Fraction with coverage > 20x93% (95%) Concordance with SNP array99.87% Ti/Tv ratio2.14 (3.05) First 48 healthy individuals Quality usually uniform across all individuals. One female with triple X chromosome. A few with likely many false CNVs. Two inbred individuals. Use to calibrate error rate: 800 heterozygous variants (400 SNPs) in a 45MB homozygote region.

31 Variants PropertyValue (exome) Total SNPs3.4M (22k) Novel SNPs3.7% (3.9%) Het/hom ratio1.64 (1.67) Insertions count224k (243) Deletions count239k (219) Substitutions count82k (369) Synonymous SNPs10520 Non-synonymous SNPs9680 Nonsense SNPs71 Other disrupting241 CNV count348 SV count1489 MEI count3491

32 AJ and Europeans 13 Complete Genomics public genomes. Some quality differences. Similar number of variants of all kinds. het/hom ratio: 1.64 vs. 1.59. Upcoming data from 33 Flemish genomes. Minor differences. More variants in AJ. More allele sharing. More population specific variants.

33 Summary Identity-by-descent (IBD) theory: IBD is an important tool in population genetics. We developed theory of IBD sharing and a few applications. Ashkenazi Jewish (AJ) genetics: AJ are genetically distinct and homogeneous group, close to Europeans and Middle-Easterns. Demographic inference using IBD revealed a severe bottleneck. We began The Ashkenazi Genome Project to sequence the majority of genetic variation in AJ and provide a reference panel for disease mapping. Initial results available for QC, variant statistics, and comparison to Europeans.

34 The end Thanks to: Itsik Pe’er IBD: Pier Francesco Palamara Vladimir Vacic AJ sequencing: Todd Lencz (LIJMC) Gil Atzmon, Harry Ostrer (EIN.) Lorraine Clark (CU) Funding: Human Frontiers Science program Cross- Disciplinary Fellowship.

35 Identity-by-descent (IBD) founder chromosomes contemporary chromosomes Identity-by-descent

36 Mosaic of segments ℓ1ℓ1 0 L coordinate ℓ2ℓ2 ℓ3ℓ3 ℓ4ℓ4 ℓ5ℓ5 ℓ6ℓ6 ℓ7ℓ7 ℓ8ℓ8 ℓ9ℓ9 ℓ 10 ℓ 11 m ℓ T =ℓ 1 +ℓ 5 +ℓ 9 A B t A B A B

37 Mosaic of segments ℓ1ℓ1 0 L coordinate ℓ2ℓ2 ℓ3ℓ3 ℓ4ℓ4 ℓ5ℓ5 ℓ6ℓ6 ℓ7ℓ7 ℓ8ℓ8 ℓ9ℓ9 ℓ 10 ℓ 11 m ℓ T =ℓ 1 +ℓ 5 +ℓ 9 A B

38 Mean IBD (Palamara et al.) See (Palamara et al., AJHG, 2012). Assume shared segments must have length at least m. Define I(s): the indicator, with probability π, that site s is in a shared segment between two given chromosomes. Define f T : the mean fraction of the chromosome found in shared segments, or the total sharing. Given g, the number of generations to the MRCA: In the coalescent, g→Nt: Then, =π.

39 Varying population size

40 The variance of the total sharing (1) The variance requires calculating two-sites probabilities. Idea: For one site, PDF of the coalescence time is Φ(t)~Exp(1). For two sites, calculate the joint PDF Φ(t 1,t 2 ). Φ(t 1,t 2 ) takes into account the interaction between the sites. Given t 1, t 2, calculate π 2 as if sites are independent.

41 The variance of the total sharing (2) Express π 2 in terms of the Laplace transform of Φ(t 1,t 2 ). π 2 Use the coalescent with recombination to find where A-E are defined in terms of q 1, q 2, and the scaled recombination rate ρ.

42 Increase in association power The imputed genomes can be thought of as increasing the effective number of sequences. A simple model (Shen et al., Bioinformatics, 2011) : Variant appears in cases only. Carrier frequency in cases equal β. Dominant effect. Association detected if P-value below a threshold. For a fixed budget, trade-off in the number of cases/controls to sequence.

43 Estimator of population size Given one genome, estimate the population size N. Calculate the total sharing f T. We know that Invert to suggest an estimator: Not very useful: estimator is biased and has SD Compared to for Watterson’s estimator (based on the number of het sites).

44 IBD in AJ Are `hyper-sharing’ individuals sharing more with everyone else, or just with other `hyper-sharing’ individuals? Each curve represents average of 1/7 of the individuals in order of their cohort-averaged sharing. Highest sharing Lowest sharing Highest sharing Lowest sharing

45 Complete Genomics WGS


Download ppt "Sharing of long genomic segments: Theory and results in Ashkenazi Jews Bar-Ilan University July 26, 2012 Shai Carmi Itsik Pe’er’s lab Department of Computer."

Similar presentations


Ads by Google