High resolution detection of IBD Sharon R Browning and Brian L Browning Supported by the Marsden Fund
Aim Detect short segments of identity by descent (IBD) in “unrelated” individuals or distant relatives <1 cM (or < 1 Mb) Need dense SNP data Account for linkage disequilibrium (LD) Various applications IBD mapping in humans – midway between linkage mapping and association mapping. Could be useful for QTL mapping in cows and sheep?
What is IBD? In a pedigree, IBD is defined in terms of pedigree founders: Two haplotypes are IBD if they are copies of the same founder haplotype. IBD regions typically large (10+ cM) for small pedigrees. Founder (grandmother) Half-cousins (may share IBD through grandmother)
IBD without a pedigree is nebulous Assuming no recurrent mutation, identical alleles are IBD this definition leads to ordinary association tests Useful IBD for improvements in mapping Extends beyond background LD Due to non-ancient ancestry
What level of resolution is needed? Very long IBD stretches (5+ Mb) are easy to detect but are too rare. For IBD mapping Expected size of IBD regions depends on when the mutation(s) entered the population. Small IBD regions give better localization.
IBD Model Part I Uses Beagle model previously applied to haplotype phase inference imputation multilocus association testing. No need to prune SNPs → greater power to detect short segments. Beagle LD model is computationally efficient.
Beagle model At each marker location, haplotypes are clustered. Number of clusters can vary, depending on LD structure. Approx. 100 clusters in a data set with 2000 individuals. The model is constructed to be Markov (in the haplotype clusters).
IBD Model Part II Markov model for IBD with two states 0 or 1 pair of haplotypes shared IBD between a pair of individuals. Need to check for homozygosity within individuals first. Transition probabilities specified by the user based on population history.
IBD Model Part III Allow for some genotyping error Computationally prohibitive to sum over all possible miscalled genotypes. Instead allow for IBD when there is no IBS, with a penalty. P(haplotypes | IBD) multiplied by error rate if haplotypes are not IBS at the position. Used error rate = 0.01 or (depending on data quality). Doesn’t correct for the messed up haplotypes caused by genotype error.
Estimation Build LD model using 10 iterations of stochastic EM. Simultaneous phasing and IBD detection. Don’t have to worry about getting haplotypes wrong. Calculate IBD probabilities using forward- backward algorithm for this model. Repeat with 3 restarts of LD model building, then average the IBD probabilities. Model can get caught in local max, leading to false positive IBD.
Threshold for IBD We use a threshold of 0.99 on posterior IBD probability. Define length of IBD region as distance over which IBD probability > 0.5 but IBD probability must be ≥ 0.99 somewhere in the region. IBD prob IBD region
Data 1958 British Birth Cohort (1958BC) Genotyped on Illumina 550K platform (Sanger) and Affymetrix 500K (WTCCC). Genotypes re-called by Beagle (using LD) to improve accuracy. 1400 individuals.
Detection of IBD – 1958BC Chromosome 22, non-monomorphic markers Illumina: 8407 SNPs Affymetrix: 5098 SNPs In 40,000 random pairs found Illumina: 54 IBD regions (lengths 0.52 – 12.5 cM) Affymetrix: 19 IBD regions (lengths 2.1 – 12.1 cM) 58 regions total For the 4 regions found by Affymetrix but not by Illumina, Illumina had IBD probability ≥0.92 Various regions shown on next 3 slides.
0.5 cM region Illumina = solid black line; Affymetrix = dashed blue line
Conclusions New, very dense genotype data provide new opportunity to detect small IBD regions. Detection of short IBD regions will play an important role in various genetic analyses. Computation is challenging Need a pre-filter?