Variant calling: number of individuals vs. depth of read coverage Gabor T. Marth Boston College Biology Department 1000 Genomes Meeting Cold Spring Harbor Laboratory May
Single-base variant calling in 1000G data 1.SNP discovery (for potential follow-up genotyping) 2.Possibly using genotypes called from sequence directly for haplotype phasing (genotype imputation?) Sample size x read coverage / individual = constant What is the best sample size? Not easy to answer only based on idealistic theoretical considerations Simulation studies must model many effects to be realistic
Variant discovery is a complex process aacgtCaggct aacgtTaggct aacgtCaggct aacgtTaggct seq. readssamplesfragments population genotype priors allele sampling likelihoods base error probabilities aacgtCaggct aacgtTaggct aacgtCaggct aacgtTaggct aacgtCaggct aacgtTaggct aacgtCaggct aacgtTaggct aacgtCaggct aacgtTaggct aacgtCaggct aacgtTaggct aacgtCaggct aacgtTaggct aacgtCaggct G1G1 G2G2 G3G3
Bayesian variant detection math Priors: (1) Nucleotide diversity; (2) Allele frequency distribution; (3) Specific diploid genotype layout Allele sampling likelihoods: Binomial distribution of the number of reads from each of the two chromosomes Base error probabilities: Likelihood that the called base faithfully represents DNA fragment, calculated from the base quality values
SNP calling and genotyping P(SNP) = total probability of all non-monomorphic genotype combinations P(Gi) = marginal probability consequence: data from other individuals influence the genotype call of a given individual: include illustration using testProb program in GigaBayes package.
Variant calling in simulated data: design Analysis by Aaron Quinlan (see poster at the Genome Meeting)
Estimated vs. population allele frequency
Allele frequency (cont’d)
SNP discovery sensitivity
Genotype density 16x: / x: / x: / x: /
Genotype density
Summary / Conclusions
Thanks