Presentation on theme: "Ruibin Xi Peking University School of Mathematical Sciences"— Presentation transcript:
1 Ruibin Xi Peking University School of Mathematical Sciences Biostatistics-Lecture 17 Single nucleotide polymorphism detection—an introductionRuibin XiPeking UniversitySchool of Mathematical Sciences
2 SNPs vs. SNVs Really a matter of frequency of occurrence Both are concerned with aberrations at a single nucleotideSNP (Single Nucleotide Polymorphism)Aberration expected at the position for any member in the species (well-characterized)Occur in population at some frequency so expected at a given locusCatalogued in dbSNP (SNV (Single Nucleotide Variants)Aberration seen in only a few individual (not well characterized)Occur at low frequency so not commonMay be related with certain diseases
3 SNV types of interest Non-synonymous mutations Impact on protein sequenceResults in amino acid changeMissense and nonsense mutationsSomatic mutations in cancerTumor-specific mutations in tumor-normal pairs
4 Catalogs of human genetic variation The 1000 Genomes ProjectSNPs and structural variantsgenomes of about 2500 unidentified people from about 25 populations around the world will be sequenced using NGS technologiesHapMapidentify and catalog genetic similarities and differencesdbSNPDatabase of SNPs and multiple small-scale variations that include indels, microsatellites, and non-polymorphic variantsCOSMICCatalog of Somatic Mutations in Cancer
5 A framework for variation discovery DePristo, M.A. et al. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat Genet. 43(5): PMID: (2011).
6 A framework for variation discovery Phase 1: MappingPlace reads with an initial alignment on the reference genome using mapping algorithmsRefine initial alignmentslocal realignment around indelsmolecular duplicates are eliminatedGenerate the technology-independent SAM/BAM alignment map formatAccurate mapping crucial for variation discoveryDePristo, M.A. et al. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat Genet. 43(5): PMID: (2011).
7 Remove duplicatesremove potential PCR duplicates - from PCR amplification step in library prepif multiple read pairs have identical external coordinates, only retain the pair with highest mapping qualityDuplicates manifest themselves with high read depth support - impacts variant callingSoftware: SAMtools (rmdup) or Picard tools (MarkDuplicates)Human HapMap individual NA chr20:False SNP
10 Local Alignment Create local haplotypes For each haplotype Hi, align reads to Hi and score according toFind the best haplotype Hi, realign all reads just again Hi and H0(reference haplotype).reads all realigned if the log LR is > 5
11 A framework for variation discovery Phase 2: Discovery of raw variantsAnalysis-ready SAM/BAM files are analyzed to discover all sites with statistical evidence for an alternate allele present among the samplesSNPs, SNVs, short indels, and SVsSNVsDePristo, M.A. et al. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat Genet. 43(5): PMID: (2011).
12 A framework for variation discovery Phase 3: Discovery of analysis-ready variantstechnical covariates, known sites of variation, genotypes for individuals, linkage disequilibrium, and family and population structure are integrated with the raw variant calls from Phase 2 to separate true polymorphic sites from machine artifactsat these sites high-quality genotypes are determined for all samplesSNVsDePristo, M.A. et al. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat Genet. 43(5): PMID: (2011).
13 SNV Filtering Sufficient depth of read coverage Strand BiasSNV FilteringSufficient depth of read coverageSNV present in given number of readsHigh mapping and SNV qualitySNV density in a given bp windowSNV greater than a given bp from a predicted indelStrand balance/biasBentley, D.R. et al. Accurate whole human genome sequencing using reversible terminator chemistry. Nature 456, 53–59 (2008).Wheeler, D.A. et al. The complete genome of an individual by massively parallel DNA sequencing. Nature 452, 872–876 (2008).Larson, D.E. et al. SomaticSniper: Identification of Somatic Point Mutations in Whole Genome Sequencing Data. Bioinformatics Advance Access (2011).
15 SomaticSniper: somatic detection filter Filter using SAMtools (Li, et al., 2009) calls from the tumor.Sites are retained if they meet all of the following rules:Site is greater than 10bp from a predicted indel of quality ≥ 50Maximum mapping quality at the site is ≥ 40< 3 SNV calls in a 10 bp window around the siteSite is covered by ≥ 3 readsConsensus quality ≥ 20SNP quality ≥ 20SomaticSniper predictions passing the filters are then intersected with calls from dbSNP and sites matching both the position and allele of known dbSNPs are removed.Li, H. et al. The Sequence alignment/map (SAM) format and SAMtools. Bioinformatics, 25, (2009).Larson, D.E. et al. SomaticSniper: Identification of Somatic Point Mutations in Whole Genome Sequencing Data. Bioinformatics Advance Access (2011).
16 Variant calling methods > 15 different algorithmsTwo categoriesHeuristic approachBased on thresholds for read depth, base quality, variant allele frequency, statistical significanceProbabilistic methods, e.g. Bayesian modelto quantify statistical uncertaintyAssign priors based on observed allele frequency of multiple samplesSNPvariantRefAInd1G/GInd2A/GNielsen R, Paul JS, Albrechtsen A, Song YS. Genotype and SNP calling from next-generation sequencing data. Nat Rev Genet Jun;12(6): PMID:
17 Variant callers Name Category Tumor/Normal Pairs Metric Reference SOAPsnpBayesianNoPhred QUALLi et al. (2009)JointSNVMix(Fisher)Probability modelYesSomatic probabilityRoth, A. et al. (2012)SomaticSniperHeuristicSomatic ScoreLarson, D.E. et al. (2012)VarScan 2p-valueKoboldt, D. et al. (2012)GATKDePristo, M.A. et al. (2011)Li, R. et al. SNP detection for massively parallel whole-genome resequencing. Genome Res. 19, 1124–1132 (2009).Roth, A. et al. JointSNVMix : A Probabilistic Model For Accurate Detection Of Somatic Mutations In Normal/Tumour Paired Next Generation Sequencing Data. Bioinformatics (2012).Larson, D.E. et al. SomaticSniper: identification of somatic point mutations in whole genome sequencing data. Bioinformatics. 28(3):311-7 (2012).Koboldt, D. et al. VarScan 2: Somatic mutation and copy number alteration discovery in cancer by exome sequencing. Genome Research (2012).DePristo, M.A. et al. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat Genet. 43(5): PMID: (2011).
18 Algorithm-SOAPsnp Given a genotype Ti, by the Bayes rule For haploid genomeFor diploid genomeFor diploid genome, given a set of observed alleles at a locus
19 Algorithm-JointSNVMix JointSNVMix (Fisher’s Exact Test)Allele count data from the normal and tumor compared using a two tailed Fisher’s exact testIf the counts are significantly different the position is labeled as a variant position (e.g., p-value < 0.001)2x2 Contingency TableREF alleleALT alleleTotalTumor151631Normal25Totals4056G6PC2hg19chr2:A>G Asn286AspThe two-tailed for the Fisher’s Exact Test P value is <The association between rows (groups) and columns (outcomes) is considered to be extremely statistically significant.
21 How many variants will I find ? Samples compared to reference genomeHiseq: whole genome; mean coverage 60; HapMap individual NA12878Exome: agilent capture; mean coverage 20; HapMap individual NA12878DePristo MA, et al. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat Genet May;43(5): PMID:
22 Variant Annotation SeattleSeq Annovar annotation of known and novel SNPsincludes dbSNP rs ID, gene names and accession numbers, SNP functions (e.g. missense), protein positions and amino-acid changes, conservation scores, HapMap frequencies, PolyPhen predictions, and clinical associationAnnovarGene-based annotationRegion-based annotationsFilter-based annotation