SNP Discovery in Whole-Genome Light-Shotgun 454 Pyrosequences Aaron Quinlan 1, Andrew Clark 2, Elaine Mardis 3, Gabor Marth 1 (1) Department of Biology,

Slides:



Advertisements
Similar presentations
Combinatorial Algorithms for Haplotype Inference Pure Parsimony Dan Gusfield.
Advertisements

Genetic Map and Forward Genetics Tools for C. briggsae Presented by Dan Koboldt Ray Miller’s Group.
Note that the genetic map is different for men and women Recombination frequency is higher in meiosis in women.
G ENOTYPE AND SNP C ALLING FROM N EXT - GENERATION S EQUENCING D ATA Authors: Rasmus Nielsen, et al. Published in Nature Reviews, Genetics, Presented.
Reference mapping and variant detection Peter Tsai Bioinformatics Institute, University of Auckland.
Base quality and read quality: How should data quality be measured? Gabor T. Marth Boston College Biology Department 1000 Genomes Meeting Cold Spring Harbor.
Targeted Data Introduction  Many mapping, alignment and variant calling algorithms  Most of these have been developed for whole genome sequencing and.
Next-generation sequencing – the informatics angle Gabor T. Marth Boston College Biology Department AGBT 2008 Marco Island, FL. February
Variant discovery Different approaches: With or without a reference? With a reference – Limiting factors are CPU time and memory required – Crossbow –
Design Goals Crash Course: Reference-guided Assembly.
Bioinformatics for high-throughput DNA sequencing Gabor Marth Boston College Biology New grad student orientation Boston College September 8, 2009.
Bioinformatics Methods and Computer Programs for Next-Generation Sequencing Data Analysis Gabor Marth Boston College Biology Next Generation Sequencing.
Biology and Bioinformatics Gabor T. Marth Department of Biology, Boston College BI820 – Seminar in Quantitative and Computational Problems.
Data analysis methods for next- generation sequencing technologies Gabor T. Marth Boston College Biology Department Epigenomics & Sequencing Meeting July.
Next-generation sequence analysis Gabor T. Marth Boston College Biology Department PSB 2008 January
Some new sequencing technologies. Molecular Inversion Probes.
A coalescent computational platform for tagging marker selection for clinical studies Gabor T. Marth Department of Biology, Boston College
General methods of SNP discovery: PolyBayes Gabor T. Marth Department of Biology Boston College Chestnut Hill, MA
Computational Tools for Finding and Interpreting Genetic Variations Gabor T. Marth Department of Biology, Boston College
Mining SNPs from EST Databases Picoult-Newberg et al. (1999)
The Extraction of Single Nucleotide Polymorphisms and the Use of Current Sequencing Tools Stephen Tetreault Department of Mathematics and Computer Science.
Next-generation sequencing: informatics & software aspects Gabor T. Marth Boston College Biology Department.
Polymorphism discovery in next- generation re-sequencing data Gabor T. Marth Boston College Biology Department Illumina workshop, Washington, DC November.
Bioinformatics for next-generation DNA sequencing Gabor T. Marth Boston College Biology Department BC Biology new graduate student orientation September.
Evolutionary Genome Biology Gabor T. Marth, D.Sc. Department of Biology, Boston College Medical Genomics Course – Debrecen, Hungary, May 2006.
Read mapping and variant calling in human short-read DNA sequences
Next-generation sequencing: informatics & software aspects Gabor T. Marth Boston College Biology Department.
BI420 – Course information Web site: Instructor: Gabor Marth Teaching.
Genotyping of James Watson’s genome from Low-coverage Sequencing Data Sanjiv Dinakar and Yözen Hernández.
Informatics tools for next-generation sequence analysis Gabor T. Marth Boston College Biology Department University of Michigan October 20, 2008.
Polymorphism discovery informatics Gabor T. Marth Department of Biology Boston College Chestnut Hill, MA
Sequence Variation Informatics Gabor T. Marth Department of Biology, Boston College BI420 – Introduction to Bioinformatics.
Informatics for next-generation sequence analysis – SNP calling Gabor T. Marth Boston College Biology Department PSB 2008 January
Informatics challenges and computer tools for sequencing 1000s of human genomes Gabor T. Marth Boston College Biology Department Cold Spring Harbor Laboratory.
Biostatistics-Lecture 15 High-throughput sequencing and sequence alignment Ruibin Xi Peking University School of Mathematical Sciences.
Considerations for Analyzing Targeted NGS Data Introduction Tim Hague, CTO.
Single Nucleotide Polymorphism
Computational research for medical discovery at Boston College Biology Gabor T. Marth Boston College Department of Biology
High throughput sequencing: informatics & software aspects Gabor T. Marth Boston College Biology Department BI543 Fall 2013 January 29, 2013.
The medical relevance of genome variability Gabor T. Marth, D.Sc. Department of Biology, Boston College Medical Genomics Course – Debrecen,
SNP Haplotypes as Diagnostic Markers Shrish Tiwari CCMB, Hyderabad.
National Taiwan University Department of Computer Science and Information Engineering Pattern Identification in a Haplotype Block * Kun-Mao Chao Department.
Finnish Genome Center Monday, 16 November Genotyping & Haplotyping.
Lecture 6. Functional Genomics: DNA microarrays and re-sequencing individual genomes by hybridization.
February 20, 2002 UD, Newark, DE SNPs, Haplotypes, Alleles.
By Alfonso Farrugio, Hieu Nguyen, and Antony Vydrin Sequencing Technologies and Human Genetic Variation.
Informatics challenges for next-generation sequence analysis
ASSEMBLY AND ALIGNMENT-FREE METHOD OF PHYLOGENY RECONSTRUCTION FROM NGS DATA Huan Fan, Anthony R. Ives, Yann Surget-Groba and Charles H. Cannon.
Variant calling: number of individuals vs. depth of read coverage Gabor T. Marth Boston College Biology Department 1000 Genomes Meeting Cold Spring Harbor.
Next-generation sequencing: the informatics angle
Next-generation sequencing: the informatics angle Gabor T. Marth Boston College Biology Department CHI Next-Generation Data Analysis meeting Providence,
Computational Biology and Genomics at Boston College Biology Gabor T. Marth Department of Biology, Boston College
Evolutionary Genome Biology Gabor T. Marth, D.Sc. Department of Biology, Boston College
SNP Scores. Overall Score Coverage Score * 4 optional scores ▫Read Balance Score  = 1 if reads are balanced in each direction ▫Allele Balance Score 
GSVCaller – R-based computational framework for detection and annotation of short sequence variations in the human genome Vasily V. Grinev Associate Professor.
Analysis of Next Generation Sequence Data BIOST /06/2015.
Aaron R. Quinlan and Gabor T. Marth Department of Biology, Boston College, Chestnut Hill, MA 02467
A brief guide to sequencing Dr Gavin Band Wellcome Trust Advanced Courses; Genomic Epidemiology in Africa, 21 st – 26 th June 2015 Africa Centre for Health.
Synteny - many distantly related species have co- linear maps for portions of their genomes; co-linearity between maize and sorghum, between maize and.
Integrated variant detection Erik Garrison, Boston College.
Inferences on human demographic history using computational Population Genetic models Gabor T. Marth Department of Biology Boston College Chestnut Hill,
071126_EAS56_0057_FC – lanes 1-8 read 2 b a _EAS56_0057_FC – lanes 1-8 read 1 Table S1. Summary tables for a read 1 and b read 2 of a.
Virginia Commonwealth University
SNP Detection Congtam Pham 2/24/04 Dr. Marth’s Class.
Genome sequencing informatics
Discovery tools for human genetic variations
Genome organization and Bioinformatics
BI820 – Seminar in Quantitative and Computational Problems in Genomics
A Flexible Bayesian Framework for Modeling Haplotype Association with Disease, Allowing for Dominance Effects of the Underlying Causative Variants  Andrew.
Next-generation DNA sequencing
Presentation transcript:

SNP Discovery in Whole-Genome Light-Shotgun 454 Pyrosequences Aaron Quinlan 1, Andrew Clark 2, Elaine Mardis 3, Gabor Marth 1 (1) Department of Biology, Boston College (2) Departments of Molecular Biology and Genetics, Cornell University (3) Departments of Genetics and Molecular Microbiology, Washington University AGBT Marco Island, FL. February 9, 2007

454 machines have been proven for several applications genome sequencing microRNA discovery mutation detection in cancer tissue

454 machines trade off throughput with read length read length bases per run 10 bp 1Gb 1,000 bp100 bp 100 Mb 10 Mb 1Mb

454 shotgun reads for SNP discovery genome size bases per run 1 Mb1 Gb100 Mb 10 Mb 10 Gb for 100Mb genomes a few 454 runs produce ~ 1x coverage at ~ 1x the genome is fairly densely covered still, most 454 reads align as singletons

Are single-coverage 454 reads resulting from light- shotgun sequencing accurate enough for SNP discovery? melanogster reference genome sequence (iso-1 strain) 454 shotgun reads from an African melanogaster isolate (strain id 46-2) African melanogaster strain courtesy of Dr. Charles Langley, UC Davis 454 sequencing at the Washington University Genome Sequencing Center

Steps of SNP discovery Sequence clustering and organization Multiple fragment alignment SNP detection Paralog identification

SNP discovery in capillary traces hinges on base quality most errors come from substitutions, i.e. calling the wrong base in Sanger-principle capillary sequences the number of bases is generally well resolved substitution errors are well described by the PHRED base quality values allowing us to distinguish between sequencing error and true polymorphism, detect and score candidate SNPs

Most 454 errors are over-calls or under-calls in 454 reads one the identity of the nucleotide is usually accurate, but the number of bases is often unclear most errors are over-calls or under-calls errors don’t necessarily occur in “low quality” regions of the read, and PHRED base quality values do not describe over- and under-call errors Separate out alignments!!!

How many bases were incorporated? nucleotide incorporation tests light signal ? the number of bases in a mono-nucleotide run has to be inferred from the signal intensity, but this inference is often not trivial a signal is also produced when, in fact, no nucleotide is incorporated signal intensity is variable for a given # incorporated bases Add cartoon scale on sides!!!

The base number probabilities conversely, for a given signal intensity (e.g. 1.5), the true number of incorporated nucleotides is either 1 or 2 (and sometimes even 3 or 0) histogram of observed signal intensities for different numbers of actually incorporated bases our base caller calculates and reports the base number probabilities i.e. the (posterior) probability that given the observed incorporation signal 0, 1, 2, …, etc. bases were incorporated, e.g. P(0C), P(1C), Pr(2C), …, etc. these base number probabilities address under- and over-calls and replace the PHRED base quality values for 454 reads Annotate 0, 1, 2!!! Figga Mo’ bigga!!!

PyroBayes – our 454 base caller Use data likelihood from last page!!! Add Bayesian equation!!!

Mapping / sequence alignment simple BLAT approach to map 454 reads ACGACAGGGATGCGTGGGA TTGATGACTAGTAACGACAGGGACGCGTGGGAAGGTTAGTACCGTAC unique pair-wise alignments kept 454 reads that align to multiple locations in the genome (paralogous sequences) are removed

SNP calling for 454 reads the genome reference allele (C) is wrong and, in fact, the reference allele is T (from PHRAP base quality value) the 454 allele (T) is the result of over-call, and one of the C nucleotide tests just before or after was an under-call… Given an apparent mismatch between the genome reference sequence (C allele) and the 454 read (T allele) we have to consider the possibility that: The result is a SNP probability score that our SNP caller reports ACGACAGGGATGCGTGGGA ACGACAGGGACGCGTGGGA ACGACAGGGATGCGTGGGA ACGACAGGGACGCGTGGGA … we use the base number probabilities To evaluate sequence differences… P(0C) would not be available from PHRED

The SNP discovery pipeline ACGACAAGGCGTGGGA 454 base calling read mapping ACGACAGGGATGCGTGGGA TTGATGACTAGTAACGACAGGGACGCGTGGGAAGGTTAGTACCGTACTGGGA SNP calling + thresholding Pr(C/T) (341,600 reads called) (220,121 reads uniquely mapped) (41,265 candidate SNPs)

SNP candidate validation we attempted experimental validation for 1,549 randomly chosen candidates each candidate was PCR-amplified and sequenced on ABI capillary machines. 1,114 of 1,231 candidates were confirmed (318 could not be assayed). 90.5% true positive rate

Melanogaster SNPs from a single 454 run SNPs were evenly distributed on melanogaster autosomes (chr. 4 is almost completely heterochromatic) Average density: 1 SNP per 2.9 kb melanogaster genome sequence 81.4% of SNPs were discovered in a single 454 read vs. the genome reference 1 SNP per 530 bp aligned 454 sequence

SNPs for a melanogaster genotyping chip some SNP alleles we discovered are likely singletons (alleles only present in the reference or the African strain, but not in the entire melanogaster “population”) but we know from population genetic theory that SNP discovery (ascertainment) in a pair of chromosomes enriches for common variants most useful as genetic markers 40K SNPs with 90%+ validation rate from a single 454 run probably sufficient for a genotyping chip for larger genomes / denser maps multiple 454 runs will be needed

Ongoing 454 data mining projects 10 different melanogaster strains mammalian projects: larger genome size requires reduced genome representation strategy (RRS) RRS shotgun reads provide deeper sequence coverage in “target” regions

Refinements of the 454 data analysis pipeline improved base calling gives higher accuracy effective anchored aligners and SNP callers for deep alignments address more data and deeper alignments from RRS strategies extended SNP calls for all substitutions and INDELs gives more SNPs

Thanks Elaine Mardis Wash. U. Andy Clark Cornell University Eric Tsung Chip Stewart Michael Stromberg Tony Nguyen Aaron Quinlan Boston College Weichun Huang Michele Busby Damien Croteau- Chonka bioinformatics.bc.edu/marthlab

base callers for 454 and short-read sequencing machines reference guided, “anchored” alignment programs SNP callers for deep 454 alignments and for short read alignments

SNP calling – filters TCGCGTATGCG TCTCGTATGCG Reference Afr. 454 seq. TCGCGTATGCG TCCCGTATGCG Reference Afr. 454 seq. TCGCCTACGCG TCGCGTTCGCG Reference Afr. 454 seq. only considered candidate SNPs that were the least likely the result of a 454 over-call or under-call only considered candidate SNPs with SNP probability score > 0.9