Informatics tools for next-generation sequence analysis Gabor T. Marth Boston College Biology Department University of Michigan October 20, 2008.

Slides:



Advertisements
Similar presentations
Mo17 shotgun project Goal: sequence Mo17 gene space with inexpensive new technologies Datasets in progress: Four-phases of 454-FLX sequencing to max of.
Advertisements

The Good, Bad, and Ugly of Next-Gen Sequencing
Single Nucleotide Polymorphism And Association Studies Stat 115 Dec 12, 2006.
Discovery of Structural Variation with Next-Generation Sequencing Alexandre Gillet-Markowska Gilles Fischer Team – Biology.
Reference mapping and variant detection Peter Tsai Bioinformatics Institute, University of Auckland.
Next–generation DNA sequencing technologies – theory & practice
Base quality and read quality: How should data quality be measured? Gabor T. Marth Boston College Biology Department 1000 Genomes Meeting Cold Spring Harbor.
Next-generation sequencing
Targeted Data Introduction  Many mapping, alignment and variant calling algorithms  Most of these have been developed for whole genome sequencing and.
Next-generation sequencing – the informatics angle Gabor T. Marth Boston College Biology Department AGBT 2008 Marco Island, FL. February
Next-generation sequencing and PBRC. Next Generation Sequencer Applications DeNovo Sequencing Resequencing, Comparative Genomics Global SNP Analysis Gene.
Next-generation sequencing: informatics & software aspects Gabor T. Marth Boston College Biology Department Harvard Nanocourse October 7, 2009.
Greg Phillips Veterinary Microbiology
Variant discovery Different approaches: With or without a reference? With a reference – Limiting factors are CPU time and memory required – Crossbow –
Design Goals Crash Course: Reference-guided Assembly.
Bioinformatics for high-throughput DNA sequencing Gabor Marth Boston College Biology New grad student orientation Boston College September 8, 2009.
Bioinformatics Methods and Computer Programs for Next-Generation Sequencing Data Analysis Gabor Marth Boston College Biology Next Generation Sequencing.
1000 Genomes SV detection Boston College Chip Stewart 24 November 2008.
Biology and Bioinformatics Gabor T. Marth Department of Biology, Boston College BI820 – Seminar in Quantitative and Computational Problems.
Data analysis methods for next- generation sequencing technologies Gabor T. Marth Boston College Biology Department Epigenomics & Sequencing Meeting July.
Next-generation sequence analysis Gabor T. Marth Boston College Biology Department PSB 2008 January
Physical Mapping I CIS 667 February 26, Physical Mapping A physical map of a piece of DNA tells us the location of certain markers  A marker is.
General methods of SNP discovery: PolyBayes Gabor T. Marth Department of Biology Boston College Chestnut Hill, MA
NHGRI/NCBI Short-Read Archive: Data Retrieval Gabor T. Marth Boston College Biology Department NCBI/NHGRI Short-Read.
Computational Tools for Finding and Interpreting Genetic Variations Gabor T. Marth Department of Biology, Boston College
Mining SNPs from EST Databases Picoult-Newberg et al. (1999)
Next-generation sequencing: informatics & software aspects Gabor T. Marth Boston College Biology Department.
Polymorphism discovery in next- generation re-sequencing data Gabor T. Marth Boston College Biology Department Illumina workshop, Washington, DC November.
Bioinformatics for next-generation DNA sequencing Gabor T. Marth Boston College Biology Department BC Biology new graduate student orientation September.
Evolutionary Genome Biology Gabor T. Marth, D.Sc. Department of Biology, Boston College Medical Genomics Course – Debrecen, Hungary, May 2006.
Read mapping and variant calling in human short-read DNA sequences
Next-generation sequencing: informatics & software aspects Gabor T. Marth Boston College Biology Department.
Polymorphism discovery informatics Gabor T. Marth Department of Biology Boston College Chestnut Hill, MA
Sequence Variation Informatics Gabor T. Marth Department of Biology, Boston College BI420 – Introduction to Bioinformatics.
Informatics for next-generation sequence analysis – SNP calling Gabor T. Marth Boston College Biology Department PSB 2008 January
Informatics challenges and computer tools for sequencing 1000s of human genomes Gabor T. Marth Boston College Biology Department Cold Spring Harbor Laboratory.
Update on Next-Generation Sequencing
Next generation sequencing platforms Applications
Next Now-Generation Genomics: methods and applications for modern disease research Aaron J. Mackey, Ph.D. Center for Public Health.
Next generation sequencing Xusheng Wang 4/29/2010.
Whole Exome Sequencing for Variant Discovery and Prioritisation
Biostatistics-Lecture 15 High-throughput sequencing and sequence alignment Ruibin Xi Peking University School of Mathematical Sciences.
Considerations for Analyzing Targeted NGS Data Introduction Tim Hague, CTO.
Computational research for medical discovery at Boston College Biology Gabor T. Marth Boston College Department of Biology
High throughput sequencing: informatics & software aspects Gabor T. Marth Boston College Biology Department BI543 Fall 2013 January 29, 2013.
Gerton Lunter Wellcome Trust Centre for Human Genetics From calling bases to calling variants: Experiences with Illumina data.
How will new sequencing technologies enable the HMP? Elaine Mardis, Ph.D. Associate Professor of Genetics Co-Director, Genome Sequencing Center Washington.
Copy Number Variation Eleanor Feingold University of Pittsburgh March 2012.
SEQUENCING – THE BENCHTOPS. Roche 454 Junior Same technology as 454 FLX Read length: 400 bases Paired-end 100,000 reads 12 hours (instrument time) Output.
Lecture 6. Functional Genomics: DNA microarrays and re-sequencing individual genomes by hybridization.
Informatics challenges for next-generation sequence analysis
Next-generation sequencing: the informatics angle
Next-generation sequencing: the informatics angle Gabor T. Marth Boston College Biology Department CHI Next-Generation Data Analysis meeting Providence,
Computational Biology and Genomics at Boston College Biology Gabor T. Marth Department of Biology, Boston College
GSVCaller – R-based computational framework for detection and annotation of short sequence variations in the human genome Vasily V. Grinev Associate Professor.
SNP Discovery in Whole-Genome Light-Shotgun 454 Pyrosequences Aaron Quinlan 1, Andrew Clark 2, Elaine Mardis 3, Gabor Marth 1 (1) Department of Biology,
Canadian Bioinformatics Workshops
Canadian Bioinformatics Workshops
1 Finding disease genes: A challenge for Medicine, Mathematics and Computer Science Andrew Collins, Professor of Genetic Epidemiology and Bioinformatics.
Canadian Bioinformatics Workshops
From Reads to Results Exome-seq analysis at CCBR
Canadian Bioinformatics Workshops
Next generation sequencing
2nd (Next) Generation Sequencing
Discovery tools for human genetic variations
Genome organization and Bioinformatics
BI820 – Seminar in Quantitative and Computational Problems in Genomics
Data formats Gabor T. Marth Boston College
Next-generation DNA sequencing
BF528 - Genomic Variation and SNP Analysis
Presentation transcript:

Informatics tools for next-generation sequence analysis Gabor T. Marth Boston College Biology Department University of Michigan October 20, 2008

Next-gen. sequencers offer vast throughput read length bases per machine run 10 bp1,000 bp100 bp 1 Gb 100 Mb 10 Mb 10 Gb Illumina, AB/SOLiD short-read sequencers ABI capillary sequencer 454 pyrosequencer ( Mb in bp reads) (5-15Gb in bp reads) 1 Mb

Next-gen sequencing enables new applications Meissner et al. Nature 2008 Ruby et al. Cell, 2006 Jones-Rhoades et al. PLoS Genetics, 2007 organismal resequencing & de novo sequencing transcriptome sequencing for transcript discovery and expression profiling epigenetic analysis (e.g. DNA methylation)

Large-scale individual human resequencing

Technologies

Roche / 454 system pyrosequencing technology variable read-length the only new technology with >100bp reads

Illumina / Solexa Genome Analyzer fixed-length short-read sequencer very high throughput read properties are very close to traditional capillary sequences

AB / SOLiD system ACGT A C G T 2 nd Base 1 st Base fixed-length short-reads very high throughput 2-base encoding system color-space informatics

Helicos / Heliscope system short-read sequencer single molecule sequencing no amplification variable read-length error rate reduced with 2- pass template sequencing

Data characteristics

Read length read length [bp] ~ (variable) (fixed) (fixed) (variable) 400

Representational biases this affects genome resequencing (deeper starting read coverage is needed) will have major impact is on counting applications “dispersed” coverage distribution

Amplification errors many reads from clonal copies of a single fragment early PCR errors in “clonal” read copies lead to false positive allele calls early amplification error gets propagated into every clonal copy

Read quality

Error rate (Illumina)

Error rate (454)

Per-read errors (Solexa)

Per read errors (454)

Base quality values not well calibrated

Tools for genome resequencing

The resequencing informatics pipeline (iii) read assembly REF (ii) read mapping IND (i) base calling IND (iv) SNP and short INDEL calling (vi) data validation, hypothesis generation (v) SV calling

The variation discovery “toolbox” base callers read mappers SNP callers SV callers assembly viewers

1. Base calling base sequence base quality (Q-value) sequence diverse chemistry & sequencing error profiles

454 pyrosequencer error profile multiple bases in a homo-polymeric run are incorporated in a single incorporation test  the number of bases must be determined from a single scalar signal  the majority of errors are INDELs

454 base quality values the native 454 base caller assigns too low base quality values

PYROBAYES: determine base number

PYROBAYES: Performance assigned quality values predict measured error rate better higher fraction of bases are high quality

Base quality value calibration

Recalibrated base quality values (Illumina)

… and they give you the picture on the box 2. Read mapping Read mapping is like doing a jigsaw puzzle… …you get the pieces… Unique pieces are easier to place than others…

Non-uniqueness of reads confounds mapping Reads from repeats cannot be uniquely mapped back to their true region of origin RepeatMasker does not capture all micro-repeats, i.e. repeats at the scale of the read length

Strategies to deal with non-unique mapping Non-unique read mapping: optionally either only report uniquely mapped reads or report all map locations for each read (mapping quality values for all mapped reads are being implemented) read mapping to multiple loci requires the assignment of alignment probabilities

Paired-end reads help unique read placement fragment amplification: fragment length bp fragment length limited by amplification efficiency Korbel et al. Science 2007 circularization: 500bp - 10kb (sweet spot ~3kb) fragment length limited by library complexity PE MP PE reads are now the standard for genome resequencing

MOSAIK

INDEL alleles/errors – gapped alignments 454

Aligning multiple read types together ABI/capillary 454 FLX 454 GS20 Illumina Alignment and co- assembly of multiple reads types permits simultaneous analysis of data from multiple sources and error characteristics

Aligner speed

3. Polymorphism / mutation detection sequencing error polymorphism

New challenges for SNP calling deep alignments of 100s / 1000s of individuals trio sequences

Rare alleles in 100s / 1,000s of samples

Allele discovery is a multi-step sampling process Population SamplesReads

Capturing the allele in the sample

Allele calling in the reads base call sample size individual read coverage base quality

Allele calling in deep sequence data aatgtagtaAgtacctac aatgtagtaCgtacctac aatgtagtaAgtacctac Q30Q40Q50Q

More samples or deeper coverage / sample? Shallower read coverage from more individuals … …or deeper coverage from fewer samples? simulation analysis by Aaron Quinlan

Analysis indicates a balance

SNP calling in trios the child inherits one chromosome from each parent there is a small probability for a mutation in the child

SNP calling in trios aatgtagtaAgtacctac aatgtagtaCgtacctac aatgtagtaAgtacctac aatgtagtaCgtacctac mother father child P=0.79 P=0.86

Determining genotype directly from sequence AACGTTAGCATA AACGTTCGCATA AACGTTAGCATA individual 1 individual 3 individual 2 A/C C/CC/C A/A

4. Structural variation discovery

SV events from PE read mapping patterns

Deletion: Aberrant positive mapping distance

Copy number estimation from depth of coverage

Alignability – read coverage normalization

Het deletion “revealed” by normalization

Tandem duplication: negative mapping distance

Spanner – a hybrid SV/CNV detection tool Navigation bar Fragment lengths in selected region Depth of coverage in selected region

5. Data visualization 1.aid software development: integration of trace data viewing, fast navigation, zooming/panning 2.facilitate data validation (e.g. SNP validation): simultanous viewing of multiple read types, quality value displays 3.promote hypothesis generation: integration of annotation tracks

Data visualization

Our software tools for next-gen data

Data mining projects

SNP calling in short-read coverage C. elegans reference genome (Bristol, N2 strain) Pasadena, CB4858 (1 ½ machine runs) Bristol, N2 strain (3 ½ machine runs) goal was to evaluate the Solexa/Illumina technology for the complete resequencing of large model-organism genomes 5 runs (~120 million) Illumina reads from the Wash. U. Genome Center, as part of a collaborative project lead by Elaine Mardis, at Washington University primary aim was to detect polymorphisms between the Pasadena and the Bristol strain

Polymorphism discovery in C. elegans SNP calling error rate very low: Validation rate = 97.8% (224/229) Conversion rate = 92.6% (224/242) Missed SNP rate = 3.75% (26/693) SNP INS INDEL candidates validate and convert at similar rates to SNPs: Validation rate = 89.3% (193/216) Conversion rate = 87.3% (193/221) MOSAIK aligned / assembled the reads (< 4 hours on 1 CPU) PBSHORT found 44,642 SNP candidates (2 hours on our 160-CPU cluster) SNP density: 1 in 1,630 bp (of non-repeat genome sequence)

Mutational profiling: deep 454/Illumina/SOLiD data Pichia stipitis converts xylose to ethanol (bio-fuel production) one mutagenized strain had especially high conversion efficiency determine where the mutations were that caused this phenotype we resequenced the 15MB genome with 454 Illumina, and SOLiD reads 14 true point mutations in the entire genome Pichia stipitis reference sequence Image from JGI web site

Technology comparisons

Thanks

Credits Elaine Mardis Andy Clark Aravinda Chakravarti Doug Smith Michael Egholm Scott Kahn Francisco de la Vega Kristen Stoops Ed Thayer

Lab

Recruitment