Next-generation sequence analysis Gabor T. Marth Boston College Biology Department PSB 2008 January 4-8. 2008.

Slides:



Advertisements
Similar presentations
Next-Generation Sequencing: Methodology and Application
Advertisements

Mo17 shotgun project Goal: sequence Mo17 gene space with inexpensive new technologies Datasets in progress: Four-phases of 454-FLX sequencing to max of.
Vanderbilt Center for Quantitative Sciences Summer Institute Sequencing Analysis Yan Guo.
RNAseq.
High-Throughput Sequencing Technologies
Next-generation sequencing
Next-generation sequencing – the informatics angle Gabor T. Marth Boston College Biology Department AGBT 2008 Marco Island, FL. February
RNA-Seq An alternative to microarray. Steps Grow cells or isolate tissue (brain, liver, muscle) Isolate total RNA Isolate mRNA from total RNA (poly.
Next-generation sequencing and PBRC. Next Generation Sequencer Applications DeNovo Sequencing Resequencing, Comparative Genomics Global SNP Analysis Gene.
Next-generation sequencing: informatics & software aspects Gabor T. Marth Boston College Biology Department Harvard Nanocourse October 7, 2009.
Greg Phillips Veterinary Microbiology
Design Goals Crash Course: Reference-guided Assembly.
Bioinformatics for high-throughput DNA sequencing Gabor Marth Boston College Biology New grad student orientation Boston College September 8, 2009.
Bioinformatics Methods and Computer Programs for Next-Generation Sequencing Data Analysis Gabor Marth Boston College Biology Next Generation Sequencing.
Biology and Bioinformatics Gabor T. Marth Department of Biology, Boston College BI820 – Seminar in Quantitative and Computational Problems.
Data analysis methods for next- generation sequencing technologies Gabor T. Marth Boston College Biology Department Epigenomics & Sequencing Meeting July.
General methods of SNP discovery: PolyBayes Gabor T. Marth Department of Biology Boston College Chestnut Hill, MA
NHGRI/NCBI Short-Read Archive: Data Retrieval Gabor T. Marth Boston College Biology Department NCBI/NHGRI Short-Read.
Computational Tools for Finding and Interpreting Genetic Variations Gabor T. Marth Department of Biology, Boston College
Next-generation sequencing: informatics & software aspects Gabor T. Marth Boston College Biology Department.
Polymorphism discovery in next- generation re-sequencing data Gabor T. Marth Boston College Biology Department Illumina workshop, Washington, DC November.
Bioinformatics for next-generation DNA sequencing Gabor T. Marth Boston College Biology Department BC Biology new graduate student orientation September.
Read mapping and variant calling in human short-read DNA sequences
Next-generation sequencing: informatics & software aspects Gabor T. Marth Boston College Biology Department.
RNA-Seq An alternative to microarray. Steps Grow cells or isolate tissue (brain, liver, muscle) Isolate total RNA Isolate mRNA from total RNA (poly.
$399 Personal Genome Service $2,500 Health Compass service $985 deCODEme (November 2007) (April 2008) $350,000 Whole-genome sequencing (November 2007)
Informatics tools for next-generation sequence analysis Gabor T. Marth Boston College Biology Department University of Michigan October 20, 2008.
Polymorphism discovery informatics Gabor T. Marth Department of Biology Boston College Chestnut Hill, MA
Sequence Variation Informatics Gabor T. Marth Department of Biology, Boston College BI420 – Introduction to Bioinformatics.
Informatics for next-generation sequence analysis – SNP calling Gabor T. Marth Boston College Biology Department PSB 2008 January
Informatics challenges and computer tools for sequencing 1000s of human genomes Gabor T. Marth Boston College Biology Department Cold Spring Harbor Laboratory.
High Throughput Sequencing
Department of Bioinformatics and Computational Biology
CS 6293 Advanced Topics: Current Bioinformatics
Diabetes and Endocrinology Research Center The BCM Microarray Core Facility: Closing the Next Generation Gap Alina Raza 1, Mylinh Hoang 1, Gayan De Silva.
Next generation sequencing platforms Applications
The impact of next-generation sequencing technology of genetics Elaine R. Mardis – 11 February Washington School of Medicine, Genome Sequencing Center.
High-Throughput Sequencing Technologies
Special Topics in Genomics Lecture 1: Introduction Instructor: Hongkai Ji Department of Biostatistics
Next generation sequencing Xusheng Wang 4/29/2010.
Introduction Basic Genetic Mechanisms Eukaryotic Gene Regulation The Human Genome Project Test 1 Genome I - Genes Genome II – Repetitive DNA Genome III.
High Throughput Sequencing: Technologies & Applications Michael Brudno CSC 2431 – Algorithms for HTS University of Toronto 06/01/2010.
Biostatistics-Lecture 15 High-throughput sequencing and sequence alignment Ruibin Xi Peking University School of Mathematical Sciences.
Computational research for medical discovery at Boston College Biology Gabor T. Marth Boston College Department of Biology
Massive Parallel Sequencing
High throughput sequencing: informatics & software aspects Gabor T. Marth Boston College Biology Department BI543 Fall 2013 January 29, 2013.
Chromatin Immunoprecipitation DNA Sequencing (ChIP-seq)
I519 Introduction to Bioinformatics, Fall, 2012
Gerton Lunter Wellcome Trust Centre for Human Genetics From calling bases to calling variants: Experiences with Illumina data.
Basic Molecular Biology Many slides by Omkar Deshpande.
Next Generation Sequencing
SEQUENCING – THE BENCHTOPS. Roche 454 Junior Same technology as 454 FLX Read length: 400 bases Paired-end 100,000 reads 12 hours (instrument time) Output.
RNA-Seq Primer Understanding the RNA-Seq evidence tracks on the GEP UCSC Genome Browser Wilson Leung08/2014.
UK NGS Sequencing Update July 2009 Dr Gerard Bishop - Division of Biology Dr Sarah Butcher – Centre for Bioinformatics.
VARiD: A Variation Detection Framework for Color-space and Letter- space platforms By A.V. Dalca, S. M. Rumble, S. Levy, M. Brudno Presented by Velian.
Introduction to RNAseq
SHRiMP: The SHort Read Mapping Package Michael Brudno Department of Computer Science University of Toronto 11/09/08.
Informatics challenges for next-generation sequence analysis
Next-generation sequencing: the informatics angle
Next-generation sequencing: the informatics angle Gabor T. Marth Boston College Biology Department CHI Next-Generation Data Analysis meeting Providence,
Accessing and visualizing genomics data
1 Repeats!. 2 Introduction  A repeat family is a collection of repeats which appear multiple times in a genome.  Our objective is to identify all families.
SNP Discovery in Whole-Genome Light-Shotgun 454 Pyrosequences Aaron Quinlan 1, Andrew Clark 2, Elaine Mardis 3, Gabor Marth 1 (1) Department of Biology,
When the next-generation sequencing becomes the now- generation Lisa Zhang November 6th, 2012.
Next-generation sequencing technology
Next generation sequencing
Very important to know the difference between the trees!
Discovery tools for human genetic variations
Genome organization and Bioinformatics
Next-generation DNA sequencing
Presentation transcript:

Next-generation sequence analysis Gabor T. Marth Boston College Biology Department PSB 2008 January

Read length and throughput read length bases per machine run 10 bp1,000 bp100 bp 100 Mb 10 Mb 1Mb 1Gb Illumina/Solexa, AB/SOLiD short-read sequencers ABI capillary sequencer 454 pyrosequencer ( Mb in bp reads) (1-4 Gb in bp reads)

DNA ligationDNA base extension Church, 2005 Sequencing chemistries

Template clonal amplification Church, 2005

Massively parallel sequencing Church, 2005

Features of NGS data Short sequence reads – bp –25-35bp (micro-reads) Huge amount of sequence per run –Up to gigabases per run Huge number of reads per run –Up to 100’s of millions Higher error as compared with Sanger sequencing –Error profile different to Sanger

Current and future application areas Genome re-sequencing: somatic mutation detection, organismal SNP discovery, mutational profiling, structural variation discovery De novo genome sequencing Short-read sequencing will be (at least) an alternative to micro-arrays for: DNA-protein interaction analysis (CHiP-Seq) novel transcript discovery quantification of gene expression epigenetic analysis (methylation profiling) DEL SNP reference genome

Fundamental informatics challenges 1. Interpreting machine readouts – base calling, base error estimation 2. Dealing with non- uniqueness in the genome: resequenceability 3. Alignment of billions of reads

Informatics challenges (cont’d) 5. Data visualization 4. SNP and short INDEL, and structural variation discovery 6. Data storage & management

Challenge 1. Base accuracy and base calling machine read-outs are quite different read length, read accuracy, and sequencing error profiles are variable (and change rapidly as machine hardware, chemistry, optics, and noise filtering improves) what is the instrument-specific error profile? are the base quality values satisfactory? (1) are base quality values accurate? (2) are most called bases high-quality?

454 pyrosequencer error profile multiple bases in a homo-polymeric run are incorporated in a single incorporation test  the number of bases must be determined from a single scalar signal  the majority of errors are INDELs error rates are nucleotide-dependent

454 base quality values the native 454 base caller assigns too low base quality values

PYROBAYES: determine base number data likelihoods priors posterior base number probability New 454 base caller:

PYROBAYES: base calls and quality values call the most likely number of nucleotides produce three base quality values: QS (substitution) QI (insertion) QD (deletion)

PYROBAYES: Performance better correlation between assigned and measured quality values higher fraction of high-quality bases

Illumina/Solexa base accuracy error rate grows as a function of base position within the read a large fraction of the reads contains 1 or 2 errors

Illumina/Solexa base accuracy (cont’d) Actual base accuracy for a fixed base quality value is a function of base position within the read (i.e. there is need for quality value calibration) Most errors are substitutions  PHRED quality values work

3’5’ N N N T G z z z 3’5’ N N N G A z z z 3’5’ N N N A T z z z 2-base, 4-color: 16 probe combinations ●4 dyes to encode 16 2-base combinations ●Detect a single color indicates 4 combinations & eliminates 12 ●Each color reflects position, not the base call ●Each base is interrogated by two probes ●Dual interrogation eases discrimination –errors (random or systematic) vs. SNPs (true polymorphisms) ACGT A C G T 2 nd Base 1 st Base AB SOLiD System dibase sequencing

The decoding matrix allows a sequence of transitions to be converted to a base sequence, as long as one of two bases is known. ACGT A C G T 2 nd Base 1 st Base AA AC AC AA AG AT AA AG AG CC CA CA CC CT CG CC CT CT GG GT GT GG GA GC GG GA GA TT TG TG TT TC TA TT TC TC A A C A A G C C T C C C A C C T A A G A G G T G G A T T C T T T G T T C G G A G Possible Sequences Converting dibase (color) into base calls

Reference Alignment to reference in “color-space” Working in color space: –Reverse-complementation becomes simply reverse –Apply color transition rules to remove measurement errors from partial assemblies –If reference of Sanger reads are combined, translate to color space

A C G G T C G T C G T G T G C G T A C G G T C G C C G T G T G C G T A C G G T C G T C G T G T G C G T No change SNP Measurement error SOLiD error checking code (I)

G T C encodes 3 Possible Changes in the Middle Base 3 Possible Changes in Dibase Encoding G A C encodes G C C encodes G G C encodes Allowed Transitions Only Some Transitions Indicate a SNP in sample SOLiD error checking code (II)

A C G G T C G T C G T G T G C G T A C G G T C - T C G T G T G C G T A C G G T C - - C G T G T G C G T A C G G T C G T C G T G T G C G T Invalid adjacent 1 base deletion 2 base deletion SOLiD error checking code (III)

SOLiD di-base sequencing accuracy and QV

Challenge 2. Resequenceability Reads from repeats cannot be uniquely mapped back to their true region of origin RepeatMasker does not capture all micro-repeats, i.e. repeats at the scale of the read length Near-perfect micro-repeats can be also a problem because we want to align reads even with a few sequencing errors and / or SNPs

Repeats at the fragment level “base masking” “fragment masking”

Fragment level repeat annotation bases in repetitive fragments may be resequenced with reads representing other, unique fragments  fragment-level repeat annotations spare a higher fraction of the genome than base-level repeat masking

Find perfect and near-perfect micro-repeats Hash based methods (fast but only work out to a couple of mismatches) Exact methods (very slow but find every repeat copy) Heuristic methods (fast but miss a fraction of the repeats)

Challenge 3. Read alignment and assembly resequencing requires reference sequence-guided read alignment to align billions of reads the aligner has to be fast and efficient INDEL errors require gapped alignment individually aligned reads must be “assembled” together has to work for every read type (short, medium-length, and long reads) must tolerate sequencing errors and SNPs must work with both base-level and fragment-level repeat annotations transcribed sequences require additional features e.g. splice-site aware alignment capability most frequently used tools: BLAT (only pair-wise), SSAHA (pair-wise), MAQ (pair-wise and assembly), ELAND (pair-wise), MOSAIK (pair-wise and assembly, gapped)

MOSAIK: method Step 1. initial short-hash based scan for possible read locations Step 2. evaluation of candidate read locations with SW method

MOSAIK – performance Solexa read alignments to C. elegans genome: 100 million reads aligned in 95 minutes 18,000 reads / second 454 reads to Pichia (yeast-size) genome GS20: 2,000 reads / second FLX: 300 reads / second Solexa read alignments to masked human genome: 40 seconds for 1 million reads 18,000 reads / second 5.5 GB RAM used (more for longer initial hash sizes)

MOSAIK: co-assembling different read types ABI/cap. 454/FLX Illumina 454/GS20

Challenge 4. Polymorphism discovery shallow and deep read coverage most candidates will never be “checked”  only very low error rates are acceptable we updated PolyBayes to deal with new read types made the new software (PBSHORT) much more efficient

Structural variation discovery copy number variations (deletions & amplifications) can be detected from variations in the depth of read coverage structural rearrangements (inversions and translocations) require paired-end read data

Challenge 5. Data visualization 1.aid software development: integration of trace data viewing, fast navigation, zooming/panning 2.facilitate data validation (e.g. SNP validation): co-viewing of multiple read types, quality value displays 3.promote hypothesis generation: integration of annotation tracks

Challenge 6. Massive data volumes Short-read format working group (Asim Siddiqui, UBC) Assembly format working group Boston College two connected working groups to define standard data formats

Next-generation sequencing software Machine manufacturers’ sites plus third- party developers’ sites, e.g.:

Applications in various discovery projects 1. SNP discovery in shallow, single-read 454 coverage (Drosophila melanogaster) 2. Mutational profiling in deep 454 data (Pichia stipitis) 3. SNP and INDEL discovery in deep Illumina / Solexa short-read coverage (Caenorhabditis elegans) (image from Nature Biotech.)

SNP calling in single-read 454 coverage collaborative project with Andy Clark (Cornell) and Elaine Mardis (Wash. U.) goal was to assess polymorphism rates between 10 different African and American melanogaster isolates 10 runs of 454 reads (~300,000 reads per isolate) were collected key informatics question: can we detect SNPs with high accuracy in low-coverage, survey-style 454 reads aligned to finished reference genome sequence? DNA courtesy of Chuck Langley, UC Davis reads were base-called with PyroBayes and aligned to the 180Mb reference melanogaster genome sequence with Mosaik  0.16 x nominal read coverage  most reads are singletons SNP detection with PolyBayes

SNP calling success rates iso-1 reference read 46-2 ABI reads (2 fwd + 2 rev) 92.9 % validation rate (1,342 / 1,443) single-read coverage: 92.9% (1,275 / 1,372 ) double-read coverage: 94.3% (67 / 71) 2.0% missed SNP rate (25 / 1247) single-read coverage: 2.12% (25 / 1176) double-read coverage: 0% (0 / 59)

Genome variation in melanogaster isolates 658,280 SNPs discovered among all 10 lines. Nucleotide diversity Ѳ ≈ 5x10 -3 (1 SNP / 200 bp) between each line and reference (in line with expectations). 20.2% (133,264 sites) polymorphic among two or more lines. The 1 SNP / 900 bp nominal density is sufficient for high-resolution marker mapping

SNP calling in short-read coverage C. elegans reference genome (Bristol, N2 strain) Pasadena, CB4858 (1 ½ machine runs) Bristol, N2 strain (3 ½ machine runs) goal was to evaluate the Solexa/Illumina technology for the complete resequencing of large model-organism genomes 5 runs (~120 million) Illumina reads from the Wash. U. Genome Center, as part of a collaborative project lead by Elaine Mardis, at Washington University primary aim was to detect polymorphisms between the Pasadena and the Bristol strain

Polymorphism discovery in C. elegans SNP calling error rate very low: Validation rate = 97.8% (224/229) Conversion rate = 92.6% (224/242) Missed SNP rate = 3.75% (26/693) SNP INS INDEL candidates validate and convert at similar rates to SNPs: Validation rate = 89.3% (193/216) Conversion rate = 87.3% (193/221) MOSAIK aligned / assembled the reads (< 4 hours on 1 CPU) PBSHORT found 44,642 SNP candidates (2 hours on our 160-CPU cluster) SNP density: 1 in 1,630 bp (of non-repeat genome sequence)

Mutational profiling: deep 454/Illumina/SOLiD data collaboration with Doug Smith at Agencourt Pichia stipitis converts xylose to ethanol (bio-fuel production) one mutagenized strain had especially high conversion efficiency determine where the mutations were that caused this phenotype we resequenced the 15MB genome with 454 Illumina, and SOLiD reads 14 true point mutations in the entire genome Pichia stipitis reference sequence Image from JGI web site

Mutational profiling: comparisons TechnologyCoverageNominal coverageFPFNTotal error 454/FLX2 runs12.9x /FLX1 run9.8x617 Illumina7 lanes53.5x000 Illumina3 lanes23.4x000 Illumina2 lanes15.6x202 Illumina1 lane7.6x222 SOLiD-30.0X000 SOLiD-20.0X000 SOLiD-10.0X000 SOLiD-8.0X044 SOLiD-6.0X066

Informatics of transcriptome sequencing measuring gene expression levels by sequence tag counting requires SAGE informatics-like approaches novel transcript discovery Inferred Exon 1Inferred Exon 2 Inferred Exon 1Inferred Exon 2 new genes & exons novel transcripts in known genes

Protein-DNA interactions: CHiP-Seq Protein-bound DNA fragments are isolated with chromatin immunoprecipitation (ChIP) and then sequenced (Seq) on a high- throughput sequencing platform. Sequences are mapped to the genome sequence with a read alignment program. Regions over-represented in the sequences are identified. Johnson et al. Science, 2007

Protein-DNA interactions: CHIP-SEQ Mikkelsen et al. Nature ChIP-Seq scales well for simultaneous analysis of binding sites in the entire genome.