Presentation is loading. Please wait.

Presentation is loading. Please wait.

Next-generation DNA sequencing

Similar presentations


Presentation on theme: "Next-generation DNA sequencing"— Presentation transcript:

1 Next-generation DNA sequencing
Boston College Biology Department BI420 Introduction to Bioinformatics Fall 2012

2 Traditional DNA sequencing

3 Genetics of living organisms
Chromosomes DNA

4 Radioactive label gel sequencing

5 Four-color capillary sequencing
~1 Mb ~100 Mb >100 Mb ~3,000 Mb ABI 3700 four-color sequence trace

6 Individual human resequencing

7 Next-generation sequencing

8 … offer vast throughput … & many applications
Illumina, SOLiD 1 Tb 100 Gb 10 Gb 454 bases per machine run 1 Gb 100 Mb 10 Mb ABI / capillary 1 Mb 10 bp 100 bp 1,000 bp read length

9 Sequencing chemistries
DNA base extension DNA ligation Church, 2005 9

10 Template clonal amplification
Church, 2005 10

11 Massively parallel sequencing
Church, 2005

12 Chemistry of paired-end sequencing
Double strand DNA is folded into a bridge shape then separated into single strands. The end of each strand is then sequenced. (Figure courtesy of Illumina) 12

13 Paired-end reads fragment amplification: fragment length 100 - 600 bp
fragment length limited by amplification efficiency circularization: 500bp - 10kb (sweet spot ~3kb) fragment length limited by library complexity Korbel et al. Science 2007 13

14 Features of NGS data Short sequence reads
bp 25-35bp (micro-reads) Huge amount of sequence per run Up to gigabases per run Huge number of reads per run Up to 100’s of millions Higher error as compared with Sanger sequencing Error profile different to Sanger 14

15 T1. Roche / 454 FLX system Pyrosequencing technology. This involves addition of a very small quantity of the ddNTP, which will only fluoresce after it binds to the template. The procedure is repeated base-by-base and one checks which color lights up Lengths of homopolymer runs (AAA, CCCC, etc) quantified by brightness of signal. This is the largest source of error variable read-length 15

16 T2. Illumina / Solexa Genome Analyzer
fixed-length short-read sequencer read properties are very close to traditional capillary sequences very low INDEL error rate 16

17 T3. AB / SOLiD system 1 2 3 fixed-length short-read sequencer
2nd Base 1st Base 1 2 3 fixed-length short-read sequencer employs a 2-base encoding system 17

18 T4. Pacific Biosciences Single Molecule Real Time
DNA polymerase fixed in place Polymerase altered so that as bases are added onto second strand, a base-specific fluorescence signal will be emitted Single-molecule optical readout finely controlled using waveguides Long readlengths (>1000bp) SMRT Technology overview 18

19 Applications

20 Application areas Genome resequencing variant discovery
somatic mutation detection mutational profiling De novo sequencing Identification of protein-bound DNA chromatin structure methylation transcription binding sites RNA-Seq expression transcript discovery Mikkelsen et al. Nature 2007 Cloonan et al. Nature Methods, 2008

21 SNP and short-INDEL discovery
21

22 Mutational profiling in deep 454 data
Pichia stipitis reference sequence Image from JGI web site Pichia stipitis is a yeast that efficiently converts xylose to ethanol (bio-fuel production) one specific mutagenized strain had especially high conversion efficiency goal was to determine where the mutations were that caused this phenotype we analyzed 10 runs (~3 million reads) of 454 reads (~20x coverage of the 15MB genome) found 39 mutations Smith et al. Genome Research 2008

23 Structural variation detection
structural variations (deletions, insertions, inversions and translocations) from paired-end read map locations copy number (for amplifications, deletions) from depth of read coverage Ask Chip to provide images for this one slide 23

24 Identification of protein-bound DNA
genome sequence aligned reads Chromatin structure (CHIP-SEQ) (Mikkelsen et al. Nature 2007) Transcription binding sites. (Robertson et al. Nature Methods, 2007) 24

25 Novel transcript discovery (genes)
Mortazavi et al. Nature Methods novel exons novel transcripts containing known exons 25

26 Novel transcript discovery (miRNAs)
Ruby et al. Cell, 2006 26

27 Expression profiling tag counting (e.g. SAGE, CAGE)
gene gene aligned reads aligned reads Jones-Rhoads et al. PLoS Genetics, 2007 tag counting (e.g. SAGE, CAGE) shotgun transcript sequencing 27

28 De novo genome sequencing
Lander et al. Nature 2001 short reads read pairs longer reads assembled sequence contigs 28

29 Technologies / properties / applications
Technology Roche/454 Illumina/Solexa AB/SOLiD Read properties Read length bp 75-150bp 25-50bp Error rate <0.5% <1.0% Dominant error type INDEL SUB Quality values available yes not really Paired-end separation < 10kb (3kb optimal) bp 500bp - 10kb (3kb optimal) Applications SNP discovery short-INDEL discovery SV discovery CHIP-SEQ small RNA/gene discovery mRNA Xcript discovery Expression profiling De novo sequencing ? 29

30 The Bioinformatics angle

31 Trace extraction Trace extraction

32 Base calling Base calling machine read-outs are quite different
read length, read accuracy, and sequencing error profiles are variable (and change rapidly as machine hardware, chemistry, optics, and noise filtering improves) 32

33 Base error rate error rate typically 0.4 - 1%
the more errors the aligner allows, the lower the fraction of the reads that can be uniquely aligned 33

34 Error rate grows with each cycle
This phenomenon limits useful read length A key challenge in sequencing technology is how to get long reads that remain accurate. 34

35 Read mapping read mapping is similar to a jigsaw puzzle…
…where they give you the cover on the box

36 Some pieces are easier to place than others…
pieces that look like each other… …pieces with unique features

37 Repeats  multiple mapping problem
Lander et al. 2001

38 Dealing with multiple mapping
38

39 Mapping quality values
0.8 0.19 0.01 39

40 Paired-end (PE) reads fragment length: 100 – 600bp Korbel et al. Science 2007 fragment length: 1 – 10kb PE reads are now the standard for whole-genome short-read sequencing 40

41 Gapped alignments (for INDELs)
41

42 Read mapping programs Many mappers are available
Handling of read pairs Handling non-unique mapping Speed and accuracy Flexibility vis-à-vis sequencing technologies Stability and support

43 Data storage requirements
43

44 Duplicate reads

45 Local misalignments

46 Base quality value recalibration

47 Multiple read types ABI/cap. 454/FLX 454/GS20 Illumina

48 Alignment visualization
integrating genomic context (e.g. gene annotations) too much data – indexed browsing too much detail – color coding, show/hide structural variant visualization more difficult

49 Standard data formats SRF/FASTQ GVF/VCF SAM/BAM

50 Standard data formats Reads: FASTQ Alignments: SAM/BAM Variants: VCF


Download ppt "Next-generation DNA sequencing"

Similar presentations


Ads by Google