Download presentation
Presentation is loading. Please wait.
1
Genome Sequencing
2
What to Sequence and Why?
De novo whole genome sequencing requires de novo whole genome assembly Polymorphism discovery (distinct from genotyping!) Targeted approaches Whole genome SNPs, copy number variations, insertions, deletions, etc. Expressed sequence discovery ESTs cDNAs miRNAs, etc Functional genomics ChIP Expression profiling Nucleosome positioning Structure Function
3
Three Fundamentally Different Types of Chemistries for DNA Sequencing
“Chemistry”: Jargon; meaning the specific molecular biology and chemistry utilized in the sequencing reactions Maxam-Gilbert chemical degradation of DNA infrequently used due to toxic compounds and difficulty Sequencing by synthesis (“SBS”) uses DNA polymerase in a primer extension reaction Very common approach Sanger developed it (“Sanger sequencing”) 454, Illumina, Pacific Biosciences Ligation-based sequencing using short probes that hybridize to the template novel approach SOLiD, Complete Genomics
4
5’ and 3’ 5’ 3’ 5’ PO3 3’ OH Antiparellel OH 3’ PO3 5’
Base plus sugar “nucleoside” Adenine Adenosine Guanine Guanosine Cytosine Cytidine Thymine Thymidine in DNA: “deoxyadenosine” plus triphosphate “deoxynucleotide” “2’-deoxyadenosine 5’-triphosphate” = dATP 5’ 3’ Deoxyribose PO3 OH 3’ 5’ base PO3 5’ OH 3’ If I throw in DNA polymerase and free nucleotide, which end gets extended? Antiparellel base Onto the 3’ end, in a template directed fashion Adapted From Berg et al: Biochemistry 5th ed. Freeman+Co, 2002
5
5’ 3’ Watson 5’ T A G C G T C A G C T G 3’
Crick 3’ A T C G C A G T C G A C 5’ PO3 OH 3’ 5’ base PO3 5’ OH 3’ base Adapted From Berg et al: Biochemistry 5th ed. Freeman+Co, 2002
6
Sanger Sequencing Templates
PCR product Plasmid “Clone” Plasmid backbone seq primer site seq primer site Insert In Sanger sequencing, Crick is the template and Watson’s synthesis starts at the primer’s 3’OH Watson 5’ .. T A G C G T C A G C T .. 3’ Crick 3’ .. A T C G C A G T C G A .. 5’ Primer T A G C G 3’ A T C G C A G T C G A C .. 5’ 5’ 3’
7
The Chain Terminator dideoxy
Dideoxy nucleotides cannot be further extended, and so terminate the sequence chain H 3’ 5’ base PO3 dideoxy Adapted From Berg et al: Biochemistry 5th ed. Freeman+Co, 2002
8
Original Sanger Sequencing with Radioactive Signal
Template (Crick) very low concentration of ddNTPs compared to dNTPs A nested series of DNA fragments ending in the base specified by the terminator-ddNTP Watsons Expose gel to x-ray film (to make an “auto-radiogram”) Recombinant DNA: Genes and Genomes. 3rd Edition (Dec06). WH Freeman Press.
9
Wouldn’t it be great to run everything in one lane?
CAAGTGTCTTAAC This is great but… Wouldn’t it be great to run everything in one lane? Save space and time, more efficient Fluorescently label the ddNTPs so that they each appear a different color
10
Fluorescent Sanger Sequencing: “Dye-terminators”
Each of the 4 ddNTPs is labeled with a different fluorescent dye (instead of radioactivity) Recombinant DNA: Genes and Genomes. 3rd Edition (Dec06). WH Freeman Press.
11
Fluorescent Sanger Sequencing
Direction of electro-phoresis Load on gel (modern machines use capillaries, not slab gels) dGTP dATP dTTP dCTP + One-tube sequencing reaction (note: cycle sequencing with modified Taq Polymerase)
12
Fluorescent Sanger Sequencing Trace
Lane signal (Real fluorescent signals from a lane/capillary are much uglier than this). A bunch of magic to boost signal/noise, correct for dye-effects, mobility differences, etc, generates the ‘final’ trace (for each capillary of the run) Trace Recombinant DNA: Genes and Genomes. 3rd Edition (Dec06). WH Freeman Press.
13
Sanger Base Calling Base Caller (Phred)
ugly over here ugly over there ... N A G C G T T C C G C G ... A N N ... Base Caller (Phred) Quality score = -10 * log(probability of error) For Q20, probability of error = 1/100 For Q99, probability of error ~10-10
14
Phred: The base-calling program
Algorithm based on ideas about what might go wrong in a sequencing reaction and in electrophoresis Tested the algorithm on a huge dataset of “gold standard” sequences (finished human and C. elegans sequences generated by highly-redundant sequencing) Compared the results of phred with the ABI Basecaller Phred was considerably more accurate (40-50% fewer errors), particularly for indels and particularly for the higher quality sequences (Ewing et al., 1998, Gen. Research 8: ; Ewing and Green 1998, Gen. Research 8: )
15
Progress of Sanger Sequencing Technology
Radioactive polyacrylamide slab gel electrophoresis Low throughput, labor-intensive 1 grad student could do 10 runs a day, 100 bp/run = 1000 bp/day
16
Progress of Sanger Sequencing Technology
AB slab gel sequencers Fluorescent sequencing Per machine: 6 runs/day 96 reads/run 500 bp/read 288,000 bp/day
17
Progress of Sanger Sequencing Technology
AB capillary sequencers Per machine: 24 runs/day 96 reads/run 550 – 1,000 bp/read 1-2 million bp/day ~1,000-fold increase in throughput since 1985 accomplished by incremental improvements of the same underlying technology Novel Disruptive Sequencing Technologies have x more throughput: 454 Pyrosequencing, Solexa, SOLiD
18
Whole Genome Sequencing Approaches
Hierarchical Shotgun Approach Genomic DNA BAC library Organized, Mapped Large Clone Contigs (minimal tiling path) Shotgun Clones GCAATGAAATATGTTCTTGTAATTTAAGCTGACACTCCTAATTTAGCTCTTGTCCTCTACTGAGTCTACCTAATTATATGTATGGATTGACTTGG AGCTCTTGTCCTCTACTGAGTCTACCTAATTATATGTATGGATTGACTTGGTGTTTTCTCTTTTTCTTAAATAGTAATGCAGAAAGCCTGGAGAGAGAG Reads GCAATGAAATATGTTCTTGTAATTTAAGCTGACACTCCTAATTTAGCTCTTGTCCTCTACTGAGTCTACCTAATTATATGTATGGATTGACTTGGTGTTTTCTCTTTTTCTTAAATAGTAATGCAGAAAGCCTGGAGAGAGAG Assembly
19
Whole Genome Sequencing Approaches
Shotgun Approach Genomic DNA Shotgun Clones GCAATGAAATATGTTCTTGTAATTTAAGCTGACACTCCTAATTTAGCTCTTGTCCTCTACTGAGTCTACCTAATTATATGTATGGATTGACTTGG AGCTCTTGTCCTCTACTGAGTCTACCTAATTATATGTATGGATTGACTTGGTGTTTTCTCTTTTTCTTAAATAGTAATGCAGAAAGCCTGGAGAGAGAG Reads GCAATGAAATATGTTCTTGTAATTTAAGCTGACACTCCTAATTTAGCTCTTGTCCTCTACTGAGTCTACCTAATTATATGTATGGATTGACTTGGTGTTTTCTCTTTTTCTTAAATAGTAATGCAGAAAGCCTGGAGAGAGAG Assembly
20
Rationale for Hierarchical Strategy
Better for a repeat-rich genome less misassembly of finished genome long-range misassembly largely eliminated and short-range reduced Better for an outbred organism each clone from an individual and no polymorphisms in the final sequence. (Added bonus: get SNPs from regions of overlapping clones) Better if there are cloning biases use minimum tiling path,so the same coverage for each region Easier to identify and fill gaps (from unclonable regions) sooner BUT Time consuming and expensive to make minimum tiling path
21
De Novo Whole Genome Sequencing
Make millions of random clones: “Shotgunning” Plasmid "backbone" Insert sequencing primer "forward read" "reverse read"
22
Some Key Concepts in Assembly
Mate pairs ( = paired end reads) Contigs Supercontigs ( = Scaffolds) Sequence coverage Physical coverage
23
Sequencing Read GCAATGAAATATGTTCTTGTAATTTAAGCTGACACTCCTAATTTAGCTCTTGTCCTCTACTGAGTCTACCTAATTATATGTATGGATTGACTTGGTGTTTTCTCTTTTTCTTAAATAGTAATGCAGAAAGCCTGGAGAGAGAGAAACCCCCAAGCTAGGATTTCTGCAGCTCATGAAGCCTTGGAGATAAATGAGTAAGTGGGGGAAAATCTTGCTGTTAAAAAGGAAATCTCATCCTTTGCTGAATATATTCAGTTGCCATTGATAGGATACTTAAATTAAACTGCATTTGAACTGGAGGATTATTTGGGGAGTTATTACTCTATTTAAAAAAGTTTTTTTTTTAAATGAAGGACAGCCACCATGTGGAGGTGGTTTTAGTCATTTTATGAATTCAATGGCTTTGCTGTGATCCTAAATTAATTTCTTGAAGGGCTATCCCTAGGATATTGTGAGGATATAAAATAAATACAATTCTTTACATATCTAAAACATTCTGACAGGGAAAATTTTCCAGATGTAGAATGCTCATCTGCACTAGAACATTTTCTAGTAGAACTTCTGCTAGTGGGGAAAACATGATAACAACATAAGGTTTAAAAAAAAAATTTTAGAAAATACTTCAAGATTAAGACAAAGATAAGAGGAAATGCTGTCTTGAGTGTTGTTAAACATTCTGTGGGTTACCAAGGAAGGCTGGGAAATCTCTTCTGGAGATCTCAGAAAATGAGAAAGATTCTTAAAGTTGGAGTCATAAAAACTCAGGGTTGGCAGAGACCTTAAAGGTCACTTAGCTGAACCACCCATCTGGTGCTTGAATCACCTCAACACTATCCTTGCCAAGTGGTCATTGTTAAACTATTTTATGATTTTTCTGAAGAAGGTTACAGAATCTTCTTCAGAGATCTTAGGGAAAAAAAAAAAAGATTGTCGTGAGAGTTGAAAATCCTGCCATTGTAACCAGTTGATCTACGGTTTCTGATTCTGTCATGCAACATATTTATTTTCCAGTTTCTTGTCATCTACAAATTCGATATGCCTGCCTTCTGTGTGTCATCCATATTTCTGAGAAAAATATGAAGGCCAGGAATAGAGCCCTGTGACATGACATAGAAACTACCCTCCAGGTTCATGTCTTCATGAATCACCATCTTTTGTATTGTTCACTCAATTACTAAGCCACCCAGTTACACTGTGACTCAGCTCATATTTCTCCATTTGGATCTTAAGAATGCCAATCGTAGCTGCGGATCTTAAATTTATAGTAAATCTATTACAGTAAATTAAGCTAGCACAATCTGATTTATTTATTCTTAGTGAATATAAGCTGGCTTCTAGTCGTCACTACTTTCTTTTTAAAGTGCTTGGAGACCATTCCTTTAATAATCCATTAGAATATCTTTCCAAATCACTGTGTTCTGTAGTTTGGGAAGTCTGCCTTCTTCCCCTTTTTGAAAATTTATGCTACATTTATCATCTCATCTTCTAGCACCTCTCCATTCTTTGTGATTCCTCAACTATCCACAGAGAGCAATTCCATGGCCTGCCTACAAGGTCTTTCGGTTTCCTGGGATTTGCCCATCCAGTCCAGTAATTCATTTAGAATGGATCAATTATTTGCTATCTTACATCTTTTTACCCATTTTAGAGTTTAATTTCTTCTCCCTTTTTCAGTCTGACAGTCATTCTCCTTGATAGAGAAGCCAGGAACAAAATAGGAGGGAGAGAGTTTTGCTTTTTCTTTATTATCTACTGCTTTTAACAATAAACCTTCCTTGTTTTGATGTTATTATGTTGTTTGTCTTTTTTTTTTACTTATTTGCCTTTGTGACATGGGGACGGTGATAGGGCCTTAAATATAATTTTAAAATAGGGAATAAATGGTTGTCTTTAGTATTTTATTTTGTTTTATTATTATTATTATTATTGTTATTTTTGCAAGCTTCAGCTAATTTGGAATTGTAGCTCTCCTGACATTATTCTTATAAGCTCATTCCACTCTCTTATAGACCATCATTACATGCCCTCTTTCCATCTTTTAAAATATGTCCTTTAAAAATCTGACCTGGGAGAAATCTCTGTGAAGCCGTGTTGGTTACTTAAGTGCCACCCCTCTTTTCTTCCTGAGAGGATCATTTGTGATTGCAGTTACAGTTGA GCAATGAAATATGTTCTTGTAATTTAAGCTGACACTCCTAATTTAGCTCTTGTCCTCTACTGAGTCTACCTAATTATATGTATGGATTGACTTGGTGTTTTCTCTTTTTCTTAAATAGTAATGCAGAAAGCCTGGAGAGAGAGAAACCCCCAAGCTAGGATTTCTGCAGCTCATGAAGCCTTGGAGATAAATGAGTAAGTGGGGGAAAATCTTGCTGTTAAAAAGGAAATCTCATCCTTTGCTGAATATATTCAGTTGCCATTGATAGGATACTTAAATTAAACTGCATTTGAACTGGAGGATTATTTGGGGAGTTATTACTCTATTTAAAAAAGTTTTTTTTTTAAATGAAGGACAGCCACCATGTGGAGGTGGTTTTAGTCATTTTATGAATTCAATGGCTTTGCTGTGATCCTAAATTAATTTCTTGAAGGGCTATCCCTAGGATATTGTGAGGATATAAAATAAATACAATTCTTTACATATCTAAAACATTCTGACAGGGAAAATTTTCCAGATGTAGAATGCTCATCTGCACTAGAACATTTTCTAGTAGAACTTCTGCTAGTGGGGAAAACATGATAACAACATAAGGTTTAAAAAAAAAATTTTAGAAAATACTTCAAGATTAAGACAAAGATAAGAGGAAATGCTGTCTTGAGTGTTGTTAAACATTCTGTGGGTTACCAAGGAAGGCTGGGAAATCTCTTCTGGAGATCTCAGAAAATGAGAAAGATTCTTAAAGTTGGAGTCATAAAAACTCAGGGTTGGCAGAGACCTTAAAGGTCACTTAGCTGAACCACCCATCTGGTGCTTGAATCACCTCAACACTATCCTTGCCAAGTGGTCATTGTTAAACTATTTTATGATTTTTCTGAAGAAGGTTACAGAATCTTCTTCAGAGATCTTAGGGAAAAAAAAAAAAGATTGTCGTGAGAGTTGAAAATCCTGCCATTGTAACCAGTTGATCTACGGTTTCTGATTCTGTCATGCAACATATTTATTTTCCAGTTTCTTGTCATCTACAAATTCGATATGCCTGCCTTCTGTGTGTCATCCATATTTCTGAGAAAAATATGAAGGCCAGGAATAGAGCCCTGTGACATGACATAGAAACTACCCTCCAGGTTCATGTCTTCATGAATCACCATCTTTTGTATTGTTCACTCAATTACTAAGCCACCCAGTTACACTGTGACTCAGCTCATATTTCTCCATTTGGATCTTAAGAATGCCAATCGTAGCTGCGGATCTTAAATTTATAGTAAATCTATTACAGTAAATTAAGCTAGCACAATCTGATTTATTTATTCTTAGTGAATATAAGCTGGCTTCTAGTCGTCACTACTTTCTTTTTAAAGTGCTTGGAGACCATTCCTTTAATAATCCATTAGAATATCTTTCCAAATCACTGTGTTCTGTAGTTTGGGAAGTCTGCCTTCTTCCCCTTTTTGAAAATTTATGCTACATTTATCATCTCATCTTCTAGCACCTCTCCATTCTTTGTGATTCCTCAACTATCCACAGAGAGCAATTCCATGGCCTGCCTACAAGGTCTTTCGGTTTCCTGGGATTTGCCCATCCAGTCCAGTAATTCATTTAGAATGGATCAATTATTTGCTATCTTACATCTTTTTACCCATTTTAGAGTTTAATTTCTTCTCCCTTTTTCAGTCTGACAGTCATTCTCCTTGATAGAGAAGCCAGGAACAAAATAGGAGGGAGAGAGTTTTGCTTTTTCTTTATTATCTACTGCTTTTAACAATAAACCTTCCTTGTTTTGATGTTATTATGTTGTTTGTCTTTTTTTTTTACTTATTTGCCTTTGTGACATGGGGACGGTGATAGGGCCTTAAATATAATTTTAAAATAGGGAATAAATGGTTGTCTTTAGTATTTTATTTTGTTTTATTATTATTATTATTATTGTTATTTTTGCAAGCTTCAGCTAATTTGGAATTGTAGCTCTCCTGACATTATTCTTATAAGCTCATTCCACTCTCTTATAGACCATCATTACATGCCCTCTTTCCATCTTTTAAAATATGTCCTTTAAAAATCTGACCTGGGAGAAATCTCTGTGAAGCCGTGTTGGTTACTTAAGTGCCACCCCTCTTTTCTTCCTGAGAGGATCATTTGTGATTGCAGTTACAGTTGA
24
Mate Pair Sequencing Reads
GCAATGAAATATGTTCTTGTAATTTAAGCTGACACTCCTAATTTAGCTCTTGTCCTCTACTGAGTCTACCTAATTATATGTATGGATTGACTTGGTGTTTTCTCTTTTTCTTAAATAGTAATGCAGAAAGCCTGGAGAGAGAGAAACCCCCAAGCTAGGATTTCTGCAGCTCATGAAGCCTTGGAGATAAATGAGTAAGTGGGGGAAAATCTTGCTGTTAAAAAGGAAATCTCATCCTTTGCTGAATATATTCAGTTGCCATTGATAGGATACTTAAATTAAACTGCATTTGAACTGGAGGATTATTTGGGGAGTTATTACTCTATTTAAAAAAGTTTTTTTTTTAAATGAAGGACAGCCACCATGTGGAGGTGGTTTTAGTCATTTTATGAATTCAATGGCTTTGCTGTGATCCTAAATTAATTTCTTGAAGGGCTATCCCTAGGATATTGTGAGGATATAAAATAAATACAATTCTTTACATATCTAAAACATTCTGACAGGGAAAATTTTCCAGATGTAGAATGCTCATCTGCACTAGAACATTTTCTAGTAGAACTTCTGCTAGTGGGGAAAACATGATAACAACATAAGGTTTAAAAAAAAAATTTTAGAAAATACTTCAAGATTAAGACAAAGATAAGAGGAAATGCTGTCTTGAGTGTTGTTAAACATTCTGTGGGTTACCAAGGAAGGCTGGGAAATCTCTTCTGGAGATCTCAGAAAATGAGAAAGATTCTTAAAGTTGGAGTCATAAAAACTCAGGGTTGGCAGAGACCTTAAAGGTCACTTAGCTGAACCACCCATCTGGTGCTTGAATCACCTCAACACTATCCTTGCCAAGTGGTCATTGTTAAACTATTTTATGATTTTTCTGAAGAAGGTTACAGAATCTTCTTCAGAGATCTTAGGG
25
Assembly: Contigs and Supercontigs
"Supercontig" or "Scaffold" "Contig" Seq gap NNNNNN number of N's in sequence = estimated size
26
Why Different Insert Sizes are Useful
Longer (fosmid) mate pairs connect assembly pieces that are not connected by shorter (plasmid) mate pairs
27
Some Key Concepts in Assembly
Mate pairs ( = paired end reads) Contigs Supercontigs ( = Scaffolds) Sequence coverage Physical coverage
28
Sequence Coverage? Each Read = 600 bases 24,000 bp Reverse Reads
Forward Reads 24,000 bp
29
What if mean insert size = 4 kb?
‘Physical’ Coverage? Mean insert size = 40 kb 40,000 bp What if mean insert size = 4 kb?
30
Draft Assemblies are not Perfect
Sequence coverage may vary (Ciona is 12x, dog is 6x, shrew is 2x) missing regions strong fragmentation Some regions don’t clone well results in low sequence coverage which causes gaps in assembly Some regions don’t sequence well extreme GC content homopolymeric or otherwise low-complexity runs Some regions don’t assemble well mobile elements high identity, large copy number segmental duplications Polymorphism
31
454 Sequencing Most viable “UHTS” method for de novo genome sequencing
Still sequencing by synthesis, but fundamentally different Uses pyrosequencing (light produced as a by product of base extension) No cloning required Sequences ~1,000,000 reads simultaneously Each read is ~400bp long Takes ~10 hours for machine run
32
454 Library Generation (1) Genomic DNA Fragment (shear or restrict)
Add adaptors
33
454 Library Generation (2) Adaptor DNA adapter ends plus beads
~ 1 DNA molecule and 1 bead per droplet Make water-in-oil emulsion In brief, an in vitro–constructed adaptor flanked shotgun library (shown as gold and turquoise adaptors flanking unique inserts) is PCR amplified (that is, multi-template PCR, not multiplex PCR, as only a single primer pair is used, corresponding to the gold and turquoise adaptors) in the context of a water-in-oil emulsion. One of the PCR primers is tethered to the surface (5′-attached) of micron-scale beads that are also included in the reaction. A low template concentration results in most bead-containing compartments having either zero or one template molecule present. In productive emulsion compartments (where both a bead and template molecule is present), PCR amplicons are captured to the surface of the bead. After breaking the emulsion, beads bearing amplification products can be selectively enriched. Each clonally amplified bead will bear on its surface PCR products corresponding to amplification of a single molecule from the template library. emPCR Most beads with with multiple copies of individual DNA molecules
34
454 Sequencing Process Beads loaded into wells Enzymes added
1) The emulsion is broken, the DNA strands are denatured, and beads carrying single-stranded DNA templates are enriched (not shown) and deposited into wells of a fiber-optic slide. 2) Smaller beads carrying immobilized enzymes required for a solid phase pyrophosphate sequencing reaction are deposited into each well. 3) Sequencing gives off light
35
Pyrogram
36
Sequencing Strategy with 454
High Coverage (~30x) of single end reads One tenth coverage with paired end reads (provides greater physical coverage) Assemble genome using both sets of data – paired end reads provide long range continuity.
37
454 Advantages, Limitations and Future
Fast, accurate Great for small, simple genomes Disadvantages Reads still relatively short (~400bp) Doesn’t yet work well for de novo sequencing of large genomes Homopolymer stretches (8+) are difficult to read Future Improved chemistry is expected to lead to read lengths > 1000 bp in 2010.
38
The Players Commercially available now:
Solexa (Illumina) SOLiD (Applied Biosystems/Life Technologies) Helicos (Complete Genomics) Next next generation approaches two or three potentially viable ones are in R+D phase (E.g. Pacific Biosciences, Nabys, Visigen) may become commercially available soon
39
How UHTS Differs from Sanger
Data Amount of data generated per day or per run (“throughput”) Type of data Quality of data (error) Process of Data Generation: Molecular Biology: Library construction Molecular Biology: Sample prep Sequencing: Chemistry Sequencing: Detection Analysis: Image Processing Analysis: Read Analysis In other words: it’s a completely different beast
40
Read Length 700 bases of genome, 1x sequence coverage
100 200 300 400 500 600 700 Sanger 454 Sol* [Consider DNA sequence variation vs functional genomics]
41
Further Important Differences
Parameter Sanger (AB 3730) 454 (Roche) Solexa (Illumina) SOLiD (AB) Read L (bp) 700 400 35-100 50 Number of reads per run [per day] 96 [2000] 1,000,000 [2,000,000] 1,000,000,000 [600,000,000] [200,000,000] Mb per day 1.4 80 Up to 25,000 Up to 12,500 SNP error rate low high (~1%) high (>1%) Indel error rate high
42
How UHTS Differs from Sanger
Data Amount of data generated per day or per run (“throughput”) Type of data Quality of data (error) Process of Data Generation: Molecular Biology: Library construction Molecular Biology: Sample prep Sequencing: Chemistry Sequencing: Detection Analysis: Image Processing Analysis: Read Analysis
43
“Fragment” Library Solexa, SOLiD: L <200 bp Seq Primer “Insert”
Left amplification handle Right amplification handle Solexa, SOLiD: L <200 bp
44
Making Paired Tags Conceptually Simple ... Practically Somewhat Difficult, though now fairly standard long gDNA fragment molecular biology magic ... on to Amplification and Sequencing
45
Generating DiTag Libraries
Paired Tag for clonal amplification-based sequencing Sheared gDNA (Long .. > 2kb!) >snip< with EcoP15I Adapter with EcoP15I sites Seq Primer 1 Tag 1 Left single molecule amplification handle Right single molecule amplification handle Seq Primer 2 Tag 2 Add linkers
46
How UHTS Differs from Sanger
Data Amount of data generated per day or per run (“throughput”) Type of data Quality of data (error) Process of Data Generation: Molecular Biology: Library construction Molecular Biology: Sample prep Sequencing: Chemistry Sequencing: Detection Analysis: Image Processing Analysis: Read Analysis
47
Sample Preparation Some approaches use a single DNA molecule as a sequencing template do exist (Helicos, Pacific Biosciences). Challenges include: Keeping single molecules stable during insults of sequencing Signal to noise ratio in base detection BUT Avoid amplification biases Next-best Solution: Clonal Amplification of Single Molecules Single molecule only briefly needed as a template, not during the entire time of the sequencing reaction Thousands of identical molecules boost signal Two different methods Emulsion PCR 454 and Applied Biosystems (SOLiD) Bridge amplification of molecules immobilized on surface Solexa system (Illumina)
48
Illumina Sequencing Technology Robust Reversible Terminator Chemistry Foundation
5’ 3’ G T C A DNA ( ug) T G C A G T Sample preparation Cluster growth Single molecule array Sequencing 1 2 3 7 8 9 4 5 6 Base calling T G C T A C G A T … Image acquisition
49
Sequencing 20 Microns 100 Microns 250+ Million Clusters Per Flow Cell
Purpose: To visually demonstrate sequencing by synthesis Details: This is an actual cluster pattern from a flow cell (4-color overlay) If you zoom in on one small section (the 20 micron box) you can follow each cluster through sequencing by synthesis Conclusion: This is the visual set up to the following slide 20 Microns 100 Microns
50
How UHTS Differs from Sanger
Data Amount of data generated per day or per run (“throughput”) Type of data Quality of data (error) Process of Data Generation: Molecular Biology: Library construction Molecular Biology: Sample prep Sequencing: Chemistry Sequencing: Detection Analysis: Image Processing Analysis: Read Analysis
51
Detection, Chemistry Massively Parallel Detection On immobilized “molecular colonies” Means you have to measure (image) every cycle Instead of the Sanger model (letting reaction go to completion and then separating products by size) Requires specially designed ‘chemistry’ Solexa: reversible dye-terminators (polymerase) AB: ligation-based (ligase; 3’->5’ !)
52
Solexa Sequencing: Reversible Terminators
fluorophore Detection O HN N 3’ DNA Incorporate Ready for Next Cycle O DNA HN N 3’ free 3’ end OH Deblock and Cleave off Dye cleavage site O HN O N PPP O 3’ 3’ OH is blocked
53
Sequencing by Synthesis, One Base at a Time
3’- …-5’ G T C T A G A G A C T T C T G Cycle 1: Add sequencing reagents First base incorporated Remove unincorporated bases Detect signal Cycle 2-n: Add sequencing reagents and repeat
54
How UHTS Differs from Sanger
Data Amount of data generated per day or per run (“throughput”) Type of data Quality of data (error) Process of Data Generation: Molecular Biology: Library construction Molecular Biology: Sample prep Sequencing: Chemistry Sequencing: Detection Analysis: Image Processing Analysis: Read Analysis
55
Sol* (composite of four images, false-colored)
20 microns 100 microns
56
Flow Cells and Imaging Solexa: single 8-channel
Reagents flowed in here Image for each channel is actually tiled (four images per panel, one for each color) Out to waste
57
Amount of Image Data ~1 TB image data per run ~3 TB image data per run
Solexa: single 8-channel First chemistry, then imaging SOLiD: dual flow cell, 8 quadrants alternating chemistry and imaging Number of image panels in each channel: Number of image panels in each quadrant: 27 columns X 17 rows 3672 for both flow cells 3 columns X 100 rows 2400 for each flow cell times 4 images times 35 cycles 336,000 images 514,080 images ~1 TB image data per run ~3 TB image data per run
58
Image Processing, Base Calling
Image processing algorithms find signals in each panel, align signals from different panels, etc Machines ship with server or small cluster that does image analysis while run is happening Sequence data after base calling much reduced in size (tens of gigabytes) => more manageable but still large amounts that add up over time
59
How UHTS Differs from Sanger
Data Amount of data generated per day or per run (“throughput”) Type of data Quality of data (error) Process of Data Generation: Molecular Biology: Library construction Molecular Biology: Sample prep Sequencing: Chemistry Sequencing: Detection Analysis: Image Processing Analysis: Read Analysis
60
Placement (Mapping) of Short Reads
Requires reference genome (hence “placement” or “mapping”; meaning alignment of read to genome) Placement can be ambiguous because reads are short and error-prone (error rates matter greatly!) Read Placement is a significant computational challenge because of the sheer amount of data Need lots of disk space Need a cluster Need lots of RAM Commercial vendors (Illumina, AB) have decided on fast algorithms that *only* tolerate mismatches (and not many) Example: 3 Gigabases of read data in 35 bp chunks (1x mammalian genome coverage) takes 30 CPU hours to align with the fastest mismatch-only software … To make the most of the data you may need fast and robust algorithms typically written by computer scientists. ... whereas it takes 125 days with Mike Brudno’s SHRIMP (a superfast Smith-Waterman implementation explicitly developed for short reads)
61
Useful Papers Technology: Quality: Assembly:
Sanger, F., Nicklen, S. and Coulson, A.R. (1977). DNA sequencing with chain-terminating inhibitors. Proc Natl Acad Sci U S A. 74, Ronaghi, M., Uhlén, M. and Nyrén, P. (1998). A sequencing method based on real-time pyrophosphate. Science 281, Margulies, M. et al. (2005). Genome sequencing in microfabricated high-density picolitre reactors. Nature 437, Eid, J. et al. (2009). Real-time DNA sequencing from single polymerase molecules. Science. 323, Drmanac, R. et al. (2010). Human Genome Sequencing Using Unchained Base Reads on Self-Assembling DNA Nanoarrays. Science 327, Quality: Ewing, B., Hillier, L., Wendl, M.C. and Green, P. (1988). Base-calling of automated sequencer traces using phred. I. Accuracy assessment. Genome Res. 8, Ewing, B. and Green, P. (1998). Base-calling of automated sequencer traces using phred. II. Error probabilities. Genome Res. 8, Assembly: Batzoglou, S., Jaffe, D.B., Stanley, K., Butler, J., Gnerre, S., Mauceli, E., Berger, B., Mesirov, J.P. and Lander, E.S. (2002). ARACHNE: a whole-genome shotgun assembler. Genome Res. 12, Jaffe, D.B., Butler, J., Gnerre, S., Mauceli, E., Lindblad-Toh, K., Mesirov, J.P., Zody, M.C. and Lander, E.S. (2003). Whole-genome sequence assembly for mammalian genomes: Arachne 2. Genome Res. 13, 91-6.
63
SOLiD Emulsion PCR on Beads
Purify ~2x108 Beads Spread Beads on Flow Cell For Seq’ing From Dressman et al. (2003) Proc. Natl. Acad. Sci. USA 100,
64
4-6 M Emulsions with 1 M Beads (SOLiD)
reactors containing a bead are false-colored Courtesy of Gina Costa (AB)
65
Flow Cells with “Molecular Colonies”
SOLiD enrich for beads that productively amplified spread out beads randomly on microscope slide to which the beads stick via covalent bond ( <- important; beads must not move during sequencing) Solexa flow cell with randomly spaced molecular clusters spacing depends on initial seeding of the single molecules onto the flow cell
66
Clonally Amplified Molecules on Flow Cell
67
Scale 50-500 Million beads or clusters per flow cell
that’s where the throughput comes from
68
SOLiD: Multiple Cycles and “Dinucleotide Color Space”
seq primer Bind Primer Ligation 1 + Cleave 1 Ligation 2 Cleave 2 Ligation 3 Cleave 3 (5’p) 5’ 3’ Bead XYNNN-III 5’ Cycle 1 dinuc 1 Cycle 2 dinuc 6 Cycle 3 dinuc 11 Cycle n dinuc 16 5’ XYNNN (5’P) 5’ Phosphorothiolate Linkage Cleaves with neutral pH, aqueous conditions in seconds Removes dye, cleaves off Inosines, and produces 5’ Phosphate (native DNA end) in a single step XYNNNXYNNN-III 5’ 5’ XYNNNXYNNN (5’P) 5’ XYNNNXYNNNXYNNN-III 5’ 5’ XYNNNXYNNNXYNNN (5’P) 5’ XY4 XY3 XY2 XY1 GC TC TG TT CG CT GT GG TA GA CA CC AT AG AC AA Cy5 TXR Cy3 FAM Dinuc (repeat)
69
SOLiD Sequencing Anneal sequencing primer 5’P 5’
insert (watson!) sequencing primer site amplification handle 1 bead loaded up with thousands of identical molecules ~ one shown for illustration
70
Sixteen probes, but only four colors
SOLiD Sequencing TCNNNIII 5’ AG CCNNNIII ACNNNIII AGNNNIII ATNNNIII TTNNNIII CANNNIII GANNNIII CGNNNIII AANNNIII TGNNNIII TCNNNIII GCNNNIII GGNNNIII GTNNNIII CTNNNIII TANNNIII Sixteen probes, but only four colors
71
SOLiD Sequencing TCNNN ATNNN CC GGNNN AG TA CCNNNIII ACNNNIII AGNNNIII
5’ AG TA CCNNNIII ACNNNIII AGNNNIII ATNNNIII TTNNNIII CANNNIII GANNNIII CGNNNIII AANNNIII TGNNNIII TCNNNIII GCNNNIII GGNNNIII GTNNNIII CTNNNIII TANNNIII
72
SOLiD Sequencing TCNNNATNNNGGNNNGTNNNTANNNGCNNN
After five cycles TCNNNATNNNGGNNNGTNNNTANNNGCNNN 5’ AGTGCTAGATCCGAGGACTCATCGCCGATACG CCNNNIII ACNNNIII AGNNNIII ATNNNIII TTNNNIII CANNNIII GANNNIII CGNNNIII AANNNIII TGNNNIII TCNNNIII GCNNNIII GGNNNIII GTNNNIII CTNNNIII TANNNIII
73
SOLiD Sequencing CANNNTCNNNGCNNNTGNNNAGNNNCTNNN
“Reset”: Melt off previous extension products and repeat five cycles with offset of one base CANNNTCNNNGCNNNTGNNNAGNNNCTNNN 5’ AGTGCTAGATCCGAGGACTCATCGCCGATACG
74
SOLiD Sequencing GANNNAGNNNCCNNNGTNNNGGNNNTGNNN
|||||||||||||||||||||||||||||| Align in color space! GANNNAGNNNCCNNNGTNNNGGNNNTGNNN CGNNNTANNNTCNNNAGNNNCGNNNATNNN ACNNNCTNNNCTNNNGANNNGCNNNTANNN CANNNTCNNNGCNNNTGNNNAGNNNCTNNN TCNNNATNNNGGNNNGTNNNTANNNGCNNN 5’ AGTGCTAGATCCGAGGACTCATCGCCGATACG BUTBUTBUT ... how do I work with color space?!? ...GACCCGTACGCCGTAGTGCTAGATCCGAGGACTCATCGCCGATACGAT translate genome into color space ...
75
Translating color space back to base space
2nd Base A C G T A C 1st Base G T
Similar presentations
© 2025 SlidePlayer.com Inc.
All rights reserved.