Sequencing and Assembly GEN875, Genomics and Proteomics, Fall 2010.

Slides:



Advertisements
Similar presentations
Mo17 shotgun project Goal: sequence Mo17 gene space with inexpensive new technologies Datasets in progress: Four-phases of 454-FLX sequencing to max of.
Advertisements

The Past, Present, and Future of DNA Sequencing
Next Generation Sequencing, Assembly, and Alignment Methods
Canadian Bioinformatics Workshops
RNA-Seq An alternative to microarray. Steps Grow cells or isolate tissue (brain, liver, muscle) Isolate total RNA Isolate mRNA from total RNA (poly.
What Is Genomics? Genomics is the study of how the entire genome of a species functions as a unit and evolves over time. It is the study of life’s blueprint,
Genome Sequence Assembly: Algorithms and Issues Fiona Wong Jan. 22, 2003 ECS 289A.
Sequencing and Assembly Cont’d. CS273a Lecture 5, Win07, Batzoglou Steps to Assemble a Genome 1. Find overlapping reads 4. Derive consensus sequence..ACGATTACAATAGGTT..
Sequencing and Assembly Cont’d. CS273a Lecture 5, Aut08, Batzoglou Steps to Assemble a Genome 1. Find overlapping reads 4. Derive consensus sequence..ACGATTACAATAGGTT..
CS273a Lecture 4, Autumn 08, Batzoglou Hierarchical Sequencing.
Genome sequencing and assembling
CS 6293 Advanced Topics: Current Bioinformatics
Genome Sequencing and Assembly High throughput Sequencing Xiaole Shirley Liu STAT115, STAT215, BIO298, BIST520.
DNA Sequencing LECTURE 6: Biotechnology; 3 Credit hours Atta-ur-Rahman School of Applied Biosciences (ASAB) National University of Sciences and Technology.
Update on Next-Generation Sequencing
Next Now-Generation Genomics: methods and applications for modern disease research Aaron J. Mackey, Ph.D. Center for Public Health.
Next generation sequencing Xusheng Wang 4/29/2010.
De-novo Assembly Day 4.
Analyzing your clone 1) FISH 2) “Restriction mapping” 3) Southern analysis : DNA 4) Northern analysis: RNA tells size tells which tissues or conditions.
High Throughput Sequencing Methods and Concepts
CS 394C March 19, 2012 Tandy Warnow.
Chapter 14 Genomes and Genomics. Sequencing DNA dideoxy (Sanger) method ddGTP ddATP ddTTP ddCTP 5’TAATGTACG TAATGTAC TAATGTA TAATGT TAATG TAAT TAA TA.
Todd J. Treangen, Steven L. Salzberg
Introduction to next generation sequencing Rolf Sommer Kaas.
Next generation sequence data and de novo assembly For human genetics By Jaap van der Heijden.
Meraculous: De Novo Genome Assembly with Short Paired-End Reads
Sequence assembly using paired- end short tags Pramila Ariyaratne Genome Institute of Singapore SOC-FOS-SICS Joint Workshop on Computational Analysis of.
Steps in a genome sequencing project Funding and sequencing strategy source of funding identified / community drive development of sequencing strategy.
High Throughput Sequencing Methods and Concepts Cedric Notredame adapted from S.M Brown.
Biological Motivation for Fragment Assembly Rhys Price Jones Anne R. Haake.
SIZE SELECT SHEAR Shotgun DNA Sequencing (Technology) DNA target sample LIGATE & CLONE Vector End Reads (Mates) SEQUENCE Primer.
The Changing Face of Sequencing
Stratton Nature 45: 719, 2009 Evolution of DNA sequencing technologies to present day DNA SEQUENCING & ASSEMBLY.
Genome Characterization DNA sequence-ULTIMATE Map DNA sequencing-methods Assembly/sequencing BIO520 BioinformaticsJim Lund Assigned reading: Service 2006.
How will new sequencing technologies enable the HMP? Elaine Mardis, Ph.D. Associate Professor of Genetics Co-Director, Genome Sequencing Center Washington.
Problems of Genome Assembly James Yorke and Aleksey Zimin University of Maryland, College Park 1.
PHYSICAL MAPPING AND POSITIONAL CLONING. Linkage mapping – Flanking markers identified – 1cM, for example Probably ~ 1 MB or more in humans Need very.
Sequencing DNA 1. Maxam & Gilbert's method (chemical cleavage) 2. Fred Sanger's method (dideoxy method) 3. AUTOMATED sequencing (dideoxy, using fluorescent.
Applied Bioinformatics Week 5. Topics Cleaning of Nucleotide Sequences Assembly of Nucleotide Reads.
Human Genome.
GENE SEQUENCING. INTRODUCTION CELL The cells contain the nucleus. The chromosomes are present within the nucleus.
Locating and sequencing genes
COMPUTATIONAL GENOMICS GENOME ASSEMBLY
Chapter 5 Sequence Assembly: Assembling the Human Genome.
ALLPATHS: De Novo Assembly of Whole-Genome Shotgun Microreads
When the next-generation sequencing becomes the now- generation Lisa Zhang November 6th, 2012.
Introduction to next-gen sequencing bioinformatics.ca Canadian Bioinformatics Workshops
Topic Cloning and analyzing oxalate degrading enzymes to see if they dissolve kidney stones with Dr. VanWert.
Introduction to Illumina Sequencing
Cse587A/Bio 5747: L2 1/19/06 1 DNA sequencing: Basic idea Background: test tube DNA synthesis DNA polymerase (a natural enzyme) extends 2-stranded DNA.
DNA Sequencing First generation techniques
Next-generation sequencing technology
Sequenziamento: metodo di Sanger con ddNTP
Virginia Commonwealth University
DNA Sequencing Second generation techniques
Next generation sequencing
Assembly algorithms for next-generation sequencing data
Quality Control & Preprocessing of Metagenomic Data
DNA Sequencing -sayed Mohammad Amin Nourion -A’Kia Buford
COMPUTATIONAL GENOMICS GENOME ASSEMBLY
Fragment Assembly (in whole-genome shotgun sequencing)
Genome sequence assembly
Next-generation sequencing technology
6. Sequencing Genomes.
SOLEXA aka: Sequencing by Synthesis
DNA Sequencing The DNA from the genome is chopped into bits- whole chromosomes are too large to deal with, so the DNA is broken into manageably-sized overlapping.
Genome Sequencing and Assembly
Next-generation DNA sequencing
Introduction to Sequencing
Presentation transcript:

Sequencing and Assembly GEN875, Genomics and Proteomics, Fall 2010

History of DNA Sequencing Avery: Proposes DNA as ‘Genetic Material’ Watson & Crick: Double Helix Structure of DNA Holley: Sequences Yeast tRNA Ala Miescher: Discovers DNA Wu: Sequences Cohesive End DNA Sanger: Dideoxy Chain Termination Gilbert: Chemical Degradation Messing: M13 Cloning Hood et al.: Partial Automation Cycle Sequencing Improved Sequencing Enzymes Improved Fluorescent Detection Schemes 1986 Next Generation Sequencing Improved enzymes and chemistry Improved image processing Adapted from Francis Oulette; Adapted from Eric Green, NIH; Adapted from Messing & Llaca, PNAS (1998) ,000 25,000 1, ,000 50,000,000 Efficiency (bp/person/year) 15, ,000,000,

Explosive Growth in Sequencing 8/22/2005 Press Release: INSD (GenBank, EMBL, DDBJ) reaches 100 Gigabase milestone

What do we sequence? Genomes (de novo, resequencing) Metagenomes or complex samples Transcripts Fragments recovered by chIP or tagged in some other way

NCBI Genomes Eukaryotic Genomes: Complete 23, 25, 22, 20 Assembly 230, 162, 109, 72 In progress 229, 235, 299, 166 Prokaryotic Genomes: Complete 745, 567, 371, 254 In progress 1215, 841, 615, 433 Comparison of data from 9/4/08, 9/5/07, 9/4/06 and 8/31/05

NCBI Genomes 9/6/2010

Sequencing Platforms Sanger sequencing and capillary electrophoresis Massively parallel pyrosequencing (454) “proprietary Clonal Single Molecule Array technology and novel reversible terminator- based sequencing” (Illumina) Sequencing by ligation (ABI SOLiD) Single molecule sequencing (PacBio)

Basics of the “old” technology Clone the DNA. Generate a ladder of labeled (colored) molecules that are different by 1 nucleotide. Separate mixture on some matrix. Detect fluorochrome by laser. Interpret peaks as string of DNA. Strings are 500 to 1,000 letters long 1 machine generates 57,000 nucleotides/run Assemble all strings into a genome. Adapted from Francis Oulette

Sample Isolate DNA Physical fragmentation Size selection Ligate randomly into vectors Transformation Plate on agar Pick and grow individual colonies Isolate cloned constructs Cycle Sequencing High-throughput Steps Library construction and sequencing

Dual Ended Sequencing Can Provide Information to Link Contigs 5 Kb insert Primer A Primer B Sequencing with primers that begin in the vector on either side of the insert yields about 800 bp of DNA sequence from each end of the insert The middle of the insert is never sequenced for most clones used in the project

Basics of the “new” technology Get DNA. Attach it to something. Extend and amplify signal with some labeling scheme. Detect fluorochrome by microscopy. Interpret series of spots as short strings of DNA. Strings are letters long Multiple images are interpreted as 0.4 to 1.2 GB/run (1,200,000,000 letters/day). Map or align strings to one or many genome or assemble. Adapted from Francis Oulette

Differences between the various platforms: Nanotechnology used. Resolution of the image analysis. Chemistry and enzymology. Signal to noise detection in the software Software/images/file size/pipeline Cost $$$ Adapted from Francis Oulette

Next Generation DNA Sequencing Technologies Adapted from Richard Wilson, School of Medicine, Washington University, “Sequencing the Cancer Genome” 3 Gb ==

Roche 454 Genome sequencing in microfabricated high-density picolitre reactors Margulies, M. Eghold, M. et al. Nature Sep 15; 437(7057):326-7 Pyrosequencing

Throughput million high-quality, filter-passed bases per run* 1 billion bases per day Run Time10 hours Read LengthAverage length = 400 bases Accuracy Q20 read length of 400 bases (99% at 400 bases and higher for prior bases) Reads per run>1 million high-quality reads GS FLX Titanium Series: The GS FLX Titanium series reagents run on the Genome Sequencer FLX Instrument, a system based on 454's sequencing-by-synthesis technology. The GS FLX Titanium series improves on the current system with upgraded reagents, consumables, and software. sequencing-by-synthesis technology GS FLX Titanium Series:

Solexa-based Whole Genome Sequencing Adapted from Richard Wilson, School of Medicine, Washington University, “Sequencing the Cancer Genome”

Solexa-based Whole Genome Sequencing Solexa flow cell ~50M clusters are sequenced per flow cell. Adapted from Richard Wilson, School of Medicine, Washington University, “Sequencing the Cancer Genome”

Debbie Nickerson, Department of Genome Sciences, University of Washington,

Genome-scale Sequence Analysis De novo assembly Templated assembly Read mapping or alignment to a reference genome

“The choice of alignment or assembly algorithm is strongly influenced by both the experiment in question and the details of the sequencing technology used. The performance characteristics of the sequencing machines are changing rapidly, and any delineation of performance characteristics such as machine capacity, run time or read length and its relationship to error profile will quickly be outdated.”

Assemblers Greedy Assemblers – compare all reads to each other then join them in order of overlap size Figure 8. Greedy assembly of four reads.

Assemblers Overlap Graph Assemblers – make a graph where each node represents a read and edges between them represent overlaps. Figure 9. Overlap graph for a bacterial genome. The thick edges in the picture on the left (a Hamiltonian cycle) correspond to the correct layout of the reads along the genome (figure on the right). The remaining edges represent false overlaps induced by repeats (exemplified by the red lines in the figure on the right)

Assembly with dual-ended sequencing Sequence assembly Contigs linked by a spanning clone Contigs joined by overlaps Scaffold – two or more linked contigs

Repeat handling Screen out known repeats and set them aside for later Infer repetitiveness based on coverage First assemble unambiguous overlaps, then resolve repeats using mate pairs

Assemblers and short reads Full overlap assemblers compare all reads against all other reads. Scale quadratically with the number of reads. Computationally intractable for large NSG datasets Led to development of k-mer based methods: a de Bruijn graph with a node for every k-mer observed in the sequence set and an edge between nodes if these two k-mers are observed adjacently in a read

Figure 2. Differences between an overlap graph and a de Bruijn graph for assembly. Based on the set of 10 8-bp reads (A), we can build an overlap graph (B) in which each read is a node, and overlaps >5 bp are indicated by directed edges. Transitive overlaps, which are implied by other longer overlaps, are shown as dotted edges. In a de Bruin graph (C ), a node is created for every k-mer in all the reads; here the k-mer size is 3. Edges are drawn between every pair of successive k-mers in a read, where the k-mers overlap by k 1 bases. In both approaches, repeat sequences create a fork in the graph. Note here we have only considered the forward orientation of each sequence to simplify the figure.

Figure 1. The k-mer uniqueness ratio for five well-known organisms and one single-celled human parasite. The ratio is defined here as the percentage of the genome that is covered by unique sequences of length k or longer. The horizontal axis shows the length in base pairs of the sequences. For example, ;92.5% of the grapevine genome is contained in unique sequences of 100 bp or longer.

De Bruijn k-mer assemblers Newbler (Roche 454) SHARCGS VCAKE VELVET EULER-SR EDENA ABySS ALLPATHS SOAPdenovo Contrail

Most assemblers have an error detection and resolution phase Errors produce characteristic graphic structures

Problems with de Bruijn graph methods Require large amount of memory to store graph – for example Velvet would require a terabyte of memory to assemble the human genome Not as easy to parallelize as overlap assemblers From Shatz et al. 2010: “To date, only two de Bruijn graph assemblers have been shown to have the ability to assemble a mammalian-sized genome. ABySS (Simpson et al. 2009) assembled a human genome in 87 h on a cluster of 21 eight-core machines each with 16 GB of RAM (168 cores, 336 GB of RAM total). SOAPdenovo assembled a human genome in 40 h using a single computer with 32 cores and 512 GB of RAM (Li et al. 2010). Although these types of computing resources are not widely available, they are within reach for large-scale scientific centers.”

How many clones/reads do we need? …according to the work of Lander and Waterman (Genomic mapping by fingerprinting random clones: a mathematical analysis.Genomics Apr;2(3):231-9.), the number of “islands” or contigs formed from randomly collected sequences depends on: G= Genome Length L = Sequence Read Length N = Number of Sequences Collected T= Number of Basepairs of Overlap Needed # Islands =Ne LN G TLTL (- ( 1 - )

5 Mbp Genome, 500 bp reads, 25 bp overlap # readscoverage% sequenced# contigs

Graph of previous data

Shotgun Sequencing Model ,000 20,000 30,00040,00050,00060,00070,000 # Sequences # non-singleton contigs Predicted 5.5 Mb size Observed # non-singletons Predicted 3.7Mb size Genome size as predicted from the assembly

Figure 3. Expected average contig length for a range of different read lengths and coverage values. Also shown are the average contig lengths and N50 lengths for the dog genome, assembled with 710-bp reads, and the panda genome, assembled with reads averaging 52 bp in length.

Combining sequence data types In practice, appears to be the best strategy for both microbial and eukaryotic genomes Creates assembly challenges of its own

One strategy for microbial genomes ~¼ run of 454 regular, ¼ run of paired end (2.5 kb library) plus one lane of Solexa Assemble Solexa data with Velvet Assemble 454 data with Newbler Shred the Velvet assembly into Newbler size reads and add it to the 454 assembly Use Solexa deep coverage to “polish”

Gap Closure Strategies Primer walk to sequence the rest of linking clones that span a scaffold gap Primer walk off clones at the ends of contigs for which there is no linking information PCR based on your best guess at contig order (comparison to other closely related genomes, predicted genes at the end of genomes, anything else you can come up with) Combinatorial PCR with primers designed at the end of each contig

Phred Scores Phred Score P( incorrect base )Base call accuracy 10 1 in 10 90% 20 1 in % 30 1 in % 40 1 in % 50 1 in %