Sequencing 101 Readout.

Sequencing 101 Readout

Original Sanger Sequencing with Radioactive Signal
Template (Crick) very low concentration of ddNTPs compared to dNTPs A nested series of DNA fragments ending in the base specified by the terminator-ddNTP Watsons Expose gel to x-ray film (to make an “auto-radiogram”) Recombinant DNA: Genes and Genomes. 3rd Edition (Dec06). WH Freeman Press.

Wouldn’t it be great to run everything in one lane?
CAAGTGTCTTAAC This is great but… Wouldn’t it be great to run everything in one lane? Save space and time, more efficient Fluorescently label the ddNTPs so that they each appear a different color

Fluorescent Sanger Sequencing Trace
Lane signal (Real fluorescent signals from a lane/capillary are much uglier than this). A bunch of magic to boost signal/noise, correct for dye-effects, mobility differences, etc, generates the ‘final’ trace (for each capillary of the run) Trace Recombinant DNA: Genes and Genomes. 3rd Edition (Dec06). WH Freeman Press.

Sanger Base Calling Base Caller (Phred) ugly over here ugly over there
... N A G C G T T C C G C G ... A N N ... Base Caller (Phred) Quality score = -10 * log(probability of error) For Q20, probability of error = 1/100 For Q99, probability of error ~10-10

Whole Genome Sequencing
Hierarchical Shotgun Approach Genomic DNA BAC library Organized, Mapped Large Clone Contigs (minimal tiling path) Shotgun Clones GCAATGAAATATGTTCTTGTAATTTAAGCTGACACTCCTAATTTAGCTCTTGTCCTCTACTGAGTCTACCTAATTATATGTATGGATTGACTTGG AGCTCTTGTCCTCTACTGAGTCTACCTAATTATATGTATGGATTGACTTGGTGTTTTCTCTTTTTCTTAAATAGTAATGCAGAAAGCCTGGAGAGAGAG Reads GCAATGAAATATGTTCTTGTAATTTAAGCTGACACTCCTAATTTAGCTCTTGTCCTCTACTGAGTCTACCTAATTATATGTATGGATTGACTTGGTGTTTTCTCTTTTTCTTAAATAGTAATGCAGAAAGCCTGGAGAGAGAG Assembly

De Novo Whole Genome Sequencing
Make millions of random clones: “Shotgunning” Plasmid "backbone" Insert sequencing primer "forward read" "reverse read"

Assembly: Contigs and Supercontigs
"Supercontig" or "Scaffold" "Contig" Seq gap NNNNNN number of N's in sequence = estimated size

Why Different Insert Sizes are Useful
Longer (fosmid) mate pairs connect assembly pieces that are not connected by shorter (plasmid) mate pairs

454 Sequencing Most viable “UHTS” method for de novo genome sequencing
Still sequencing by synthesis, but fundamentally different Uses pyrosequencing (light produced as a by product of base extension) No cloning required Sequences ~1,000,000 reads simultaneously Each read is ~400bp long Takes ~10 hours for machine run

454 Library Generation (2) Adaptor DNA adapter ends plus beads
~ 1 DNA molecule and 1 bead per droplet Make water-in-oil emulsion In brief, an in vitro–constructed adaptor flanked shotgun library (shown as gold and turquoise adaptors flanking unique inserts) is PCR amplified (that is, multi-template PCR, not multiplex PCR, as only a single primer pair is used, corresponding to the gold and turquoise adaptors) in the context of a water-in-oil emulsion. One of the PCR primers is tethered to the surface (5′-attached) of micron-scale beads that are also included in the reaction. A low template concentration results in most bead-containing compartments having either zero or one template molecule present. In productive emulsion compartments (where both a bead and template molecule is present), PCR amplicons are captured to the surface of the bead. After breaking the emulsion, beads bearing amplification products can be selectively enriched. Each clonally amplified bead will bear on its surface PCR products corresponding to amplification of a single molecule from the template library. emPCR Most beads with with multiple copies of individual DNA molecules

454 Sequencing Process Beads loaded into wells Enzymes added
1) The emulsion is broken, the DNA strands are denatured, and beads carrying single-stranded DNA templates are enriched (not shown) and deposited into wells of a fiber-optic slide. 2) Smaller beads carrying immobilized enzymes required for a solid phase pyrophosphate sequencing reaction are deposited into each well. 3) Sequencing gives off light

Pyrogram

<454 movie here>

Illumina Sequencing Technology Robust Reversible Terminator Chemistry Foundation
5’ 3’ G T C A DNA ( ug) T G C A G T Sample preparation Cluster growth Single molecule array Sequencing 1 2 3 7 8 9 4 5 6 Base calling T G C T A C G A T … Image acquisition

Sequencing 20 Microns 100 Microns 250+ Million Clusters Per Flow Cell
Purpose: To visually demonstrate sequencing by synthesis Details: This is an actual cluster pattern from a flow cell (4-color overlay) If you zoom in on one small section (the 20 micron box) you can follow each cluster through sequencing by synthesis Conclusion: This is the visual set up to the following slide 20 Microns 100 Microns

Flow Cells and Imaging Solexa: single 8-channel Reagents flowed in here Image for each channel is actually tiled (four images per panel, one for each color) Out to waste

<Illumina movie here>

The numbers (March 2013) Platform Read Length Gb/run Time/run Cost/run
Error rate 454 1 ~1 day ~$5-10k ~1% Illumina HiSeq ~1 week ~$20-25k ~0.1% Illumina MiSeq 1-2 ~$1k Ion Torrent ~2-3 hours ~$0.5-1k Pac Bio ~$2k ~15%

What are the data? Vendors are typically producing data in fastq, or cfastq format. followed by a sequence Identifier @SEQ_ID GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT + !''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65 The sequence ‘+’, optionally followed by a sequence Identifier The quality scores Explain the quality scores – also, see wikipedia page on fastq format, and provide reference here

Example of Illumina SeqID
@HWUSI-EAS100R:6:73:941:1973#0/1 HWUSI-EAS100R The unique instrument name 6 Flowcell lane 73 Tile number within the flowcell 941 'x'-coordinate of the cluster within the tile 1973 'y'-coordinate of the cluster within the tile #0 index number for a multiplexed sample (0 for no indexing) /1 the member of a pair, /1 or /2 (paired-end or mate-pair reads only)

Error profiles De-phasing Crosstalk Degradation De-phasing
(1) A strong correlation of the A and C intensities as well as of the G and T intensities due to similar emission spectra of the fluorophores and limited separation by the filters used. (2) Dependence of the signal of a specific cycle by the signal of the cycles before and after, called phasing and pre-phasing respectively. Phasing and pre-phasing are caused by incomplete removal of the 3' terminators and fluorophores, sequences in the cluster missing an incorporation cycle, as well as by the incorporation of nucleotides without effective 3' terminators. The Illumina base caller (Bustard) uses a so-called crosstalk matrix estimated from the first and second imaging cycle to orthogonalize the correlated channels. Further this matrix is used to scale the different intensities measured for each of the fluorophores. The estimation of the crosstalk matrix is based on the assumption that the four nucleotides are almost equally frequent in the library being sequenced. If the sample does not fulfill this assumption this estimate can be inaccurate and lead to incorrect base calling. Bustard estimates the phasing and pre-phasing as two channel- independent parameters from the increasing correlation of intensities in the first few cycles of the sequencing run. Using the crosstalk matrix and the two phasing parameters, it creates corrected intensity values and calls the base with the highest corrected intensity for each cluster and cycle. In the case of equal intensity values or small intensity differences an N is called. De-phasing Error profiles

Assessing Quality Aberrant Solexa tiles. This figure displays three distinct types of errors that we have seen occur on Solexa tiles. The image was generated using plotQTile with the color-by category option. The tile data was drawn from tiles 129, 85, and 6 respectively – all produced from the same lane and run. The error depicted in 1b appears to be caused by a bubble in one of the reagents. This bubble increased the error rate, converting U0 reads into U1 reads. From further investigation we know this occurred during cycle 28. The error depicted in 1c is an area in which no reads at all occurred – possibly a problem with the DNA binding to the flow cell, or a smudge on the surface of the flow cell. The error depicted in 1a remains enigmatic. Figure 2d was generated using the cycleplot function. It displays the mean QAG-score for cycles 1–32 on tile 47, and showcases the ability to detect the source of an error by decomposing the data according to cycle. Here we see the drop in average intensity that occurred during cycle 28. The aggregate quality score (QAG-score) for a base call is the maximum Q-score minus the sum of the remaining three Q-scores.

Read Frequency Distribution
VecBase Screen > gnl|uv|NGB :1-219 pCR4-TOPO multiple cloning site Length=219 Score = 100 bits (50), Expect = 9e-19 Identities = 50/50 (100%), Gaps = 0/50 (0%) Strand=Plus/Plus Query 1 ATTAACCCTCACTAAAGGGACTAGTCCTGCAGGTTTAAACGAATTCGCCC 50 |||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct 43 ATTAACCCTCACTAAAGGGACTAGTCCTGCAGGTTTAAACGAATTCGCCC 92 1. Trim trailing bad bases 2. Remove duplicate reads QA: filtering

Alignment

Aligners differ in speed and quality

Mapping Short Reads Many options; often a trade off between speed, resources and sensitivity. Several open source projects to solve this problem, continually improving in speed and memory requirements. New features being added all the time. When dealing with short read data, make sure you have the very latest versions of the software you’re using, as some are updated frequently. Unlike earlier-generation sequence alignment programs such as BLAST, which were designed in an environment that required alignments of protein sequences and searching though large databases to find homologous sequences, today’s short-read alignment programs are generally used for the alignment of DNA sequence from the species of interest to the reference genome assembly of that species. This difference, although it initially seems subtle, has several consequences to the final algorithm design and implementation, which include letting assumptions about the number of expected mismatches be driven by the species polymorphism rate and the technology error rate rather than by considerations of evolutionary substitutions.

Alignment Exhaustive searching in this manner is prohibitively slow
@HWI-EAS412_4:1:1:1376:380 AATGATACGGCGACCACCGAGATCTA Exhaustive searching in this manner is prohibitively slow Algorithms are needed to that will trade speed for sensitivity

Approaches to Short Read Alignment
Hash-Based mapping Hashing of reads (E.g. bwa, Eland) Hashing of genome (E.g. novoalign) Indexing using Suffix Array/Burrows-Wheeler Transform (BWT) (E.g. bowtie, bwa) The methods covered are (i) hash table–based implementations, in which the hash may be created using either the reference genome or the set of sequencing reads, and (ii) Burrows Wheeler transform (BWT)-based methods, which first create an efficient index of the reference genome assembly in a way that facilitates rapid searching in a low-memory footprint

How does the “hashing” work?
In general, a hash function simply converts a string to an integer. The integer is then used as an index in an array, for fast look up. The reads are “hashed”, using 6 different permutations of the first 28bp. The genome is then looked through, in 28bp chunks, to see if they match, via the hash, to reads. Read up more on hashing, and add some notes here. Do they need to know this? No – however, it’s interesting (in my opinion), and it’s useful for bioinformaticians to think about algorithms.

The Burrows-Wheeler Transform

Bowtie Algorithm @HWI-EAS412_4:1:1:1376:380 AATGATACGGCGACCACCGAGATCTA

Comparison (old!) PC: 2.4 GHz Intel Core 2, 2 GB RAM
CPU time Wall clock time Reads per hour Peak virtual memory footprint Bowtie speedup Reads aligned (%) Bowtie –v 2 (server) 15m:07s 15m:41s 33.8 M 1,149 MB - 67.4 SOAP (server) 91h:57m:35s 91h:47m:46s 0.08 M 13,619 MB 351x 67.3 Bowtie (PC) 16m:41s 17m:57s 29.5 M 1,353 MB 71.9 Maq (PC) 17h:46m:35s 17h:53m:07s 0.49 M 804 MB 59.8x 74.7 Bowtie (server) 17m:58s 18m:26s 28.8 M Maq (server) 32h:56m:53s 32h:58m:39s 0.27 M 107x PC: 2.4 GHz Intel Core 2, 2 GB RAM Server: 2.4 GHz AMD Opteron, 32 GB RAM Bowtie v0.9.6, Maq v0.6.6, SOAP v1.10 SOAP not run on PC due to memory constraints Reads: FASTQ 8.84 M reads from 1000 Genomes (Acc: SRR001115) Reference: Human (NCBI 36.3, contigs)

Assembly Iverson et al. Science 3 February 2012: Vol. 335 no. 6068 pp

De novo assembly New methods are still actively developed
Short reads require long overlaps e.g., nt reads must overlap by 20+nt end-trimming helps to remove low quality bases. Most de novo short read assemblers use a k-mer hashing based approach. Known as “de Bruijn graph” assembly Describes which short words follow which other short words

Sequence Hashing (k = 4) TAGTCGAGGCTTTAGATCCGATGAGGCTTTAGAGACAG
AGTCGAG CTTTAGA CGATGAG CTTTAGA GTCGGG TTAGATC ATGAGGC GAGACAG GAGGCTC ATCCGAT AGGCTTT GAGACAG AGTCGAG TAGATCC ATGAGGC TAGAGAA TAGTCGA CTTTAGA CCGATGA TTAGAGA CGAGGCT AGATCCG TGAGGCT AGAGACA TAGTCGA GCTTTAG TCCGATG GCTCTAG TCGACGC GATCCGA GAGGCTT AGAGACA TAGTCGA TTAGATC GATGAGG TTTAGAG GTCGAGG TCTAGAT ATGAGGC TAGAGAC AGGCTTT ATCCGAT AGGCTTT GAGACAG AGTCGAG TTAGATT ATGAGGC AGAGACA GGCTTTA TCCGATG TTTAGAG CGAGGCT TAGATCC TGAGGCT GAGACAG AGTCGAG TTTAGATC ATGAGGC TTAGAGA GAGGCTT GATCCGA GAGGCTT GAGACAG An example de Bruijn graph assembly for a short genomic sequence without polymorphism. Sequence at top represents the genome, which is then sampled using shotgun sequencing in base space with 7-bp reads (step 1). Some of the reads have errors (red). Hashing (k = 4)

{ Graph Building TAGTCGAGGCTTTAGATCCGATGAGGCTTTAGAGACAG
TAGT AGTC GTCG TCGA CGAG GAGG AGGC GGCT GCTT AGAG GAGA AGAC GACA ACAG (3x) (7x) (9x) (10x) (8x) (16x) (16x) (16x) (11x) (9x) (12x) (9x) (8x) (5x)  CTTC TTCA TCAG CAGA (1x) (2x) (2x) (1x) CTTT TTTA TTAG TAGA (8x) (8x) (12x) (16x) CGAC GACG ACGC (1x) (1x) (1x) TGAG ATGA GATG CGAT CCGA TCCG ATCC GATC AGAT (9x) (8x) (5x) (6x) (7x) (7x) (7x) (8x) (8x) GATT (1x) AGAA { In step 2, the k-mers in the reads (4-mers in this example) are collected into nodes and the coverage at each node is recorded. There are continuous linear stretches within the graph, and the sequencing errors create distinctive, low-coverage features through out the graph TAGTCGAGGCTTTAGATCCGATGAGGCTTTAGAGACAG

{ { Simplification of Linear Stretches
TAGT AGTC GTCG TCGA CGAG GAGG AGGC GGCT GCTT AGAG GAGA AGAC GACA ACAG (3x) (7x) (9x) (10x) (8x) (16x) (16x) (16x) (11x) (9x) (12x) (9x) (8x) (5x)  CTTC TTCA TCAG CAGA (1x) (2x) (2x) (1x) CTTT TTTA TTAG TAGA (8x) (8x) (12x) (16x) CGAC GACG ACGC (1x) (1x) (1x) TGAG ATGA GATG CGAT CCGA TCCG ATCC GATC AGAT (9x) (8x) (5x) (6x) (7x) (7x) (7x) (8x) (8x) GATT (1x) AGAA { Simplification of Linear Stretches TAGTCGA CGAG CGACGC GCTCTAG GCTTTAG GATCCGATGAG AGAT AGAA  { GATT GAGGCT TAGA AGAGA AGACAG In step 3, the graph is simplified to combine nodes that are associated with the continuous linear stretches into single, larger nodes of various k-mer sizes.

{ { Tips Bubble Error (tip and bubble) removal
TAGTCGA CGAG CGACGC GCTCTAG GCTTTAG GATCCGATGAG AGAT AGAA  { GATT GAGGCT TAGA AGAGA AGACAG Tips Bubble { TAGTCGAG GAGGCTTTAGA AGAGACAG  AGATCCGATGAG Error (tip and bubble) removal In step 4, error correction removes the tips and bubbles that result from sequencing errors and creates a final graph structure that accurately and completely describes in the original genome sequence. TAGTCGAGGCTTTAGATCCGATGAGGCTTTAGAGACAG

De novo Assembler performance: the Assemblathon
From Genome10K/UCSD Roughly annual, latest 21 participants, 3 genomes Assembly size for bird genome (budgie) NG50: if all scaffold lengths are summed from longest to the shortest, this is the length at which the sum length accounts for 50% of total sequence. “NG50” calculated in genome size, “N50” in assembly size. Higher is better (most sequence in long scaffolds), but it does not account for error! Scaffold size % scaffolds

De novo Assembler performance: the Assemblathon
Assembly metrics for bird genome (budgie)

Other assembly considerations: computational demands
The time taken to run an assembly can vary by >100x. The memory required for assembly can be 16G, 32G, or more. No one right answer – depends on technology, amount of data, quality of data, GC%, computational resources, desired outcome…

http://saaientist. blogspot
Chimeras, Rearrangements, the problem of reference genomes Second level of QA Mismatched paired end reads

222 appshttp://seqanswers.com/wiki/SEQanswers
The standard tools

Illumina announces HiSeq 2000
The Future Illumina announces HiSeq 2000 + Better dynamic range, higher resolution- Library construction, machine acquisition, downstream processing 2010

Sequencing 101 Readout.

Similar presentations

Presentation on theme: "Sequencing 101 Readout."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Sequencing 101 Readout.

Similar presentations

Presentation on theme: "Sequencing 101 Readout."— Presentation transcript:

Similar presentations

About project

Feedback