Special Topics in Genomics

Slides:

Advertisements

Similar presentations

Original Figures for "Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring"

Advertisements

DNA Technology & Gene Mapping Biotechnology has led to many advances in science and medicine including the creation of DNA clones via recombinant clones,

Next–generation DNA sequencing technologies – theory & practice

SOLiD Sequencing & Data

Microarray Simultaneously determining the abundance of multiple(100s-10,000s) transcripts.

RNA-Seq An alternative to microarray. Steps Grow cells or isolate tissue (brain, liver, muscle) Isolate total RNA Isolate mRNA from total RNA (poly.

Accurate Multiplex Polony Sequencing of an Evolved Bacterial Genome Jay Shendure, Gregory J. Porreca, Nikos B. Reppas, Xiaoxia Lin, John P. McCutcheon.

Figure 1: (A) A microarray may contain thousands of ‘spots’. Each spot contains many copies of the same DNA sequence that uniquely represents a gene from.

RNA-Seq An alternative to microarray. Steps Grow cells or isolate tissue (brain, liver, muscle) Isolate total RNA Isolate mRNA from total RNA (poly.

CS 6293 Advanced Topics: Current Bioinformatics

with an emphasis on DNA microarrays

Special Topics in Genomics Lecture 1: Introduction Instructor: Hongkai Ji Department of Biostatistics

Affymetrix vs. glass slide based arrays

Analyzing your clone 1) FISH 2) “Restriction mapping” 3) Southern analysis : DNA 4) Northern analysis: RNA tells size tells which tissues or conditions.

-The methods section of the course covers chapters 21 and 22, not chapters 20 and 21 -Paper discussion on Tuesday - assignment due at the start of class.

High Throughput Sequencing Methods and Concepts

Restriction Nucleases Cut at specific recognition sequence Fragments with same cohesive ends can be joined.

Library screening Heterologous and homologous gene probes Differential screening Expression library screening.

High Throughput Sequencing Methods and Concepts Cedric Notredame adapted from S.M Brown.

Chromatin Immunoprecipitation DNA Sequencing (ChIP-seq)

Microarray - Leukemia vs. normal GeneChip System.

Scenario 6 Distinguishing different types of leukemia to target treatment.

I519 Introduction to Bioinformatics, Fall, 2012

Stratton Nature 45: 719, 2009 Evolution of DNA sequencing technologies to present day DNA SEQUENCING & ASSEMBLY.

Microarrays and Gene Expression Analysis. 2 Gene Expression Data Microarray experiments Applications Data analysis Gene Expression Databases.

1 Global expression analysis Monday 10/1: Intro* 1 page Project Overview Due Intro to R lab Wednesday 10/3: Stats & FDR - * read the paper! Monday 10/8:

Ultra-High Throughput DNA Sequencing on the 454/Roche GS-FLX

Chapter 10: Genetic Engineering- A Revolution in Molecular Biology.

ANALYSIS OF GENE EXPRESSION DATA. Gene expression data is a high-throughput data type (like DNA and protein sequences) that requires bioinformatic pattern.

Figure S1. RACE mapped transcription starts and polyA signals of Ogre CL5 and Ogre CL5del and putative splice site of Ogre CL5 and Ogre CL5del in Silene.

Statistical Analysis for Expression Experiments Heather Adams BeeSpace Doctoral Forum Thursday May 21, 2009.

GSVCaller – R-based computational framework for detection and annotation of short sequence variations in the human genome Vasily V. Grinev Associate Professor.

Green with envy?? Jelly fish “GFP” Transformed vertebrates.

Title: Studying whole genomes Homework: learning package 14 for Thursday 21 June 2016.

Presented by Samuel Chapman. Pyrosequencing-Intro The core idea behind pyrosequencing is that it utilizes the process of complementary DNA extension on.

Canadian Bioinformatics Workshops

Introduction to Illumina Sequencing

DNA Sequencing First generation techniques

Next-generation sequencing technology

DNA Sequencing Second generation techniques

Next generation sequencing

Next Generation Sequencing

Analysis of genomes and transcriptomes using ChIP-seq and RNA-seq

Gene expression from RNA-Seq

Next-generation sequencing technology

AMPLIFYING AND ANALYZING DNA.

Design and Analysis of Single-Cell Sequencing Experiments

SOLEXA aka: Sequencing by Synthesis

Chapter 14 Bioinformatics—the study of a genome

Single-Color Digital PCR Provides High-Performance Detection of Cancer Mutations from Circulating DNA Christina Wood-Bouwens, Billy T. Lau, Christine.

Sensitivity of RNA‐seq.

Hybrid Capture and Next-Generation Sequencing Identify Viral Integration Sites from Formalin-Fixed, Paraffin-Embedded Tissue Eric J. Duncavage, Vincent.

Massively Parallel Sequencing: The Next Big Thing in Genetic Medicine

Taichi Umeyama, Takashi Ito Cell Reports

Discovery and Characterization of piRNAs in the Human Fetal Ovary

RNA Exosome Depletion Reveals Transcription Upstream of Active Human Promoters by Pascal Preker, Jesper Nielsen, Susanne Kammler, Søren Lykke-Andersen,

High-Throughput Sequencing Technologies

High-Throughput Sequencing Technologies

A Massively Parallel Reporter Assay of 3′ UTR Sequences Identifies In Vivo Rules for mRNA Degradation Michal Rabani, Lindsey Pieper, Guo-Liang Chew,

Rapid identification of efficient target cleavage sites using a hammerhead ribozyme library in an iterative manner Wei-Hua Pan, Ping Xin, Vuong Bui,

Joseph Rodriguez, Jerome S. Menet, Michael Rosbash Molecular Cell

Volume 16, Issue 8, Pages (August 2016)

Volume 20, Issue 6, Pages (August 2017)

Volume 63, Issue 6, Pages (September 2016)

Baekgyu Kim, Kyowon Jeong, V. Narry Kim Molecular Cell

(Top) Construction of synthetic long read clouds with 10× Genomics technology. (Top) Construction of synthetic long read clouds with 10× Genomics technology.

Standard (Sanger) sequencing

Volume 41, Issue 2, Pages (January 2011)

Taichi Umeyama, Takashi Ito Cell Reports

Figure 1. DNA-guided RNA cleavage activity

Presentation transcript:

Special Topics in Genomics Next-generation Sequencing

Work flow of conventional versus second-generation sequencing (a) With high-throughput shotgun Sanger sequencing, genomic DNA is fragmented, then cloned to a plasmid vector and used to transform E. coli. For each sequencing reaction, a single bacterial colony is picked and plasmid DNA isolated. Each cycle sequencing reaction takes place within a microliter-scale volume, generating a ladder of ddNTP-terminated, dye-labeled products, which are subjected to high-resolution electrophoretic separation within one of 96 or 384 capillaries in one run of a sequencing instrument. As fluorescently labeled fragments of discrete sizes pass a detector, the four-channel emission spectrum is used to generate a sequencing trace. (b) In shotgun sequencing with cyclic-array methods, common adaptors are ligated to fragmented genomic DNA, which is then subjected to one of several protocols that results in an array of millions of spatially immobilized PCR colonies or 'polonies'15. Each polony consists of many copies of a single shotgun library fragment. As all polonies are tethered to a planar array, a single microliter-scale reagent volume (e.g., for primer hybridization and then for enzymatic extension reactions) can be applied to manipulate all array features in parallel. Similarly, imaging-based detection of fluorescent labels incorporated with each extension can be used to acquire sequencing data on all features in parallel. Successive iterations of enzymatic interrogation and imaging are used to build up a contiguous sequencing read for each array feature. Jay Shendure & Hanlee Ji, Nature Biotechnology 26, 1135 - 1145 (2008)

Available next-generation sequencing platforms Illumina/Solexa ABI SOLiD Roche 454 Polonator HeliScope …

Example: Illumina/Solexa 1. Prepare genomic DNA 2. Attach DNA to surface 3. Bridge amplification 4. Fragement become double stranded 5. Denature the double stranded molecules 6. Complete amplification

Illumina/Solexa 7. Determine first base 8. Image first base 9. Determine second base 10. Image second base 11. Sequence reads over multiple cycles 12. Align data. >50 milliion clusters/flow cell, each 1000 copies of the same template, 1 billion bases per run, 1% of the cost of capillary-based method. (From: http://www.illumina.com/downloads/SS_DNAsequencing.pdf)

Clonal amplification of sequencing features in the second-generation sequencing (a) The 454, the Polonator and SOLiD platforms rely on emulsion PCR20 to amplify clonal sequencing features. In brief, an in vitro–constructed adaptor-flanked shotgun library (shown as gold and turquoise adaptors flanking unique inserts) is PCR amplified (that is, multi-template PCR, not multiplex PCR, as only a single primer pair is used, corresponding to the gold and turquoise adaptors) in the context of a water-in-oil emulsion. One of the PCR primers is tethered to the surface (5'-attached) of micron-scale beads that are also included in the reaction. A low template concentration results in most bead-containing compartments having either zero or one template molecule present. In productive emulsion compartments (where both a bead and template molecule is present), PCR amplicons are captured to the surface of the bead. After breaking the emulsion, beads bearing amplification products can be selectively enriched. Each clonally amplified bead will bear on its surface PCR products corresponding to amplification of a single molecule from the template library. (b) The Solexa technology relies on bridge PCR21, 22 (aka 'cluster PCR') to amplify clonal sequencing features. In brief, an in vitro–constructed adaptor-flanked shotgun library is PCR amplified, but both primers densely coat the surface of a solid substrate, attached at their 5' ends by a flexible linker. As a consequence, amplification products originating from any given member of the template library remain locally tethered near the point of origin. At the conclusion of the PCR, each clonal cluster contains 1,000 copies of a single member of the template library. Accurate measurement of the concentration of the template library is critical to maximize the cluster density while simultaneously avoiding overcrowding. Jay Shendure & Hanlee Ji, Nature Biotechnology 26, 1135 - 1145 (2008)

Strategies for cyclic array sequencing With the 454 platform, clonally amplified 28-m beads generated by emulsion PCR serve as sequencing features and are randomly deposited to a microfabricated array of picoliter-scale wells. With pyrosequencing, each cycle consists of the introduction of a single nucleotide species, followed by addition of substrate (luciferin, adenosine 5'-phosphosulphate) to drive light production at wells where polymerase-driven incorporation of that nucleotide took place. This is followed by an apyrase wash to remove unincorporated nucleotide. (b) With the Solexa technology, a dense array of clonally amplified sequencing features is generated directly on a surface by bridge PCR. Each sequencing cycle includes the simultaneous addition of a mixture of four modified deoxynucleotide species, each bearing one of four fluorescent labels and a reversibly terminating moiety at the 3' hydroxyl position. A modified DNA polymerase drives synchronous extension of primed sequencing features. This is followed by imaging in four channels and then cleavage of both the fluorescent labels and the terminating moiety. (c) With the SOLiD and the Polonator platforms, clonally amplified 1-m beads are used to generate a disordered, dense array of sequencing features. Sequencing is performed with a ligase, rather than a polymerase. With SOLiD, each sequencing cycle introduces a partially degenerate population of fluorescently labeled octamers. The population is structured such that the label correlates with the identity of the central 2 bp in the octamer (the correlation with 2 bp, rather than 1 bp, is the basis of two-base encoding). After ligation and imaging in four channels, the labeled portion of the octamer (that is, 'zzz') is cleaved via a modified linkage between bases 5 and 6, leaving a free end for another cycle of ligation. Several such cycles will iteratively interrogate an evenly spaced, discontiguous set of bases. The system is then reset (by denaturation of the extended primer), and the process is repeated with a different offset (e.g., a primer set back from the original position by one or several bases) such that a different set of discontiguous bases is interrogated on the next round of serial ligations. (d) With the HeliScope platform, single nucleic acid molecules are sequenced directly, that is, there is no clonal amplification step required. Poly-A–tailed template molecules are captured by hybridization to surface-tethered poly-T oligomers to yield a disordered array of primed single-molecule sequencing templates. Templates are labeled with Cy3, such that imaging can identify the subset of array coordinates where a sequencing read is expected. Each cycle consists of the polymerase-driven incorporation of a single species of fluorescently labeled nucleotide at a subset of templates, followed by fluorescence imaging of the full array and chemical cleavage of the label. Jay Shendure & Hanlee Ji, Nature Biotechnology 26, 1135 - 1145 (2008)

Conventional sequencing Can sequence up to 1,000 bp, and per-base 'raw' accuracies as high as 99.999%. In the context of high-throughput shotgun genomic sequencing, Sanger sequencing costs on the order of $0.50 per kilobase. Jay Shendure & Hanlee Ji, Nature Biotechnology 26, 1135 - 1145 (2008)

Second-generation DNA sequencing technologies Jay Shendure & Hanlee Ji, Nature Biotechnology 26, 1135 - 1145 (2008)

Applications of next-generation sequencing Jay Shendure & Hanlee Ji, Nature Biotechnology 26, 1135 - 1145 (2008)

Base calling Schematic representation of main Illumina noise factors. (a–d) A DNA cluster comprises identical DNA templates (colored boxes) that are attached to the flow cell. Nascent strands (black boxes) and DNA polymerase (black ovals) are depicted. (a) In the ideal situation, after several cycles the signal (green arrows) is strong, coherent and corresponds to the interrogated position. (b) Phasing noise introduces lagging (blue arrows) and leading (red arrow) nascent strands, which transmit a mixture of signals. (c) Fading is attributed to loss of material that reduces the signal intensity (c). (d) Changes in the fluorophore cross-talk cause misinterpretation of the received signal (teal arrows; d). For simplicity, the noise factors are presented separately from each other. Erlich et al. Nature Methods 5: 679-682 (2008)

Base calling: Alta-Cyclic The training process (green arrows) starts with creation of the training set, beginning with sequences generated by the standard Illumina pipeline, by linking intensity reads and a corresponding genome sequence (the 'correct' sequence). Then, two grid searches are used to optimize the parameters to call the bases. After optimization, a final SVM array is created, each of which corresponds to a cycle. In the base-calling stage (blue arrows), the intensity files of the desired library undergo deconvolution to correct for phasing noise using the optimized values and are sent for classification with the SVM array. The output is processed, and sequences and quality scores are reported. Erlich et al. Nature Methods 5: 679-682 (2008)

Alta-Cyclic performance (a) Analysis of the HepG2 RNA library using Alta-Cyclic. The absolute number of additional fully correct reads (in addition to those generated by the Illumina base caller) is indicated by the red line; the fold change of the improvement is indicated by the blue bars. (b) A comparison of fully correct reads for the Tetrahymena micronuclear library by the Illumina base caller and Alta-Cyclic. (c) The average error rate in calls of the artificial SNP locations in the phi X library as a function of the cycle in which they were called. The dashed line represents 1% error rate (Q20). The plot on the right shows the last 18 cycles in a different scale. (d) A comparison of fully correct reads for the phi X library with 1% artificial SNPs. (e) Phi X sequences generated by Alta-Cyclic or Illumina were exhaustively aligned to the reference genome (allowing up to 53 mismatches out of 78). The distribution of alignment scores is shown beginning with an identical number of raw reads for input into each base caller. Erlich et al. Nature Methods 5: 679-682 (2008)

ChIP-Seq Hongkai Ji et al. Nature Biotechnology 26: 1293-1300. 2008

ChIP-Seq Analysis Alignment Peak Detection Annotation Visualization Sequence Analysis Motif Analysis

Alignment ELAND Bowtie SOAP SeqMap …

Peak detection FindPeaks CHiPSeq BS-Seq SISSRs QuEST MACS CisGenome …

Two common designs One sample experiment contains only a ChIP’d sample Two sample experiment contains a ChIP’d sample and a negative control sample

One sample analysis A simple way is the sliding window method ki Poisson background model is commonly used to estimate error rate ki ~ Poisson(λ0) ki Or people use Monte Carlo simulations Both are based on the assumption that read sampling rate is a constant across the genome. Hongkai Ji et al. Nature Biotechnology 26: 1293-1300. 2008

The constant rate assumption does not hold! Negative binomial model fits the data better! ki | λi ~ Poisson(λi) λi ~ Gamma(α, β) Marginally, ki ~ NegBinom(α, β) Hongkai Ji et al. Nature Biotechnology 26: 1293-1300. 2008

FDR estimation based on Poisson and negative binomial model Hongkai Ji et al. Nature Biotechnology 26: 1293-1300. 2008

Read direction provides extra information Hongkai Ji et al. Nature Biotechnology 26: 1293-1300. 2008

CisGenome procedure Negative binomial model Alignment Exploration FDR computation Negative binomial model Peak Detection Use read direction to refine peak boundary and filter low quality peaks Post Processing Hongkai Ji et al. Nature Biotechnology 26: 1293-1300. 2008

Two sample analysis Reason: read sample rates at the same genomic locus are correlated across different samples. Hongkai Ji et al. Nature Biotechnology 26: 1293-1300. 2008

CisGenome two sample analysis Alignment Exploration FDR computation Peak Detection Post Processing k1i k2i ni =k1i + k2i k1i | ni ~ Binom(ni , p0) Hongkai Ji et al. Nature Biotechnology 26: 1293-1300. 2008

A comparative study of ChIP-chip and ChIP-seq NRSF ChIP-chip 2 ChIP + 2 Mock IP in Jurkat cells, profiled using Affymetrix Human Tiling 2.0R arrays. NRSF ChIP-seq ChIP + Negative Control in Jurkat cells, sequenced with the next generation sequencer made by Illumina/Solexa. Hongkai Ji et al. Nature Biotechnology 26: 1293-1300. 2008

Intersection Before post-processing After post-processing Hongkai Ji et al. Nature Biotechnology 26: 1293-1300. 2008

Signal correlation Hongkai Ji et al. Nature Biotechnology 26: 1293-1300. 2008

Visual comparison Hongkai Ji et al. Nature Biotechnology 26: 1293-1300. 2008

Comparison of peak detection results Hongkai Ji et al. Nature Biotechnology 26: 1293-1300. 2008

Are array specific peaks noise or signal? Hongkai Ji et al. Nature Biotechnology 26: 1293-1300. 2008

Effects of read number in ChIP-seq Hongkai Ji et al. Nature Biotechnology 26: 1293-1300. 2008

RNA-Seq Mortazavi et al. Nature Methods,5:621-628, 2008 After two rounds of poly(A) selection, RNA is fragmented to an average length of 200 nt by magnesium-catalyzed hydrolysis and then converted into cDNA by random priming. The cDNA is then converted into a molecular library for Illumina/Solexa 1G sequencing, and the resulting 25-bp reads are mapped onto the genome. Normalized transcript prevalence is calculated with an algorithm from the ERANGE package. (b) Primary data from mouse muscle RNAs that map uniquely in the genome to a 1-kb region of the Myf6 locus, including reads that span introns. The RNA-Seq graph above the gene model summarizes the quantity of reads, so that each point represents the number of reads covering each nucleotide, per million mapped reads (normalized scale of 0–5.5 reads). (c) Detection and quantification of differential expression. Mouse poly(A)-selected RNAs from brain, liver and skeletal muscle for a 20-kb region of chromosome 10 containing Myf6 and its paralog Myf5, which are muscle specific. In muscle, Myf6 is highly expressed in mature muscle, whereas Myf5 is expressed at very low levels from a small number of cells. The specificity of RNA-Seq is high: Myf6 expression is known to be highly muscle specific, and only 4 reads out of 71 million total liver and brain mapped reads were assigned to the Myf6 gene model. Mortazavi et al. Nature Methods,5:621-628, 2008

Reproducibility, linearity and sensitivity (a) Comparison of two brain technical replicate RNA-Seq determinations for all mouse gene models (from the UCSC genome database), measured in reads per kilobase of exon per million mapped sequence reads (RPKM), which is a normalized measure of exonic read density; R 2 = 0.96. (b) Distribution of uniquely mappable reads onto gene parts in the liver sample. Although 93% of the reads fall onto exons or the RNAFAR-enriched regions (see Fig. 3 and text), another 4% of the reads falls onto introns and 3% in intergenic regions. (c) Six in vitro–synthesized reference transcripts of lengths 0.3–10 kb were added to the liver RNA sample (1.2 104 to 1.2 109 transcripts per sample; R 2 > 0.99). (d) Robustness of RPKM measurement as a function of RPKM expression level and depth of sequencing. Subsets of the entire liver dataset (with 41 million mapped unique + splice + multireads) were used to calculate the expression level of genes in four different expression classes to their final expression level. Although the measured expression level of the 211 most highly expressed genes (black and cyan) was effectively unchanged after 8 million mappable reads, the measured expression levels of the other two classes (purple and red) converged more slowly. The fraction of genes for which the measured expression level was within 5% of the final value is reported. 3 RPKM corresponds to approximately one transcript per cell in liver. The corresponding number of spliced reads in each subset is shown on the top x axis. Mortazavi et al. Nature Methods,5:621-628, 2008

ERANGE Mortazavi et al. Nature Methods,5:621-628, 2008 (a) The main steps in the computational pipeline are outlined at left, with different aspects of read assignment and weighting diagrammed at right and the corresponding number of gene model reads treated in muscle shown in parentheses. In each step, the sequence read or reads being assigned by the algorithm are shown as a black rectangle, and their assignment to one or more gene models is indicated in color. Sequence reads falling outside known or predicted regions are shown in gray. RNAFAR regions (clusters of reads that do not belong to any gene model in our reference set) are shown as dotted lines. They can either be assigned to neighboring gene models, if they are within a specified threshold radius (purple), or assigned their own predicted transcript model (green). Multireads (shown as parallelograms) are assigned fractionally to their different possible locations based on the expression levels of their respective gene models as described in the text. (b) Comparison of mouse liver expanded RPKM values to publicly available Affymetrix microarray intensities from GEO (GSE6850) for genes called as present by Rosetta Resolver. Expanded RPKMs include unique reads, spliced reads and RNAFAR candidate exon aggregation, but not multireads. Genes with >30% contribution of multireads to their final RPKM (Supplementary Fig. 4) are marked in red. (c) Comparison of Affymetrix intensity values with final RPKMs, which includes multireads. Note that the multiread-affected genes that are below the regression line in b straddle the regression line in c. Mortazavi et al. Nature Methods,5:621-628, 2008

Summary The next-generation sequencing is Rapidly evolving Democratizing the extent to which individual investigators can pursue projects at a scale previously accessible only to major genome centers New statistical tools need to be developed