The Good, Bad, and Ugly of Next-Gen Sequencing

Slides:



Advertisements
Similar presentations
Next-Generation Sequencing: Methodology and Application
Advertisements

RNA-Seq as a Discovery Tool
RNA-seq library prep introduction
High throughput sequencing Barbera van Schaik
The Past, Present, and Future of DNA Sequencing
Vanderbilt Center for Quantitative Sciences Summer Institute Sequencing Analysis Yan Guo.
IMGS 2012 Bioinformatics Workshop: File Formats for Next Gen Sequence Analysis.
 Sequencing technology › Roche/454 GS-FLX (‘454’) › Illumina  Prokaryotic profiling › De novo genome sequencing › Metagenomics › SNP profiling › Species.
Next–generation DNA sequencing technologies – theory & practice
Bioinformatics Lectures at Rice
SOLiD Sequencing & Data
Next-generation sequencing
The 454 and Ion PGM at the Genomics Core Facility Dr. Deborah Grove, Director for Genetic Analysis Genomics Core Facility Huck Institutes of the Life Sciences.
Next-generation sequencing – the informatics angle Gabor T. Marth Boston College Biology Department AGBT 2008 Marco Island, FL. February
Greg Phillips Veterinary Microbiology
Bioinformatics Methods and Computer Programs for Next-Generation Sequencing Data Analysis Gabor Marth Boston College Biology Next Generation Sequencing.
Affymetrix Microarray and Illumina/ Solexa NextGen Sequencing Yuannan Xia, Ph.D Genomics Core Research Facility
The SOLiD System: Next-Generation Sequencing Overview of the SOLiD System –  Scalable  Accurate Ultra High Throughput  Flexible  Mate Pairs.
Data analytical issues with high-density oligonucleotide arrays A model for gene expression analysis and data quality assessment.
BME 130 – Genomes Lecture 4 Sequencing technology II Next generation sequencing.
High Throughput Sequencing
Next generation sequencing Why? What? How? Marcel Dinger Developmental Biology Divisional Seminar 7 October 2010.
Department of Bioinformatics and Computational Biology
CS 6293 Advanced Topics: Current Bioinformatics
Diabetes and Endocrinology Research Center The BCM Microarray Core Facility: Closing the Next Generation Gap Alina Raza 1, Mylinh Hoang 1, Gayan De Silva.
GENOME SEQUENCING. I. Genome sequencing The Sanger Method (1977) Denaturation +priming Polymerization.
NGS Data Generation Dr Laura Emery. Overview The NGS data explosion Sequencing technologies An example of a sequencing workflow Bioinformatics challenges.
The impact of next-generation sequencing technology of genetics Elaine R. Mardis – 11 February Washington School of Medicine, Genome Sequencing Center.
Next Now-Generation Genomics: methods and applications for modern disease research Aaron J. Mackey, Ph.D. Center for Public Health.
High-Throughput Sequencing Technologies
Molecular Biology Dr. Chaim Wachtel April 4, 2013.
Library Preparation Application dependant, using standard molecular biological techniques. Fragment library oligo kit: (per library)$35 GeneAmp dNTP blend:
Todd J. Treangen, Steven L. Salzberg
Technological Solutions. In 1977 Sanger et al. were able to work out the complete nucleotide sequence in a virus – (Phage 0X174) This breakthrough allowed.
Next Generation Sequencing and its data analysis challenges Background Alignment and Assembly Applications Genome Epigenome Transcriptome.
High throughput sequencing: informatics & software aspects Gabor T. Marth Boston College Biology Department BI543 Fall 2013 January 29, 2013.
Detection of Genomic Rearrangements in K562 cells using Paired End Sequencing Rosa Maria Alvarez Massachusetts Institute of Technology Class of 2009.
Chromatin Immunoprecipitation DNA Sequencing (ChIP-seq)
I519 Introduction to Bioinformatics, Fall, 2012
How will new sequencing technologies enable the HMP? Elaine Mardis, Ph.D. Associate Professor of Genetics Co-Director, Genome Sequencing Center Washington.
SEQUENCING – THE BENCHTOPS. Roche 454 Junior Same technology as 454 FLX Read length: 400 bases Paired-end 100,000 reads 12 hours (instrument time) Output.
Transcriptomics Sequencing. over view The transcriptome is the set of all RNA molecules, including mRNA, rRNA, tRNA, and other non coding RNA produced.
Lecture 6. Functional Genomics: DNA microarrays and re-sequencing individual genomes by hybridization.
Recombinant DNA What is the basis of recombinant DNA technology? How does one “clone” a gene? How are genetically modified organisms (GMOs) created? Illustration.
目录 The Principle and Application of Common Used Techniques in Molecular Biology chapter 18.
Next-generation sequencing: the informatics angle
Library QA & QC Day 1, Video 3
Introduction to Illumina Sequencing
Next-generation sequencing technology
Research Techniques Made Simple: Next-Generation Sequencing:
Short Read Sequencing Analysis Workshop
Next generation sequencing
Amos Tanay Nir Yosef 1st HCA Jamboree, 8/2017
Cancer Genomics Core Lab
Dr. Christoph W. Sensen und Dr. Jung Soh Trieste Course 2017
Utilizing the Illumina deep sequencing technique to define
Next-generation sequencing technology
Very important to know the difference between the trees!
Step 1: amplification and cloning procedures
Small RNA Sample Preparation
2nd (Next) Generation Sequencing
High-throughput sequencing techniques
Discovery tools for human genetic variations
ULTRASEQUENCING. Next Generation Sequencing: methods and applications.
RNA sequencing (RNA-Seq) and its application in ovarian cancer
Next-generation DNA sequencing
IWGS workflow. iWGS workflow. A typical iWGS analysis consists of four steps: (1) data simulation (optional); (2) preprocessing (optional); (3) de novo.
Schematic representation of a transcriptomic evaluation approach.
Sequence Analysis - RNA-Seq 1
Presentation transcript:

The Good, Bad, and Ugly of Next-Gen Sequencing The University of Texas at Austin, Genomic Sequencing and Analysis Facility or for short GSAF The Good, Bad, and Ugly of Next-Gen Sequencing Scott Hunicke-Smith 2012

Outline Next-gen sequencing: Background The details Library construction Sequencing Data analysis

NGS enabling technologies Clonal amplification (Exception: SMS) Two methods: emulsion PCR (454, SOLiD), bridge amplification (Illumina) Sequencing by synthesis Massive parallelism

How they work videos Roche/454 Illumina (Solexa) Genome Analyzer http://454.com/products-solutions/multimedia-presentations.asp Illumina (Solexa) Genome Analyzer http://www.youtube.com/watch?v=77r5p8IBwJk Life Technologies SOLiD http://media.invitrogen.com.edgesuite.net/ab/applications-technologies/solid/SOLiD_video_final.html

NGS enabling technologies Clonal amplification (Exception: SMS) Two methods: emulsion PCR (454, SOLiD), bridge amplification (Illumina) Sequencing by synthesis Massive parallelism

The Details: Categories Library Construction Sequencing Data Analysis

Instruments: Roche Workflow

Mate-Pair Library Construction Shearing – size, size distribution Ligation biases (x4) Digestion – length, distribution Final gel cut

Read Types vs Library Types emPCR Sequencing F3 read-> <-F5 read R3 read-> emPCR 5’-CCACTACGCCTCCGCTTTCCTCTCTATGGGCAGTCGGTGAT<Template1-~150 bp>CGCCTTGGCCGTACAGCAGGGGCTTAGAGAATGAGGAACCCGGGGCAG-3’ |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| 3’-GGTGATGCGGAGGCGAAAGGAGAGATACCCGTCAGCCACTA-<Template 1 RC >-GCGGAACCGGCATGTCGTCCCCGAATCTCTTACTCCTTGGGCCCCGTC-5’ Single-end (F3 read only) Cheapest, highest quality Paired-end (F3 and F5 read) Much more information content Differentiates PCR duplicates Mate-pair (F3 and R3 read) Provides info on large-scale structure

Read Types vs Library Types Clear terms: Fragment library Mate-paired library Paired-end read Ambiguous terms: Paired-end library Mate-paired read

SE vs PE From: Bainbridge et al. Genome Biology 2010, 11:R62

Library Construction: Workflows Mate-Pair Libraries Fragment Libraries

How they work videos Roche/454 Illumina (Solexa) Genome Analyzer http://454.com/products-solutions/multimedia-presentations.asp Illumina (Solexa) Genome Analyzer http://www.youtube.com/watch?v=77r5p8IBwJk Life Technologies SOLiD http://media.invitrogen.com.edgesuite.net/ab/applications-technologies/solid/SOLiD_video_final.html

Question Which of these was NOT an enabling invention for NGS: Clonal amplification Intercalating dyes Sequencing by synthesis Massive parallelism

Essential Ideas NGS interrogates populations, not individual clones Number of reads (sequences) ≅ 100x library molecules put into clonal amplification MOLAR RATIOS matter! Highly repeatable (from library through sequencing) Error rates are (very) high

Characteristics of SBS Step-wise efficiency is <100% Like inflation eating away at your savings This can be resolved by correcting “phasing” This single software addition increased read lengths by ~10-fold Dominant error modalities can be predicted based on the technology Fluor-term-nucleotide systems have ____ errors Native (un-terminated) systems have ___ errors

Trajectory of Price

Instruments: How they work Step Roche/454 LifeTech SOLiD Illumina HiSeq Create ss DNA library Shear, ligate adaptors, optional PCR Segregate molecules On polystyrene beads On magnetic beads On glass surface Clonal amplification emulsion PCR Bridge amplification Fix colonies to seq. substrate Deposit beads into picotiter plate Bind via sequencing template Done during clonal amp. Sequence SBS: single-nucleotide addition SBS: ligation of 4-color dinucleotide-encoded oligos SBS: 4-color incorporation of capped nucleotides Detect Luminesence over whole surface Fluorescence scanning   Cost for 1 run $6,843 $3,882 $17,462 Data from 1 run, megabase-pairs 400 34000 320000 Time for 1 run, days 1 14 11 Cost per megabase raw data $17.11 $0.11 $0.05 Throughput, megabase/day 2429 29091

What it costs Examples: Gene expression profiling (INCLUDING array cost): NimbleGen 12 samples on catalog, 72k probe, 4-plex arrays: ~$450 per sample from 1 ug total RNA or cDNA. Illumina Human, Mouse, or Rat, 12 samples: ~$300 per sample from 100 ng total RNA Deep Sequencing: Illumina RNA-seq: 1 sample, 40 million read-pairs: $876 Illumina de novo: Draft sequence ~5 megabase bacterial genome (~25 MB raw sequence): ~$500

What, exactly, are we sequencing?

Good Example: ChIP-Seq

RNA/miRNA library What’s in YOUR library?

RNA-seq Quantitation – what’s in YOUR genome? CAACCCCAACACCCACCGGCACACAGACCCCAACC – 99x CAACCCCAACACCCACCGGCACACAGACCGGGCCC – 1x You found a transcript WHERE? Jesse Gray @ Harvard: ChIP-Seq data showed RNA Pol II binding tens of KB away from any annotated gene, in a promoter/enhancer complex RNA-Seq data confirmed ~1kb transcripts arising from these binding sites Quantitation – what’s in YOUR genome? CAACCCCAACACCCACCGGCACACAGACCCCAACC CAACCCCAACACCCACCGGCACACAGACCGGGCCC You found a transcript WHERE? Jesse Gray @ Harvard: ChIP-Seq data showed RNA Pol II binding tens of KB away from any annotated gene, in a promoter/enhancer complex RNA-Seq data confirmed ~1kb transcripts arising from these binding sites

Question Which type of mathematics are you most likely to need when analyzing NGS data: Calculus Linear algebra Statistics Differential equations Set theory (Hint: it has been removed from Texas requirements for high school math)

Sequencing All instruments susceptible to: Updates are very frequent Poor library quantitation leading to excessive templates (failure) or wasted space (more expensive) Failures in cluster or bead generation (expensive) Failures in sequencing chemistry (very expensive) Updates are very frequent

Instruments: Accuracy/Quality “Error rate” - typ. to individual read Better: Mappable data

Quality Values: Debated Illumina & Roche screen/trim reads, ABI does not Quality value distributions vary widely

Aligners/Mappers Algorithms Differences Spaced-seed indexing Hash seed words from reference or reads Burrows-Wheeler transform (BWT) Differences Speed Scaleability on clusters Memory requirements Sensitivity: esp. indels Ease of use Output format

Aligners/Mappers Differences in alignment tools: Use of base quality values Gapped or un-gapped Multiple-hit treatment Estimate of alignment quality Handle paired-end & mate-pair data Treatment of multiple matches Read length assumptions Colorspace treatment (aware vs. useful) Experimental complexities: Methylation (bisulfite) analysis Splice junction treatment Iterative variant detection Taken from: http://www.bioconductor.org/help/course-materials/2010/CSAMA10/2010-06-14__HTS_introduction__Brixen__Bioc_course.pdf

Data courtesy Dhivya Arasappan, GSAF Bioinformatician Some Comparisons Data courtesy Dhivya Arasappan, GSAF Bioinformatician

Informatics Pipelines: RNA-seq General workflow: Pre-filter (optional) Map Filter Summarize (e.g. by gene or exon) Interpret Rule sets are required to make sense of the “unbiased” sequence data Rule sets can get complicated quickly Algorithm matters (speed, sensitivity, specificity)

Rule Set Example Basis for definition of “hit”… Accept all hits Tag Seq 1 Gene sequence A Tag Seq 1 Tag Seq 1 Gene sequence B Tag Seq 1 Gene sequence C Basis for definition of “hit”… Accept all hits Collapse intergenic non-unique Select random non-unique Select only unique Apply stat model to non-unique Summarize by gene, exon (gene model?)

Comparison of Short-Read Mappers & Filters Mapping along normalized gene length – effects of post-mapping filters. Fig 1a: Bowtie raw output,max.100 hits per tag (No filter) Fig 1b: Bowtie output, max.25 hits per tag, 3mis, nontiling, max. coverage of 1% Fig 1c: Bowtie output,1 hit per tag, 3mis, nontiling, max.coverage of 1%, no polyA tails Fig 2a:SOAP2 raw output (No filter) Fig 2b: SOAP2 output, 1 hit per tag, 3mis, nontiling, max. coverage of 1% Fig 2c: SOAP2 output,1 hit per tag, 3mis, nontiling, max.coverage of 1%, no polyA tails

Data Analysis Workflow: RNA-Seq

Pipeline example From: Landgraf, et. al., “A mammalian microRNA expression atlas based on small RNA library sequencing.”, Nat Biotechnol. 2007 Sep; 25(9):996-7, supplemental materials

TACC: A Joy in Life RANGER: 63,000 processing cores, 1.73 PB shared disk LONESTAR: 5,840 processing cores, 103 TB local disk RANCH & CORRAL: 3.7 PB archive Typical mapping of 20e6 reads: 20 hours on high-end desktop 2 hours at TACC

Medical Examples Warfarin Gleevec targeting BCL/ABL Irinotecan: UGT1A1 First CML, then GIST “Too” specific… and $32,000/year See also: Herceptin, Avastin, Cetuximab… Warfarin CYP 450 enzymes have regulators too… Irinotecan: UGT1A1 Irinotecan is converted by an enzyme into its active metabolite SN-38, which is in turn inactivated by the enzyme UGT1A1 by glucuronidation. # The most common polymorphism is a variation in the number of TA repeats in the TATA box region of the UGT1A1 gene. The presence of seven TA repeats (UGT1A1*28) instead of the normal six TA repeats (UGT1A1*1) reduces gene expression and results in impaired metabolism. This variant allele is common in many populations, and occurs in 38.7% of Caucasians, 16% of Asians and 42.6% of Africans.1,2 # Studies have shown that impaired metabolism in patients who are homozygous for the UGT1A1*28 allele results in severe, dose-limiting toxicity during irinotecan therapy. These findings led to a recent update in the irinotecan label to include dosing recommendations based on the presence of a UGT1A1*28 allele.3. From: http://www.twt.com/clinical/ivd/ugt1a1.html

Tarceva: EGFR EGFR mutation improves survival, but nullifies effect of treatment

The future of cancer treatment Researchers at St. Jude’s and Dana Farber both predict sequencing of all incoming cancer patients in the next 2-3 years Applications will be: Predicting tumor response (pt stratification) Characterizing resistance to anticancer agents (this is the challenge in most metastatic solid tumors) and Profiling the full spectrum of informative genetic/molecular alterations

Real (applied) data From: “Integrative Analysis of the Melanoma Transcriptome”, Berger, Genome Research, Feb. 23 2010

Personalized cancer detection Personalized Analysis of Rearranged Ends (PARE) – Leary @ Johns Hopkins Do one mate-pair sequence analysis of the primary tumor Identify transpositions/gene fusions/etc. that are specific to that patient’s tumor Use as a detection target for recurrence at least, or as a drug target Science Translational Medicine, 24 Feb. 2010

Pharmacogenomics & the FDA 13,000 drugs on-market 1,200 were reviewed for PGx labels 121 have them, and 1 in 4 outpatients use them Measurements and Main Results. Pharmacogenomic biomarkers were defined, FDA-approved drug labels containing this information were identified, and utilization of these drugs was determined. Of 1200 drug labels reviewed for the years 1945–2005, 121 drug labels contained pharmacogenomic information based on a key word search and follow-up screening. Of those, 69 labels referred to human genomic biomarkers, and 52 referred to microbial genomic biomarkers. Of the labels referring to human biomarkers, 43 (62%) pertained to polymorphisms in cytochrome P450 (CYP) enzyme metabolism, with CYP2D6 being most common. Of 36.1 million patients whose prescriptions were processed by a large pharmacy benefits manager in 2006, about 8.8 million (24.3%) received one or more drugs with human genomic biomarker information in the drug label. Conclusion. Nearly one fourth of all outpatients received one or more drugs that have pharmacogenomic information in the label for that drug. The incorporation and appropriate use of pharmacogenomic information in drug labels should be tested for its ability to improve drug use and safety in the United States. From: Lesko et. Al., “Pharmacogenomic Biomarker Information in Drug Labels Approved by the United States Food and Drug Administration: Prevalence of Related Drug Use”, Pharmacotherapy, Volume: 28 | Issue: 8 , August 2008.

Epidemiology Metagenomics to be specific: Key point: survey of microbial communities by culture is biased; survey by sequencing is completely unbiased Can thus survey any biological milieu: Individual: sinus, skin, gut, etc. either singly or in aggregate Survey water supply, environmental samples Corporate: survey raw sewage streams at sentinel locations to monitor outbreaks

Key Take-Home Concepts Access to DNA and RNA-based information will be trivial in the next 10 years Understanding of genome information will be take a lot longer Consider bioinformatics in your curriculum and your career

From the UT GSAF Scott Hunicke-Smith, Ph.D. – Director Dhivya Arasappan, M.S. – Bioinformatician Jessica Wheeler – Lab Manager Melanie Weiler – RA Jillian DeBlanc – RA Heather Deidrick – RA Yvonne Murray – Administrator Gabriella Huerta – RA Terry Heckmann – RA Margaret Lutz – RA

Preliminaries All the world’s a sequence… Combination data: De novo sequencing Re-sequencing: SNP discovery, genotyping, rearrangements, targeted resequencing, etc. Regulatory elements: ChIP-Seq Methylation Small RNA discovery & quantification mRNA quantification: RNA-Seq Combination data: mRNA -> cDNA -> nextgen = Gene expression Splice variants SNPs mRNA: all for about the price of a microarray experiment…