High Throughput Sequencing

Slides:



Advertisements
Similar presentations
Fa07CSE 182 CSE182-L4: Database filtering. Fa07CSE 182 Summary (through lecture 3) A2 is online We considered the basics of sequence alignment –Opt score.
Advertisements

IMGS 2012 Bioinformatics Workshop: File Formats for Next Gen Sequence Analysis.
High Throughput Sequencing Xiaole Shirley Liu STAT115, STAT215, BIO298, BIST520.
Next Generation Sequencing, Assembly, and Alignment Methods
Introduction to Short Read Sequencing Analysis
Sequencing and Sequence Alignment
Blockwise Suffix Sorting for Space-Efficient Burrows-Wheeler Ben Langmead Based on work by Juha Kärkkäinen.
Ultrafast and memory-efficient alignment of short reads to the human genome Ben Langmead, Cole Trapnell, Mihai Pop, and Steven L. Salzberg Center for Bioinformatics.
RNA-Seq An alternative to microarray. Steps Grow cells or isolate tissue (brain, liver, muscle) Isolate total RNA Isolate mRNA from total RNA (poly.
STAT115 STAT215 BIO512 BIST298 Introduction to Computational Biology and Bioinformatics Spring 2015 Xiaole Shirley Liu Please Fill Out Student Sign In.
Ultrafast and memory-efficient alignment of short DNA sequences to the human genome Ben Langmead, Cole Trapnell, Mihai Pop, and Steven L. Salzberg Center.
1 Exact Set Matching Charles Yan Exact Set Matching Goal: To find all occurrences in text T of any pattern in a set of patterns P={p 1,p 2,…,p.
Biological Sequence Analysis BNFO 691/602 Spring 2014 Mark Reimers
CS 6293 Advanced Topics: Current Bioinformatics
Genome Sequencing and Assembly High throughput Sequencing Xiaole Shirley Liu STAT115, STAT215, BIO298, BIST520.
SOAP3-dp Workflow.
Update on Next-Generation Sequencing
NGS Analysis Using Galaxy
Presented by Mario Flores, Xuepo Ma, and Nguyen Nguyen.
Whole Exome Sequencing for Variant Discovery and Prioritisation
Mapping NGS sequences to a reference genome. Why? Resequencing studies (DNA) – Structural variation – SNP identification RNAseq – Mapping transcripts.
Genome & Exome Sequencing Read Mapping Xiaole Shirley Liu STAT115, STAT215, BIO298, BIST520.
High Throughput Sequencing Methods and Concepts
Introduction to Short Read Sequencing Analysis
MES Genome Informatics I - Lecture V. Short Read Alignment
Massive Parallel Sequencing
Aligning Reads Ramesh Hariharan Strand Life Sciences IISc.
High Throughput Sequencing Methods and Concepts Cedric Notredame adapted from S.M Brown.
Next Generation DNA Sequencing
SHRiMP: Accurate Mapping of Short Reads in Letter- and Colour-spaces Stephen Rumble, Phil Lacroute, …, Arend Sidow, Michael Brudno.
Quick introduction to genomic file types Preliminary quality control (lab)
Introduction to Modeling and Algorithms in Life Sciences Ananth Grama Purdue University
How will new sequencing technologies enable the HMP? Elaine Mardis, Ph.D. Associate Professor of Genetics Co-Director, Genome Sequencing Center Washington.
IPlant Collaborative Discovery Environment RNA-seq Basic Analysis Log in with your iPlant ID; three orange icons.
Sequence Comparison Algorithms Ellen Walker Bioinformatics Hiram College.
Spliced Transcripts Alignment & Reconstruction
Qq q q q q q q q q q q q q q q q q q q Background: DNA Sequencing Goal: Acquire individual’s entire DNA sequence Mechanism: Read DNA fragments and reconstruct.
STAT115 STAT215 BIO512 BIST298 Introduction to Computational Biology and Bioinformatics Spring 2016 Xiaole Shirley Liu.
Introduction to Illumina Sequencing
Next-generation sequencing technology
RNAseq: a Closer Look at Read Mapping and Quantitation
Research Techniques Made Simple: Next-Generation Sequencing:
DNA Sequencing Second generation techniques
Burrows-Wheeler Transformation Review
FastHASH: A New Algorithm for Fast and Comprehensive Next-generation Sequence Mapping Hongyi Xin1, Donghyuk Lee1, Farhad Hormozdiari2, Can Alkan3, Onur.
Short Read Sequencing Analysis Workshop
Lesson: Sequence processing
Cancer Genomics Core Lab
BWT-Transformation What is BWT-transformation? BWT string compression
DNA Sequencing -sayed Mohammad Amin Nourion -A’Kia Buford
VCF format: variants c.f. S. Brown NYU
Next-generation sequencing technology
SVM 2FG.
DNA Sequencing.
SOLEXA aka: Sequencing by Synthesis
2nd (Next) Generation Sequencing
MapView: visualization of short reads alignment on a desktop computer
CSC2431 February 3rd 2010 Alecia Fowler
Next-generation sequencing - Mapping short reads
Massively Parallel Sequencing: The Next Big Thing in Genetic Medicine
Maximize read usage through mapping strategies
BIOINFORMATICS Fast Alignment
BF nd (Next) Generation Sequencing
Next-generation sequencing - Mapping short reads
CS 6293 Advanced Topics: Translational Bioinformatics
Canadian Bioinformatics Workshops
Alignment of Next-Generation Sequencing Data
BF528 - Sequence Analysis Fundamentals
Assembling Genomes BCH339N Systems Biology / Bioinformatics – Spring 2016 Edward Marcotte, Univ of Texas at Austin.
The Variant Call Format
Presentation transcript:

High Throughput Sequencing Xiaole Shirley Liu STAT115, STAT215, BIO298, BIST520 Guest lecturer: Wei Li

About me Wei Li, research fellow at DFCI Studied high-throughput sequencing algorithms shortly after HTS comes out (2009) Transcript reconstruction algorithms from high-throughput RNA sequencing data (RNA-seq): IsoInfer/IsoLasso/CEM CRISPR/Cas9 screening algorithms: design, analysis (MAGeCK/MAGeCK-VISPR)

Why high-throughput sequencing? High-throughput sequencing/HTS/Next-generation sequencing/NGS 2-3 orders of magnitude faster/cheaper/higher data throughput compared with “first generation” Huge applications in academia/industry

First generation: Sanger sequencing Frederick Sanger: the 3rd person overall to win two Nobel prizes

First Generation Sanger Sequencing: 384 * 1kb / 3 hours

Sanger sequencing materials Sanger sequencing uses DNA elongation to “read” sequences dNTPs: required for normal elongation process ddNTPs: missing oxygen bond, will stop the synthesis dideoxyNTP, di=two, deoxy=remove oxygen http://www.slideshare.net/thelawofscience/biotechnology-dna-sequencing

Sanger sequencing setup 4 tubes, each test tube has deoxyA,G,C,T In addition each also has ONE of the 4 ddNTP

What happens if you have both dATP and ddATP? The synthesis stops whenever you encounter “T”

Sequencing in 2001 [Enter any extra notes here; leave the item ID line at the bottom] Avitage Item ID: {{CE8AAEAA-A22F-47FE-A1F8-66CBC3CDB6FC}}

Sequencing in 2007 [Enter any extra notes here; leave the item ID line at the bottom] Avitage Item ID: {{010D7619-E070-4F7B-BC99-6011AA639C8D}}

Second Generation Massively parallel sequencing by synthesis Many different technologies: Illumina, 454, SOLiD, Helicos, etc Illumina: HiSeq, MiSeq, NextSeq 1-16 samples 25M-4B reads 30-300bp 1-8 days 15GB-1TB output Moving targets

Illumina Cluster Generation Amplify sequenced fragments in place on the flow cell Can sequence from both the pink and purple adapters (Paired-end seq) Can multiplex many samples / lane

Illumina Sequencing process 1. Incorporate all 4 nucleotides, each label with a different dye 2. Wash, 4-color imaging 4. Repeat cycles 3. Cleave dye and terminating groups, wash

Illumina Sequencing Cycle 1 2 3 4 5 6

Third Generation Single molecule sequencing: no amp Fewer but much longer reads Good for sequencing long reads, but not for read count applications, technology still in developmenthttp://www.youtube.com/watch?v=v8p4ph2MAvI https://www.nanoporetech.com/news/movies#movie-28-minion

High Throughput Sequencing Big (data), fast (speed), cheap (cost), flexible (applications) Cost reduces faster than Moore’s law: Bioinformatic analyses become bottleneck!

High Throughput Sequencing Data Analysis

FASTQ File Format Quality score using ASCII (higher -> better) Sequence ID, sequence Quality ID, quality score Quality score using ASCII (higher -> better) @HWI-EAS305:1:1:1:991#0/1 GCTGGAGGTTCAGGCTGGCCGGATTTAAACGTAT +HWI-EAS305:1:1:1:991#0/1 MVXUWVRKTWWULRQQMMWWBBBBBBBBBBBBBB @HWI-EAS305:1:1:1:201#0/1 AAGACAAAGATGTGCTTTCTAAATCTGCACTAAT +HWI-EAS305:1:1:1:201#0/1 PXX[[[[XTXYXTTWYYY[XXWWW[TMTVXWBBB

FASTQC: Sequencing Quality Good quality! Poor quality!

Read Mapping Mapping hundreds of millions of reads back to the reference genome is CPU and RAM intensive and slow Read quality decreases with length (small single nucleotide mismatches or indels) Most mappers allow ~2 mismatches within first 30bp (4 ^ 28 could still uniquely identify most 30bp sequences in a 3GB genome), slower when allowing indels Mapping output: SAM (BAM) or BED

Read mapping algorithms Spaced seed alignment Burrows-Wheeler Suffix tree

Spaced seed alignment Tags and tag-sized pieces of reference are cut into small “seeds.” Pairs of spaced seeds are stored in an index. Look up spaced seeds for each tag. For each “hit,” confirm the remaining positions. Report results to the user.

BW alignment

Burrows-Wheeler Store entire reference genome. Align tag base by base from the end. When tag is traversed, all active locations are reported. If no match is found, then back up and try a substitution. Trapnell & Salzberg, Nat Biotech 2009

Burrows-Wheeler Transform Reversible permutation used originally in compression Once BWT(T) is built, all else shown here is discarded First col can be derived by sorting the last col T (query sequence) BWT(T) Encoding for compression gc$ac 1111001 Burrows Wheeler Matrix Last column Burrows M, Wheeler DJ: A block sorting lossless data compression algorithm. Digital Equipment Corporation, Palo Alto, CA 1994, Technical Report 124; 1994 Slides from Ben Langmead

Burrows-Wheeler Transform Property that makes BWT(T) reversible is “LF Mapping” ith occurrence of a character in Last column is same text occurrence as the ith occurrence in First column Rank: 2 (2nd ‘a’ in First column) BWT(T) T Rank: 2 (2nd ‘a’ in Last column) Burrows Wheeler Matrix Slides modified from Ben Langmead

BWT: How to reconstruct T from BWT(T)? To recreate T from BWT(T), repeatedly apply rule: T = BWT[ LF(i) ] + T; i = LF(i) Where LF(i) maps row i to row whose first character corresponds to i’s last per LF Mapping Final T Slides from Ben Langmead

BWT: How to reconstruct T from BWT(T)? To recreate T from BWT(T), repeatedly apply rule: T = BWT[ LF(i) ] + T; i = LF(i) Where LF(i) maps row i to row whose first character corresponds to i’s last per LF Mapping LF(i)=7; the first ‘g’ BWT[LF(i)]=‘c’; the second last character is ‘c’; i=LF(i)=7 i=1; this is the last character of T The first and last columns are known Slides from Ben Langmead

BWT: How to reconstruct T from BWT(T)? To recreate T from BWT(T), repeatedly apply rule: T = BWT[ LF(i) ] + T; i = LF(i) Where LF(i) maps row i to row whose first character corresponds to i’s last per LF Mapping LF(i)=6; the second ‘c’ BWT[LF(i)]=‘a’; the 3rd last character is a’; i=LF(i)=6 i=7; this is the second last character of T Slides from Ben Langmead

BWT: How to reconstruct T from BWT(T)? To recreate T from BWT(T), repeatedly apply rule: T = BWT[ LF(i) ] + T; i = LF(i) Where LF(i) maps row i to row whose first character corresponds to i’s last per LF Mapping Final T Slides from Ben Langmead

BWT: How To Do Exact Matching? To match Q in T using BWT(T), repeatedly apply rule: top = LF(top, qc); bot = LF(bot, qc) Where qc is the next character in Q (right-to-left) and LF(i, qc) maps row i to the row whose first character corresponds to i’s last character as if it were qc Slides from Ben Langmead

BWT: How To Do Exact Matching? To match Q in T using BWT(T), repeatedly apply rule: top = LF(top, qc); bot = LF(bot, qc) Where qc is the next character in Q (right-to-left) and LF(i, qc) maps row i to the row whose first character corresponds to i’s last character as if it were qc qc=‘a’ top=LF(5,’a’)=3 bot=LF(6,’a’)=4 qc=‘c’ top=5 The last character of row 5,6 is ‘a’ bot=6 Slides from Ben Langmead

BWT: How To Do Exact Matching? To match Q in T using BWT(T), repeatedly apply rule: top = LF(top, qc); bot = LF(bot, qc) Where qc is the next character in Q (right-to-left) and LF(i, qc) maps row i to the row whose first character corresponds to i’s last character as if it were qc qc=‘a’ top=LF(3,’a’)=2 bot=LF(4,’a’)=2 The last character of row 3,4 is ‘a’,’$’ Slides from Ben Langmead

Exact Matching with FM Index In progressive rounds, top & bot delimit the range of rows beginning with progressively longer suffixes of Q (from right to left) If range becomes empty the query suffix (and therefore the query) does not occur in the text If no match, instead of giving up, try to “backtrack” to a previous position and try a different base (mismatch, much slower) Slides from Ben Langmead

STAR Alignment Suffix Tree Very fast and accuracy for mapping PE-seq and high read counts O(n) time to build O(mlogn) time to search

Suffix tree (Example) Let s=abab, a suffix tree of s is a compressed trie of all suffixes of s=abab$ $ { $ b$ ab$ bab$ abab$ } a b $

Mapped Seq Files Mapped SAM HWUSI-EAS366_0112:6:1:1298:18828#0/1    16      chr9    98116600        255     38M     *       0       0       TACAATATGTCTTTATTTGAGATATGGATTTTAGGCCG  Y\]bc^dab\[_UU`^`LbTUT\ccLbbYaY`cWLYW^  XA:i:1  MD:Z:3C30T3     NM:i:2 HWUSI-EAS366_0112:6:1:1257:18819#0/1    4       *       0       0       *       *       0       0       AGACCACATGAAGCTCAAGAAGAAGGAAGACAAAAGTG  ece^dddT\cT^c`a`ccdK\c^^__]Yb\_cKS^_W\  XM:i:1 HWUSI-EAS366_0112:6:1:1315:19529#0/1    16      chr9    102610263       255     38M     *       0       0       GCACTCAAGGGTACAGGAAAAGGGTCAGAAGTGTGGCC  ^c_Yc\Lcb`bbYdTa\dd\`dda`cdd\Y\ddd^cT`  XA:i:0  MD:Z:38 NM:i:0 chr1 123450 123500 + chr5 28374615 28374615 - Mapped SAM Map: 0 OK, 4 unmapped, 16 mapped reverse strand Sequence, quality score XA (mapper-specific) MD: mismatch info: 3 match, then C ref, 30 match, then T ref, 3 match NM: number of mismatch BAM: binary SAM format Mapped BED Chr, start, end, strand http://samtools.github.io/hts-specs/SAMv1.pdf

Mapping Statistics Terms Mappable locations: reads that can find match to A location in the genome Uniquely mapped reads: reads that can find match to A SINGLE location in the genome Repeat sequences in the genome, length-dependent Uniquely mapped locations: number of unique locations hit by uniquely mapped reads Redundancy: potential PCR amplification bias

Summary Sequencing technologies Sequence quality assessment 1st, 2nd, 3rd generation Sequence quality assessment FASTQC Read mapping Spaced seed BWA: Borrows Wheeler transformation, LF mapping STAR: Suffix Tree, fast SAM / BAM format