High Throughput Sequencing

Slides:

Advertisements

Similar presentations

Fa07CSE 182 CSE182-L4: Database filtering. Fa07CSE 182 Summary (through lecture 3) A2 is online We considered the basics of sequence alignment –Opt score.

Advertisements

IMGS 2012 Bioinformatics Workshop: File Formats for Next Gen Sequence Analysis.

High Throughput Sequencing Xiaole Shirley Liu STAT115, STAT215, BIO298, BIST520.

Next Generation Sequencing, Assembly, and Alignment Methods

Introduction to Short Read Sequencing Analysis

Sequencing and Sequence Alignment

Blockwise Suffix Sorting for Space-Efficient Burrows-Wheeler Ben Langmead Based on work by Juha Kärkkäinen.

Ultrafast and memory-efficient alignment of short reads to the human genome Ben Langmead, Cole Trapnell, Mihai Pop, and Steven L. Salzberg Center for Bioinformatics.

RNA-Seq An alternative to microarray. Steps Grow cells or isolate tissue (brain, liver, muscle) Isolate total RNA Isolate mRNA from total RNA (poly.

STAT115 STAT215 BIO512 BIST298 Introduction to Computational Biology and Bioinformatics Spring 2015 Xiaole Shirley Liu Please Fill Out Student Sign In.

Ultrafast and memory-efficient alignment of short DNA sequences to the human genome Ben Langmead, Cole Trapnell, Mihai Pop, and Steven L. Salzberg Center.

1 Exact Set Matching Charles Yan Exact Set Matching Goal: To find all occurrences in text T of any pattern in a set of patterns P={p 1,p 2,…,p.

Biological Sequence Analysis BNFO 691/602 Spring 2014 Mark Reimers

CS 6293 Advanced Topics: Current Bioinformatics

Genome Sequencing and Assembly High throughput Sequencing Xiaole Shirley Liu STAT115, STAT215, BIO298, BIST520.

SOAP3-dp Workflow.

Update on Next-Generation Sequencing

NGS Analysis Using Galaxy

Presented by Mario Flores, Xuepo Ma, and Nguyen Nguyen.

Whole Exome Sequencing for Variant Discovery and Prioritisation

Mapping NGS sequences to a reference genome. Why? Resequencing studies (DNA) – Structural variation – SNP identification RNAseq – Mapping transcripts.

Genome & Exome Sequencing Read Mapping Xiaole Shirley Liu STAT115, STAT215, BIO298, BIST520.

High Throughput Sequencing Methods and Concepts

Introduction to Short Read Sequencing Analysis

MES Genome Informatics I - Lecture V. Short Read Alignment

Massive Parallel Sequencing

Aligning Reads Ramesh Hariharan Strand Life Sciences IISc.

High Throughput Sequencing Methods and Concepts Cedric Notredame adapted from S.M Brown.

Next Generation DNA Sequencing

SHRiMP: Accurate Mapping of Short Reads in Letter- and Colour-spaces Stephen Rumble, Phil Lacroute, …, Arend Sidow, Michael Brudno.

Quick introduction to genomic file types Preliminary quality control (lab)

Introduction to Modeling and Algorithms in Life Sciences Ananth Grama Purdue University

How will new sequencing technologies enable the HMP? Elaine Mardis, Ph.D. Associate Professor of Genetics Co-Director, Genome Sequencing Center Washington.

IPlant Collaborative Discovery Environment RNA-seq Basic Analysis Log in with your iPlant ID; three orange icons.

Sequence Comparison Algorithms Ellen Walker Bioinformatics Hiram College.

Spliced Transcripts Alignment & Reconstruction

Qq q q q q q q q q q q q q q q q q q q Background: DNA Sequencing Goal: Acquire individual’s entire DNA sequence Mechanism: Read DNA fragments and reconstruct.

STAT115 STAT215 BIO512 BIST298 Introduction to Computational Biology and Bioinformatics Spring 2016 Xiaole Shirley Liu.

Introduction to Illumina Sequencing

Next-generation sequencing technology

RNAseq: a Closer Look at Read Mapping and Quantitation

Research Techniques Made Simple: Next-Generation Sequencing:

DNA Sequencing Second generation techniques

Burrows-Wheeler Transformation Review

FastHASH: A New Algorithm for Fast and Comprehensive Next-generation Sequence Mapping Hongyi Xin1, Donghyuk Lee1, Farhad Hormozdiari2, Can Alkan3, Onur.

Short Read Sequencing Analysis Workshop

Lesson: Sequence processing

Cancer Genomics Core Lab

BWT-Transformation What is BWT-transformation? BWT string compression

DNA Sequencing -sayed Mohammad Amin Nourion -A’Kia Buford

VCF format: variants c.f. S. Brown NYU

Next-generation sequencing technology

DNA Sequencing.

SOLEXA aka: Sequencing by Synthesis

2nd (Next) Generation Sequencing

MapView: visualization of short reads alignment on a desktop computer

CSC2431 February 3rd 2010 Alecia Fowler

Next-generation sequencing - Mapping short reads

Massively Parallel Sequencing: The Next Big Thing in Genetic Medicine

Maximize read usage through mapping strategies

BIOINFORMATICS Fast Alignment

BF nd (Next) Generation Sequencing

Next-generation sequencing - Mapping short reads

CS 6293 Advanced Topics: Translational Bioinformatics

Canadian Bioinformatics Workshops

Alignment of Next-Generation Sequencing Data

BF528 - Sequence Analysis Fundamentals

Assembling Genomes BCH339N Systems Biology / Bioinformatics – Spring 2016 Edward Marcotte, Univ of Texas at Austin.

The Variant Call Format

Presentation transcript:

High Throughput Sequencing Xiaole Shirley Liu STAT115, STAT215, BIO298, BIST520 Guest lecturer: Wei Li

About me Wei Li, research fellow at DFCI Studied high-throughput sequencing algorithms shortly after HTS comes out (2009) Transcript reconstruction algorithms from high-throughput RNA sequencing data (RNA-seq): IsoInfer/IsoLasso/CEM CRISPR/Cas9 screening algorithms: design, analysis (MAGeCK/MAGeCK-VISPR)

Why high-throughput sequencing? High-throughput sequencing/HTS/Next-generation sequencing/NGS 2-3 orders of magnitude faster/cheaper/higher data throughput compared with “first generation” Huge applications in academia/industry

First generation: Sanger sequencing Frederick Sanger: the 3rd person overall to win two Nobel prizes

First Generation Sanger Sequencing: 384 * 1kb / 3 hours

Sanger sequencing materials Sanger sequencing uses DNA elongation to “read” sequences dNTPs: required for normal elongation process ddNTPs: missing oxygen bond, will stop the synthesis dideoxyNTP, di=two, deoxy=remove oxygen http://www.slideshare.net/thelawofscience/biotechnology-dna-sequencing

Sanger sequencing setup 4 tubes, each test tube has deoxyA,G,C,T In addition each also has ONE of the 4 ddNTP

What happens if you have both dATP and ddATP? The synthesis stops whenever you encounter “T”

Sequencing in 2001 [Enter any extra notes here; leave the item ID line at the bottom] Avitage Item ID: {{CE8AAEAA-A22F-47FE-A1F8-66CBC3CDB6FC}}

Sequencing in 2007 [Enter any extra notes here; leave the item ID line at the bottom] Avitage Item ID: {{010D7619-E070-4F7B-BC99-6011AA639C8D}}

Second Generation Massively parallel sequencing by synthesis Many different technologies: Illumina, 454, SOLiD, Helicos, etc Illumina: HiSeq, MiSeq, NextSeq 1-16 samples 25M-4B reads 30-300bp 1-8 days 15GB-1TB output Moving targets

Illumina Cluster Generation Amplify sequenced fragments in place on the flow cell Can sequence from both the pink and purple adapters (Paired-end seq) Can multiplex many samples / lane

Illumina Sequencing process 1. Incorporate all 4 nucleotides, each label with a different dye 2. Wash, 4-color imaging 4. Repeat cycles 3. Cleave dye and terminating groups, wash

Illumina Sequencing Cycle 1 2 3 4 5 6

Third Generation Single molecule sequencing: no amp Fewer but much longer reads Good for sequencing long reads, but not for read count applications, technology still in developmenthttp://www.youtube.com/watch?v=v8p4ph2MAvI https://www.nanoporetech.com/news/movies#movie-28-minion

High Throughput Sequencing Big (data), fast (speed), cheap (cost), flexible (applications) Cost reduces faster than Moore’s law: Bioinformatic analyses become bottleneck!

High Throughput Sequencing Data Analysis

FASTQ File Format Quality score using ASCII (higher -> better) Sequence ID, sequence Quality ID, quality score Quality score using ASCII (higher -> better) @HWI-EAS305:1:1:1:991#0/1 GCTGGAGGTTCAGGCTGGCCGGATTTAAACGTAT +HWI-EAS305:1:1:1:991#0/1 MVXUWVRKTWWULRQQMMWWBBBBBBBBBBBBBB @HWI-EAS305:1:1:1:201#0/1 AAGACAAAGATGTGCTTTCTAAATCTGCACTAAT +HWI-EAS305:1:1:1:201#0/1 PXX[[[[XTXYXTTWYYY[XXWWW[TMTVXWBBB

FASTQC: Sequencing Quality Good quality! Poor quality!

Read Mapping Mapping hundreds of millions of reads back to the reference genome is CPU and RAM intensive and slow Read quality decreases with length (small single nucleotide mismatches or indels) Most mappers allow ~2 mismatches within first 30bp (4 ^ 28 could still uniquely identify most 30bp sequences in a 3GB genome), slower when allowing indels Mapping output: SAM (BAM) or BED

Read mapping algorithms Spaced seed alignment Burrows-Wheeler Suffix tree

Spaced seed alignment Tags and tag-sized pieces of reference are cut into small “seeds.” Pairs of spaced seeds are stored in an index. Look up spaced seeds for each tag. For each “hit,” confirm the remaining positions. Report results to the user.

BW alignment

Burrows-Wheeler Store entire reference genome. Align tag base by base from the end. When tag is traversed, all active locations are reported. If no match is found, then back up and try a substitution. Trapnell & Salzberg, Nat Biotech 2009

Burrows-Wheeler Transform Reversible permutation used originally in compression Once BWT(T) is built, all else shown here is discarded First col can be derived by sorting the last col T (query sequence) BWT(T) Encoding for compression gc$ac 1111001 Burrows Wheeler Matrix Last column Burrows M, Wheeler DJ: A block sorting lossless data compression algorithm. Digital Equipment Corporation, Palo Alto, CA 1994, Technical Report 124; 1994 Slides from Ben Langmead

Burrows-Wheeler Transform Property that makes BWT(T) reversible is “LF Mapping” ith occurrence of a character in Last column is same text occurrence as the ith occurrence in First column Rank: 2 (2nd ‘a’ in First column) BWT(T) T Rank: 2 (2nd ‘a’ in Last column) Burrows Wheeler Matrix Slides modified from Ben Langmead

BWT: How to reconstruct T from BWT(T)? To recreate T from BWT(T), repeatedly apply rule: T = BWT[ LF(i) ] + T; i = LF(i) Where LF(i) maps row i to row whose first character corresponds to i’s last per LF Mapping Final T Slides from Ben Langmead

BWT: How to reconstruct T from BWT(T)? To recreate T from BWT(T), repeatedly apply rule: T = BWT[ LF(i) ] + T; i = LF(i) Where LF(i) maps row i to row whose first character corresponds to i’s last per LF Mapping LF(i)=7; the first ‘g’ BWT[LF(i)]=‘c’; the second last character is ‘c’; i=LF(i)=7 i=1; this is the last character of T The first and last columns are known Slides from Ben Langmead

BWT: How to reconstruct T from BWT(T)? To recreate T from BWT(T), repeatedly apply rule: T = BWT[ LF(i) ] + T; i = LF(i) Where LF(i) maps row i to row whose first character corresponds to i’s last per LF Mapping LF(i)=6; the second ‘c’ BWT[LF(i)]=‘a’; the 3rd last character is a’; i=LF(i)=6 i=7; this is the second last character of T Slides from Ben Langmead

BWT: How to reconstruct T from BWT(T)? To recreate T from BWT(T), repeatedly apply rule: T = BWT[ LF(i) ] + T; i = LF(i) Where LF(i) maps row i to row whose first character corresponds to i’s last per LF Mapping Final T Slides from Ben Langmead

BWT: How To Do Exact Matching? To match Q in T using BWT(T), repeatedly apply rule: top = LF(top, qc); bot = LF(bot, qc) Where qc is the next character in Q (right-to-left) and LF(i, qc) maps row i to the row whose first character corresponds to i’s last character as if it were qc Slides from Ben Langmead

BWT: How To Do Exact Matching? To match Q in T using BWT(T), repeatedly apply rule: top = LF(top, qc); bot = LF(bot, qc) Where qc is the next character in Q (right-to-left) and LF(i, qc) maps row i to the row whose first character corresponds to i’s last character as if it were qc qc=‘a’ top=LF(5,’a’)=3 bot=LF(6,’a’)=4 qc=‘c’ top=5 The last character of row 5,6 is ‘a’ bot=6 Slides from Ben Langmead

BWT: How To Do Exact Matching? To match Q in T using BWT(T), repeatedly apply rule: top = LF(top, qc); bot = LF(bot, qc) Where qc is the next character in Q (right-to-left) and LF(i, qc) maps row i to the row whose first character corresponds to i’s last character as if it were qc qc=‘a’ top=LF(3,’a’)=2 bot=LF(4,’a’)=2 The last character of row 3,4 is ‘a’,’$’ Slides from Ben Langmead

Exact Matching with FM Index In progressive rounds, top & bot delimit the range of rows beginning with progressively longer suffixes of Q (from right to left) If range becomes empty the query suffix (and therefore the query) does not occur in the text If no match, instead of giving up, try to “backtrack” to a previous position and try a different base (mismatch, much slower) Slides from Ben Langmead

STAR Alignment Suffix Tree Very fast and accuracy for mapping PE-seq and high read counts O(n) time to build O(mlogn) time to search

Suffix tree (Example) Let s=abab, a suffix tree of s is a compressed trie of all suffixes of s=abab$ $ { $ b$ ab$ bab$ abab$ } a b $

Mapped Seq Files Mapped SAM HWUSI-EAS366_0112:6:1:1298:18828#0/1 16 chr9 98116600 255 38M * 0 0 TACAATATGTCTTTATTTGAGATATGGATTTTAGGCCG Y\]bc^dab\[_UU`^`LbTUT\ccLbbYaY`cWLYW^ XA:i:1 MD:Z:3C30T3 NM:i:2 HWUSI-EAS366_0112:6:1:1257:18819#0/1 4 * 0 0 * * 0 0 AGACCACATGAAGCTCAAGAAGAAGGAAGACAAAAGTG ece^dddT\cT^c`a`ccdK\c^^__]Yb\_cKS^_W\ XM:i:1 HWUSI-EAS366_0112:6:1:1315:19529#0/1 16 chr9 102610263 255 38M * 0 0 GCACTCAAGGGTACAGGAAAAGGGTCAGAAGTGTGGCC ^c_Yc\Lcb`bbYdTa\dd\`dda`cdd\Y\ddd^cT` XA:i:0 MD:Z:38 NM:i:0 chr1 123450 123500 + chr5 28374615 28374615 - Mapped SAM Map: 0 OK, 4 unmapped, 16 mapped reverse strand Sequence, quality score XA (mapper-specific) MD: mismatch info: 3 match, then C ref, 30 match, then T ref, 3 match NM: number of mismatch BAM: binary SAM format Mapped BED Chr, start, end, strand http://samtools.github.io/hts-specs/SAMv1.pdf

Mapping Statistics Terms Mappable locations: reads that can find match to A location in the genome Uniquely mapped reads: reads that can find match to A SINGLE location in the genome Repeat sequences in the genome, length-dependent Uniquely mapped locations: number of unique locations hit by uniquely mapped reads Redundancy: potential PCR amplification bias

Summary Sequencing technologies Sequence quality assessment 1st, 2nd, 3rd generation Sequence quality assessment FASTQC Read mapping Spaced seed BWA: Borrows Wheeler transformation, LF mapping STAR: Suffix Tree, fast SAM / BAM format