Spliced Transcripts Alignment & Reconstruction

Slides:



Advertisements
Similar presentations
Mo17 shotgun project Goal: sequence Mo17 gene space with inexpensive new technologies Datasets in progress: Four-phases of 454-FLX sequencing to max of.
Advertisements

Introduction 1.Ordering of P. knowlesi contigs v P. falciparum methodology progress/status towards a synteny map – ‘true’ scaffold 2. Gene prediction generating.
RNA-Seq based discovery and reconstruction of unannotated transcripts
RNAseq.
Getting the computer setup Follow directions on handout to login to server. Type “qsub -I” to get a compute node. The data you will be using is stored.
Transcriptome Sequencing with Reference
Next Generation Sequencing, Assembly, and Alignment Methods
Introduction to Short Read Sequencing Analysis
RNA-Seq based discovery and reconstruction of unannotated transcripts in partially annotated genomes 3 Serghei Mangul*, Adrian Caciula*, Ion.
Multiple sequence alignment Conserved blocks are recognized Different degrees of similarity are marked.
Some new sequencing technologies. Molecular Inversion Probes.
Whole Genome Alignment using Multithreaded Parallel Implementation Hyma S Murthy CMSC 838 Presentation.
Novel multi-platform next generation assembly methods for mammalian genomes The Baylor College of Medicine, Australian Government and University of Connecticut.
Sequence Alignment II CIS 667 Spring Optimal Alignments So we know how to compute the similarity between two sequences  How do we construct an.
Multiple sequence alignment Conserved blocks are recognized Different degrees of similarity are marked.
Sequence similarity. Motivation Same gene, or similar gene Suffix of A similar to prefix of B? Suffix of A similar to prefix of B..Z? Longest similar.
Dynamic Programming. Pairwise Alignment Needleman - Wunsch Global Alignment Smith - Waterman Local Alignment.
Introduction to computational genomics – hands on course Gene expression (Gasch et al) Unit 1: Mapper Unit 2: Aggregator and peak finder Solexa MNase Reads.
Alignment of Genomic Sequences Wen-Hsiung Li Ecology & Evolution Univ. of Chicago.
Sequence comparison: Local alignment
Delon Toh. Pitfalls of 2 nd Gen Amplification of cDNA – Artifacts – Biased coverage Short reads – Medium ~100bp for Illumina – 700bp for 454.
Presented by Mario Flores, Xuepo Ma, and Nguyen Nguyen.
Mapping NGS sequences to a reference genome. Why? Resequencing studies (DNA) – Structural variation – SNP identification RNAseq – Mapping transcripts.
De-novo Assembly Day 4.
LECTURE 2 Splicing graphs / Annoteted transcript expression estimation.
Li and Dewey BMC Bioinformatics 2011, 12:323
Space-Efficient Sequence Alignment Space-Efficient Sequence Alignment Bioinformatics 202 University of California, San Diego Lecture Notes No. 7 Dr. Pavel.
Mouse Genome Sequencing
CS 394C March 19, 2012 Tandy Warnow.
Lecture 9 of Advanced Databases Storage and File Structure (Part II) Instructor: Mr.Ahmed Al Astal.
Introduction to Short Read Sequencing Analysis
“Hotspot” algorithm chr5:131,975, ,012,092 Idea: gauge enrichment of tags relative to a local background model based on the number of tags in a 50kb.
LOC_Os02g08480 Supplementary Figure S1. Exons shorter than a read length have few or no reads aligned. The gene at LOC_Os02g08040 contains exons shorter.
Aligning Reads Ramesh Hariharan Strand Life Sciences IISc.
TopHat Mi-kyoung Seo. Today’s paper..TopHat Cole Trapnell at the University of Washington's Department of Genome Sciences Steven Salzberg Center.
SIZE SELECT SHEAR Shotgun DNA Sequencing (Technology) DNA target sample LIGATE & CLONE Vector End Reads (Mates) SEQUENCE Primer.
Function preserves sequences Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.
RNA-Seq Primer Understanding the RNA-Seq evidence tracks on the GEP UCSC Genome Browser Wilson Leung08/2014.
Introduction to RNAseq
From Smith-Waterman to BLAST
1 MAVID: Constrained Ancestral Alignment of Multiple Sequence Author: Nicholas Bray and Lior Pachter.
Pairwise Sequence Alignment (cont.) (Lecture for CS397-CXZ Algorithms in Bioinformatics) Feb. 4, 2004 ChengXiang Zhai Department of Computer Science University.
….. The cloud The cluster…... What is “the cloud”? 1.Many computers “in the sky” 2.A service “in the sky” 3.Sometimes #1 and #2.
Huffman Coding (2 nd Method). Huffman coding (2 nd Method)  The Huffman code is a source code. Here word length of the code word approaches the fundamental.
Assembly S.O.P. Overlap Layout Consensus. Reference Assembly 1.Align reads to a reference sequence 2.??? 3.PROFIT!!!!!
RNA Sequencing and transcriptome reconstruction Manfred G. Grabherr.
Short Read Workshop Day 5: Mapping and Visualization
Reliable Identification of Genomic Variants from RNA-seq Data Robert Piskol, Gokul Ramaswami, Jin Billy Li PRESENTED BY GAYATHRI RAJAN VINEELA GANGALAPUDI.
Multi-Genome Multi- read (MGMR) progress report Main source for Background Material, slide backgrounds: Eran Halperin's Accurate Estimation of Expression.
RNA-Seq Primer Understanding the RNA-Seq evidence tracks on
RNAseq: a Closer Look at Read Mapping and Quantitation
Short Read Sequencing Analysis Workshop
Sequence assembly Jose Blanca COMAV institute bioinf.comav.upv.es.
VCF format: variants c.f. S. Brown NYU
Metafast High-throughput tool for metagenome comparison
Transcriptomics II De novo assembly
The ideal approach is simultaneous alignment and tree estimation.
Genome alignment Usman Roshan.
S1 Supporting information Bioinformatic workflow and quality of the metrics Number of slides: 10.
Sequence comparison: Local alignment
Pairwise and NGS read alignment
Introduction to Genome Assembly
Jin Zhang, Jiayin Wang and Yufeng Wu
CS 598AGB Genome Assembly Tandy Warnow.
Reference based assembly
From: TopHat: discovering splice junctions with RNA-Seq
Maximize read usage through mapping strategies
Inference of alternative splicing from RNA-Seq data with probabilistic splice graphs BMI/CS Spring 2019 Colin Dewey
Sequence Analysis - RNA-Seq 2
Presentation transcript:

Spliced Transcripts Alignment & Reconstruction STAR Alexander Dobin, Philippe Batut, Sudipto Chakrabortty, Carrie Davis, Delphine Fagegaltier, Sonali Jha, Wei Lin, Felix Schlesinger, Chenghai Xue, Christopher Zaleski, Thomas Gingeras CSHL

STAR: spliced transcript alignment and reconstruction 'Ab initio' detection of splice junctions un-annotated, non-canonical, distal exons, chimeric ... Any read length, any number of SJs per read Any (reasonable) number of mismatches and indels Unique and all multiple mappers Alignment scoring utilizing reads quality scores "Auto" trimming of poor quality ends Non-templated poly-A tails detection Very Fast: human 75-mer reads: 60 Million read per hour Memory: RAM~9*(Genome length) bytes: 25GB for human II. Algorithm

Maximum mappable length Typical short read aligner: does the read map entirely, i.e. at full length? What is the maximum mappable length? can detect many mismatches can precisely "trim" poor quality tails can detect splice junctions With suffix arrays we find maximum mappable length in no extra time Map Extend Map Map Map again II. Algorithm

Scoring with quality scores Similar to local alignment scoring, but penalties have probabilistic meaning Illumina quality score: +QS for matches; -QS for mismatches Penalty for gap opening: Total score A more elaborate iterative penalty system is being developed gap penalty is calculated from mapped gap length distribution mismatch penalties vs QS scores are re-calibrated after mapping Choose the alignment(s) with highest score II. Algorithm

STAR alignment algorithm Split each read into "good" pieces by quality scores Map good pieces using suffix arrays Stitch and extend mapped pieces Score and select the best alignment

Splitting the reads Split the read at poor quality bases (QS<15), 'N' Map each good piece separately Recover mismatches caused by poor SNR Avoid erroneous mapping caused by sequencing errors: just 1 SNP can cause mis-mapping from paralog to paralog

Suffix array based search For each good piece find maximum exactly mappable length (could be a multiple mapper) if a long portion of the good piece is still unmapped - repeat repeat this procedure backwards (from 3' to 5' of a good piece)

Stitch and extend mapped pieces Each uniquely mapped piece originates an alignment window (cluster) Collect all mapped pieces within an alignment window (e.g. 200kb) Consider all collinear combinations of mapped pieces Choose the combination with the highest score for each cluster Choose the alignment cluster with the highest score Stitch Extend Extend

Comparison with exhaustive search Fly embryo 76mer RNA seq 1 Illumina lane: 8,930,945 total reads, good quality   Exhaustively mapped Only in STAR Missed by STAR Exact 5,125,614 2,425 1MM 1,353,709 94 3,217 2MM 417,225 23 4,172 Multiple mappers by exhaustive search, <0.002% of all reads STAR maps 99.8% of all exhaustively mapped reads poor quality reads which did not have a single unique "anchor" III. Application

with exhaustive search Reads mapped by STAR 1.5% multi-mappers 8.5% STAR splice junctions 1.8% not mapped by STAR 0.2% STAR InDels gap < 20b 11% STAR >2MM or shorter length 77% STAR overlap with exhaustive search III. Application

STAR alignments ~1,000,000 alignments found by STAR and not by exhaustive search Distribution of mapped lengths mean length = 72 Distribution of mismatches spliced portions poor quality tails III. Application

Benchmarks BLAT Bowtie STAR Fly 13 19 91 Human 1 58 Single thread benchmarks 75-mer reads Bowtie (-v2 -k1) only reports non-spliced alignments with 0-2 MM, 1 or 2 alignments per read BLAT and STAR report >2MM and spliced alignments, and all the multiple alignments Million of reads aligned per hour   BLAT Bowtie STAR Fly 13 19 91 Human 1 58 III. Application

% mapped: unique+multiple Human K562/GM: 2x75 Lane All reads % mapped: unique % mapped: unique+multiple GM 1/1 16,730,063 75 83 GM 1/2 16,721,853 GM 1/3 54,477,453 35 38 GM 2/1 23,817,621 42 45 GM 2/2 25,536,631 39 K562 1/1 12,200,529 79 86 K562 1/2 12,845,645 K562 1/3 47,382,765 47 50 K562 2/1 25,597,881 K562 2/2 25,996,379 36

Splice junctions Total # of Gencode junctions: 284k Canonical Annotated Number of junctions Canonical Un-Annotated Non-Canonical Un-Annotated Minimum number of reads per junction

Transcript assembly algorithm Use contigs and splice junctions only Find all possible collinear maximally extended transcripts by following all possible paths

Examples of transcripts STAR transcripts

Examples of transcripts STAR transcripts

Summary STAR: ab initio splice junction detection Maximum mappable length search with suffix arrays Alignment scoring uses quality scores of the reads Very fast: 60M/hour for 75-mer reads in human, requires large amount of RAM (~25GB for human) The code will be beta-released in November '09 dobin@cshl.edu

Examples of transcripts STAR transcripts

Another Mapped Cluster Chimeric stitching READ Best Mapped Cluster Another Mapped Cluster chr1 chr2 If the Best Mapped Cluster leaves enough un-mapped read space, try to stitch other clusters that cover the unmapped space II. Algorithm