Presentation is loading. Please wait.

Presentation is loading. Please wait.

Spliced Transcripts Alignment & Reconstruction

Similar presentations


Presentation on theme: "Spliced Transcripts Alignment & Reconstruction"— Presentation transcript:

1 Spliced Transcripts Alignment & Reconstruction
STAR Alexander Dobin, Philippe Batut, Sudipto Chakrabortty, Carrie Davis, Delphine Fagegaltier, Sonali Jha, Wei Lin, Felix Schlesinger, Chenghai Xue, Christopher Zaleski, Thomas Gingeras CSHL

2 STAR: spliced transcript alignment and reconstruction
'Ab initio' detection of splice junctions un-annotated, non-canonical, distal exons, chimeric ... Any read length, any number of SJs per read Any (reasonable) number of mismatches and indels Unique and all multiple mappers Alignment scoring utilizing reads quality scores "Auto" trimming of poor quality ends Non-templated poly-A tails detection Very Fast: human 75-mer reads: 60 Million read per hour Memory: RAM~9*(Genome length) bytes: 25GB for human II. Algorithm

3 Maximum mappable length
Typical short read aligner: does the read map entirely, i.e. at full length? What is the maximum mappable length? can detect many mismatches can precisely "trim" poor quality tails can detect splice junctions With suffix arrays we find maximum mappable length in no extra time Map Extend Map Map Map again II. Algorithm

4 Scoring with quality scores
Similar to local alignment scoring, but penalties have probabilistic meaning Illumina quality score: +QS for matches; -QS for mismatches Penalty for gap opening: Total score A more elaborate iterative penalty system is being developed gap penalty is calculated from mapped gap length distribution mismatch penalties vs QS scores are re-calibrated after mapping Choose the alignment(s) with highest score II. Algorithm

5 STAR alignment algorithm
Split each read into "good" pieces by quality scores Map good pieces using suffix arrays Stitch and extend mapped pieces Score and select the best alignment

6 Splitting the reads Split the read at poor quality bases (QS<15), 'N' Map each good piece separately Recover mismatches caused by poor SNR Avoid erroneous mapping caused by sequencing errors: just 1 SNP can cause mis-mapping from paralog to paralog

7 Suffix array based search
For each good piece find maximum exactly mappable length (could be a multiple mapper) if a long portion of the good piece is still unmapped - repeat repeat this procedure backwards (from 3' to 5' of a good piece)

8 Stitch and extend mapped pieces
Each uniquely mapped piece originates an alignment window (cluster) Collect all mapped pieces within an alignment window (e.g. 200kb) Consider all collinear combinations of mapped pieces Choose the combination with the highest score for each cluster Choose the alignment cluster with the highest score Stitch Extend Extend

9 Comparison with exhaustive search
Fly embryo 76mer RNA seq 1 Illumina lane: 8,930,945 total reads, good quality Exhaustively mapped Only in STAR Missed by STAR Exact 5,125,614 2,425 1MM 1,353,709 94 3,217 2MM 417,225 23 4,172 Multiple mappers by exhaustive search, <0.002% of all reads STAR maps 99.8% of all exhaustively mapped reads poor quality reads which did not have a single unique "anchor" III. Application

10 with exhaustive search
Reads mapped by STAR 1.5% multi-mappers 8.5% STAR splice junctions 1.8% not mapped by STAR 0.2% STAR InDels gap < 20b 11% STAR >2MM or shorter length 77% STAR overlap with exhaustive search III. Application

11 STAR alignments ~1,000,000 alignments found by STAR and not by exhaustive search Distribution of mapped lengths mean length = 72 Distribution of mismatches spliced portions poor quality tails III. Application

12 Benchmarks BLAT Bowtie STAR Fly 13 19 91 Human 1 58
Single thread benchmarks 75-mer reads Bowtie (-v2 -k1) only reports non-spliced alignments with 0-2 MM, 1 or 2 alignments per read BLAT and STAR report >2MM and spliced alignments, and all the multiple alignments Million of reads aligned per hour BLAT Bowtie STAR Fly 13 19 91 Human 1 58 III. Application

13 % mapped: unique+multiple
Human K562/GM: 2x75 Lane All reads % mapped: unique % mapped: unique+multiple GM 1/1 16,730,063 75 83 GM 1/2 16,721,853 GM 1/3 54,477,453 35 38 GM 2/1 23,817,621 42 45 GM 2/2 25,536,631 39 K562 1/1 12,200,529 79 86 K562 1/2 12,845,645 K562 1/3 47,382,765 47 50 K562 2/1 25,597,881 K562 2/2 25,996,379 36

14 Splice junctions Total # of Gencode junctions: 284k Canonical
Annotated Number of junctions Canonical Un-Annotated Non-Canonical Un-Annotated Minimum number of reads per junction

15 Transcript assembly algorithm
Use contigs and splice junctions only Find all possible collinear maximally extended transcripts by following all possible paths

16 Examples of transcripts
STAR transcripts

17 Examples of transcripts
STAR transcripts

18 Summary STAR: ab initio splice junction detection
Maximum mappable length search with suffix arrays Alignment scoring uses quality scores of the reads Very fast: 60M/hour for 75-mer reads in human, requires large amount of RAM (~25GB for human) The code will be beta-released in November '09

19 Examples of transcripts
STAR transcripts

20 Another Mapped Cluster
Chimeric stitching READ Best Mapped Cluster Another Mapped Cluster chr1 chr2 If the Best Mapped Cluster leaves enough un-mapped read space, try to stitch other clusters that cover the unmapped space II. Algorithm


Download ppt "Spliced Transcripts Alignment & Reconstruction"

Similar presentations


Ads by Google