SNAP: Fast, accurate sequence alignment enabling biological applications Ravi Pandya, Microsoft Research ASHG 10/19/2014.

SNAP: Fast, accurate sequence alignment enabling biological applications Ravi Pandya, Microsoft Research ASHG 10/19/2014

SNAP SNAP is fast * Align 50x genome in 1.2 hours (BWA-MEM = 11.75 hours) Sort + index + markdup BAM in 2 hours (samtools+sambamba = 4.25 hours) SNAP is as accurate as BWA-MEM, Bowtie2, etc. ROC on simulated data % aligned on real data Variant calls on real data * NA12878:ERR194147, Azure D14 (16 cores, 112GB RAM, 800GB SSD)

Sequence alignment The problem: Given a read R and a reference genome G Find the position in p in G that minimizes EditDistance(R, G[p.. p + |R|]) SNAP solves this quickly and accurately because of: Efficient system architecture Reducing the number of comparisons Reducing the cost of comparisons

System architecture full alignsort async read async write empty temp file mergesort mark duplicates index compress

The sequence alignment problem The easy part: 97% of 20-mers in the human genome occur only once but at only 75% of locations The hard part: The other 3% of 20-mers and 25% of locations 10% of reads 95% of time CDF of per-read/pair alignment time, NA18705 169M pairs (using deeper search parameters than current defaults) Bill Bolosky, MSR

Hash table lookup Build a multi-valued map (~30GB for hg19) from all seeds S in G  all locations of S in G 330 reads/s 14k reads/s For all seeds in read, all locations of seed in genome, Score implied alignment of read, keep the best Ignore frequent seeds (>300 occurrences) Only use a few seeds/read 42x Bill Bolosky, MSR

Fast scoring 113k reads/s 154k reads/s (470x overall) Sort candidates by # of seed hits Skip locations with #seed misses > limit 1.4x 92k reads/s O(n 2 )  Ukkonen O(nd), n=len, d=min(limit, actual) Use limit = best score so far + 2 (for MAPQ) 1.2x 6.6x Bill Bolosky, MSR

Paired-end alignment Find & score candidate location pairs C(R1:R2) = C(R1) ∩ C(R2) {± insert size} Enumerate in O(h log n) h = |C(R1) ∩ C(R2)| n = |C(R1)| + |C(R2)| Increases accuracy by allowing much higher limit on seed occurrences (e.g. 4k vs 300) Bill Bolosky, MSR

Results: simulated data Mason-generated paired-end 100bp reads

Results: real data NA18507 (Illumina HiSeq 50x) * AWS cr1.8xlarge (32 cores, 244GB RAM, 2x120GB SSD)

Results: GATK variant calls Broad GATK pipeline, curated NA12878 variant calls

Results: NIST Genome-in-a- Bottle Appistry GATK pipeline, GIAB highly confident calls Longer seeds are much faster, similar precision/recall 11.75 ERR194147*.fastq.gz, Azure D14 (16 cores, 112GB RAM, 800GB SSD)

Results: NIST Genome-in-a- Bottle Lower confidence calls (qual>20, 2 platforms) Highly confident indel snp Aligner Recall Precision Recall Precision bwa-mem97.24%97.15%99.57%99.65% snap-2097.04%97.48%99.51%99.57% snap-2497.04%97.46%99.52%99.57% snap-2897.04%97.45%99.53%99.57% snap-3297.00%97.41%99.51%99.57% Lower confidence indel snp Aligner Recall Precision Recall Precision bwa-mem96.38%96.30%99.00%99.32% snap-2096.17%96.68%98.94%99.25% snap-2496.17%96.67%98.95%99.23% snap-2896.16%96.62%98.96%99.21% snap-3296.11%96.55%98.94%99.17%

Pathogen ID: SURPI (Charles Chiu, UCSF) “This analysis of DNA sequences required just 96 minutes. A similar analysis conducted with the use of previous generations of computational software on the same hardware platform would have taken 24 hours or more to complete, Chiu said.”

SURPI SNAP enables SURPI with: Fast filtering mode 64-bit index for >40GB ntDB Secondary mapping output Charles Chiu, UCSF

Acknowledgements Microsoft Research Bill Bolosky Ravi Pandya UC San Francisco Taylor Sittler Broad Institute Christopher Hartl UC Berkeley AMPLab Matei Zaharia Kristal Curtis Armando Fox Scott Shenker Ion Stoica David Patterson Binaries, source, documentation (Apache 2.0 licensed) http://snap.cs.berkeley.edu

SNAP: Fast, accurate sequence alignment enabling biological applications Ravi Pandya, Microsoft Research ASHG 10/19/2014.

Similar presentations

Presentation on theme: "SNAP: Fast, accurate sequence alignment enabling biological applications Ravi Pandya, Microsoft Research ASHG 10/19/2014."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

SNAP: Fast, accurate sequence alignment enabling biological applications Ravi Pandya, Microsoft Research ASHG 10/19/2014.

Similar presentations

Presentation on theme: "SNAP: Fast, accurate sequence alignment enabling biological applications Ravi Pandya, Microsoft Research ASHG 10/19/2014."— Presentation transcript:

Similar presentations

About project

Feedback