SeqMapReduce: software and web service for accelerating sequence mapping Yanen Li Department of Computer Science, University of Illinois at Urbana-Champaign.

SeqMapReduce: software and web service for accelerating sequence mapping Yanen Li Department of Computer Science, University of Illinois at Urbana-Champaign Email: yanenli2@illinois.eduyanenli2@illinois.edu 10/05/2009, CAMDA 2009, Chicago

Challenge of NGS Alignment Sequences: Short (25 ~ 76 bp) Size of data set: large, still increasing BLAST? Transaction /Long Query Batch/Short Query BLAST NGS Aligner We need INDEX !

The NGS Aligner War Where are you?

NGS Aligner Classification Standalone Algorithms Hash Reads: Eland, RMAP, MAQ, SHRiMP … Pros: less RAM, less overhead Cons: waste of genome scan Hash Genome: SOAP, PASS, Mosaik, BFAST … Pros: fast, scale up well Cons: big RAM, heavy overhead Index Genome (Burrows-Wheeler): Bowtie, BWA

NGS Aligner Classification Parallel Algorithm OptionsThings Needed to Consider Multi-threadHard to scale up to many cores Cluster ComputingLoad balancing, Fault tolerance Cloud ComputingRestricted programming interface

Programming Model of Cloud Computing MapReduce Developer supplies two functions – All v with the same k are reduced together Simple framework usually can scale up well

Why Cloud Computing Attractive? Fit for Data Intensive Computing (DIC) NGS alignment is DIC in nature Hadoop – open sourced Cloud Computing system Built-in Load balancing and Fault tolerance Easy to program

Cloud Based NGS Aligner Hash ReadsHash GenomeHash Both SeqMapReduce * CloudBurst * Hash/index Genome will be the next SeqMapReduce: Hash all reads in RAM in every node CloudBurst: Hash reads and the genome, but not in RAM

The SeqMapReduce Framework

Inside SeqMapReduce Pre-processing: formatting the genome Format once, use every time Bases at the end are duplicated

Inside SeqMapReduce Map phase: Seed & Filtering Divide a read into K parts, If M mismatches: at least (K-M) parts are exactly matched e.g. K=4, M=2 4-2=2 parts exactly matched combinations We need only 6 Hash Tables Genome seqs scanned for potential hits Then go to Mismatches Counting

Inside SeqMapReduce Reduce Phase Aggregating intermediate results Post Processing Duplication detection Mismatches counting Final output report

Inside SeqMapReduce Mismatches counting Naive way: simple counting (O(N)) Mismatches counting using bit operations Bit-wise XOR (Exclusive or) 00011011 00 011011 01 001110 110001 11 100100

Mismatches counting Original R (read), and G (genome) W=R XOR G Define 2 constants W1=10101010… W2=01010101… X=W & W1 (keep 10, clear 01, 11=>10) Y=W & W2 (keep 01, clear 10, 11=>01) Then Y << 1 N=POPCNT(X | Y) W is combinations of 00 01 10 11 W00011011 W201 Y=W & W200010001 Y << 100100010 W00011011 W110 X=W & W100 10 X=W & W1 X | Y W00011011 X | Y0010 Y =W & W2

Web Service of SeqMapReduce

Input format.zip of fasta format reads Reads can be upload through web site Support 13 model organisms Support reads longer than 32 bps Up to 5 mismatches No indels in current version (will update soon) Output with ELAND format Free of charge for academics Users: Small labs, want quick results but could be afford expensive hardware and softwares

Results on CAMDA 2009 datasets Pol II ChIP-seq FC201WVA_20080307_s_5 (4.5 million) IFNg stimulated STAT1 ChIP-seq FC302MA_20080507_s_1 (6.2 million) Illinois Cloud Computing Testbed (CCT). Each node: 64 bit 2.6 GHz CPUs, 16 GB RAM, and 2 TB storage. 2 mismatches are allowed. Accuracy: 95% of results are the same as MAQ.

Speed Up Run time VS No. of cores Pol II data set Run time VS No. of cores STAT1 data set Speed up is quasi-linear to the No. of cores Ave overhead time: 67.22s Ave overhead time: 86.09 s

Scale Up SizeSize RatioRun Time Ratio STAT16.2 million1.38364 second1.03 Pol II4.5 million354 second RAM requirement: ~ 50 M per million reads Can scale up to tens of millions of read with several Gs of RAM

Comparison to CloudBurst Why CloudBurst is slow? It hashes Reads and genome, with Hadoop system hash function No filtering in the Map phase: heavy I/O to Reduce phase

Results on Amazon EC2 Speed up similar of using UIUC Hadoop Cluster, but slower Large Standard Instances are chosen Cost $99.01

Future Plans Apply to Bisulfite Reads to genome wide methylation analysis Web-based visualization of short-read alignments

Acknowledgements UIUC Cloud Test Bed Michael Schatz CAMDA Organizers This work is supported by NSF DBI 08-45823 (SZ) Thank you!

SeqMapReduce: software and web service for accelerating sequence mapping Yanen Li Department of Computer Science, University of Illinois at Urbana-Champaign.

Similar presentations

Presentation on theme: "SeqMapReduce: software and web service for accelerating sequence mapping Yanen Li Department of Computer Science, University of Illinois at Urbana-Champaign."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

SeqMapReduce: software and web service for accelerating sequence mapping Yanen Li Department of Computer Science, University of Illinois at Urbana-Champaign.

Similar presentations

Presentation on theme: "SeqMapReduce: software and web service for accelerating sequence mapping Yanen Li Department of Computer Science, University of Illinois at Urbana-Champaign."— Presentation transcript:

Similar presentations

About project

Feedback