Presentation is loading. Please wait.

Presentation is loading. Please wait.

A Proposed Solution to the Short Read Reassembly Problem Carl Ebeling and Corey Olson.

Similar presentations

Presentation on theme: "A Proposed Solution to the Short Read Reassembly Problem Carl Ebeling and Corey Olson."— Presentation transcript:

1 A Proposed Solution to the Short Read Reassembly Problem Carl Ebeling and Corey Olson

2 Outline Background Indexing Solution Architecture

3 Motivation Solexa/Illumina and SOLiD ~billions of base pairs in hours 100s of millions of short reads (30-70 bp) read in parallel Computational cost rising Needed: hardware solution to improve speed and usability

4 Background Goal: quickly align millions of reads to the reference genome Read errors and SNPs prevent simple indexing Solutions Brute force comparison of all reads to reference Indexed-based using seeds Burroughs-Wheeler Transform

5 Index Based Solution Reference Index Table (RIT) Maps all seeds to positions in the reference Read Position Table (RPT) Maps reads to regions in the reference for comparison Smith Waterman Comparison Stream reference genome into SW units for scoring of reads

6 RIT Creation CATGCTAT 65 Mask SeedCATGCTAT CATGCTAA CATGCTAC CAT_GC_TGAT CATGCTAG CATGCCGG Note: first column is number of entries


8 Read Scoring SW Unit TAGTGTGATCGAA :63 RPT 0:31 64:95 96: :159 Read #6:

9 Buckets Buckets combine hits for a read along the reference Reduces number of SW units required Optimal bucket length unknown

10 Entries Per Location in RIT N = number of base pairs in reference genome k = characters in the seed (#1s in the mask) Note: Each entry in RIT ~ 4 Bytes, 2^2k total locations, N entries N=31,k=11: RIT = 2^31*2^2 = 8GB N=32,k=14: RIT = 2^32*2^2 = 16GB

11 Entries in RPT R = number of reads Seff = effective number of seeds per read Ex: R=2^27, Seff=2: 2^20 * 2048 * 4 = 8GB

12 Entries per Bucket b = bucket size Note: this determines the number of SW units required

13 Architecture Memory Required 8 GB for RIT, 8 GB for RPT Creation of RIT and RPT is random access Access time can be masked with buffering and multiple memory banks High bandwidth communication required between FPGAs

14 RIT Creation Algorithm 1.Move to the next reference character 2.Generate the next seed with the mask 3.Using seed as address, open DRAM row a)Read current array length b)Increment array length and write back c)Write reference position to array[length]

15 Memory Distribution RIT AA.. AC.. AG.. AT.. CA.. CC.. CG.. CT.. RIT TA.. TC.. TG.. TT.. RIT Distributed by Seed RPT part 0 RPT Buckets Partitioned across memory modules by reads RPT part 1 RPT part 2 RPT part 3 RPT part 4 RPT part 5 RPT part 6 RPT part 7 RPT part n-4 part n-3 part n-2 part n-1

16 RPT Creation Algorithm 1.Clear the bucket set P in the FPGA assigned to the read 2.For each seed in the read a)Using seed as address, read all reference positions from RIT b)Add the current read to the bucket associated with each position 3.After all seeds in read, for each bucket in P a)Using the reference position as address, read the current array length b)Increment the array length and write back c)Write the read ID to array[length]

17 Reassembly Process with Architecture Reference streamed from host source Reads loaded from RPT into SW units at start comparison point Max score and location for each read recorded by SW unit at end comparison point

18 Active SW Units at one time Lr = Read Length e = error window size

19 Performance Estimates Construction of RIT = 16 seconds Assuming 128MHz and process 1 reference character per clock Construction of RPT = 10 minutes Assuming R=130M, L R =64, N=2^31, k=14, 4 FPGAs Reassembly Phase = 16 seconds Assuming 128MHz, N=2^31

Download ppt "A Proposed Solution to the Short Read Reassembly Problem Carl Ebeling and Corey Olson."

Similar presentations

Ads by Google