The short-read alignment in distributed memory environment

The short-read alignment in distributed memory environment
Andrzej Dorobisz, Paweł Russek, Kazimierz Wiatr Description of the problem Distribution of data and work The task of short-read alignment, which comes from genomics, is to find the best match for very short DNA sequences (short reads) in a lengthy reference genome data. The short reads and genome are represented as text patterns of four letter alphabet {‘A’, ‘C’, ‘G’, ‘T’}, each letter denote base pair (bp) in DNA sequence. Typical parameters are as follows. The reference genome has length GLEN of a couple of Gbps, and a single run can comprise million of short reads, bps each. The main difficulty in this problem is that inexact matching is accepted, and the short read can be edited to fit its location in the genome. Three edit operations: insertion (CGAT -> CGACT) deletion (CGAT -> CGT) replacement (CGAT -> CGCT) Our goal was to propose a method for a distribution of the workload and data in a computer cluster to minimize data redundancy and inter-node communication. The key concept of our solution is that we can distribute trie among the one root process and many worker processes. There are one root and four workers (W=4) in Figure 1. We assign trie’s root node (‘_’) and top Ld trie levels to the root process (Ld=1 in Figure 1). Remaining subtrees are divided between worker processes in such a way that each root’s leaf node is a root node of the sub-trie that is stored by the single worker process. The root starts procedure for each short read. It traces all necessary path in the root’s sub-trie and delegates processing to the corresponding worker processes when the leaves node are reached. The worker node gets p=pposppos+1…pn and k to continue the backtracking procedure in its sub-trie. After processing each worker sends back results to root and then they are stored in the result file. This method provides that both data and work are distributed, so the problem is effectively parallelized. Simplified method If we assume that on first few letters cannot occur edit operation, then distribution of work becomes very elegant. Root process performs exact matching and then sends short-read only to one worker which continue processing with inexact matching. We have implemented this solution in the sequential and distributed versions. We have chosen C++11 and used MPI library to implement the distributed one. A C G T - || Example alignment of CGAT short-read against ACCGACTTCGTCGCGCTTA reference with one error allowed. Figure 1: An example of the trie for AGCATGCTGCAGTCATGCTTAGGCTA reference genome and its nodes partitioning to MPI processes Algorithm Results Since the maximum length of short reads is limited, we build the trie of strings that are all substrings of length Lmax of the reference genome, where Lmax is a maximum length of short reads. An example of the trie for Lmax=4 is given in Figure 1. The inexact matching in the trie is performed by the recursive procedure MismatchRec: 1: procedure MismatchRec(p, pos, cur, k) 2: if cur is NULL then return  3: end if 4: if pos == |p| + 1 then return SS(cur) 5: end if 6: R ← MismatchRec(p, pos+1, next(cur, ppos), k) 7: if k > 0 then 8: for all x  {A,C, G, T}, x ≠ ppos 9: R ← R  MismatchRec(p, pos+1, next(cur, x), k−1) 10: end for 11: for all x  {A,C, G, T} do 12: R ← R  MismatchRec(p, pos, next(cur, x), k−1) 13: end for 14: R ← R  MismatchRec(p, pos+1, cur, k−1) 15: end if 16: return R 17: end procedure In presented algorithm, p=p1p2...pn is a searched pattern, pos is a selected position in p, cur is a current trie node, and k is the number of an allowed edit operations. The next(node,letter) function returns the descendant of the node that corresponds to the selected letter, and SS(cur) returns positions of all substrings going through the node cur. The procedure is the backtrack search algorithm in the trie. At each trie node, all possible paths are checked for the unedited pattern (line 6), three replacements of the letter (line 9), four insertions (line 12), and the letter deletion (line 14). Our solution was tested on Zeus cluster at the ACC „CYFRONET”. We have measured time for the trie construction, data distribution and alignment. Lmax was set to 100 and we assume an exact prefix of length three. Results are given in Tables 1, 2, and 3. Table 1. Data distribution time [s] Table 2. Tries construction time [s] GLEN 50 000 sequential - W=4 0.26 s 0.24 s 2.17 s W=20 0.35 s 0.45 s 2.55 s GLEN 50 000 sequential 3.07 s 6.09 s 60.88 s W=4 1.06 s 2.17 s 26.63 s W=20 0.29 s 0.51 s 5.26 s Table 3. Overall short-read alignment time ( reads and 100 kbps genome) [s] mismatches 3 4 5 sequential 3.35 s 34.46 s - W=4 0.90 s 9.75 s W=20 1.09 s 2.71 s 14.11 s Performed tests positively verified our solution. We can see that communication cost is small in comparison to gained acceleration. We can see that both phases – trie construction and short-read alignment were effectively parallelized. To summarize: we achieve our goal and show that short-read alignment problem can be effectively run in distributed memory environment. Acknowledgements This work is supported by PLGrid Core project no. POIG /13

The short-read alignment in distributed memory environment

Similar presentations

Presentation on theme: "The short-read alignment in distributed memory environment"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

The short-read alignment in distributed memory environment

Similar presentations

Presentation on theme: "The short-read alignment in distributed memory environment"— Presentation transcript:

Similar presentations

About project

Feedback