Download presentation
Presentation is loading. Please wait.
Published byAlfred Summers Modified over 9 years ago
1
Genome-scale disk-based suffix tree indexing Benjarath Phoophakdee Mohammed J. Zaki Compiled by: Amit Mahajan Chaitra Venus
2
Introduction… Growth in biological sequences database Need for effective and efficient structure Suffix Tree –Exact/approx. matching –Database querying –Longest common substrings etc.
3
Introduction… In-memory construction algorithms –O(n 2 ) –Can achieve Linear Time and Space suffix links edge encoding skip and count –Problem: do not scale for large input sequences
4
Disk based Suffix trees “A Database Index to Large Biological Sequences” –Abandon suffix links (for better locality of reference) –Partition input based on fixed length prefixes –Faces problem in partition size because of data skew –Use of bin packing for partitions: expensive to count frequency for long length prefixes “Practical Suffix Tree Construction” –TDD: Similar to above… drops suffix links –Reported to scale to human genome level –Random I/Os when input string size > memory
5
ST-Merge (Improvement to TDD) –Input string = smaller contiguous substrings –Apply TDD on each substring and then merge all trees –Does not have suffix links TOP-Q and DynaCluster –Only known algorithms that maintain suffix links and do not have data skew problem –Experiments show that they do not scale to human genome level Disk based Suffix trees
6
Issue Problems with disk based algorithms –Data skew –No Suffix Links –No scalability Authors propose a novel disk based suffix tree algorithm called TRELLIS
7
TRELLIS O(n 2 ) Time, O(n) Space Idea: –construct by partitioning and merging –use variable length prefixes –Recover suffix links in a different post construction phase Effectively scales up to human genome level –Can index entire human genome using 2GB in 4 hours, recover suffix links in 2 hours
8
TRELLIS Has 4 different phases –Prefix Creation –Partitioning –Merging –Suffix Link Recovery
9
Prefix Creation Phase Problems with fixed-length prefix –Cannot handle data skew –Computing appropriate length is not defined TRELLIS makes use of variable length prefixes. P = {P 0, P 1, P 2, …, P m-1 } Use some threshold t to determine P such that freq(P i ) ≤ t
10
Prefix Creation Phase Multi-scan approach to compute P –i th scan Process prefixes up to certain length L i (See formula below to calculate L i ) EP i = set of prefixes that need further extension in next scan (as their frequency > t) Add to P only the smallest length prefixes that meets the frequency threshold t and reject their extensions
11
Prefix Creation Phase Ex: With t = 10 6, only two stages were required for the human genome with L 1 =8 and L 2 =16 Resulting set P contained about 6400 prefixes of lengths in the range 4 to 16
12
Partitioning Phase Divide input string into r consecutive partitions where r = (n+1) / t Suffix Subtree T Ri –Contains suffixes that start in partition R i –Use Ukkonen’s algorithm* to build it Prefixed Suffix Subtree T Ri, Pk –Split T Ri into subtrees that contain only suffixes that have prefix P k –At most m such subtrees Store these prefixed suffix subtrees on disk * proposed in the paper “Online construction of suffix trees” – E. Ukkonen
13
Partitioning Phase T Ri s obtained are implicit suffix trees (i.e. some suffixes are part of internal edges) To guarantee that T Ri explicitly contains all suffixes from i th partition –Continue to read some characters from next partition R i+1 until t leaves are obtained in T Ri –Cannot do special character appending as it will incur additional overhead during merging phase
14
Merging Phase For each prefix P k in the set P –Merge all Prefixed Suffixed Subtrees T Ri,Pk to get Prefixed Suffix Tree T Pk We get m Prefixed Suffix trees Store the resulting trees back to disk
15
Suffix Link Recovery Phase Why? –Suffix links are crucial for efficiency in many fast string processing algorithms Why in a separate phase? –TRELLIS may discard all suffix links information during the merge phase as new internal nodes are created and some old ones are deleted –It is useful to discard suffix links information after partitioning as it reduces amount of data per node –Recovering links from scratch takes same time as keeping original link information
16
Suffix Link Recovery Phase TRELLIS recovers suffix links of one Prefix Suffix Tree at a time Start with children of root Proceeding in a depth-first fashion, do the following for each internal node x –Locate p(x) and sl(p(x)) –Count from sl(p(x)) to locate sl(x), when found add link –Do this recursively for all children of x
17
Choosing t Note: t is threshold for Partition size also M >= n/4 + ((0.7 x 40) + 16)t + (0.7 x 40)t M = available main memory n/4 = memory for input (in compressed form) # internal nodes = 0.7(# external nodes) 40, 16 are sizes of internal and external nodes
18
Computational Complexity Prefix Creation Phase –O(nL) time, where L = longest prefix length –O(n+|∑ L+1 |)space Partitioning Phase –Input is broken into r partitions and each partition is of size t –O(t) time/space for each => r x O(t) = O(n) –Disk I/Os: O(r x m) since at most m prefixed suffix subtrees can be created for each partition
19
Computational Complexity Merging Phase –Each merge operation can be O(p) where p = | longest common prefix | –Across all prefixes, merging = O(p x n) since number of tree nodes in suffix tree is bounded by n –In worst case p can be O(n), therefore merge = O(n 2 ) –Disk I/Os: O(r x m)
20
Computational Complexity Suffix Link Recovery Phase –Internal nodes in final suffix trees are O(n) –Constant set of operations for each suffix link recovery Putting all together… –O(n 2 ) time since most expensive is the merge phase –O(n) space
21
Experimental Setup Compared to –TOP-Q and DynaCluster (maintain suffix links) –TDD (no suffix links) Performed on Linux with –2 GB RAM for human genome and 512 MB for others –288 GB disk space –TRELLIS written in C++ and compiled with g++ –Other algorithms obtained from their authors
22
Experimental Results TRELLIS vs. TOP-Q and DynaCluster For 200 Mbp, DynaCluster did not terminate even after 8 hours, TRELLIS took 13 min
23
Experimental Results TRELLIS vs. TDD TDD uses four different buffers (string, suffix, temp and tree) 200 Mbp requires only last 2 buffers Saves additional I/O incurred in other cases
24
Experimental Results TRELLIS vs. TDD TDD is built using memory optimized suffix-tree method Difference is not significant for human genome as TDD needs to be run in 64 bit mode
25
Experimental Results TRELLIS vs. TDD – Query time TDD does not store edge length, determine by examining children Internal node has pointer only to one child, so scan all children linearly for every query
26
Conclusions TRELLIS –Solves data skew problem: variable length prefixes –Scales gracefully for very large sequence –No Disk I/O overhead as it works with suffix trees that are guaranteed to fit in memory –It exhibits faster construction and query times when compared to other disk based algorithms
27
Future Work Plan to make TRELLIS applicable to wider range of alphabets (Ex: English alphabets) No buffering strategy required for human genome, but start building one for use of a generalized suffix tree composed of many large genomes Parallelize TRELLIS, since its partioning and merging steps seem ideally suited
Similar presentations
© 2024 SlidePlayer.com Inc.
All rights reserved.