Space-time Tradeoffs for Longest-Common-Prefix Array Construction Simon J. Puglisi and Andrew Turpin

Space-time Tradeoffs for Longest-Common-Prefix Array Construction Simon J. Puglisi and Andrew Turpin (simon.puglisi@rmit.edu.au)

Outline Refresher on suffix array (SA) and longest- common-prefix array (LCP) basics –Why are they useful? Existing algorithms for LCP construction –Why do we need new ones? Two algorithms for LCP construction –Empirical comparison to earlier algorithms Concise representations of LCP array

The Ubiquitous Suffix Array isuffix 0toy_boat_toy_boat_toy_boat$ 1oy_boat_toy_boat_toy_boat$ 2y_boat_toy_boat_toy_boat$ 3_boat_toy_boat_toy_boat$ 4boat_toy_boat_toy_boat$ 5oat_toy_boat_toy_boat$ 6at_toy_boat_toy_boat$ 7t_toy_boat_toy_boat$ 8_toy_boat_toy_boat$ 9toy_boat_toy_boat$ 10oy_boat_toy_boat$ 11y_boat_toy_boat$ 12_boat_toy_boat$ 13boat_toy_boat$ 14oat_toy_boat$ 15at_toy_boat$ 16t_toy_boat$ 17_toy_boat$ 18toy_boat$ 19oy_boat$ 20y_boat$ 21_boat$ 22boat$ 23oat$ 24at$ 25t$ 26$ SAsuffix 26$ 24at$ 15at_toy_boat$ 6at_toy_boat_toy_boat$ 22boat$ 13boat_toy_boat$ 4boat_toy_boat_toy_boat$ 23oat$ 14oat_toy_boat$ 5oat_toy_boat_toy_boat$ 19oy_boat$ 10oy_boat_toy_boat$ 1oy_boat_toy_boat_toy_boat$ 25t$ 18toy_boat$ 9toy_boat_toy_boat$ 0toy_boat_toy_boat_toy_boat$ 16t_toy_boat$ 7t_toy_boat_toy_boat$ 20y_boat$ 11y_boat_toy_boat$ 2y_boat_toy_boat_toy_boat$ 21_boat$ 12_boat_toy_boat$ 3_boat_toy_boat_toy_boat$ 17_toy_boat$ 8_toy_boat_toy_boat$ Suffix Sort

LCPSAsuffix -26$ 024at$ 215at_toy_boat$ 106at_toy_boat_toy_boat$ 022boat$ 413boat_toy_boat$ 134boat_toy_boat_toy_boat$ 023oat$ 314oat_toy_boat$ 125oat_toy_boat_toy_boat$ 119oy_boat$ 710oy_boat_toy_boat$ 161oy_boat_toy_boat_toy_boat$ 025t$ 118toy_boat$ 89toy_boat_toy_boat$ 170toy_boat_toy_boat_toy_boat$ 116t_toy_boat$ 107t_toy_boat_toy_boat$ 020y_boat$ 611y_boat_toy_boat$ 152y_boat_toy_boat_toy_boat$ 021_boat$ 512_boat_toy_boat$ 143_boat_toy_boat_toy_boat$ 017_toy_boat$ 98_toy_boat_toy_boat$ The Longest-Common-Prefix (LCP) Array LCP[i] = The length of the longest common prefix of suffix SA[i] and SA[i-1]. = |lcp(SA[i-1],SA[i])|, i > 0

Why the Longest-Common-Prefix array? (SA,LCP,x) == suffix tree –Any bottom-up and top-down traversal (Abouelhoda et al., JDA 2004) –Same asymptotic time bounds, just smaller and faster in practice –Eg., LZ77 factorization (Chen et al., CPM 2007) Important for disk resident suffix trees –LOFSA (Sinha et al., SIGMOD 2008)

Previous work Brute force: –for each i \in 1..n-1 work out LCP[i] by comparing t[SA[i-1]..n] to t[SA[i]..n] until we get a mismatch –Expensive if string has regularities, O(n 2 ) in the worst case Ө(n) time (Kasai et al., CPM 1999) –13n bytes of space –x[1..n], SA[1..n], ISA[1..n], LCP[1..n] Ө(n) time (Manzini, SWAT 2004) –Two refinements of Kasai et al.’s algorithm –9n bytes –6n + 4H k n bytes (space usage decreases with text entropy)

The need for new LCP construction algorithms Prior algorithms use lots of memory –Try to compute LCP[] for the Human Genome –DNA has high entropy, so 9n byte alg is best –27Gb of RAM Poor locality of memory reference –Using secondary memory for large inputs implausible –Even in RAM the algorithms are (relatively) slow Eg., slower than the fastest SA construction algorithms

SA 26 24 15 6 22 13 4 23 14 5 19 10 1 25 18 9 0 16 7 20 11 2 21 12 3 17 8 New Alg: choose a (special) sample of suffixes LSsuffix -15at_toy_boat$ 022boat$ 44boat_toy_boat_toy_boat$ 023oat$ 11oy_boat_toy_boat_toy_boat$ 025t$ 118toy_boat$ 89toy_boat_toy_boat$ 116t_toy_boat$ 011y_boat_toy_boat$ 152y_boat_toy_boat_toy_boat$ 08_toy_boat_toy_boat$ Choose a sample of the SA and compute lcp’s Preprocess L for O(1) time Range Minimum Queries (RMQ) – requires 2n + o(n) bits, after O(n) time preprocessing (Fischer 2008) Lcp for two non-adjacent suffixes is the minimum value in L[] between them.

A difference cover D v, modulo v, is a set of integers in the range [0..v) such that for all i \in [0..v), there exist j, k \in D v such that i = k-j (mod v). –A tool for linear time suffix sorting (Karkkainen et al, JACM, 2006) |D v | = O(√v) δ function defined on D v : –δ(i,j) = k, i+k and j+k \in D v (mod v) for any i,j –δ computed in O(1) time and requires O(v) space The sample is defined by a difference cover

SA 26 24 15 6 22 13 4 23 14 5 19 10 1 25 18 9 0 16 7 20 11 2 21 12 3 17 8 New Alg: choose a (special) sample of suffixes LSsuffix -15at_toy_boat$ 022boat$ 44boat_toy_boat_toy_boat$ 023oat$ 11oy_boat_toy_boat_toy_boat$ 025t$ 118toy_boat$ 89toy_boat_toy_boat$ 116t_toy_boat$ 011y_boat_toy_boat$ 152y_boat_toy_boat_toy_boat$ 08_toy_boat_toy_boat$ In this example, suffixes i such that i mod 7 \in D 7 = {1,2,4} have been chosen Suffixes 1,2,4, 8,9,11, 15,16,18,…  S has O(n/√v) elements (because |D v | is O(√v))

Using L to compute values in LCP i j δ(i,j)  lcp(i,j) = l’ + δ(i,j) rank(i+δ(i,j)) rank(j+δ(i,j)) SL l’ = lcp((i+δ(i,j)), (i+δ(i,j))) = RMQ L (...)..... i+δ(i,j) j+δ(i,j) l’ i + δ(i,j) l’ j + δ(i,j) lcp(i,j)

To compute L efficiently we exploit the following simple observation: If lcp for SA[i] is l, then lcp for SA[i]+v ≥ l-v –The lcp for a given suffix provides a lower bound on the lcp of suffixes which follow it in the string. Computing L SALCP..... j j+v..... l ≥ l-v  Overall O(n√v) time and O(n/√v) space to compute L

Now computing any LCP[k] requires at most v comparisons and an RMQ on L To compute LCP over top of SA: –for i = 1 to n do if lcp(SA[i],SA[i-1]) < v then –LCP[i] = lcp(SA[i],SA[i-1]) else –LCP[i] = δ(i,j) + RMQ L (…) Total time O(nv); extra space O(n/√v) Combining things…

0 50 100 150 Time (sec) 24681012 200 Ours (on disk) 6n 9n 13n Memory (bytes per input character) 14 Ours (in memory) Running Time & Memory Required for 200Mb DNA

0 50 100 150 Time (sec) 24681012 200 Ours (on disk) 6n 9n 13n Memory (bytes per input character) 14 Ours (in memory) Running Time & Memory Required for 200Mb English

An even better algorithm… In fact it’s possible to use even less space (and do away with the difference cover as well!) Requires O(vn) time and O(n/v) space –(Juha Karkkainen, last Friday)

Conclusions O(nv) time, O(n/√v) space algorithm (using DC) –O(nv) time, O(n/v) space (by rejigging things a bit) By varying v we have a controlled tradeoff between memory and time Algorithms are fast and use low memory Runtime is not greatly effected if the output (and most of the input) resides on disk

Representing LCP in small space The 2 nd algorithm implies a concise representation of the LCP array –nlogn/v bits to store sample suffixes –nlogv bits to store “extra part” Choosing v = logn → n + nloglogn bits Sadakane 2001: 6n + o(n) bits

Future Work Can we eliminate the random access to the text so that algorithm scales unboundedly? Is there a way to exploit the self-similarity present in the SA (and hence LCP) to further reduce constant factors in the runtime? What is the concise representation like in practice? Can it be made smaller?

Space-time Tradeoffs for Longest-Common-Prefix Array Construction Simon J. Puglisi and Andrew Turpin

Similar presentations

Presentation on theme: "Space-time Tradeoffs for Longest-Common-Prefix Array Construction Simon J. Puglisi and Andrew Turpin"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Space-time Tradeoffs for Longest-Common-Prefix Array Construction Simon J. Puglisi and Andrew Turpin

Similar presentations

Presentation on theme: "Space-time Tradeoffs for Longest-Common-Prefix Array Construction Simon J. Puglisi and Andrew Turpin"— Presentation transcript:

Similar presentations

About project

Feedback