Presentation is loading. Please wait.

Presentation is loading. Please wait.

Compressed suffix arrays and suffix trees with applications to text indexing and string matching.

Similar presentations

Presentation on theme: "Compressed suffix arrays and suffix trees with applications to text indexing and string matching."— Presentation transcript:

1 Compressed suffix arrays and suffix trees with applications to text indexing and string matching

2 Jeffrey scott vitter Roberto Grosssi

3 Agenda  A (very) short review on suffix arrays  Introduction  Problem Definition  Information theory reasoning –Simple solution round 2  Compressed suffix arrays in ½ *nloglogn +O(n) bits and O(loglogn) access time  Rank And Select Problem definitions –Rank DS  Compressed suffix arrays in ε -1 n + O(n) bits and O(log ε n) access time  Select data structure (if time permits)

4 Short review on suffix arrays  A suffix array is a sorted array of the suffix of a string S represented by an array of pointers to the suffixes of S  For example The string TelAviv and it ’ s corresponding suffix array telaviv S0S0S0S0 elaviv S1S1S1S1 laviv S2S2S2S2 aviv S3S3S3S3 viv S4S4S4S4 iv S5S5S5S5 v S6S6S6S6 4602513

5 Introduction  Succinct data structures branch  Dna genome strings (small alphabet, large strings)  Mainly a Theoretical article

6 Problem Definition  The Algorithm Is composed of two phases –compression –lookup  Compress : – given a suffix array Sa compress it to get it ’ s succinct representation  lookup(i): –Given the compressed representation return SA[i]

7 Some Definitions  We will deal (at first) with binary alphabet –Σ = {a,b}  We will add a special end of string symbol #  And will set the relation between the characters to be –a<#<b (*)  Basic Ram Model –Log(n) word size –Word lookup and arithmetic in constant time

8 Information theory reasoning abba#15432abab#41532abaa#13524abaa#34152 aabb# 12543 aaba#14253aaab#12354aaaa#12345 bbbb# 54321 bbba# 45321 bbab# 35241 bbaa# 34521 babb# 25143 baba# 42531 baab# 23514 baaa#23451

9 Information theory reasoning (2)  Suffix array size nlog(n)  One to one corresponds between the suffix array to the string –Construction details  Number of possible suffix arrays 2 n-1 –Perfect compress n bits (the string itself) –The cost for lookup Ω(n) see prev lecture

10 “Simple” solution round 2 different approach  Let ’ s pack together each logn bits to create a new alphabet.  So the text length will be n/logn and the pattern length would be m/logn  The suffix array will take o(n) bits  Searching becomes hard (alignment) – the text is aligned but the pattern isn ’ t logn cases

11 “Simple” solution round 2  the text isn ’ t aligned the pattern occurs k bit right to a word boundary  Need to append k bits to the pattern and check it  So we need to check 2^k cases  K~logn => n different cases to check  Assuming we know how much to pad!!

12 General framework  Abstract Data Type Optimization [Jacobson'89]  # distinct Data structures = C(n) => Each data structure occupies O(log C(n)) bits.  Doesn ’ t guarantee the time complexity on the supported operations

13 Compressed suffix arrays in ½*nloglogn +O(n) bits and O(loglogn) access time  Recursive method in nature –Take advantage on the suffixes  Let Sa 0 be the uncompressed suffix array  And N 0 be it ’ s size (assume power of 2)  In The k phase of the compression we start with Sa k with the size and create Sa k+1 with the size Sa k+1 holds the permutation {1..N k+1 }

14 Sa k+1 Construction  Create the B k bit vector B k [i] = 1 iff Sa k [i] is even B k [i] = 1 iff Sa k [i] is even  create the Rank vector Rank k (j) counts the number of one bits in the first j bits of B k  Create the Ψk(i) vector –stores the 0 to 1 companion relation)  Store the even values from Sa k in Sa k+1

15 An Example  The 32 chars string T  abbabbabbabbabaaabababbabbbabba#

16 An Example 16151413121110987654321 aababbabbabbabbaText 30143224211471028191713311615 Sa 0 1111001011000010 B0B0B0B0 8765444332111110 Rank 0 1615141331301028872318151422 Ψ0Ψ0Ψ0Ψ0

17 Example … 3231302928272625242322212019181716 #abbabbbabbababaa 2522258261129232036927181230 01101100010100111 16161514141312121212111110101098 27313021282717161323102187181716

18 How To compute Sa k from Sa k-1  Lemma 1 –Given suffix array Sa k let B k rank k Ψ k and Sa k+1 Be the result of the transformation performed by phase k we can construct Sa k from Sak+1 by the following formula Sa k [i] = 2* Sa k+1 [rank k (Ψ k (i))]+(B k [i]-1) –Let ’ s split for 2 cases  Bk[i] is even  Bk[i] is odd

19 Example continue 11141310396157161225148 Sa 1 0010100100111011 B1B1B1B1 8887766655543221 Rank 1 54142121412961654921 Ψ1Ψ1Ψ1Ψ1 25386174 Sa 2 10011001 B2B2B2B2 43332111 Rank 2 84154851 Ψ2Ψ2Ψ2Ψ2 1432 Sa 3

20 Compress –We Keep l = O(loglogn) levels –All Levels but the Sal level are save implicitly –For each of the level 0..l-1 we save B j,rank j Ψ j –rank j Ψ j are stored implicitly –The Size of Sa l is

21 lookup  just compute recursively Sa k [i] from Sa k+1 [i]  Recursion depth loglogn  All data structure going to be used have o(1) access time  O(loglogn) lookup cost

22 How The Data Is Stored  The Bk bit vector is stored explctiy –O(Nk) space –O(1) lookup –O(Nk) preprocess time  The Rank K vector is stored implicitly using Jacobson rank data structure –O(N k (loglogn k )/logn k ) space –O(1) lookup –O(Nk) preprocess time  The Ψ k vector is stored implicitly (using rank and select)

23 Ψ k vector representation

24 Let’s Take a look

25 An Example 16151413121110987654321 aababbabbabbabbaText 30143224211471028191713311615 Sa 0 1111001011000010 B0B0B0B0 8765444332111110 Rank 0 1615141331301028872318151422 Ψ0Ψ0Ψ0Ψ0

26 Example … 3231302928272625242322212019181716 #abbabbbabbababaa 2522258261129232036927181230 01101100010100111 16161514141312121212111110101098 27313021282717161323102187181716

27 So What can we do with all the list’s  Concatenate them together in a lexicographical order and form the Lk list  L 1 ={9,1,6,12,14,2,4,5}  Let ’ s see how we can compute Ψ k (i) –If B k [i] is even, it ’ s simply i –Otherwise, – because all the prefix patterns saved are in sorted order, –We saved in the Lk list till the point i, entries for all the odd suffix ’ s before i, h=i-rank[i] –So we can look up the h entry in Lk  And it will give us the answer

28 Simple example  L 2 ={5,8,2,4}  Rank 2 ={1,1,1,2,3,3,3,4}  B 2 ={1,0,0,1,1,0,0,1}  Ψ2={1,5,8,4,5,1,4,8}  Ψ(3) = ?  Rank(3) = 1, h= 3-1, L2[2] = 8  Ψ(3) =8  Ψ(3) =8

29 Rank and select  Given a bit vector length n  Rank[i] is the number of 1 bits till I  Select(i) returns the index of the ith 1

30 Ψ k vector representation  Lemma 2 Given s integers in sorted order, each containing w bits,where s<2 w each containing w bits,where s<2 w we can store them with at most s(2+w-floor(logs))+O(s/loglogs) bits so that retrieving the hth integer takes constant time

31 Ψ k vector representation Take the first z=floor(logs) bits of each int, creating the q 1..q s int It ’ s easy to see that, q 1 <q i <q i+1 <s (we take the msb bits after all) The rest w-z bits of each int, will be r i 10101010101010101010101010101 1010101010101010101010101 SiSi qiqi 101 riri

32 Ψ k vector representation Store r i in a simple array, (w-z)*s bits Store q 1..q s in a table supporting select and rank in constant time. The table Q is implemented in the following way Instead of saving the number themselves, we store q 1,q 2 -q 1,q 2 -q 3, … q s -q s-1 we store q 1,q 2 -q 1,q 2 -q 3, … q s -q s-1 in unary representation )0 i 1( And add a select data structure.

33 Ψ k vector representation In order to get qi we simply do select(i), and count the number of zeros before the ith 1 Qi = select(i) - rank(select(i))

34 Ψ k vector representation The q table size is the size of the unary string is s+2z <2s + the select overhead O(s/loglogs) So we can output Si easily S i =q i *2 w-z +r i

35 Ψ k vector representation  Lemma 3 We can store the concatenated list L k used for Ψ k in n*(1/2+3/2 k+1 )+O(n/2 k loglogn), so accessing the hth element will take constant time, with preprocessing time o(n/2k+2 2 k )  There are 2 2 k lists, number them,(even the empty ones)

36 Ψ k vector representation  Lemma 3 We can store the concatenated list L k used for Ψ k in n*(1/2+3/2 k+1 )+O(n/2 k loglogn), so accessing the hth element will take constant time, with preprocessing time o(n/2k+2 2 k )  There are 2 2 k lists, number them,(even the empty ones)  Each X i integer in the lists, 1<x i <N k will be transformed into a new integer by appending it ’ s list int representation  X` bit size is, 2 K +logn k,

37 Ψ k vector representation  Lemma 3 We can store the concatenated list L k used for Ψ k in n*(1/2+3/2 k+1 )+O(n/2 k loglogn), so accessing the hth element will take constant time, with preprocessing time o(n/2k+2 2 k )  There are 2 2 k lists, number them,(even the empty ones)  Each X i integer in the lists, 1<x i <N k will be transformed into a new integer by appending it ’ s list int representation  X` bit size is, 2 K +logn k,  After concatenating all the lists,we have a N k /2 sorted numbers sized 2 K +logn k bits  Using lemma 2 we get.  O(1) access time  And a space bound of n(1/2+3/2 k+1 )+O(n/2 k loglogn) bits

38 Sum it up (space complexity)

39 Rank data structure  Due to Jacobson  Given a bit vector length n,Rank[i] is the number of 1 bits till I  Multilevel approach  We will slice the bit string to log2n chunks.  Between each chunk we will keep rank counter  Each chunk will be divvied into ½ * logn chunks,  And a counter will be kept between each sub chunks  At The Bottom Level a simple Lookup table will be used.

40 Rank 3 7 101 Lookup table 14 Log 2 n chunks ½ logn sub chunks The output 14+3+1

41 Rank Analysis

42 Compressed suffix arrays in ε -1 n + O(n) bits and O(log ε n) access time  In order to break the space barrier we need to save less levels =>longer lookup ’ s  Lets save 3 compressed levels only Sa 0 Sa l Sa l` L = ceil(loglogn), l`=ceil(1/2loglogn)  using A Dictionary data structure, which Can say If an element is member of the Dictionary, and support a rank query, O(1) time for both queries  The Space complexity of the dictionary is  We keep in 2 dictionaries what items we have in the next level D 0 and Dl (from Sa 0 ->Sa l` Sa l` ->Sa l

43 The Ψ` k function  We define the Ψ` k function, which maps each 1 to it ’ s companion 0  Let ’ s define the φ k function to be  We just need to merge the indexes in L k and L` k

44 Example

45 The φ k function implementation  Lemma 4 :We can store the concatenated list used for φ k –k =0 in n+O(n/loglogn) bits –K>0 in n*(1+1/2 k-1 )+O(n/2kloglogn), preprocess time of O(n/2 k +2 2 k ) –If k>0 simply using lemma 3 –K=0  Encode a,# as 0, and b as 1.  Create a n bit vector, named l  L[f] = 0 iff the list for φ 0 is a or # at the f position  We add a select and select 0 data structure on top of it. O(n/loglogn)  Also we keep the number of 0 in l as c0,  Query φ k (j) is done in the following way  if j = C0, return select0(c0)  If j<c0 return select0(j)  If j>c0 return select(j-c0)

46 The Lookup algorithm  Sa[i], we start walking the φ k function i,i`,i``,i```  Sa0[i]+1=Sa0[i`] …  Until reaching entry found in the dictionary D 0, –Let s be the walk length –And r the entry rank in the dictionary (how many items, already passed to the next level?)  Using r we start walking the next level –Let s` be the walk length –And r` the entry rank in the dictionary  we return the following result  The walk length is, max(s,s`)<2 l` <sqr(logn)  So the query time is O(sqr(logn))

47 The General multilevel Build  For every 0<ε<1,  Assume εl is an integer so 2 εl <2log ε n  Create all the levels, 0, εl,2εl..l  Number of levels is ε -1 +1 => lookup of O(log ε n)

48 The General multilevel Build

49 Select data structure  select(i)- returns the i 1 bit in the string  Same idea as rank, a bit more complicated  multilevel approach  At the first level we record the position of every lognloglogn th bit, –Total space o(N/loglogn)  Between each two bits, we keep the following data,  If the distance between them r>(lognloglogn) 2 –we keep the absolute pos of all the indexes between them  log 2 nloglogn –Other wise we keep, the relative position of each logrloglogn th bit  Total space logr*loglogn <log 2 nloglogn = r/loglogn r<N !!!  Then we keep one more level (the same notions) –Block size comes to the size of (lgn) 4

50 Select data structure  After that, we keep a lookup table  For every logn/d pattern we save (d>=2) –Number of 1 bits, –the location of the ith 1 bit in the pattern  Same as before the space is O(n 1/d lognloglogn)  The lookup is then very simple, just walk the levels,  Get a block and ask a query about him using the lookup table.  Space complexity, O(n/loglogn)

Download ppt "Compressed suffix arrays and suffix trees with applications to text indexing and string matching."

Similar presentations

Ads by Google