Download presentation
Presentation is loading. Please wait.
1
Suffix arrays
2
Suffix array We loose some of the functionality but we save space. Let s = abab Sort the suffixes lexicographically: ab, abab, b, bab The suffix array gives the indices of the suffixes in sorted order 2031
3
How do we build it ? Build a suffix tree Traverse the tree in DFS, lexicographically picking edges outgoing from each node and fill the suffix array. O(n) time
4
How do we search for a pattern ? If P occurs in T then all its occurrences are consecutive in the suffix array. Do a binary search on the suffix array Takes O(mlogn) time
5
Example Let S = mississippi i ippi issippi ississippi mississippi pi 7 4 1 0 9 8 6 3 10 5 2 ppi sippi sisippi ssippi ssissippi L R Let P = issa M
6
How do we accelerate the search ? L R Maintain = LCP(P,L) Maintain r = LCP(P,R) Assume ≥ r M r
7
L R M r If = r then start comparing M to P at + 1
8
L R M r > r
9
L R M r Someone whispers LCP(L,M) LCP(L,M) >
10
L R M r Continue in the right half LCP(L,M) >
11
L R M r LCP(L,M) <
12
L R M r LCP(L,M) < Continue in the left half
13
L R M r LCP(L,M) = start comparing M to P at + 1
14
Analysis If we do more than a single comparison in an iteration then max(, r ) grows by 1 for each comparison O(m + logn) time
15
Construct the suffix array without the suffix tree
16
Linear time construction Recursively ? Say we want to sort only suffixes that start at even positions ?
17
Change the alphabet You in fact sort suffixes of a string shorter by a factor of 2 ! Every pair of characters is now a character
18
Change the alphabet a$0 aa1 ab2 b$3 ba4 bb5 $ a b a aa b 2 12
19
But we do not gain anything…
20
Divide into triples $ yab ba da b bad o abb ada bba do$
21
Divide into triples $ yab ba da b bad o abb ada bba do$ $ yab ba da b bad o bba dab bad o$$
22
Sort recursively 2/3 of the suffixes $ yab ba da b bad o abb ada bba do$ bba dab bad o$$ 124645 37 016425 37 $ yab ba da b bad o 142653 78 148275 1011 1234 56 789101112 0 01234 56 7
23
Sort the remaining third $ yab ba da b bad o 142653 78 (b, 2)(a, 5) (a, 7) (y, 1) (b, 2) (a, 5) (a, 7) (y, 1) 3 6 9 0 1234 56 789101112 0 148275 1011
24
Merge $ yab ba da b bad o 142653 78 1234 56 789101112 1 0 3 6 9 0 148275 1011
25
Merge $ yab ba da b bad o 142653 78 1234 56 789101112 1 0 3 6 9 0 48275 1011 6
26
Merge $ yab ba da b bad o 142653 78 1234 56 789101112 1 0 39 0 48275 1011 64
27
Merge $ yab ba da b bad o 142653 78 1234 56 789101112 1 0 39 0 8275 1011 649
28
Merge $ yab ba da b bad o 142653 78 1234 56 789101112 1 0 3 0 8275 1011 6493
29
Merge $ yab ba da b bad o 142653 78 1234 56 789101112 1 0 0 8275 1011 64938
30
Merge $ yab ba da b bad o 142653 78 1234 56 789101112 1 0 0 275 1011 649382
31
Merge $ yab ba da b bad o 142653 78 1234 56 789101112 1 0 0 75 1011 6493827
32
Merge $ yab ba da b bad o 142653 78 1234 56 789101112 1 0 0 5 1011 64938275
33
Merge $ yab ba da b bad o 142653 78 1234 56 789101112 1 0 0 1011 64938275
34
Merge $ yab ba da b bad o 142653 78 1234 56 789101112 1 0 649382751011 0
35
summary $ yab ba da b bad o 142653 78 1234 56 789101112 1 0 6493827510110 When comparing to a suffix with index 1 (mod 3) we compare the char and break ties by the ranks of the following suffixes When comparing to a suffix with index 2 (mod 3) we compare the char, the next char if there is a tie, and finally the ranks of the following suffixes
36
Compute LCP’s $ yab ba da b bad o 1234 56 789101112 1 0 6493827510110 abbado$ abbadabbado$ adabbado$ ado$ badabbado$ bado$ bbadabbado$ bbado$ dabbado$ do$ o$ yabbadabbado$ 1 4 8 2 7 5 10 11 0 6 3 9
37
Crucial observation $ yab ba da b bad o 1234 56 789101112 1 0 6493827510110 abbado$ abbadabbado$ adabbado$ ado$ badabbado$ bado$ bbadabbado$ bbado$ dabbado$ do$ o$ yabbadabbado$ LCP(i,j) = min {LCP(i,i+1),LCP(i+1,i+2),….,LCP(j-1,j)} 1 4 8 2 7 5 10 11 0 6 3 9
38
$ yab ba da b bad o 1234 56 789101112 1 0 6493827510110 abbado$ abbadabbado$ adabbado$ ado$ badabbado$ bado$ bbadabbado$ bbado$ dabbado$ do$ o$ yabbadabbado$ LCP(11,0) 1 6 4 9 3 8 2 7 5 10 11 0 0 Find LCP’s of consecutive suffixes
39
$ yab ba da b bad o 1234 56 789101112 1 0 6493827510110 abbado$ abbadabbado$ adabbado$ ado$ badabbado$ bado$ bbadabbado$ bbado$ dabbado$ do$ o$ yabbadabbado$ LCP(8,2) 1 6 4 9 3 8 2 7 5 10 11 0 01
40
$ yab ba da b bad o 1234 56 789101112 1 0 6493827510110 abbado$ abbadabbado$ adabbado$ ado$ badabbado$ bado$ bbadabbado$ bbado$ dabbado$ do$ o$ yabbadabbado$ LCP(9,3) 1 6 4 9 3 8 2 7 5 10 11 0 01 0
41
$ yab ba da b bad o 1234 56 789101112 1 0 6493827510110 abbado$ abbadabbado$ adabbado$ ado$ badabbado$ bado$ bbadabbado$ bbado$ dabbado$ do$ o$ yabbadabbado$ LCP(6,4) 1 6 4 9 3 8 2 7 5 10 11 0 101 0
42
$ yab ba da b bad o 1234 56 789101112 1 0 6493827510110 abbado$ abbadabbado$ adabbado$ ado$ badabbado$ bado$ bbadabbado$ bbado$ dabbado$ do$ o$ yabbadabbado$ LCP(7,5) 1 6 4 9 3 8 2 7 5 10 11 0 0101 0
43
$ yab ba da b bad o 1234 56 789101112 1 0 6493827510110 abbado$ abbadabbado$ adabbado$ ado$ badabbado$ bado$ bbadabbado$ bbado$ dabbado$ do$ o$ yabbadabbado$ LCP(1,6) 1 6 4 9 3 8 2 7 5 10 11 0 5 0101 0
44
$ yab ba da b bad o 1234 56 789101112 1 0 6493827510110 abbado$ abbadabbado$ adabbado$ ado$ badabbado$ bado$ bbadabbado$ bbado$ dabbado$ do$ o$ yabbadabbado$ LCP(2,7) 1 6 4 9 3 8 2 7 5 10 11 0 45 0101 0
45
$ yab ba da b bad o 1234 56 789101112 1 0 6493827510110 abbado$ abbadabbado$ adabbado$ ado$ badabbado$ bado$ bbadabbado$ bbado$ dabbado$ do$ o$ yabbadabbado$ 1 6 4 9 3 8 2 7 5 10 11 0 45 0101 0 LCP(3,8) 3
46
$ yab ba da b bad o 1234 56 789101112 1 0 6493827510110 abbado$ abbadabbado$ adabbado$ ado$ badabbado$ bado$ bbadabbado$ bbado$ dabbado$ do$ o$ yabbadabbado$ 1 6 4 9 3 8 2 7 5 10 11 0 45 0101 0 LCP(4,9) 3 2
47
$ yab ba da b bad o 1234 56 789101112 1 0 6493827510110 abbado$ abbadabbado$ adabbado$ ado$ badabbado$ bado$ bbadabbado$ bbado$ dabbado$ do$ o$ yabbadabbado$ 1 6 4 9 3 8 2 7 5 10 11 0 45 0101 0 LCP(5,10) 3 21
48
$ yab ba da b bad o 1234 56 789101112 1 0 6493827510110 abbado$ abbadabbado$ adabbado$ ado$ badabbado$ bado$ bbadabbado$ bbado$ dabbado$ do$ o$ yabbadabbado$ 1 6 4 9 3 8 2 7 5 10 11 0 45 0101 0 LCP(10,11) 3 210
49
$ yab ba da b bad o 1234 56 789101112 1 0 6493827510110 abbado$ abbadabbado$ adabbado$ ado$ badabbado$ bado$ bbadabbado$ bbado$ dabbado$ do$ o$ yabbadabbado$ 1 6 4 9 3 8 2 7 5 10 11 0 45 0101 0 The starting position deceases by 1 in every iteration. So it cannot increase more than O(n) times 3 210 Analysis
50
$ yab ba da b bad o 1234 56 789101112 1 0 6493827510110 45 0101 0 3 210 We need more LCPs for search Linearly many, calculate the all bottom up
51
abc ab bc a $ 234 56 78 4 1 18526379 $ ca$ cabbca$ bca$ bcabbca$ bbca$ a$ abcabbca$ abbca$ 9 7 3 6 2 5 8 1 4 Another example 9 2 1 0 0 3 1 0 2
52
Think about the LCP which we know at any point in the algorithm Analysis A successful comparison increases it by one It decreases by one when iteration starts So the number of successful comparisons is O(n)
53
Burrows –Wheeler (bzip2) Currently best algorithm for text Basic Idea: Sort the characters by their full context (typically done in blocks). This is called the block sorting transform. Use move-to-front encoding to encode the sorted characters. The ingenious observation is that the decoder only needs the sorted characters and a pointer to the first character of the original sequence.
54
S = abraca שלב I : יצירת מטריצה M שבנויה מכל ההזזות הציקליות של S: M = a b r a c a # b r a c a # a r a c a # a b a c a # a b r c a # a b r a a # a b r a c # a b r a c a
55
שלב II : מיון השורות בסדר לקסיקוגרפי : a b r a c a # b r a c a # a a c a # a b r c a # a b r a a # a b r a c # a b r a c a r a c a # a b LF L is the Burrows Wheeler Transform
56
a b r a c a # b r a c a # a r a c a # a b a c a # a b r c a # a b r a a # a b r a c # a b r a c a a b r a c a # b r a c a # a a c a # a b r c a # a b r a a # a b r a c # a b r a c a r a c a # a b Claim: Every column contains all chars. LF You can obtain F from L by sorting
57
a b r a c a # b r a c a # a a c a # a b r c a # a b r a a # a b r a c # a b r a c a r a c a # a b LF The “a’s” are in the same order in L and in F, Similarly for every other char.
58
From L you can reconstruct the string L # a r a c a b F # a r a c a b What is the first char of S ?
59
From L you can reconstruct the string L # a r a c a b F # a r a c a b What is the first char of S ? a
60
From L you can reconstruct the string L # a r a c a b F # a r a c a b abab
61
L # a r a c a b F # a r a c a b abr
62
Compression ? L # a r a c a b Compress the transform to a string of integers using move to front 0 2 3 2 0 3 Then use Huffman to code the integers
63
a b r a c a # b r a c a # a r a c a # a b a c a # a b r c a # a b r a a # a b r a c # a b r a c a a b r a c a # b r a c a # a a c a # a b r c a # a b r a a # a b r a c # a b r a c a r a c a # a b Why is it good ? LF Characters with the same (right) context appear together
64
a b r a c a # b r a c a # a r a c a # a b a c a # a b r c a # a b r a a # a b r a c # a b r a c a a b r a c a # b r a c a # a a c a # a b r c a # a b r a a # a b r a c # a b r a c a r a c a # a b Sorting is equivalent to computing the suffix array. LF Can encode and decode in linear time
65
p i#mississi p p pi#mississ i s ippi#missi s s issippi#mi s s sippi#miss i s sissippi#m i i ssippi#mis s m ississippi # i ssissippi# m A useful tool: L F mapping # mississipp i i #mississip p i ppi#missis s FL How do we map L’s onto F’s chars ?... Need to distinguish equal chars... unknown To implement the LF-mapping for a char at position j in L we need the oracle occ( , j )
66
fr occ=2 [lr-fr+1] Substring search using the BWT ( Count the pattern occurrences ) #mississipp i#mississip ippi#missis issippi#mis ississippi# mississippi pi#mississi ppi#mississ sippi#missi sissippi#mi ssippi#miss ssissippi#m ipssm#pissiiipssm#pissii L mississippi P = si First step fr lr Inductive step: Given fr,lr for P[j+1,p] Take c=P[j] P[ j ] Find the first c in L[fr, lr] Find the last c in L[fr, lr] L-to-F mapping of these chars lr rows prefixed by char “i” s s unknown Occ() oracle is enough slide stolen from Paolo Ferragina@
67
fr occ=2 [lr-fr+1] Substring search using the BWT ( Count the pattern occurrences ) #mississipp i#mississip ippi#missis issippi#mis ississippi# mississippi pi#mississi ppi#mississ sippi#missi sissippi#mi ssippi#miss ssissippi#m ipssm#pissiiipssm#pissii L mississippi # 1 i 2 m 7 p 8 s 10 C Available info P = si First step fr lr Inductive step: Given fr,lr for P[j+1,p] Take c=P[j] P[ j ] Find the first c in L[fr, lr] Find the last c in L[fr, lr] L-to-F mapping of these chars lr rows prefixed by char “i” s s slide stolen from Paolo Ferragina@
68
fr occ=2 [lr-fr+1] Substring search using the BWT ( Count the pattern occurrences ) #mississipp i#mississip ippi#missis issippi#mis ississippi# mississippi pi#mississi ppi#mississ sippi#missi sissippi#mi ssippi#miss ssissippi#m ipssm#pissiiipssm#pissii L mississippi # 1 i 2 m 7 p 8 s 10 C Available info P = si First step fr lr rows prefixed by char “i” s s slide stolen from Paolo Ferragina@ What if someone whispers how many “s” we have up to index 2 and up to index 5: occ(s,2), occ(s,5) ? fr = C[s] + occ(s,2) + 1 lr = C[s] + occ(s,5)
69
occ( a, j ) ipssm#pissiiipssm#pissii L occ(s,4) = 2
70
Make a bit vector for each character ipssm#pissiiipssm#pissii L 0 0 1 1 0 0 occ(s,4) = rank(4) rank(i) = how many ones are there before position i ?
71
How do you answer rank queries ? 0 0 1 1 0 0 rank(i) = how many ones are there before position i ? We can prepare a vector with all answers 0 0 1 2 2 2 2 2 3 4 4 4
72
Lets do it with O(n) bits per character 0 0 1 1 0 0 1 0 1 1 0 0 Partition in 2n/log(n) blocks of size log(n)/2 0 0 1 1 0 0 logn/2 257 Keep the answer for each prefix of the blocks There are “kinds” of blocks, prepare a table with all answers for each block
73
0 0 1 1 0 0 1 0 1 1 0 0 In our solution the bit vector takes Θ(n) bits and also the “additionals” take Θ (n) bits 0 0 1 1 0 0 logn/2 257
74
Can we do it with smaller overhead : so additionals would take o(n) ? 0 0 1 1 0 0 1 0 1 1 0 0 superblocks of size log 2 (n) 0 0 1 1 0 0 log 2 n 7 13 Each block keeps the number of one in previous blocks that are in the same superblock 0 0 1 1 0 0 24
75
Analysis 0 0 1 1 0 0 1 0 1 1 0 0 The superblock table is of size n/log (n) 0 0 1 1 0 0 log 2 n 7 13 The block table is of size (loglog(n)) * n/log (n) 0 0 1 1 0 0 24 The tables for the blocks √n log(n)loglog(n) So the additionals take o(n) space
76
Next step Do it without keeping the bit vectors themselves Instead keep only the compressed version of the text Saves a lot of space for compressible strings
Similar presentations
© 2024 SlidePlayer.com Inc.
All rights reserved.