Suffix arrays. Suffix array We loose some of the functionality but we save space. Let s = abab Sort the suffixes lexicographically: ab, abab, b, bab The.

Slides:



Advertisements
Similar presentations
Paolo Ferragina, Università di Pisa XML Compression and Indexing Paolo Ferragina Dipartimento di Informatica, Università di Pisa [Joint with F. Luccio,
Advertisements

Paolo Ferragina, Università di Pisa On Compression and Indexing: two sides of the same coin Paolo Ferragina Dipartimento di Informatica, Università di.
Dynamic Rank-Select Structures with Applications to Run-Length Encoded Texts Sunho Lee and Kunsoo Park Seoul National Univ.
Algoritmi per IR Dictionary-based compressors. Lempel-Ziv Algorithms Keep a “dictionary” of recently-seen strings. The differences are: How the dictionary.
Lecture #1 From 0-th order entropy compression To k-th order entropy compression.
Text Indexing The Suffix Array. Basic notation and facts Occurrences of P in T = All suffixes of T having P as a prefix SUF(T) = Sorted set of suffixes.
Michael Alves, Patrick Dugan, Robert Daniels, Carlos Vicuna
Greedy Algorithms Amihood Amir Bar-Ilan University.
Paolo Ferragina, Università di Pisa Compressed Permuterm Index Paolo Ferragina Dipartimento di Informatica, Università di Pisa.
Text Indexing The Suffix Array. Basic notation and facts Occurrences of P in T = All suffixes of T having P as a prefix SUF(T) = Sorted set of suffixes.
BWT-Based Compression Algorithms Haim Kaplan and Elad Verbin Tel-Aviv University Presented in CPM ’07, July 8, 2007.
Suffix Trees, Suffix Arrays and Suffix Trays Richard Cole Tsvi Kopelowitz Moshe Lewenstein.
Interplay between Stringology and Data Structure Design Roberto Grossi.
1 Suffix tree and suffix array techniques for pattern analysis in strings Esko Ukkonen Univ Helsinki Erice School 30 Oct 2005 Modified Alon Itai 2006.
Compressed Compact Suffix Arrays Veli Mäkinen University of Helsinki Gonzalo Navarro University of Chile compact compress.
Suffix Tree and Suffix Array R Brain Chen R Pluto Chang.
Suffix Sorting & Related Algoritmics Martin Farach-Colton Rutgers University USA.
String Processing II: Compressed Indexes Patrick Nichols Jon Sheffi Dacheng Zhao
Suffix Sorting & Related Algoritmics Martin Farach-Colton Rutgers University USA.
Fudan University Chen, Yaoliang 1. TTS System A Chinese Text-To-Speech system SafeDB Bug backlog SMemoHelper A small tool that helps learn English.
Presented By Dr. Shazzad Hosain Asst. Prof. EECS, NSU Suffix Trees and Their Uses.
From Kolmogorov and Shannon to Bioinformatics and Grid Computing Raffaele Giancarlo Dipartimento di Matematica, Università di Palermo.
Modern Information Retrieval
Suffix arrays. Suffix array We loose some of the functionality but we save space. Let s = abab Sort the suffixes lexicographically: ab, abab, b, bab The.
Blockwise Suffix Sorting for Space-Efficient Burrows-Wheeler Ben Langmead Based on work by Juha Kärkkäinen.
1 Lempel-Ziv algorithms Burrows-Wheeler Data Compression.
Web Algorithmics Dictionary-based compressors. LZ77 Algorithm’s step: Output Advance by len + 1 A buffer “window” has fixed length and moves aacaacabcaaaaaa.
Suffix trees.
Suffix trees and suffix arrays presentation by Haim Kaplan.
Suffix trees. Trie A tree representing a set of strings. a c b c e e f d b f e g { aeef ad bbfe bbfg c }
Tries. (Compacted) Trie y s 1 z stile zyg 5 etic ial ygy aibelyite czecin omo systile syzygetic syzygial syzygy szaibelyite szczecin.
Advanced Algorithms for Massive DataSets Data Compression.
Compressed Index for a Dynamic Collection of Texts H.W. Chan, W.K. Hon, T.W. Lam The University of Hong Kong.
Data Structures and Algorithms Huffman compression: An Application of Binary Trees and Priority Queues.
String Matching. Problem is to find if a pattern P[1..m] occurs within text T[1..n] Simple solution: Naïve String Matching –Match each position in the.
Paolo Ferragina, Università di Pisa Prologo Paolo Ferragina Dipartimento di Informatica, Università di Pisa.
Information and Coding Theory Heuristic data compression codes. Lempel- Ziv encoding. Burrows-Wheeler transform. Juris Viksna, 2015.
Introduction n – length of text, m – length of search pattern string Generally suffix tree construction takes O(n) time, O(n) space and searching takes.
Read Alignment Algorithms. The Problem 2 Given a very long reference sequence of length n and given several short strings.
Introduction to Algorithms Chapter 16: Greedy Algorithms.
Priority Queues, Trees, and Huffman Encoding CS 244 This presentation requires Audio Enabled Brent M. Dingle, Ph.D. Game Design and Development Program.
TABLE COMPRESSION AND RELATED PROBLEMS Raffaele Giancarlo Dipartimento di Matematica Università di Palermo.
On the Sorting-Complexity of Suffix Tree Construction MARTIN FARACH-COLTON PAOLO FERRAGINA S. MUTHUKRISHNAN Requires Math fonts downloadable from herehere.
Suffix trees. Trie A tree representing a set of strings. a b c e e f d b f e g { aeef ad bbfe bbfg c }
Compressed Suffix Arrays and Suffix Trees Roberto Grossi, Jeffery Scott Vitter.
B ACKWARD S EARCH FM-I NDEX (F ULL - TEXT INDEX IN M INUTE SPACE ) Paper by Ferragina & Manzini Presentation by Yuval Rikover.
BACKWARD SEARCH FM-INDEX (FULL-TEXT INDEX IN MINUTE SPACE)
Joint Advanced Student School Compressed Suffix Arrays Compression of Suffix Arrays to linear size Fabian Pache.
Information Retrieval
The Burrows-Wheeler Transform: Theory and Practice Article by: Giovanni Manzini Original Algorithm by: M. Burrows and D. J. Wheeler Lecturer: Eran Vered.
Suffix Tree 6 Mar MinKoo Seo. Contents  Basic Text Searching  Introduction to Suffix Tree  Suffix Trees and Exact Matching  Longest Common Substring.
Discrete Methods in Mathematical Informatics Kunihiko Sadakane The University of Tokyo
Exact String Matching Algorithms. Copyright notice Many of the images in this power point presentation of other people. The Copyright belong to the original.
Advanced Data Structures Lecture 8 Mingmin Xie. Agenda Overview Trie Suffix Tree Suffix Array, LCP Construction Applications.
Linear Time Suffix Array Construction Using D-Critical Substrings
Burrows-Wheeler Transformation Review
COMP9319 Web Data Compression and Search
15-853:Algorithms in the Real World
HUFFMAN CODES.
Information and Coding Theory
Advanced Algorithms for Massive DataSets
COMP9319 Web Data Compression and Search
Two equivalent problems
Andrzej Ehrenfeucht, University of Colorado, Boulder
13 Text Processing Hongfei Yan June 1, 2016.
Huffman Coding CSE 373 Data Structures.
Suffix trees.
Problem with Huffman Coding
Suffix trees and suffix arrays
Suffix Arrays and Suffix Trees
Presentation transcript:

Suffix arrays

Suffix array We loose some of the functionality but we save space. Let s = abab Sort the suffixes lexicographically: ab, abab, b, bab The suffix array gives the indices of the suffixes in sorted order 2031

How do we build it ? Build a suffix tree Traverse the tree in DFS, lexicographically picking edges outgoing from each node and fill the suffix array. O(n) time

How do we search for a pattern ? If P occurs in T then all its occurrences are consecutive in the suffix array. Do a binary search on the suffix array Takes O(mlogn) time

Example Let S = mississippi i ippi issippi ississippi mississippi pi ppi sippi sisippi ssippi ssissippi L R Let P = issa M

How do we accelerate the search ? L R Maintain = LCP(P,L) Maintain r = LCP(P,R) Assume  ≥ r M r

L R M r If = r then start comparing M to P at + 1

L R M r > r

L R M r Someone whispers LCP(L,M) LCP(L,M) >

L R M r Continue in the right half LCP(L,M) >

L R M r LCP(L,M) <

L R M r LCP(L,M) < Continue in the left half

L R M r LCP(L,M) = start comparing M to P at + 1

Analysis If we do more than a single comparison in an iteration then max(, r ) grows by 1 for each comparison  O(m + logn) time

Construct the suffix array without the suffix tree

Linear time construction Recursively ? Say we want to sort only suffixes that start at even positions ?

Change the alphabet You in fact sort suffixes of a string shorter by a factor of 2 ! Every pair of characters is now a character

Change the alphabet a$0 aa1 ab2 b$3 ba4 bb5 $ a b a aa b 2 12

But we do not gain anything…

Divide into triples $ yab ba da b bad o abb ada bba do$

Divide into triples $ yab ba da b bad o abb ada bba do$ $ yab ba da b bad o bba dab bad o$$

Sort recursively 2/3 of the suffixes $ yab ba da b bad o abb ada bba do$ bba dab bad o$$ $ yab ba da b bad o

Sort the remaining third $ yab ba da b bad o (b, 2)(a, 5) (a, 7) (y, 1) (b, 2) (a, 5) (a, 7) (y, 1) 

Merge $ yab ba da b bad o

Merge $ yab ba da b bad o

Merge $ yab ba da b bad o

Merge $ yab ba da b bad o

Merge $ yab ba da b bad o

Merge $ yab ba da b bad o

Merge $ yab ba da b bad o

Merge $ yab ba da b bad o

Merge $ yab ba da b bad o

Merge $ yab ba da b bad o

Merge $ yab ba da b bad o

summary $ yab ba da b bad o When comparing to a suffix with index 1 (mod 3) we compare the char and break ties by the ranks of the following suffixes When comparing to a suffix with index 2 (mod 3) we compare the char, the next char if there is a tie, and finally the ranks of the following suffixes

Compute LCP’s $ yab ba da b bad o abbado$ abbadabbado$ adabbado$ ado$ badabbado$ bado$ bbadabbado$ bbado$ dabbado$ do$ o$ yabbadabbado$

Crucial observation $ yab ba da b bad o abbado$ abbadabbado$ adabbado$ ado$ badabbado$ bado$ bbadabbado$ bbado$ dabbado$ do$ o$ yabbadabbado$ LCP(i,j) = min {LCP(i,i+1),LCP(i+1,i+2),….,LCP(j-1,j)}

$ yab ba da b bad o abbado$ abbadabbado$ adabbado$ ado$ badabbado$ bado$ bbadabbado$ bbado$ dabbado$ do$ o$ yabbadabbado$ LCP(11,0) Find LCP’s of consecutive suffixes

$ yab ba da b bad o abbado$ abbadabbado$ adabbado$ ado$ badabbado$ bado$ bbadabbado$ bbado$ dabbado$ do$ o$ yabbadabbado$ LCP(8,2)

$ yab ba da b bad o abbado$ abbadabbado$ adabbado$ ado$ badabbado$ bado$ bbadabbado$ bbado$ dabbado$ do$ o$ yabbadabbado$ LCP(9,3)

$ yab ba da b bad o abbado$ abbadabbado$ adabbado$ ado$ badabbado$ bado$ bbadabbado$ bbado$ dabbado$ do$ o$ yabbadabbado$ LCP(6,4)

$ yab ba da b bad o abbado$ abbadabbado$ adabbado$ ado$ badabbado$ bado$ bbadabbado$ bbado$ dabbado$ do$ o$ yabbadabbado$ LCP(7,5)

$ yab ba da b bad o abbado$ abbadabbado$ adabbado$ ado$ badabbado$ bado$ bbadabbado$ bbado$ dabbado$ do$ o$ yabbadabbado$ LCP(1,6)

$ yab ba da b bad o abbado$ abbadabbado$ adabbado$ ado$ badabbado$ bado$ bbadabbado$ bbado$ dabbado$ do$ o$ yabbadabbado$ LCP(2,7)

$ yab ba da b bad o abbado$ abbadabbado$ adabbado$ ado$ badabbado$ bado$ bbadabbado$ bbado$ dabbado$ do$ o$ yabbadabbado$ LCP(3,8) 3

$ yab ba da b bad o abbado$ abbadabbado$ adabbado$ ado$ badabbado$ bado$ bbadabbado$ bbado$ dabbado$ do$ o$ yabbadabbado$ LCP(4,9) 3 2

$ yab ba da b bad o abbado$ abbadabbado$ adabbado$ ado$ badabbado$ bado$ bbadabbado$ bbado$ dabbado$ do$ o$ yabbadabbado$ LCP(5,10) 3 21

$ yab ba da b bad o abbado$ abbadabbado$ adabbado$ ado$ badabbado$ bado$ bbadabbado$ bbado$ dabbado$ do$ o$ yabbadabbado$ LCP(10,11) 3 210

$ yab ba da b bad o abbado$ abbadabbado$ adabbado$ ado$ badabbado$ bado$ bbadabbado$ bbado$ dabbado$ do$ o$ yabbadabbado$ The starting position deceases by 1 in every iteration. So it cannot increase more than O(n) times Analysis

$ yab ba da b bad o We need more LCPs for search Linearly many, calculate the all bottom up

abc ab bc a $ $ ca$ cabbca$ bca$ bcabbca$ bbca$ a$ abcabbca$ abbca$ Another example

Think about the LCP which we know at any point in the algorithm Analysis A successful comparison increases it by one It decreases by one when iteration starts So the number of successful comparisons is O(n)

Burrows –Wheeler (bzip2) Currently best algorithm for text Basic Idea: Sort the characters by their full context (typically done in blocks). This is called the block sorting transform. Use move-to-front encoding to encode the sorted characters. The ingenious observation is that the decoder only needs the sorted characters and a pointer to the first character of the original sequence.

S = abraca שלב I : יצירת מטריצה M שבנויה מכל ההזזות הציקליות של S: M = a b r a c a # b r a c a # a r a c a # a b a c a # a b r c a # a b r a a # a b r a c # a b r a c a

שלב II : מיון השורות בסדר לקסיקוגרפי : a b r a c a # b r a c a # a a c a # a b r c a # a b r a a # a b r a c # a b r a c a r a c a # a b LF L is the Burrows Wheeler Transform

a b r a c a # b r a c a # a r a c a # a b a c a # a b r c a # a b r a a # a b r a c # a b r a c a a b r a c a # b r a c a # a a c a # a b r c a # a b r a a # a b r a c # a b r a c a r a c a # a b Claim: Every column contains all chars. LF You can obtain F from L by sorting

a b r a c a # b r a c a # a a c a # a b r c a # a b r a a # a b r a c # a b r a c a r a c a # a b LF The “a’s” are in the same order in L and in F, Similarly for every other char.

From L you can reconstruct the string L # a r a c a b F # a r a c a b What is the first char of S ?

From L you can reconstruct the string L # a r a c a b F # a r a c a b What is the first char of S ? a

From L you can reconstruct the string L # a r a c a b F # a r a c a b abab

L # a r a c a b F # a r a c a b abr

Compression ? L # a r a c a b Compress the transform to a string of integers using move to front Then use Huffman to code the integers

a b r a c a # b r a c a # a r a c a # a b a c a # a b r c a # a b r a a # a b r a c # a b r a c a a b r a c a # b r a c a # a a c a # a b r c a # a b r a a # a b r a c # a b r a c a r a c a # a b Why is it good ? LF Characters with the same (right) context appear together

a b r a c a # b r a c a # a r a c a # a b a c a # a b r c a # a b r a a # a b r a c # a b r a c a a b r a c a # b r a c a # a a c a # a b r c a # a b r a a # a b r a c # a b r a c a r a c a # a b Sorting is equivalent to computing the suffix array. LF Can encode and decode in linear time

p i#mississi p p pi#mississ i s ippi#missi s s issippi#mi s s sippi#miss i s sissippi#m i i ssippi#mis s m ississippi # i ssissippi# m A useful tool: L  F mapping # mississipp i i #mississip p i ppi#missis s FL How do we map L’s onto F’s chars ?... Need to distinguish equal chars... unknown To implement the LF-mapping for a char  at position j in L we need the oracle occ( , j )

fr occ=2 [lr-fr+1] Substring search using the BWT ( Count the pattern occurrences ) #mississipp i#mississip ippi#missis issippi#mis ississippi# mississippi pi#mississi ppi#mississ sippi#missi sissippi#mi ssippi#miss ssissippi#m ipssm#pissiiipssm#pissii L mississippi P = si First step fr lr Inductive step: Given fr,lr for P[j+1,p] Œ Take c=P[j] P[ j ]  Find the first c in L[fr, lr] Ž Find the last c in L[fr, lr]  L-to-F mapping of these chars lr rows prefixed by char “i” s s unknown Occ() oracle is enough slide stolen from Paolo

fr occ=2 [lr-fr+1] Substring search using the BWT ( Count the pattern occurrences ) #mississipp i#mississip ippi#missis issippi#mis ississippi# mississippi pi#mississi ppi#mississ sippi#missi sissippi#mi ssippi#miss ssissippi#m ipssm#pissiiipssm#pissii L mississippi # 1 i 2 m 7 p 8 s 10 C Available info P = si First step fr lr Inductive step: Given fr,lr for P[j+1,p] Œ Take c=P[j] P[ j ]  Find the first c in L[fr, lr] Ž Find the last c in L[fr, lr]  L-to-F mapping of these chars lr rows prefixed by char “i” s s slide stolen from Paolo

fr occ=2 [lr-fr+1] Substring search using the BWT ( Count the pattern occurrences ) #mississipp i#mississip ippi#missis issippi#mis ississippi# mississippi pi#mississi ppi#mississ sippi#missi sissippi#mi ssippi#miss ssissippi#m ipssm#pissiiipssm#pissii L mississippi # 1 i 2 m 7 p 8 s 10 C Available info P = si First step fr lr rows prefixed by char “i” s s slide stolen from Paolo What if someone whispers how many “s” we have up to index 2 and up to index 5: occ(s,2), occ(s,5) ? fr = C[s] + occ(s,2) + 1 lr = C[s] + occ(s,5)

occ( a, j ) ipssm#pissiiipssm#pissii L occ(s,4) = 2

Make a bit vector for each character ipssm#pissiiipssm#pissii L occ(s,4) = rank(4) rank(i) = how many ones are there before position i ?

How do you answer rank queries ? rank(i) = how many ones are there before position i ? We can prepare a vector with all answers

Lets do it with O(n) bits per character Partition in 2n/log(n) blocks of size log(n)/ logn/2 257 Keep the answer for each prefix of the blocks There are “kinds” of blocks, prepare a table with all answers for each block

In our solution the bit vector takes Θ(n) bits and also the “additionals” take Θ (n) bits logn/2 257

Can we do it with smaller overhead : so additionals would take o(n) ? superblocks of size log 2 (n) log 2 n 7 13 Each block keeps the number of one in previous blocks that are in the same superblock

Analysis The superblock table is of size n/log (n) log 2 n 7 13 The block table is of size (loglog(n)) * n/log (n) The tables for the blocks √n log(n)loglog(n) So the additionals take o(n) space

Next step Do it without keeping the bit vectors themselves Instead keep only the compressed version of the text Saves a lot of space for compressible strings