Paolo Ferragina, Università di Pisa Compressed Permuterm Index Paolo Ferragina Dipartimento di Informatica, Università di Pisa.

Slides:



Advertisements
Similar presentations
Paolo Ferragina, Università di Pisa XML Compression and Indexing Paolo Ferragina Dipartimento di Informatica, Università di Pisa [Joint with F. Luccio,
Advertisements

Paolo Ferragina, Università di Pisa On Compression and Indexing: two sides of the same coin Paolo Ferragina Dipartimento di Informatica, Università di.
Algorithms and data structures for big data, whats next? Paolo Ferragina University of Pisa.
Dynamic Rank-Select Structures with Applications to Run-Length Encoded Texts Sunho Lee and Kunsoo Park Seoul National Univ.
Algoritmi per IR Dictionary-based compressors. Lempel-Ziv Algorithms Keep a “dictionary” of recently-seen strings. The differences are: How the dictionary.
Lecture #1 From 0-th order entropy compression To k-th order entropy compression.
Text Indexing The Suffix Array. Basic notation and facts Occurrences of P in T = All suffixes of T having P as a prefix SUF(T) = Sorted set of suffixes.
Latent Semantic Indexing (mapping onto a smaller space of latent concepts) Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 18.
Text Indexing The Suffix Array. Basic notation and facts Occurrences of P in T = All suffixes of T having P as a prefix SUF(T) = Sorted set of suffixes.
Paolo Ferragina, Università di Pisa Compressing and Indexing Strings and (labeled) Trees Paolo Ferragina Dipartimento di Informatica, Università di Pisa.
Paolo Ferragina, Università di Pisa Compressed Rank & Select on general strings Paolo Ferragina Dipartimento di Informatica, Università di Pisa.
Fast Compressed Tries through Path Decompositions Roberto Grossi Giuseppe Ottaviano* Università di Pisa * Part of the work done while at Microsoft Research.
Suffix Trees, Suffix Arrays and Suffix Trays Richard Cole Tsvi Kopelowitz Moshe Lewenstein.
The Wavelet Trie: Maintaining an Indexed Sequence of Strings in Compressed Space Roberto GrossiGiuseppe Ottaviano * Università di Pisa * Part of the work.
The course Project #1: Dictionary search with 1 error The problem consists of building a data structure to index a large dictionary of strings.
Interplay between Stringology and Data Structure Design Roberto Grossi.
What about the trees of the Mississippi? Suffix Trees explained in an algorithm for indexing large biological sequences Jacob Kleerekoper & Marjolijn Elsinga.
Suffix Trees and Suffix Arrays
Compressed Compact Suffix Arrays Veli Mäkinen University of Helsinki Gonzalo Navarro University of Chile compact compress.
Suffix Sorting & Related Algoritmics Martin Farach-Colton Rutgers University USA.
1 Suffix Trees © Jeff Parker, Outline An introduction to the Suffix Tree Some sample applications How to build a Suffix Tree efficiently.
A Categorization Theorem on Suffix Arrays with Applications to Space Efficient Text Indexes Meng He, J. Ian Munro, and S. Srinivasa Rao University of Waterloo.
String Processing II: Compressed Indexes Patrick Nichols Jon Sheffi Dacheng Zhao
Suffix Sorting & Related Algoritmics Martin Farach-Colton Rutgers University USA.
Tries Search for ‘bell’ O(n) by KMP algorithm O(dm) in a trie Tries
Presented By Dr. Shazzad Hosain Asst. Prof. EECS, NSU Suffix Trees and Their Uses.
1 Compressed Index for Dictionary Matching WK Hon (NTHU), TW Lam (HKU), R Shah (LSU), SL Tam (HKU), JS Vitter (Purdue)
Full-Text Indexing via Burrows-Wheeler Transform Wing-Kai Hon Oct 18, 2006.
Suffix arrays. Suffix array We loose some of the functionality but we save space. Let s = abab Sort the suffixes lexicographically: ab, abab, b, bab The.
Indexed Search Tree (Trie) Fawzi Emad Chau-Wen Tseng Department of Computer Science University of Maryland, College Park.
Web Algorithmics Dictionary-based compressors. LZ77 Algorithm’s step: Output Advance by len + 1 A buffer “window” has fixed length and moves aacaacabcaaaaaa.
Suffix arrays. Suffix array We loose some of the functionality but we save space. Let s = abab Sort the suffixes lexicographically: ab, abab, b, bab The.
Amortized Rigidness in Dynamic Cartesian Trees Iwona Białynicka-Birula and Roberto Grossi Università di Pisa STACS 2006.
Suffix trees and suffix arrays presentation by Haim Kaplan.
Text Indexing S. Srinivasa Rao April 19, 2007 [based on slides by Paolo Ferragina]
Tries. (Compacted) Trie y s 1 z stile zyg 5 etic ial ygy aibelyite czecin omo systile syzygetic syzygial syzygy szaibelyite szczecin.
Compressed Index for a Dynamic Collection of Texts H.W. Chan, W.K. Hon, T.W. Lam The University of Hong Kong.
Database Index to Large Biological Sequences Ela Hunt, Malcolm P. Atkinson, and Robert W. Irving Proceedings of the 27th VLDB Conference,2001 Presented.
Paolo Ferragina, Università di Pisa Prologo Paolo Ferragina Dipartimento di Informatica, Università di Pisa.
Basics of Data Compression Paolo Ferragina Dipartimento di Informatica Università di Pisa.
Basics of Data Compression Paolo Ferragina Dipartimento di Informatica Università di Pisa.
Random access to arrays of variable-length items
Tries1. 2 Outline and Reading Standard tries (§9.2.1) Compressed tries (§9.2.2) Suffix tries (§9.2.3)
TABLE COMPRESSION AND RELATED PROBLEMS Raffaele Giancarlo Dipartimento di Matematica Università di Palermo.
Suffix trees. Trie A tree representing a set of strings. a b c e e f d b f e g { aeef ad bbfe bbfg c }
B ACKWARD S EARCH FM-I NDEX (F ULL - TEXT INDEX IN M INUTE SPACE ) Paper by Ferragina & Manzini Presentation by Yuval Rikover.
BACKWARD SEARCH FM-INDEX (FULL-TEXT INDEX IN MINUTE SPACE)
Index construction: Compression of postings Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 5.3 and a paper.
The Burrows-Wheeler Transform: Theory and Practice Article by: Giovanni Manzini Original Algorithm by: M. Burrows and D. J. Wheeler Lecturer: Eran Vered.
Index construction: Compression of documents Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading Managing-Gigabytes: pg 21-36, 52-56,
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" Basics
Suffix Tree 6 Mar MinKoo Seo. Contents  Basic Text Searching  Introduction to Suffix Tree  Suffix Trees and Exact Matching  Longest Common Substring.
Advanced Data Structures Lecture 8 Mingmin Xie. Agenda Overview Trie Suffix Tree Suffix Array, LCP Construction Applications.
On searching and extracting strings from compressed textual data Rossano Venturini Dipartimento di Informatica University of Pisa, Italy.
A simple storage scheme for strings achieving entropy bounds Paolo Ferragina and Rossano Venturini Dipartimento di Informatica University of Pisa, Italy.
Linear Time Suffix Array Construction Using D-Critical Substrings
Index construction: Compression of postings
15-853:Algorithms in the Real World
Tries 07/28/16 11:04 Text Compression
COMP9319 Web Data Compression and Search
Two equivalent problems
Paolo Ferragina Dipartimento di Informatica, Università di Pisa
Auto-completion Search
Suffix trees.
Index construction: Compression of postings
Problem with Huffman Coding
Suffix trees and suffix arrays
Index construction: Compression of postings
Index construction: Compression of postings
Paolo Ferragina Dipartimento di Informatica, Università di Pisa
Presentation transcript:

Paolo Ferragina, Università di Pisa Compressed Permuterm Index Paolo Ferragina Dipartimento di Informatica, Università di Pisa

Paolo Ferragina, Università di Pisa A basic problem Given a dictionary D of strings, having variable length, design a compressed data structure that supports 1) string  id 2) Prefix(  ): find all s in D that are prefixed by  3) Suffix(  ): find all s in D that are suffixed by  4) Substring(  ): find all s in D that contain  5) PrefixSuffix(  ) = Prefix(  )  Suffix(  ) (Compacted) Trie  Two versions: for D and for D R + Intersect answers  Need to store D for resolving edge-labels

Paolo Ferragina, Università di Pisa A basic problem Given a dictionary D of strings, having variable length, compress them in a way that we can efficiently support 1) string  id 2) Prefix(  ): find all s in D that are prefixed by  3) Suffix(  ): find all s in D that are suffixed by  4) Substring(  ): find all s in D that contain by  5) PrefixSuffix(  ) = Prefix(  )  Suffix(  ) Permuterm Index (Garfield, 76)  Reduce any query to a “ prefix query ” over a larger dictionary

Paolo Ferragina, Università di Pisa Permuterm Index [Garfield, 1976] Take a dictionary D={yahoo,google} 1. Append a special char $ to the end of each string 2. Generate all rotations of these strings yahoo$ ahoo$y hoo$ya oo$yah o$yaho $yahoo google$ oogle$g ogle$go gle$goo le$goog e$googl $google Prefix(ya) = Prefix($ya) Suffix(oo) = Prefix(oo$) Substring(oo) = Prefix(oo) PrefixSuffix(y,o)= Prefix(o$y) Any query on D reduces to a prefix-query on P[D] Permuterm Dictionary Space problems

Paolo Ferragina, Università di Pisa The FM-index The result: Count(P): O(p) time Locate (P): O(occ * polylog(|T|)) time Display( T[i,i+L] ): O( L + polylog(|T|) ) time Space occupancy: |T| H k (T) + o(|T| log |  |) bits [Ferragina-Manzini, JACM ‘05] New concept: The FM-index is an opportunistic data structure The main idea is to reduce substring search to some basic operations over arrays of symbols Compressed Permuterm index builds upon the best two features of the FM-index  

Paolo Ferragina, Università di Pisa fr occ=2 [lr-fr+1] Third ingredient: FM-index substring search #mississipp i#mississip ippi#missis issippi#mis ississippi# mississippi pi#mississi ppi#mississ sippi#missi sissippi#mi ssippi#miss ssissippi#m ipssm#pissiiipssm#pissii P = si lr unknown L Count(P[1,p]):  Finds in O(p) time

Paolo Ferragina, Università di Pisa Compressed Permuterm Index Some queries are trivial...  Prefix(  ) = Substring search($  ) within Z  Suffix(  ) = Substring search(  $) within Z  Substr(  ) = Substring search(  ) within Z Z = $hat$hip$hop$hot$# Build FM-index to support substring searches Lexicographically sorted

Paolo Ferragina, Università di Pisa PrefixSuffix search Key property: Last char of s i is at L[i+1] Cyclic-LF[i] If (i > #D) return LF[i] else return LF[i+1] LF[2] i=2 CLF[2] unknown

Paolo Ferragina, Università di Pisa PrefixSuffix(ho,p) PrefixSuffix(P): Search FM-index of Z using Cyclic-LF instead of LF No change in time/space bounds of compressed indexes unknown $ho LF CLF

Paolo Ferragina, Università di Pisa Rank and Select of strings Z = $hat$hip$hop$hot$# Other queries...  Rank(s) = row of $s$  Select(i)= backw from L[i+1] unknown

Paolo Ferragina, Università di Pisa A test on URLs Time of 20  60  sec/char, and space close to bzip Time close to Front-Coding (4  sec/char), but <50% of its space Choose your trade-off Trade-off % dict-size