# Paolo Ferragina, Università di Pisa Compressed Permuterm Index Paolo Ferragina Dipartimento di Informatica, Università di Pisa.

## Presentation on theme: "Paolo Ferragina, Università di Pisa Compressed Permuterm Index Paolo Ferragina Dipartimento di Informatica, Università di Pisa."— Presentation transcript:

Paolo Ferragina, Università di Pisa Compressed Permuterm Index Paolo Ferragina Dipartimento di Informatica, Università di Pisa

Paolo Ferragina, Università di Pisa A basic problem Given a dictionary D of strings, having variable length, design a compressed data structure that supports 1) string  id 2) Prefix(  ): find all s in D that are prefixed by  3) Suffix(  ): find all s in D that are suffixed by  4) Substring(  ): find all s in D that contain  5) PrefixSuffix(  ) = Prefix(  )  Suffix(  ) (Compacted) Trie  Two versions: for D and for D R + Intersect answers  Need to store D for resolving edge-labels

Paolo Ferragina, Università di Pisa A basic problem Given a dictionary D of strings, having variable length, compress them in a way that we can efficiently support 1) string  id 2) Prefix(  ): find all s in D that are prefixed by  3) Suffix(  ): find all s in D that are suffixed by  4) Substring(  ): find all s in D that contain by  5) PrefixSuffix(  ) = Prefix(  )  Suffix(  ) Permuterm Index (Garfield, 76)  Reduce any query to a “ prefix query ” over a larger dictionary

Paolo Ferragina, Università di Pisa Permuterm Index [Garfield, 1976] Take a dictionary D={yahoo,google} 1. Append a special char \$ to the end of each string 2. Generate all rotations of these strings yahoo\$ ahoo\$y hoo\$ya oo\$yah o\$yaho \$yahoo google\$ oogle\$g ogle\$go gle\$goo le\$goog e\$googl \$google Prefix(ya) = Prefix(\$ya) Suffix(oo) = Prefix(oo\$) Substring(oo) = Prefix(oo) PrefixSuffix(y,o)= Prefix(o\$y) Any query on D reduces to a prefix-query on P[D] Permuterm Dictionary Space problems

Paolo Ferragina, Università di Pisa The FM-index The result: Count(P): O(p) time Locate (P): O(occ * polylog(|T|)) time Display( T[i,i+L] ): O( L + polylog(|T|) ) time Space occupancy: |T| H k (T) + o(|T| log |  |) bits [Ferragina-Manzini, JACM ‘05] New concept: The FM-index is an opportunistic data structure The main idea is to reduce substring search to some basic operations over arrays of symbols Compressed Permuterm index builds upon the best two features of the FM-index  

Paolo Ferragina, Università di Pisa fr occ=2 [lr-fr+1] Third ingredient: FM-index substring search #mississipp i#mississip ippi#missis issippi#mis ississippi# mississippi pi#mississi ppi#mississ sippi#missi sissippi#mi ssippi#miss ssissippi#m ipssm#pissiiipssm#pissii P = si lr unknown L Count(P[1,p]):  Finds in O(p) time

Paolo Ferragina, Università di Pisa Compressed Permuterm Index Some queries are trivial...  Prefix(  ) = Substring search(\$  ) within Z  Suffix(  ) = Substring search(  \$) within Z  Substr(  ) = Substring search(  ) within Z Z = \$hat\$hip\$hop\$hot\$# Build FM-index to support substring searches Lexicographically sorted

Paolo Ferragina, Università di Pisa PrefixSuffix search Key property: Last char of s i is at L[i+1] Cyclic-LF[i] If (i > #D) return LF[i] else return LF[i+1] LF[2] i=2 CLF[2] unknown

Paolo Ferragina, Università di Pisa PrefixSuffix(ho,p) PrefixSuffix(P): Search FM-index of Z using Cyclic-LF instead of LF No change in time/space bounds of compressed indexes unknown \$ho LF CLF

Paolo Ferragina, Università di Pisa Rank and Select of strings Z = \$hat\$hip\$hop\$hot\$# Other queries...  Rank(s) = row of \$s\$  Select(i)= backw from L[i+1] unknown

Paolo Ferragina, Università di Pisa A test on URLs Time of 20  60  sec/char, and space close to bzip Time close to Front-Coding (4  sec/char), but <50% of its space Choose your trade-off Trade-off % dict-size

Download ppt "Paolo Ferragina, Università di Pisa Compressed Permuterm Index Paolo Ferragina Dipartimento di Informatica, Università di Pisa."

Similar presentations