1 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer Search Algorithms Winter Semester 2004/2005 8 Nov 2004.

1 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer Search Algorithms Winter Semester 2004/2005 8 Nov 2004 4th Lecture Christian Schindelhauer schindel@upb.de

Search Algorithms, WS 2004/05 2 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer Chapter II Searching in Compressed Text 08 Nov 2004

Search Algorithms, WS 2004/05 3 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer Searching in Compressed Text (Overview)  What is Text Compression –Definition –The Shannon Bound –Huffman Codes –The Kolmogorov Measure  Searching in Non-adaptive Codes –KMP in Huffman Codes  Searching in Adaptive Codes –The Lempel-Ziv Codes –Pattern Matching in Z-Compressed Files –Adapting Compression for Searching

Search Algorithms, WS 2004/05 4 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer What is Text Compression?  First approach: –Given a text s   n –Find a compressed version c   m such that m < n –Such that s can be derived from c  Formal: –Compression Function f :  *   * is one-to-one (injective) and efficiently invertible  Fact: –Most of all text is uncompressible  Proof: –There are (|  | m+1 -1)/(|  |-1) strings of length at most m –There are |  | n strings of length n –From these strings at most (|  | m+1 -1)/(|  |-1) strings can be compressed –This is fraction of at most |  | m-n+1 /(|  |-1) –E.g. for |  | = 256 and m=n-10 this is 8.3 × 10 -25 which implies that only 8.3 × 10 -25 of all files of n bytes can be compressed to a string of length n-10

Search Algorithms, WS 2004/05 5 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer Why does Text Compression work?  Usually texts are using letters with different frequencies –Relative Frequencies of Letters in General English Plain text From Cryptographical Mathematics, by Robert Edward Lewand: e: 12%, t: 10%, a: 8%, i: 7%, n: 7%, o: 7%... k: 0.4%, x: 0. 2%, j: 0. 2%, q: 0. 09%, z:0. 06% –Special characters like $,%,# occur even less frequent –Some character encodings are (nearly) unused, e.g. bytecode: 0 of ASCII  Text underlies a lot of rules –Words are (usually) the same (collected in dictionaries) –Not all words can be used in combination –Sentences are structured (grammar) –Program codes use code words –Digitally encoded pictures have smooth areas, where colors change gradually –Patterns repeat

Search Algorithms, WS 2004/05 6 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer Information Theory: The Shannon bound  C. E. Shannon in his 1949 paper "A Mathematical Theory of Communication".  Shannon derives his definition of entropy  The entropy rate of a data source means the average number of bits per symbol needed to encode it.  Example text: ababababab –Entropy: 1 –Encoding: Use 0 for a Use 1 for b –Code: 0101010101  Huffman Codes are a way to derive such a Shannon bound (for sufficiently large text)

Search Algorithms, WS 2004/05 7 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer Huffman Code  Huffman Code –is adapted for each text (but not within the text) –consists of a dictionary, which maps each letter of a text to a binary string and the code given as a prefix-free binary encoding  Prefix-free code –uses strings s 1,s 2,...,s m of variable length such that no strint s i is a prefix of s j amnmaaamp iipt LetterFrequencyCode a510 i401 p3111 m2000 t2001 n2110  Example of Huffman encoding: –Text: iipt a amnmaaamp iiptiipt a 1000011000010 000111 01 11100101 111001 10 Encoding:

Search Algorithms, WS 2004/05 8 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer Computing Huffman Codes  Compute the letter frequencies  Build root nodes labeled with frequencies  repeat –Build node connected the two least frequent unlinked nodes –Mark sons with 0 and 1 –Father node carries the sum of the frequencies  until one tree is left  The path to each letter carries the code LetterFrequency a5 i4 p3 m2 t2 n2 ainpmt 5 2324 5 2 4 8 10 18 10 0 1 0 1 1 1 0 111 110 0 1001001000 LetterCode a10 i01 p111 m000 t001 n110

Search Algorithms, WS 2004/05 9 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer Searching in Huffman Codes  Let u be the size of the compressed text  Let v be the size of the pattern Huffman-encoded according to the text dictionary  KMP can search in Huffman Codes in time O(u+v+m)  Encoding the pattern takes O(v+m) steps  Building the prefix takes time O(v)  Searching the text on a bit level takes time O(u+v)  Problems: –This algorithm is bit-wise not byte-wise Exercise: Develop a byte-wise strategy

Search Algorithms, WS 2004/05 10 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer The Downside of Huffman Codes  Example: Consider 128 Byte text: –abbaabbaabbaabbaabbaabbaabbaabbaabbaabbaabbaabbaabbaabbaabbaabba abbaabbaabbaabbaabbaabbaabbaabbaabbaabbaabbaabbaabbaabbaabbaabba –will be encoded using 16 Bytes (and an extra byte for the dictionary) as –0110011001100110011001100110011001100110011001100110011001100110 –This does not use the full compression possibilities for this text –E.g. using (abba)^32 would need only 9 Bytes  The perfect code: –A self-extracting program for a string x is a program that started without input produces the output x and then halts. –So, the smallest self-extracting-program is the ultimate encoding  Kolmogorov complexity K(x) of a string x denotes the length of such an self-extracting program for x

Search Algorithms, WS 2004/05 11 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer Kolmogoroff Complexity  Is the Kolmogorov Complexity depending on programming language? –No, as long as the programming language is universal, e.g. can simulate any Turing machine Lemma Let K 1 (x) and K 2 (x) denote the Kolmogorov Complexity with respect to two arbitrary universal programming languages.Then for a constant c and all strings x: K 1 (x)  K 2 (x) + c  Is the Kolmogorov Complexity useful? –No: Theorem K(x) is not recursive.

Search Algorithms, WS 2004/05 12 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer Ziv-Lempel-Welch (LZW)-Codes  From the Ziv-Lempel-Family –LZ77, LSZZ, LZ78, LZW, LZMZ, LZAP  Literature –LZW: Terry A. Welch: "A Technique for High Performance Data Compression", IEEE Computer vol. 17 no. 6, Juni 1984, p. 8-19 –LZ77 J. Ziv, A. Lempel: "A Universal Algorithm for Sequential Data Compression", IEEE Transactions, p. 337-343 –LZ78 J. Ziv, A. Lempel: "Compression of Individual Sequences Via Variable-Rate Coding", IEEE Transactions on Information, p. 530-536  known as Unix-command: “compress”  Uses: –TRIES

Search Algorithms, WS 2004/05 13 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer Trie = “reTRIEval TREE”  Name taken out of “ReTRIEval”  Tree –for storing/encoding text –efficient search for equal prefices  Structure –Edges labelled with letters –Nods are numbered  Mapping –Every node encodes a word of the text –The text of a node can be read on the path from the root to the node Node 1 = “m” Node 6 = “at” –Inverse direction: Every word uniquely points at a node –(or at least some prefix points to a leaf) “it” = node 11 “manaman” points with “m” to node 1 0 1 m 2 a 3 n 4 m 5 n 6 t 7 p 8 i 10 p 9 t 11 t  Encoding of –“manamanatapitipitipi” –1,2,3,4,5,6,7,8,9,10,11,12 or –1,5,4,5,6,7,11,10,11,10,8  Decoding of –5,11,2 –“an”, “it”, “a” = anita 12 i

Search Algorithms, WS 2004/05 14 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer How LZW builds a TRIE  LZW –works bytewise –starts with the 256-leaf trie with leafs “a”, “b”,... numbered with “a”, “b”,... LZW-Trie-Builder(T) 1.n  length(T) 2.i  1 3.TRIE  start-TRIE 4.m  number of nodes in TRIE 5.u  root(TRIE) 6.while i  n do 7. if no edge with label T[i] under u then 8. m  m+1 9. append leaf m to u with edge label T[i] 10. u  root(TRIE) 11. else 12. u  node under u with edge label T[i] 13. fi 14. i  i +1 15. od - a a b b c c d d... z z Example: nanananananana - a a n n... na Scanned: na a

Search Algorithms, WS 2004/05 15 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer How LZW builds a TRIE  LZW –works bytewise –starts with the 256-leaf trie with leafs “a”, “b”,... numbered with “a”, “b”,... LZW-Trie-Builder(T) 1.n  length(T) 2.i  1 3.TRIE  start-TRIE 4.m  number of nodes in TRIE 5.u  root(TRIE) 6.while i  n do 7. if no edge with label T[i] under u then 8. m  m+1 9. append leaf m to u with edge label T[i] 10. u  root(TRIE) 11. else 12. u  node under u with edge label T[i] 13. fi 14. i  i +1 15. od Example: nanananananana - a a n n... na Scanned: nananananana na a Continue with: nanananananana na Residual part: nanananananana

Search Algorithms, WS 2004/05 16 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer How does it continue?  Exercise: –Consider the text: “nananana...na” of length 2n –Describe the LZW-Trie –How many nodes are there in the final tree? –Compute the asymptotic compression ratio, i.e. size of LZW-Encoding/length of text –Compare this result with Huffmann encoding and Shannon bounds –Is the LZW-Trie-algorithm an optimal algorithm for words of this kind? –Prove it!

Search Algorithms, WS 2004/05 17 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer How LZW produces the encoding LZW-Encoder(T) 1.n  length(T) 2.i  1 3.TRIE  start-TRIE 4.m  number of nodes in TRIE 5.u  root(TRIE) 6.while i  n do 7. if no edge with label T[i] under u then 8. output (m,u,T[i]) 9. m  m+1 10. append leaf m to u with edge label T[i] 11. u  root(TRIE) 12. else 13. u  node under u with edge label T[i] 14. fi 15. i  i +1 16. od 17.if u  root(TRIE) then 18. output (u) 19.fi The output m is predictable: 256,257,258,... Therefore use only output(u,T[i]) start-Trie = 256-leaf trie with bytes encoded as 0,1,2,..,255

Search Algorithms, WS 2004/05 18 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer 0 m m n a a a 256257 n tp i i 262 p t t 261 t An Example Encoding Encoding of m a n a m a n a t a p i t i p i t i p i (m,a)(n,a)(256,n)(a,t)(a,p)(i,t)(i,p)(261,i)(p,i) 256257258259260261262264264 mana(ma)natapitip(it)ipi 258 n a 259260 263 i LZW-Encoder(T) 1.n  length(T) 2.i  1 3.TRIE  start-TRIE 4.m  number of nodes in TRIE 5.u  root(TRIE) 6.while i  n do 7. if no edge with label T[i] under u then 8. output (u,T[i]) 9. m  m+1 10. append leaf m to u with edge label T[i] 11. u  root(TRIE) 12. else 13. u  node under u with edge label T[i] 14. fi 15. i  i +1 16. od 17.if u  root(TRIE) then 18. output (u) 19.fi p p 264 i

Search Algorithms, WS 2004/05 19 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer The Decoder LZW-Decoder(Code) 1.i  1 2.TRIE  start-TRIE 3.m  255 4.for i  0 to 255 do C(i)=“i” od 5.while not end of file do 6. (u,c)  read-next-two-symbols(Code); 7. if c exists then 8. output (C(u), c) 9. m  m+1 10. append leaf m to u with edge label c 11. C(m)  (C(u),c) 12. else 13. output (C(u)) 14. od If the last string of the code did not produce a new node in the trie then output the corresponding string 0 m m n a a a 256257 n tp i i 262 p t t 261 t Encoding of m a n a m a n a t a p i t i p i t i p i (m,a)(n,a)(256,n)(a,t)(a,p)(i,t)(i,p)(261,i)(p,i) 256257258259260261262264264 mana(ma)natapitip(it)ipi 258 n a 259260 263 i p p 264 i

Search Algorithms, WS 2004/05 20 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer Performance of LZW  Encoding can be performed in time O(n) –where n is the length of the given text  Decoding can be performed in time O(n) –where n is the length of the uncompressed output  The memory consumption is linear in the size of the compressed code  LZW can be nicely implemented in hardware  There is no software patent –so it is very populary, see “compress” for UNIX  LZW can be further compressed using Huffman-Codes –Every second character is a plain copy from the text!  Search in LZW is difficult –The encoding is embedded in the text (adaptive encoding) –For one search in a text there is an linear number of possiblities to encode a search pattern (EXERCISE)

Search Algorithms, WS 2004/05 21 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer The Algorithm of Amir, Benson & Farach “Let Sleeping Files Lie”  Ideas –Build the Trie, but do not decode –Use KMP-Matcher with the nodes of the LZW-Trie –Prepare a data structure based on the pattern m –Then, scan the text and update this data structure  Goal: Running time of O(n + f(m)) –where n is the code length –f(m) is some small polynomial depending on the pattern length m –for well compressed codes and f(m)<n it should be faster than decoding and then running text search

Search Algorithms, WS 2004/05 22 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer Searching in LZW-Codes Inside a node  Example: Search for tapioca abtapiocaab blahblah bb abar tapioca is “inside” a node Then we have found tapioca For all nodes u of a trie: Set: Is_inside[u]=1 if the text of u contains the pattern

Search Algorithms, WS 2004/05 23 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer Searching in LZW-Codes Torn apart  Example: Search for tapioca i carasi po abrasta Starting somewhere in a node Parts are hidden in some other nodes The end is the start of another node All parts are nodes of the LZW-Trie

Search Algorithms, WS 2004/05 24 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer Finding the start: longest_prefix The Suffix of Nodes = Prefix of Patterns  Classify all nodes of the trie –Is the suffix of the node a prefix of the pattern –And if yes, how long is it?  For very long text encoded by a node only the last m letters matter  Can be computed using the KMP- Matcher-algorithm while building the Trie  Example: –Pattern: “manamana” pamana The last four letter are the first four of the pattern ma length of suffix of node which is prefix of patter is 2 papa result: 0 mana result: 4 amanaplanacanalpamana result: 4 amanaplanacanalpamana m amanaplanacanalpamanam

Search Algorithms, WS 2004/05 25 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer Is the node inside of a Pattern  Find positions where the text of the node is inside the pattern  Several occurences are possible –e.g. one letter –There are at most m(m-1)/2 encodings of such sub-strings –For every sub-string there is exactly one node that fits  Define table Inside-Node of size O(m 2 ) –Inside-Node[start,end] := Node that encodes pattern P[start]..P[end]  From Inside-Node[start,end] one can derive Inside-Node[start,end+1] as soon as the corresponding node is created  To find all occurences quickly an pointer –Next-inside-occurence(start,end) indicates the next position where the substrings lies –It is initialized for start=end with the next occurence of the letter  Example: –Pattern: “manamana” ana This text could be in positions 2-4 or positions 6-8 of the pattern anam result: (2,5) othorgonal result: (0,0) is not in the pattern

Search Algorithms, WS 2004/05 26 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer Finding the End: longest_suffix Prefix of the Node = Suffix of the Pattern  Classify all nodes of the trie –Is the prefix of the node a suffix of the pattern –And if yes, does it complete the pattern, if already i letters were found?  For very long text encoded by a node only the first m letters matter  Since the text is added at the right side this property can be derived from the ancestor  Example: –Pattern: “manamana” ananimal Here 3 and 1 could be the solution We take 3, because 1 can be derived from 3 using the technique shown in KMP-Matcher (using  on the reverse string) manamanamana result: 8 panamacanal result: 0 manammanaaaaaaaaaaaa m manammanaaaaaaaaaaaam

Search Algorithms, WS 2004/05 27 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer How does it fit?  On the left side we have the maximum prefix of the pattern  On the right side we have the maximum suffix of the pattern panamamapama Pattern: mamapamana pamanapana This does not fit Yet the pattern is inside, though, since the last 8 letters + the first 6 letters of the pattern give the pattern 8 letter prefix found 6 letter suffix found Solution: Define prefix-suffix-table PS-T(p,s) = 1 if p-letter-prefix and s-letter-prefix contain the patter

Search Algorithms, WS 2004/05 28 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer ABF-LZW-Matcher (LZW-Code C, uncompressed pattern P) 1.n  length( C), m  length( M) 2.Init_DS(P) 3.TRIE  start-TRIE 4.v  255 5.prefix  0 6.for i  0 to 255 do C(i)=“i” od 7.for l  1 to n do 8. (u,c)  read-next-two-symbols(Code) 9. v  v+1 10. Update_DS() 11. Check_for_Occurence() 12. od Update_DS() 1.length[v]  length[u]+1 2.C[v]  C[u]c 3.is_inside[v]  is_inside[u] 4.if longest_prefix[u]< m and P[longest_prefix[u]+1]= c then 5. longest_prefix[v]  longest_prefix[u] +1 6.fi 7.if length[u]<m then 8. for all entries (start,end) of u in inside_node 9. doif P[end+1]=c then 10. inside_node[start,end+1]  v 11. Link entry of v 12. fi 13. do 14.if longest_suffix[u] < length[u] or P[length[v]]  c then 15. longest_suffix[v]  longest_suffix[u] 16. else 17. longest_suffix[v]  1+longest_suffix[u] 18. if longest_suffix[v] = m then is_inside[v]  1 fi 19.fi

Search Algorithms, WS 2004/05 29 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer Check for Occurences 1.if is_inside[v] = m then 2. return “pattern found at l” 3. prefix  prefixlength[v] 4. else if prefix = 0 then 5. prefix  prefixlength[v] 6. else if prefix + length[v] < m then 7. while prefix  0 and inside-node[prefix+1,prefix+length[v]]  v do 8. prefix   (prefix) 9. od 10. if prefix = 0 then prefix  prefixlength[v] 11. else prefix  prefix+length[v] 12. fi 13. else 14. suffix  longest_suffix[v] 15. if PS-T[prefix,suffix]=1 then 16. return “pattern found at l” 17. prefix  prefixlength[v] 18. else 19. prefix  prefixlength[v] 20. fi Loop possibly needs time m for each symbol Amortized analysis will not heal this

30 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer Thanks for your attention End of 4th lecture Next lecture:Mo 15 Nov 2004, 11.15 am, FU 116 Next exercise class: Mo 15 Nov 2004, 1.15 pm, F0.530 or We 17 Nov 2004, 1.00 pm, E2.316

1 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer Search Algorithms Winter Semester 2004/2005 8 Nov 2004.

Similar presentations

Presentation on theme: "1 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer Search Algorithms Winter Semester 2004/2005 8 Nov 2004."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer Search Algorithms Winter Semester 2004/2005 8 Nov 2004.

Similar presentations

Presentation on theme: "1 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer Search Algorithms Winter Semester 2004/2005 8 Nov 2004."— Presentation transcript:

Similar presentations

About project

Feedback