# 1 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer Search Algorithms Winter Semester 2004/2005 25 Oct.

## Presentation on theme: "1 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer Search Algorithms Winter Semester 2004/2005 25 Oct."— Presentation transcript:

1 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer Search Algorithms Winter Semester 2004/2005 25 Oct 2004 3rd Lecture Christian Schindelhauer schindel@upb.de

Search Algorithms, WS 2004/05 2 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer Chapter I Searching Text 18 Oct 2004

Search Algorithms, WS 2004/05 3 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer Searching Text (Overview)  The task of string matching –Easy as a pie  The naive algorithm –How would you do it?  The Rabin-Karp algorithm –Ingenious use of primes and number theory  The Knuth-Morris-Pratt algorithm –Let a (finite) automaton do the job –This is optimal  The Boyer-Moore algorithm –Bad letters allow us to jump through the text –This is even better than optimal (in practice)  Literature –Cormen, Leiserson, Rivest, “Introduction to Algorithms”, chapter 36, string matching, The MIT Press, 1989, 853-885.

Search Algorithms, WS 2004/05 4 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer The Naive Algorithm Naive-String-Matcher(T,P) 1.n  length(T) 2.m  length(P) 3.for s  0 to n-m do 4. if P[1..m] = T[s+1.. s+m] then 5. return “Pattern occurs with shift s” 6.fi 7.od Fact:  The naive string matcher needs worst case running time O((n-m+1) m)  For n = 2m this is O(n 2 )  The naive string matcher is not optimal, since string matching can be done in time O(m + n)

Search Algorithms, WS 2004/05 5 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer The Rabin-Karp-Algorithm  Idea: Compute –checksum for pattern P and –checksum for each sub-string of T of length m amnmaaanptaiiptpii 423142311323110 ptai 3 valid hit spurious hit checksums checksum

Search Algorithms, WS 2004/05 6 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer Finite-Automaton-Matcher  The example automaton accepts at the end of occurences of the pattern abba  For every pattern of length m there exists an automaton with m+1 states that solves the pattern matching problem with the following algorithm: Finite-Automaton-Matcher(T, ,P) 1.n  length(T) 2.q  0 3.for i  1 to n do 4. q   (q,T[i]) 5. if q = m then 6. s  i - m 7. return “Pattern occurs with shift” s 8.fi 9.od

Search Algorithms, WS 2004/05 7 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer The Finite-Automaton-Matcher Q is a finite set of states q 0  Q is the start state Q is a set of accepting sates  : input alphabet  : Q    Q: transition function 0 1 4 2 3 a b b a a b b b a a input state ab 010 112 213 340 412 aabbababab 012123 42 341

Search Algorithms, WS 2004/05 8 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer Knuth-Morris-Pratt Pattern Matching KMP-Matcher(T,P) 1.n  length(T) 2.m  length(P) 3.   Compute-Prefix-Function(P) 4.q  0 5.for i  1 to n do 6. while q > 0 and P[q+1]  T[i] do 7. q   [q] od 8. if P[q+1] = T[i] then 9. q  q+1 fi 10. if q = m then 11. print “Pattern occurs with shift”i-m 12. q   [q] fi od amnmaaampa m m m a m a ma ma m m a m m mma mma m m Pattern mmaa 

Search Algorithms, WS 2004/05 9 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer Boyer-Moore: The ideas! amnmaaanptaiiptpii ptii ptii Start comparing at the end What’s this? There is no “a” in the search pattern We can shift m+1 letters An “a” again... ptii First wrong letter! Do a large shift! ptii Bingo! Do another large shift! ptii That’s it! 10 letters compared and ready!

Search Algorithms, WS 2004/05 10 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer Boyer-Moore-Matcher(T,P,  ) 1.n  length(T) 2.m  length(P) 3.  Compute-Last-Occurence-Function(P,m,  ) 4.   Compute-Good-Suffix(P,m) 5.s  0 6.while s  n-m do 7. j  m 8. while j > 0 and P[j] = T[s+j] do 9. j  j-1 od 10. if j=0 then 11.print “Pattern occurs with shift” s 12. s  s+  [0] else 13. s  s+ max(  [j], j - [T[s+j]] ) fi od We start comparing at the right end Bad character shift Valid shifts Success! Now do a valid shift Shift as far as possible indicated by bad character heuristic or good suffix heuristic

Search Algorithms, WS 2004/05 11 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer Boyer-Moore: Last-occurrence amnmaatnptaiiptpii ptii ptii What’s this? There is no “a” in the search pattern We can shift by j - [a] = 4-0 letters “t” occurs in “piti” at the 3rd position: Shift by j - [a] = 4-3 = one step ptii “p” occurs in “piti” at the first position Shift by j - [a] = 4-1 = 3 letters ptii There is no “a” in the search pattern We can shift by at least j - [a] = 2-0 letters j=4 j=2

Search Algorithms, WS 2004/05 12 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer Compute-Last-Occurrence-Function(P,m,  ) 1.for each character a   do 2. [a]  0 od 3.for j  1 to m do 4. [P[j]]  j od 5.return Running time: O(|  | + m) ptii a i p t

Search Algorithms, WS 2004/05 13 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer The Prefix Function  [q] := max {k : k < q and P k is a suffix of P q } baabaaaaa baabaaaaa  [7] = 4 baabaaaa b a baabaaaa baabaaa baabaaa baabaa baaba P8P8 P7bP7b P7P7 P6P6 P5P5 Text Pattern

Search Algorithms, WS 2004/05 14 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer a baaba  [q] := max {k : k < q and P k is a suffix of P q } Pattern: baabaa  [6] = 3 baaa baa  [4] = 1 baaba baaa  [5] = 2 a  [1] = 0 ba a  [2] = 0 baa ba  [3] = 1 baabaaaaa baabaaa baabaa  [7] = 4 baabaaaa baabaaa  [8] = 1 baabaaaa baaba a a  [9] = 1 a

Search Algorithms, WS 2004/05 15 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer Computing  Compute-Prefix-Function(P) 1.m  length(P) 2.  [1]  0 3.k  0 4.for q  2 to m do 5. while k > 0 and P[k+1]  P[q] do 6. k   [k] od 7. if P[k+1] = P[q] then 8. k  k+1 fi 9.  [q]  k od If P k+1 is not a suffix of P q... shift the pattern to the next reasonable position (given by smaller values of  ) If the letter fits, then increment position (otherwise k = 0) We have found the position such that  [q] := max {k : k < q and P k is a suffix of P q }

Search Algorithms, WS 2004/05 16 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer n Boyer-Moore: Good Suffix - the far jump ammaaan Pattern: First mismatch maaan maaan namaaan nammaaan nammaaan nammaaan nammaaan nammaaan nam nam m Is Rev(P) 5 a suffix of Rev(P) 6 ? Is Rev(P) 5 a suffix of Rev(P) 7 ? Is Rev(P) 5 a suffix of Rev(P) 8 ? (or P 5 a suffix of P 8 )? Is P 4 a suffix of P 8 ? Is P 3 a suffix of P 8 ? Is P 2 a suffix of P 8 ? Is P 1 a suffix of P 8 ? Is P 0 a suffix of P 8 ?  [q] := max {k : k < q and P k is a suffix of P q }  [8]=4 Shift =m-  [j] =8-4 =4 j=6

Search Algorithms, WS 2004/05 17 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer m Boyer-Moore: Good Suffix - the small jump ammaaam Pattern: First mismatch maaan maaam mamaaam mammaaam mammaaam mammaaam mammaaam mammaaam mam mam m Is P 4 a suffix of P 8 ? Is P 3 a suffix of P 8 ? Is P 2 a suffix of P 8 ? Is P 1 a suffix of P 8 ? Is P 0 a suffix of P 8 ? f[6]=8 Shift (f[j]-j)=8-6=2 j=6 f[j] := min{k : k > j and Rev(P) j is a suffix of Rev(P) k }  ’[q] := max {k : k < q and Rev(P) k is a suffix of Rev(P) q } Is Rev(P) 5 a suffix of Rev(P) 6 ? Is Rev(P) 5 a suffix of Rev(P) 7 ? Is Rev(P) 5 a suffix of Rev(P) 8 ? (or P 5 a suffix of P 8 )?

Search Algorithms, WS 2004/05 18 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer Boyer-Moore: Good Suffix - the small jump Pattern: j=6 f[6]=8 Shift (f[j]-j)=8-6=2 f[j] := min{k : k > j and Rev(P) j is a suffix of Rev(P) k }  ’[q] := max {k : k < q and Rev(P) k is a suffix of Rev(P) q }

Search Algorithms, WS 2004/05 19 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer Why is it the same?  ’[k] := max {j : j < k and Rev(P) j is a suffix of Rev(P) k } Matrix for Rev(P) j is a suffix of Rev(P) k k j f[j] := min{k : k > j and Rev(P) j is a suffix of Rev(P) k }

Search Algorithms, WS 2004/05 20 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer Compute-Good-Suffix-Function(P,m) 1.   Compute-Prefix-Function(P) 2.P’  reverse(P) 3.  ’  Compute-Prefix-Function(P’) 4.for j  0 to m do 5.  [j]  m -  [m] od 6.for l  1 to m do 7. j  m -  ’[l] 8. if  [j] > l -  ’[l] then 9.  [j]  l -  ’[l] fi od 10.return  Running time: O(m) The far jump or is it a small jump

Search Algorithms, WS 2004/05 21 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer Boyer-Moore-Matcher(T,P,  ) 1.n  length(T) 2.m  length(P) 3.  Compute-Last-Occurence-Function(P,m,  ) 4.   Compute-Good-Suffix(P,m) 5.s  0 6.while s  n-m do 7. j  m 8. while j > 0 and P[j] = T[s+j] do 9. j  j-1 od 10. if j=0 then 11. print “Pattern occurs with shift” s 12. s  s+  [0] else 13. s  s+ max(  [j], j - [T[s+j]] ) fi od  Running time: O((n-m+1)m) in the worst case  In practice: O(n/m + v m + m + |  |)  for v hits in the text

Search Algorithms, WS 2004/05 22 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer Chapter II Searching in Compressed Text 25 Oct 2004

Search Algorithms, WS 2004/05 23 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer Searching in Compressed Text (Overview)  What is Text Compression –Definition –The Shannon Bound –Huffman Codes –The Kolmogorov Measure  Searching in Non-adaptive Codes –KMP in Huffman Codes  Searching in Adaptive Codes –The Lempel-Ziv Codes –Pattern Matching in Z-Compressed Files –Adapting Compression for Searching

Search Algorithms, WS 2004/05 24 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer What is Text Compression?  First approach: –Given a text s   n –Find a compressed version c   m such that m < n –Such that s can be derived from c  Formal: –Compression Function f :  *   * is one-to-one (injective) and efficiently invertible  Fact: –Most of all text is uncompressible  Proof: –There are (|  | m+1 -1)/(|  |-1) strings of length at most m –There are |  | n strings of length n –From these strings at most (|  | m+1 -1)/(|  |-1) strings can be compressed –This is fraction of at most |  | m-n+1 /(|  |-1) –E.g. for |  | = 256 and m=n-10 this is 8.3 × 10 -25 which implies that only 8.3 × 10 -25 of all files of n bytes can be compressed to a string of length n-10

Search Algorithms, WS 2004/05 25 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer Why does Text Compression work?  Usually texts are using letters with different frequencies –Relative Frequencies of Letters in General English Plain text From Cryptographical Mathematics, by Robert Edward Lewand: e: 12%, t: 10%, a: 8%, i: 7%, n: 7%, o: 7%... k: 0.4%, x: 0. 2%, j: 0. 2%, q: 0. 09%, z:0. 06% –Special characters like \$,%,# occur even less frequent –Some character encodings are (nearly) unused, e.g. bytecode: 0 of ASCII  Text underlies a lot of rules –Words are (usually) the same (collected in dictionaries) –Not all words can be used in combination –Sentences are structured (grammar) –Program codes use code words –Digitally encoded pictures have smooth areas, where colors change gradually –Patterns repeat

Search Algorithms, WS 2004/05 26 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer Information Theory: The Shannon bound  C. E. Shannon in his 1949 paper "A Mathematical Theory of Communication".  Shannon derives his definition of entropy  The entropy rate of a data source means the average number of bits per symbol needed to encode it.  Example text: ababababab –Entropy: 1 –Encoding: Use 0 for a Use 1 for b –Code: 0101010101  Huffman Codes are a way to derive such a Shannon bound (for sufficiently large text)

Search Algorithms, WS 2004/05 27 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer Huffman Code  Huffman Code –is adapted for each text (but not within the text) –consists of a dictionary, which maps each letter of a text to a binary string and the code given as a prefix-free binary encoding  Prefix-free code –uses strings s 1,s 2,...,s m of variable length such that no strint s i is a prefix of s j amnmaaamp iipt LetterFrequencyCode a510 i401 p3111 m2000 t2001 n2110  Example of Huffman encoding: –Text: iipt a amnmaaamp iiptiipt a 1000011000010 000111 01 11100101 111001 10 Encoding:

Search Algorithms, WS 2004/05 28 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer Computing Huffman Codes  Compute the letter frequencies  Build root nodes labeled with frequencies  repeat –Build node connected the two least frequent unlinked nodes –Mark sons with 0 and 1 –Father node carries the sum of the frequencies  until one tree is left  The path to each letter carries the code LetterFrequency a5 i4 p3 m2 t2 n2 ainpmt 5 2324 5 2 4 8 10 18 10 0 1 0 1 1 1 0 111 110 0 1001001000 LetterCode a10 i01 p111 m000 t001 n110

Search Algorithms, WS 2004/05 29 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer Searching in Huffman Codes  Let u be the size of the compressed text  Let v be the size of the pattern Huffman-encoded according to the text dictionary  KMP can search in Huffman Codes in time O(u+v+m)  Encoding the pattern takes O(v+m) steps  Building the prefix takes time O(v)  Searching the text on a bit level takes time O(u+v)  Problems: –This algorithm is bit-wise not byte-wise Exercise: Develop a byte-wise strategy

Search Algorithms, WS 2004/05 30 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer The Downside of Huffman Codes  Example: Consider 128 Byte text: –abbaabbaabbaabbaabbaabbaabbaabbaabbaabbaabbaabbaabbaabbaabbaabba abbaabbaabbaabbaabbaabbaabbaabbaabbaabbaabbaabbaabbaabbaabbaabba –will be encoded using 16 Bytes (and an extra byte for the dictionary) as –0110011001100110011001100110011001100110011001100110011001100110 –This does not use the full compression possibilities for this text –E.g. using (abba)^32 would need only 9 Bytes  The perfect code: –A self-extracting program for a string x is a program that started without input produces the output x and then halts. –So, the smallest self-extracting-program is the ultimate encoding  Kolmogorov complexity K(x) of a string x denotes the length of such an self-extracting program for x

Search Algorithms, WS 2004/05 31 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer Kolmogoroff Complexity  Is the Kolmogorov Complexity depending on programming language? –No, as long as the programming language is universal, e.g. can simulate any Turing machine Lemma Let K 1 (x) and K 2 (x) denote the Kolmogorov Complexity with respect to two arbitrary universal programming languages.Then for a constant c and all strings x: K 1 (x)  K 2 (x) + c  Is the Kolmogorov Complexity useful? –No: Theorem K(x) is not recursive.

Search Algorithms, WS 2004/05 32 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer Proof of Lemma Lemma Let K 1 (x) and K 2 (x) denote the Kolmogorov Complexity with respect to two arbitrary universal programming languages.Then for a constant c and all strings x: K 1 (x)  K 2 (x) + c Proof  Let M 1 be the self-extracting program for x with respect to the first language  Let U be a universal program in the seconde that simulates a given machine M 1 of the first language  The output of U(M 1,  ) is x  Then, the can find a machine M 2 of length |U|+|M 1 |+O(1) that has the same functionality as U(M 1,  ) –by using S-m-n-Theorem  Since |U| is a fixed (constant-sized) machine this proves the statement.

Search Algorithms, WS 2004/05 33 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer Proof of the Theorem Theorem K(x) is not recursive. Proof  Assume K(x) is recursive.  For string length n let x n denote the smallest string of length n such that K(x)  |x| = n  We can enumerate x n –Compute for all strings x of size n the Kolmogorov complexity K(x) and output the first string x with K(x)  n  Let M be the program computing x n on input n  We can efficiently encode x n : –Combine M with binary encoded n: K(x)  log n + |M| = log n + O(1)  For large enough n this is a contradiction to K(x)  n

34 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer Thanks for your attention End of 3rd lecture Next lecture:Mo 8 Nov 2004, 11.15 am, FU 116 Next exercise class: Mo 25 Oct 2004, 1.15 pm, F0.530 or We 27 Oct 2004, 1.00 pm, E2.316

Download ppt "1 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer Search Algorithms Winter Semester 2004/2005 25 Oct."

Similar presentations