Presentation is loading. Please wait.

Presentation is loading. Please wait.

String Searching CSCI 2720 Spring 2007 Eileen Kraemer.

Similar presentations


Presentation on theme: "String Searching CSCI 2720 Spring 2007 Eileen Kraemer."— Presentation transcript:

1 String Searching CSCI 2720 Spring 2007 Eileen Kraemer

2 A common word processor facility is to search for a given word in a document. Generally, the problem is to search for occurrences of a short string in a long string. String Search Do the first then do the other onethe

3 History of String Search The brute force algorithm:  invented in the dawn of computer history  re-invented many times, still common Knuth & Pratt invented a better one in 1970  invented independently by Morris  published 1976 as “Knuth-Morris-Pratt” Boyer & Moore found a better one before 1976  found independently by Gosper Karp & Rabin found a “better” one in 1980

4 The obvious algorithm is to try the word at each possible place, and compare all the characters: for i := 0 to n-m do (doc length n) for j := 0 to m-1 do (word length m) compare word[j] with doc[i+j] if not equal, exit the inner loop The complexity is at worst O(m*n) and best O(n).

5 Improving String Search Surprisingly, there is a faster algorithm where you compare the last characters first: Do the first then do the other one the compare ‘e’ with ‘ ‘, fail so move along 3 places Do the first then do the other one the can only move along 2 places

6 Improved string search, continued In every case where the document character is not one of the characters in the word, we can move along m places. Sometimes, it is less.

7 Problem Definition, terminology Let p be the pattern string Let t be the target string Let k be the index of the character in the target string that “lies over” the first character of the pattern Given two strings, p and t, over the alphabet , determine whether p occurs as the substring of t That is, determine whether there exists k such that p=Substring(t,k,|p|).

8 Straightforward string searching function SimpleStringSearch(string p,t): integer {Find p in t; return its location or -1 if p is not a substring of t} for k from 0 to Length(t) – Length(p) do i <- 0 while i < Length(p) and p[i] = t[k+i] do i <- i+1 if i == Length(p) then return k return -1

9 SimpleStringSearch t[0] t[1] t[2] t[3] t[4] t[5] t[6] t[7] t[8] t[9] t10] ABCEFGABCDE ABCD p[0] p[1] p[2] p[3] Y Y Y N

10 SimpleStringSearch t[0] t[1] t[2] t[3] t[4] t[5] t[6] t[7] t[8] t[9] t10] ABCEFGABCDE ABCD p[0] p[1] p[2] p[3] N

11 SimpleStringSearch t[0] t[1] t[2] t[3] t[4] t[5] t[6] t[7] t[8] t[9] t10] ABCEFGABCDE ABCD p[0] p[1] p[2] p[3] N

12 SimpleStringSearch t[0] t[1] t[2] t[3] t[4] t[5] t[6] t[7] t[8] t[9] t10] ABCEFGABCDE ABCD p[0] p[1] p[2] p[3] N

13 SimpleStringSearch t[0] t[1] t[2] t[3] t[4] t[5] t[6] t[7] t[8] t[9] t10] ABCEFGABCDE ABCD p[0] p[1] p[2] p[3] N

14 SimpleStringSearch t[0] t[1] t[2] t[3] t[4] t[5] t[6] t[7] t[8] t[9] t10] ABCEFGABCDE ABCD p[0] p[1] p[2] p[3] N

15 SimpleStringSearch t[0] t[1] t[2] t[3] t[4] t[5] t[6] t[7] t[8] t[9] t10] ABCEFGABCDE ABCD p[0] p[1] p[2] p[3] N

16 SimpleStringSearch t[0] t[1] t[2] t[3] t[4] t[5] t[6] t[7] t[8] t[9] t10] ABCEFGABCDE ABCD p[0] p[1] p[2] p[3] Y Y Y Y

17 Straightforward string searching Worst case:  Pattern string always matches completely except for last character  Example: search for XXXXXXY in target string of XXXXXXXXXXXXXXXXXXXX  Outer loop executed once for every character in target string  Inner loop executed once for every character in pattern   (|p| * |t|) Okay if patterns are short, but better algorithms exist

18 Knuth-Morris-Pratt  (|p| * |t|) Key idea:  if pattern fails to match, slide pattern to right by as many boxes as possible without permitting a match to go unnoticed

19 Knuth-Morris-Pratt t[0] t[1] t[2] t[3] t[4] t[5] t[6] t[7] t[8] t[9] t10] XYXYXYc p[0] p[1] p[2] p[3] p[4] Y Y Y Y XYXYZ N XYXYZ Y Y Y Y?

20 Knuth-Morris Pratt Correct motion of pattern depends on both location of mismatch and the mismatching character If c == X : move 2 boxes to right If c == E : move 5 boxes to right If c == Z : target found; alg terminates

21 Knuth-Morris-Pratt Goal: determine d, number of boxes to right pattern should move; smallest d such that: p[0] = t[k+d] p[1] = t[k+d+1] p[2] = t[k+d+2] … p[i-d] = t[k+i]

22 Knuth-Morris-Pratt Note: can be stated largely in terms of pattern alone. Value of d depends only on:  The pattern  The value of i  The mismatching character c (at t[k+i])

23 Knuth-Morris-Pratt Can define a function KMPskip(p,i,c) to give correct d  Return smallest integer d such that 0 <= d <=I, such that p[i-d] == c and p[j] == p[j+d] for each 0 <=j <= i-di1  Return i+1 if no such d exists Calculate all values of KMPskip for pattern p and store it in KMPskiparray do lookup at each mismatch

24 Knuth-Morris-Pratt For pattern ABCD: 0123 1034 1204 1230 1234 A B C D A B C D other

25 Knuth-Morris-Pratt For pattern XYXYZ: 01032 10305 12340 12345 X Y X Y Z X Y Z other

26 Knuth-Morris-Pratt Function KMPSearch(string p, t): integer {Find p in t; return its location or -1 if p is not a substring of t} KMPskiparray <- ComputeKMPskiparray(p) k <- 0 i <- 0 While k < Length(t) – Length(p) do if i == Length(p) then return k d <- KMPskiparray[I,t[k+i]] k <- k + d i <- I + 1 –d Return -1

27 The Boyer-Moore Algorithm Similar to KMP in that:  Pattern compared against target  On mismatch, move as far to right as possible Different from KMP in that:  Compare the patterns from right to left instead of left to right Does that make a difference?  Yes!! – much faster on long targets; many characters in target string are never examined at all

28 Boyer-Moore example t[0] t[1] t[2] t[3] t[4] t[5] t[6] t[7] t[8] t[9] t10] ABCEFGABCDE ABCD p[0] p[1] p[2] p[3] N There is no E in the pattern : thus the pattern can’t match if any characters lie under t[3]. So, move four boxes to the right.

29 Boyer-Moore example t[0] t[1] t[2] t[3] t[4] t[5] t[6] t[7] t[8] t[9] t10] ABCEFGABCDE ABCD p[0] p[1] p[2] p[3] N Again, no match. But there is a B in the pattern. So move two boxes to the right.

30 Boyer-Moore example t[0] t[1] t[2] t[3] t[4] t[5] t[6] t[7] t[8] t[9] t10] ABCEFGABCDE ABCD p[0] p[1] p[2] p[3] Y YYY

31 Boyer-Moore : another example t[k] t[k+1] … t[k+i] t[k+m-1] … cE…RG LE…SDE…RG p[0] p[1] … p[i-1] p[i] p[i+1] … p[m-1] Y YYYN Problem: determine d, the number of boxes that the pattern can be moved to the right. d should be smallest integer such that t[k+m-1]= p[m-1-d], t[k+m-2] = p[m-2-d], … t[k+i] = p[i-d]

32 The Boyer-Moore Algorithm We said:  d should be smallest integer such that: T[k+m-1] = p[m-1-d] T[k+m-2] = p[m-2-d] T[k+i] = p[i-d]  Reminder: k = starting index in target string m = length of pattern i = index of mismatch in pattern string  Problem: statement is valid only for d<= i Need to ensure that we don’t “fall off” the left edge of the pattern

33 Boyer-Moore : another example t[k] t[k+5] t[k+8] cXYZ YZWXYZXYZ p[0] p[1] p[2] p[3] p[4] p[5] p[6] p[7] p[8] Y YYN If c == W, then d should be 3 If c == R, then d should be 7

34 BMPSkip Let m = |p| For any character c and any i such that 0<= i < m, define BMPSkip(p,i,c) to be:  The amount the pattern can move to the right when characters i+1 through m-1 of the pattern match corresponding characters in the target but p[i] doesn’t match character c. Then BMPSkip(p,I,c) should return the smallest d such that:  p[j]= p[j-d] for all j such that max(i+1,d) <= j<= m-1, and  p[i-d] = c if d<= i

35 Boyer-Moore For pattern ABCD: -443 4-42 44-1 444- 4444 A B C D A B C D other <- if the position in the pattern is this character And the mis- matching character in the target is this -  Then skip this many spaces …

36 Boyer-Moore For pattern XYXYZ: -5-52 5-5-1 5555- 55555 X Y X Y Z X Y Z other  - If the position in the pattern is this And the mis- matching character in the target is this --  Then skip this many spaces

37 Note: entries in the Boyer-Moore arrays are generally larger than with KMP; thus, the pattern will move faster Table not consulted on a match (thus, the blank entries)

38 BMSearch Function BMSearch(string p,t): int {Find p in t; return its location or -1 if p is not a substring of t} BMSkiparray <- ComputeBMSkipArray(p) k <- 0 while k <= Length(t) – Length(p) do i <- Length(p) – 1 while i >= 0 and p[i] = t[k+i] do i <- i– 1 if i = -1 then return k k <- k + BMSkiparray[i,t[k+i]] return -1

39 The Karp-Rabin Algorithm Idea Karp & Rabin found an algorithm which is:  almost as fast as Boyer-Moore  simple enough to understand easily  can be adapted for 2-dimensional searches for patterns in pictures Go back to the brute force idea, but now use a single number to represent the word you are searching for, and a single number for the current portion of the document you are comparing against.

40 The Karp-Rabin Algorithm Suppose we are searching for 4-letter words. Then the whole (English) word fits in one (computer) word w of 4 bytes. If the current 4 bytes of the document are also in one word d, a single comparison can match the two in one step. To move along the document, shift d and add in the next character. For longer words, use hashing. The characters of the word and the document are combined into single hash numbers wh and dh. The hash number dh can be updated by doing a suitable sum and adding in the code for the next character.


Download ppt "String Searching CSCI 2720 Spring 2007 Eileen Kraemer."

Similar presentations


Ads by Google