Presentation is loading. Please wait.

Presentation is loading. Please wait.

240-301 Comp. Eng. Lab III (Software), Pattern Matching1 Pattern Matching Dr. Andrew Davison WiG Lab (teachers room), CoE 240-301,

Similar presentations


Presentation on theme: "240-301 Comp. Eng. Lab III (Software), Pattern Matching1 Pattern Matching Dr. Andrew Davison WiG Lab (teachers room), CoE 240-301,"— Presentation transcript:

1 240-301 Comp. Eng. Lab III (Software), Pattern Matching1 Pattern Matching Dr. Andrew Davison WiG Lab (teachers room), CoE ad@fivedots.coe.psu.ac.th 240-301, Computer Engineering Lab III (Software) T: P: Semester 1, 2006-2007

2 240-301 Comp. Eng. Lab III (Software), Pattern Matching2 Overview 1. What is Pattern Matching? 2. The Brute Force Algorithm 3. The Boyer-Moore Algorithm 4. The Knuth-Morris-Pratt Algorithm 5. More Information

3 240-301 Comp. Eng. Lab III (Software), Pattern Matching3 1. What is Pattern Matching? v Definition: –given a text string T and a pattern string P, find the pattern inside the text u T: “the rain in spain stays mainly on the plain” u P: “n th” v Applications: –text editors, Web search engines (e.g. Google), image analysis

4 240-301 Comp. Eng. Lab III (Software), Pattern Matching4 String Concepts v Assume S is a string of size m. v A substring S[i.. j] of S is the string fragment between indexes i and j. v A prefix of S is a substring S[0.. i] v A suffix of S is a substring S[i.. m-1] –i is any index between 0 and m-1

5 240-301 Comp. Eng. Lab III (Software), Pattern Matching5 Examples v Substring S[1..3] == "ndr" v All possible prefixes of S: –"andrew", "andre", "andr", "and", "an”, "a" v All possible suffixes of S: –"andrew", "ndrew", "drew", "rew", "ew", "w" andrew S 05

6 240-301 Comp. Eng. Lab III (Software), Pattern Matching6 2. The Brute Force Algorithm v Check each position in the text T to see if the pattern P starts in that position andrew T: rewP: andrew T: rewP:.. P moves 1 char at a time through T

7 240-301 Comp. Eng. Lab III (Software), Pattern Matching7 Brute Force in Java public static int brute(String text,String pattern) { int n = text.length(); // n is length of text int m = pattern.length(); // m is length of pattern int j; for(int i=0; i <= (n-m); i++) { j = 0; while ((j < m) && (text.charAt(i+j) == pattern.charAt(j)) ) j++; if (j == m) return i; // match at i } return -1; // no match } // end of brute() Return index where pattern starts, or -1

8 240-301 Comp. Eng. Lab III (Software), Pattern Matching8 Usage public static void main(String args[]) { if (args.length != 2) { System.out.println("Usage: java BruteSearch "); System.exit(0); } System.out.println("Text: " + args[0]); System.out.println("Pattern: " + args[1]); int posn = brute(args[0], args[1]); if (posn == -1) System.out.println("Pattern not found"); else System.out.println("Pattern starts at posn " + posn); }

9 240-301 Comp. Eng. Lab III (Software), Pattern Matching9 Analysis v Brute force pattern matching runs in time O(mn) in the worst case. v But most searches of ordinary text take O(m+n), which is very quick. continued

10 240-301 Comp. Eng. Lab III (Software), Pattern Matching10 v The brute force algorithm is fast when the alphabet of the text is large –e.g. A..Z, a..z, 1..9, etc. v It is slower when the alphabet is small –e.g. 0, 1 (as in binary files, image files, etc.) continued

11 240-301 Comp. Eng. Lab III (Software), Pattern Matching11 v Example of a worst case: –T: "aaaaaaaaaaaaaaaaaaaaaaaaaah" –P: "aaah" v Example of a more average case: –T: "a string searching example is standard" –P: "store"

12 240-301 Comp. Eng. Lab III (Software), Pattern Matching12 3. The Boyer-Moore Algorithm v The Boyer-Moore pattern matching algorithm is based on two techniques. v 1. The looking-glass technique –find P in T by moving backwards through P, starting at its end

13 240-301 Comp. Eng. Lab III (Software), Pattern Matching13 v 2. The character-jump technique –when a mismatch occurs at T[i] == x –the character in pattern P[j] is not the same as T[i] v There are 3 possible cases, tried in order. x a T i b a P j

14 240-301 Comp. Eng. Lab III (Software), Pattern Matching14 Case 1 v If P contains x somewhere, then try to shift P right to align the last occurrence of x in P with T[i]. x a T i b a P j x c x a T i new b a P j new x c ? ? and move i and j right, so j at end

15 240-301 Comp. Eng. Lab III (Software), Pattern Matching15 Case 2 v If P contains x somewhere, but a shift right to the last occurrence is not possible, then shift P right by 1 character to T[i+1]. a x T i a x P j c w a x T i new a x P j new c w ? and move i and j right, so j at end x x is after j position x

16 240-301 Comp. Eng. Lab III (Software), Pattern Matching16 Case 3 v If cases 1 and 2 do not apply, then shift P to align P[0] with T[i+1]. x a T i b a P j d c x a T i new b a P j new d c ? ? and move i and j right, so j at end No x in P ? 0

17 240-301 Comp. Eng. Lab III (Software), Pattern Matching17 Boyer-Moore Example (1) T: P:

18 240-301 Comp. Eng. Lab III (Software), Pattern Matching18 Last Occurrence Function v Boyer-Moore’s algorithm preprocesses the pattern P and the alphabet A to build a last occurrence function L() –L() maps all the letters in A to integers v L(x) is defined as:// x is a letter in A –the largest index i such that P[i] == x, or –-1 if no such index exists

19 240-301 Comp. Eng. Lab III (Software), Pattern Matching19 L() Example v A = {a, b, c, d} v P: "abacab" 354L(x)L(x) dcbax abacab 012345 P L() stores indexes into P[]

20 240-301 Comp. Eng. Lab III (Software), Pattern Matching20 Note v In Boyer-Moore code, L() is calculated when the pattern P is read in. v Usually L() is stored as an array –something like the table in the previous slide

21 240-301 Comp. Eng. Lab III (Software), Pattern Matching21 Boyer-Moore Example (2) 1111354 L(x)L(x)L(x)L(x)dcbax T: P:

22 240-301 Comp. Eng. Lab III (Software), Pattern Matching22 Boyer-Moore in Java public static int bmMatch(String text, String pattern) { int last[] = buildLast(pattern); int n = text.length(); int m = pattern.length(); int i = m-1; if (i > n-1) return -1; // no match if pattern is // longer than text : Return index where pattern starts, or -1

23 240-301 Comp. Eng. Lab III (Software), Pattern Matching23 int j = m-1; do { if (pattern.charAt(j) == text.charAt(i)) if (j == 0) return i; // match else { // looking-glass technique i--; j--; } else { // character jump technique int lo = last[text.charAt(i)]; //last occ i = i + m - Math.min(j, 1+lo); j = m - 1; } } while (i <= n-1); return -1; // no match } // end of bmMatch()

24 240-301 Comp. Eng. Lab III (Software), Pattern Matching24 public static int[] buildLast(String pattern) /* Return array storing index of last occurrence of each ASCII char in pattern. */ { int last[] = new int[128]; // ASCII char set for(int i=0; i < 128; i++) last[i] = -1; // initialize array for (int i = 0; i < pattern.length(); i++) last[pattern.charAt(i)] = i; return last; } // end of buildLast()

25 240-301 Comp. Eng. Lab III (Software), Pattern Matching25 Usage public static void main(String args[]) { if (args.length != 2) { System.out.println("Usage: java BmSearch "); System.exit(0); } System.out.println("Text: " + args[0]); System.out.println("Pattern: " + args[1]); int posn = bmMatch(args[0], args[1]); if (posn == -1) System.out.println("Pattern not found"); else System.out.println("Pattern starts at posn " + posn); }

26 240-301 Comp. Eng. Lab III (Software), Pattern Matching26 Analysis v Boyer-Moore worst case running time is O(nm + A) v But, Boyer-Moore is fast when the alphabet (A) is large, slow when the alphabet is small. –e.g. good for English text, poor for binary v Boyer-Moore is significantly faster than brute force for searching English text.

27 240-301 Comp. Eng. Lab III (Software), Pattern Matching27 Worst Case Example v T: "aaaaa…a" v P: "baaaaa" T: P:

28 240-301 Comp. Eng. Lab III (Software), Pattern Matching28 4. The KMP Algorithm v The Knuth-Morris-Pratt (KMP) algorithm looks for the pattern in the text in a left-to- right order (like the brute force algorithm). v But it shifts the pattern more intelligently than the brute force algorithm. continued

29 240-301 Comp. Eng. Lab III (Software), Pattern Matching29 v If a mismatch occurs between the text and pattern P at P[j], what is the most we can shift the pattern to avoid wasteful comparisons? v Answer: the largest prefix of P[0.. j-1] that is a suffix of P[1.. j-1]

30 240-301 Comp. Eng. Lab III (Software), Pattern Matching30 Example T: P: j new = 2 j = 5 i

31 240-301 Comp. Eng. Lab III (Software), Pattern Matching31 Why v Find largest prefix (start) of: "a b a a b"( P[0..j-1] ) which is suffix (end) of: "b a a b"( p[1.. j-1] ) v Answer: "a b" v Set j = 2 // the new j value j == 5

32 240-301 Comp. Eng. Lab III (Software), Pattern Matching32 KMP Failure Function v KMP preprocesses the pattern to find matches of prefixes of the pattern with the pattern itself. v j = mismatch position in P[] v k = position before the mismatch (k = j-1). v The failure function F(k) is defined as the size of the largest prefix of P[0..k] that is also a suffix of P[1..k].

33 240-301 Comp. Eng. Lab III (Software), Pattern Matching33 v P: "abaaba" j: 012345 j: 012345 v In code, F() is represented by an array, like the table. Failure Function Example F(k) is the size of the largest prefix. 1 3 2 4210j 100F(j)F(j) k F(k) (k == j-1)

34 240-301 Comp. Eng. Lab III (Software), Pattern Matching34 Why is F(4) == 2? v F(4) means –find the size of the largest prefix of P[0..4] that is also a suffix of P[1..4] = find the size largest prefix of "abaab" that is also a suffix of "baab" = find the size of "ab" = 2 P: "abaaba"

35 240-301 Comp. Eng. Lab III (Software), Pattern Matching35 v Knuth-Morris-Pratt’s algorithm modifies the brute-force algorithm. –if a mismatch occurs at P[j] (i.e. P[j] != T[i]), then k = j-1; j = F(k); // obtain the new j Using the Failure Function

36 240-301 Comp. Eng. Lab III (Software), Pattern Matching36 KMP in Java public static int kmpMatch(String text, String pattern) { int n = text.length(); int m = pattern.length(); int fail[] = computeFail(pattern); int i=0; int j=0; : Return index where pattern starts, or -1

37 240-301 Comp. Eng. Lab III (Software), Pattern Matching37 while (i 0) j = fail[j-1]; else i++; } return -1; // no match } // end of kmpMatch()

38 240-301 Comp. Eng. Lab III (Software), Pattern Matching38 public static int[] computeFail( String pattern) { int fail[] = new int[pattern.length()]; fail[0] = 0; int m = pattern.length(); int j = 0; int i = 1; :

39 240-301 Comp. Eng. Lab III (Software), Pattern Matching39 while (i 0) // j follows matching prefix j = fail[j-1]; else { // no match fail[i] = 0; i++; } } return fail; } // end of computeFail() Similar code to kmpMatch()

40 240-301 Comp. Eng. Lab III (Software), Pattern Matching40 Usage public static void main(String args[]) { if (args.length != 2) { System.out.println("Usage: java KmpSearch "); System.exit(0); } System.out.println("Text: " + args[0]); System.out.println("Pattern: " + args[1]); int posn = kmpMatch(args[0], args[1]); if (posn == -1) System.out.println("Pattern not found"); else System.out.println("Pattern starts at posn " + posn); }

41 240-301 Comp. Eng. Lab III (Software), Pattern Matching41 Example 0 3 1 4210k 100F(k)F(k) T: P:

42 240-301 Comp. Eng. Lab III (Software), Pattern Matching42 Why is F(4) == 1? v F(4) means –find the size of the largest prefix of P[0..4] that is also a suffix of P[1..4] = find the size largest prefix of "abaca" that is also a suffix of "baca" = find the size of "a" = 1 P: "abacab"

43 240-301 Comp. Eng. Lab III (Software), Pattern Matching43 KMP Advantages v KMP runs in optimal time: O(m+n) –very fast v The algorithm never needs to move backwards in the input text, T –this makes the algorithm good for processing very large files that are read in from external devices or through a network stream

44 240-301 Comp. Eng. Lab III (Software), Pattern Matching44 KMP Disadvantages v KMP doesn’t work so well as the size of the alphabet increases –more chance of a mismatch (more possible mismatches) –mismatches tend to occur early in the pattern, but KMP is faster when the mismatches occur later

45 240-301 Comp. Eng. Lab III (Software), Pattern Matching45 KMP Extensions v The basic algorithm doesn't take into account the letter in the text that caused the mismatch. aaab b aaa b b a x aaa b b a T: P: Basic KMP does not do this.

46 240-301 Comp. Eng. Lab III (Software), Pattern Matching46 5. More Information v Algorithms in C++ Robert Sedgewick Addison-Wesley, 1992 –chapter 19, String Searching v Online Animated Algorithms: – –http://www.ics.uci.edu/~goodrich/dsa/ 11strings/demos/pattern/ – –http://www-sr.informatik.uni-tuebingen.de/ ~buehler/BM/BM1.html – –http://www-igm.univ-mlv.fr/~lecroq/string/ This book is in the CoE library.


Download ppt "240-301 Comp. Eng. Lab III (Software), Pattern Matching1 Pattern Matching Dr. Andrew Davison WiG Lab (teachers room), CoE 240-301,"

Similar presentations


Ads by Google