Presentation is loading. Please wait.

Presentation is loading. Please wait.

Www.monash.edu.au 1 prepared from lecture material © 2004 Goodrich & Tamassia COMMONWEALTH OF AUSTRALIA Copyright Regulations 1969 WARNING This material.

Similar presentations


Presentation on theme: "Www.monash.edu.au 1 prepared from lecture material © 2004 Goodrich & Tamassia COMMONWEALTH OF AUSTRALIA Copyright Regulations 1969 WARNING This material."— Presentation transcript:

1 www.monash.edu.au 1 prepared from lecture material © 2004 Goodrich & Tamassia COMMONWEALTH OF AUSTRALIA Copyright Regulations 1969 WARNING This material has been reproduced and communicated to you by or on behalf of Monash University pursuant to Part VB of the Copyright Act 1968 (the Act). The material in this communication may be subject to copyright under the Act. Any further reproduction or communication of this material by you may be the subject of copyright protection under the Act. Do not remove this notice.

2 www.monash.edu.au FIT2004 Algorithms & Data Structures L5: Strings, Matching, and Dynamic Programming Prepared by: Bernd Meyer from lecture materials © 2004 Goodrich & Tamassia February 2007

3 www.monash.edu.au 3 prepared from lecture material © 2004 Goodrich & Tamassia Sequence Matching Problems Finding a word in a document web search etc. spell-checking (approximate matching) Bioinformatics!! –eg (DNA) Sequence Alignment

4 www.monash.edu.au 4 prepared from lecture material © 2004 Goodrich & Tamassia Strings (§ 11.1 Goodrich & Tamassia) A string is a sequence of characters Examples of strings: –Java program –HTML document –DNA sequence –Digitized image An alphabet  is the set of possible characters for a family of strings Example of alphabets: –ASCII, Unicode –{0, 1} –{A, C, G, T} Let P be a string of size m –A substring P[i.. j] of P is the subsequence of P consisting of the characters with index positions between i and j –A prefix of P is a substring of the type P[0.. i] –A suffix of P is a substring of the type P[i..m  1] Given strings T (text) and P (pattern), the pattern matching problem consists of finding a substring of T equal to P

5 www.monash.edu.au 5 prepared from lecture material © 2004 Goodrich & Tamassia Brute-Force Pattern Matching The brute-force pattern matching algorithm compares the pattern P with the text T for each possible shift of P relative to T, until either –a match is found, or –all placements of the pattern have been tried Brute-force pattern matching runs in time O(nm) Example of worst case: –T  aaa … ah –P  aaah Algorithm BruteForceMatch(T, P) Input text T of size n and pattern P of size m Output starting index of a substring of T equal to P or  1 if no such substring exists for i  0 to n  m { test shift i of the pattern } j  0 while j  m  T[i  j]  P[j] j  j  1 if j  m return i {match at i} else break while loop {mismatch} return -1 {no match anywhere}

6 www.monash.edu.au 6 prepared from lecture material © 2004 Goodrich & Tamassia Boyer-Moore Heuristics The Boyer-Moore’s pattern matching algorithm is based on two heuristics Looking-glass heuristic: Compare P with a subsequence of T moving backwards Character-jump heuristic: When a mismatch occurs at T[i]  c –If P contains c, shift P to align the last occurrence of c in P with T[i] –Else, shift P to align P[0] with T[i  1] Example

7 www.monash.edu.au 7 prepared from lecture material © 2004 Goodrich & Tamassia Boyer-Moore Heuristics Why is Boyer-Moore Matching correct?

8 www.monash.edu.au 8 prepared from lecture material © 2004 Goodrich & Tamassia Last-Occurrence Function Boyer-Moore’s algorithm preprocesses the pattern P and the alphabet  to build the last-occurrence function L mapping  to integers, where L(c) is defined as –the largest index i such that P[i]  c or –  1 if no such index exists Example: –  {a, b, c, d} –P  abacab The last-occurrence function can be represented by an array indexed by the numeric codes of the characters The last-occurrence function can be computed in time O(m  s), where m is the size of P and s is the size of  cabcd L(c)L(c)453 11

9 www.monash.edu.au 9 prepared from lecture material © 2004 Goodrich & Tamassia Case 1: j  1  l The Boyer-Moore Algorithm Algorithm BoyerMooreMatch(T, P,  ) L  lastOccurenceFunction(P,  ) i  m  1 j  m  1 repeat if T[i]  P[j] if j  0 return i { match at i } else i  i  1 j  j  1 else { character-jump } l  L[ T[i] ] i  i  m – min(j, 1  l) j  m  1 until i  n  1 return  1 { no match } Case 2: 1  l  j

10 www.monash.edu.au 10 prepared from lecture material © 2004 Goodrich & Tamassia Example

11 www.monash.edu.au 11 prepared from lecture material © 2004 Goodrich & Tamassia Analysis Boyer-Moore’s algorithm runs in time O(nm  s) Example of worst case: –T  aaa … a –P  baaa The worst case may occur in images and DNA sequences but is unlikely in English text Boyer-Moore’s algorithm is significantly faster than the brute-force algorithm on English text

12 www.monash.edu.au 12 prepared from lecture material © 2004 Goodrich & Tamassia The KMP Algorithm Knuth-Morris-Pratt’s algorithm compares the pattern to the text in left-to-right, but shifts the pattern more intelligently than the brute-force algorithm. When a mismatch occurs, what is the most we can shift the pattern so as to avoid redundant comparisons? Answer: the largest prefix of P[0..j] that is a suffix of P[1..j] x j.. abaab..... abaaba abaaba No need to repeat these comparisons Resume comparing here

13 www.monash.edu.au 13 prepared from lecture material © 2004 Goodrich & Tamassia KMP Failure Function Knuth-Morris-Pratt’s algorithm preprocesses the pattern to find matches of prefixes of the pattern with the pattern itself The failure function F(j) is defined as the size of the largest prefix of P[0..j] that is also a suffix of P[1..j] Knuth-Morris-Pratt’s algorithm modifies the brute-force algorithm so that if a mismatch occurs at P[j]  T[i] we set j  F(j  1) so F(j  1) is the index of the pattern at which we restart matching (left to right) j01234  P[j]P[j]abaaba F(j)F(j)00112 

14 www.monash.edu.au 14 prepared from lecture material © 2004 Goodrich & Tamassia The KMP Algorithm The failure function can be represented by an array and can be computed in O(m) time At each iteration of the while- loop, either –i increases by one, or –the shift amount i  j increases by at least one (observe that F(j  1) < j ) Hence, there are no more than 2n iterations of the while-loop Thus, KMP’s algorithm runs in optimal time O(m  n) Algorithm KMPMatch(T, P) F  failureFunction(P) i  0 j  0 while i  n if T[i]  P[j] if j  m  1 return i  j { match } else i  i  1 j  j  1 else if j  0 j  F[j  1] else i  i  1 return  1 { no match }

15 www.monash.edu.au 15 prepared from lecture material © 2004 Goodrich & Tamassia Example j01234  P[j]P[j]abacab F(j)F(j)00101 

16 www.monash.edu.au 16 prepared from lecture material © 2004 Goodrich & Tamassia Computing the Failure Function The failure function can be represented by an array and can be computed in O(m) time The construction is similar to the KMP algorithm itself At each iteration of the while- loop, either –i increases by one, or –the shift amount i  j increases by at least one (observe that F(j  1) < j ) Hence, there are no more than 2m iterations of the while-loop Algorithm failureFunction(P) F[0]  0 i  1 j  0 while i  m if P[i]  P[j] {we have matched j + 1 chars} F[i]  j + 1 i  i  1 j  j  1 else if j  0 then {use failure function to shift P} j  F[j  1] else F[i]  0 { no match } i  i  1

17 www.monash.edu.au 17 prepared from lecture material © 2004 Goodrich & Tamassia Subsequences (LCS) A subsequence of a character string x 0 x 1 x 2 …x n-1 is a string of the form x i 1 x i 2 …x i k, where i j < i j+1. Not the same as substring! Example String: ABCDEFGHIJK –Subsequence: ACEGJIK –Subsequence: DFGHK –Not subsequence: DAGH

18 www.monash.edu.au 18 prepared from lecture material © 2004 Goodrich & Tamassia LCS: Longest Common Subsequence Given two strings X and Y, the longest common subsequence (LCS) problem is to find a longest subsequence common to both X and Y Has applications to DNA similarity testing (alphabet is {A,C,G,T}) Example: ABCDEFG and XZACKDFWGH have ACDFG as a longest common subsequence

19 www.monash.edu.au 19 prepared from lecture material © 2004 Goodrich & Tamassia A Poor Approach to the LCS Problem A Brute-force solution: –Enumerate all subsequences of X –Test which ones are also subsequences of Y –Pick the longest one. Analysis: –If X is of length n, then it has 2 n subsequences –This is an exponential-time algorithm!

20 www.monash.edu.au 20 prepared from lecture material © 2004 Goodrich & Tamassia Dynamic Programming Technique Applies to a problem that at first seems to require a lot of time (possibly exponential), provided we have: –Simple subproblem definition: the subproblems can be defined in terms of a few variables, such as j, k, l, m, and so on. –Subproblem optimality: the global optimum value can be defined in terms of optimal subproblems –Subproblem overlap: the subproblems are not independent, but instead they overlap (hence, should be constructed bottom-up). Unwind recursion into iteration using tables

21 www.monash.edu.au 21 prepared from lecture material © 2004 Goodrich & Tamassia Dynamic-Programming Approach to LCS Define L[i,j] to be the length of the longest common subsequence of X[0..i] and Y[0..j]. Allow for -1 as an index, so L[-1,k] = 0 and L[k,-1]=0, to indicate that the null part of X or Y has no match with the other. Then we can define L[i,j] in the general case as follows: If xi=yj, then L[i,j] = L[i-1,j-1] + 1 (we can add this match) If xi≠yj, then L[i,j] = max{L[i-1,j], L[i,j-1]} (we have no match here) Case 1Case 2

22 www.monash.edu.au 22 prepared from lecture material © 2004 Goodrich & Tamassia An LCS Algorithm Algorithm LCS(X,Y ): Input:Strings X and Y with n and m elements, respectively Output: For i = 0,…,n-1, j = 0,...,m-1, the length L[i, j] of a longest string that is a subsequence of both the string X[0..i] = x 0 x 1 x 2 …x i and the string Y [0.. j] = y 0 y 1 y 2 …y j for i =1 to n-1 do L[i,-1] = 0 for j =0 to m-1 do L[-1,j] = 0 for i =0 to n-1 do for j =0 to m-1 do if x i = y j then L[i, j] = L[i-1, j-1] + 1 else L[i, j] = max{L[i-1, j], L[i, j-1]} return array L

23 www.monash.edu.au 23 prepared from lecture material © 2004 Goodrich & Tamassia Visualizing the LCS Algorithm

24 www.monash.edu.au 24 prepared from lecture material © 2004 Goodrich & Tamassia Analysis of LCS Algorithm We have two nested loops –The outer one iterates n times –The inner one iterates m times –A constant amount of work is done inside each iteration of the inner loop –Thus, the total running time is O(nm) Answer is contained in L[n,m] (and the subsequence can be recovered from the L table).

25 www.monash.edu.au 25 prepared from lecture material © 2004 Goodrich & Tamassia Algorithm Design Techniques Greedy: –Solution is built step-by-step, taking the best locally possible step at that point. Divide-and-conquer: –Problem is broken into several smaller and/or simpler independent subproblems –subproblems are solved independently usually using recursion –solutions of subproblems are combined to yield solution of original problem Dynamic Programming: –Problem is broken up into several smaller overlapping sub- problems. –subproblems are solved “bottom-up” (smallest first) –subproblem solutions are combined into solutions of increasingly complex subproblems.

26 www.monash.edu.au 26 prepared from lecture material © 2004 Goodrich & Tamassia Origin of Dynamic Programming Bellman. Pioneered the systematic study of dynamic programming in the 1950s. The name “dynamic programming” was given to the technique to market it… According to Bellman’s auto-biography the Secretary of Defense (funding!) at that time was hostile to mathematical research and the name was chosen because >"it's impossible to use dynamic in a pejorative sense" >"something not even a Congressman could object to"

27 www.monash.edu.au 27 prepared from lecture material © 2004 Goodrich & Tamassia Dynamic Programming Applications Some famous applications of dynamic programming algorithms: –Unix diff for comparing two files. –Smith-Waterman for sequence alignment. –Bellman-Ford for shortest path routing in networks. –Cocke-Kasami-Younger for parsing context free grammars.


Download ppt "Www.monash.edu.au 1 prepared from lecture material © 2004 Goodrich & Tamassia COMMONWEALTH OF AUSTRALIA Copyright Regulations 1969 WARNING This material."

Similar presentations


Ads by Google