Chapter 3 String Matching.

Slides:



Advertisements
Similar presentations
Lecture 24 Coping with NPC and Unsolvable problems. When a problem is unsolvable, that's generally very bad news: it means there is no general algorithm.
Advertisements

A. Levitin “Introduction to the Design & Analysis of Algorithms,” 3rd ed., Ch. 3 ©2012 Pearson Education, Inc. Upper Saddle River, NJ. All Rights Reserved.
The Design and Analysis of Algorithms
Theory of Algorithms: Brute Force James Gain and Edwin Blake {jgain | Department of Computer Science University of Cape Town August.
Lecture 8 Jianjun Hu Department of Computer Science and Engineering University of South Carolina CSCE350 Algorithms and Data Structure.
Design and Analysis of Algorithms - Chapter 31 Brute Force A straightforward approach usually based on problem statement and definitions Examples: 1. Computing.
Analysis & Design of Algorithms (CSCE 321)
Chapter 3 Brute Force Copyright © 2007 Pearson Addison-Wesley. All rights reserved.
Design and Analysis of Algorithms - Chapter 31 Brute Force A straightforward approach usually based on problem statement and definitions Examples: 1. Computing.
Homework page 102 questions 1, 4, and 10 page 106 questions 4 and 5 page 111 question 1 page 119 question 9.
Knuth-Morris-Pratt Algorithm Prepared by: Mayank Agarwal Prepared by: Mayank Agarwal Nitesh Maan Nitesh Maan.
Lecture 6 Jianjun Hu Department of Computer Science and Engineering University of South Carolina CSCE350 Algorithms and Data Structure.
String Matching. Problem is to find if a pattern P[1..m] occurs within text T[1..n] Simple solution: Naïve String Matching –Match each position in the.
Chapter 12 Coping with the Limitations of Algorithm Power Copyright © 2007 Pearson Addison-Wesley. All rights reserved.
IT 60101: Lecture #201 Foundation of Computing Systems Lecture 20 Classic Optimization Problems.
KMP String Matching Prepared By: Carlens Faustin.
A. Levitin “Introduction to the Design & Analysis of Algorithms,” 3rd ed., Ch. 3 ©2012 Pearson Education, Inc. Upper Saddle River, NJ. All Rights Reserved.
Theory of Algorithms: Brute Force. Outline Examples Brute-Force String Matching Closest-Pair Convex-Hull Exhaustive Search brute-force strengths and weaknesses.
Brute Force A straightforward approach, usually based directly on the problem’s statement and definitions of the concepts involved Examples: Computing.
A. Levitin “Introduction to the Design & Analysis of Algorithms,” 3rd ed., Ch. 3 ©2012 Pearson Education, Inc. Upper Saddle River, NJ. All Rights Reserved.
5-1-1 CSC401 – Analysis of Algorithms Chapter 5--1 The Greedy Method Objectives Introduce the Brute Force method and the Greedy Method Compare the solutions.
Chapter 3 Brute Force. A straightforward approach, usually based directly on the problem’s statement and definitions of the concepts involved Examples:
1 String Matching Algorithms Topics  Basics of Strings  Brute-force String Matcher  Rabin-Karp String Matching Algorithm  KMP Algorithm.
COMP108 Algorithmic Foundations Polynomial & Exponential Algorithms
Lecture 7 Jianjun Hu Department of Computer Science and Engineering University of South Carolina CSCE350 Algorithms and Data Structure.
Chapter 3 Brute Force Copyright © 2007 Pearson Addison-Wesley. All rights reserved.
1/39 COMP170 Tutorial 13: Pattern Matching T: P:.
Ch3 /Lecture #4 Brute Force and Exhaustive Search 1.
Exhaustive search Exhaustive search is simply a brute- force approach to combinatorial problems. It suggests generating each and every element of the problem.
Introduction to Algorithms: Brute-Force Algorithms.
Rabin & Karp Algorithm. Rabin-Karp – the idea Compare a string's hash values, rather than the strings themselves. For efficiency, the hash value of the.
Brute Force A straightforward approach, usually based directly on the problem’s statement and definitions of the concepts involved Examples: Computing.
Brute Force II.
Limitation of Computation Power – P, NP, and NP-complete
Brute Force A straightforward approach, usually based directly on the problem’s statement and definitions of the concepts involved Examples: Computing.
Brute Force Algorithms
Brute Force A straightforward approach, usually based directly on the problem’s statement and definitions of the concepts involved Examples: Computing.
Advanced Algorithms Analysis and Design
COMP108 Algorithmic Foundations Polynomial & Exponential Algorithms
Greedy Technique.
Advanced Algorithms Analysis and Design
String Matching (Chap. 32)
Chapter 3 Brute Force Copyright © 2007 Pearson Addison-Wesley. All rights reserved.
Advanced Algorithm Design and Analysis (Lecture 12)
13 Text Processing Hongfei Yan June 1, 2016.
Chapter 3 String Matching.
Rabin & Karp Algorithm.
Brute Force A straightforward approach, usually based directly on the problem’s statement and definitions of the concepts involved Examples – based directly.
Chapter 3 Brute Force Copyright © 2007 Pearson Addison-Wesley. All rights reserved.
Chapter 3 Brute Force Copyright © 2007 Pearson Addison-Wesley. All rights reserved.
Heuristics Definition – a heuristic is an inexact algorithm that is based on intuitive and plausible arguments which are “likely” to lead to reasonable.
Tuesday, 12/3/02 String Matching Algorithms Chapter 32
String-Matching Algorithms (UNIT-5)
Chapter 3 Brute Force Copyright © 2007 Pearson Addison-Wesley. All rights reserved.
Brute Force A straightforward approach, usually based directly on the problem’s statement and definitions of the concepts involved Examples: Computing.
Brute Force A straightforward approach, usually based directly on the problem’s statement and definitions of the concepts involved Examples: Computing.
Pattern Matching 1/14/2019 8:30 AM Pattern Matching Pattern Matching.
KMP String Matching Donald Knuth Jim H. Morris Vaughan Pratt 1997.
Chapter 11 Limitations of Algorithm Power
Recall Some Algorithms And Algorithm Analysis Lots of Data Structures
3. Brute Force Selection sort Brute-Force string matching
Data Structures and Algorithms (AT70. 02) Comp. Sc. and Inf. Mgmt
3. Brute Force Selection sort Brute-Force string matching
Backtracking and Branch-and-Bound
CSC 380: Design and Analysis of Algorithms
Pattern Matching Pattern Matching 5/1/2019 3:53 PM Spring 2007
CSC 380: Design and Analysis of Algorithms
3. Brute Force Selection sort Brute-Force string matching
Presentation transcript:

Chapter 3 String Matching

String Matching Given: Two strings T[1..n] and P[1..m] over alphabet . Want to find all occurrences of P[1..m] “the pattern” in T[1..n] “the text”. Example:  = {a, b, c} Text T pattern P a b c - P occurs with shift s. - P occurs beginning at position s+1. s is a valid shift. The idea of the string matching problem is that we want to find all occurrences of the pattern P in the given text T s=3 a b

Sequential Search

Naïve String Matching Using Brute Force Technique

Naïve String Matching method n ≡ size of input string m ≡ size of pattern to be matched O( (n-m+1)m ) Θ( n2 ) if m = floor( n/2 ) We can do better

Rabin Karp String Matching Consider a hashing scheme Let characters in both arrays T and P be digits in radix-S notation. (S = (0,1,...,9) Assume each character is digit in radix-d notation (e.g. d=10) Let p be the value of the characters in P Choose a prime number q such that fits within a computer word to speed computations. Compute (p mod q) The value of p mod q is what we will be using to find all matches of the pattern P in T.

Compute (T[s+1, .., s+m] mod q) for s = 0 .. n-m Test against P only those sequences in T having the same (mod q) value

Assume each character is digit in radix-d notation (e.g. d=10) p = decimal value of pattern ts = decimal value of substring T[s+1..s+m] for s = 0,1...,n-m s = a valid shift We never explicitly compute a new value. We simply adjust the existing value as we move over one character.

Performance of Robin Karp:- Preprocessing (determining each pattern hash) Θ( m ) Worst case running time Θ( (n-m+1)m ) No better than naïve method Expected case If we assume the number of hits is constant compared to n, we expect O( n ) Only pattern-match “hits” – not all shifts

The Knuth-Morris-Pratt Algorithm Knuth, Morris and Pratt proposed a linear time algorithm for the string matching problem. A matching time of O(n) is achieved by avoiding comparisons with elements of ‘S’ that have previously been involved in comparison with some element of the pattern ‘p’ to be matched. i.e., backtracking on the string ‘S’ never occurs

Components of KMP algorithm The prefix function, Π The prefix function,Π for a pattern encapsulates knowledge about how the pattern matches against shifts of itself. This information can be used to avoid useless shifts of the pattern ‘p’. In other words, this enables avoiding backtracking on the string ‘S’. The KMP Matcher With string ‘S’, pattern ‘p’ and prefix function ‘Π’ as inputs, finds the occurrence of ‘p’ in ‘S’ and returns the number of shifts of ‘p’ after which occurrence is found.

The prefix function, Π Following pseudocode computes the prefix fucnction, Π: Compute-Prefix-Function (p) 1 m  length[p] //’p’ pattern to be matched 2 Π[1]  0 3 k  0 for q  2 to m do while k > 0 and p[k+1] != p[q] 6 do k  Π[k] If p[k+1] = p[q] then k  k +1 Π[q]  k 10 return Π

Example: compute Π for the pattern ‘p’ below: p a b c Initially: m = length[p] = 7 Π[1] = 0 k = 0 Step 1: q = 2, k=0 Π[2] = 0 Step 2: q = 3, k = 0, Π[3] = 1 Step 3: q = 4, k = 1 Π[4] = 2 q 1 2 3 4 5 6 7 p a b c Π q 1 2 3 4 5 6 7 p a b c Π q 1 2 3 4 5 6 7 p a b c A Π

Step 4: q = 5, k =2 Π[5] = 3 Step 5: q = 6, k = 3 Π[6] = 1 Π[7] = 1 After iterating 6 times, the prefix function computation is complete:  q 1 2 3 4 5 6 7 p a b c Π q 1 2 3 4 5 6 7 p a b c Π q 1 2 3 4 5 6 7 p a b c Π q 1 2 3 4 5 6 7 p a b A c Π

The KMP Matcher The KMP Matcher, with pattern ‘p’, string ‘S’ and prefix function ‘Π’ as input, finds a match of p in S. Following pseudocode computes the matching component of KMP algorithm: KMP-Matcher(S,p) 1 n  length[S] 2 m  length[p] 3 Π  Compute-Prefix-Function(p) 4 q  0 //number of characters matched 5 for i  1 to n //scan S from left to right 6 do while q > 0 and p[q+1] != S[i] do q  Π[q] //next character does not match if p[q+1] = S[i] then q  q + 1 //next character matches if q = m //is all of p matched? then print “Pattern occurs with shift” i – m q  Π[ q] // look for the next match Note: KMP finds every occurrence of a ‘p’ in ‘S’. That is why KMP does not terminate in step 12, rather it searches remainder of ‘S’ for any more occurrences of ‘p’.

Illustration: given a String ‘S’ and pattern ‘p’ as follows: S b a c Let us execute the KMP algorithm to find whether ‘p’ occurs in ‘S’. For ‘p’ the prefix function, Π was computed previously and is as follows: q 1 2 3 4 5 6 7 p a b A c Π

S b a c p a b c S b a c p a b c Initially: n = size of S = 15; m = size of p = 7 Step 1: i = 1, q = 0 comparing p[1] with S[1] S b a c p a b c P[1] does not match with S[1]. ‘p’ will be shifted one position to the right. Step 2: i = 2, q = 0 comparing p[1] with S[2] S b a c p a b c P[1] matches S[2]. Since there is a match, p is not shifted.

S b a c p a b c S b a c p a b c b a c S p a b c Step 3: i = 3, q = 1 Comparing p[2] with S[3] p[2] does not match with S[3] S b a c p a b c Backtracking on p, comparing p[1] and S[3] Step 4: i = 4, q = 0 comparing p[1] with S[4] p[1] does not match with S[4] S b a c p a b c Step 5: i = 5, q = 0 comparing p[1] with S[5] p[1] matches with S[5] b a c S p a b c

S b a c p a b c b a c S p a b c S b a c p a b c Step 6: i = 6, q = 1 Comparing p[2] with S[6] p[2] matches with S[6] S b a c p a b c Step 7: i = 7, q = 2 Comparing p[3] with S[7] p[3] matches with S[7] b a c S p a b c Step 8: i = 8, q = 3 Comparing p[4] with S[8] p[4] matches with S[8] S b a c p a b c

S b a c p a b c b a c S p a b c b a c S p a b c Step 9: i = 9, q = 4 Comparing p[5] with S[9] p[5] matches with S[9] S b a c p a b c Step 10: i = 10, q = 5 Comparing p[6] with S[10] p[6] doesn’t match with S[10] b a c S p a b c Backtracking on p, comparing p[4] with S[10] because after mismatch q = Π[5] = 3 Step 11: i = 11, q = 4 Comparing p[5] with S[11] p[5] matches with S[11] b a c S p a b c

S b a c p a b c S b a c p a b c Step 12: i = 12, q = 5 Comparing p[6] with S[12] p[6] matches with S[12] S b a c p a b c Step 13: i = 13, q = 6 Comparing p[7] with S[13] p[7] matches with S[13] S b a c p a b c Pattern ‘p’ has been found to completely occur in string ‘S’. The total number of shifts that took place for the match to be found are: i – m = 13 – 7 = 6 shifts.

Running - time analysis Compute-Prefix-Function (Π) 1 m  length[p] //’p’ pattern to be matched 2 Π[1]  0 3 k  0 for q  2 to m do while k > 0 and p[k+1] != p[q] 6 do k  Π[k] If p[k+1] = p[q] then k  k +1 Π[q]  k return Π In the above pseudocode for computing the prefix function, the for loop from step 4 to step 10 runs ‘m’ times. Step 1 to step 3 take constant time. Hence the running time of compute prefix function is Θ(m). KMP Matcher 1 n  length[S] 2 m  length[p] 3 Π  Compute-Prefix-Function(p) 4 q  0 5 for i  1 to n 6 do while q > 0 and p[q+1] != S[i] do q  Π[q] if p[q+1] = S[i] then q  q + 1 if q = m then print “Pattern occurs with shift” i – m q  Π[ q] The for loop beginning in step 5 runs ‘n’ times, i.e., as long as the length of the string ‘S’. Since step 1 to step 4 take constant time, the running time is dominated by this for loop. Thus running time of matching function is Θ(n).

Closest-Pair Problem Find the two closest points in a set of n points (in the two-dimensional Cartesian plane). Brute-force algorithm Compute the distance between every pair of distinct points and return the indexes of the points for which the distance is the smallest. Draw example.

Closest-Pair Brute-Force Algorithm (cont.) The basic operation of the algorithm is computing the Euclidean distance between two points. The square root is a complex operation who’s result is often irrational, therefore the results can be found only approximately. Computing such operations are not trivial. One can avoid computing square roots by comparing distance squares instead. Efficiency: How to make it faster? Θ(n^2) multiplications (or sqrt) Using divide-and-conquer!

Brute-Force Strengths and Weaknesses wide applicability simplicity yields reasonable algorithms for some important problems (e.g., matrix multiplication, sorting, searching, string matching) Weaknesses rarely yields efficient algorithms some brute-force algorithms are unacceptably slow not as constructive as some other design techniques

Convex Hull

Exhaustive Search A brute force solution to a problem involving search for an element with a special property, usually among combinatorial objects such as permutations, combinations, or subsets of a set. Method: generate a list of all potential solutions to the problem in a systematic manner (see algorithms in Sec. 5.4) evaluate potential solutions one by one, disqualifying infeasible ones and, for an optimization problem, keeping track of the best one found so far when search ends, announce the solution(s) found

Example 1: Traveling Salesman Problem Given n cities with known distances between each pair, find the shortest tour that passes through all the cities exactly once before returning to the starting city Alternatively: Find shortest Hamiltonian circuit in a weighted connected graph Example: a b c d 8 2 7 5 3 4 How do we represent a solution (Hamiltonian circuit)?

TSP by Exhaustive Search Tour Cost a→b→c→d→a 2+3+7+5 = 17 a→b→d→c→a 2+4+7+8 = 21 a→c→b→d→a 8+3+4+5 = 20 a→c→d→b→a 8+7+4+2 = 21 a→d→b→c→a 5+4+3+8 = 20 a→d→c→b→a 5+7+3+2 = 17 Efficiency: Θ((n-1)!)

Example 2: Knapsack Problem Given n items: weights: w1 w2 … wn values: v1 v2 … vn a knapsack of capacity W Find most valuable subset of the items that fit into the knapsack Example: Knapsack capacity W=16 item weight value 2 $20 5 $30 10 $50 5 $10

Knapsack Problem by Exhaustive Search Subset Total weight Total value {1} 2 $20 {2} 5 $30 {3} 10 $50 {4} 5 $10 {1,2} 7 $50 {1,3} 12 $70 {1,4} 7 $30 {2,3} 15 $80 {2,4} 10 $40 {3,4} 15 $60 {1,2,3} 17 not feasible {1,2,4} 12 $60 {1,3,4} 17 not feasible {2,3,4} 20 not feasible {1,2,3,4} 22 not feasible Efficiency: Θ(2^n) Each subset can be represented by a binary string (bit vector, Ch 5).

Example 3: The Assignment Problem There are n people who need to be assigned to n jobs, one person per job. The cost of assigning person i to job j is C[i,j]. Find an assignment that minimizes the total cost. Job 0 Job 1 Job 2 Job 3 Person 0 9 2 7 8 Person 1 6 4 3 7 Person 2 5 8 1 8 Person 3 7 6 9 4 Algorithmic Plan: Generate all legitimate assignments, compute their costs, and select the cheapest one. How many assignments are there? Pose the problem as one about a cost matrix: n! cycle cover in a graph

Assignment Problem by Exhaustive Search 9 2 7 8 6 4 3 7 5 8 1 8 7 6 9 4 Assignment (col.#s) Total Cost 1, 2, 3, 4 9+4+1+4=18 1, 2, 4, 3 9+4+8+9=30 1, 3, 2, 4 9+3+8+4=24 1, 3, 4, 2 9+3+8+6=26 1, 4, 2, 3 9+7+8+9=33 1, 4, 3, 2 9+7+1+6=23 etc. (For this particular instance, the optimal assignment can be found by exploiting the specific features of the number given. It is: ) C = 2,1,3,4

Final Comments on Exhaustive Search Exhaustive-search algorithms run in a realistic amount of time only on very small instances In some cases, there are much better alternatives! Euler circuits shortest paths minimum spanning tree assignment problem In many cases, exhaustive search or its variation is the only known way to get exact solution The Hungarian method runs in O(n^3) time.