Advanced Algorithm Design and Analysis (Lecture 3) SW5 fall 2004 Simonas Šaltenis E1-215b

Slides:



Advertisements
Similar presentations
Space-for-Time Tradeoffs
Advertisements

String Searching Algorithms Problem Description Given two strings P and T over the same alphabet , determine whether P occurs as a substring in T (or.
Data Structures and Algorithms (AT70.02) Comp. Sc. and Inf. Mgmt. Asian Institute of Technology Instructor: Dr. Sumanta Guha Slide Sources: CLRS “Intro.
Yangjun Chen 1 String Matching String matching problem - prefix - suffix - automata - String-matching automata - prefix function - Knuth-Morris-Pratt algorithm.
Advanced Algorithm Design and Analysis (Lecture 4) SW5 fall 2004 Simonas Šaltenis E1-215b
1 A simple fast hybrid pattern- matching algorithm Department of Computer Science and Information Engineering National Cheng Kung University, Taiwan R.O.C.
1 Prof. Dr. Th. Ottmann Theory I Algorithm Design and Analysis (12 - Text search, part 1)
Pattern Matching1. 2 Outline and Reading Strings (§9.1.1) Pattern matching algorithms Brute-force algorithm (§9.1.2) Boyer-Moore algorithm (§9.1.3) Knuth-Morris-Pratt.
Goodrich, Tamassia String Processing1 Pattern Matching.
UMass Lowell Computer Science Analysis of Algorithms Prof. Karen Daniels Fall, 2006 Wednesday, 12/6/06 String Matching Algorithms Chapter 32.
6-1 String Matching Learning Outcomes Students are able to: Explain naïve, Rabin-Karp, Knuth-Morris- Pratt algorithms Analyse the complexity of these algorithms.
UMass Lowell Computer Science Analysis of Algorithms Prof. Karen Daniels Fall, 2001 Lecture 8 Tuesday, 11/13/01 String Matching Algorithms Chapter.
UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bioinformatics Algorithms and Data Structures Chapter 2: KMP Algorithm Lecturer:
Boyer-Moore string search algorithm Book by Dan Gusfield: Algorithms on Strings, Trees and Sequences (1997) Original: Robert S. Boyer, J Strother Moore.
Knuth-Morris-Pratt Algorithm left to right scan like the naïve algorithm one main improvement –on a mismatch, calculate maximum possible shift to the right.
Boyer-Moore Algorithm 3 main ideas –right to left scan –bad character rule –good suffix rule.
String Matching COMP171 Fall String matching 2 Pattern Matching * Given a text string T[0..n-1] and a pattern P[0..m-1], find all occurrences of.
Algorithms for Regulatory Motif Discovery Xiaohui Xie University of California, Irvine.
Quick Search Algorithm A very fast substring search algorithm, SUNDAY D.M., Communications of the ACM. 33(8),1990, pp Adviser: R. C. T. Lee Speaker:
Exact and Approximate Pattern in the Streaming Model Presented by - Tanushree Mitra Benny Porat and Ely Porat 2009 FOCS.
Pattern Matching COMP171 Spring Pattern Matching / Slide 2 Pattern Matching * Given a text string T[0..n-1] and a pattern P[0..m-1], find all occurrences.
Raita Algorithm T. RAITA Advisor: Prof. R. C. T. Lee
Algorithms and Data Structures. /course/eleg67701-f/Topic-1b2 Outline  Data Structures  Space Complexity  Case Study: string matching Array implementation.
Pattern Matching1. 2 Outline Strings Pattern matching algorithms Brute-force algorithm Boyer-Moore algorithm Knuth-Morris-Pratt algorithm.
String Matching Input: Strings P (pattern) and T (text); |P| = m, |T| = n. Output: Indices of all occurrences of P in T. ExampleT = discombobulate later.
String Matching. Problem is to find if a pattern P[1..m] occurs within text T[1..n] Simple solution: Naïve String Matching –Match each position in the.
String Matching Using the Rabin-Karp Algorithm Katey Cruz CSC 252: Algorithms Smith College
1 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer Search Algorithms Winter Semester 2004/ Oct.
KMP String Matching Prepared By: Carlens Faustin.
  ;  E       
20/10/2015Applied Algorithmics - week31 String Processing  Typical applications: pattern matching/recognition molecular biology, comparative genomics,
MCS 101: Algorithms Instructor Neelima Gupta
Application: String Matching By Rong Ge COSC3100
Strings and Pattern Matching Algorithms Pattern P[0..m-1] Text T[0..n-1] Brute Force Pattern Matching Algorithm BruteForceMatch(T,P): Input: Strings T.
1 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer Search Algorithms Winter Semester 2004/ Oct.
MCS 101: Algorithms Instructor Neelima Gupta
Design and Analysis of Algorithms - Chapter 71 Space-time tradeoffs For many problems some extra space really pays off: b extra space in tables (breathing.
String Searching CSCI 2720 Spring 2007 Eileen Kraemer.
String Matching String Matching Problem We introduce a general framework which is suitable to capture an essence of compressed pattern matching according.
1 String Matching Algorithms Topics  Basics of Strings  Brute-force String Matcher  Rabin-Karp String Matching Algorithm  KMP Algorithm.
CS5263 Bioinformatics Lecture 15 & 16 Exact String Matching Algorithms.
Contest Algorithms January 2016 Three types of string search: brute force, Knuth-Morris-Pratt (KMP) and Rabin-Karp 13. String Searching 1Contest Algorithms:
Fundamental Data Structures and Algorithms
String-Matching Problem COSC Advanced Algorithm Analysis and Design
1/39 COMP170 Tutorial 13: Pattern Matching T: P:.
Rabin & Karp Algorithm. Rabin-Karp – the idea Compare a string's hash values, rather than the strings themselves. For efficiency, the hash value of the.
1 String Matching Algorithms Mohd. Fahim Lecturer Department of Computer Engineering Faculty of Engineering and Technology Jamia Millia Islamia New Delhi,
CSG523/ Desain dan Analisis Algoritma
Advanced Algorithms Analysis and Design
Advanced Algorithms Analysis and Design
Advanced Algorithm Design and Analysis (Lecture 12)
13 Text Processing Hongfei Yan June 1, 2016.
Rabin & Karp Algorithm.
Space-for-time tradeoffs
Tuesday, 12/3/02 String Matching Algorithms Chapter 32
String-Matching Algorithms (UNIT-5)
Adviser: R. C. T. Lee Speaker: C. W. Cheng National Chi Nan University
Chapter 7 Space and Time Tradeoffs
Pattern Matching 12/8/ :21 PM Pattern Matching Pattern Matching
Pattern Matching in String
Pattern Matching 1/14/2019 8:30 AM Pattern Matching Pattern Matching.
Space-for-time tradeoffs
Pattern Matching 2/15/2019 6:17 PM Pattern Matching Pattern Matching.
Data Structures and Algorithms (AT70. 02) Comp. Sc. and Inf. Mgmt
Space-for-time tradeoffs
Pattern Matching Pattern Matching 5/1/2019 3:53 PM Spring 2007
Space-for-time tradeoffs
Pattern Matching 4/27/2019 1:16 AM Pattern Matching Pattern Matching
Space-for-time tradeoffs
Sequences 5/17/ :43 AM Pattern Matching.
Presentation transcript:

Advanced Algorithm Design and Analysis (Lecture 3) SW5 fall 2004 Simonas Šaltenis E1-215b

AALG, lecture 3, © Simonas Šaltenis, Text-search Algorithms Goals of the lecture: Naive text-search algorithm and its analysis; Rabin-Karp algorithm and its analysis; Knuth-Morris-Pratt algorithm ideas; Boyer-Moore-Horspool algorithm and its analysis. Comparison of the advantages and disadvantages of the different text-search algorithms.

AALG, lecture 3, © Simonas Šaltenis, Text-Search Problem Input: Text T = “ at the thought of ” n = length(T) = 17 Pattern P = “ the ” m = length(P) = 3 Output: Shift s – the smallest integer (0 s n – m) such that T[s.. s+m–1] = P[0.. m–1]. Returns –1, if no such s exists 0123 … n at the thought of the s=3

AALG, lecture 3, © Simonas Šaltenis, Naïve Text Search Naive-Search(T,P) 01 for s  0 to n – m 02 j  0 03 // check if T[s..s+m–1] = P[0..m–1] 04 while T[s+j] = P[j] do 05 j  j if j = m return s 07 return –1 Idea: Brute force Check all values of s from 0 to n – m Let T = “ at the thought of ” and P = “ though ” What is the number of character comparisons?

AALG, lecture 3, © Simonas Šaltenis, Analysis of Naïve Text Search Worst-case: Outer loop: n – m Inner loop: m Total (n–m)m = O(nm) What is the input the gives this worst-case behaviuor? Best-case: n-m When? Completely random text and pattern: O(n–m)

AALG, lecture 3, © Simonas Šaltenis, Fingerprint idea Assume: We can compute a fingerprint f(P) of P in O(m) time. If f(P)f(T[s.. s+m–1]), then P T[s.. s+m– 1] We can compare fingerprints in O(1) We can compute f’ = f(T[s+1.. s+m]) from f(T[s.. s+m–1]), in O(1) f f’

AALG, lecture 3, © Simonas Šaltenis, Algorithm with Fingerprints Let the alphabet ={ 0,1,2,3,4,5,6,7,8,9 } Let fingerprint to be just a decimal number, i.e., f(“ 1045 ”) = 1* * * = 1045 Fingerprint-Search(T,P) 01 fp  compute f(P) 02 f  compute f(T[0..m–1]) 03 for s  0 to n – m do 04 if fp = f return s 05 f  (f – T[s]*10 m-1 )*10 + T[s+m] 06 return –1 f new f T[s]T[s] T[s+m] Running time 2O(m) + O(n–m) = O(n)! Where is the catch?

AALG, lecture 3, © Simonas Šaltenis, Using a Hash Function Problem: we can not assume we can do arithmetics with m-digits-long numbers in O(1) time Solution: Use a hash function h = f mod q For example, if q = 7, h(“ 52 ”) = 52 mod 7 = 3 h(S 1 )  h(S 2 ) S 1  S 2 But h(S 1 ) = h(S 2 ) does not imply S 1 =S 2 ! For example, if q = 7, h(“ 73 ”) = 3, but “ 73 ” “ 52 ” Basic “mod q” arithmetics: (a+b) mod q = (a mod q + b mod q) mod q (a*b) mod q = (a mod q)*(b mod q) mod q

AALG, lecture 3, © Simonas Šaltenis, Preprocessing and Stepping Preprocessing: fp = P[m-1] + 10*(P[m-2] + 10*(P[m-3]+ … … + 10*(P[1] + 10*P[0])…)) mod q In the same way compute ft from T[0..m-1] Example: P = “ 2531 ”, q = 7, what is fp? Stepping: ft = (ft – T[s]*10 m-1 mod q)*10 + T[s+m]) mod q 10 m-1 mod q can be computed once in the preprocessing Example: Let T[…] = “ 5319 ”, q = 7, what is the corresponding ft? ft new ft T[s]T[s] T[s+m]

AALG, lecture 3, © Simonas Šaltenis, Rabin-Karp Algorithm Rabin-Karp-Search(T,P) 01 q  a prime larger than m 02 c  10 m-1 mod q // run a loop multiplying by 10 mod q 03 fp  0; ft  0 04 for i  0 to m-1 // preprocessing 05 fp  (10*fp + P[i]) mod q 06 ft  (10*ft + T[i]) mod q 07 for s  0 to n – m // matching 08 if fp = ft then // run a loop to compare strings 09 if P[0..m-1] = T[s..s+m-1] return s 10 ft  ((ft – T[s]*c)*10 + T[s+m]) mod q 11 return –1 How many character comparisons are done if T = “ ” and P = “ 1978 ”?

AALG, lecture 3, © Simonas Šaltenis, Analysis If q is a prime, the hash function distributes m-digit strings evenly among the q values Thus, only every q-th value of shift s will result in matching fingerprints (which will require comparing stings with O(m) comparisons) Expected running time (if q > m): Preprocessing: O(m) Outer loop: O(n-m) All inner loops: Total time: O(n-m) Worst-case running time: O(nm)

AALG, lecture 3, © Simonas Šaltenis, Rabin-Karp in Practice If the alphabet has d characters, interpret characters as radix-d digits (replace 10 with d in the algorithm). Choosing prime q > m can be done with randomized algorithms in O(m), or q can be fixed to be the largest prime so that 10*q fits in a computer word. Rabin-Karp is simple and can be easily extended to two-dimensional pattern matching.

AALG, lecture 3, © Simonas Šaltenis, Searching in n comparisons The goal: each character of the text is compared only once! Problem with the naïve algorithm: Forgets what was learned from a partial match! Examples: T = “ Tweedledee and Tweedledum ” and P = “ Tweedledum ” T = “ pappar ” and P = “ pappappappar”

AALG, lecture 3, © Simonas Šaltenis, General situation State of the algorithm: Checking shift s, q characters of P are matched, we see a non-matching character in T. Need to find: Largest prefix “P-” such that it is a suffix of P[0..q-1]  T[s]T[s] q T:T: P:P: New q’ = max{k  q | P[0..k–1] = P[q–k+1..q–1]}  q P[0..q-1]: P:P: q’

AALG, lecture 3, © Simonas Šaltenis, Finite automaton search Algorithm: Preprocess: For each q (0  q  m-1) and each pre-compute a new value of q. Let’s call it (q,) Fills a table of a size m|| Run through the text Whenever a mismatch is found (P[q] T[s+q]): Set s = s + q - (q,) + 1 and q = (q,) Analysis: Matching phase in O(n)  Too much memory: O(m||), two much preprocessing: at best O(m||).

AALG, lecture 3, © Simonas Šaltenis, Prefix function Idea: forget unmatched character ()! State of the algorithm: Checking shift s, q characters of P are matched, we see a non-matching character in T. Need to find: Largest prefix “P-” such that it is a suffix of P[0..q-1]  T[s]T[s] q T:T: P:P: New q’ =  [q] = max{k < q | P[0..k–1] = P[q–k..q–1]}  q T[s..s+q]: P:P: q’ compare this again

AALG, lecture 3, © Simonas Šaltenis, Prefix table We can pre-compute a prefix table of size m to store values of [q] (0 q  m) Compute a prefix table for: P = “ dadadu ” P pappar q [q][q]

AALG, lecture 3, © Simonas Šaltenis, Knuth-Morris-Pratt Algorithm KMP-Search(T,P) 01   Compute-Prefix-Table(P) 02 q  0 // number of characters matched 03 for i  0 to n-1 // scan the text from left to right 04 while q > 0 and P[q]  T[i] do 05 q   [q] 06 if P[q] = T[i] then q  q if q = m then return i – m return –1 Compute-Prefix-Table is the essentially the same KMP search algorithm performed on P.

AALG, lecture 3, © Simonas Šaltenis, Analysis of KMP Worst-case running time: O(n+m) Main algorithm: O(n) Compute-Prefix-Table : O(m) Space usage: O(m)

AALG, lecture 3, © Simonas Šaltenis, Reverse naïve algorithm Why not search from the end of P? Boyer and Moore Reverse-Naive-Search(T,P) 01 for s  0 to n – m 02 j  m – 1 // start from the end 03 // check if T[s..s+m–1] = P[0..m–1] 04 while T[s+j] = P[j] do 05 j  j if j < 0 return s 07 return –1 Running time is exactly the same as of the naïve algorithm…

AALG, lecture 3, © Simonas Šaltenis, Occurrence heuristic Boyer and Moore added two heuristics to reverse naïve, to get an O(n+m) algorithm, but its complex Horspool suggested just to use the modified occurrence heuristic: After a mismatch, align T[s + m–1] with the rightmost occurrence of that letter in the pattern P[0..m–2] Examples: T= “ detective date ” and P= “ date ” T= “ tea kettle ” and P= “ kettle ”

AALG, lecture 3, © Simonas Šaltenis, Shift table In preprocessing, compute the shift table of the size ||. Example: P = “ kettle ” shift[ e ] =4, shift[ l ] =1, shift[ t ] =2, shift[ t ] =5 shift[any other letter] = 6 Example: P = “ pappar ” What is the shift table?

AALG, lecture 3, © Simonas Šaltenis, Boyer-Moore-Horspool Alg. BMH-Search(T,P) 01 // compute the shift table for P 01 for c  0 to |  | shift[c] = m // default values 03 for k  0 to m shift[P[k]] = m – 1 - k 05 // search 06 s  0 07 while s  n – m do 08 j  m – 1 // start from the end 09 // check if T[s..s+m–1] = P[0..m–1] 10 while T[s+j] = P[j] do 11 j  j if j < 0 return s 13 s  s + shift[T[s + m–1]] // shift by last letter 14 return –1

AALG, lecture 3, © Simonas Šaltenis, BMH Analysis Worst-case running time Preprocessing: O(||+m) Searching: O(nm) What input gives this bound? Total: O(nm) Space: O(||) Independent of m On real-world data sets very fast

AALG, lecture 3, © Simonas Šaltenis, Comparison Let’s compare the algorithms. Criteria: Worst-case running time Preprocessing Searching Expected running time Space used Implementation complexity