Advanced Algorithm Design and Analysis (Lecture 12)

Slides:



Advertisements
Similar presentations
Space-for-Time Tradeoffs
Advertisements

Nearest Neighbor Queries using R-trees
Searching on Multi-Dimensional Data
Nearest Neighbor Queries using R-trees Based on notes from G. Kollios.
15-853Page : Algorithms in the Real World Suffix Trees.
296.3: Algorithms in the Real World
String Searching Algorithms Problem Description Given two strings P and T over the same alphabet , determine whether P occurs as a substring in T (or.
Data Structures and Algorithms (AT70.02) Comp. Sc. and Inf. Mgmt. Asian Institute of Technology Instructor: Dr. Sumanta Guha Slide Sources: CLRS “Intro.
Advanced Algorithm Design and Analysis (Lecture 4) SW5 fall 2004 Simonas Šaltenis E1-215b
1 Prof. Dr. Th. Ottmann Theory I Algorithm Design and Analysis (12 - Text search, part 1)
Pattern Matching1. 2 Outline and Reading Strings (§9.1.1) Pattern matching algorithms Brute-force algorithm (§9.1.2) Boyer-Moore algorithm (§9.1.3) Knuth-Morris-Pratt.
Goodrich, Tamassia String Processing1 Pattern Matching.
UMass Lowell Computer Science Analysis of Algorithms Prof. Karen Daniels Fall, 2006 Wednesday, 12/6/06 String Matching Algorithms Chapter 32.
6-1 String Matching Learning Outcomes Students are able to: Explain naïve, Rabin-Karp, Knuth-Morris- Pratt algorithms Analyse the complexity of these algorithms.
UMass Lowell Computer Science Analysis of Algorithms Prof. Karen Daniels Fall, 2001 Lecture 8 Tuesday, 11/13/01 String Matching Algorithms Chapter.
String Matching COMP171 Fall String matching 2 Pattern Matching * Given a text string T[0..n-1] and a pattern P[0..m-1], find all occurrences of.
Algorithms for Regulatory Motif Discovery Xiaohui Xie University of California, Irvine.
Spatial Queries Nearest Neighbor Queries.
Pattern Matching COMP171 Spring Pattern Matching / Slide 2 Pattern Matching * Given a text string T[0..n-1] and a pattern P[0..m-1], find all occurrences.
Algorithms and Data Structures. /course/eleg67701-f/Topic-1b2 Outline  Data Structures  Space Complexity  Case Study: string matching Array implementation.
A Fast Algorithm for Multi-Pattern Searching Sun Wu, Udi Manber May 1994.
String Matching. Problem is to find if a pattern P[1..m] occurs within text T[1..n] Simple solution: Naïve String Matching –Match each position in the.
Advanced Algorithm Design and Analysis (Lecture 5) SW5 fall 2004 Simonas Šaltenis E1-215b
KMP String Matching Prepared By: Carlens Faustin.
Advanced Algorithm Design and Analysis (Lecture 3) SW5 fall 2004 Simonas Šaltenis E1-215b
20/10/2015Applied Algorithmics - week31 String Processing  Typical applications: pattern matching/recognition molecular biology, comparative genomics,
MCS 101: Algorithms Instructor Neelima Gupta
Strings and Pattern Matching Algorithms Pattern P[0..m-1] Text T[0..n-1] Brute Force Pattern Matching Algorithm BruteForceMatch(T,P): Input: Strings T.
MCS 101: Algorithms Instructor Neelima Gupta
String Searching CSCI 2720 Spring 2007 Eileen Kraemer.
String Matching String Matching Problem We introduce a general framework which is suitable to capture an essence of compressed pattern matching according.
1 String Matching Algorithms Topics  Basics of Strings  Brute-force String Matcher  Rabin-Karp String Matching Algorithm  KMP Algorithm.
String-Matching Problem COSC Advanced Algorithm Analysis and Design
ICS220 – Data Structures and Algorithms Analysis Lecture 14 Dr. Ken Cosh.
1/39 COMP170 Tutorial 13: Pattern Matching T: P:.
A new matching algorithm based on prime numbers N. D. Atreas and C. Karanikas Department of Informatics Aristotle University of Thessaloniki.
Rabin & Karp Algorithm. Rabin-Karp – the idea Compare a string's hash values, rather than the strings themselves. For efficiency, the hash value of the.
Advanced Data Structures Lecture 8 Mingmin Xie. Agenda Overview Trie Suffix Tree Suffix Array, LCP Construction Applications.
1 String Matching Algorithms Mohd. Fahim Lecturer Department of Computer Engineering Faculty of Engineering and Technology Jamia Millia Islamia New Delhi,
CSG523/ Desain dan Analisis Algoritma
Strategies for Spatial Joins
Spatial Queries Nearest Neighbor and Join Queries.
Advanced Algorithms Analysis and Design
Advanced Algorithms Analysis and Design
13 Text Processing Hongfei Yan June 1, 2016.
Chapter 3 String Matching.
Nearest Neighbor Queries using R-trees
Rabin & Karp Algorithm.
CSCE350 Algorithms and Data Structure
Chapter 3 String Matching.
Space-for-time tradeoffs
Tuesday, 12/3/02 String Matching Algorithms Chapter 32
String-Matching Algorithms (UNIT-5)
Chapter 7 Space and Time Tradeoffs
Pattern Matching 12/8/ :21 PM Pattern Matching Pattern Matching
Pattern Matching in String
Pattern Matching 1/14/2019 8:30 AM Pattern Matching Pattern Matching.
KMP String Matching Donald Knuth Jim H. Morris Vaughan Pratt 1997.
Space-for-time tradeoffs
2-Dimensional Pattern Matching
Pattern Matching 2/15/2019 6:17 PM Pattern Matching Pattern Matching.
Data Structures and Algorithms (AT70. 02) Comp. Sc. and Inf. Mgmt
Space-for-time tradeoffs
Pattern Matching Pattern Matching 5/1/2019 3:53 PM Spring 2007
Space-for-time tradeoffs
Pattern Matching 4/27/2019 1:16 AM Pattern Matching Pattern Matching
Space-for-time tradeoffs
Sequences 5/17/ :43 AM Pattern Matching.
15-826: Multimedia Databases and Data Mining
Liang Jin (UC Irvine) Nick Koudas (AT&T Labs Research)
Presentation transcript:

Advanced Algorithm Design and Analysis (Lecture 12) SW5 fall 2006 Simonas Šaltenis E1-215b simas@cs.aau.dk

Nearest-Neighbor Queries Nearest-neighbor query: Given a set of multi-dimensional points S and a query point q, return a point p  S, closest to q: That is ∀𝑎∈𝑆 𝑑 𝑞,𝑝 ≤𝑑 𝑞,𝑎 p q AALG, lecture 11, © Simonas Šaltenis, 2006

Nearest-Neighbor Queries Applications: Find a closest gas station Similarity search: find an image or a song that is most similar to a given one. Represent each image (song) as a feature vector (multi-dimensional point) Can we program Consistent(MBR, Q) to get a NN query algorithm for R-trees? Similar to a circular range query, but we don't know the radius of the circle a priori! AALG, lecture 11, © Simonas Šaltenis, 2006

AALG, lecture 11, © Simonas Šaltenis, 2006 Depth-First NN Search p1 p 7 p6 p 8 p2 p3 p4 p5 p9 p10 p11 p12 p13 R 3 R 4 R 5 R 6 R 7 R 1 R 2 Candidate NN: p8 p12 Nodes accessed: 6 R 1 R 2 R 3 R 4 R 5 p6 p7 p 5 p1 p 2 Pointers to data items p8 p 3 p 4 p9 p10 p11 p12 p13 R 6 R 7 AALG, lecture 11, © Simonas Šaltenis, 2006

Depth-First Traversal Branch-and-bound algorithm. Ideas: Use an upper estimate on the range prunedist, which is infinite initially Update prunedist, whenever points are retrieved In a node, order entries to be visited by their MINDIST to a data point R 1 R 2 R 2 R 3 MINDIST(R3, q) = 0 AALG, lecture 11, © Simonas Šaltenis, 2006

AALG, lecture 11, © Simonas Šaltenis, 2006 Depth-First NN search prunedist   // Prunning distance nncand  NIL // Best NN candidate found so far DepthFirstNN(q, node) // in a tree rooted at node 01 if node is leaf then 02 foreach entry p in node do 03 if d(p,q) < prunedist then 04 prunedist  d(p,q) 05 nncand  p 06 else // node is not leaf 07 foreach entry <MBR, ptr> in node sorted on MINDIST do 08 if MINDIST (MBR,q) < prunedist then 09 DepthFirstNN(q, ReadNode(ptr)) Worst case performance: O(n) Prunedist could also be updated after line 9, but this does not help (can be proved...) AALG, lecture 11, © Simonas Šaltenis, 2006

AALG, lecture 11, © Simonas Šaltenis, 2006 Optimal algorithm The depth-first algorithm is not optimal! The optimal one should visit only the entries intersecting the NN-circle (R1, R2, R7, R5) R 1 R 2 p1 p 7 p6 p 8 p2 p3 p4 p5 p9 p10 p11 p12 p13 R 3 R 4 R 5 R 6 R 7 AALG, lecture 11, © Simonas Šaltenis, 2006

AALG, lecture 11, © Simonas Šaltenis, 2006 Best-First NN Search Priority queue on MINDIST: p1 p 7 p6 p 8 p2 p3 p4 p5 p9 p10 p11 p12 p13 R 3 R 4 R 5 R 6 R 7 R 1 R 2 R1, R2 R2, R5, R4, R3 R7, R5, R4, R6, R3 R5, p12, R4, R6, R3 p12 Nodes accessed: 5 R 1 R 2 R 3 R 4 R 5 R 6 R 7 p1 p 2 p 3 p 4 p 5 p6 p7 p8 p9 p10 p11 p12 p13 AALG, lecture 11, © Simonas Šaltenis, 2006

AALG, lecture 11, © Simonas Šaltenis, 2006 Best-First NN Search BestFirstNN(q, T) 01 Q   // Priority queue ordered on MINDIST to q 02 Q.enqueue(T.root) 03 while true do 04 p  Q.dequeue() 05 if p is a data point then return p 06 else 07 foreach entry e in ReadNode(p) do 08 Q.enqueue(e) 09 Prune Q by removing all entries after the first data point AALG, lecture 11, © Simonas Šaltenis, 2006

NN Querying: Conclusions Best-first: Uses O(n) memory in worst-case Is optimal for a given R-tree, but runs in O(n) in worst-case Depth-first: Uses only O(log n) memory Runs in O(n) in worst-case Both algorithms can be generalized to find not one but k nearest neighbors They work also on kd-trees and other hierarchies of enclosing descriptions AALG, lecture 11, © Simonas Šaltenis, 2006

Text-search Algorithms Goals of the lecture: Naive text-search algorithm and its analysis; Rabin-Karp algorithm and its analysis; Knuth-Morris-Pratt algorithm ideas. AALG, lecture 12, © Simonas Šaltenis, 2006

AALG, lecture 12, © Simonas Šaltenis, 2006 Text-Search Problem Input: Text T = “at the thought of” n = length(T) = 17 Pattern P = “the” m = length(P) = 3 Output: Shift s – the smallest integer (0 s n – m) such that T[s .. s+m–1] = P[0 .. m–1]. Returns –1, if no such s exists 0123 … n-1 at the thought of s=3 the 012 AALG, lecture 12, © Simonas Šaltenis, 2006

AALG, lecture 12, © Simonas Šaltenis, 2006 Naïve Text Search Idea: Brute force Check all values of s from 0 to n – m Naive-Search(T,P)  01 for s  0 to n – m 02 j  0 03 // check if T[s..s+m–1] = P[0..m–1] 04 while T[s+j] = P[j] do 05 j  j + 1 06 if j = m return s 07 return –1 Let T = “at the thought of” and P = “though” What is the number of character comparisons? AALG, lecture 12, © Simonas Šaltenis, 2006

Analysis of Naïve Text Search Worst-case: Outer loop: n – m + 1 Inner loop: m Total (n–m+1)m = O(nm) What is the input that gives this worst-case behavior? Best-case: n – m + 1 When? Completely random text and pattern: O(n–m) AALG, lecture 12, © Simonas Šaltenis, 2006

AALG, lecture 12, © Simonas Šaltenis, 2006 Fingerprint idea Assume: We can compute a fingerprint f(P) of P in O(m) time. If f(P)f(T[s .. s+m–1]), then P T[s .. s+m–1] We can compare fingerprints in O(1) We can compute f’ = f(T[s+1.. s+m]) from f(T[s .. s+m–1]), in O(1) f’ f AALG, lecture 12, © Simonas Šaltenis, 2006

Algorithm with Fingerprints Let the alphabet ={0,1,2,3,4,5,6,7,8,9} Let fingerprint to be just a decimal number, i.e., f(“1045”) = 1*103 + 0*102 + 4*101 + 5 = 1045 Fingerprint-Search(T,P) 01 fp  compute f(P) 02 f  compute f(T[0..m–1])   03 for s  0 to n – m do 04 if fp = f return s 05 f (f – T[s]*10m-1)*10 + T[s+m] 06 return –1 T[s] new f f T[s+m] Running time 2O(m) + O(n–m) = O(n)! Where is the catch? AALG, lecture 12, © Simonas Šaltenis, 2006

AALG, lecture 12, © Simonas Šaltenis, 2006 Using a Hash Function Problem: We can not assume we can do arithmetics with m-digits-long numbers in O(1) time! Solution: Use a hash function h = f mod q For example, if q = 7, h(“52”) = 52 mod 7 = 3 h(S1)  h(S2) S1 S2 But h(S1) = h(S2) does not imply S1=S2! For example, if q = 7, h(“73”) = 3, but “73” “52” Basic “mod q” arithmetics: (a+b) mod q = (a mod q + b mod q) mod q (a*b) mod q = (a mod q)*(b mod q) mod q AALG, lecture 12, © Simonas Šaltenis, 2006

Preprocessing and Stepping fp = P[m-1] + 10*(P[m-2] + 10*(P[m-3]+ … … + 10*(P[1] + 10*P[0])…)) mod q In the same way compute ft from T[0..m-1] Example: P = “2531”, q = 7, what is fp? Stepping: ft = (ft – T[s]*10m-1 mod q)*10 + T[s+m]) mod q 10m-1 mod q can be computed once in the preprocessing Example: Let T[…] = “5319”, q = 7, what is the corresponding ft? ft new ft T[s] T[s+m] AALG, lecture 12, © Simonas Šaltenis, 2006

AALG, lecture 12, © Simonas Šaltenis, 2006 Rabin-Karp Algorithm Rabin-Karp-Search(T,P) 01 q  a prime larger than m 02 c  10m-1 mod q // run a loop multiplying by 10 mod q 03 fp  0; ft  0 04 for i  0 to m-1 // preprocessing 05 fp  (10*fp + P[i]) mod q 06   ft  (10*ft + T[i]) mod q 07 for s  0 to n – m // matching 08 if fp = ft then // run a loop to compare strings 09 if P[0..m-1] = T[s..s+m-1] return s 10 ft ((ft – T[s]*c)*10 + T[s+m]) mod q 11 return –1 How many character comparisons are done if T = “2531978” and P = “1978” (and q = 7)? AALG, lecture 12, © Simonas Šaltenis, 2006

AALG, lecture 12, © Simonas Šaltenis, 2006 Analysis If q is a prime, the hash function distributes m-digit strings evenly among the q values Thus, only every q-th value of shift s will result in matching fingerprints (which will require comparing strings with O(m) comparisons) Expected running time (if q > m): Preprocessing: O(m) Outer loop: O(n-m) All inner loops: Total time: O(n-m) Worst-case running time: O(nm) 𝑛−𝑚 𝑞 𝑚=𝑂 𝑛−𝑚 AALG, lecture 12, © Simonas Šaltenis, 2006

Rabin-Karp in Practice If the alphabet has d characters, interpret characters as radix-d digits (replace 10 with d in the algorithm). Choosing prime q > m can be done with randomized algorithms in O(m), or q can be fixed to be the largest prime so that 10*q fits in a computer word. Rabin-Karp is simple and can be easily extended to two-dimensional pattern matching. AALG, lecture 12, © Simonas Šaltenis, 2006

Searching in n comparisons The goal: each character of the text is compared only once! Problem with the naïve algorithm: Forgets what was learned from a partial match! Examples: T = “Tweedledee and Tweedledum” and P = “Tweedledum” T = “pappappappar” and P = “pappar” AALG, lecture 12, © Simonas Šaltenis, 2006

AALG, lecture 12, © Simonas Šaltenis, 2006 General situation State of the algorithm: Checking shift s, q characters of P are matched, we see a non-matching character in T. Need to find: Largest prefix “P-” such that it is a suffix of P[0..q-1] q P: T:  T[s]  q P[0..q-1]: P: q’ New q’ = max{k  q | P[0..k–1] = P[q–k+1..q–1]} AALG, lecture 12, © Simonas Šaltenis, 2006

Finite automaton search Algorithm: Preprocess: For each q (0  q  m-1) and each pre-compute a new value of q. Let’s call it (q,) Fills a table of a size m|| Run through the text Whenever a mismatch is found (P[q] T[s+q]): Set s = s + q - (q,) + 1 and q = (q,) Analysis:  Matching phase in O(n)  Too much memory: O(m||), too much preprocessing: at best O(m||). AALG, lecture 12, © Simonas Šaltenis, 2006

AALG, lecture 12, © Simonas Šaltenis, 2006 Prefix function Idea: forget unmatched character ()! State of the algorithm: Checking shift s, q characters of P are matched, we see a non-matching character in T. Need to find: Largest prefix “P-” such that it is a suffix of P[0..q-1] q P: T:  T[s]  q T[s..s+q]: P: q’ compare this again New q’ =  [q] = max{k < q | P[0..k–1] = P[q–k..q–1]} AALG, lecture 12, © Simonas Šaltenis, 2006

AALG, lecture 12, © Simonas Šaltenis, 2006 Prefix table We can pre-compute a prefix table of size m to store values of [q] (0 q  m) Compute a prefix table for: P = “dadadu” 2 5 a 1 [q] 6 4 3 q r p P AALG, lecture 12, © Simonas Šaltenis, 2006

Knuth-Morris-Pratt Algorithm KMP-Search(T,P) 01   Compute-Prefix-Table(P) 02 q  0 // number of characters matched 03 for i  0 to n-1 // scan the text from left to right 04 while q > 0 and P[q]  T[i] do 05   q  [q] 06 if P[q] = T[i] then q  q + 1 07 if q = m then return i – m + 1 08 return –1 Compute-Prefix-Table is essentially the same KMP search algorithm performed on P. What is the running time? AALG, lecture 12, © Simonas Šaltenis, 2006

AALG, lecture 12, © Simonas Šaltenis, 2006 Analysis of KMP Worst-case running time: O(n+m) Main algorithm: O(n) Compute-Prefix-Table: O(m) Space usage: O(m) AALG, lecture 12, © Simonas Šaltenis, 2006