String Sorts Tries Substring Search: KMP, BM, RK

Slides:



Advertisements
Similar presentations
© 2004 Goodrich, Tamassia Pattern Matching1. © 2004 Goodrich, Tamassia Pattern Matching2 Strings A string is a sequence of characters Examples of strings:
Advertisements

Analysis of Algorithms
Space-for-Time Tradeoffs
Algorithm : Design & Analysis [19]
TECH Computer Science String Matching  detecting the occurrence of a particular substring (pattern) in another string (text) A straightforward Solution.
296.3: Algorithms in the Real World
Combinatorial Pattern Matching CS 466 Saurabh Sinha.
Tries Search for ‘bell’ O(n) by KMP algorithm O(dm) in a trie Tries
String Searching Algorithms Problem Description Given two strings P and T over the same alphabet , determine whether P occurs as a substring in T (or.
Advanced Algorithm Design and Analysis (Lecture 4) SW5 fall 2004 Simonas Šaltenis E1-215b
Comp. Eng. Lab III (Software), Pattern Matching1 Pattern Matching Dr. Andrew Davison WiG Lab (teachers room), CoE ,
1 Prof. Dr. Th. Ottmann Theory I Algorithm Design and Analysis (12 - Text search, part 1)
Design and Analysis of Algorithms - Chapter 71 Space-time tradeoffs For many problems some extra space really pays off (extra space in tables - breathing.
Pattern Matching1. 2 Outline and Reading Strings (§9.1.1) Pattern matching algorithms Brute-force algorithm (§9.1.2) Boyer-Moore algorithm (§9.1.3) Knuth-Morris-Pratt.
Goodrich, Tamassia String Processing1 Pattern Matching.
Princeton University COS 423 Theory of Algorithms Spring 2002 Kevin Wayne String Searching Reference: Chapter 19, Algorithms in C by R. Sedgewick. Addison.
Boyer-Moore Algorithm 3 main ideas –right to left scan –bad character rule –good suffix rule.
String Matching COMP171 Fall String matching 2 Pattern Matching * Given a text string T[0..n-1] and a pattern P[0..m-1], find all occurrences of.
Pattern Matching COMP171 Spring Pattern Matching / Slide 2 Pattern Matching * Given a text string T[0..n-1] and a pattern P[0..m-1], find all occurrences.
Aho-Corasick Algorithm Generalizes KMP to handle sets of strings New ideas –keyword trees –failure functions/links –output links.
Pattern Matching1. 2 Outline Strings Pattern matching algorithms Brute-force algorithm Boyer-Moore algorithm Knuth-Morris-Pratt algorithm.
String Matching. Problem is to find if a pattern P[1..m] occurs within text T[1..n] Simple solution: Naïve String Matching –Match each position in the.
1 Time Analysis Analyzing an algorithm = estimating the resources it requires. Time How long will it take to execute? Impossible to find exact value Depends.
1 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer Search Algorithms Winter Semester 2004/ Oct.
CSC401 – Analysis of Algorithms Chapter 9 Text Processing
Advanced Algorithm Design and Analysis (Lecture 3) SW5 fall 2004 Simonas Šaltenis E1-215b
20/10/2015Applied Algorithmics - week31 String Processing  Typical applications: pattern matching/recognition molecular biology, comparative genomics,
MCS 101: Algorithms Instructor Neelima Gupta
Application: String Matching By Rong Ge COSC3100
Strings and Pattern Matching Algorithms Pattern P[0..m-1] Text T[0..n-1] Brute Force Pattern Matching Algorithm BruteForceMatch(T,P): Input: Strings T.
Suffix trees. Trie A tree representing a set of strings. a b c e e f d b f e g { aeef ad bbfe bbfg c }
Design and Analysis of Algorithms - Chapter 71 Space-time tradeoffs For many problems some extra space really pays off: b extra space in tables (breathing.
String Searching CSCI 2720 Spring 2007 Eileen Kraemer.
String Matching String Matching Problem We introduce a general framework which is suitable to capture an essence of compressed pattern matching according.
Exact String Matching Algorithms Presented By Dr. Shazzad Hosain Asst. Prof. EECS, NSU.
Sets of Digital Data CSCI 2720 Fall 2005 Kraemer.
1 String Matching Algorithms Topics  Basics of Strings  Brute-force String Matcher  Rabin-Karp String Matching Algorithm  KMP Algorithm.
CS5263 Bioinformatics Lecture 15 & 16 Exact String Matching Algorithms.
Contest Algorithms January 2016 Three types of string search: brute force, Knuth-Morris-Pratt (KMP) and Rabin-Karp 13. String Searching 1Contest Algorithms:
Fundamental Data Structures and Algorithms
Design and Analysis of Algorithms – Chapter 71 Space-Time Tradeoffs: String Matching Algorithms* Dr. Ying Lu RAIK 283: Data Structures.
Contents What is a trie? When to use tries
1/39 COMP170 Tutorial 13: Pattern Matching T: P:.
Advanced Data Structures Lecture 8 Mingmin Xie. Agenda Overview Trie Suffix Tree Suffix Array, LCP Construction Applications.
1 String Matching Algorithms Mohd. Fahim Lecturer Department of Computer Engineering Faculty of Engineering and Technology Jamia Millia Islamia New Delhi,
CSG523/ Desain dan Analisis Algoritma
15-853:Algorithms in the Real World
Advanced Algorithms Analysis and Design
Andrzej Ehrenfeucht, University of Colorado, Boulder
Advanced Algorithm Design and Analysis (Lecture 12)
13 Text Processing Hongfei Yan June 1, 2016.
Strings: Tries, Suffix Trees
CSCE350 Algorithms and Data Structure
Space-for-time tradeoffs
Tuesday, 12/3/02 String Matching Algorithms Chapter 32
String-Matching Algorithms (UNIT-5)
Chapter 7 Space and Time Tradeoffs
Pattern Matching 12/8/ :21 PM Pattern Matching Pattern Matching
Pattern Matching 1/14/2019 8:30 AM Pattern Matching Pattern Matching.
KMP String Matching Donald Knuth Jim H. Morris Vaughan Pratt 1997.
Space-for-time tradeoffs
Pattern Matching 2/15/2019 6:17 PM Pattern Matching Pattern Matching.
Space-for-time tradeoffs
Strings: Tries, Suffix Trees
Pattern Matching Pattern Matching 5/1/2019 3:53 PM Spring 2007
Space-for-time tradeoffs
Pattern Matching 4/27/2019 1:16 AM Pattern Matching Pattern Matching
Space-for-time tradeoffs
Sequences 5/17/ :43 AM Pattern Matching.
Week 14 - Wednesday CS221.
Presentation transcript:

String Sorts Tries Substring Search: KMP, BM, RK Lecture 16 Strings String Sorts Tries Substring Search: KMP, BM, RK

key-indexed counting LSD radix sort MSD radix sort String Sorts key-indexed counting LSD radix sort MSD radix sort ACKNOWLEDGEMENTS: http://algs4.cs.princeton.edu

key-indexed counting LSD radix sort MSD radix sort String Sorts key-indexed counting LSD radix sort MSD radix sort

Review: sorting algorithms Lower bound. ~NlgN compares required by any compare-based algorithm.

Key-indexed counting assumptions Assumption. Keys are integers between 0 and R - 1. Implication. Can use key as an array index. Applications. Sort string by first letter. Sort class roster by section. Sort phone numbers by area code. Subroutine in a sorting algorithm.

Key-indexed counting demo Goal. Sort an array a[] of N integers between 0 and R - 1. Count frequencies of each letter using key as index. Compute frequency cumulates which specify destinations. Access cumulates using key as index to move items. Copy back into original array.

Key-indexed counting demo Goal. Sort an array a[] of N integers between 0 and R - 1. Count frequencies of each letter using key as index. Compute frequency cumulates which specify destinations. Access cumulates using key as index to move items. Copy back into original array.

Key-indexed counting demo Goal. Sort an array a[] of N integers between 0 and R - 1. Count frequencies of each letter using key as index. Compute frequency cumulates which specify destinations. Access cumulates using key as index to move items. Copy back into original array.

Key-indexed counting demo Goal. Sort an array a[] of N integers between 0 and R - 1. Count frequencies of each letter using key as index. Compute frequency cumulates which specify destinations. Access cumulates using key as index to move items. Copy back into original array.

Key-indexed counting demo Goal. Sort an array a[] of N integers between 0 and R - 1. Count frequencies of each letter using key as index. Compute frequency cumulates which specify destinations. Access cumulates using key as index to move items. Copy back into original array.

Key-indexed counting analysis Proposition. Key-indexed counting takes time proportional to N+R. Proposition. Key-indexed counting uses extra space proportional to N+R. Stable? Yes.

key-indexed counting LSD radix sort MSD radix sort String Sorts key-indexed counting LSD radix sort MSD radix sort

Least-significant-digit-first string sort LSD string (radix) sort. Consider characters from right to left. Stably sort using dth character as the key (using key-indexed counting).

LSD string sort: correctness proof Proposition. LSD sorts fixed-length strings in ascending order. Pf. [ by induction on i ] After pass i, strings are sorted by last i characters. If two strings differ on sort key, key-indexed sort puts them in proper relative order. If two strings agree on sort key, stability keeps them in proper relative order. Proposition. LSD sort is stable. Pf. Key-indexed counting is stable.

Summary: sorting algorithms

key-indexed counting LSD radix sort MSD radix sort String Sorts key-indexed counting LSD radix sort MSD radix sort

Reverse LSD Consider characters from left to right. Stably sort using dth character as the key (using key-indexed counting).

Most-significant-digit-first string sort MSD string (radix) sort. Partition array into R pieces according to first character (use key-indexed counting). Recursively sort all strings that start with each character (key-indexed counts delineate subarrays to sort).

MSD string sort example

Variable-length strings Treat strings as if they had an extra char at end (smaller that any char). C strings. Have extra char ‘\0’ at end => no extra work needed.

MSD string sort problem Observation 1. Much too slow for small subarrays. Each function call needs its own count[] array. ASCII (256 counts): 100x slower than copy pass for N=2. Unicode (65536 counts): 32000x slower for N=2. Observation 2. Huge number of small subarrays because of recursion.

Summary: sorting algorithms

Retrieve DFA simulation Trie Radix Tree Suffix Trie(Tree) Tries Retrieve DFA simulation Trie Radix Tree Suffix Trie(Tree)

Alphabet Word:{i,a,in,at,an,inn,int,ate,age,adv,ant} letters on the path <=> prefix of the word Common Prefix <=> Common Ancestor Leaf Node <=> longest prefix

Construct Node Find word Add word Is Leaf?(Is End?) Edge. Edge of next letter? Exist: jump to next Node. Not exist: return false. Is end of the word? Add word Not exist: add new Node, jump to it. Mark.

Example Find ”ant” Find ”and” Add ”and”

Other version Child-Brother Tree Double Array Trie Binary Tree Double Array Trie Ternary search tries How to save edge? Array List BST

Analysis Time complexity Space complexity Add: length of string Find: length of string Space complexity Total length of string

Radix Tree Internal node has least two child Leaf Node <= number of words Space complexity Number of words

Application Scenarios Retrieving for dictionary Longest common prefix Sort Support With KMP : Aho-Corasick automaton Dictionary of all suffixes word: Suffix Tree

Suffix Tree Suffix Trie: Trie by all of suffix words Example: “BANANAS”’s suffix Trie: Dictionary: BANANAS,ANANAS,NANAS,ANAS, NAS,AS,S Path Compressed :Suffix Tree

Suffix Tree: Application Scenarios Find substring Find k-th substring in MSD string sort Count the number of different substring Find the repeat time of substring Find longest repeat substring Longest common substring >=2 若o在S中,则o必然是S的某个后缀的前缀。 如果T在S中重复了两次,则S应有两个后缀以T为前缀,重复次数就自然统计出来了。 最深非叶节点所经历的字符串起来就是最长重复子串。 (4). 两个字符串S1,S2的最长公共部分 方案:将S1#S2$作为字符串压入后缀树,找到最深的非叶节点,且该节点的叶节点既有#也有$(无#)。

Suffix Tree:construct Time complexity Construction A Trie O(n^2) Esko Ukkonen Algorithm Nodes <= 2*letters step by step letter by letter prefix by prefix deferred update E.g. BAN(for BANANAS)

Substring Search Knuth-Morris-Pratt Algorithm Boyer-Moore Algorithm Rabin-Karp Algorithm

Substring Search Knuth-Morris-Pratt Algorithm Boyer-Moore Algorithm Rabin-Karp Algorithm

Knuth-Morris-Pratt Algorithm Solution with DFA simulation How to construct DFA Analysis of KMP

Brute-force substring search

Knuth, Morris and Pratt (1974)

Overview Consider the string B C B A A B A C A A B A B A C A A the pattern A B A B A C

Steps Construct the DFA DFA simulation

How to construct the DFA Match situations:

How to construct the DFA Mismatch situations: Maintain the state of the substring start from s[1] copy dfa[][X] to dfa[][j] , X = dfa[pat.charAt(j)][X] X 1 1 1 4

Code public KMP(String pat) { // Build DFA from pattern. this.pat = pat; int M = pat.length(); int R = 256; dfa = new int[R][M]; dfa[pat.charAt(0)][0] = 1; for (int X = 0, j = 1; j < M; j++) { // Compute dfa[][j]. for (int c = 0; c < R; c++) dfa[c][j] = dfa[c][X]; // Copy mismatch cases. dfa[pat.charAt(j)][j] = j + 1; // Set match case. X = dfa[pat.charAt(j)][X]; // Update restart state. }

Code public int search(String txt) { // simulate operation of DFA on text int M = pat.length(); int N = txt.length(); int i, j; for (i = 0, j = 0; i < N && j < M; i++) { j = dfa[txt.charAt(i)][j]; } if (j == M) return i - M; // found return N; // not found

Meaning of DFA A B B Prefix: A AB Suffix: B BB A B A B A B Prefix: A AB ABA ABAB ABABA Suffix: B AB BAB ABAB BABAB

Analysis Time consumption: O(n+m) Advantage: Disadvantage: Low time consumption Just traverse the string once, can be used in stream Disadvantage: Space consumption depends on the character set

Substring Search Knuth-Morris-Pratt Algorithm Boyer-Moore Algorithm Rabin-Karp Algorithm

Brute Suffix Match Algorithm: Brute-Suffix-Match(T, P) 1 j = 0; 2 while (j <= strlen(T) - strlen(P)) { 3 for (i = strlen(P) - 1; i >= 0 && P[i] ==T[i + j]; --i) 4 if (i < 0) 5 match; 6 else 7 ++j; 8 }

Brute Suffix Match SLOW! Algorithm: Brute-Suffix-Match(T, P) 1 j = 0; 2 while (j <= strlen(T) - strlen(P)) { 3 for (i = strlen(P) - 1; i >= 0 && P[i] ==T[i + j]; --i) 4 if (i < 0) 5 match; 6 else 7 ++j; 8 } SLOW!

The Bad Symbol Heuristic: Easy Case Observation-1 If c is known not to occur in P, then we know we need not consider the possibility of an occurrence of P starting at T positions 1, 2, … or length(P) B A A B C A A B A A C B A A B B A A B A A B B A A

The Bad Symbol Heuristic: More Interesting Case Observation-2 If the right-most occurrence of c in P is d from the right end of P, then we know we can slide P down d positions without checking for matches. H E L L R L M T C E A B C D E T F Q A T T E N D E N C E A T T E N D E N C E A T T E N D E N C E

The Good Suffix Heuristic: Matched Suffix Occurs Elsewhere Observation-3 We know that the next m characters of T match the final m characters of P.(subp) We can slide P down by some amount so that the discovered occurrence of subp in T is aligned with the rightmost occurrence of subp in P. H G E L L Z A B C D E F G H A B L A B C A B A B L A B C A B A B L A B C A B

The Good Suffix Heuristic: Matched Suffix Not Occurs Observation-4 We know that the next m characters of T match the final m characters of P.(subp) We can slide P down by some amount so that the discovered occurrence of subp in T is aligned with the occurrence of a max prefix in P. H G E L L C A B C D E F G H A B L A B C A B A B L A B C A B D

The Good Suffix Heuristic: Matched Suffix Not Occurs Observation-5 If the rightmost occurrence of c in P is to the right of the mismatched character or no matched suffix occurs in P, we would have to move P backwards to align the two unknown characters. H G E L L D A B C D E F G H H E L L R L M T C E T A F L A E C A B D E N C E T A F L A E C A B D E N C E T

Boyer-Moore Algorithm Demo WHICH-FINALLY-HALTS.-AT-THAT-POINT WHICH-FINALLY-HALTS.-AT-THAT-POINT WHICH-FINALLY-HALTS.-AT-THAT-POINT WHICH-FINALLY-HALTS.-AT-THAT-POINT WHICH-FINALLY-HALTS.-AT-THAT-POINT WHICH-FINALLY-HALTS.-AT-THAT-POINT WHICH-FINALLY-HALTS.-AT-THAT-POINT WHICH-FINALLY-HALTS.-AT-THAT-POINT AT-THAT AT-THAT AT-THAT AT-THAT AT-THAT AT-THAT AT-THAT AT-THAT AT-THAT

Boyer-Moore Algorithm void preBmBc(char *x, int m, int bmBc[]) {     int i;     for (i = 0; i<ASIZE; ++i)          bmBc[i] = m;     for (i = 0; i<m - 1; ++i)          bmBc[x[i]] = m - i - 1; } suffix[m-1]=m; for (i=m-2;i>=0;--i){ q=i; while(q>=0&&P[q]==P[m-1-i+q]) --q; suffix[i]=i-q; void preBmGs(char *x, int m, int bmGs[]) {    int i, j, suff[XSIZE];    suffixes(x, m, suff);    for (i = 0; i < m; ++i)       bmGs[i] = m;    j = 0;    for (i = m - 1; i >= 0; --i)       if (suff[i] == i + 1)          for (; j < m - 1 - i; ++j)             if (bmGs[j] == m)                bmGs[j] = m - 1 - i;    for (i = 0; i <= m - 2; ++i)       bmGs[m - 1 - suff[i]] = m - 1 - i; }

Boyer-Moore Algorithm j = 0; while (j <= strlen(T) - strlen(P)) { for (i = strlen(P) - 1; i >= 0 && P[i] ==T[i + j]; --i) if (i < 0) match; else j += max(bmGs[i], bmBc[T[i]]-(m-1-i)); } O(N/M) Best: O(NM) Worst:

Substring Search Knuth-Morris-Pratt Algorithm Boyer-Moore Algorithm Rabin-Karp Algorithm

Rabin Karp Algorithm ----for string search 谢立伟 5130379098 2015.12

Basic Idea Input: P[1..m]: Target String T[1..n]: Source Text (n>=m) Output: s --> T[s+1, s+2, .. s+m] (s<n-m)

Basic Idea Hash will help. Reduce to a simple linear search problem. when m = 1: Reduce to a simple linear search problem. when m > 1: Hash will help.

Detailed Procedure (For convenience, numbers only.)

Detailed Procedure Optimization: ≣ 10*(31415 – 3*10000)+2(mod 13) (Dynamic Programming)

Pseudo Space Complexity: O(n-m) Time Complexity: O(m*(n-m)) RABIN-KARP-MATCHER(T, P, d, q) n = T.length m = P.length h = d^(m-1) mod q p = 0 t[0] = 0 for i = 1 to m p = (d*p + P[i]) mod q t[0] = (d*t[0]+T[i]) mod q for s = 0 to n-m if p==t[s] if P[1..m] == T[s+1..s+m] print (“Find P with shift: ”+s) if s < n-m t[s+1] = (d*(t[s]-T(s+1)*h) + T[s+m+1]) mod q Space Complexity: O(n-m) Time Complexity: O(m*(n-m))

Analysis pros When q increases Reduce the chance of conflicts Decrease the time of confirmation cons (May)Increase the space requirement (May)Increase the time of Mod operation