String Sorts Tries Substring Search: KMP, BM, RK Lecture 16 Strings String Sorts Tries Substring Search: KMP, BM, RK
key-indexed counting LSD radix sort MSD radix sort String Sorts key-indexed counting LSD radix sort MSD radix sort ACKNOWLEDGEMENTS: http://algs4.cs.princeton.edu
key-indexed counting LSD radix sort MSD radix sort String Sorts key-indexed counting LSD radix sort MSD radix sort
Review: sorting algorithms Lower bound. ~NlgN compares required by any compare-based algorithm.
Key-indexed counting assumptions Assumption. Keys are integers between 0 and R - 1. Implication. Can use key as an array index. Applications. Sort string by first letter. Sort class roster by section. Sort phone numbers by area code. Subroutine in a sorting algorithm.
Key-indexed counting demo Goal. Sort an array a[] of N integers between 0 and R - 1. Count frequencies of each letter using key as index. Compute frequency cumulates which specify destinations. Access cumulates using key as index to move items. Copy back into original array.
Key-indexed counting demo Goal. Sort an array a[] of N integers between 0 and R - 1. Count frequencies of each letter using key as index. Compute frequency cumulates which specify destinations. Access cumulates using key as index to move items. Copy back into original array.
Key-indexed counting demo Goal. Sort an array a[] of N integers between 0 and R - 1. Count frequencies of each letter using key as index. Compute frequency cumulates which specify destinations. Access cumulates using key as index to move items. Copy back into original array.
Key-indexed counting demo Goal. Sort an array a[] of N integers between 0 and R - 1. Count frequencies of each letter using key as index. Compute frequency cumulates which specify destinations. Access cumulates using key as index to move items. Copy back into original array.
Key-indexed counting demo Goal. Sort an array a[] of N integers between 0 and R - 1. Count frequencies of each letter using key as index. Compute frequency cumulates which specify destinations. Access cumulates using key as index to move items. Copy back into original array.
Key-indexed counting analysis Proposition. Key-indexed counting takes time proportional to N+R. Proposition. Key-indexed counting uses extra space proportional to N+R. Stable? Yes.
key-indexed counting LSD radix sort MSD radix sort String Sorts key-indexed counting LSD radix sort MSD radix sort
Least-significant-digit-first string sort LSD string (radix) sort. Consider characters from right to left. Stably sort using dth character as the key (using key-indexed counting).
LSD string sort: correctness proof Proposition. LSD sorts fixed-length strings in ascending order. Pf. [ by induction on i ] After pass i, strings are sorted by last i characters. If two strings differ on sort key, key-indexed sort puts them in proper relative order. If two strings agree on sort key, stability keeps them in proper relative order. Proposition. LSD sort is stable. Pf. Key-indexed counting is stable.
Summary: sorting algorithms
key-indexed counting LSD radix sort MSD radix sort String Sorts key-indexed counting LSD radix sort MSD radix sort
Reverse LSD Consider characters from left to right. Stably sort using dth character as the key (using key-indexed counting).
Most-significant-digit-first string sort MSD string (radix) sort. Partition array into R pieces according to first character (use key-indexed counting). Recursively sort all strings that start with each character (key-indexed counts delineate subarrays to sort).
MSD string sort example
Variable-length strings Treat strings as if they had an extra char at end (smaller that any char). C strings. Have extra char ‘\0’ at end => no extra work needed.
MSD string sort problem Observation 1. Much too slow for small subarrays. Each function call needs its own count[] array. ASCII (256 counts): 100x slower than copy pass for N=2. Unicode (65536 counts): 32000x slower for N=2. Observation 2. Huge number of small subarrays because of recursion.
Summary: sorting algorithms
Retrieve DFA simulation Trie Radix Tree Suffix Trie(Tree) Tries Retrieve DFA simulation Trie Radix Tree Suffix Trie(Tree)
Alphabet Word:{i,a,in,at,an,inn,int,ate,age,adv,ant} letters on the path <=> prefix of the word Common Prefix <=> Common Ancestor Leaf Node <=> longest prefix
Construct Node Find word Add word Is Leaf?(Is End?) Edge. Edge of next letter? Exist: jump to next Node. Not exist: return false. Is end of the word? Add word Not exist: add new Node, jump to it. Mark.
Example Find ”ant” Find ”and” Add ”and”
Other version Child-Brother Tree Double Array Trie Binary Tree Double Array Trie Ternary search tries How to save edge? Array List BST
Analysis Time complexity Space complexity Add: length of string Find: length of string Space complexity Total length of string
Radix Tree Internal node has least two child Leaf Node <= number of words Space complexity Number of words
Application Scenarios Retrieving for dictionary Longest common prefix Sort Support With KMP : Aho-Corasick automaton Dictionary of all suffixes word: Suffix Tree
Suffix Tree Suffix Trie: Trie by all of suffix words Example: “BANANAS”’s suffix Trie: Dictionary: BANANAS,ANANAS,NANAS,ANAS, NAS,AS,S Path Compressed :Suffix Tree
Suffix Tree: Application Scenarios Find substring Find k-th substring in MSD string sort Count the number of different substring Find the repeat time of substring Find longest repeat substring Longest common substring >=2 若o在S中,则o必然是S的某个后缀的前缀。 如果T在S中重复了两次,则S应有两个后缀以T为前缀,重复次数就自然统计出来了。 最深非叶节点所经历的字符串起来就是最长重复子串。 (4). 两个字符串S1,S2的最长公共部分 方案:将S1#S2$作为字符串压入后缀树,找到最深的非叶节点,且该节点的叶节点既有#也有$(无#)。
Suffix Tree:construct Time complexity Construction A Trie O(n^2) Esko Ukkonen Algorithm Nodes <= 2*letters step by step letter by letter prefix by prefix deferred update E.g. BAN(for BANANAS)
Substring Search Knuth-Morris-Pratt Algorithm Boyer-Moore Algorithm Rabin-Karp Algorithm
Substring Search Knuth-Morris-Pratt Algorithm Boyer-Moore Algorithm Rabin-Karp Algorithm
Knuth-Morris-Pratt Algorithm Solution with DFA simulation How to construct DFA Analysis of KMP
Brute-force substring search
Knuth, Morris and Pratt (1974)
Overview Consider the string B C B A A B A C A A B A B A C A A the pattern A B A B A C
Steps Construct the DFA DFA simulation
How to construct the DFA Match situations:
How to construct the DFA Mismatch situations: Maintain the state of the substring start from s[1] copy dfa[][X] to dfa[][j] , X = dfa[pat.charAt(j)][X] X 1 1 1 4
Code public KMP(String pat) { // Build DFA from pattern. this.pat = pat; int M = pat.length(); int R = 256; dfa = new int[R][M]; dfa[pat.charAt(0)][0] = 1; for (int X = 0, j = 1; j < M; j++) { // Compute dfa[][j]. for (int c = 0; c < R; c++) dfa[c][j] = dfa[c][X]; // Copy mismatch cases. dfa[pat.charAt(j)][j] = j + 1; // Set match case. X = dfa[pat.charAt(j)][X]; // Update restart state. }
Code public int search(String txt) { // simulate operation of DFA on text int M = pat.length(); int N = txt.length(); int i, j; for (i = 0, j = 0; i < N && j < M; i++) { j = dfa[txt.charAt(i)][j]; } if (j == M) return i - M; // found return N; // not found
Meaning of DFA A B B Prefix: A AB Suffix: B BB A B A B A B Prefix: A AB ABA ABAB ABABA Suffix: B AB BAB ABAB BABAB
Analysis Time consumption: O(n+m) Advantage: Disadvantage: Low time consumption Just traverse the string once, can be used in stream Disadvantage: Space consumption depends on the character set
Substring Search Knuth-Morris-Pratt Algorithm Boyer-Moore Algorithm Rabin-Karp Algorithm
Brute Suffix Match Algorithm: Brute-Suffix-Match(T, P) 1 j = 0; 2 while (j <= strlen(T) - strlen(P)) { 3 for (i = strlen(P) - 1; i >= 0 && P[i] ==T[i + j]; --i) 4 if (i < 0) 5 match; 6 else 7 ++j; 8 }
Brute Suffix Match SLOW! Algorithm: Brute-Suffix-Match(T, P) 1 j = 0; 2 while (j <= strlen(T) - strlen(P)) { 3 for (i = strlen(P) - 1; i >= 0 && P[i] ==T[i + j]; --i) 4 if (i < 0) 5 match; 6 else 7 ++j; 8 } SLOW!
The Bad Symbol Heuristic: Easy Case Observation-1 If c is known not to occur in P, then we know we need not consider the possibility of an occurrence of P starting at T positions 1, 2, … or length(P) B A A B C A A B A A C B A A B B A A B A A B B A A
The Bad Symbol Heuristic: More Interesting Case Observation-2 If the right-most occurrence of c in P is d from the right end of P, then we know we can slide P down d positions without checking for matches. H E L L R L M T C E A B C D E T F Q A T T E N D E N C E A T T E N D E N C E A T T E N D E N C E
The Good Suffix Heuristic: Matched Suffix Occurs Elsewhere Observation-3 We know that the next m characters of T match the final m characters of P.(subp) We can slide P down by some amount so that the discovered occurrence of subp in T is aligned with the rightmost occurrence of subp in P. H G E L L Z A B C D E F G H A B L A B C A B A B L A B C A B A B L A B C A B
The Good Suffix Heuristic: Matched Suffix Not Occurs Observation-4 We know that the next m characters of T match the final m characters of P.(subp) We can slide P down by some amount so that the discovered occurrence of subp in T is aligned with the occurrence of a max prefix in P. H G E L L C A B C D E F G H A B L A B C A B A B L A B C A B D
The Good Suffix Heuristic: Matched Suffix Not Occurs Observation-5 If the rightmost occurrence of c in P is to the right of the mismatched character or no matched suffix occurs in P, we would have to move P backwards to align the two unknown characters. H G E L L D A B C D E F G H H E L L R L M T C E T A F L A E C A B D E N C E T A F L A E C A B D E N C E T
Boyer-Moore Algorithm Demo WHICH-FINALLY-HALTS.-AT-THAT-POINT WHICH-FINALLY-HALTS.-AT-THAT-POINT WHICH-FINALLY-HALTS.-AT-THAT-POINT WHICH-FINALLY-HALTS.-AT-THAT-POINT WHICH-FINALLY-HALTS.-AT-THAT-POINT WHICH-FINALLY-HALTS.-AT-THAT-POINT WHICH-FINALLY-HALTS.-AT-THAT-POINT WHICH-FINALLY-HALTS.-AT-THAT-POINT AT-THAT AT-THAT AT-THAT AT-THAT AT-THAT AT-THAT AT-THAT AT-THAT AT-THAT
Boyer-Moore Algorithm void preBmBc(char *x, int m, int bmBc[]) { int i; for (i = 0; i<ASIZE; ++i) bmBc[i] = m; for (i = 0; i<m - 1; ++i) bmBc[x[i]] = m - i - 1; } suffix[m-1]=m; for (i=m-2;i>=0;--i){ q=i; while(q>=0&&P[q]==P[m-1-i+q]) --q; suffix[i]=i-q; void preBmGs(char *x, int m, int bmGs[]) { int i, j, suff[XSIZE]; suffixes(x, m, suff); for (i = 0; i < m; ++i) bmGs[i] = m; j = 0; for (i = m - 1; i >= 0; --i) if (suff[i] == i + 1) for (; j < m - 1 - i; ++j) if (bmGs[j] == m) bmGs[j] = m - 1 - i; for (i = 0; i <= m - 2; ++i) bmGs[m - 1 - suff[i]] = m - 1 - i; }
Boyer-Moore Algorithm j = 0; while (j <= strlen(T) - strlen(P)) { for (i = strlen(P) - 1; i >= 0 && P[i] ==T[i + j]; --i) if (i < 0) match; else j += max(bmGs[i], bmBc[T[i]]-(m-1-i)); } O(N/M) Best: O(NM) Worst:
Substring Search Knuth-Morris-Pratt Algorithm Boyer-Moore Algorithm Rabin-Karp Algorithm
Rabin Karp Algorithm ----for string search 谢立伟 5130379098 2015.12
Basic Idea Input: P[1..m]: Target String T[1..n]: Source Text (n>=m) Output: s --> T[s+1, s+2, .. s+m] (s<n-m)
Basic Idea Hash will help. Reduce to a simple linear search problem. when m = 1: Reduce to a simple linear search problem. when m > 1: Hash will help.
Detailed Procedure (For convenience, numbers only.)
Detailed Procedure Optimization: ≣ 10*(31415 – 3*10000)+2(mod 13) (Dynamic Programming)
Pseudo Space Complexity: O(n-m) Time Complexity: O(m*(n-m)) RABIN-KARP-MATCHER(T, P, d, q) n = T.length m = P.length h = d^(m-1) mod q p = 0 t[0] = 0 for i = 1 to m p = (d*p + P[i]) mod q t[0] = (d*t[0]+T[i]) mod q for s = 0 to n-m if p==t[s] if P[1..m] == T[s+1..s+m] print (“Find P with shift: ”+s) if s < n-m t[s+1] = (d*(t[s]-T(s+1)*h) + T[s+m+1]) mod q Space Complexity: O(n-m) Time Complexity: O(m*(n-m))
Analysis pros When q increases Reduce the chance of conflicts Decrease the time of confirmation cons (May)Increase the space requirement (May)Increase the time of Mod operation