String Sorts Tries Substring Search: KMP, BM, RK

Name: String Sorts Tries Substring Search: KMP, BM, RK
Uploaded: 2017-08-16T01:12:16+00:00
Duration: PTM23S39
Channel: Benjamin Walsh
Description: String Sorts Tries Substring Search: KMP, BM, RK

String Sorts Tries Substring Search: KMP, BM, RK
Lecture 16 Strings String Sorts Tries Substring Search: KMP, BM, RK

key-indexed counting LSD radix sort MSD radix sort
String Sorts key-indexed counting LSD radix sort MSD radix sort ACKNOWLEDGEMENTS:

String Sorts key-indexed counting LSD radix sort MSD radix sort

Review: sorting algorithms
Lower bound. ~NlgN compares required by any compare-based algorithm.

Key-indexed counting assumptions
Assumption. Keys are integers between 0 and R - 1. Implication. Can use key as an array index. Applications. Sort string by first letter. Sort class roster by section. Sort phone numbers by area code. Subroutine in a sorting algorithm.

Key-indexed counting demo
Goal. Sort an array a[] of N integers between 0 and R - 1. Count frequencies of each letter using key as index. Compute frequency cumulates which specify destinations. Access cumulates using key as index to move items. Copy back into original array.

Key-indexed counting analysis
Proposition. Key-indexed counting takes time proportional to N+R. Proposition. Key-indexed counting uses extra space proportional to N+R. Stable? Yes.

Least-significant-digit-first string sort
LSD string (radix) sort. Consider characters from right to left. Stably sort using dth character as the key (using key-indexed counting).

LSD string sort: correctness proof
Proposition. LSD sorts fixed-length strings in ascending order. Pf. [ by induction on i ] After pass i, strings are sorted by last i characters. If two strings differ on sort key, key-indexed sort puts them in proper relative order. If two strings agree on sort key, stability keeps them in proper relative order. Proposition. LSD sort is stable. Pf. Key-indexed counting is stable.

Summary: sorting algorithms

Reverse LSD Consider characters from left to right.
Stably sort using dth character as the key (using key-indexed counting).

Most-significant-digit-first string sort
MSD string (radix) sort. Partition array into R pieces according to first character (use key-indexed counting). Recursively sort all strings that start with each character (key-indexed counts delineate subarrays to sort).

MSD string sort example

Variable-length strings
Treat strings as if they had an extra char at end (smaller that any char). C strings. Have extra char ‘\0’ at end => no extra work needed.

MSD string sort problem
Observation 1. Much too slow for small subarrays. Each function call needs its own count[] array. ASCII (256 counts): 100x slower than copy pass for N=2. Unicode (65536 counts): 32000x slower for N=2. Observation 2. Huge number of small subarrays because of recursion.

Summary: sorting algorithms

Retrieve DFA simulation Trie Radix Tree Suffix Trie(Tree)
Tries Retrieve DFA simulation Trie Radix Tree Suffix Trie(Tree)

Alphabet Word:{i,a,in,at,an,inn,int,ate,age,adv,ant}
letters on the path <=> prefix of the word Common Prefix <=> Common Ancestor Leaf Node <=> longest prefix

Construct Node Find word Add word Is Leaf?(Is End?) Edge.
Edge of next letter? Exist: jump to next Node. Not exist: return false. Is end of the word？ Add word Not exist: add new Node, jump to it. Mark.

Example Find ”ant” Find ”and” Add ”and”

Other version Child-Brother Tree Double Array Trie
Binary Tree Double Array Trie Ternary search tries How to save edge? Array List BST

Analysis Time complexity Space complexity Add: length of string
Find: length of string Space complexity Total length of string

Radix Tree Internal node has least two child
Leaf Node <= number of words Space complexity Number of words

Application Scenarios
Retrieving for dictionary Longest common prefix Sort Support With KMP : Aho-Corasick automaton Dictionary of all suffixes word: Suffix Tree

Suffix Tree Suffix Trie: Trie by all of suffix words Example：
“BANANAS”’s suffix Trie: Dictionary: BANANAS,ANANAS,NANAS,ANAS, NAS,AS,S Path Compressed ：Suffix Tree

Suffix Tree: Application Scenarios
Find substring Find k-th substring in MSD string sort Count the number of different substring Find the repeat time of substring Find longest repeat substring Longest common substring >=2 若o在S中，则o必然是S的某个后缀的前缀。如果T在S中重复了两次，则S应有两个后缀以T为前缀，重复次数就自然统计出来了。最深非叶节点所经历的字符串起来就是最长重复子串。 (4). 两个字符串S1，S2的最长公共部分方案：将S1#S2$作为字符串压入后缀树，找到最深的非叶节点，且该节点的叶节点既有#也有$(无#)。

Suffix Tree:construct
Time complexity Construction A Trie O(n^2) Esko Ukkonen Algorithm Nodes <= 2*letters step by step letter by letter prefix by prefix deferred update E.g. BAN(for BANANAS)

Substring Search Knuth-Morris-Pratt Algorithm Boyer-Moore Algorithm
Rabin-Karp Algorithm

Knuth-Morris-Pratt Algorithm
Solution with DFA simulation How to construct DFA Analysis of KMP

Brute-force substring search

Knuth, Morris and Pratt (1974)

Overview Consider the string B C B A A B A C A A B A B A C A A
the pattern A B A B A C

Steps Construct the DFA DFA simulation

How to construct the DFA
Match situations:

How to construct the DFA
Mismatch situations: Maintain the state of the substring start from s[1] copy dfa[][X] to dfa[][j] , X = dfa[pat.charAt(j)][X] X 1 1 1 4

Code public KMP(String pat) { // Build DFA from pattern. this.pat = pat; int M = pat.length(); int R = 256; dfa = new int[R][M]; dfa[pat.charAt(0)][0] = 1; for (int X = 0, j = 1; j < M; j++) { // Compute dfa[][j]. for (int c = 0; c < R; c++) dfa[c][j] = dfa[c][X]; // Copy mismatch cases. dfa[pat.charAt(j)][j] = j + 1; // Set match case. X = dfa[pat.charAt(j)][X]; // Update restart state. }

Code public int search(String txt) { // simulate operation of DFA on text int M = pat.length(); int N = txt.length(); int i, j; for (i = 0, j = 0; i < N && j < M; i++) { j = dfa[txt.charAt(i)][j]; } if (j == M) return i - M; // found return N; // not found

Meaning of DFA A B B Prefix: A AB Suffix: B BB A B A B A B
Prefix: A AB ABA ABAB ABABA Suffix: B AB BAB ABAB BABAB

Analysis Time consumption: O(n+m) Advantage: Disadvantage:
Low time consumption Just traverse the string once, can be used in stream Disadvantage: Space consumption depends on the character set

Brute Suffix Match Algorithm: Brute-Suffix-Match(T, P) 1 j = 0;
2 while (j <= strlen(T) - strlen(P)) { 3 for (i = strlen(P) - 1; i >= 0 && P[i] ==T[i + j]; --i) 4 if (i < 0) match; 6 else j; 8 }

Brute Suffix Match SLOW! Algorithm: Brute-Suffix-Match(T, P) 1 j = 0;
2 while (j <= strlen(T) - strlen(P)) { 3 for (i = strlen(P) - 1; i >= 0 && P[i] ==T[i + j]; --i) 4 if (i < 0) match; 6 else j; 8 } SLOW!

The Bad Symbol Heuristic: Easy Case
Observation-1 If c is known not to occur in P, then we know we need not consider the possibility of an occurrence of P starting at T positions 1, 2, … or length(P) B A A B C A A B A A C B A A B B A A B A A B B A A

The Bad Symbol Heuristic: More Interesting Case
Observation-2 If the right-most occurrence of c in P is d from the right end of P, then we know we can slide P down d positions without checking for matches. H E L L R L M T C E A B C D E T F Q A T T E N D E N C E A T T E N D E N C E A T T E N D E N C E

The Good Suffix Heuristic: Matched Suffix Occurs Elsewhere
Observation-3 We know that the next m characters of T match the final m characters of P.(subp) We can slide P down by some amount so that the discovered occurrence of subp in T is aligned with the rightmost occurrence of subp in P. H G E L L Z A B C D E F G H A B L A B C A B A B L A B C A B A B L A B C A B

The Good Suffix Heuristic: Matched Suffix Not Occurs
Observation-4 We know that the next m characters of T match the final m characters of P.(subp) We can slide P down by some amount so that the discovered occurrence of subp in T is aligned with the occurrence of a max prefix in P. H G E L L C A B C D E F G H A B L A B C A B A B L A B C A B D

The Good Suffix Heuristic: Matched Suffix Not Occurs
Observation-5 If the rightmost occurrence of c in P is to the right of the mismatched character or no matched suffix occurs in P, we would have to move P backwards to align the two unknown characters. H G E L L D A B C D E F G H H E L L R L M T C E T A F L A E C A B D E N C E T A F L A E C A B D E N C E T

Boyer-Moore Algorithm Demo
WHICH-FINALLY-HALTS.-AT-THAT-POINT WHICH-FINALLY-HALTS.-AT-THAT-POINT WHICH-FINALLY-HALTS.-AT-THAT-POINT WHICH-FINALLY-HALTS.-AT-THAT-POINT WHICH-FINALLY-HALTS.-AT-THAT-POINT WHICH-FINALLY-HALTS.-AT-THAT-POINT WHICH-FINALLY-HALTS.-AT-THAT-POINT WHICH-FINALLY-HALTS.-AT-THAT-POINT AT-THAT AT-THAT AT-THAT AT-THAT AT-THAT AT-THAT AT-THAT AT-THAT AT-THAT

Boyer-Moore Algorithm
void preBmBc(char *x, int m, int bmBc[]) { int i; for (i = 0; i<ASIZE; ++i) bmBc[i] = m; for (i = 0; i<m - 1; ++i) bmBc[x[i]] = m - i - 1; } suffix[m-1]=m; for (i=m-2；i>=0；--i){ q=i; while(q>=0&&P[q]==P[m-1-i+q]) --q; suffix[i]=i-q; void preBmGs(char *x, int m, int bmGs[]) { int i, j, suff[XSIZE]; suffixes(x, m, suff); for (i = 0; i < m; ++i) bmGs[i] = m; j = 0; for (i = m - 1; i >= 0; --i) if (suff[i] == i + 1) for (; j < m i; ++j) if (bmGs[j] == m) bmGs[j] = m i; for (i = 0; i <= m - 2; ++i) bmGs[m suff[i]] = m i; }

Boyer-Moore Algorithm
j = 0； while (j <= strlen(T) - strlen(P)) { for (i = strlen(P) - 1; i >= 0 && P[i] ==T[i + j]; --i) if (i < 0) match； else j += max(bmGs[i], bmBc[T[i]]-（m-1-i))； } O(N/M) Best: O(NM) Worst:

Rabin Karp Algorithm ----for string search
谢立伟

Basic Idea Input: P[1..m]: Target String T[1..n]: Source Text (n>=m) Output: s --> T[s+1, s+2, .. s+m] (s<n-m)

Basic Idea Hash will help. Reduce to a simple linear search problem.
when m = 1: Reduce to a simple linear search problem. when m > 1: Hash will help.

Detailed Procedure (For convenience, numbers only.)

Detailed Procedure Optimization: ≣ 10*(31415 – 3*10000)+2(mod 13)
(Dynamic Programming)

Pseudo Space Complexity: O(n-m) Time Complexity: O(m*(n-m))
RABIN-KARP-MATCHER(T, P, d, q) n = T.length m = P.length h = d^(m-1) mod q p = 0 t[0] = 0 for i = 1 to m p = (d*p + P[i]) mod q t[0] = (d*t[0]+T[i]) mod q for s = 0 to n-m if p==t[s] if P[1..m] == T[s+1..s+m] print (“Find P with shift: ”+s) if s < n-m t[s+1] = (d*(t[s]-T(s+1)*h) + T[s+m+1]) mod q Space Complexity: O(n-m) Time Complexity: O(m*(n-m))

Analysis pros When q increases Reduce the chance of conflicts
Decrease the time of confirmation cons (May)Increase the space requirement (May)Increase the time of Mod operation

String Sorts Tries Substring Search: KMP, BM, RK

Similar presentations

Presentation on theme: "String Sorts Tries Substring Search: KMP, BM, RK"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

String Sorts Tries Substring Search: KMP, BM, RK

Similar presentations

Presentation on theme: "String Sorts Tries Substring Search: KMP, BM, RK"— Presentation transcript:

Similar presentations

About project

Feedback