Presentation is loading. Please wait.

Presentation is loading. Please wait.

String Sorts Tries Substring Search: KMP, BM, RK

Similar presentations


Presentation on theme: "String Sorts Tries Substring Search: KMP, BM, RK"— Presentation transcript:

1 String Sorts Tries Substring Search: KMP, BM, RK
Lecture 16 Strings String Sorts Tries Substring Search: KMP, BM, RK

2 key-indexed counting LSD radix sort MSD radix sort
String Sorts key-indexed counting LSD radix sort MSD radix sort ACKNOWLEDGEMENTS:

3 key-indexed counting LSD radix sort MSD radix sort
String Sorts key-indexed counting LSD radix sort MSD radix sort

4 Review: sorting algorithms
Lower bound. ~NlgN compares required by any compare-based algorithm.

5 Key-indexed counting assumptions
Assumption. Keys are integers between 0 and R - 1. Implication. Can use key as an array index. Applications. Sort string by first letter. Sort class roster by section. Sort phone numbers by area code. Subroutine in a sorting algorithm.

6 Key-indexed counting demo
Goal. Sort an array a[] of N integers between 0 and R - 1. Count frequencies of each letter using key as index. Compute frequency cumulates which specify destinations. Access cumulates using key as index to move items. Copy back into original array.

7 Key-indexed counting demo
Goal. Sort an array a[] of N integers between 0 and R - 1. Count frequencies of each letter using key as index. Compute frequency cumulates which specify destinations. Access cumulates using key as index to move items. Copy back into original array.

8 Key-indexed counting demo
Goal. Sort an array a[] of N integers between 0 and R - 1. Count frequencies of each letter using key as index. Compute frequency cumulates which specify destinations. Access cumulates using key as index to move items. Copy back into original array.

9 Key-indexed counting demo
Goal. Sort an array a[] of N integers between 0 and R - 1. Count frequencies of each letter using key as index. Compute frequency cumulates which specify destinations. Access cumulates using key as index to move items. Copy back into original array.

10 Key-indexed counting demo
Goal. Sort an array a[] of N integers between 0 and R - 1. Count frequencies of each letter using key as index. Compute frequency cumulates which specify destinations. Access cumulates using key as index to move items. Copy back into original array.

11 Key-indexed counting analysis
Proposition. Key-indexed counting takes time proportional to N+R. Proposition. Key-indexed counting uses extra space proportional to N+R. Stable? Yes.

12 key-indexed counting LSD radix sort MSD radix sort
String Sorts key-indexed counting LSD radix sort MSD radix sort

13 Least-significant-digit-first string sort
LSD string (radix) sort. Consider characters from right to left. Stably sort using dth character as the key (using key-indexed counting).

14 LSD string sort: correctness proof
Proposition. LSD sorts fixed-length strings in ascending order. Pf. [ by induction on i ] After pass i, strings are sorted by last i characters. If two strings differ on sort key, key-indexed sort puts them in proper relative order. If two strings agree on sort key, stability keeps them in proper relative order. Proposition. LSD sort is stable. Pf. Key-indexed counting is stable.

15 Summary: sorting algorithms

16 key-indexed counting LSD radix sort MSD radix sort
String Sorts key-indexed counting LSD radix sort MSD radix sort

17 Reverse LSD Consider characters from left to right.
Stably sort using dth character as the key (using key-indexed counting).

18 Most-significant-digit-first string sort
MSD string (radix) sort. Partition array into R pieces according to first character (use key-indexed counting). Recursively sort all strings that start with each character (key-indexed counts delineate subarrays to sort).

19 MSD string sort example

20 Variable-length strings
Treat strings as if they had an extra char at end (smaller that any char). C strings. Have extra char ‘\0’ at end => no extra work needed.

21 MSD string sort problem
Observation 1. Much too slow for small subarrays. Each function call needs its own count[] array. ASCII (256 counts): 100x slower than copy pass for N=2. Unicode (65536 counts): 32000x slower for N=2. Observation 2. Huge number of small subarrays because of recursion.

22 Summary: sorting algorithms

23 Retrieve DFA simulation Trie Radix Tree Suffix Trie(Tree)
Tries Retrieve DFA simulation Trie Radix Tree Suffix Trie(Tree)

24 Alphabet Word:{i,a,in,at,an,inn,int,ate,age,adv,ant}
letters on the path <=> prefix of the word Common Prefix <=> Common Ancestor Leaf Node <=> longest prefix

25 Construct Node Find word Add word Is Leaf?(Is End?) Edge.
Edge of next letter? Exist: jump to next Node. Not exist: return false. Is end of the word? Add word Not exist: add new Node, jump to it. Mark.

26 Example Find ”ant” Find ”and” Add ”and”

27 Other version Child-Brother Tree Double Array Trie
Binary Tree Double Array Trie Ternary search tries How to save edge? Array List BST

28 Analysis Time complexity Space complexity Add: length of string
Find: length of string Space complexity Total length of string

29 Radix Tree Internal node has least two child
Leaf Node <= number of words Space complexity Number of words

30 Application Scenarios
Retrieving for dictionary Longest common prefix Sort Support With KMP : Aho-Corasick automaton Dictionary of all suffixes word: Suffix Tree

31 Suffix Tree Suffix Trie: Trie by all of suffix words Example:
“BANANAS”’s suffix Trie: Dictionary: BANANAS,ANANAS,NANAS,ANAS, NAS,AS,S Path Compressed :Suffix Tree

32 Suffix Tree: Application Scenarios
Find substring Find k-th substring in MSD string sort Count the number of different substring Find the repeat time of substring Find longest repeat substring Longest common substring >=2 若o在S中,则o必然是S的某个后缀的前缀。 如果T在S中重复了两次,则S应有两个后缀以T为前缀,重复次数就自然统计出来了。 最深非叶节点所经历的字符串起来就是最长重复子串。 (4). 两个字符串S1,S2的最长公共部分 方案:将S1#S2$作为字符串压入后缀树,找到最深的非叶节点,且该节点的叶节点既有#也有$(无#)。

33 Suffix Tree:construct
Time complexity Construction A Trie O(n^2) Esko Ukkonen Algorithm Nodes <= 2*letters step by step letter by letter prefix by prefix deferred update E.g. BAN(for BANANAS)

34 Substring Search Knuth-Morris-Pratt Algorithm Boyer-Moore Algorithm
Rabin-Karp Algorithm

35 Substring Search Knuth-Morris-Pratt Algorithm Boyer-Moore Algorithm
Rabin-Karp Algorithm

36 Knuth-Morris-Pratt Algorithm
Solution with DFA simulation How to construct DFA Analysis of KMP

37 Brute-force substring search

38 Knuth, Morris and Pratt (1974)

39 Overview Consider the string B C B A A B A C A A B A B A C A A
the pattern A B A B A C

40 Steps Construct the DFA DFA simulation

41 How to construct the DFA
Match situations:

42 How to construct the DFA
Mismatch situations: Maintain the state of the substring start from s[1] copy dfa[][X] to dfa[][j] , X = dfa[pat.charAt(j)][X] X 1 1 1 4

43 Code public KMP(String pat) { // Build DFA from pattern. this.pat = pat; int M = pat.length(); int R = 256; dfa = new int[R][M]; dfa[pat.charAt(0)][0] = 1; for (int X = 0, j = 1; j < M; j++) { // Compute dfa[][j]. for (int c = 0; c < R; c++) dfa[c][j] = dfa[c][X]; // Copy mismatch cases. dfa[pat.charAt(j)][j] = j + 1; // Set match case. X = dfa[pat.charAt(j)][X]; // Update restart state. }

44 Code public int search(String txt) { // simulate operation of DFA on text int M = pat.length(); int N = txt.length(); int i, j; for (i = 0, j = 0; i < N && j < M; i++) { j = dfa[txt.charAt(i)][j]; } if (j == M) return i - M; // found return N; // not found

45 Meaning of DFA A B B Prefix: A AB Suffix: B BB A B A B A B
Prefix: A AB ABA ABAB ABABA Suffix: B AB BAB ABAB BABAB

46 Analysis Time consumption: O(n+m) Advantage: Disadvantage:
Low time consumption Just traverse the string once, can be used in stream Disadvantage: Space consumption depends on the character set

47 Substring Search Knuth-Morris-Pratt Algorithm Boyer-Moore Algorithm
Rabin-Karp Algorithm

48 Brute Suffix Match Algorithm: Brute-Suffix-Match(T, P) 1 j = 0;
2 while (j <= strlen(T) - strlen(P)) { 3 for (i = strlen(P) - 1; i >= 0 && P[i] ==T[i + j]; --i) 4 if (i < 0) match; 6 else j; 8 }

49 Brute Suffix Match SLOW! Algorithm: Brute-Suffix-Match(T, P) 1 j = 0;
2 while (j <= strlen(T) - strlen(P)) { 3 for (i = strlen(P) - 1; i >= 0 && P[i] ==T[i + j]; --i) 4 if (i < 0) match; 6 else j; 8 } SLOW!

50 The Bad Symbol Heuristic: Easy Case
Observation-1 If c is known not to occur in P, then we know we need not consider the possibility of an occurrence of P starting at T positions 1, 2, … or length(P) B A A B C A A B A A C B A A B B A A B A A B B A A

51 The Bad Symbol Heuristic: More Interesting Case
Observation-2 If the right-most occurrence of c in P is d from the right end of P, then we know we can slide P down d positions without checking for matches. H E L L R L M T C E A B C D E T F Q A T T E N D E N C E A T T E N D E N C E A T T E N D E N C E

52 The Good Suffix Heuristic: Matched Suffix Occurs Elsewhere
Observation-3 We know that the next m characters of T match the final m characters of P.(subp) We can slide P down by some amount so that the discovered occurrence of subp in T is aligned with the rightmost occurrence of subp in P. H G E L L Z A B C D E F G H A B L A B C A B A B L A B C A B A B L A B C A B

53 The Good Suffix Heuristic: Matched Suffix Not Occurs
Observation-4 We know that the next m characters of T match the final m characters of P.(subp) We can slide P down by some amount so that the discovered occurrence of subp in T is aligned with the occurrence of a max prefix in P. H G E L L C A B C D E F G H A B L A B C A B A B L A B C A B D

54 The Good Suffix Heuristic: Matched Suffix Not Occurs
Observation-5 If the rightmost occurrence of c in P is to the right of the mismatched character or no matched suffix occurs in P, we would have to move P backwards to align the two unknown characters. H G E L L D A B C D E F G H H E L L R L M T C E T A F L A E C A B D E N C E T A F L A E C A B D E N C E T

55 Boyer-Moore Algorithm Demo
WHICH-FINALLY-HALTS.-AT-THAT-POINT WHICH-FINALLY-HALTS.-AT-THAT-POINT WHICH-FINALLY-HALTS.-AT-THAT-POINT WHICH-FINALLY-HALTS.-AT-THAT-POINT WHICH-FINALLY-HALTS.-AT-THAT-POINT WHICH-FINALLY-HALTS.-AT-THAT-POINT WHICH-FINALLY-HALTS.-AT-THAT-POINT WHICH-FINALLY-HALTS.-AT-THAT-POINT AT-THAT AT-THAT AT-THAT AT-THAT AT-THAT AT-THAT AT-THAT AT-THAT AT-THAT

56 Boyer-Moore Algorithm
void preBmBc(char *x, int m, int bmBc[]) {     int i;     for (i = 0; i<ASIZE; ++i)          bmBc[i] = m;     for (i = 0; i<m - 1; ++i)          bmBc[x[i]] = m - i - 1; } suffix[m-1]=m; for (i=m-2;i>=0;--i){ q=i; while(q>=0&&P[q]==P[m-1-i+q]) --q; suffix[i]=i-q; void preBmGs(char *x, int m, int bmGs[]) {    int i, j, suff[XSIZE];    suffixes(x, m, suff);    for (i = 0; i < m; ++i)       bmGs[i] = m;    j = 0;    for (i = m - 1; i >= 0; --i)       if (suff[i] == i + 1)          for (; j < m i; ++j)             if (bmGs[j] == m)                bmGs[j] = m i;    for (i = 0; i <= m - 2; ++i)       bmGs[m suff[i]] = m i; }

57 Boyer-Moore Algorithm
j = 0; while (j <= strlen(T) - strlen(P)) { for (i = strlen(P) - 1; i >= 0 && P[i] ==T[i + j]; --i) if (i < 0) match; else j += max(bmGs[i], bmBc[T[i]]-(m-1-i)); } O(N/M) Best: O(NM) Worst:

58 Substring Search Knuth-Morris-Pratt Algorithm Boyer-Moore Algorithm
Rabin-Karp Algorithm

59 Rabin Karp Algorithm ----for string search
谢立伟

60 Basic Idea Input: P[1..m]: Target String T[1..n]: Source Text (n>=m) Output: s --> T[s+1, s+2, .. s+m] (s<n-m)

61 Basic Idea Hash will help. Reduce to a simple linear search problem.
when m = 1: Reduce to a simple linear search problem. when m > 1: Hash will help.

62 Detailed Procedure (For convenience, numbers only.)

63 Detailed Procedure Optimization: ≣ 10*(31415 – 3*10000)+2(mod 13)
(Dynamic Programming)

64 Pseudo Space Complexity: O(n-m) Time Complexity: O(m*(n-m))
RABIN-KARP-MATCHER(T, P, d, q) n = T.length m = P.length h = d^(m-1) mod q p = 0 t[0] = 0 for i = 1 to m p = (d*p + P[i]) mod q t[0] = (d*t[0]+T[i]) mod q for s = 0 to n-m if p==t[s] if P[1..m] == T[s+1..s+m] print (“Find P with shift: ”+s) if s < n-m t[s+1] = (d*(t[s]-T(s+1)*h) + T[s+m+1]) mod q Space Complexity: O(n-m) Time Complexity: O(m*(n-m))

65 Analysis pros When q increases Reduce the chance of conflicts
Decrease the time of confirmation cons (May)Increase the space requirement (May)Increase the time of Mod operation


Download ppt "String Sorts Tries Substring Search: KMP, BM, RK"

Similar presentations


Ads by Google