Presentation is loading. Please wait.

Presentation is loading. Please wait.

Advanced Data Structures Lecture 8 Mingmin Xie. Agenda Overview Trie Suffix Tree Suffix Array, LCP Construction Applications.

Similar presentations


Presentation on theme: "Advanced Data Structures Lecture 8 Mingmin Xie. Agenda Overview Trie Suffix Tree Suffix Array, LCP Construction Applications."— Presentation transcript:

1 Advanced Data Structures Lecture 8 Mingmin Xie

2 Agenda Overview Trie Suffix Tree Suffix Array, LCP Construction Applications

3 Overview of String Matching Given an alphabet Σ, text T, and a pattern P: Is there a substring of T matching P ? Algorithmic approach: O(|T|), etc: KMP Data structural approach: o If T is large, immutable and searched for often o Why not preprocess it? o answer queries in O(|P|) time using O(|T||Σ|) space and O(|T| + sort(Σ)) preprocessing time.

4 Trie Name come from retrieval A set of words: { ana, ann, anna, anne } Is "ann" there? - O(|P|) Who starts with "ann"? - O(|P| + output)

5 Compressed Trie Coalesce non-branching paths Store only indices

6 Space complexity of Compressed Trie There are only branching nodes and leaf nodes Remove a leaf and its parent until there is no branching node:

7 Space complexity of Compressed Trie Finally at least one leaf node remaining: number of branching nodes < number of leaf nodes = number of words T number of nodes = branching + leaf < 2T number of edges = number of nodes - 1 <= 2T Every node has an array of pointers for the alphabet Σ. Space: O(|T||Σ|) So far we have a dictionary which takes O(|T||Σ|) space and serves queries in O(|P|) time.

8 Back to String Matching Now for the original string matching problem: find P in T We can see all suffixes of T as a dictionary and query a pattern P in it. If P is a prefix of any suffix, then P is in T. For example: T = "banana" The set of suffixes: { banana$, anana$, nana$, ana$, na$, na$, a$, $ }

9 Suffix Tree Making use of the trie, we can put all suffixes of T into a trie.

10 Suffix Tree Is P in T? -- O(|P|) Given a pattern P, all occurrences of P in T can be found and reported in time O(|P| + output). Space: O(|T||Σ|) o length of T = |T| = number of suffixes

11 How to Construct? The simple way: insert suffixes(words) into the Trie one by one Time: O(|T|2) Worst case: T = "aaaa", { $, a$, aa$, aaa$, aaaa$ } We went through path of common prefix "aa.." again and again, which can be optimized by preprocessing common prefixes!

12 Suffix Array T = "banana"

13 Suffix Array Suffix array is also a powerful structure itself. It is easier to implement so we can use it instead of suffix tree sometimes. If SA[i] = j then suffix Sj has rank i among the set of strings {S0,..., Sn−1}. We can search for occurrences of P directly on the suffix array using binary search in O(|P|lg|T|) time.

14 Longest Common Prefix Array

15 Equivalence

16 Construction of Suffix Tree From Suffix Array and LCP:

17 Construction of Suffix Array and LCP Skew algorithm[2]. Overview:

18 Skew Algorithm Suppose T = "mississippi" 1. sort Σ = {i, m, p, s} => {1, 2, 3, 4}, cost O(sort(Σ)) In the first iteration use any sorting algorithm, leading to the O(sort(Σ)) term. In the following iterations we can use radix sort to sort in linear time (see below). 2. Alphabet reduction: replace each letter in the text with its rank among the letters in the text. T = 21441441331

19 Skew Algorithm 3. Divide the text T into 3 parts and consider triples of letters to be one megaletter, i.e. change the alphabet. More formally, form T0,T1, and T2 as follows: T0 = T1 = T2 = Note that Ti’s are just texts with n/3 letters of a new Σ3 alphabet. Our text size has become a third of the original, while the alphabet size has cubed. Here we have: T0 = mis sis sip pi$ = 214 414 413 310 T1 = iss iss ipp i$$ = 144 144 133 100 T2 = ssi ssi ppi = 441 441 331

20 Skew Algorithm Suffixes(T ) ∼ = Suffixes(T0) ∪ Suffixes(T1) ∪ Suffixes(T2) T0T1T2 mis sis sip pi$ sis sip pi$ sip pi$ pi$ iss iss ipp i$$ iss ipp i$$ ipp i$$ i$$ ssi ssi ppi ssi ppi ppi

21 Skew Algorithm 4. Recurse on sort the new alphabet Σ3 => {1, 2, 3, 4, 5, 6, 7} Here rank[i] = j if SA[j] = i. We use it to compare suffixes later.

22 Skew Algorithm 5. Sort suffixes of T2 using radix sort: suffix T2[i:] - pull off the first letter, it becomes a suffix of T0 SA2: {2, 1, 0} LCP: simply check the first letter distinct: 0 same: lookup LCP01 for T0[i+1:] and add 1 LCP2 = {0, 1 + LCP01[1]} = {0, 3}

23 Skew Algorithm 6. Merge SA01 and SA2 linearly. (like merge sort) Need to compare T0[i:] or T1[i:] to T2[j:] in constant time: T0[i:] vs T2[j:] ~= (T[3i], T1[i:]) vs (T[3j+2], T0[j+1:]), e.g.: o T0[1:] ~= sis sip pi$ ~= s iss ipp i$$ ~= (s, T1[0:]) ~= (s, 4) o T2[1:] ~= ssi ppi ~= s sip pi$ ~= (s, T0[2:]) ~= (s, 6) o so T0[1:] < T2[1:] T1[i:] vs T2[j:] ~= (T[3i+1], T[3i+2], T0[i+1:]) vs (T[3j+2], T[3j+3], T1[j+1:]), e.g.: o T1[1:] ~= iss ipp i$$ ~= i s sip pi$ ~= (i, s, T0[2:]) ~= (i, s, 6) o T2[1:] ~= ssi ppi ~= s s ipp i$$ ~= (s, s, T1[2:]) ~= (s, s, 1) o so T1[1:] < T2[1:]

24 Skew Algorithm Finally the merged suffix array SA: {10, 7, 4, 1, 0, 9, 8, 6, 3, 5, 2} LCP: done exactly as above T0[i:] vs T2[j:] 1st character different: 0 1st character same: lookup LCP01 and add 1 T1[i:] vs T2[j:] 1st character different: 0 1st same, 2nd different: 1 both equal: lookup LCP01 and add 2

25 Time Complexity The recursive execution time obeys the recurrence: T(n) = O(n) + T(2n/3) T(n) = O(1) for n < 3 Solution T(n) = O(n) The first sort of Σ costs O(sort(Σ)). Overall: O(|T| + sort(Σ)).

26 Applications count the number of occurrences of P o augment subtree: record the number of leaves longest repeated substring: O(|T|) o find the branching node of maximum letter depth multiple documents via multiple $'s: T = T1$1T2$2... longest common substring between two documents: O(|T|) o combine them as above o find the deepest node with both $1 and $2

27 Document Retrieval

28 Q & A Thank you


Download ppt "Advanced Data Structures Lecture 8 Mingmin Xie. Agenda Overview Trie Suffix Tree Suffix Array, LCP Construction Applications."

Similar presentations


Ads by Google