Presentation is loading. Please wait.

Presentation is loading. Please wait.

Suffix Trees, Suffix Arrays and Suffix Trays Richard Cole Tsvi Kopelowitz Moshe Lewenstein.

Similar presentations


Presentation on theme: "Suffix Trees, Suffix Arrays and Suffix Trays Richard Cole Tsvi Kopelowitz Moshe Lewenstein."— Presentation transcript:

1 Suffix Trees, Suffix Arrays and Suffix Trays Richard Cole Tsvi Kopelowitz Moshe Lewenstein

2 Indexing problem Input: Text T=t 1,…,t n (preprocess to DS) Queries: Pattern P=p 1,…,p m (use DS) T= 51430

3 Suffix Property P appears at location i of T iff P is a prefix of the suffix T i T= T 14 = 51430

4 Suffix Tree A suffix tree for string S is a compressed trie of all suffixes of S. { $ b$ ab$ bab$ abab$ } a b a b $ a b $ b $ $ $ Example: s=abab$

5 Suffix Tree The size of the suffix tree of S is O(|S|). { $ b$ ab$ bab$ abab$ } Example: s=abab$ 0 1 a b a b $ a b $ b 2 $ 3 $ 4 $

6 Suffix Tree The size of the suffix tree of S is O(|S|). { $ b$ ab$ bab$ abab$ } 0 1 [2,3] 2 3 4 Example: s=abab$ [2,4] [4,4] [1,1] [2,4]

7 Indexing and Suffix Trees Navigate from root. (Use suffix property). P = ssi Time: O(|P| + occ)

8 Indexing and Suffix Trees Navigate from root. (Use suffix property). P = ssi Time: O(|P| log|Σ| + occ)

9 Suffix Trees Weiner 1973 (linear time construction!) McCreight 1975 (space efficient) Ukkonen 1995 (online) Farach 1997 (poly range alphabets)

10 Suffix Array POS 11 8 5 2 1 10 9 7 4 6 3 All suffixes S1S1 mississippi S2S2 ississippi S3S3 ssissippi S4S4 sissippi S5S5 issippi S6S6 ssippi S7S7 sippi S8S8 ippi S9S9 ppi S 10 pi S 11 i Sorted suffixes S 11 i S8S8 ippi S5S5 issippi S2S2 ississippi S1S1 mississippi S 10 pi S9S9 ppi S7S7 sippi S4S4 sissippi S6S6 ssippi S3S3 ssissippi

11 Suffix Array 11 8 5 2 1 10 9 7 4 6 3 m i s s i s s i p p i S = SA(S) = P = pi

12 Suffix Array 11 8 5 2 1 10 9 7 4 6 3 m i s s i s s i p p i S = SA(S) = P = pi

13 Suffix Array 11 8 5 2 1 10 9 7 4 6 3 m i s s i s s i p p i S = SA(S) = P = pi

14 Suffix Array 11 8 5 2 1 10 9 7 4 6 3 m i s s i s s i p p i S = SA(S) = P = pi

15 Suffix Array 11 8 5 2 1 10 9 7 4 6 3 m i s s i s s i p p i S = SA(S) = P = pi Time: O(|P|*log |S|)

16 Suffix Array Introduced: Manber and Myers (1993). Gonnet, Baeza-Yates, Snider (1992) (PAT arrays). Manber and Myers (1993): Time - O(|P| + log |S|)

17 Suffix Array Construction Manber and Myers (1993) - O(n log n). Karkkainen-Sanders (2003) - O(n) (poly range) 2 Other papers as well.

18 End of Story? No. Lots of questions. 1.Construction Time of Suffix Trees. 2.Query Time. 3.Compressed Indexing Structures. 4.Indexing with Errors. 5.Real-Time S.T. construction.

19 Query Time for Large Alphabets Suffix Trees: O(|P|*log|Σ|) (deterministic) Suffix Arrays: O(|P| + log |T|) Suffix Trays: O(|P|+log|Σ|) for alphabets {1,…,|Σ|}

20 Query Time for Large Alphabets Actually it is easy to answer queries in O(|P|) time. Create at every node of suffix tree - |∑| length array. Then navigation at every node is O(1). However, time and space of suffix tree construction = O(n|∑| )

21 Query Time for Large Alphabets Suffix Trees: O(|P|*log|Σ|) (deterministic) Suffix Arrays: O(|P| + log |S|) Suffix Trays: O(|P|+log|Σ|) for alphabets {1,…,|Σ|}

22 Suffix Tree – Suffix Array connection The ordering of the suffixes (leaves) in suffix tree is exactly the suffix array

23 Suffix Array POS 8 5 2 11 1 9 10 6 3 7 4 12 All suffixes S1S1 mississippi$ S2S2 ississippi$ S3S3 ssissippi$ S4S4 sissippi$ S5S5 issippi$ S6S6 ssippi$ S7S7 sippi$ S8S8 ippi$ S9S9 ppi$ S 10 pi$ S 11 i$ S 12 $ sorted suffixes S8S8 ippi$ S5S5 issippi$ S2S2 ississippi$ S 11 i$ S1S1 mississippi$ S9S9 ppi$ S 10 pi$ S6S6 ssippi$ S3S3 ssissippi$ S7S7 sippi$ S4S4 sissippi$ S 12 $

24 Example: Mississippi$ 8 5 2 11 1 9 10 6 3 7 4 12 SA(mississippi) =

25 Suffix Tree – Suffix Array connection We utilize this connection as follows: Every node in the suffix tree corresponds to an interval in suffix array.

26 Example: Mississippi$ 8 5 2 11 1 9 10 6 3 7 4 12 SA(mississippi) =

27 Suffix Tree – Suffix Array connection Moreover, Time to search in suffix array on interval I is: O(|P| + log |I|).

28 Suffix Tree – Suffix Array connection DFN: a |Σ|-leaf is a node that (1) has at least |Σ| leaves in its subtree (2) all its children do not. Number of leaves in subtree of |Σ|-leaf is O(|Σ| 2 ). Why? At most |Σ| children – each with less than |Σ| leaves in subtree.

29 Suffix Tree – Suffix Array connection Number of leaves in subtree of |Σ|-leaf is O(|Σ| 2 ). Time to search in suffix array for |Σ|-leaf is: O(|P| + log |Σ|).

30 Example: Mississippi$ 8 5 2 11 1 9 10 6 3 7 4 12 SA(mississippi) =

31 Suffix Tray Idea Outline: Navigate in suffix tree till a |Σ|-leaf is hit and then move to suffix array (time in SA - O(|P| + log |Σ|)) Problem: Navigation in suffix tree O(|P| log |Σ|) time. We promised O(|P| + log |Σ|).

32 Suffix Tray Recall idea: Create at every node of suffix tree - |∑| length array. Then navigation at every node is O(1). Too expensive overall: O(n|∑| ) But OK for O(n/|Σ|) nodes.

33 Suffix Tray Idea: Truncate suffix trees at |Σ|-leaves for Σ-tree Would be nice: size of Σ-tree = O(n/|Σ|) However, this is not the case. a $ $ $ $ $a a a a $ < | Σ| leaves | Σ|-leaf - the rest

34 < | Σ| leaves | Σ|-leaf - the rest $ a $ $ $ $ $ab ab $ $ab $ $ baba S=ababababa$

35 Suffix Tray Alternative Idea: Extend def. of Σ-tree by removing all nodes with fewer than |Σ| leaves in its subtree. Nodes in Σ-tree: 1.Σ-leaf 2.Branching-Σ-node: node with at least 2 children 3.Others – nodes with only one child.

36 Suffix Tray - Example $a $ $ $ $$ab ab $ $ab $ $ baba < | Σ| leaves | Σ|-leaf - others - branching |Σ|- node

37 Suffix Tray Observation: # of Σ-leafs = O(n/|Σ|) Hence, # of branching-Σ-nodes = O(n/|Σ|) So, we can save Σ-tables for navigation at each.

38 Suffix Tray – What is Left? $a $ $ $ $$ab ab $ $ab $ $ baba < | Σ| leaves | Σ|-leaf - others - branching |Σ|- node

39 Suffix Tray Nodes in Σ-tree with only one child. a b b c d e 8 5 2 11 1 9 10 6 3 7 4 12 Interval less than |Σ| 2

40 Suffix Tray Size of suffix Tray: O(n) Navigation: 1.Σ-leaf - jump to suffix array 2.Branching-Σ-node: look at Σ-array 3.Others – look at one character to Σ-tree child. Time: O(|P| + log|Σ|)

41 End of Story? No. Lots of questions. 1.Construction Time of Suffix Trees. 2.Query Time. 3.Compressed Indexing Structures. 4.Indexing with Errors. 5.Real-Time S.T. construction.


Download ppt "Suffix Trees, Suffix Arrays and Suffix Trays Richard Cole Tsvi Kopelowitz Moshe Lewenstein."

Similar presentations


Ads by Google