Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Prof. Dr. Th. Ottmann Theory I Algorithm Design and Analysis (12 - Text search: suffix trees)

Similar presentations


Presentation on theme: "1 Prof. Dr. Th. Ottmann Theory I Algorithm Design and Analysis (12 - Text search: suffix trees)"— Presentation transcript:

1 1 Prof. Dr. Th. Ottmann Theory I Algorithm Design and Analysis (12 - Text search: suffix trees)

2 2 Text search Different scenarios: Dynamic texts Text editors Symbol manipulators Static texts Literature databases Library systems Gene databases World Wide Web

3 3 Properties of suffix trees Search index for a text  in order to search for patterns  Properties: 1. Substring search in time O(|  |). 2. Queries to  itself, e.g.: Longest substring in  occurring at least twice. 3. Prefix search: all positions in  with prefix .

4 4 Properties of suffix trees 4. Range search: all positions in  in interval [ ,  ] with   lex , e.g. abracadabra, acacia  [abc, acc], abacus  [abc, acc]. 5. Linear complexity: Required space and time for construction in O(|  |)

5 5 Tries Trie: tree for representing keys. alphabet , set S of keys, S   * Key String   * Edge of a trie T: label with a single character from  Neighboring edges: different characters

6 6 Tries a a a c b b c b b c c c Example:

7 7 Tries Each leaf represents a key: corresponds to the labeling of the edges on the path from the root to the leaf ! Keys are not stored in nodes !

8 8 Suffix tries Trie for all suffixes of a text Example:  = ababc suffixes: ababc = suf 1 babc = suf 2 abc = suf 3 bc = suf 4 c = suf 5 a a a c b b c b b c c c

9 9 Suffix tries Internal nodes of a suffix trie = substrings of . Each proper substring of  is represented as an internal node. Let  = a n b n :  n 2 + 2n + 1 different substrings = internal nodes  Space requirement in O(n 2 ).

10 10 Suffix tries A suffix trie T fulfills some of the required properties: a a a c b b c b b c c c 1. String matching for  : follow the path with edge labels  in T in time O (|  |). #leaves of the subtree #occurrences of  2. Longest repeated substring: internal node with the greatest depth which has at least two children. 3. Prefix search: all occurrences of strings with prefix  can be found in the subtree below the internal node corresponding to  in T.

11 11 Suffix trees A suffix tree is created from a suffix trie by contraction of unary nodes: a a a c b b c b b c c c ab abc b c c c suffix tree = contracted suffix trie

12 12 Internal representation of suffix trees Child-sibling representation Substring: pair of numbers (i,j) ab abc b c c c T Example:  = ababc

13 13 Internal representation of suffix trees (  ) (1,2)(2,2)(5,$) (3,$)(5,$)(3,$)(5,$) ab abc b c c c Example:  = ababc node v = (v.w, v.o, v.sn, v.br) Further pointers (suffix pointers) are added later

14 14 Properties of suffix trees (S1)No suffix is prefix of another suffix; this holds if (last character of  ) = $  Search: (T1)edge non-empty substring of . (T2) neighboring edges: corresponding substrings start with different characters.

15 15 Properties of suffix trees Size (T3)each internal node (  root) has at least two children (T4)leaf (non-empty) suffix of . Let n = |  |  1

16 16 Construction of suffix trees Definition: Partial path: path from the root to a node in T Path: a partial path ending in a leaf Location of a string  : node at the end of the partial path labeled with  (if it exists). ab abc b c c c T

17 17 Construction of suffix trees Extension of a string  : string with prefix  Extended location of a string  : place of the shortest extension of , whose place is defined. Contracted location of a string  : place of the longest prefix of , whose place is defined. ab abc b c c c T

18 18 Construction of suffix trees Definitions: suf i : suffix of  starting at position i, e.g. suf 1 = , suf n = $. head i : longest prefix of suf i which is also a prefix of suf j for a j < i. Example:  = bbabaabc  = baa (has no location) suf 4 = baabc head 4 = ba

19 19 Construction of suffix trees a abc c b aabc b baabc ac babaabc c  = bbabaabc

20 20 Naive suffix-tree construction Begin with the empty tree T 0 Tree T i+1 is created from T i by inserting suffix suf i+1. Algorithm suffix tree Input: a text  Output: the suffix tree T for  1 n := |  |; T 0 :=  ; 2 for i := 0 to n – 1do 3 insert suf i+1 in T i, resulting in T i+1 ; 4 end for

21 21 Naive suffix-tree construction In T i all suffixes suf j (j < i) already have a location.  head i = longest prefix of suf i whose extended location in T i-1 exists. Definition: tail i := suf i – head i, i.e. suf i = head i tail i. tail i  .

22 22 Naive suffix-tree construction Example:  = ababc suf 3 = abc head 3 = ab tail 3 = c T 0 = T 1 = T 2 = ababc babc

23 23 Naive suffix-tree construction T i+1 ca be constructed from T i as follows: 1.Determine the extended location of head i+1 in T i and split the last edge leading to this location into two new edges by inserting a new node. 2. Create a new leaf as location for suf i+1 x = erweiterter Ort von head i+1 x v head i+1 tail i+1

24 24 Naive suffix-tree construction Example:  = ababc babc c ababc abc ab T3T3 T2T2 head 3 = ab tail 3 = c

25 25 Naive suffix-tree construction Algorithm suffix insertion Input: tree T i and suffix suf i+1 Output: tree T i+1 1v := root of T i 2j := i 3repeat 4find child w of v with  w.u =  j+1 5k := w.u – 1; 6while k < w.o and  k+1 =  j+1 do 7 k := k +1; j := j + 1 8end while

26 26 Naive suffix-tree construction 9if k = w.o then v := w 10 until k <w.o or w = nil 11 /* v is the contracted location of head i+1 */ 12 insert the location of head i+1 and tail i+1 in T i below v Running time for suffix insertion: O( ) Total time for naive suffix-tree construction: O( )

27 27 The algorithm M (Mc Creight, 1976) When the extended location of head i+1 in T i has been found: creation of a new node and edge splitting in O(1) time.+ Idea: Extended location of head i+1 is determined in constant amortized time in T i. (Additional information is required!)

28 28 Analysis of algorithm M Theorem 1 Algorithm M constructs a suffix tree for  with |  | leaves and at most |  | - 1 internal nodes in time O(|  |). Remark: Ukkonen (1992) found an O(n) on-line algorithm for the construction of suffix trees, i.e. after each step i, the resulting structure is a correct suffix tree for t 1 …t i (where  = t 1 …t n ).

29 29 Suffix tree: application Usage of suffix tree T: 1Search for string  : follow the path with edge labeling  in T in time O(|  |). leaves of the subtree occurrences of  2Search for longest repeated substring: Find the location of a substring with the greatest weighted depth that is an internal node 3Prefix search: All occurrences of strings with prefix  can be found in the subtree below the „location“ of  in T.

30 30 Suffix tree: application 4Range query for [ ,  ] : Range boundaries

31 31 Suffix tree: example T 0 = T 1 = bbabaabc suf 1 = bbabaabcsuf 2 = babaabc head 2 = b

32 32 Suffix tree: example T 2 = b abaabc babaabbc T 3 = abaabc b abaabc babaabbc suf 3 = abaabc suf 4 = baabc head 3 =  head 4 = ba

33 33 Suffix tree: example T4 =T4 = abaabc b babaabbc a abc baabclocation of head 4 suf 5 = aabc head 5 = a

34 34 Suffix tree: example babaabbc a abc baabc location of head 5 abc a b T 5 = suf 6 = abc head 6 = ab baabc

35 35 Suffix tree: example babaabbc a abc baabc location of head 6 abc a b T 6 = b c aabc suf 7 = bc head 7 = b

36 36 Suffix tree: example babaabbc a abc baabc abc a b T7 =T7 = b c aabc c suf 8 = c

37 37 Suffix tree: example babaabbc a abc baabc abc a b T 8 = b c aabc c c


Download ppt "1 Prof. Dr. Th. Ottmann Theory I Algorithm Design and Analysis (12 - Text search: suffix trees)"

Similar presentations


Ads by Google