Presentation is loading. Please wait.

Presentation is loading. Please wait.

Suffix Arrays and Suffix Trees

Similar presentations


Presentation on theme: "Suffix Arrays and Suffix Trees"— Presentation transcript:

1 Suffix Arrays and Suffix Trees
Stefan Burkhardt

2 Motivation What are suffix arrays and trees ? Examples Some construction algorithms

3 Motivation Many biological problems require approximate matching.
No efficient (space and time!) Indices for approximate matching known Filter algorithms for approximate matching use exact matching to be efficient, fast exact matching algorithms have to be employed => Indices for exact string matching Motivation

4 What are suffix arrays and trees?
Text indexing data structures not word based allow search for patterns or computation of statistics Important Properties Size Speed of exact matching Space required for construction Time required for construction

5 The Suffix Array Definition: Given a string D the suffix
array SA for this string is the sorted list of pointers to all suffixes of D. (Manber, Myers 1990)

6 D = A B A A B B A B B A C SORT ! 0 A B A A B B A B B A C
Example: D = A B A A B B A B B A C 0 A B A A B B A B B A C 1 B A A B B A B B A C 2 A A B B A B B A C 3 A B B A B B A C 4 B B A B B A C 5 B A B B A C 6 A B B A C 7 B B A C 8 B A C 9 A C 10 C SORT !

7 A B A A B B A B B A C 2 A A B B A B B A C 0 A B A A B B A B B A C
Example: A B A A B B A B B A C 2 A A B B A B B A C 0 A B A A B B A B B A C 3 A B B A B B A C 6 A B B A C 9 A C 1 B A A B B A B B A C 5 B A B B A C 8 B A C 4 B B A B B A C 7 B B A C 10 C

8 Basic Idea: 2 binary searches in SA Search for leftmost position
Exact matching using a Suffix Array A B A A B B A B B A C SUFFIX ARRAY SA: SA = Basic Idea: 2 binary searches in SA Search for leftmost position Search for rightmost position

9 Search for leftmost occurence of:
B B A B A A B B A B B A C

10 Search for leftmost occurence of:
B B A B A A B B A B B A C BB > BA Continue binary search in the right (larger) half of SA

11 Search for leftmost occurence of:
B B A B A A B B A B B A C BB = BB More occurences of BB left of this one possible!

12 Search for leftmost occurence of:
B B A B A A B B A B B A C BB > BA leftmost position of BB is pointed to by SA[8]

13 Search for rightmost occurence of:
B B A B A A B B A B B A C BB > BA Search further to the right

14 Search for rightmost occurence of:
B B A B A A B B A B B A C BB = BB More occurences of BB right of this one possible!

15 Search for leftmost occurence of:
B B A B A A B B A B B A C BB = BA More occurences of BB right of this one possible!

16 Search for rightmost occurence of:
B B A B A A B B A B B A C BB < C rightmost position of BB is pointed to by SA[9]

17 B B Results of search for: A B A A B B A B B A C leftmost position of BB is pointed to by SA[8] rightmost position of BB is pointed to by SA[9] =>All occurences of the pattern BB are pointed to by SA[8..9]

18 Important Properties for |SA| = N and p = length of pattern:
Size : 1 Pointer per Letter (4 Byte if N < 4Gb) Speed of exact matching : O(log N) binary search steps # of compared chars is O(p log N) can be reduced to O(p + log N)

19 Some known Construction methods: Manber-Myers
variant of the labeling technique of Karp, Miller and Rosenberg Sorting of suffixes is performed as follows: i Sort in i rounds substrings of length 2 (0  i  log(n)) each round is possible in O(n) Construction in O(n log(n)) time 2 n Pointers space Space for external construction: dependent on Sort implementation multiway-mergesort: 6 n Pointers inplace merge: n Pointers (slower)

20 Some known Construction methods: Manber-Myers
Round 1: 2-pass Bucketsort using the first character Create 2 arrays, Pos and Prm Pos[k] : pointer to kth smallest suffix Prm[k] : pointer to Pos[k], Prm[Pos[k]] = k Round i: Use the fact that when comparing suffix x and y from 1. For 0..2i-1-1 suffix x and y are equal 2. For 2i-1.. 2i-1 suffix x and y have already been compared ! Result is given by comparing suffix x+2i-1 with y+2i-1 use Prm to access suffix x+2i-1 and y+2i-1 D = A B A A C Pos = Prm =

21 Baeza-Yates-Gonnet-Snider (External)
Idea: text is cut in pieces of size M runs in N/M rounds, in each round: - compute SA for the current text piece - merge SA with the suffix array for the previous pieces Run Time: O(N log(M) / M) time O(N log(M) / (MB)) Block I/Os 3 3 Space: 2 n Pointers

22 A B A A B B A B A B A A SA1: 3 2 0 1 B B A B SA2: 2 3 1 0
Example BGS Construction: M = 4 A B A A B B A B A B A A SA1: B B A B SA2: Merge SA1 , SA2

23 A B A A B B A B B B A B 7 6 4 5 A B A A 3 2 0 1 BAA 2 ABAA 1 AA 0 A 3
Example BGS Construction: A B A A B B A B B B A B A B A A BAA 2 ABAA 1 AA 0 A 3 New SA:

24 Baeza-Yates-Gonnet-Snider (External)
Runtime analysis of one round: - compute SA for the current piece of size M: O(M log M) sort comparisons of suffixes Problem: worst case for comparison is complete suffixes (= N) But: expected case is much smaller (lcp) Worst case runtime: O(N M log M) - merge SA with the already existing SAx: length of SAx: O(N) => number of merge steps: O(N) one merge step = 1 comparison = O(N) worst case => O(N2) runtime => total runtime for one round: O(N2 + N M log M) = O(N2 log M) N/M rounds => total runtime = O(N3 log M / M)

25 The Suffix Tree Definition: Given a string D the suffix
tree ST for this string is the compacted trie built on all suffixes of D. (Weiner, 1973)

26 The Suffix Tree Structural Properties:
Each arc of the tree denotes a substring Each node has outdeg > 1 Node arcs start with different characters Each leaf l denotes the suffix composed of all arc labels on the path root – l N leaves and <N internal nodes a special character is used as end marker

27 An Example B A A $ A B A 16 B B B A $ A $ A A A B 15 14 B B A A B A $ A $ A A B B A B A A . A A A 13 12 . A A A . B B $ $ $ . B B . A . A B B A . . A . $ 12 10 . $ 9 A . A B . . . $ $ . B . . A 6 . 7 $ $ 1 8 $ $ 4 $ 5 2 3 D = A B A A B A B A A B A A B A B A $

28 An Example B A A $ A B A 16 B B B A $ A $ A A A B 15 14 B B A A B A $ A $ A A B B A B A A . A A A 13 12 . A A A . B B $ $ $ . B B . A . A B B A . . A . $ 12 10 . $ 9 A . A B . . . $ $ . B . . A 6 . 7 $ $ 1 8 $ $ 4 $ 5 2 3 D = A B A A B A B A A B A A B A B A $

29 An Example B A A $ A B A 16 B B B A $ A $ A A A B 15 14 B B A A B A $ A $ A A B B A B A A . A A A 13 12 . A A A . B B $ $ $ . B B . A . A B B A . . A . $ 12 10 . $ 9 A . A B . . . $ . $ B . . A 6 . 7 $ $ 1 8 $ $ 4 $ 5 2 3 D = A B A A B A B A A B A A B A B A $

30 D = A B A A B A B A A B A A B A B A $ Simple Construction
for all suffixes s insert(s) ABAABABAABAABABA$ D = A B A A B A B A A B A A B A B A $

31 D = A B A A B A B A A B A A B A B A $ Simple Construction
for all suffixes s insert(s) BAABABAABAABABA$ ABAABABAABAABABA$ 1 D = A B A A B A B A A B A A B A B A $

32 D = A B A A B A B A A B A A B A B A $ Simple Construction
for all suffixes s insert(s) BAABABAABAABABA$ ABAABABAABAABABA$ 1 D = A B A A B A B A A B A A B A B A $

33 D = A B A A B A B A A B A A B A B A $ Simple Construction A
for all suffixes s insert(s) BAABABAABAABABA$ ABABAABAABABA$ BAABABAABAABABA$ 2 1 D = A B A A B A B A A B A A B A B A $

34 D = A B A A B A B A A B A A B A B A $ Simple Construction A
for all suffixes s insert(s) BAABABAABAABABA$ ABABAABAABABA$ BAABABAABAABABA$ 2 1 D = A B A A B A B A A B A A B A B A $

35 D = A B A A B A B A A B A A B A B A $ Simple Construction A
for all suffixes s insert(s) B A BAABABAABAABABA$ ABABAABAABABA$ BAABAABABA$ ABABAABAABABA$ 3 2 1 D = A B A A B A B A A B A A B A B A $

36 D = A B C D E $ Problem: O(n ) Space ( N + N-1 + N-2 + ... + 1) A B C
1 2 3 4 5 D = A B C D E $

37 Solution: Arc Pointers
B A A $ A B A 16 B B B A $ A $ A A A B 15 14 B B A A B A $ A $ A A B B A B A A . A A A 13 12 . A A A . B B $ $ $ . B B . A . A B B A . . A . $ 12 10 . $ 9 A . A B . . . $ . B . . $ A 6 . 7 $ $ 1 8 $ $ 4 $ 5 2 3 D = A B A A B A B A A B A A B A B A $

38 Solution: Arc Pointers
B (0,0) A $ A B A 16 B B B A $ A $ A A A B 15 14 B B A A B A $ A $ A A B B A B A A . A A A 13 12 . A A A . B B $ $ $ . B B . A . A B B A . . A . $ 12 10 . $ 9 A . A B . . . $ . B . . $ A 6 . 7 $ $ 1 8 $ $ 4 $ 5 2 3 D = A B A A B A B A A B A A B A B A $

39 Solution: Arc Pointers
B (0,0) A $ A A 16 (1,2) B B B $ A $ A A A B 15 14 B B A A B A $ A $ A A B B A B A A . A A A 13 12 . A A A . B B $ $ $ . B B . A . A B B A . . A . $ 12 10 . $ 9 A . A B . . . $ . B . . $ A 6 . 7 $ $ 1 8 $ $ 4 $ 5 2 3 D = A B A A B A B A A B A A B A B A $

40 Solution: Arc Pointers
B (0,0) A $ A A 16 (1,2) B B B $ A $ A A B 15 14 B B A A (3,5) A $ A $ A B B A B A A . A A A 13 12 . A A A . B B $ $ $ . B B . A . A B B A . . A . $ 12 10 . $ 9 A . A B . . . $ . B . . $ A 6 . 7 $ $ 1 8 $ $ 4 $ 5 2 3 D = A B A A B A B A A B A A B A B A $

41 Solution: Arc Pointers
B (0,0) A $ A A 16 (1,2) B B B $ A $ A A B 15 14 B B A A (3,5) A $ A $ A B B A B A A . A A A 13 12 . A A A . B $ $ $ . B B . A . (6,7) B B A . . A . $ 12 10 . $ 9 A . A B . . . $ . B . . $ A 6 . 7 $ $ 1 8 $ $ 4 $ 5 2 3 D = A B A A B A B A A B A A B A B A $

42 Solution: Arc Pointers
B (0,0) A $ A A 16 (1,2) B B B $ A $ A A B 15 14 B B A A (3,5) A $ A $ A B B A B A A . A A A 13 12 . A A A . B $ $ $ . B B . A . (6,7) B B A . . A . $ 12 10 . $ 9 . A B . . $ . . $ A 6 7 $ (8,16) $ 1 8 $ $ 4 5 2 3 D = A B A A B A B A A B A A B A B A $

43 O(n) Arcs => O(n) pointer pairs B (0,0) A $ A A 16 (1,2) B B B $ A $ A A B 15 14 B B A A (3,5) A $ A $ A B B A B A A . A A A 13 12 . A A A . B $ $ $ . B B . A . (6,7) B B A . . A . $ 12 10 . $ 9 . A B . $ . $ . . A 6 7 $ (8,16) $ 1 8 $ $ 4 5 2 3 D = A B A A B A B A A B A A B A B A $

44 . . . . . . . . . . . . . . . . . . P = A B A A B A B Searching B A A
$ A B A 16 B B B A $ A $ A A A B 15 14 B B A A B A $ A $ A A B B A B A A . A A A 13 12 . A A A . B B $ $ $ . B B . A . A B B A . . A . $ 12 10 . $ 9 A . A B . . . $ . B . . $ A 6 . 7 $ $ 1 8 $ $ 4 $ 5 2 3 P = A B A A B A B

45 . . . . . . . . . . . . . . . . . . P = A B A A B A B Searching B A A
$ A B A 16 B B B A $ A $ A A A B 15 14 B B A A B A $ A $ A A B B A B A A . A A A 13 12 . A A A . B B $ $ $ . B B . A . A B B A . . A . $ 12 10 . $ 9 A . A B . . . $ . B . . $ A 6 . 7 $ $ 1 8 $ $ 4 $ 5 2 3 P = A B A A B A B

46 . . . . . . . . . . . . . . . . . . P = A B A A B A B Searching B A A
$ A B A 16 B B B A $ A $ A A A B 15 14 B B A A B A $ A $ A A B B A B A A . A A A 13 12 . A A A . B B $ $ $ . B B . A . A B B A . . A . $ 12 10 . $ 9 A . A B . . . $ . B . . $ A 6 . 7 $ $ 1 8 $ $ 4 $ 5 2 3 P = A B A A B A B

47 . . . . . . . . . . . . . . . . . . P = A B A A B A B Searching B A A
$ A B A 16 B B B A $ A $ A A A B 15 14 B B A A B A $ A $ A A B B A B A A . A A A 13 12 . A A A . B B $ $ $ . B B . A . A B B A . . A . $ 12 10 . $ 9 A . A B . . . $ . B . . $ A 6 . 7 $ $ 1 8 $ $ 4 $ 5 2 3 P = A B A A B A B

48 . . . . . . . . . . . . . . . . . . P = A B A A B A B Searching B A A
$ A B A 16 B B B A $ A $ A A A B 15 14 B B A A B A $ A $ A A B B A B A A . A A A 13 12 . A A A . B B $ $ $ . B B . A . A B B A . . A . $ 12 10 . $ 9 A . A B . . . $ . B . . $ A 6 . 7 $ $ 1 8 $ $ 4 $ 5 2 3 P = A B A A B A B

49 Searching B A A $ A B A 16 B B B A $ A $ A A A B 15 14 B B A A B A $ A $ A A B B A B A A . A A A 13 12 . A A A . B B $ $ $ . B B . A . A B B A . . A . $ 12 10 . $ 9 A . A B . . . $ . B . . $ A 6 . 7 $ $ 1 8 $ $ 4 $ 5 2 3 D = A B A A B A B A, A B A A B A B A A B

50 Some Structural Properties
Longest common prefix of two suffixes in D: depth of the lowest common node of the suffixes B B A A B A B A A A B A $ lcp = 2 14 $

51 Some Structural Properties
Longest repeat in D: maximum depth of any inner node Most common string of length m: For each node save number of leaves below it Examine all nodes with depth >= m many more.... Several applications in Biology (See frex book by Gusfield)

52 Summary Suffix Trees: Search time: O(p log |S| + occ) Space: O(N)
(between 1.25 and 5 n Pointers) Implementations frex by Kurtz (Bielefeld) Construction: O(N log |S|) O(N) for integers (Farach, 97) Note: Implementation Details are extremely important for practicacl use. (constants/space)

53 Suffix Tree Applications :
Work on the following organisms: Arabidopsis Thaliana (100 Mbps) Michigan State / Minnesota University Yeast (13 Mbps) MPI for Biochemistry, Munich Borelia Burgdorferi (1 Mbps) Brookhaven Nat. Lab. / Stony Brook Univ.


Download ppt "Suffix Arrays and Suffix Trees"

Similar presentations


Ads by Google