Suffix Arrays and Suffix Trees

Slides:



Advertisements
Similar presentations
Longest Common Subsequence
Advertisements

Two-dimensional pattern matching M.G.W.H. van de Rijdt 23 August 2005.
1 Suffix Arrays: A new method for on-line string searches Udi Manber Gene Myers May 1989 Presented by: Oren Weimann.
Suffix Trees Come of Age in Bioinformatics Algorithms, Applications and Implementations Dan Gusfield, U.C. Davis.
Suffix Trees Construction and Applications João Carreira 2008.
A New Compressed Suffix Tree Supporting Fast Search and its Construction Algorithm Using Optimal Working Space Dong Kyue Kim 1 andHeejin Park 2 1 School.
Two implementation issues Alphabet size Generalizing to multiple strings.
Suffix Sorting & Related Algoritmics Martin Farach-Colton Rutgers University USA.
15-853Page : Algorithms in the Real World Suffix Trees.
296.3: Algorithms in the Real World
© 2004 Goodrich, Tamassia Tries1. © 2004 Goodrich, Tamassia Tries2 Preprocessing Strings Preprocessing the pattern speeds up pattern matching queries.
1 Suffix Trees and Suffix Arrays Modern Information Retrieval by R. Baeza-Yates and B. Ribeiro-Neto Addison-Wesley, (Chapter 8)
1 Prof. Dr. Th. Ottmann Theory I Algorithm Design and Analysis (12 - Text search: suffix trees)
Suffix Trees Suffix trees Linearized suffix trees Virtual suffix trees Suffix arrays Enhanced suffix arrays Suffix cactus, suffix vectors, …
1 Data structures for Pattern Matching Suffix trees and suffix arrays are a basic data structure in pattern matching Reported by: Olga Sergeeva, Saint.
21/05/2015Applied Algorithmics - week51 Off-line text search (indexing)  Off-line text search refers to the situation in which a preprocessed digital.
Combinatorial Pattern Matching CS 466 Saurabh Sinha.
Advanced Algorithm Design and Analysis (Lecture 4) SW5 fall 2004 Simonas Šaltenis E1-215b
On Demand String Sorting over Unbounded Alphabets Carmel Kent Moshe Lewenstein Dafna Sheinwald.
Fast and Practical Algorithms for Computing Runs Gang Chen – McMaster, Ontario, CAN Simon J. Puglisi – RMIT, Melbourne, AUS Bill Smyth – McMaster, Ontario,
CSC 213 Lecture 18: Tries. Announcements Quiz results are getting better Still not very good, however Average score on last quiz was 5.5 Every student.
FALL 2006CENG 351 Data Management and File Structures1 External Sorting.
Space Efficient Linear Time Construction of Suffix Arrays
CSE 326: Data Structures Sorting Ben Lerner Summer 2007.
Data Structures & Algorithms Radix Search Richard Newman based on slides by S. Sahni and book by R. Sedgewick.
1 Database Tuning Rasmus Pagh and S. Srinivasa Rao IT University of Copenhagen Spring 2007 February 8, 2007 Tree Indexes Lecture based on [RG, Chapter.
Building Suffix Trees in O(m) time Weiner had first linear time algorithm in 1973 McCreight developed a more space efficient algorithm in 1976 Ukkonen.
CSE 373 Data Structures Lecture 15
On the Use of Regular Expressions for Searching Text Charles L.A. Clarke and Gordon V. Cormack Fast Text Searching.
Chapter 19: Binary Trees. Objectives In this chapter, you will: – Learn about binary trees – Explore various binary tree traversal algorithms – Organize.
Introduction n – length of text, m – length of search pattern string Generally suffix tree construction takes O(n) time, O(n) space and searching takes.
Heapsort. Heapsort is a comparison-based sorting algorithm, and is part of the selection sort family. Although somewhat slower in practice on most machines.
B + -Trees. Motivation An AVL tree with N nodes is an excellent data structure for searching, indexing, etc. The Big-Oh analysis shows that most operations.
CSED101 INTRODUCTION TO COMPUTING TREE 2 Hwanjo Yu.
Tries1. 2 Outline and Reading Standard tries (§9.2.1) Compressed tries (§9.2.2) Suffix tries (§9.2.3)
Suffix trees. Trie A tree representing a set of strings. a b c e e f d b f e g { aeef ad bbfe bbfg c }
Chapter 18: Searching and Sorting Algorithms. Objectives In this chapter, you will: Learn the various search algorithms Implement sequential and binary.
Sets of Digital Data CSCI 2720 Fall 2005 Kraemer.
Sorting Lower Bounds n Beating Them. Recap Divide and Conquer –Know how to break a problem into smaller problems, such that –Given a solution to the smaller.
Suffix Tree 6 Mar MinKoo Seo. Contents  Basic Text Searching  Introduction to Suffix Tree  Suffix Trees and Exact Matching  Longest Common Substring.
Chapter 11. Chapter Summary  Introduction to trees (11.1)  Application of trees (11.2)  Tree traversal (11.3)  Spanning trees (11.4)
Advanced Data Structures Lecture 8 Mingmin Xie. Agenda Overview Trie Suffix Tree Suffix Array, LCP Construction Applications.
Tries 4/16/2018 8:59 AM Presentation for use with the textbook Data Structures and Algorithms in Java, 6th edition, by M. T. Goodrich, R. Tamassia, and.
COMP9319 Web Data Compression and Search
Decision Trees DEFINITION: DECISION TREE A decision tree is a tree in which the internal nodes represent actions, the arcs represent outcomes of an action,
15-853:Algorithms in the Real World
MA/CSSE 473 Day 26 Student questions Boyer-Moore B Trees.
Tries 07/28/16 11:04 Text Compression
Tries 5/27/2018 3:08 AM Tries Tries.
Fast String Manipulation
New Indices for Text : Pat Trees and PAT Arrays
Multiway Search Trees Data may not fit into main memory
Two equivalent problems
Andrzej Ehrenfeucht, University of Colorado, Boulder
B+-Trees.
Lecture 22 Binary Search Trees Chapter 10 of textbook
Tries 9/14/ :13 AM Presentation for use with the textbook Data Structures and Algorithms in Java, 6th edition, by M. T. Goodrich, R. Tamassia, and.
13 Text Processing Hongfei Yan June 1, 2016.
Comparison of large sequences
Strings: Tries, Suffix Trees
String Data Structures and Algorithms: Suffix Trees and Suffix Arrays
String Data Structures and Algorithms
String Data Structures and Algorithms
Tries 2/23/2019 8:29 AM Tries 2/23/2019 8:29 AM Tries.
Tries 2/27/2019 5:37 PM Tries Tries.
Algorithm Efficiency and Sorting
Chap 3 String Matching 3 -.
Strings: Tries, Suffix Trees
CENG 351 Data Management and File Structures
Indexing and Searching
Presentation transcript:

Suffix Arrays and Suffix Trees Stefan Burkhardt

Motivation What are suffix arrays and trees ? Examples Some construction algorithms

Motivation Many biological problems require approximate matching. No efficient (space and time!) Indices for approximate matching known Filter algorithms for approximate matching use exact matching to be efficient, fast exact matching algorithms have to be employed => Indices for exact string matching Motivation

What are suffix arrays and trees? Text indexing data structures not word based allow search for patterns or computation of statistics Important Properties Size Speed of exact matching Space required for construction Time required for construction

The Suffix Array Definition: Given a string D the suffix array SA for this string is the sorted list of pointers to all suffixes of D. (Manber, Myers 1990)

D = A B A A B B A B B A C SORT ! 0 A B A A B B A B B A C Example: D = A B A A B B A B B A C 0 A B A A B B A B B A C 1 B A A B B A B B A C 2 A A B B A B B A C 3 A B B A B B A C 4 B B A B B A C 5 B A B B A C 6 A B B A C 7 B B A C 8 B A C 9 A C 10 C SORT !

A B A A B B A B B A C 2 A A B B A B B A C 0 A B A A B B A B B A C Example: A B A A B B A B B A C 2 A A B B A B B A C 0 A B A A B B A B B A C 3 A B B A B B A C 6 A B B A C 9 A C 1 B A A B B A B B A C 5 B A B B A C 8 B A C 4 B B A B B A C 7 B B A C 10 C

Basic Idea: 2 binary searches in SA Search for leftmost position Exact matching using a Suffix Array A B A A B B A B B A C SUFFIX ARRAY SA: SA = 2 0 3 6 9 1 5 8 4 7 10 Basic Idea: 2 binary searches in SA Search for leftmost position Search for rightmost position

Search for leftmost occurence of: B B A B A A B B A B B A C 2 0 3 6 9 1 5 8 4 7 10 0 1 2 3 4 5 6 7 8 9 10

Search for leftmost occurence of: B B A B A A B B A B B A C 2 0 3 6 9 1 5 8 4 7 10 0 1 2 3 4 5 6 7 8 9 10 BB > BA Continue binary search in the right (larger) half of SA

Search for leftmost occurence of: B B A B A A B B A B B A C 2 0 3 6 9 1 5 8 4 7 10 0 1 2 3 4 5 6 7 8 9 10 BB = BB More occurences of BB left of this one possible!

Search for leftmost occurence of: B B A B A A B B A B B A C 2 0 3 6 9 1 5 8 4 7 10 0 1 2 3 4 5 6 7 8 9 10 BB > BA leftmost position of BB is pointed to by SA[8]

Search for rightmost occurence of: B B A B A A B B A B B A C 2 0 3 6 9 1 5 8 4 7 10 0 1 2 3 4 5 6 7 8 9 10 BB > BA Search further to the right

Search for rightmost occurence of: B B A B A A B B A B B A C 2 0 3 6 9 1 5 8 4 7 10 0 1 2 3 4 5 6 7 8 9 10 BB = BB More occurences of BB right of this one possible!

Search for leftmost occurence of: B B A B A A B B A B B A C 2 0 3 6 9 1 5 8 4 7 10 0 1 2 3 4 5 6 7 8 9 10 BB = BA More occurences of BB right of this one possible!

Search for rightmost occurence of: B B A B A A B B A B B A C 2 0 3 6 9 1 5 8 4 7 10 0 1 2 3 4 5 6 7 8 9 10 BB < C rightmost position of BB is pointed to by SA[9]

B B Results of search for: A B A A B B A B B A C 2 0 3 6 9 1 5 8 4 7 10 0 1 2 3 4 5 6 7 8 9 10 leftmost position of BB is pointed to by SA[8] rightmost position of BB is pointed to by SA[9] =>All occurences of the pattern BB are pointed to by SA[8..9]

Important Properties for |SA| = N and p = length of pattern: Size : 1 Pointer per Letter (4 Byte if N < 4Gb) Speed of exact matching : O(log N) binary search steps # of compared chars is O(p log N) can be reduced to O(p + log N)

Some known Construction methods: Manber-Myers variant of the labeling technique of Karp, Miller and Rosenberg Sorting of suffixes is performed as follows: i Sort in i rounds substrings of length 2 (0  i  log(n)) each round is possible in O(n) Construction in O(n log(n)) time 2 n Pointers space Space for external construction: dependent on Sort implementation multiway-mergesort: 6 n Pointers inplace merge: 3 n Pointers (slower)

Some known Construction methods: Manber-Myers Round 1: 2-pass Bucketsort using the first character Create 2 arrays, Pos and Prm Pos[k] : pointer to kth smallest suffix Prm[k] : pointer to Pos[k], Prm[Pos[k]] = k Round i: Use the fact that when comparing suffix x and y from 1. For 0..2i-1-1 suffix x and y are equal 2. For 2i-1.. 2i-1 suffix x and y have already been compared ! Result is given by comparing suffix x+2i-1 with y+2i-1 use Prm to access suffix x+2i-1 and y+2i-1 D = A B A A C Pos = 2 0 3 1 4 Prm = 1 3 0 2 4

Baeza-Yates-Gonnet-Snider (External) Idea: text is cut in pieces of size M runs in N/M rounds, in each round: - compute SA for the current text piece - merge SA with the suffix array for the previous pieces Run Time: O(N log(M) / M) time O(N log(M) / (MB)) Block I/Os 3 3 Space: 2 n Pointers

A B A A B B A B A B A A SA1: 3 2 0 1 B B A B SA2: 2 3 1 0 Example BGS Construction: M = 4 A B A A B B A B A B A A SA1: 3 2 0 1 B B A B SA2: 2 3 1 0 Merge SA1 , SA2

A B A A B B A B B B A B 7 6 4 5 A B A A 3 2 0 1 BAA 2 ABAA 1 AA 0 A 3 Example BGS Construction: A B A A B B A B B B A B 7 6 4 5 A B A A 3 2 0 1 BAA 2 ABAA 1 AA 0 A 3 New SA: 3 0 7 1 2 6 4 5

Baeza-Yates-Gonnet-Snider (External) Runtime analysis of one round: - compute SA for the current piece of size M: O(M log M) sort comparisons of suffixes Problem: worst case for comparison is complete suffixes (= N) But: expected case is much smaller (lcp) Worst case runtime: O(N M log M) - merge SA with the already existing SAx: length of SAx: O(N) => number of merge steps: O(N) one merge step = 1 comparison = O(N) worst case => O(N2) runtime => total runtime for one round: O(N2 + N M log M) = O(N2 log M) N/M rounds => total runtime = O(N3 log M / M)

The Suffix Tree Definition: Given a string D the suffix tree ST for this string is the compacted trie built on all suffixes of D. (Weiner, 1973)

The Suffix Tree Structural Properties: Each arc of the tree denotes a substring Each node has outdeg > 1 Node arcs start with different characters Each leaf l denotes the suffix composed of all arc labels on the path root – l N leaves and <N internal nodes a special character is used as end marker

An Example B A A $ A B A 16 B B B A $ A $ A A A B 15 14 B B A A B A $ A $ A A B B A B A A . A A A 13 12 . A A A . B B $ $ $ . B B . A . A B B A . . A . $ 12 10 . $ 9 A . A B . . . $ $ . B . . A 6 . 7 $ $ 1 8 $ $ 4 $ 5 2 3 D = A B A A B A B A A B A A B A B A $ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

An Example B A A $ A B A 16 B B B A $ A $ A A A B 15 14 B B A A B A $ A $ A A B B A B A A . A A A 13 12 . A A A . B B $ $ $ . B B . A . A B B A . . A . $ 12 10 . $ 9 A . A B . . . $ $ . B . . A 6 . 7 $ $ 1 8 $ $ 4 $ 5 2 3 D = A B A A B A B A A B A A B A B A $ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

An Example B A A $ A B A 16 B B B A $ A $ A A A B 15 14 B B A A B A $ A $ A A B B A B A A . A A A 13 12 . A A A . B B $ $ $ . B B . A . A B B A . . A . $ 12 10 . $ 9 A . A B . . . $ . $ B . . A 6 . 7 $ $ 1 8 $ $ 4 $ 5 2 3 D = A B A A B A B A A B A A B A B A $ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

D = A B A A B A B A A B A A B A B A $ Simple Construction for all suffixes s insert(s) ABAABABAABAABABA$ D = A B A A B A B A A B A A B A B A $ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

D = A B A A B A B A A B A A B A B A $ Simple Construction for all suffixes s insert(s) BAABABAABAABABA$ ABAABABAABAABABA$ 1 D = A B A A B A B A A B A A B A B A $ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

D = A B A A B A B A A B A A B A B A $ Simple Construction for all suffixes s insert(s) BAABABAABAABABA$ ABAABABAABAABABA$ 1 D = A B A A B A B A A B A A B A B A $ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

D = A B A A B A B A A B A A B A B A $ Simple Construction A for all suffixes s insert(s) BAABABAABAABABA$ ABABAABAABABA$ BAABABAABAABABA$ 2 1 D = A B A A B A B A A B A A B A B A $ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

D = A B A A B A B A A B A A B A B A $ Simple Construction A for all suffixes s insert(s) BAABABAABAABABA$ ABABAABAABABA$ BAABABAABAABABA$ 2 1 D = A B A A B A B A A B A A B A B A $ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

D = A B A A B A B A A B A A B A B A $ Simple Construction A for all suffixes s insert(s) B A BAABABAABAABABA$ ABABAABAABABA$ BAABAABABA$ ABABAABAABABA$ 3 2 1 D = A B A A B A B A A B A A B A B A $ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

D = A B C D E $ Problem: O(n ) Space ( N + N-1 + N-2 + ... + 1) A B C 1 2 3 4 5 D = A B C D E $ 0 1 2 3 4 5

Solution: Arc Pointers B A A $ A B A 16 B B B A $ A $ A A A B 15 14 B B A A B A $ A $ A A B B A B A A . A A A 13 12 . A A A . B B $ $ $ . B B . A . A B B A . . A . $ 12 10 . $ 9 A . A B . . . $ . B . . $ A 6 . 7 $ $ 1 8 $ $ 4 $ 5 2 3 D = A B A A B A B A A B A A B A B A $ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

Solution: Arc Pointers B (0,0) A $ A B A 16 B B B A $ A $ A A A B 15 14 B B A A B A $ A $ A A B B A B A A . A A A 13 12 . A A A . B B $ $ $ . B B . A . A B B A . . A . $ 12 10 . $ 9 A . A B . . . $ . B . . $ A 6 . 7 $ $ 1 8 $ $ 4 $ 5 2 3 D = A B A A B A B A A B A A B A B A $ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

Solution: Arc Pointers B (0,0) A $ A A 16 (1,2) B B B $ A $ A A A B 15 14 B B A A B A $ A $ A A B B A B A A . A A A 13 12 . A A A . B B $ $ $ . B B . A . A B B A . . A . $ 12 10 . $ 9 A . A B . . . $ . B . . $ A 6 . 7 $ $ 1 8 $ $ 4 $ 5 2 3 D = A B A A B A B A A B A A B A B A $ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

Solution: Arc Pointers B (0,0) A $ A A 16 (1,2) B B B $ A $ A A B 15 14 B B A A (3,5) A $ A $ A B B A B A A . A A A 13 12 . A A A . B B $ $ $ . B B . A . A B B A . . A . $ 12 10 . $ 9 A . A B . . . $ . B . . $ A 6 . 7 $ $ 1 8 $ $ 4 $ 5 2 3 D = A B A A B A B A A B A A B A B A $ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

Solution: Arc Pointers B (0,0) A $ A A 16 (1,2) B B B $ A $ A A B 15 14 B B A A (3,5) A $ A $ A B B A B A A . A A A 13 12 . A A A . B $ $ $ . B B . A . (6,7) B B A . . A . $ 12 10 . $ 9 A . A B . . . $ . B . . $ A 6 . 7 $ $ 1 8 $ $ 4 $ 5 2 3 D = A B A A B A B A A B A A B A B A $ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

Solution: Arc Pointers B (0,0) A $ A A 16 (1,2) B B B $ A $ A A B 15 14 B B A A (3,5) A $ A $ A B B A B A A . A A A 13 12 . A A A . B $ $ $ . B B . A . (6,7) B B A . . A . $ 12 10 . $ 9 . A B . . $ . . $ A 6 7 $ (8,16) $ 1 8 $ $ 4 5 2 3 D = A B A A B A B A A B A A B A B A $ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

O(n) Arcs => O(n) pointer pairs B (0,0) A $ A A 16 (1,2) B B B $ A $ A A B 15 14 B B A A (3,5) A $ A $ A B B A B A A . A A A 13 12 . A A A . B $ $ $ . B B . A . (6,7) B B A . . A . $ 12 10 . $ 9 . A B . $ . $ . . A 6 7 $ (8,16) $ 1 8 $ $ 4 5 2 3 D = A B A A B A B A A B A A B A B A $ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

. . . . . . . . . . . . . . . . . . P = A B A A B A B Searching B A A $ A B A 16 B B B A $ A $ A A A B 15 14 B B A A B A $ A $ A A B B A B A A . A A A 13 12 . A A A . B B $ $ $ . B B . A . A B B A . . A . $ 12 10 . $ 9 A . A B . . . $ . B . . $ A 6 . 7 $ $ 1 8 $ $ 4 $ 5 2 3 P = A B A A B A B

. . . . . . . . . . . . . . . . . . P = A B A A B A B Searching B A A $ A B A 16 B B B A $ A $ A A A B 15 14 B B A A B A $ A $ A A B B A B A A . A A A 13 12 . A A A . B B $ $ $ . B B . A . A B B A . . A . $ 12 10 . $ 9 A . A B . . . $ . B . . $ A 6 . 7 $ $ 1 8 $ $ 4 $ 5 2 3 P = A B A A B A B

. . . . . . . . . . . . . . . . . . P = A B A A B A B Searching B A A $ A B A 16 B B B A $ A $ A A A B 15 14 B B A A B A $ A $ A A B B A B A A . A A A 13 12 . A A A . B B $ $ $ . B B . A . A B B A . . A . $ 12 10 . $ 9 A . A B . . . $ . B . . $ A 6 . 7 $ $ 1 8 $ $ 4 $ 5 2 3 P = A B A A B A B

. . . . . . . . . . . . . . . . . . P = A B A A B A B Searching B A A $ A B A 16 B B B A $ A $ A A A B 15 14 B B A A B A $ A $ A A B B A B A A . A A A 13 12 . A A A . B B $ $ $ . B B . A . A B B A . . A . $ 12 10 . $ 9 A . A B . . . $ . B . . $ A 6 . 7 $ $ 1 8 $ $ 4 $ 5 2 3 P = A B A A B A B

. . . . . . . . . . . . . . . . . . P = A B A A B A B Searching B A A $ A B A 16 B B B A $ A $ A A A B 15 14 B B A A B A $ A $ A A B B A B A A . A A A 13 12 . A A A . B B $ $ $ . B B . A . A B B A . . A . $ 12 10 . $ 9 A . A B . . . $ . B . . $ A 6 . 7 $ $ 1 8 $ $ 4 $ 5 2 3 P = A B A A B A B

Searching B A A $ A B A 16 B B B A $ A $ A A A B 15 14 B B A A B A $ A $ A A B B A B A A . A A A 13 12 . A A A . B B $ $ $ . B B . A . A B B A . . A . $ 12 10 . $ 9 A . A B . . . $ . B . . $ A 6 . 7 $ $ 1 8 $ $ 4 $ 5 2 3 D = A B A A B A B A, A B A A B A B A A B

Some Structural Properties Longest common prefix of two suffixes in D: depth of the lowest common node of the suffixes B B A A B A B A A A B A $ lcp = 2 14 $

Some Structural Properties Longest repeat in D: maximum depth of any inner node Most common string of length m: For each node save number of leaves below it Examine all nodes with depth >= m many more.... Several applications in Biology (See frex book by Gusfield)

Summary Suffix Trees: Search time: O(p log |S| + occ) Space: O(N) (between 1.25 and 5 n Pointers) Implementations frex by Kurtz (Bielefeld) Construction: O(N log |S|) O(N) for integers (Farach, 97) Note: Implementation Details are extremely important for practicacl use. (constants/space)

Suffix Tree Applications : Work on the following organisms: Arabidopsis Thaliana (100 Mbps) Michigan State / Minnesota University Yeast (13 Mbps) MPI for Biochemistry, Munich Borelia Burgdorferi (1 Mbps) Brookhaven Nat. Lab. / Stony Brook Univ.