1 Data structures for Pattern Matching Suffix trees and suffix arrays are a basic data structure in pattern matching Reported by: Olga Sergeeva, Saint.

Slides:



Advertisements
Similar presentations
On-line Construction of Suffix Trees Chairman : Prof. R.C.T. Lee Speaker : C. S. Wu ( ) June 10, 2004 Dept. of CSIE National Chi Nan University.
Advertisements

Chapter 5: Tree Constructions
Boosting Textual Compression in Optimal Linear Time.
Parametrized Matching Amir, Farach, Muthukrishnan Orgad Keller.
Introduction to Computer Science 2 Lecture 7: Extended binary trees
1 Suffix Arrays: A new method for on-line string searches Udi Manber Gene Myers May 1989 Presented by: Oren Weimann.
COP 3502: Computer Science I (Note Set #21) Page 1 © Mark Llewellyn COP 3502: Computer Science I Spring 2004 – Note Set 21 – Balancing Binary Trees School.
AVL Trees1 Part-F2 AVL Trees v z. AVL Trees2 AVL Tree Definition (§ 9.2) AVL trees are balanced. An AVL Tree is a binary search tree such that.
Suffix Trees Come of Age in Bioinformatics Algorithms, Applications and Implementations Dan Gusfield, U.C. Davis.
Suffix Trees Construction and Applications João Carreira 2008.
Suffix Tree. Suffix Tree Representation S=xabxac Represent every edge using its start and end text location.
Suffix Trees. 2 Outline Introduction Suffix Trees (ST) Building STs in linear time: Ukkonen’s algorithm Applications of ST.
Presented By Dr. Shazzad Hosain Asst. Prof. EECS, NSU Linear Time Construction of Suffix Tree.
Two implementation issues Alphabet size Generalizing to multiple strings.
Constant-Time LCA Retrieval
Genome-scale disk-based suffix tree indexing Benjarath Phoophakdee Mohammed J. Zaki Compiled by: Amit Mahajan Chaitra Venus.
1 Suffix tree and suffix array techniques for pattern analysis in strings Esko Ukkonen Univ Helsinki Erice School 30 Oct 2005 Modified Alon Itai 2006.
Suffix Trees and Suffix Arrays
1 Suffix Trees Charles Yan Suffix Trees: Motivations Substring problem: One is given a text T of length m. After O (m) preprocessing time, one.
Suffix Sorting & Related Algoritmics Martin Farach-Colton Rutgers University USA.
15-853Page : Algorithms in the Real World Suffix Trees.
296.3: Algorithms in the Real World
1 Prof. Dr. Th. Ottmann Theory I Algorithm Design and Analysis (12 - Text search: suffix trees)
Rapid Global Alignments How to align genomic sequences in (more or less) linear time.
Advanced Algorithm Design and Analysis (Lecture 4) SW5 fall 2004 Simonas Šaltenis E1-215b
Presented By Dr. Shazzad Hosain Asst. Prof. EECS, NSU Linear Time Construction of Suffix Tree.
Sorting Chapter Sorting Consider list x 1, x 2, x 3, … x n We seek to arrange the elements of the list in order –Ascending or descending Some O(n.
1 A simple construction of two- dimensional suffix trees in linear time * Division of Electronics and Computer Engineering Hanyang University, Korea Dong.
Space Efficient Linear Time Construction of Suffix Arrays
External Memory Algorithms Kamesh Munagala. External Memory Model Aggrawal and Vitter, 1988.
Algorithms and Data Structures. /course/eleg67701-f/Topic-1b2 Outline  Data Structures  Space Complexity  Case Study: string matching Array implementation.
Building Suffix Trees in O(m) time Weiner had first linear time algorithm in 1973 McCreight developed a more space efficient algorithm in 1976 Ukkonen.
AALG, lecture 11, © Simonas Šaltenis, Range Searching in 2D Main goals of the lecture: to understand and to be able to analyze the kd-trees and.
Induction and recursion
CPSC 335 BTrees Dr. Marina Gavrilova Computer Science University of Calgary Canada.
 Jim has six children.  Chris fights with Bob,Faye, and Eve all the time; Eve fights (besides with Chris) with Al and Di all the time; and Al and Bob.
“On an Algorithm of Zemlyachenko for Subtree Isomorphism” Yefim Dinitz, Alon Itai, Michael Rodeh (1998) Presented by: Masha Igra, Merav Bukra.
Created on 29/10/2008yahaya.wordpress.com1 Trees Another common nonlinear data structure is the tree. We have already seen an example of a tree when we.
Graphs Data Structures and Algorithms A. G. Malamos Reference Algorithms, 2006, S. Dasgupta, C. H. Papadimitriou, and U. V. Vazirani Introduction to Algorithms,Third.
Parallel Suffix Array Construction by Accelerated Sampling Matthew Felice Pace University of Warwick Joint work with Alexander Tiskin University of Warwick.
Data Structures Using C++ 2E Chapter 10 Sorting Algorithms.
CMSC 341 B- Trees D. Frey with apologies to Tom Anastasio.
Indexing and hashing Azita Keshmiri CS 157B. Basic concept An index for a file in a database system works the same way as the index in text book. For.
Greedy Algorithms for the Shortest Common Superstring Overview by Anton Nesterov Saint Petersburg State University Russia Original paper by A. Frieze,
On the Sorting-Complexity of Suffix Tree Construction MARTIN FARACH-COLTON PAOLO FERRAGINA S. MUTHUKRISHNAN Requires Math fonts downloadable from herehere.
Suffix trees. Trie A tree representing a set of strings. a b c e e f d b f e g { aeef ad bbfe bbfg c }
Chapter 18: Searching and Sorting Algorithms. Objectives In this chapter, you will: Learn the various search algorithms Implement sequential and binary.
Agenda Review: –Planar Graphs Lecture Content:  Concepts of Trees  Spanning Trees  Binary Trees Exercise.
Data Structures and Algorithms in Parallel Computing Lecture 2.
Algorithms 2005 Ramesh Hariharan. Divide and Conquer+Recursion Compact and Precise Algorithm Description.
Bijective tree encoding Saverio Caminiti. 2 Talk Outline Domains Prüfer-like codes Prüfer code (1918) Neville codes (1953) Deo and Micikevičius code (2002)
Bushy Binary Search Tree from Ordered List. Behavior of the Algorithm Binary Search Tree Recall that tree_search is based closely on binary search. If.
BINARY TREES Objectives Define trees as data structures Define the terms associated with trees Discuss tree traversal algorithms Discuss a binary.
Suffix Tree 6 Mar MinKoo Seo. Contents  Basic Text Searching  Introduction to Suffix Tree  Suffix Trees and Exact Matching  Longest Common Substring.
8/3/2007CMSC 341 BTrees1 CMSC 341 B- Trees D. Frey with apologies to Tom Anastasio.
Chapter 11. Chapter Summary  Introduction to trees (11.1)  Application of trees (11.2)  Tree traversal (11.3)  Spanning trees (11.4)
Section Recursion 2  Recursion – defining an object (or function, algorithm, etc.) in terms of itself.  Recursion can be used to define sequences.
Advanced Data Structures Lecture 8 Mingmin Xie. Agenda Overview Trie Suffix Tree Suffix Array, LCP Construction Applications.
COMP9319 Web Data Compression and Search
Computational Geometry
Strings: Tries, Suffix Trees
B- Trees D. Frey with apologies to Tom Anastasio
Parsing Costas Busch - LSU.
String Data Structures and Algorithms
String Data Structures and Algorithms
Suffix Trees String … any sequence of characters.
Suffix Arrays and Suffix Trees
Chap 3 String Matching 3 -.
Strings: Tries, Suffix Trees
Presentation transcript:

1 Data structures for Pattern Matching Suffix trees and suffix arrays are a basic data structure in pattern matching Reported by: Olga Sergeeva, Saint Petersburg. On the materials: E.Ukkonen. On-Line Construction of Suffix Trees, 1993 Udi Manber and Gene Myers. Suffix Arrays: a New Method For On-Line String Searches, 1992 Juha Karkkainen and Peter Sanders. Simple Linear Work Suffix Array Construction, 2003.

2 Suffix is a ‘concluding’ substring of the string. If the suffixes are are organized ‘well’, the resulting construction can be very informative and can provide a good base for developing fast algorithms of working with strings.

3 Suffix tree: Also, the marks on the edges, having a common root, begin with different symbols of our alphabet. Suffix tree of a string is a tree with a root and marked edges, in which any concatenation of the marks along each path from the root to a leaf forms a suffix and every suffix appears once.

4 Searching in a suffix tree: Having a suffix tree for a string S, we can easily find out if a string M is its substring. This will take us O(log |M|) operations.

5 Algorithms: 1973, Weiner. “Linear Pattern Matching Algorithms”. 1976, McCreight. “A Space-Economial Suffix Tree Construction Algorithm”. 1993, Ukkonen. “On-Line Construction of Suffix Trees”.

6 Ukkonen’s algorithm: implicit trees. In general, for the suffix tree to exist, we need to add a ‘terminal’ symbol to the string (let us denote it ‘$’). Otherwise, we’ll have no means to identify the ends of the suffixes, which is desirable many algorithms. But in the course of work we will build trees for strings without terminal symbol. Such trees are called implicit trees.

7 Ukkonen’s algorithm: idea, general description. Let’s divide the process of constructing the tree into phases. Starting with a tree for the first symbol, we’ll build the tree inductively, in the end of each phase having an implicit tree for a prefix of the string.

8 Ukkonen’s algorithm: possible cases of extension. When extending one suffix, we can encounter three situations: when we are just to extend a mark on an edge; when we are to add an edge (and maybe to split an existing edge into two); when nothing is needed to be done.

9 Ukkonen’s algorithm: towards improvement. Essential: how we search the ends of the suffixes in. We can find the end of suffix a in, walking from the root each time. In this case, we will build from in, so the final tree will appear after operations, compared to in the naïve algorithm! We’ll reduce this to, using some observations and techniques.

10 Ukkonen’s algorithm. Heuristics: suffix links. Suffix link – a pointer from an inner vertex with the path mark x  to a vertex with mark , if it exists in a tree. Every inner vertex has a suffix link. Moreover, If a vertex v with path mark x  is added to the tree in the extension j of the phase i+1, then s(v) either already exists in the tree or will be created in the next extension, in this phase. So, any just created vertex has a suffix link to the ending point of the next extension. Consequently, in the end of each phase the tree has all its suffix links.

11 Ukkonen’s algorithm. Heuristics: suffix links. Using suffix links can substantially reduce amount of search. If we maintain a pointer to the end of the longest suffix S[1..i], we will have the ending point of the previously extended suffix in the beginning of each extension. We will not move from the root every time we search for the end of the next suffix. Instead, we will get from the available ending point up to the first inner vertex v (if it’s not an inner vertex itself), walk along its suffix link s(v) and search only in the subtree of s(v). Notice that he moment we are moving along the suffix link, the depth of v in tree, exceeds the depth of s(v) by less then 2.

12 Ukkonen’s algorithm. Heuristics: suffix links. Using suffix links can substantially reduce amount of search. If we maintain a pointer to the end of the longest suffix S[1..i], we will have the ending point of the previously extended suffix in the beginning of each extension. We will not move from the root every time we search for the end of the next suffix. Instead, we will get from the available ending point up to the first inner vertex v (if it’s not an inner vertex itself), walk along its suffix link s(v) and search only in the subtree of s(v). Notice that he moment we are moving along the suffix link, the depth of v in tree, exceeds the depth of s(v) by less then 2.

13 Ukkonen’s algorithm. Heuristics: suffix links. Using suffix links can substantially reduce amount of search. If we maintain a pointer to the end of the longest suffix S[1..i], we will have the ending point of the previously extended suffix in the beginning of each extension. We will not move from the root every time we search for the end of the next suffix. Instead, we will get from the available ending point up to the first inner vertex v (if it’s not an inner vertex itself), walk along its suffix link s(v) and search only in the subtree of s(v). Notice that he moment we are moving along the suffix link, the depth of v in tree, exceeds the depth of s(v) by less then 2.

14 Ukkonen’s algorithm: jumping over edges. Still, there are unnecessary steps. We walk along edges in the sub tree of s(v), in series comparing symbols to the searched mark, as if we were checking if a way, corresponding to this mark, exists in the tree. But it exists, and all we need to do is to find its ending point. That’s why we can restrict ourselves to choosing the right edge in a vertex (comparing the first symbols), end jump to the next vertex along this edge, or find the sought point on the edge, if its mark is long enough.

15 Ukkonen’s algorithm: jumping over edges. Still, there are unnecessary steps. We walk along edges in the sub tree of s(v), in series comparing symbols to the searched mark, as if we were checking if a way, corresponding to this mark, exists in the tree. But it exists, and all we need to do is to find its ending point. That’s why we can restrict ourselves to choosing the right edge in a vertex (comparing the first symbols), end jump to the next vertex along this edge, or find the sought point on the edge, if its mark is long enough.

16 Ukkonen’s algorithm: jumping over edges. Still, there are unnecessary steps. We walk along edges in the sub tree of s(v), in series comparing symbols to the searched mark, as if we were checking if a way, corresponding to this mark, exists in the tree. But it exists, and all we need to do is to find its ending point. That’s why we can restrict ourselves to choosing the right edge in a vertex (comparing the first symbols), end jump to the next vertex along this edge, or find the sought point on the edge, if its mark is long enough.

17 Ukkonen’s algorithm: current results. Theorem. In the improved algorithm, every phase takes time. Cons. The current version of Ukkonen’s algorithm terminates in.

18 Ukkonen’s algorithm: the last observations. To keep the labels on the marks is insufficient, because their overall length doesn’t have to be linear. We can replace the labels with indices, indicating the beginning and the end of the substring in S.

19 Ukkonen’s algorithm: the last observations. To keep the labels on the marks is insufficient, because their overall length doesn’t have to be linear. We can replace the labels with indices, indicating the beginning and the end of the substring in S. The first time in the phase we find out, that in the current extension nothing is to be done, we can complete with the phase. So a phase is a consequency of extensions, which use the first (prolonging a mark) and the second (branching off) rules.

20 Ukkonen’s algorithm: the last observations. To keep the labels on the marks is insufficient, because their overall length doesn’t have to be linear. We can replace the labels with indices, indicating the beginning and the end of the substring in S. The first time in the phase we find out, that in the current extension nothing is to be done, we can complete with the phase. So a phase is a consequency of extensions, which use the first (prolonging a mark) and the second (branching off) rules. A leaf cannot become an inner vertex. During the phase, we add the same symbol to edge marks.

21 Ukkonen’s algorithm: the last observations. To keep the labels on the marks is insufficient, because their overall length doesn’t have to be linear. We can replace the labels with indices, indicating the beginning and the end of the substring in S. The first time in the phase we find out, that in the current extension nothing is to be done, we can complete with the phase. So a phase is a consequency of extensions, which use the first (prolonging a mark) and the second (branching off) rules. A leaf cannot become an inner vertex. During the phase, we add the same symbol to edge marks. We will ‘split’ or ‘do nothing’ only with the suffixes, not processed in the previous phase (those, to which we applied ‘do nothing’) – in the other cases we prolong a leaf.

22 Ukkonen’s algorithm: time estimation. Theorem. Ukkonen’s algorithm terminates in O(n) time.

23 Ukkonen’s algorithm: difficulties of implementation. Problems of working with suffix trees: Dependency on the length of the alphabet No ‘locality’ – bad for paging Number of ‘children’ ranges for different vertices – no general ways of representation: Arrays for vertices near to the root. Linked lists for the leaves. Balanced trees and hashing for the middle vertices. As a result, the structure becomes even more complicated.

24 Suffix array: An array, containing the suffixes in lexicographic order. The idea belongs to Udi Manber and Gene Myers (1993, “Suffix Arrays: a New Method For On-Line String Searches”). They proposed an algorithm of direct constructing the array in O(n*log n) time. This algorithm not only built the array, but on the way gathered some additional information. Manber and Myers also presented an algorithm of search, using this information, for a pattern P in O(|P| + log m) time.

25 Suffix array construction: idea. As with the trees, we build the array inductively, greatly using it’s structure. Initially, we have an array with unordered suffixes. Beginning with sorting the suffixes by the first symbol (which is linear in the string’s length), every phase we twice the number the suffixes are sorted on. After the phase H, the suffixes are organized into buckets, holding suffixes with the same H first symbols. If A(i) is the suffix in the first bucket, A(i-H) should be first in its 2H-bucket. We can move it to the beginning of its 2H-bucket, and mark this fact. For every bucket, we need to know the number of suffixes in this bucket that have already beer moved and placed in 2H-order. The algorithm basically scans the suffixes as they appear in the H-order and for each A(i) it moves A(i-H) (if it exists) to the next available place in its bucket.

26 ‘Skew’ algorithm: structure Construct the suffix array of the suffixes, starting at positions i = 1, 2 (mod 3). This is done by reduction to the suffix array construction of a string of two thirds the length, which is solved recursively. Construct the suffix array of the remaining suffixes using the result of the first step. Merge the suffix arrays into one.

27 The skew algorithm: example

28 The skew algorithm: example

29 The skew algorithm: example

30 The skew algorithm: example

31 The skew algorithm: example

32 Literature E.Ukkonen. On-Line Construction of Suffix Trees, 1993 Udi Manber and Gene Myers. Suffix Arrays: a New Method For On-Line String Searches, 1992 Juha Karkkainen and Peter Sanders. Simple Linear Work Suffix Array Construction, Martin Farach. Optimal Suffix Tree Construction with Large Alphabets, 1997 Dan Gusfield. Algorithms on Strings, Trees, and Sequences. Computer science and computational biology