Obtaining Provably Good Performance from Suffix Trees in Secondary Storage Pang Ko & Srinivas Aluru Department of Electrical and Computer Engineering Iowa.

Slides:



Advertisements
Similar presentations
Succinct Representations of Dynamic Strings Meng He and J. Ian Munro University of Waterloo.
Advertisements

 Definition of B+ tree  How to create B+ tree  How to search for record  How to delete and insert a data.
Chapter 4: Trees Part II - AVL Tree
AVL Trees COL 106 Amit Kumar Shweta Agrawal Slide Courtesy : Douglas Wilhelm Harder, MMath, UWaterloo
Advanced Database Discussion B Trees. Motivation for B-Trees So far we have assumed that we can store an entire data structure in main memory What if.
B+-Trees (PART 1) What is a B+ tree? Why B+ trees? Searching a B+ tree
296.3: Algorithms in the Real World
© 2004 Goodrich, Tamassia Tries1. © 2004 Goodrich, Tamassia Tries2 Preprocessing Strings Preprocessing the pattern speeds up pattern matching queries.
Tries Standard Tries Compressed Tries Suffix Tries.
B-Trees. Motivation for B-Trees Index structures for large datasets cannot be stored in main memory Storing it on disk requires different approach to.
Branch and Bound Similar to backtracking in generating a search tree and looking for one or more solutions Different in that the “objective” is constrained.
Chapter 9 of DBMS First we look at a simple (strawman) approach (ISAM). We will see why it is unsatisfactory. This will motivate the B+Tree Read 9.1 to.
Indexed Search Tree (Trie) Fawzi Emad Chau-Wen Tseng Department of Computer Science University of Maryland, College Park.
2010/3/81 Lecture 8 on Physical Database DBMS has a view of the database as a collection of stored records, and that view is supported by the file manager.
B-Trees Disk Storage What is a multiway tree? What is a B-tree?
B-Trees Disk Storage What is a multiway tree? What is a B-tree?
©Silberschatz, Korth and Sudarshan12.1Database System Concepts Chapter 12: Part B Part A:  Index Definition in SQL  Ordered Indices  Index Sequential.
Techniques and Data Structures for Efficient Multimedia Similarity Search.
Self-Balancing Search Trees Chapter 11. Chapter 11: Self-Balancing Search Trees2 Chapter Objectives To understand the impact that balance has on the performance.
B + -Trees (Part 1) Lecture 20 COMP171 Fall 2006.
1 B-Trees Disk Storage What is a multiway tree? What is a B-tree? Why B-trees? Comparing B-trees and AVL-trees Searching a B-tree Insertion in a B-tree.
B + -Trees (Part 1). Motivation AVL tree with N nodes is an excellent data structure for searching, indexing, etc. –The Big-Oh analysis shows most operations.
Self-Balancing Search Trees Chapter 11. Chapter Objectives  To understand the impact that balance has on the performance of binary search trees  To.
B + -Trees (Part 1) COMP171. Slide 2 Main and secondary memories  Secondary storage device is much, much slower than the main RAM  Pages and blocks.
CS 206 Introduction to Computer Science II 11 / 24 / 2008 Instructor: Michael Eckmann.
B-Trees and B+-Trees Disk Storage What is a multiway tree?
1 Database Tuning Rasmus Pagh and S. Srinivasa Rao IT University of Copenhagen Spring 2007 February 8, 2007 Tree Indexes Lecture based on [RG, Chapter.
Homework #3 Due Thursday, April 17 Problems: –Chapter 11: 11.6, –Chapter 12: 12.1, 12.2, 12.3, 12.4, 12.5, 12.7.
B + -Trees COMP171 Fall AVL Trees / Slide 2 Dictionary for Secondary storage * The AVL tree is an excellent dictionary structure when the entire.
Tirgul 6 B-Trees – Another kind of balanced trees.
1 CS 728 Advanced Database Systems Chapter 17 Database File Indexing Techniques, B- Trees, and B + -Trees.
CPSC 335 BTrees Dr. Marina Gavrilova Computer Science University of Calgary Canada.
Indexing. Goals: Store large files Support multiple search keys Support efficient insert, delete, and range queries.
1 B Trees - Motivation Recall our discussion on AVL-trees –The maximum height of an AVL-tree with n-nodes is log 2 (n) since the branching factor (degree,
Introduction n – length of text, m – length of search pattern string Generally suffix tree construction takes O(n) time, O(n) space and searching takes.
INTRODUCTION TO MULTIWAY TREES P INTRO - Binary Trees are useful for quick retrieval of items stored in the tree (using linked list) - often,
12.1 Chapter 12: Indexing and Hashing Spring 2009 Sections , , Problems , 12.7, 12.8, 12.13, 12.15,
Succinct Data Structures Ian Munro University of Waterloo Joint work with David Benoit, Andrej Brodnik, D, Clark, F. Fich, M. He, J. Horton, A. López-Ortiz,
Discrete Structures Lecture 12: Trees Ji Yanyan United International College Thanks to Professor Michael Hvidsten.
B + -Trees. Motivation An AVL tree with N nodes is an excellent data structure for searching, indexing, etc. The Big-Oh analysis shows that most operations.
CMSC 341 B- Trees D. Frey with apologies to Tom Anastasio.
Starting at Binary Trees
MA/CSSE 473 Day 28 Dynamic Programming Binomial Coefficients Warshall's algorithm Student questions?
1 Tree Indexing (1) Linear index is poor for insertion/deletion. Tree index can efficiently support all desired operations: –Insert/delete –Multiple search.
Succinct Dynamic Cardinal Trees with Constant Time Operations for Small Alphabet Pooya Davoodi Aarhus University May 24, 2011 S. Srinivasa Rao Seoul National.
Tries1. 2 Outline and Reading Standard tries (§9.2.1) Compressed tries (§9.2.2) Suffix tries (§9.2.3)
Indexing and hashing Azita Keshmiri CS 157B. Basic concept An index for a file in a database system works the same way as the index in text book. For.
Lecture 11COMPSCI.220.FS.T Balancing an AVLTree Two mirror-symmetric pairs of cases to rebalance the tree if after the insertion of a new key to.
B+ Trees  What if you have A LOT of data that needs to be stored and accessed quickly  Won’t all fit in memory.  Means we have to access your hard.
Chapter 7 Trees_Part3 1 SEARCH TREE. Search Trees 2  Two standard search trees:  Binary Search Trees (non-balanced) All items in left sub-tree are less.
Marwan Al-Namari Hassan Al-Mathami. Indexing What is Indexing? Indexing is a mechanisms. Why we need to use Indexing? We used indexing to speed up access.
Spring 2003 ECE569 Lecture 05.1 ECE 569 Database System Engineering Spring 2003 Yanyong Zhang
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 B+-Tree Index Chapter 10 Modified by Donghui Zhang Nov 9, 2005.
Indexing Database Management Systems. Chapter 12: Indexing and Hashing Basic Concepts Ordered Indices B + -Tree Index Files File Organization 2.
MA/CSSE 473 Day 30 B Trees Dynamic Programming Binomial Coefficients Warshall's algorithm No in-class quiz today Student questions?
8/3/2007CMSC 341 BTrees1 CMSC 341 B- Trees D. Frey with apologies to Tom Anastasio.
MA/CSSE 473 Day 26 Student questions Boyer-Moore B Trees.
Multiway Search Trees Data may not fit into main memory
Azita Keshmiri CS 157B Ch 12 indexing and hashing
AVL DEFINITION An AVL tree is a binary search tree in which the balance factor of every node, which is defined as the difference between the heights of.
B+-Trees.
B+-Trees.
B+ Tree.
B-Trees Disk Storage What is a multiway tree? What is a B-tree?
CMSC 341 Lecture 10 B-Trees Based on slides from Dr. Katherine Gibson.
B- Trees D. Frey with apologies to Tom Anastasio
B-Tree.
B-Trees Disk Storage What is a multiway tree? What is a B-tree?
B-Trees Disk Storage What is a multiway tree? What is a B-tree?
Tries 2/27/2019 5:37 PM Tries Tries.
Presentation transcript:

Obtaining Provably Good Performance from Suffix Trees in Secondary Storage Pang Ko & Srinivas Aluru Department of Electrical and Computer Engineering Iowa State University.

Motivation Large amount of biological sequence data. Index for text usually is bigger than the text itself. Requires efficient ways to store and query these data.

Related Works String B-tree  Has the best worst case performance in secondary storage, allowing updates.  However, most existing programs still uses suffix tree instead of string B-tree. Many other works that only focus on construction of suffix tree, and without worst case bound  S.J. Bedathur and J.R. Haritsa. Search-optimized suffix-tree storage for biological applications.  E. Hunt, M.P. Atkinson, and R.W. Irving. Database indexing for large DNA and protein sequence collections. Clark and Munro. “Efficient suffix trees on secondary storage”  Focus on reducing the space usage of suffix trees.  Performance depends on the height of the tree. Farach, odd even tree construction.  Optimal construction time in secondary storage  The performance for search and update operations are not studied. We show that suffix tree can achieve the same level of efficiency with constant size alphabet.

An Analogue

Definitions Let v be an internal node of a suffix tree. size(v) is the number of leaves in the subtree rooted at v. rank(v) = i, iff C i  size(v)  C i+1. Internal nodes u and v belong to the same partition, iff u is the parent of v and rank(v)=rank(u). The rank of a partition P, rank( P ) is the rank of the internal nodes in the partition.

A Suffix Tree Partitioned rank = 0 rank = 1rank = 0 rank = 2 Each root to leaf path goes through at most  log C n  partitions.

Suffix Tree & Partition Example C = 3 Partitions of rank 0

Suffix Tree & Partition Example C = 3 Partitions of rank 1

Properties of a Partition Nodes in a partition without any child in the same partition are referred to as leaves. The node whose parent is in another partition is referred to as the root. There are at most C-1 leaves for each partition.  size(root) ≥  size(u), for all leaves u of the partition.  C i+1 -1 ≥ size(root) ≥  size(u) ≥  C i  C*C i = C i+1

Properties of a Partition If a node v has more than 1 child in the same partition as v, it is referred to as a branching node. There can be at most C-2 branching nodes, because there are at most C-1 leaves. A skeleton partition tree for a partition P contains the root, all the leaves and branching nodes of a partition.  There are at most 2C-2 nodes in a skeleton partition tree.  With a suitable choice of C, it can be stored in 1 disk page.

Partition and Skeleton Partition Tree Store a representative suffix in each nodes of the skeleton partition tree

Searching for an Exact Match (1) p = TTAATGAT

Searching for an Exact Match (1) p = TTAATGAT Load the representative suffix and compare to p.

Searching for an Exact Match (1) p = TTAATGAT Load the representative suffix and compare to p. Suppose the representative suffix is TTATTAGGA…… The lcp between p and the representative suffix is 3.

Searching for an Exact Match (2) The lcp between p and the representative suffix is 3. Move to the appropriate next partition. Total number of disk access: O(p/B+log B n) p = TTAATGAT

Supporting Update Operations With insertion and deletion the size of a node as well as the partition changes. During insertion of a suffix,  Size(v) changes if and only if node v is an ancestor of the newly inserted leaf.  Rank(v) may change only if size(v) changes and node v is the root of a partition.  If rank(v) changes node v will became either a new partition by itself or a leaf in its parent’s partition.

Only the Rank of the Root of a Partition Changes Root Rank(v) increased by one  size(v) was C rank(v)  size(root) was  C rank(v) +1  Root was not in the partition 

Insertion and Deletion By the same argument only a leaf’s rank can change during the deletion of a suffix. Store and keep size(v) up to date for node v if 1.Node v is the root of the partition, 2.Node v, such that v is connected to the root by a chain of branching nodes. 3.Node v is a non-branching node and is the child of a node u that satisfies one of the conditions above.

The Root of a Partition is Removed Let v be a child of the old root in the partition.  If v is a branching node, nothing need to be done, and the new partition with v as the root have all the size set correctly.  If v is a non-branching node, we can calculated the size of its only child in the partition by subtract the size of all other children from size(v). After the updates all the size value will be set correctly as stated previously.

The Root of a Partition is Removed Old Root New Roots

The Leaf of a Partition is Removed If a leaf is removed from a partition,  The leaf became the root, its size can be calculated as the sum of the size of all its children, which were all roots of different partitions.  Either a previously branching node became a non-branching node, no update of size is necessary, or  A previously non-branching node became a new leaf, in this case the size of the new leaf can be calculated by added the size of all its children.

The Leaf of a Partition is Removed Leaf from another partition

Results Let B be the size of a disk block. Let n be the total length of strings. Let m be the length of the string being inserted or deleted.  Construction takes O(n log B n) disk accesses.  Insertion and deletion takes O(m log B (n+m)) and O(m log B (n)) disk accesses, respectively. Let p be the length of a pattern.  Searching takes disk O(p/B + log B (n)) accesses.