Genome-scale disk-based suffix tree indexing Benjarath Phoophakdee Mohammed J. Zaki Compiled by: Amit Mahajan Chaitra Venus.

Slides:

Advertisements

Similar presentations

Algorithm Design Techniques: Greedy Algorithms. Introduction Algorithm Design Techniques –Design of algorithms –Algorithms commonly used to solve problems.

Advertisements

Michael Alves, Patrick Dugan, Robert Daniels, Carlos Vicuna

Greedy Algorithms Amihood Amir Bar-Ilan University.

File Processing - Organizing file for Performance MVNC1 Organizing Files for Performance Chapter 6 Jim Skon.

Fast Incremental Maintenance of Approximate histograms : Phillip B. Gibbons (Intel Research Pittsburgh) Yossi Matias (Tel Aviv University) Viswanath Poosala.

Suffix Trees Construction and Applications João Carreira 2008.

FP-Growth algorithm Vasiljevic Vladica,

Sandeep Tata, Richard A. Hankins, and Jignesh M. Patel Presented by Niketan Pansare, Megha Kokane.

15-853Page : Algorithms in the Real World Suffix Trees.

296.3: Algorithms in the Real World

1 Suffix Trees and Suffix Arrays Modern Information Retrieval by R. Baeza-Yates and B. Ribeiro-Neto Addison-Wesley, (Chapter 8)

1 Prof. Dr. Th. Ottmann Theory I Algorithm Design and Analysis (12 - Text search: suffix trees)

1 Data structures for Pattern Matching Suffix trees and suffix arrays are a basic data structure in pattern matching Reported by: Olga Sergeeva, Saint.

G ENOME - SCALE D ISK - BASED S UFFIX T REE I NDEXING Phoophakdee and Zaki.

Modern Information Retrieval

1  Simple Nested Loops Join:  Block Nested Loops Join  Index Nested Loops Join  Sort Merge Join  Hash Join  Hybrid Hash Join Evaluation of Relational.

FALL 2006CENG 351 Data Management and File Structures1 External Sorting.

1 CS 430: Information Discovery Lecture 4 Data Structures for Information Retrieval.

Spatial Indexing I Point Access Methods.

B + -Trees (Part 1) Lecture 20 COMP171 Fall 2006.

B + -Trees (Part 1). Motivation AVL tree with N nodes is an excellent data structure for searching, indexing, etc. –The Big-Oh analysis shows most operations.

B + -Trees (Part 1) COMP171. Slide 2 Main and secondary memories  Secondary storage device is much, much slower than the main RAM  Pages and blocks.

1 A Lempel-Ziv text index on secondary storage Diego Arroyuelo and Gonzalo Navarro Combinatorial Pattern Matching 2007.

Spatial Indexing I Point Access Methods. Spatial Indexing Point Access Methods (PAMs) vs Spatial Access Methods (SAMs) PAM: index only point data Hierarchical.

Evaluation of Relational Operations. Relational Operations v We will consider how to implement: – Selection ( ) Selects a subset of rows from relation.

Spatial Indexing I Point Access Methods. Spatial Indexing Point Access Methods (PAMs) vs Spatial Access Methods (SAMs) PAM: index only point data Hierarchical.

A Fast Algorithm for Multi-Pattern Searching Sun Wu, Udi Manber May 1994.

Data Compression Gabriel Laden CS146 – Dr. Sin-Min Lee Spring 2004.

Huffman Codes Message consisting of five characters: a, b, c, d,e

CSE Lectures 22 – Huffman codes

Database Index to Large Biological Sequences Ela Hunt, Malcolm P. Atkinson, and Robert W. Irving Proceedings of the 27th VLDB Conference,2001 Presented.

Bin Yao Spring 2014 (Slides were made available by Feifei Li) Advanced Topics in Data Management.

Introduction n – length of text, m – length of search pattern string Generally suffix tree construction takes O(n) time, O(n) space and searching takes.

March 16 & 21, Csci 2111: Data and File Structures Week 9, Lectures 1 & 2 Indexed Sequential File Access and Prefix B+ Trees.

Positional Data Organization and Compression in Web Inverted Indexes Leonidas Akritidis Panayiotis Bozanis Department of Computer & Communication Engineering,

Database Management COP4540, SCS, FIU Physical Database Design (ch. 16 & ch. 3)

Memory Management during Run Generation in External Sorting – Larson & Graefe.

CPSC 404, Laks V.S. Lakshmanan1 External Sorting Chapter 13: Ramakrishnan & Gherke and Chapter 2.3: Garcia-Molina et al.

Cache-efficient string sorting for Burrows-Wheeler Transform Advait D. Karande Sriram Saroop.

1 Inverted Matrix: Efficient Discovery of Frequent Items in Large Datasets in the Context of Interactive Mining -SIGKDD’03 Mohammad El-Hajj, Osmar R. Zaïane.

Building a Distributed Full-Text Index for the Web by Sergey Melnik, Sriram Raghavan, Beverly Yang and Hector Garcia-Molina from Stanford University Presented.

Mining Document Collections to Facilitate Accurate Approximate Entity Matching Presented By Harshda Vabale.

CS4432: Database Systems II Query Processing- Part 2.

Radix Sort and Hash-Join for Vector Computers Ripal Nathuji 6.893: Advanced VLSI Computer Architecture 10/12/00.

1 CPS216: Advanced Database Systems Notes 05: Operators for Data Access (contd.) Shivnath Babu.

CPSC 404, Laks V.S. Lakshmanan1 Evaluation of Relational Operations – Join Chapter 14 Ramakrishnan and Gehrke (Section 14.4)

FALL 2005CENG 351 Data Management and File Structures1 External Sorting Reference: Chapter 8.

Computing & Information Sciences Kansas State University Wednesday, 08 Nov 2006CIS 560: Database System Concepts Lecture 32 of 42 Monday, 06 November 2006.

Relational Operator Evaluation. overview Projection Two steps –Remove unwanted attributes –Eliminate any duplicate tuples The expensive part is removing.

Implementation of Database Systems, Jarek Gryz1 Evaluation of Relational Operations Chapter 12, Part A.

Lecture 12 Huffman Algorithm. In computer science and information theory, a Huffman code is a particular type of optimal prefix code that is commonly.

Chapter 4, Part II Sorting Algorithms. 2 Heap Details A heap is a tree structure where for each subtree the value stored at the root is larger than all.

Onlinedeeneislam.blogspot.com1 Design and Analysis of Algorithms Slide # 1 Download From

Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Evaluation of Relational Operations Chapter 14, Part A (Joins)

Data Structures and Algorithms Instructor: Tesfaye Guta [M.Sc.] Haramaya University.

Suffix Tree 6 Mar MinKoo Seo. Contents  Basic Text Searching  Introduction to Suffix Tree  Suffix Trees and Exact Matching  Longest Common Substring.

CENG 3511 External Sorting. CENG 3512 Outline Introduction Heapsort Multi-way Merging Multi-step merging Replacement Selection in heap-sort.

Advanced Data Structures Lecture 8 Mingmin Xie. Agenda Overview Trie Suffix Tree Suffix Array, LCP Construction Applications.

ETH Zurich A Database Index to Large Biological Sequences (Ela Hunt, Malcolm P. Atkinson, Robert W. Irving) A report on the paper from Nicola.

Tries 4/16/2018 8:59 AM Presentation for use with the textbook Data Structures and Algorithms in Java, 6th edition, by M. T. Goodrich, R. Tamassia, and.

Tries 07/28/16 11:04 Text Compression

CS522 Advanced database Systems

Parallel Density-based Hybrid Clustering

Evaluation of Relational Operations

Spatial Online Sampling and Aggregation

Advanced Topics in Data Management

Trees Lecture 9 CS2110 – Fall 2009.

Lecture 2- Query Processing (continued)

CENG 351 Data Management and File Structures

Trees Lecture 10 CS2110 – Spring 2013.

Presentation transcript:

Genome-scale disk-based suffix tree indexing Benjarath Phoophakdee Mohammed J. Zaki Compiled by: Amit Mahajan Chaitra Venus

Introduction… Growth in biological sequences database Need for effective and efficient structure Suffix Tree –Exact/approx. matching –Database querying –Longest common substrings etc.

Introduction… In-memory construction algorithms –O(n 2 ) –Can achieve Linear Time and Space suffix links edge encoding skip and count –Problem: do not scale for large input sequences

Disk based Suffix trees “A Database Index to Large Biological Sequences” –Abandon suffix links (for better locality of reference) –Partition input based on fixed length prefixes –Faces problem in partition size because of data skew –Use of bin packing for partitions: expensive to count frequency for long length prefixes “Practical Suffix Tree Construction” –TDD: Similar to above… drops suffix links –Reported to scale to human genome level –Random I/Os when input string size > memory

ST-Merge (Improvement to TDD) –Input string = smaller contiguous substrings –Apply TDD on each substring and then merge all trees –Does not have suffix links TOP-Q and DynaCluster –Only known algorithms that maintain suffix links and do not have data skew problem –Experiments show that they do not scale to human genome level Disk based Suffix trees

Issue Problems with disk based algorithms –Data skew –No Suffix Links –No scalability Authors propose a novel disk based suffix tree algorithm called TRELLIS

TRELLIS O(n 2 ) Time, O(n) Space Idea: –construct by partitioning and merging –use variable length prefixes –Recover suffix links in a different post construction phase Effectively scales up to human genome level –Can index entire human genome using 2GB in 4 hours, recover suffix links in 2 hours

TRELLIS Has 4 different phases –Prefix Creation –Partitioning –Merging –Suffix Link Recovery

Prefix Creation Phase Problems with fixed-length prefix –Cannot handle data skew –Computing appropriate length is not defined TRELLIS makes use of variable length prefixes. P = {P 0, P 1, P 2, …, P m-1 } Use some threshold t to determine P such that freq(P i ) ≤ t

Prefix Creation Phase Multi-scan approach to compute P –i th scan Process prefixes up to certain length L i (See formula below to calculate L i ) EP i = set of prefixes that need further extension in next scan (as their frequency > t) Add to P only the smallest length prefixes that meets the frequency threshold t and reject their extensions

Prefix Creation Phase Ex: With t = 10 6, only two stages were required for the human genome with L 1 =8 and L 2 =16 Resulting set P contained about 6400 prefixes of lengths in the range 4 to 16

Partitioning Phase Divide input string into r consecutive partitions where r = (n+1) / t Suffix Subtree T Ri –Contains suffixes that start in partition R i –Use Ukkonen’s algorithm* to build it Prefixed Suffix Subtree T Ri, Pk –Split T Ri into subtrees that contain only suffixes that have prefix P k –At most m such subtrees Store these prefixed suffix subtrees on disk * proposed in the paper “Online construction of suffix trees” – E. Ukkonen

Partitioning Phase T Ri s obtained are implicit suffix trees (i.e. some suffixes are part of internal edges) To guarantee that T Ri explicitly contains all suffixes from i th partition –Continue to read some characters from next partition R i+1 until t leaves are obtained in T Ri –Cannot do special character appending as it will incur additional overhead during merging phase

Merging Phase For each prefix P k in the set P –Merge all Prefixed Suffixed Subtrees T Ri,Pk to get Prefixed Suffix Tree T Pk We get m Prefixed Suffix trees Store the resulting trees back to disk

Suffix Link Recovery Phase Why? –Suffix links are crucial for efficiency in many fast string processing algorithms Why in a separate phase? –TRELLIS may discard all suffix links information during the merge phase as new internal nodes are created and some old ones are deleted –It is useful to discard suffix links information after partitioning as it reduces amount of data per node –Recovering links from scratch takes same time as keeping original link information

Suffix Link Recovery Phase TRELLIS recovers suffix links of one Prefix Suffix Tree at a time Start with children of root Proceeding in a depth-first fashion, do the following for each internal node x –Locate p(x) and sl(p(x)) –Count from sl(p(x)) to locate sl(x), when found add link –Do this recursively for all children of x

Choosing t Note: t is threshold for Partition size also M >= n/4 + ((0.7 x 40) + 16)t + (0.7 x 40)t M = available main memory n/4 = memory for input (in compressed form) # internal nodes = 0.7(# external nodes) 40, 16 are sizes of internal and external nodes

Computational Complexity Prefix Creation Phase –O(nL) time, where L = longest prefix length –O(n+|∑ L+1 |)space Partitioning Phase –Input is broken into r partitions and each partition is of size t –O(t) time/space for each => r x O(t) = O(n) –Disk I/Os: O(r x m) since at most m prefixed suffix subtrees can be created for each partition

Computational Complexity Merging Phase –Each merge operation can be O(p) where p = | longest common prefix | –Across all prefixes, merging = O(p x n) since number of tree nodes in suffix tree is bounded by n –In worst case p can be O(n), therefore merge = O(n 2 ) –Disk I/Os: O(r x m)

Computational Complexity Suffix Link Recovery Phase –Internal nodes in final suffix trees are O(n) –Constant set of operations for each suffix link recovery Putting all together… –O(n 2 ) time since most expensive is the merge phase –O(n) space

Experimental Setup Compared to –TOP-Q and DynaCluster (maintain suffix links) –TDD (no suffix links) Performed on Linux with –2 GB RAM for human genome and 512 MB for others –288 GB disk space –TRELLIS written in C++ and compiled with g++ –Other algorithms obtained from their authors

Experimental Results TRELLIS vs. TOP-Q and DynaCluster For 200 Mbp, DynaCluster did not terminate even after 8 hours, TRELLIS took 13 min

Experimental Results TRELLIS vs. TDD TDD uses four different buffers (string, suffix, temp and tree) 200 Mbp requires only last 2 buffers Saves additional I/O incurred in other cases

Experimental Results TRELLIS vs. TDD TDD is built using memory optimized suffix-tree method Difference is not significant for human genome as TDD needs to be run in 64 bit mode

Experimental Results TRELLIS vs. TDD – Query time TDD does not store edge length, determine by examining children Internal node has pointer only to one child, so scan all children linearly for every query

Conclusions TRELLIS –Solves data skew problem: variable length prefixes –Scales gracefully for very large sequence –No Disk I/O overhead as it works with suffix trees that are guaranteed to fit in memory –It exhibits faster construction and query times when compared to other disk based algorithms

Future Work Plan to make TRELLIS applicable to wider range of alphabets (Ex: English alphabets) No buffering strategy required for human genome, but start building one for use of a generalized suffix tree composed of many large genomes Parallelize TRELLIS, since its partioning and merging steps seem ideally suited