1 Pattern Processing and Searching In RAM Michael Robinson Ph.D. candidate Advisor: Dr. Giri Narasimhan School of Computing and Information Sciences BioRG.

Slides:



Advertisements
Similar presentations
Text Indexing The Suffix Array. Basic notation and facts Occurrences of P in T = All suffixes of T having P as a prefix SUF(T) = Sorted set of suffixes.
Advertisements

LINGO Sandra Gama. Internet  endless document collection.
Space-Efficient Algorithms for Document Retrieval Veli Mäkinen University of Helsinki Joint work with Niko Välimäki.
Michael Alves, Patrick Dugan, Robert Daniels, Carlos Vicuna
© 2008 Павел Иржавский, Владимир Керус, Никита Лесников © 2008 Irzhavskij Pavel, Kerus Vladimir, Lesnikov Nikita.
1 Suffix Arrays: A new method for on-line string searches Udi Manber Gene Myers May 1989 Presented by: Oren Weimann.
A New Compressed Suffix Tree Supporting Fast Search and its Construction Algorithm Using Optimal Working Space Dong Kyue Kim 1 andHeejin Park 2 1 School.
Genome-scale disk-based suffix tree indexing Benjarath Phoophakdee Mohammed J. Zaki Compiled by: Amit Mahajan Chaitra Venus.
1 Suffix tree and suffix array techniques for pattern analysis in strings Esko Ukkonen Univ Helsinki Erice School 30 Oct 2005 Modified Alon Itai 2006.
Compressed Compact Suffix Arrays Veli Mäkinen University of Helsinki Gonzalo Navarro University of Chile compact compress.
A Categorization Theorem on Suffix Arrays with Applications to Space Efficient Text Indexes Meng He, J. Ian Munro, and S. Srinivasa Rao University of Waterloo.
Genomic Repeat Visualisation Using Suffix Arrays Nava Whiteford Department of Chemistry University of Southampton
1 Suffix Trees and Suffix Arrays Modern Information Retrieval by R. Baeza-Yates and B. Ribeiro-Neto Addison-Wesley, (Chapter 8)
1 A Modified Burrows-Wheeler Transformation for Case-insensitive Search with Application to Suffix Array Compression Kunihiko Sadakane Department of Information.
1 Data structures for Pattern Matching Suffix trees and suffix arrays are a basic data structure in pattern matching Reported by: Olga Sergeeva, Saint.
G ENOME - SCALE D ISK - BASED S UFFIX T REE I NDEXING Phoophakdee and Zaki.
Procedures of Extending the Alphabet for the PPM Algorithm Radu Rădescu George Liculescu Polytechnic University of Bucharest Faculty of Electronics, Telecommunications.
1 More Specialized Data Structures String data structures Spatial data structures.
Next Generation Sequencing, Assembly, and Alignment Methods
Modern Information Retrieval
GTCAGATGAGCAAAGTAGACACTCCAGTAACGCGGTGAGTACATTAA exon intron intergene Find Gene Structures in DNA Intergene State First Exon State Intron State.
Fast and Practical Algorithms for Computing Runs Gang Chen – McMaster, Ontario, CAN Simon J. Puglisi – RMIT, Melbourne, AUS Bill Smyth – McMaster, Ontario,
Blockwise Suffix Sorting for Space-Efficient Burrows-Wheeler Ben Langmead Based on work by Juha Kärkkäinen.
Indexed Search Tree (Trie) Fawzi Emad Chau-Wen Tseng Department of Computer Science University of Maryland, College Park.
CS 104 Introduction to Computer Science and Graphics Problems
Computer Science 335 Data Compression.
Space Efficient Linear Time Construction of Suffix Arrays
Indexing and Searching
Efficient algorithms for the scaled indexing problem Biing-Feng Wang, Jyh-Jye Lin, and Shan-Chyun Ku Journal of Algorithms 52 (2004) 82–100 Presenter:
1 A Lempel-Ziv text index on secondary storage Diego Arroyuelo and Gonzalo Navarro Combinatorial Pattern Matching 2007.
A Fast Algorithm for Multi-Pattern Searching Sun Wu, Udi Manber May 1994.
DATA STRUCTURE Subject Code -14B11CI211.
Database Index to Large Biological Sequences Ela Hunt, Malcolm P. Atkinson, and Robert W. Irving Proceedings of the 27th VLDB Conference,2001 Presented.
WEEK 1 CS 361: ADVANCED DATA STRUCTURES AND ALGORITHMS Dong Si Dept. of Computer Science 1.
An Online Algorithm for Finding the Longest Previous Factors Daisuke Okanohara University of Tokyo Karlsruhe, Sep 15, 2008 Kunihiko.
Parallel and Distributed IR. 2 Papers on Parallel and Distributed IR Introduction Paper A: Inverted file partitioning schemes in Multiple Disk Systems.
Experiments An Efficient Trie-based Method for Approximate Entity Extraction with Edit-Distance Constraints Entity Extraction A Document An Efficient Filter.
Search Engines WS 2009 / 2010 Prof. Dr. Hannah Bast Chair of Algorithms and Data Structures Department of Computer Science University of Freiburg Lecture.
Geometric Matching on Sequential Data Veli Mäkinen AG Genominformatik Technical Fakultät Bielefeld Universität.
Topic 25 - more array algorithms 1 "To excel in Java, or any computer language, you want to build skill in both the "large" and "small". By "large" I mean.
Introduction n How to retrieval information? n A simple alternative is to search the whole text sequentially n Another option is to build data structures.
Succinct Orthogonal Range Search Structures on a Grid with Applications to Text Indexing Prosenjit Bose, Carleton University Meng He, Unversity of Waterloo.
Random access to arrays of variable-length items
Cache-efficient string sorting for Burrows-Wheeler Transform Advait D. Karande Sriram Saroop.
Szymon Grabowski, Marcin Raniszewski Institute of Applied Computer Science, Lodz University of Technology, Poland The Prague Stringology Conference, 1-3.
Chapter 5: Hashing Part I - Hash Tables. Hashing  What is Hashing?  Direct Access Tables  Hash Tables 2.
Recent Results in Combined Coding for Word-Based PPM Radu Rădescu George Liculescu Polytechnic University of Bucharest Faculty of Electronics, Telecommunications.
BLAST, which stands for basic local alignment search tool, is a heuristic algorithm that is used to find similar sequences of amino acids or nucleotides.
1 Ternary Directed Acyclic Word Graphs (TDAWG) Satoru Miyamoto, Shunsuke Inenaga, Masayuki Takeda and Ayumi Shinohara Present by Peera Liewlom (The Last.
Joint Advanced Student School Compressed Suffix Arrays Compression of Suffix Arrays to linear size Fabian Pache.
BLAST, which stands for basic local alignment search tool, is a heuristic algorithm that is used to find similar sequences of amino acids or nucleotides.
Modern Information Retrieval
ETRI Linear-Time Search in Suffix Arrays July 14, 2003 Jeong Seop Sim, Dong Kyue Kim Heejin Park, Kunsoo Park.
Discrete Methods in Mathematical Informatics Kunihiko Sadakane The University of Tokyo
1 Double-Patterning Aware DSA Template Guided Cut Redistribution for Advanced 1-D Gridded Designs Zhi-Wen Lin and Yao-Wen Chang National Taiwan University.
Advanced Data Structures Lecture 8 Mingmin Xie. Agenda Overview Trie Suffix Tree Suffix Array, LCP Construction Applications.
Run-Length Compressed Indexes Are Superior for Highly Repetitive Sequence Collections Jouni Sirén 1, Niko Välimäki 1, Veli Mäkinen 1, and Gonzalo Navarro.
Compressed Suffix Arrays for Massive Data Jouni Sirén SPIRE 2009.
Efficient Computation of Substring Equivalence Classes with Suffix Arrays this study is a joint work with Shunsuke Inenaga, Hideo Bannai and Masayuki Takeda.
Linear Time Suffix Array Construction Using D-Critical Substrings
COMP9319 Web Data Compression and Search
Tries 07/28/16 11:04 Text Compression
Indexing Graphs for Path Queries with Applications in Genome Research
Succinct Data Structures
Succinct Data Structures
13 Text Processing Hongfei Yan June 1, 2016.
Searching Similar Segments over Textual Event Sequences
String Data Structures and Algorithms
String Data Structures and Algorithms
Suffix Arrays and Suffix Trees
Presentation transcript:

1 Pattern Processing and Searching In RAM Michael Robinson Ph.D. candidate Advisor: Dr. Giri Narasimhan School of Computing and Information Sciences BioRG Bioinformatics Research Group Florida International University SW 8th Street Miami, FL {mrobi002, Presented by Michael Robinson January 15, 2008 At: Florida International University Global CyberBridges, National Science Foundation Program Award Id: OCI October 1, 2006-December 31, 2009

2 Agenda - What are Suffix Trees – Suffix Arrays - Suffix Trees – Importance in Bioinformatics - Main Memory Bottleneck - Sadakane’s Compressed Suffix Tree Implementation - Compressed Suffix Tree Problem - Engineering a Compressed Suffix Tree Implementation Authors Improvements and Algorithms results - Example Required File: LCP Array Solution - Implementation Design – Software - My Current Work - Experimental Results - My Future Work - References

3 Suffix Tree BANANAS ANANAS NANAS ANAS NAS AS S Suffix Trees inventor: Peter Wiener, A N A A N B S S A N N S S A S S S N N A A A

4 Suffix Array Implementation Simplified version of Suffix Array. Lexicographic ordered text. Sequence = ABRACADABRA Suffix Array Index Index Sorted ABRACADABRA 0 10 A BRACADABRA 1 7 ABRA RACADABRA 2 0 ABRACADABRA ACADABRA 3 3 ACADABRA CADABRA 4 5 ADABRA ADABRA 5 8 BRA DABRA 6 1 BRACADABRA ABRA 7 4 CADABRA BRA 8 6 DABRA RA 9 9 RA A 10 2 RACADABRA [1]Suffix Arrays inventors: Udi Manber, Gene Myers 1989

5 Suffix Trees Importance in Bioinformatics Biological Data Type (A C G T) vs. Search Engines Data (inverted) Applying Suffix Trees to Real Genomic Sequences is Impractical

6 Main Memory Bottleneck Suffix Array Index Storage ABRACADABRA 11 BRACADABRA 10 RACADABRA 9 ACADABRA 8 CADABRA 7 ADABRA 6 DABRA 5 ABRA 4 BRA 3 RA 2 A 1 66 = n(n+1)/2 = 11(12)/2 = 66 PA01 6Mg ~ 18 TeraBytes Human Genome 3,164,700,000 nucleotides (3,164,700,000* 3,164,700,001)/2 = 5,007,663,046,582,350,000 5,007,663 terabytes Suffix Arrays inventors: Udi Manber, Gene Myers 1989

7 Sadakane’s Compressed Suffix Implementation A = 00, C = 01, G = 10 T = 11 Storage = n log n bits = 2n bits, ~20% of original space Suffix Array Index Storage uncompressed compressed ABRACADABRA 0 22 bits = 2n = n log n BRACADABRA 1 20 RACADABRA 2 18 ACADABRA 3 16 CADABRA 4 14 ADABRA 5 12 DABRA 6 10 ABRA 7 8 BRA 8 6 RA 9 4 A bits = bits = (n(n+1)/2)*2 bits Unfortunately it is not linear, 100 mg ~ 5 gig [2]Kunihiko Sadakane

8 Compressed Suffix Tree Problem Unfortunately the Suffix Tree is not linear 100 mg ~ 5 gig The Sequence is linear ACGT = 4 bases = 2n bits = 8 bits = n log 2 ∑(ACGT) GTCAAGTC = 8 bases = 2n bits = 16 bits = n log 2 ∑(ACGT) But the Suffix Array is not: ACGT = 4 bases = (4(5)/2)*2 = 20 bits = (n(n+1)/2)*2 GTCAAGTC = 8 bases = (8(9)/2)*2 = 72 bits = (n(n+1)/2)*2 In first data structure, 2 nd sequence is twice as long as the first one, but in second data structure, 2 nd sequence is more than three times the first one. It is 30% slower than non-compressed trees.

9 Engineering a Compressed Suffix Tree Implementation Authors Improvements and Algorithms results Algorithms results: Authors Sadakane Space ∑(AGCT) log 2 n n log 2 ∑(AGCT) = 2n bits GTCA 4*2 = 8 bits 2*4 = 8 bits GTCAAGTC 4*3 = 12 bits 2*8 = 16 bits TACAAGTAGTCAAGTC 4*4 = 16 bits 2*16 = 32 bits A 2048 base sequence 4*11= 44 bits 2*2048= 4096 bits Space needed during construction 1.4 times final space Authors created an Abstract Suffix Array using: Succinct Suffix Array, based on Wavelet Tree (for sound), build on Burrows-Wheeler transform [3]Niko Välimäki 1

10 Example Required File: LCP Array Solution A Useful additional Data Structure. An array of lengths of the Longest Common Prefixes, between each substring and it’s predecessor in the Suffix Array Lexicographic ordered text. Sequence = ABRACADABRA Suffix Array Index Index LCP Sorted ABRACADABRA A BRACADABRA ABRA RACADABRA ABRACADABRA ACADABRA ACADABRA CADABRA ADABRA ADABRA BRA DABRA BRACADABRA ABRA CADABRA BRA DABRA RA RA A RACADABRA Suffix Arrays inventors: Udi Manber, Gene Myers 1989

11 Implementation Design – Software - C++ object oriented. - Each Data Structure is its own class. - Generic Code, e.i. from Sadakane, retrieve short sequences. - For construction and retrieve long sequences, new code. - Tailored code is as time/space efficient as generic code.

12 My Current Work - Approach - Dissertation, Not Published Yet. - Suffix Arrays Approach. - Google’s Construction Approach. - Construction Time and Space Problems. Sequences From 11 Bases To PA01 with 6.2 Million Bases. PA01: Run on 7 different computers. Fastest Time 5 days. - All Files Contain Uncompressed Information - Space Required: Sequence File = n One Index File = from n to ~ 8 Times PA01 Sequence Size - Loading Time and RAM Space Problems. - Solution: Break Index File into 64, 1024 Sub Indexes Improving Loading, Processing Times and Allowing Processing of Larger Size Sequences.

13 My Current Work - Applications - Finding Patterns: How Many Times, and Where a Probe Appears in a Given Sequence acgttg ….. acgttg ….. acgttg ….. acgttg ….. acgttg ….. acgttg - Finding Inverted Patterns: Same as Finding Patterns plus inverted acgttg ….. gttgca ….. acgttg ….. gttgca..… acgttg ….. gttgca - Finding Inverted Reciprocal Patterns: acgttg ….. caacgt ….. acgttg ….. caacgt ….. acgttg ….. caacgty - Above Programs Generate a Text File Report for Further Processing

14 Sadakane’s Experimental Results Using: One 2.4 Ghz Pentium 4 Computer, with 1 GB ram Red Hat OS, Compiled programs using g++ (GCC)

15 Engineering a Compressed Suffix Tree Implementation Experimental Results

16 My Current Work - Results Finding Patterns in PA01: 6,264,404 bases Sequence Bases Seconds a = aa = aaa = aaaa = aaaaa = c = cc = ccc = cccc = ccccc = g = gg = ggg = gggg = ggggg = t = tt = ttt = tttt = ttttt =

17 My Future Work Do Construction for: Human Genome and All Pseudomonas aeruginosa Bacterias Consensus Pattern Search: To solve the Bioinformatics Consensus Problem to n-1 of a given probe. At the present time there are applications that solve this problem to value 3. For a probe with 50 bases, with alphabet A C G T, if we check for 3 mutations we need to do 4 * 4 mutations-1 = 64 pattern searches, for each group of 3 bases. For a sequence of 3.6 billion bases, a probe of 1,000 bases, and a mutation rank of n-1, we need to do 4 * pattern searches on the 3.5 billion sequence. Excel calculates up to 4* = E+307. The only way to do this work is with a Distributed System. Solving this problem for proteins will require more time because proteins have an alphabet of length 20.

18 Conclusions Due to Advances in computer hardware and reduction on prices, today a 1.5 terabyte hard disk costs around 400 US dollars. Recent implementations of Suffix Trees an Suffix Arrays concentrate on compressing the data causing large delays in user processing. We believe the previous bottleneck hard disk space problems have been resolved, therefore compressing data on hard disk is no longer necessary specially when the users applications slowdown to factors of 30 for Suffix Trees, and additionally log n for Suffix Arrays, when compared to uncompressed data. With the advances on Operating Systems accessing ram memories of 128 gigabytes in workstations, and with advances in Distributing (Grid) Computing, we believe that using uncompressed data with new methods like our implementation, we can produce applications that were not possible before.

19 References

20 References

21 References [1] Udi Manber and Gene Myers (1991). "Suffix arrays: a new method for on-line string searches". SIAM Journal on Computing, Volume 22, Issue 5 (October 1993), pp [2] Kunihiko Sadakane, Department of Computer Science and Communication Engineering, Kyushu University, Hakozaki , Higashi-ku, Fukuoka , Japan [3] Engineering a Compressed Suffix Tree Implementation Niko Välimäki 1, Wolfgang Gerlach 2, Kashyap Dixit 3, and Veli Mäkinen 1, Department of Computer Science, University of Helsinki, Finland Technische Fakultät, Universität Bielefeld, Germany Department of Computer Science and Engineering Indian Institute of Technology, Kanpur, India

22 Questions Thank you!! Presented by: Michael Robinson Florida International University January 15, 2007