29.3.2008ETH Zurich A Database Index to Large Biological Sequences (Ela Hunt, Malcolm P. Atkinson, Robert W. Irving) A report on the paper from Nicola.

Slides:



Advertisements
Similar presentations
On-line Construction of Suffix Trees Chairman : Prof. R.C.T. Lee Speaker : C. S. Wu ( ) June 10, 2004 Dept. of CSIE National Chi Nan University.
Advertisements

Boosting Textual Compression in Optimal Linear Time.
CS 336 March 19, 2012 Tandy Warnow.
1 Suffix Arrays: A new method for on-line string searches Udi Manber Gene Myers May 1989 Presented by: Oren Weimann.
Suffix Trees Construction and Applications João Carreira 2008.
Binary Trees, Binary Search Trees CMPS 2133 Spring 2008.
Data Compressor---Huffman Encoding and Decoding. Huffman Encoding Compression Typically, in files and messages, Each character requires 1 byte or 8 bits.
Two implementation issues Alphabet size Generalizing to multiple strings.
Binary Trees, Binary Search Trees COMP171 Fall 2006.
Genome-scale disk-based suffix tree indexing Benjarath Phoophakdee Mohammed J. Zaki Compiled by: Amit Mahajan Chaitra Venus.
Suffix Trees Specialized form of keyword trees New ideas –preprocess text T, not pattern P O(m) preprocess time O(n+k) search time –k is number of occurrences.
Sandeep Tata, Richard A. Hankins, and Jignesh M. Patel Presented by Niketan Pansare, Megha Kokane.
15-853Page : Algorithms in the Real World Suffix Trees.
296.3: Algorithms in the Real World
© 2004 Goodrich, Tamassia Tries1. © 2004 Goodrich, Tamassia Tries2 Preprocessing Strings Preprocessing the pattern speeds up pattern matching queries.
1 Suffix Trees and Suffix Arrays Modern Information Retrieval by R. Baeza-Yates and B. Ribeiro-Neto Addison-Wesley, (Chapter 8)
1 Prof. Dr. Th. Ottmann Theory I Algorithm Design and Analysis (12 - Text search: suffix trees)
Suffix Trees Suffix trees Linearized suffix trees Virtual suffix trees Suffix arrays Enhanced suffix arrays Suffix cactus, suffix vectors, …
1 Data structures for Pattern Matching Suffix trees and suffix arrays are a basic data structure in pattern matching Reported by: Olga Sergeeva, Saint.
Advanced Algorithm Design and Analysis (Lecture 4) SW5 fall 2004 Simonas Šaltenis E1-215b
Suffix Trees String … any sequence of characters. Substring of string S … string composed of characters i through j, i ate is.
1 Applications of Suffix Trees Charles Yan Exact String Matching |P|=n, |T|=m P and T are both known at the same time Boyer-Moore, or Suffix.
CS2420: Lecture 13 Vladimir Kulyukin Computer Science Department Utah State University.
1 Trees. 2 Outline –Tree Structures –Tree Node Level and Path Length –Binary Tree Definition –Binary Tree Nodes –Binary Search Trees.
Indexed Search Tree (Trie) Fawzi Emad Chau-Wen Tseng Department of Computer Science University of Maryland, College Park.
Obtaining Provably Good Performance from Suffix Trees in Secondary Storage Pang Ko & Srinivas Aluru Department of Electrical and Computer Engineering Iowa.
Costas Busch - RPI1 Mathematical Preliminaries. Costas Busch - RPI2 Mathematical Preliminaries Sets Functions Relations Graphs Proof Techniques.
Courtesy Costas Busch - RPI1 Mathematical Preliminaries.
Theoretical Computer Science COMP 335 Fall 2004
Data Structures & Algorithms Radix Search Richard Newman based on slides by S. Sahni and book by R. Sedgewick.
Fall 2004COMP 3351 Languages. Fall 2004COMP 3352 A language is a set of strings String: A sequence of letters/symbols Examples: “cat”, “dog”, “house”,
Week 7 - Wednesday.  What did we talk about last time?  Recursive running time  Master Theorem  Introduction to trees.
Database Index to Large Biological Sequences Ela Hunt, Malcolm P. Atkinson, and Robert W. Irving Proceedings of the 27th VLDB Conference,2001 Presented.
“On an Algorithm of Zemlyachenko for Subtree Isomorphism” Yefim Dinitz, Alon Itai, Michael Rodeh (1998) Presented by: Masha Igra, Merav Bukra.
Introduction n – length of text, m – length of search pattern string Generally suffix tree construction takes O(n) time, O(n) space and searching takes.
Lecture 10 Trees –Definiton of trees –Uses of trees –Operations on a tree.
1. 2 Overview  Suffix tries  On-line construction of suffix tries in quadratic time  Suffix trees  On-line construction of suffix trees in linear.
Mathematical Preliminaries. Sets Functions Relations Graphs Proof Techniques.
1 Languages. 2 A language is a set of strings String: A sequence of letters Examples: “cat”, “dog”, “house”, … Defined over an alphabet:
Tries1. 2 Outline and Reading Standard tries (§9.2.1) Compressed tries (§9.2.2) Suffix tries (§9.2.3)
Everything is String. Closed Factorization Golnaz Badkobeh 1, Hideo Bannai 2, Keisuke Goto 2, Tomohiro I 2, Costas S. Iliopoulos 3, Shunsuke Inenaga 2,
Chapter 6 – Trees. Notice that in a tree, there is exactly one path from the root to each node.
Suffix Tree 6 Mar MinKoo Seo. Contents  Basic Text Searching  Introduction to Suffix Tree  Suffix Trees and Exact Matching  Longest Common Substring.
Chapter 11. Chapter Summary  Introduction to trees (11.1)  Application of trees (11.2)  Tree traversal (11.3)  Spanning trees (11.4)
Generic Trees—Trie, Compressed Trie, Suffix Trie (with Analysi
Tries 4/16/2018 8:59 AM Presentation for use with the textbook Data Structures and Algorithms in Java, 6th edition, by M. T. Goodrich, R. Tamassia, and.
A database index to large biological sequences
Trees Chapter 15.
Tries 07/28/16 11:04 Text Compression
McCreight's suffix tree construction algorithm
Tries 5/27/2018 3:08 AM Tries Tries.
Source Code for Data Structures and Algorithm Analysis in C (Second Edition) – by Weiss
Ukkonen's suffix tree construction algorithm
Tries 9/14/ :13 AM Presentation for use with the textbook Data Structures and Algorithms in Java, 6th edition, by M. T. Goodrich, R. Tamassia, and.
ITEC 2620M Introduction to Data Structures
Binary Trees, Binary Search Trees
COP3530- Data Structures B Trees
B-Trees (continued) Analysis of worst-case and average number of disk accesses for an insert. Delete and analysis. Structure for B-tree node.
Trees Lecture 9 CS2110 – Fall 2009.
Trees Addenda.
Binary Trees, Binary Search Trees
Tries 2/23/2019 8:29 AM Tries 2/23/2019 8:29 AM Tries.
Trees.
Suffix Trees String … any sequence of characters.
Tries 2/27/2019 5:37 PM Tries Tries.
Suffix Arrays and Suffix Trees
Sequences 5/17/ :43 AM Pattern Matching.
Binary Trees, Binary Search Trees
Languages Fall 2018.
Trees Lecture 10 CS2110 – Spring 2013.
Presentation transcript:

ETH Zurich A Database Index to Large Biological Sequences (Ela Hunt, Malcolm P. Atkinson, Robert W. Irving) A report on the paper from Nicola Ranieri

2 Biological Genome Basics Genome Chromosome Deoxyribonucleic acid (DNA) Gene Nucleotide Base Pair Nucleotides...ACCTGAATTCG...

3 Overview ➢ Introduction ➢ Definitions ➢ Naive Algorithm ➢ Analysis ➢ Improvements ➢ New Algorithm

4 Motivation

5 Definition: Suffix Tree (a) the paths from the root to the leaves have a one-to-one relationship with the suffixes of S (b) edges are labelled with non-empty strings (c) every internal node (except perhaps the root node) has more than one child BANANA$ BANANA$ A tree is a suffix tree for a string S if:

6 Prefix Problem Problem:If a suffix t1 is a prefix of another suffix t2, there would be no leaf representing t1. t2 would be appended to the leaf representing t1, making it an internal node. Solution:Append a terminating character to the string S, which does not occur in S. This makes the suffixes of S prefix free.

7 Suffix Tree: Use (1) A N BANANA$ BANANA$

8 Suffix Tree: Use (2) NANNNANN

9 Definition: Trie A trie or prefix tree is a tree, in which every edge is labelled with a single character and every node represents the string of the concatenated characters on the root path. JR JAJA RARA RORO JAVJAV RADRAD RANRAN RAURAU ROSROS JAVAJAVA RANDRAND RAUMRAUM ROSEROSE

10 Naive Construction RootRoot BANANA$ BANANA$

11 Naive Construction RootRoot BANANA$ BANANA$ 1

12 Naive Construction RootRoot BANANA$ BANANA$ 1 2

13 Naive Construction RootRoot BANANA$ BANANA$ 1 2 3

14 Naive Construction RootRoot BANANA$ BANANA$

15 Naive Construction RootRoot BANANA$ BANANA$

16 Naive Construction RootRoot

17 Naive Construction BANANA$ BANANA$

18 Analysis Insert n + (n-1) + (n-2) nodes Merge n + (n-1) + (n-2) edges Worst case time complexity: O(n²) Worst case space complexity: O(n²) S = abc..xyz RootRoot

19 Improvement: Direct Merging (1) Root NAAN A $ BANANA$ BANANA$

20 Improvement: Direct Merging (2) Root NAAN A $ BANANA$ BANANA$

21 Improvement: Space Requirement (1) BANANA$ BANANA$

22 Improvement: Space Requirement (2) BANANA$ BANANA$ ➢ The right index of an edge to an inner node is equal to the smallest index in the child-edges minus one ➢ The right index of an edge to a leaf is equal to the length of S

23 Definition: Suffix Link A suffix link is a pointer from a inner non- root node, representing the substring xα, to another inner node (the root node, if α is empty) representing α, with x as a single character and α as a string.

24 A New Algorithm: Observations ➢ Suffixes with different prefixes fall in different subtrees ➢ Suffix links are the only connection between two subtrees Therefore each subtree can be built absolutely independent of the other subtrees, when no suffix links used.

25 A New Algorithm: Partitioning (1) Number of partitions: p = floor( ) Assumption: DNA has a pseudo-random nature and hence the tree is balanced Character Encoding: A = 00 C = 01 G = 10 T = A 01 C C 11 T... Each suffix has now a prefix code P i, which defines a lexicographic order. Divide the range of all P i into p sub-ranges, each representing one partition.

26 A New Algorithm: Partitioning (1) Prefix code: Partition Range: Check for partition j:

27 A New Algorithm: Construction for j in partitions do initialize_tree t for i in 0..length_of_S do if suffix i is in partition j insert i into t endif endfor safe_to_disk t endfor

28 A New Algorithm: Experimental Results

29 Questions?