Presented By Dr. Shazzad Hosain Asst. Prof. EECS, NSU Suffix Trees and Their Uses.

Slides:



Advertisements
Similar presentations
CS 336 March 19, 2012 Tandy Warnow.
Advertisements

Combinatorial Pattern Matching CS 466 Saurabh Sinha.
Text Indexing The Suffix Array. Basic notation and facts Occurrences of P in T = All suffixes of T having P as a prefix SUF(T) = Sorted set of suffixes.
Text Indexing The Suffix Array. Basic notation and facts Occurrences of P in T = All suffixes of T having P as a prefix SUF(T) = Sorted set of suffixes.
Suffix Trees, Suffix Arrays and Suffix Trays Richard Cole Tsvi Kopelowitz Moshe Lewenstein.
Constant-Time LCA Retrieval
Suffix Trees Specialized form of keyword trees New ideas –preprocess text T, not pattern P O(m) preprocess time O(n+k) search time –k is number of occurrences.
What about the trees of the Mississippi? Suffix Trees explained in an algorithm for indexing large biological sequences Jacob Kleerekoper & Marjolijn Elsinga.
Suffix Trees and Suffix Arrays
Suffix Tree and Suffix Array R Brain Chen R Pluto Chang.
1 Suffix Trees Charles Yan Suffix Trees: Motivations Substring problem: One is given a text T of length m. After O (m) preprocessing time, one.
Suffix Sorting & Related Algoritmics Martin Farach-Colton Rutgers University USA.
1 Suffix Trees © Jeff Parker, Outline An introduction to the Suffix Tree Some sample applications How to build a Suffix Tree efficiently.
15-853Page : Algorithms in the Real World Suffix Trees.
296.3: Algorithms in the Real World
1 Prof. Dr. Th. Ottmann Theory I Algorithm Design and Analysis (12 - Text search: suffix trees)
Suffix Trees Suffix trees Linearized suffix trees Virtual suffix trees Suffix arrays Enhanced suffix arrays Suffix cactus, suffix vectors, …
3 -1 Chapter 3 String Matching String Matching Problem Given a text string T of length n and a pattern string P of length m, the exact string matching.
21/05/2015Applied Algorithmics - week51 Off-line text search (indexing)  Off-line text search refers to the situation in which a preprocessed digital.
Combinatorial Pattern Matching CS 466 Saurabh Sinha.
Motivation  DNA sequencing processes large chains into subsequences of ~500 characters long  Assembling all pieces, produces a single sequence but… –At.
Presented By Dr. Shazzad Hosain Asst. Prof. EECS, NSU Linear Time Construction of Suffix Tree.
Suffix Trees String … any sequence of characters. Substring of string S … string composed of characters i through j, i ate is.
Goodrich, Tamassia String Processing1 Pattern Matching.
1 Applications of Suffix Trees Charles Yan Exact String Matching |P|=n, |T|=m P and T are both known at the same time Boyer-Moore, or Suffix.
Full-Text Indexing via Burrows-Wheeler Transform Wing-Kai Hon Oct 18, 2006.
Suffix arrays. Suffix array We loose some of the functionality but we save space. Let s = abab Sort the suffixes lexicographically: ab, abab, b, bab The.
Suffix arrays. Suffix array We loose some of the functionality but we save space. Let s = abab Sort the suffixes lexicographically: ab, abab, b, bab The.
Suffix trees.
Suffix trees and suffix arrays presentation by Haim Kaplan.
Suffix trees. Trie A tree representing a set of strings. a c b c e e f d b f e g { aeef ad bbfe bbfg c }
Building Suffix Trees in O(m) time Weiner had first linear time algorithm in 1973 McCreight developed a more space efficient algorithm in 1976 Ukkonen.
Survey: String Matching with k Mismatches Moshe Lewenstein Bar Ilan University.
1 Exact Set Matching Charles Yan Exact Set Matching Goal: To find all occurrences in text T of any pattern in a set of patterns P={p 1,p 2,…,p.
Binary Trees Chapter 6.
Database Index to Large Biological Sequences Ela Hunt, Malcolm P. Atkinson, and Robert W. Irving Proceedings of the 27th VLDB Conference,2001 Presented.
B-trees (Balanced Trees) A B-tree is a special kind of tree, similar to a binary tree. However, It is not a binary search tree. It is not a binary tree.
Chapter Tow Search Trees BY HUSSEIN SALIM QASIM WESAM HRBI FADHEEL CS 6310 ADVANCE DATA STRUCTURE AND ALGORITHM DR. ELISE DE DONCKER 1.
Physical Mapping of DNA Shanna Terry March 2, 2004.
Lecture 10 Trees –Definiton of trees –Uses of trees –Operations on a tree.
CS5263 Bioinformatics Lecture 17 Exact String Matching Algorithms.
String Matching with k Mismatches Moshe Lewenstein Bar Ilan University Modified by Ariel Rosenfeld.
Improved string matching with k mismatches (The Kangaroo Method) Galil, R. Giancarlo SIGACT News, Vol. 17, No. 4, 1986, pp. 52–54 Original: Moshe Lewenstein.
Constant-Time LCA Retrieval Presentation by Danny Hermelin, String Matching Algorithms Seminar, Haifa University.
Suffix trees. Trie A tree representing a set of strings. a b c e e f d b f e g { aeef ad bbfe bbfg c }
Exact String Matching Algorithms Presented By Dr. Shazzad Hosain Asst. Prof. EECS, NSU.
Comp. Genomics Recitation 3 (week 4) 26/3/2009 Multiple Hypothesis Testing+Suffix Trees Based in part on slides by William Stafford Noble.
Applications of Suffix Trees Dr. Amar Mukherjee CAP 5937 – ST: Bioinformatics University of central Florida.
Suffix Tree 6 Mar MinKoo Seo. Contents  Basic Text Searching  Introduction to Suffix Tree  Suffix Trees and Exact Matching  Longest Common Substring.
Chapter 11. Chapter Summary  Introduction to trees (11.1)  Application of trees (11.2)  Tree traversal (11.3)  Spanning trees (11.4)
Exact String Matching Algorithms. Copyright notice Many of the images in this power point presentation of other people. The Copyright belong to the original.
Advanced Data Structures Lecture 8 Mingmin Xie. Agenda Overview Trie Suffix Tree Suffix Array, LCP Construction Applications.
Linear Time Suffix Array Construction Using D-Critical Substrings
15-853:Algorithms in the Real World
COMP9319 Web Data Compression and Search
Two equivalent problems
String Data Structures and Algorithms: Suffix Trees and Suffix Arrays
Suffix trees.
String Data Structures and Algorithms
String Data Structures and Algorithms
Suffix trees and suffix arrays
Suffix Trees String … any sequence of characters.
CS 6293 Advanced Topics: Translational Bioinformatics
Suffix Arrays and Suffix Trees
String Matching with k Mismatches
Dynamic Programming II DP over Intervals
INF 141: Information Retrieval
Presentation transcript:

Presented By Dr. Shazzad Hosain Asst. Prof. EECS, NSU Suffix Trees and Their Uses

Suffix tree S=xabxac = abxac = bxac = xac = ac = c

Suffix tree S=xabxa = abxa = bxa = xa = a x a b x a a b x a b x a

Suffix tree (Example) Let s=abab, a suffix tree of s contains all the suffixes of s=abab$ { $ b$ ab$ bab$ abab$ } a b a b $ a b $ b $ $ $

Trivial algorithm to build a Suffix tree Put the largest suffix in Put the suffix bab$ in a b a b $ a b a b $ a b $ b s=abab$

Put the suffix ab$ in a b a b $ a b $ b a b a b $ a b $ b $ { abab$ bab$ }

Put the suffix b$ in a b a b $ a b $ b $ a b a b $ a b $ b $ $ { abab$ bab$ ab$ }

Put the suffix $ in a b a b $ a b $ b $ $ a b a b $ a b $ b $ $ $ { abab$ bab$ ab$ b$ }

We will also label each leaf with the starting point of the corres. suffix. a b a b $ a b $ b $ $ $ 1 2 a b a b $ a b $ b 3 $ 4 $ 5 $ { abab$ bab$ ab$ b$ $ }

Analysis Takes O(n 2 ) time to build. We will see how to do it in O(n) time

What can we do with it ? 7.1. APL1: Exact string matching Given a Text T, |T| = n, preprocess it such that when a pattern P, |P|=m, arrives you can quickly decide when it occurs in T. W e may also want to find all occurrences of P in T

Exact string matching In preprocessing we just build a suffix tree in O(n) time 1 2 a b a b $ a b $ b 3 $ 4 $ 5 $ Given a pattern P = ab we traverse the tree according to the pattern. T=abab$ 12345

1 2 a b a b $ a b $ b 3 $ 4 $ 5 $ If we did not get stuck traversing the pattern then the pattern occurs in the text. Each leaf in the subtree below the node we reach corresponds to an occurrence. By traversing this subtree we get all k occurrences in O(n+k) time

Generalized suffix tree Given a set of strings S a generalized suffix tree of S contains all suffixes of s  S To make these suffixes prefix-free we add a special char, say $, at the end of s To associate each suffix with a unique string in S add a different special char to each s

Generalized suffix tree (Example) Let s 1 =abab and s 2 =aab here is a generalized suffix tree for s 1 and s 2 { $ # b$ b# ab$ ab# bab$ aab# abab$ } 1 2 a b a b $ a b $ b 3 $ 4 $ 5 $ 1 b # a 2 # 3 # 4 # S = s 1 $s 1 # = abab$aab#

What can we do with it ? 7.5. APL5: Recognizing DNA Contamination We isolate, purify, clone, copy, maintain, probe or sequence DNA string Often unwanted DNA is inserted into the desired DNA sequence DNA sequence extracted form dinosaur, more similar to mammal (human) than to bird or crockodilian DNA

7.5 DNA Contamination Problem Let string S 1 is newly isolated/sequenced DNA We know S 2 the possible contaminated DNA Problem is to find all substrings of S 2 that occur in S 1 and that are longer than some given length l. Solution: Generate generalized suffix tree of S 1 and S 2.

7.4: Longest common substring (of two strings) Every node with a leaf descendant from string s 1 and a leaf descendant from string s 2 represents a maximal common substring and vice versa. 1 2 a b a b $ a b $ b 3 $ 4 $ 5 $ 1 b # a 2 # 3 # 4 # Find such node with largest “string depth”

Repetitive Structures in Biological Strings One of the most striking features of DNA is that there are many repeated substrings This is specially true for eukaryotes – Most of the Y chromosome consists of repeated substrings – One third of human genome is from repeated family Prokaryotes have less repeats

Repetitive Structures in Biological Strings Three types of repeats – Local : small scale repeated strings whose function or origin is at least partially understood – Simple repeats: both local and interspersed, whose function is less clear – Complex interspersed repeats: whose function is even more in doubt

Repetitive Structures in Biological Strings Palindrome is a string that reads the same backwards as forwards – xyaayx is a palindrome – “Was it a cat I saw” is a palindrome ignoring space

Repetitive Structures in Biological Strings Complemented palindrome of DNA or RNA – A – T/U and C – G are complements in DNA/RNA – AGCTCGCGAGCT is a complemented palindrome Complemented palindromes in both DNA/RNA regulates DNA transcription – Folds to form a “hairpin loop” Many more functionalities

Repetitive Structures in Biological Strings Restriction enzyme recognizes a specific substring in DNA of both prokaryotes and eukaryotes and cuts the DNA every place where the pattern occurs Restriction Enzyme Cutting Sites are interesting examples of repeats because they tend to be complemented palindromic repeats – For example EcoRI (restriction enzyme) recognizes GAATTC and cuts between the G and the adjoining A – BglI recognizes GCCNNNNNGGC, where N stands for any nucleotide. The enzyme cuts between the last two N s.

Lowest common ancestors A lot more can be gained from the suffix tree if we preprocess it so that we can answer LCA queries on it

Why? The LCA of two leaves represents the longest common prefix (LCP) of these 2 suffixes 1 2 a b a b $ a b $ b 3 $ 4 $ 5 $ 1 b # a 2 # 3 # 4 #

Finding maximal palindromes A palindrome: caabaac, cbaabc Want to find all maximal palindromes in a string s Let s = cbaaba The maximal palindrome with center between i-1 and i is the LCP of the suffix at position i of s and the suffix at position m-i+1 of s r

Maximal palindromes algorithm Prepare a generalized suffix tree for s = cbaaba$ and s r = abaabc# For every i find the LCA of suffix i of s and suffix m-i+1 of s r

3 a abab a baaba$baaba$ b 3 $ 7 $ b 7 # c 1 6 abab c#c# a$a$ c#c# a 5 6 $ c#c# a$a$ $ abc#abc# c#c# Let s = cbaaba$ then s r = abaabc#

Analysis O(n) time to identify all palindromes

Drawbacks Suffix trees consume a lot of space It is O(n) but the constant is quite big Notice that if we indeed want to traverse an edge in O(1) time then we need an array of ptrs. of size |Σ| in each node

Suffix array We loose some of the functionality but we save space. Let s = abab Sort the suffixes lexicographically: ab, abab, b, bab The suffix array gives the indices of the suffixes in lexically sorted order 3142

Example i ippi issippi ississippi mississippi pi ppi sippi sisippi ssippi ssissippi Let T = mississippi L R Let P = issa M 1.Suffix starting at position Pos(1) of T is the lexically smallest suffix 2.In general suffix Pos(i) of T is lexically smaller than suffix Pos(i+1)

How do we build it ? Build a suffix tree Traverse the tree in DFS, lexicographically picking edges outgoing from each node and fill the suffix array. O(n) time 1 2 a b a b $ a b $ b 3 $ 4 $ 5 $ 1 b # a 2 # 3 # 4 #

How do we search for a pattern ? If P occurs in T then all its occurrences are consecutive in the suffix array. Do a binary search on the suffix array Takes O(mlogn) time

Example Let S = mississippi i ippi issippi ississippi mississippi pi ppi sippi sisippi ssippi ssissippi L R Let P = issi M

How do we accelerate the search ? L R Maintain = LCP(P,L) Maintain r = LCP(P,R) M If  = r then start comparing M to P at + 1 r

How do we accelerate the search ? L R Suppose we know LCP(L,M) If LCP(L,M) <  we go left If LCP(L,M) > we go right If LCP(L,M) = we start comparing at  + 1 M If  > r then r

Analysis of the acceleration If we do more than a single comparison in an iteration then max(, r ) grows by 1 for each comparison  O(logn + m) time

Reference Chapter 5, 7: Algorithms on Strings, Trees and Sequences