1 Suffix Arrays: A new method for on-line string searches Udi Manber Gene Myers May 1989 Presented by: Oren Weimann.

Slides:



Advertisements
Similar presentations
Suffix Array: Data structures and applications
Advertisements

The LCA Problem Revisited Michael A. BenderMartin Farach-Colton Latin American Theoretical Informatics Symposium, pages 8894, Speaker:
On-line Construction of Suffix Trees Chairman : Prof. R.C.T. Lee Speaker : C. S. Wu ( ) June 10, 2004 Dept. of CSIE National Chi Nan University.
Boosting Textual Compression in Optimal Linear Time.
Chapter 3 Brute Force Brute force is a straightforward approach to solving a problem, usually directly based on the problem’s statement and definitions.
Longest Common Subsequence
Suffix Arrays: A new method for on-line string searches Udi Manber Gene Myers.
Advanced Algorithm Design and Analysis (Lecture 6) SW5 fall 2004 Simonas Šaltenis E1-215b
A New Compressed Suffix Tree Supporting Fast Search and its Construction Algorithm Using Optimal Working Space Dong Kyue Kim 1 andHeejin Park 2 1 School.
Two implementation issues Alphabet size Generalizing to multiple strings.
1 Suffix tree and suffix array techniques for pattern analysis in strings Esko Ukkonen Univ Helsinki Erice School 30 Oct 2005 Modified Alon Itai 2006.
Suffix Sorting & Related Algoritmics Martin Farach-Colton Rutgers University USA.
OUTLINE Suffix trees Suffix arrays Suffix trees Indexing techniques are used to locate highest – scoring alignments. One method of indexing uses the.
© 2004 Goodrich, Tamassia Tries1. © 2004 Goodrich, Tamassia Tries2 Preprocessing Strings Preprocessing the pattern speeds up pattern matching queries.
1 Suffix Trees and Suffix Arrays Modern Information Retrieval by R. Baeza-Yates and B. Ribeiro-Neto Addison-Wesley, (Chapter 8)
1 Prof. Dr. Th. Ottmann Theory I Algorithm Design and Analysis (12 - Text search: suffix trees)
Suffix Trees Suffix trees Linearized suffix trees Virtual suffix trees Suffix arrays Enhanced suffix arrays Suffix cactus, suffix vectors, …
Rapid Global Alignments How to align genomic sequences in (more or less) linear time.
1 Data structures for Pattern Matching Suffix trees and suffix arrays are a basic data structure in pattern matching Reported by: Olga Sergeeva, Saint.
The Trie Data Structure Basic definition: a recursive tree structure that uses the digital decomposition of strings to represent a set of strings for searching.
Tries Search for ‘bell’ O(n) by KMP algorithm O(dm) in a trie Tries
Hierarchy-conscious Data Structures for String Analysis Carlo Fantozzi PhD Student (XVI ciclo) Bioinformatics Course - June 25, 2002.
Boyer Moore Algorithm String Matching Problem Algorithm 3 cases Searching Timing.
Suffix Trees String … any sequence of characters. Substring of string S … string composed of characters i through j, i ate is.
Gene Prediction: Similarity-Based Approaches (selected from Jones/Pevzner lecture notes)
1 Nearest Commen Ancestors (NCA) Discrete Range Maximum Cartesian Tree [Vuillemin 1980] i j max(i, j) i.
IP Address Lookup for Internet Routers Using Balanced Binary Search with Prefix Vector Author: Hyesook Lim, Hyeong-gee Kim, Changhoon Publisher: IEEE TRANSACTIONS.
On Demand String Sorting over Unbounded Alphabets Carmel Kent Moshe Lewenstein Dafna Sheinwald.
Suffix Arrays A New Method for Online String Searches U.Manber and G.Myers.
1 Applications of Suffix Trees Charles Yan Exact String Matching |P|=n, |T|=m P and T are both known at the same time Boyer-Moore, or Suffix.
Fast and Practical Algorithms for Computing Runs Gang Chen – McMaster, Ontario, CAN Simon J. Puglisi – RMIT, Melbourne, AUS Bill Smyth – McMaster, Ontario,
1 A simple construction of two- dimensional suffix trees in linear time * Division of Electronics and Computer Engineering Hanyang University, Korea Dong.
Suffix arrays. Suffix array We loose some of the functionality but we save space. Let s = abab Sort the suffixes lexicographically: ab, abab, b, bab The.
UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bioinformatics Algorithms and Data Structures Chapter 2: KMP Algorithm Lecturer:
Boyer-Moore Algorithm 3 main ideas –right to left scan –bad character rule –good suffix rule.
Suffix trees and suffix arrays presentation by Haim Kaplan.
Recuperació de la informació Modern Information Retrieval (1999) Ricardo-Baeza Yates and Berthier Ribeiro-Neto Flexible Pattern Matching in Strings (2002)
Space Efficient Linear Time Construction of Suffix Arrays
Sequence Alignment II CIS 667 Spring Optimal Alignments So we know how to compute the similarity between two sequences  How do we construct an.
Indexing and Searching
Efficient algorithms for the scaled indexing problem Biing-Feng Wang, Jyh-Jye Lin, and Shan-Chyun Ku Journal of Algorithms 52 (2004) 82–100 Presenter:
Aho-Corasick Algorithm Generalizes KMP to handle sets of strings New ideas –keyword trees –failure functions/links –output links.
Survey: String Matching with k Mismatches Moshe Lewenstein Bar Ilan University.
Suffix Arrays A new method for on-line string searches by U. Manber and G. Myers(1993) Simple linear work suffix array construction by J. Karkkainen and.
Parallel Suffix Array Construction by Accelerated Sampling Matthew Felice Pace University of Warwick Joint work with Alexander Tiskin University of Warwick.
String Matching with k Mismatches Moshe Lewenstein Bar Ilan University Modified by Ariel Rosenfeld.
Read Alignment Algorithms. The Problem 2 Given a very long reference sequence of length n and given several short strings.
MCS 101: Algorithms Instructor Neelima Gupta
Tries1. 2 Outline and Reading Standard tries (§9.2.1) Compressed tries (§9.2.2) Suffix tries (§9.2.3)
Book: Algorithms on strings, trees and sequences by Dan Gusfield Presented by: Amir Anter and Vladimir Zoubritsky.
Suffix trees. Trie A tree representing a set of strings. a b c e e f d b f e g { aeef ad bbfe bbfg c }
Compressed Suffix Arrays and Suffix Trees Roberto Grossi, Jeffery Scott Vitter.
Bioinformatics PhD. Course 1. Biological introduction Exact Extended Approximate 6. Projects: PROMO, MREPATT, … 5. Sequence assembly 2. Comparison of short.
ETRI Linear-Time Search in Suffix Arrays July 14, 2003 Jeong Seop Sim, Dong Kyue Kim Heejin Park, Kunsoo Park.
Suffix Tree 6 Mar MinKoo Seo. Contents  Basic Text Searching  Introduction to Suffix Tree  Suffix Trees and Exact Matching  Longest Common Substring.
ETH Zurich A Database Index to Large Biological Sequences (Ela Hunt, Malcolm P. Atkinson, Robert W. Irving) A report on the paper from Nicola.
Tries 4/16/2018 8:59 AM Presentation for use with the textbook Data Structures and Algorithms in Java, 6th edition, by M. T. Goodrich, R. Tamassia, and.
COMP9319 Web Data Compression and Search
New Indices for Text : Pat Trees and PAT Arrays
Tries 9/14/ :13 AM Presentation for use with the textbook Data Structures and Algorithms in Java, 6th edition, by M. T. Goodrich, R. Tamassia, and.
Comparison of large sequences
Strings: Tries, Suffix Trees
Contents First week: algorithms for exact string matching:
Suffix trees.
Suffix trees and suffix arrays
Suffix Trees String … any sequence of characters.
Tries 2/27/2019 5:37 PM Tries Tries.
Suffix Arrays and Suffix Trees
Dynamic Programming II DP over Intervals
Strings: Tries, Suffix Trees
Presentation transcript:

1 Suffix Arrays: A new method for on-line string searches Udi Manber Gene Myers May 1989 Presented by: Oren Weimann

2 Introduction - Problem definition “Is W a substring of A?”  |A|=N and |W|=P  A = a 0 a 1 …a N-1  A i = suffix beginning at index i = a i a i+1 …a N-1 A= abccbbadgfbbcahgjf W= badgfbb A= abccbbadgfbbcahgjf

3 Introduction – what is a suffix array? Example: Pos Pos[2] = 6 (A 6 = in) A = assassin

4 Introduction – what is a suffix array? A lexicographically sorted array- Pos[N], of all the suffixes of A: Pos[k] = i  A i is the kth smallest suffix in the set {A 0, A 1, A 2…… A N-1 }

5 Introduction – what is a suffix tree? Example: A trie that contains all suffixes of A: s a 4 3 s s s s a i n 0 i n 6 i n A = assassin s i n a s s i n 2 i n 5 1 a s s i n

6 The Article Overview 1. A search algorithm In O(P+logN) (assuming we already computed Pos[ ] and the longest common prefix (lcp) information). 2. How to construct Pos[ ] in O(NlogN) time and O(N) space. (assuming lcp info is known) 3. An Algorithm for computing the lcp information in O(NlogN). 4. Algorithms for Expected-time improvement.

7 The Search algorithm - Definitions For any string u, u p = u 1 u 2 u 3……. u p (or u if |u| p) Let “ “ denote a Lexicographical order, We say u v  u p v p Note that for any choice of p: Note that W is a substring of A  there is an i such that W

8 The Search algorithm – how does the array help us know if W is a substring of A? We define a search interval: L W = min {k | W A Pos[k] or k = N} R W = max {k | W A Pos[k] or k = -1} W matches a i a i+1...a i+P-1  i=Pos[k] for some k [L W, R W ]

9 Example: Pos A = assassin Option 1 Option 2 Option 3

10 Why finding L W, R W == Finding the matches: If L W > R W => W is not a substring of A. Else: there are (R W -L W +1) matches - A Pos[L W ],…, A Pos[R W ] W>A Pos[k] W<A Pos[k] LWLW RWRW Pos

11 The Search algorithm – The easy way - O(PlogN) L MR abcde...abcdf...abd... Pos Log(N) iterations, each iteration sets new L,R bonds (initially L=0, R=N-1) according to a comparison of W with A Pos[M], where M=(L+R)/2. In the end L W R W=“abcx”

12 The Search algorithm using lcp values in O(P+logN) – Definitions: Speedup using precomputed lcp Values, for now We assume lcp is known. Each iteration We define: – l = lcp(A Pos[L], W) – r=lcp(W, A Pos[R] ) – Llcp[M] = lcp(A Pos[L] A Pos[M] ) – Rlcp[M] = lcp(A Pos[M], A Pos[R] )

13 The Search algorithm using lcp values in O(P+logN) Example: A=“abcx” l = 3 Llcp[M]=4 Rlcp[M]=2 L MR abcde...abcdf...abd... Pos r = 2 Note that Llcp[M] is well defined because every midpoint M has one L M and one R M

14 So how do we use l,r,Llcp[M] ? Example: W=abcx abcde...abc... abcdf…abd… l=3r=2 Case 1: Llcp[M] > l (Llcp[M]=4 and l=3 ) W>A Pos[L]  W>A Pos[M]  Go right  l is unchanged = 3 LM R Llcp[M]=4

15 Example: W=abcx (cont.) Case 2: Llcp[M] < l (Llcp[M]=2 and l=3 ) A Pos[L] <A Pos[M]  W<A Pos[M]  Go left  r = Llcp[M] = 2 abcde...abdf…abd… r=2l=3 L M R Llcp[M]=2

16 Example: W=abcx (cont.) abcde...abc... abcp…abd… l=3 r=2 Case 3: Llcp[M] = l (Llcp[M]=3 and l=3 ) Compare W l and A Pos[M] l until W l+j A Pos[M] l+j  Go right or left according to W l+j, A Pos[M] l+j  new l or r = (l+j)  Number of comparisons = j+1 LMR Llcp[M]=3

17 The Search algorithm using lcp values- complexity In each iteration there are maximum j+1 comparisons, when in total  Total comparisons (P + #Iterations)  O(P+logN) running time Requires only 3N-sized arrays

18 The Article Overview 1. A search algorithm In O(P+logN) (assuming we already computed Pos[ ] and the longest common prefix (lcp) information). 2. How to construct Pos[ ] in O(NlogN) time and O(N) space. (assuming lcp info is known) 3. An Algorithm for computing the lcp information in O(NlogN). 4. Algorithms for Expected-time improvement.

19 Construction of suffix array in O(NlogN) Sorting the suffixes in a unique Radix sort – We Will have O(logN) stages (numbered 1,2,4,8,16…) In stage H the suffixes are sorted in buckets called H Buckets, according to the first H characters. (next stage is 2H– thus, in stage H the suffixes are sorted by )

20 Construction of suffix array – The general idea If A i, A j H-bucket we Sort them by the Next H symbols, but: Their next H symbols = first H symbols of A i+H and A j+H which are already sorted in phase H. abef…abcd…ab…bb...bb…cd… ef… H=2 : AiAi AjAj A j+H A i+H first bucketfourth bucketthird bucketsecond bucket

21 Construction of suffix array – The general idea (cont.) Let A i be in first H-bucket after stage H A i starts with smallest H-symbol string A i-H should be first in its H-bucket abef…abcd…ab…bb...bb…cdef…cdab…ef… AiAi A i-H H=2 :

22 Construction of suffix array – The algorithm Go over the suffix array: For each A i : Move A i-H to next available place in its H-bucket The suffixes are now sorted according to -order Go over the array again, and decide which suffix opens a new 2H-bucket, use lcs knowledge (described later)

23 Construction of suffix array – The algorithm Example: A = assassin assinassassininnsinssinsassinssassin H=1 A3A3 A2A2 A i sets A i-1

24 Construction of suffix array – The algorithm Example: assinassassininnsassinssinsinssassin H=1 A0A0 A = assassin A i sets A i-1

25 Construction of suffix array – The algorithm Example: assinassassininnsassinssinsinssassin H=1 A6A6 A = assassin A5A5 A i sets A i-1

26 Construction of suffix array – The algorithm Example: assinassassininnsassinsinssinssassin H=1 A7A7 A = assassin A6A6 A i sets A i-1

27 Construction of suffix array – The algorithm Example: assinassassininnsassinsinssinssassin H=1 A2A2 A1A1 A = assassin A i sets A i-1

28 Construction of suffix array – The algorithm Example: assinassassininnsassinsinssassinssin H=1 A4A4 A = assassin A5A5 A i sets A i-1

29 Construction of suffix array – The algorithm Example: assinassassininnsassinsinssassinssin H=1 A = assassin A1A1 A0A0 A i sets A i-1

30 Construction of suffix array – The algorithm Example: assassinassininnsassinsinssassinssin H=1 A = assassin A4A4 A3A3 A i sets A i-1

31 Construction of suffix array – The algorithm Example: assassinassininnsassinsinssassinssin H=1 A = assassin Go over array to get new 2-buckets lcs(sassin,sin)= 1+ lcs(assin,in)= 1+0=1 so “sin” opens a new 2-bucket back A i sets A i-1

32 Construction of suffix array – The algorithm Example: assassinassininnsassinsinssassinssin H=2 A = assassin A0A0 A i sets A i-2

33 Construction of suffix array – The algorithm Example: assassinassininnsassinsinssassinssin H=2 A = assassin A3A3 A1A1 A i sets A i-2

34 Construction of suffix array – The algorithm Example: assassinassininnsassinsinssassinssin H=2 A = assassin A6A6 A4A4 A i sets A i-2

35 Construction of suffix array – The algorithm Example: assassinassininnsassinsinssassinssin H=2 A = assassin A7A7 A5A5 A i sets A i-2

36 Construction of suffix array – The algorithm Example: assassinassininnsassinsinssassinssin H=2 A = assassin A2A2 A0A0 A i sets A i-2

37 Construction of suffix array – The algorithm Example: assassinassininnsassinsinssassinssin H=2 A = assassin A5A5 A3A3 A i sets A i-2

38 Construction of suffix array – The algorithm Example: assassinassininnsassinsinssassinssin H=2 A = assassin A1A1 A i sets A i-2

39 Construction of suffix array – The algorithm Example: assassinassininnsassinsinssassinssin H=2 A = assassin A4A4 A2A2 A i sets A i-2

40 Construction of suffix array – The algorithm Example: assassinassininnsassinsinssassinssin H=2 A = assassin Go over array to get new 4-buckets A i sets A i-2

41 Construction of suffix array – The algorithm Example: assassinassininnsassinsinssassinssin H=4 A = assassin That’s it, we are sorted!

42 Construction of suffix array – Complexity Summary Sorting by first char – O(N) O(logN) stages of O(N) operations = O(NlogN) Total - time: O(NlogN) - space: 2 integer arrays of size N back

43 The Article Overview 1. A search algorithm In O(P+logN) (assuming we already computed Pos[ ] and the longest common prefix (lcp) information). 2. How to construct Pos[ ] in O(NlogN) time and O(N) space. 3. An Algorithm for computing the lcp information in O(NlogN). 4. Algorithms for Expected-time improvement.

44 How to find Longest Common Prefixes – the general idea We don’t care what is the lcp between suffixes in the same H-bucket. For A p, A q in the same H-bucket but different 2H-buckets: – H lcp(A p, A q ) < 2H – lcp(A p, A q ) = H + lcp(A p+H, A q+H ) – lcp(A p+H, A q+H ) < H  that is why A p+H, A q+H Are in different H-buckets, but which ones?

45 How to find Longest Common Prefixes – the general idea If A p+H and A q+H were in adjacent H-buckets then lcp is known. how?how? If not, Then: lcp(A Pos[i], A Pos[j] ) = {lcp(A Pos[k],A Pos[k+1] )}

46 How to find Longest Common Prefixes – the general idea lcp(A p+H, A q+H ) = min{1,1,2} = 1 assassinassininnsassinsinssassinssin A q+h A p+h Notice that if 2 neighbors are in the same H-bucket we can consider there lcp to be H, since lcp(A p+H, A q+H ) < H H=2

47 How to find lcp – algorithm and data structures – Hgt[] During the construction stage, we build an array Called Hgt[N]: Hgt(i)=lcp(A Pos[i-1], A Pos[i] ), initialized so that Hgt[i]=N+1 for every i. In stage H=1: Hgt(i)=0 for A Pos[i] that are first in their buckets. In stage 2H: we update every Hgt(i) that A Pos[i] is the first in a newly created 2H bucket

48 How to find lcp – Hgt[] example: H=1 assinassassininnsinssinsassinssassin assinassassininnsinssinsassinssassin H=2 lcp(ssin,sin)=1+lcp(sin,in)=1+min{lcp(in,n),lcp(sin, n)}=1

49 How to find lcp – Hgt[] example (cont.) 23 assinassassininnsin ssin sassinssassin H= lcp( assassin,assin)=2+lcp(sin, sassin)=2+1=3 lcp(ssin, ssassin)=2+lcp(in, assin)=2+0=2

50 How to find lcp – data structures We need a data structure that will contain lcp(A Pos[j], A Pos[i] ) between any i and j (not just i and i+1 which Hgt[] supplies) Hgt[] will become the leaves of a binary balanced tree called the Interval tree.

51 How to find lcp – example of Interval tree (2,3)(3,4)(4,5)(5,6)(6,7)(1,2)(0,1)

52 How to find lcp – Complexity Each time a leaf opens a new bucket we change Hgt[i] for that leaf. That change requires O(logN) changes in the interval tree There are O(N) leaves opening new bucket In total we get O(NlogN) to get all lcp values

53 The Article Overview 1. A search algorithm In O(P+logN) (assuming we already computed Pos[ ] and the longest common prefix (lcp) information). 2. How to construct Pos[ ] in O(NlogN) time and O(N) space. 3. An Algorithm for computing the lcp information in O(NlogN). 4. Algorithms for Expected-time improvement.

54 Time Expected-case Improvement of the construction of pos[] Assumptions: - All N-symbol strings are equally likely. – Under this assumption: Expected length of longest repeated substring = O(log | | N) This immediately implies that construction of pos[] is reduced to O(NLogLogN). why?why? Next is a way to reduce it to O(N).

55 Time Expected-case Improvement of the construction of pos[] Let T = We encode each possible T length string to an integer with the isomorphism Int T (u) Map each A P to Int T (A P ) [0,| | T -1] : – Int T (A P ) = a p | | T-1 +

56 Example of the mapping Int T (A P ) = a p | | T-1 + 2*4^ | |= 4, a=0, i=1, n=2, s=3 N=8 T= =1 1*4^ *4^ *4^ *4^ *4^0 + 00

57 Time Expected-case Improvement of the construction of pos[] By the definition of Int T (A P ) it takes O(N) to compute all Int T (A P ) values of all suffixes. So now instead of starting with H=1 we start with H= But since the longest repeated substring length is O(log | | N) we will have O(1) stages of the radix sort. Thus, the total time for constructing pos[] = O(N)

58 So is a suffix array better then a suffix tree? Suffix arraySuffix tree Construction time O(NlogN) - for small | | O(N) – needs additional space O(N) Time Complexity O(P+logN) – good for large alphabets O(Plog| |) Space Complexity requires 2N integers – this is the main advantage. O(N) dependent on | | ? NoYes

59