Suffix Trees Come of Age in Bioinformatics Algorithms, Applications and Implementations Dan Gusfield, U.C. Davis.

Presentation on theme: "Suffix Trees Come of Age in Bioinformatics Algorithms, Applications and Implementations Dan Gusfield, U.C. Davis."— Presentation transcript:

Suffix Trees Come of Age in Bioinformatics Algorithms, Applications and Implementations Dan Gusfield, U.C. Davis

AABAAB A A B A A B \$ 1 \$ 4 B A A \$ 2 B A A B \$ 3 B \$ 5 \$ 6 1 2 3 4 5 6 Suffix Tree Every suffix of S is encoded by a root to leaf walk. Every substring encoded by a walk from the root. S =

AABAAB A A B A A B \$ 1 \$ 4 B A A \$ 2 B A A B \$ 3 B \$ 5 \$ 6 4 1 5 2 3 6 1 2 3 4 5 6 SUFFIX ARRAY The suffixes listed in lexicographic order. Found by lexicographic DFS

Basic Facts about Suffix Trees Suffix tree for a string of length n can be built in O(n) time. Basic implementations need about 25 bytes per character. Numerous applications in string algorithms achieving significant (sometimes amazing) speedups over naïve algorithms – O(m) time substring search for a string of length m in a string of length n, regardless of n.

Suffix Tree/Array Construction Weiner 1973 McCreight ~ 1974 Ukkonen ~ 1996 Farach-Colton ~ 1999 Manber-Myers ~ 1991 (Suffix-Arrays) O(n log n time) with small space. O(n) time possible as on prior slide.

aabbaa A \$ 6 A B B A A \$ 1 \$ 5 B A A \$ 2 B B A A \$ 3 B A A \$ 4 B

Bioinformatics Applications pre 1997 A small number of applications Limited by space requirements of the tree, small computer memories, complication of the algorithms, poor locality of reference

Post 1997 Much has changed, and Suffix trees and their relatives (particularly suffix arrays) are more widely applied in bioinformatics 20- 40 new publications with substantial connection to bioinformatics, post 1997

New Results since 1997 Fundamental Algorithms: Farach-Colton construction; simpler algorithm for least- common-ancestor. Implementation improvements (space) for suffix trees and arrays New variants of suffix trees – affix trees, virtual suffix trees New algorithms for tandem repeats

New Results since 1997 Approximate repeats and large-scale repeat structure Fast lookups in databases Substring frequencies Motif and pattern discovery Hybrid dynamic programming Oligo and probe construction

New Results since 1997 Genome fragment assembly and resequencing (multiple) whole genome comparison Large-scale sequence comparison in resequencing projects Misc. clever applications Related less-used data structures

Suffix trees collect together repeated substring prefixes Example: Speeding up the use of Position Specific Scoring Matrices. Accelerating protein classification ISMB 2000 position 1 2 3 4 5 A 3 1 4 2 1 T 2 4 4 3 2 C 1 3 5 6 2 G 4 2 4 6 2 …AACTG…AACTG….AACTG… AACTG Walk around the tree to depth 5 PSSM Starting locations of AACTG

Whole Genome Alignment with MUM’s: Delcher …Salzberg NAR ~ 1999 A MUM is a Maximal Unique Matching substring in two strings S and S’ MUMMER finds all MUM’s; selects and aligns non-overlapping pairs of MUM’s; and then recurses in the regions between adjacent selected MUMs. Key issue: finding MUMs efficiently

Spotting MUMs with a suffix tree Example: S = …AATCCGTG…. S’ = ………GATCCGTA… 20 So ATCCGT is a MUM if it only occurs in these two places root Path spelling ATCCGT 7 S7 S’20 Exactly two sibling leaves

Finding MUM’s in a suffix tree Build a suffix tree for the concatenation of S and S’, noting at each leaf whether the suffix is from the S part or the S’ part. Also note the prior character in S or S’ for that position. Do a DFS traversal of the S.T. to note at each node how many leaves below it are from S and how many from S’. A node corresponds to a MUM if and only if the count there is 1 and 1, and the two prior characters are different. For total string length n, the time used is O(n).

Reducing the space needed for MUM finding MUMMER II - NAR spring 2002 root Path to a node v, spelling xB, where B is a string v v’ Suffix link Path to a node v’, spelling B

Space reduction when finding MUMs Given strings S and S’, build a suffix tree T, with suffix links, but for only the smaller of S and S’, say S. Using T, find for each position i in S’, the longest substring starting at i which occurs in S exactly once. Mark the end location in T of that match. Each such match is a candidate MUM.

Finding MUMs in less space To find the matches, walk S’ through T, using suffix links to reduce search time. After S’ has been completely processed, every marked position in T that is visited only once and has no marked position below it, corresponds to a MUM. So all MUMs can be found in O(n) time, but in much reduced space.

Finding Tandem Repeats …TATAACTAACTAAGATT.. For a string of length n, a naïve algorithm might use (on the order of) n^3 operations to find all T.R.s Stoye and Gusfield, Theoretical Computer Science, 2001

Example: Finding Tandem Repeats in Strings TATAACTAACTAAGATT…. Every Tandem Repeat is part of a chain of tandem repeats ending with a Branching Tandem Repeat. To extend a chain, examine the first character of the T.R. and the character after the end of the T.R. If they are the same, a new T.R. exists on place to the right.

Example: Finding Tandem Repeats in Strings TATAACTAACTAAGATT…. Branching Tandem Repeats Every Tandem Repeat is part of a chain of tandem repeats ending with a Branching Tandem Repeat.

Example: Finding Tandem Repeats in Strings TATAACTAACTAAGATT…. Branching Tandem Repeats Every Tandem Repeat is part of a chain of tandem repeats ending with a Branching Tandem Repeat.

Example: Finding Tandem Repeats in Strings TATAACTAACTAAGATT…. Branching Tandem Repeat Every Tandem Repeat is part of a chain of tandem repeats ending with a Branching Tandem Repeat. So to find all T.R.s – find all Branching T.R’s Use a suffix tree to efficiently find Branching T.R.’s

Finding Branching T.R.’s Every Branching T.R. ends at a node of the suffix tree – so examine each node. String to here is length 8 5 21 13 = 5 + 8 root leaves 5, 13 and 21 are below v v

Basic Algorithm Build the S.T. for string S; process it to find depth of each node, and do a DFS numbering of the nodes so that ancestry of two nodes can be determined in constant time. (Classic property of DFS) Check each internal node to see if it is at the first or second part of a tandem repeat.

How to check From a node v of depth d(v), walk the subtree to collect the leaf numbers below v. If leaf k is collected, check if v is an ancestor of leaf k+d(v), and the next characters k+1 and k+d(v) + 1 differ. If so, then the string to v is the first part of a Branching T.R. Alternatively, focus on k-d(v) to see if the string to v is the second part of a Branching T.R.

AABAAB A A B A A B \$ 1 \$ 4 B A A \$ 2 B A A B \$ 3 B \$ 5 \$ 6 1 2 3 4 5 6 Suffix Tree S =

Finding Tandem Repeats This approach gives O(n^2) time to find all branching tandem repeats.The time can be reduced to O(n log n) with a classic trick. Related more complex result: In O(n) time, one can mark in the suffix tree, the endpoints of all tandem repeat “species”. Gusfield and Stoye 2000 UCD techreport Uses the fact (Frankel et al 1999) that there are only 2n tandem repeat species – complex proof. Example: aaaaaaaaaa has only five tandem repeat species, but 25 tandem repeat occurrences

Space issues again Use “virtual suffix tree” or “extended suffix array” to reduce space Def: for positions i and j in string S, LCP(i,j) is the length of the longest matching substring starting at positions i and j in S.

AABAAB A A B A A B \$ 1 \$ 4 B A A \$ 2 B A A B \$ 3 B \$ 5 \$ 6 4 1 5 2 3 6 3 1 2 0 1 1 2 3 4 5 6 SUFFIX ARRAY With LCP or string depth information Node depths at time of descent after a backup LCP of neighbors

Extended Suffix Array String S, the suffix array A for S, and the LCP array for A, completely determine the suffix tree T for S. This is the “extended suffix array” for S. The suffix array and the LCP array can be found in O(n log n) time and small space, without an explicit suffix tree. For many efficient algorithms using an explicit suffix tree, it is possible to simulate the algorithm using only the extended suffix array, greatly reducing space usage without increasing time. S. Kurtz WABI 2002

String Barcoding – Oligo Construction Uncovering Optimal Virus Signatures Sam Rash and Dan Gusfield RECOMB April 2002

Motivation Need for rapid virus detection –Given unknown virus database known viruses –Problem identify unknown virus quickly –Ideal solution have sequence of –viruses in database –unknown virus Solution –use BLAST (or any sequence similarity program/algorithm)

Motivation –Real World only have sequence for pathogens in database –not possible to quickly sequence an unknown virus can test for presence small (<= 50 bp) strings in unknown virus –substring tests –Another Idea String Barcoding –use substring tests to uniquely identify each virus in the database –acquire unique barcode for each virus in database

Implementation Basic Idea: Formulate problem as an Integer Linear Program (ILP) –Enumerate some “useful” set of substrings from S variable in ILP for each substring –Constraint for each pair of strings in S means that at least one substring will be chosen to distinguish each pair –Objective Function Minimize sum of variables in ILP

Problem Definition Formal Definition –given set of strings S –goal find set of strings S’, the testing set –wlog, for each s 1,s 2 in S, there exists at least one u in S’ where u is a substring of only s 1 –u is a signature substring minimize |S’| –result barcode for each element on S

Idea strings: 1. cagtgc 2. cagttc 3. catgga Each node in the suffix tree has a corresponding set of string IDs below it v1 - {1,2,3}v2 - {1,2,3}v3 - {3}v4 - {1}v5 - {3} v6 - {1,2}v7 - {2}v8 - {1}v9 - {1,2,3}v10 - {1,2,3} v11 - {1,2}v12 - {1}v13 - {2}v14 - {3}v15 - {1,2,3} v16 - {2}v17 - {2}v18 - {1,3}v19 - {1}v20 - {3} v21 - {1,2,3}v22 - {3}v23 - {2}v24 - {1,2}v25 - {1} Figure 1.1 - suffix tree for set of strings cagtgc, cagttc, and catgga

Idea strings: 1. cagtgc 2. cagttc 3. catgga Each node in the suffix tree has a corresponding set of string IDs below it v1 - {1,2,3}v2 - {1,2,3}v3 - {3}v4 - {1}v5 - {3} v6 - {1,2}v7 - {2}v8 - {1}v9 - {1,2,3}v10 - {1,2,3} v11 - {1,2}v12 - {1}v13 - {2}v14 - {3}v15 - {1,2,3} v16 - {2}v17 - {2}v18 - {1,3}v19 - {1}v20 - {3} v21 - {1,2,3}v22 - {3}v23 - {2}v24 - {1,2}v25 - {1} Figure 1.1 - suffix tree for set of strings cagtgc, cagttc, and catgga

Implementation: Suffix Trees –root-edge walk Creates string –appears in exactly the strings that label the node at which it ends 2 root-edge walks ending on the same edge –Both strings created by the walk occur in exactly the same set of original strings Can use ether string example - a root edge walk c a g t t a c g a g t t g t t c c g a

Practical Implementation: If two substrings occur in exactly the same set of original strings, only one need be considered –Use strings from suffix tree for each uniquely labeled node Build ILP as discussed Solve ILP using CPLEX Acquire barcode and signatures for each original string –signature is the set of substring tests occurring in a string

Implementation: Example minimize V18 + V22 + V11 + V17 + V8#objective function st V18 + V22 + V11 + V17 + V8 >= 2#this is the theoretical minimum V18 + V17 + V8 >= 1#constraint to cover pair 1,2 V22 + V11 + V8 >= 1#constraint to cover pair 1,3 V18 + V22 + V11 + V17 >= 1#constraint to cover pair 2,3 binaries#all variables are 0/1 V18 V22 V11 V17 V8 end Figure 1.4 - barcodes Figure 1.5 - signatures tg (V18)atgga (V22) cagtgc10 cagttc00 catgga11 cagtgc{“tg”} cagttc  catgga{“tg”, “atgga”}

Implementation: Extensions minimum and maximum lengths on signature substrings acquire barcodes/signatures for only a subset of input strings (wrt to whole set) minimum string edit distance between chosen signature substrings redundancy –require r signature substrings to differentiate each pair –adds a higher level of confidence that signatures remain valid even with mutations

Results Works quickly on most moderately sized datasets (especially when redundancy >= 2) –dataset properties ~50k virus genomes taken from NCBI (Genbank) 50-150 virus genomes average length of each sequence ~1000 characters total input size ranged from approximately 50,000 – 150,000 characters increasing dataset size scaled approximately linearly –reach 25% gap (at most 1/3 more than optimum) in just a few minutes –reach small gap (often < 1%) in 4 hours

Summary Numerous applications of Suffix Trees and relatives in Bioinformatics More yet to be found Suffix tree and array software at my UCD website.

Download ppt "Suffix Trees Come of Age in Bioinformatics Algorithms, Applications and Implementations Dan Gusfield, U.C. Davis."

Similar presentations