Presentation is loading. Please wait.

Presentation is loading. Please wait.

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bioinformatics Algorithms and Data Structures Chapter 12.2.4: k-difference.

Similar presentations


Presentation on theme: "UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bioinformatics Algorithms and Data Structures Chapter 12.2.4: k-difference."— Presentation transcript:

1 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bioinformatics Algorithms and Data Structures Chapter 12.2.4: k-difference Inexact Matching Lecturer: Dr. Rose Slides by: Dr. Rose February 21, 2002, year of the palindrome Last night at 2 minutes past 8pm it was: 20:02,20/02/2002

2 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Overview k-difference inexact matching –Concepts: d-path Farthest-reaching d-path in a diagonal –O(km) time and space solution Primer selection problem –Formulations: Exact matching primer Inexact matching primer k-difference primer –O(km) time solution to k-difference primer problem

3 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Overview Exclusion methods: fast expected time O(m) –Partition approaches: BYP algorithm –Aho-Corasick exact matching algorithm »Keyword trees –Back to Aho-Corasick exact matching algorithm »Algorithm for computing failure links Back to BYP algorithm

4 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology K-difference Inexact Matching Like k-mismatch problem: allows mismatches Harder than k-mismatch: –allows spaces –End spaces in T are not counted –|P| & |T| can be vastly different  can’t focus on a 2k+1 band centered around the diagonal.

5 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology K-difference Inexact Matching Defn: –Diagonals above the main diagonal are numbered 1 through m. Diagonal i starts in cell (0,i). –Diagonals below the main diagonal are numbered -1 through 1n. Diagonal -i starts in cell (i,0). –Row 0 is initialized to be all zeros. Recall T can have free end spaces Setting row 0 to be zeros allows the left end of T to start after a gap without any cost.

6 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology K-difference Inexact Matching Defn: a d-path is a path that starts in row 0 and specifies exactly d mismatches & spaces. Defn: a d-path is a farthest-reaching in diagonal i if it ends in diagonal i and the index of its ending column c is  the ending column of any other d- path ending in diagonal i. You can visualize this as a d-path that ends farthest in diagonal i.

7 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology K-difference Inexact Matching Approach: Iterate: (1  d  k ) – find the farthest-reaching d-path for each diagonal i, (-n  i  m) The farthest-reaching d-path for diagonal i is found from the farthest-reaching (d-1)-paths on diagonals i-1, i and i+1. Observation: and d-path reaching row n corresponds to a d-difference occurrence of P in T.

8 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology K-difference Inexact Matching Observation: a farthest reaching 0-path in diagonal i is the longest match of T[i..m] and P[1..n]. Q: Why is this true? A: 0-path means an exact match  no deviation from the diagonal that you start on. Using suffix trees: Build the suffix tree in linear time (linear in m). Retrieve farthest-reaching 0-paths in constant time/path.

9 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology K-difference Inexact Matching Q: How do we find the farthest-reaching d-path on diagonal i for d > 0? A: The d-path for diagonal i depends on the previously found (d-1)-paths on diagonals i-1, i and i+1. The 3 cases are: 1.Path R 1, the farthest-reaching (d-1)-path on diagonal i+1, followed by a vertical edge to diagonal i.

10 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology K-difference Inexact Matching Since R 1 is a (d-1)-path on diagonal i+1, extending it by a vertical edge (adding a space in T) to diagonal i makes it a d-path on diagonal i.

11 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology K-difference Inexact Matching The 2nd case is: 2.Path R 2, the farthest-reaching (d-1)-path on diagonal i-1, followed by a horizontal edge to diagonal i. Again extending a (d-1)-path into a d-path on diagonal i.

12 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology K-difference Inexact Matching 3.Path R 3, the farthest-reaching (d-1)-path on diagonal i, followed by a diagonal edge corresponding to a mismatch. Again extending a (d-1)-path into a d-path on diagonal i.

13 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology K-difference Inexact Matching Each of R 1, R 2, and R 3, is initially a farthest- reaching (d-1)-path on diagonal i-1, i, i+1, respectively. Each is extended by a space or a mismatch resulting in a d-path on diagonal i. Each is subsequently extended along diagonal i. The farthest-reaching d-path on diagonal i must be one of these.

14 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology k-differences Algorithm d = 0 /* Calculate farthest-reaching 0-paths on diagonals 0 through m */ For i=0 to m { Find the longest common extension between P[1..n] and T[i..m]} /* calculate d-paths by extending (d-1)-paths R 1, R 2, and R 3 */ For d=1 to k { For i = -n to m { extend (d-1)-paths R 1, R 2, R 3 on diagonals i-1, i, i+1 to diagonal i. One of these is the farthest reaching d-path on diagonal i. } A path reaching row n defines an inexact match of P in T containing at most k differences. The column in row n indicates the end character in T. }

15 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology K-difference Inexact Matching Space analysis: –For each d and i, we need to store the location of the ending farthest-reaching d-path. d ranges from 0 to k. There are (n+m) diagonals.  O(km) space is required.

16 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology K-difference Inexact Matching Time analysis: –Constant time to retrieve 3 (d-1)-paths for particular d and i. –  O(km) for this aspect (like k-differences alignment) –Corresponding O(km) extensions of paths along diagonal. Each path extension is a maximal identical substring in P & T, i.e., a longest common extension computation. Using a suffix tree entails only constant time. Creating the suffix tree entails linear processing of strings O(n+m)  altogether O(n+m+km) = O(km)

17 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Primer (Probe) Selection Problem Problem: start with two strings  and  (detailed description on page 178-179). Exact matching version:  j > j 0, find the shortest substring  of  starting at  j s.t.   . –Can be solved in O(|  |+|  |) –Not too bad. Inexact matching version: Given parameter p,  j > j 0, find the shortest substring    starting at  j that has edit distance at least |  |/p from any substring in .

18 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Primer (Probe) Selection Problem Inexact matching version: Given parameter p,  j > j 0, find the shortest substring    starting at  j that has edit distance at least |  |  p from any substring in . Q: How much work is this? …find the shortest prefix  of  with edit distance at least |  |  p from any substring in . The naïve approach appears daunting. Let’s look at a less intimidating formulation!

19 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Primer (Probe) Selection Problem Change |  |  p to k  Convert the inexact matching problem to a k- differences problem.  This works out since in practice, |  |  p must fall in a small range for fixed p. k-difference primer problem: Given parameter k,  j > j 0, find the shortest substring    starting at  j that has edit distance at least k from any substring in .

20 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Primer (Probe) Selection Problem Approach: For each position j in  Find the shortest prefix of  [j..n] with edit distance  k from every substring in . Q: How does this compare with the k-differences inexact matching problem? A: It is the opposite problem. Find matches with at most k differences, versus Reject matches of prefixes of  [j..n] with substrings of  with fewer than k differences.

21 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Primer (Probe) Selection Problem Solution: –Use k-differences algorithm. –Use  [j..n] in the place of P. –Use  in the place of T. –Compute the farthest-reaching d-path, d = k, in each diagonal. –d-paths, d < k, reaching row n, mean no solution at j –Q: Why? –A: a d-path, d < k, indicates  [j..n] matches a substring of  with fewer than k differences.

22 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Primer (Probe) Selection Problem Solution: –Only if no farthest-reaching (k-1)-paths reaches row n can there be a primer at position j. –In particular, if no farthest-reaching (k-1)-paths reaches row r < n then  [j..r] is a primer if r is the smallest row with this property. –Repeat this approach for every potential starting position j in . Analysis: if |  |= n and |  | = m, then the algorithm takes time O(knm).

23 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Exclusion Methods Q: Can we improve on the  (km) time we have seen for k-mismatch and k-difference? A: On average, yes. (Are we quibbling?) We adopt a fast expected algorithm <  (km)  the worst case may not be better than  (km)

24 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Exclusion Methods Partition Idea: exclude much of T from the search Preliminaries: Let  = |  |, where  is the alphabet used in P and T. Let n = | P |, and m = | T |. Defn. an approximate occurrence of P is an occurrence with at most k mismatches or differences. General Partition algorithm: three phases 1.Partition phase 2.Search Phase 3.Check Phase

25 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Exclusion Methods 1.Partition phase Partition either T or P into r-length regions (depends on particular algorithm) 2.Search Phase Use exact matching to search T for r-length intervals These are potential targets for approximate occurrences of P. Eliminate as many intervals as possible. 3.Check Phase Use approximate matching to check for an approximate occurrence of P around each surviving interval for the search phase.

26 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology BYP Method BYP method has O(m) expected running time. Partition P into r-length regions, r =  n/(k+1)  Q: How many r-length regions of P are there? A: k+1, there may be an additional short region. Suppose there is a match of P & T with at most k differences. Q: What can we deduce about the corresponding r-length regions? A:There must be at least one r-length interval that exactly matches.

27 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology BYP Method BYP Algorithm: 1.Let P be the set of the first k+1 substrings of P’s partitioning. 2.Build a keyword tree for the set of patterns P. 3.Use Aho-Corasik to find I, the set of starting locations in T where a pattern in P occurs exactly. 4.….. Oops! We haven’t talked about keyword trees or Aho-Corasik. Sooooo let’s do that now.

28 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Keyword Trees (section 3.4) Defn. The keyword tree for set P is a rooted directed tree K satisfying: 1.Each edge is labeled with one character 2.Any two edges out of the same node have distinct labels. 3.Every pattern P i in P maps to some node v of K s.t. the path from the root to v spells out P i 4.Every leaf in K is mapped by some pattern in P.

29 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Keyword Trees Example: From textbook P = {potato, poetry, pottery, science, school}

30 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Keyword Trees (section 3.4) Observation: there is an isomorphic mapping between distinct prefixes of patterns in P and nodes in K. 1.Every node corresponds to a prefix of a pattern in P. 2.Conversely, every prefix of a pattern maps to a node in K.

31 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Keyword Trees (section 3.4) If n is the total length of all patterns in P, then we can construct K in O(n), assuming a fixed . Let K i denote the partial keyword tree that encodes patterns P 1,.. P i of P.

32 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Keyword Trees (section 3.4) Consider partial keyword tree K 1 –comprised of a single path of |P 1 | edges out of root r. –Each edge is labeled with one character of P 1 –Reading from the root to the leaf spells out P 1 –The leaf is labeled 1

33 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Keyword Trees (section 3.4) Creating K 2 from K 1 : 1.Find the longest path from the root of K 1 that matches a prefix of P 2. 2.This paths ends by a)Either exhausting the characters of P 2 or b)Ending at some existing node v in K 1 where no extending match is possible. In case 2a) label the node where the path ends 2. In case 2b) create a new path out of v, labeled by the remaining characters of P 2.

34 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Keyword Trees (section 3.4) Example: P 1 is potato a) P 2 is pot b) P 2 is potty

35 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Keyword Trees (section 3.4) Use of keyword trees for matching Finding occurrences of patterns in P that occur starting at position l in T: –Starting at the root r in K, follow the unique path that matches a substring of T that starts at l. –Numbered nodes along this path indicate matched patterns in P that start at position l. –This takes time proportional to min(n, m) –Traversing K for each position l in T gives O(nm) –This can be improved!

36 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Keyword Tree Speedup Observation: Our naïve keyword tree is like the naïve approach to string comparison.  Every time we increment l, we start all over at the root of K  O(nm) Recall: KMP avoided O(nm) by shifting to get a speedup. Q: Is there an analogous operation we can perform in K ? A: Of course, why else would I ask a rhetorical question?

37 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Keyword Tree Speedup First, we assume P i  P j for all combinations P i,P j in P. Next, each node v in K is labeled with the string formed by concatenating the letters from the root to v. Defn. Let L (v) denote the label of node v. Defn. Let lp(v) denote the length of the longest proper suffix of string L (v) that is a prefix of some pattern in P.

38 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Keyword Tree Speedup Example: L (v) = potat, lp(v) = 2, the suffix at is the prefix of P 4.

39 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Keyword Tree Speedup Note: if  is the lp(v)-length suffix of L (v), then there is a unique node labeled . Example: at is the lp(v)-length suffix of L (v), w is the unique node labeled at.

40 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Keyword Tree Speedup Defn: For node v of K let n v be the unique node in K labeled with the suffix of L (v) of length lp(v). When lp(v) = 0 then n v is the root of K. Defn: The ordered pair (v,n v ) is called a failure link. Example:

41 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Aho-Corasick (section 3.4.6) Algorithm AC search l = 1; c = 1; w = root of K ; Repeat { While there is an edge (w,w´) labeled character T(c) { if w´ is numbered by pattern i then report that P i occurs in T starting at position l; w= w´ and c = c + 1; } w = n w and l = c - lp(w); } Until c > m; Note: if the root fails to match increment c and the repeat loop again.

42 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Aho-Corasick Example: T = hotpotattach When l = 4 there is a match of pot, but the next position fails. At this point c = 9. The failure link points to the node labeled at and lp(v) = 2.  l = c – lp(v) = 9 – 2 = 7

43 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Computing n v in Linear Time Note: if v is the root r or 1 character away from r, then n v = r. Imagine n v has been computed for for every node that is exactly k or fewer edges from r. How can we compute n v for v, a node k+1 edges from r?

44 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Computing n v in Linear Time We are looking for n v and L (n v ). Let v´ be the parent of v in K and x the character on the edge connecting them. n v´ is known since v´ is k edges from r. Clearly, L (n v ) must be a suffix of L (n v´ ) followed by x. –First check if there is an edge (n v´,w´) with label x. –If so, then n v = w´. –O/w L (n v ) is a proper suffix of L (n v´ ) followed by x. Examine n n v´ for an outgoing edge labeled x. If no joy, keep repeating, finally setting n v = r, if we run out of edges.

45 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology BYP Method BYP method has O(m) expected running time. Partition P into r-length regions, r =  n/(k+1)  Q: How many r-length regions of P are there? A: k+1, there may be an additional short region. Suppose there is a match of P & T with at most k differences. Q: What can we deduce about the corresponding r-length regions? A:There must be at least one r-length interval that exactly matches.

46 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology BYP Method BYP Algorithm: 1.Let P be the set of the first k+1 substrings of P’s partitioning. 2.Build a keyword tree for the set of patterns P. 3.Use Aho-Corasik to find I, the set of starting locations in T where a pattern in P occurs exactly. 4.For each i  I use approximate matching to locate end points of approximate occurrences of P in T[i-n-k..i+n+k]


Download ppt "UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bioinformatics Algorithms and Data Structures Chapter 12.2.4: k-difference."

Similar presentations


Ads by Google