Presentation is loading. Please wait.

Presentation is loading. Please wait.

CS5263 Bioinformatics Lecture 15 & 16 Exact String Matching Algorithms.

Similar presentations


Presentation on theme: "CS5263 Bioinformatics Lecture 15 & 16 Exact String Matching Algorithms."— Presentation transcript:

1 CS5263 Bioinformatics Lecture 15 & 16 Exact String Matching Algorithms

2 Definitions Text: a longer string T Pattern: a shorter string P Exact matching: find all occurrence of P in T T P length m length n

3 The naïve algorithm

4 Time complexity Worst case: O(mn) Best case: O(m) –aaaaaaaaaaaaaa vs baaaaaaa Average case? –Alphabet A, C, G, T –Assume both P and T are random –Equal probability –How many chars do you need to compare before moving to the next position?

5 Average case time complexity P(mismatch at 1 st position): ¾ P(mismatch at 2 nd position): ¼ * ¾ P(mismatch at 3 nd position): (¼) 2 * ¾ P(mismatch at k th position): (¼) k-1 * ¾ Expected number of comparison per position: p = 1/4  k (1-p) p (k-1) k = (1-p) / p *  k p k k = 1/(1-p) = 4/3 Average complexity: 4m/3 Not as bad as you thought it might be

6 Biological sequences are not random T: aaaaaaaaaaaaaaaaaaaaaaaaa P: aaaab Plus: 4m/3 average case is still bad for long genomic sequences! Especially if P is not in T… Smarter algorithms: O(m + n) in worst case sub-linear in practice

7 String matching scenarios One T and one P –Search a word in a document One T and many P all at once –Search a set of words in a document –Spell checking One fixed T, many P –Search a completed genome for a short sequence Two (or many) T’s for common patterns

8 How to speedup? Pre-processing T or P Why pre-processing can save us time? –Uncovers the structure of T or P –Determines when we can skip ahead without missing anything –Determines when we can infer the result of character comparisons without doing them. ACGTAXACXTAXACGXAX ACGTACA

9 Cost for exact string matching Total cost = cost (preprocessing) + cost(comparison) + cost(output) Constant Minimize Overhead Hope: gain > overhead

10 Which string to preprocess? One T and one P –Preprocessing P? One T and many P all at once –Preprocessing P or T? One fixed T, many P (unknown) –Preprocessing T? Two (or many) T’s for common patterns –???

11 Pattern pre-processing algs –Karp – Rabin algorithm Small alphabet and small pattern –Boyer – Moore algorithm the choice of most cases Typically sub-linear time –Knuth-Morris-Pratt algorithm (KMP) grep –Aho-Corasick algorithm fgrep

12 Karp – Rabin Algorithm Let’s say we are dealing with binary numbers Text: 01010001011001010101001 Pattern: 101100 Convert pattern to integer 101100 = 2^5 + 2^3 + 2^2 = 44

13 Karp – Rabin algorithm Text: 01010001011001010101001 Pattern: 101100 = 44 decimal 10111011001010101001 = 2^5 + 2^3 + 2^2 + 2^1 = 46 10111011001010101001 = 46 * 2 – 64 + 1 = 29 10111011001010101001 = 29 * 2 - 0 + 1 = 59 10111011001010101001 = 59 * 2 - 64 + 0 = 54 10111011001010101001 = 54 * 2 - 64 + 0 = 44

14 Karp – Rabin algorithm What if the pattern is too long to fit into a single integer? Pattern: 101100. But our machine only has 5 bits Basic idea: hashing. 44 % 13 = 5 10111011001010101001 = 46 (% 13 = 7) 10111011001010101001 = 46 * 2 – 64 + 1 = 29 (% 13 = 3) 10111011001010101001 = 29 * 2 - 0 + 1 = 59 (% 13 = 7) 10111011001010101001 = 59 * 2 - 64 + 0 = 54 (% 13 = 2) 10111011001010101001 = 54 * 2 - 64 + 0 = 44 (% 13 = 5)

15 Boyer – Moore algorithm Three ideas: –Right-to-left comparison –Bad character rule –Good suffix rule

16 Boyer – Moore algorithm Right to left comparison x y y Skip some chars without missing any occurrence. But how?

17 Bad character rule 0 1 12345678901234567 T:xpbctbxabpqqaabpq P: tpabxab *^^^^ What would you do now?

18 Bad character rule 0 1 12345678901234567 T:xpbctbxabpqqaabpq P: tpabxab *^^^^ P: tpabxab

19 Bad character rule 0 1 123456789012345678 T:xpbctbxabpqqaabpqz P: tpabxab *^^^^ P: tpabxab * P: tpabxab

20 Basic bad character rule charRight-most-position in P a6 b7 p2 t1 x5 tpabxab Pre-processing: O(n)

21 Basic bad character rule charRight-most-position in P a6 b7 p2 t1 x5 T: xpbctbxabpqqaabpqz P: tpabxab *^^^^ P: tpabxab When rightmost T(k) in P is left to i, shift pattern P to align T(k) with the rightmost T(k) in P k i = 3 Shift 3 – 1 = 2

22 Basic bad character rule charRight-most-position in P a6 b7 p2 t1 x5 T: xpbctbxabpqqaabpqz P: tpabxab * P: tpabxab When T(k) is not in P, shift left end of P to align with T(k+1) k i = 7Shift 7 – 0 = 7

23 Basic bad character rule charRight-most-position in P a6 b7 p2 t1 x5 T: xpbctbxabpqqaabpqz P: tpabxab *^^ P: tpabxab When rightmost T(k) in P is right to i, shift pattern P one pos k i = 55 – 6 < 0. so shift 1

24 Extended bad character rule charPosition in P a6, 3 b7, 4 p2 t1 x5 T: xpbctbxabpqqaabpqz P: tpabxab *^^ P: tpabxab Find T(k) in P that is immediately left to i, shift P to align T(k) with that position k i = 55 – 3 = 2. so shift 2 Preprocessing still O(n)

25 Extended bad character rule Best possible: m / n comparisons Works better for large alphabet size In some cases the extended bad character rule is sufficiently good Worst-case: O(mn)

26 0 1 123456789012345678 T:prstabstubabvqxrst P: qcabdabdab *^^ P: qcabdabdab According to extended bad character rule

27 (weak) good suffix rule 0 1 123456789012345678 T:prstabstubabvqxrst P: qcabdabdab *^^ P: qcabdabdab

28 (Weak) good suffix rule t x t y t’ t y In preprocessing: For any suffix t of P, find the rightmost copy of t, t’, t ≠ t’ T P P

29 (Strong) good suffix rule 0 1 123456789012345678 T:prstabstubabvqxrst P: qcabdabdab *^^

30 (Strong) good suffix rule 0 1 123456789012345678 T:prstabstubabvqxrst P: qcabdabdab *^^ P: qcabdabdab

31 (Strong) good suffix rule Pre-processing can be done in linear time If P in T, may take O(mn) If P not in T, worst-case O(m+n) t x t y t’ t y In preprocessing: For any suffix t of P, find the rightmost copy of t, t’, t ≠ t’, and the char left to t ≠ the char left to t’ T P P z z

32 Lessons From B-M Sub-linear time is possible –But we still need to read T from disk! Bad cases require periodicity in P or T –matching random P with T is easy! Large alphabets mean large shifts Small alphabets make complicated shift data-structures possible B-M better for “english” and amino-acids than for DNA.

33 Algorithm KMP Not the fastest Best known Good for multiple pattern matching and real-time matching Idea –Left-to-right comparison –Shift P more chars when possible

34 Basic idea t t’ P t x T y t P y z z In pre-processing: for any position i in P, find the longest proper suffix of P, t = P[j+1..i], such that t matches to a prefix of P, t’, and the next char of t is different from the next char of t’, i.e., P[i+1] != P[i-j+1]. Sp’(i) = length(t)

35 Example P: aataac aataac Sp’(i) 010020 aaataaat aat aac

36 Failure link P: aataac aataac Sp’(i) 010020 aaataaat aat aac If a char in T fails to match at pos 6, re-compare it with the char at pos 3

37 FSA P: aataac 123450 a ataac 6 a t All other input goes to state 0 Sp’(i) 010020 aaataaat aat aac If the next char in T is t, we go to state 3

38 Another example P: abababc abababc Sp’(i) 0000040 abababab abab ababa ababc

39 Failure link P: abababc abababc Sp’(i) 0000040 ababa ababc If a char in T fails to match at pos 7, re-compare it with the char at pos 5

40 FSA P: abababc 123456 Sp’(i) 0000040 ababa ababc If the next char in T is a, go to state 5 0 a b a bac 7 b a All other input goes to state 0

41 Difference between Failure Link and FSA? Failure link –Preprocessing time and space are O(n), regardless of alphabet size –Comparison time is at most 2m FSA –Preprocessing time and space are O(n |  |) May be a problem for very large alphabet size –Comparison time is always m.

42 Failure link P: aataac aataac Sp’(i) 010020 aaataaat aat aac If a char in T fails to match at pos 6, re-compare it with the char at pos 3

43 Example aataac aataac ^^* T: aacaataaaaataaccttacta aataac.* aataac ^^^^^* aataac..* aataac.^^^^^ Each char in T may be compared multiple times. Up to n. Time complexity: O(2m). Comparison phase and shift phase. Comparison is bounded by m, shift is also bounded by m.

44 Example T: aacaataaaaataaccttacta Each char in T will be examined exactly once. Therefore, exact m comparisons are needed. Takes longer to do pre-processing. 123450 a ataac 6 a t 1201234501234560001001

45 How to do pre-processing?


Download ppt "CS5263 Bioinformatics Lecture 15 & 16 Exact String Matching Algorithms."

Similar presentations


Ads by Google