CS5263 Bioinformatics Lecture 15 & 16 Exact String Matching Algorithms.

CS5263 Bioinformatics Lecture 15 & 16 Exact String Matching Algorithms

Definitions Text: a longer string T Pattern: a shorter string P Exact matching: find all occurrence of P in T T P length m length n

The naïve algorithm

Time complexity Worst case: O(mn) Best case: O(m) –aaaaaaaaaaaaaa vs baaaaaaa Average case? –Alphabet A, C, G, T –Assume both P and T are random –Equal probability –How many chars do you need to compare before moving to the next position?

Average case time complexity P(mismatch at 1 st position): ¾ P(mismatch at 2 nd position): ¼ * ¾ P(mismatch at 3 nd position): (¼) 2 * ¾ P(mismatch at k th position): (¼) k-1 * ¾ Expected number of comparison per position: p = 1/4  k (1-p) p (k-1) k = (1-p) / p *  k p k k = 1/(1-p) = 4/3 Average complexity: 4m/3 Not as bad as you thought it might be

Biological sequences are not random T: aaaaaaaaaaaaaaaaaaaaaaaaa P: aaaab Plus: 4m/3 average case is still bad for long genomic sequences! Especially if P is not in T… Smarter algorithms: O(m + n) in worst case sub-linear in practice

String matching scenarios One T and one P –Search a word in a document One T and many P all at once –Search a set of words in a document –Spell checking One fixed T, many P –Search a completed genome for a short sequence Two (or many) T’s for common patterns

How to speedup? Pre-processing T or P Why pre-processing can save us time? –Uncovers the structure of T or P –Determines when we can skip ahead without missing anything –Determines when we can infer the result of character comparisons without doing them. ACGTAXACXTAXACGXAX ACGTACA

Cost for exact string matching Total cost = cost (preprocessing) + cost(comparison) + cost(output) Constant Minimize Overhead Hope: gain > overhead

Which string to preprocess? One T and one P –Preprocessing P? One T and many P all at once –Preprocessing P or T? One fixed T, many P (unknown) –Preprocessing T? Two (or many) T’s for common patterns –???

Pattern pre-processing algs –Karp – Rabin algorithm Small alphabet and small pattern –Boyer – Moore algorithm the choice of most cases Typically sub-linear time –Knuth-Morris-Pratt algorithm (KMP) grep –Aho-Corasick algorithm fgrep

Karp – Rabin Algorithm Let’s say we are dealing with binary numbers Text: 01010001011001010101001 Pattern: 101100 Convert pattern to integer 101100 = 2^5 + 2^3 + 2^2 = 44

Karp – Rabin algorithm Text: 01010001011001010101001 Pattern: 101100 = 44 decimal 10111011001010101001 = 2^5 + 2^3 + 2^2 + 2^1 = 46 10111011001010101001 = 46 * 2 – 64 + 1 = 29 10111011001010101001 = 29 * 2 - 0 + 1 = 59 10111011001010101001 = 59 * 2 - 64 + 0 = 54 10111011001010101001 = 54 * 2 - 64 + 0 = 44

Karp – Rabin algorithm What if the pattern is too long to fit into a single integer? Pattern: 101100. But our machine only has 5 bits Basic idea: hashing. 44 % 13 = 5 10111011001010101001 = 46 (% 13 = 7) 10111011001010101001 = 46 * 2 – 64 + 1 = 29 (% 13 = 3) 10111011001010101001 = 29 * 2 - 0 + 1 = 59 (% 13 = 7) 10111011001010101001 = 59 * 2 - 64 + 0 = 54 (% 13 = 2) 10111011001010101001 = 54 * 2 - 64 + 0 = 44 (% 13 = 5)

Boyer – Moore algorithm Three ideas: –Right-to-left comparison –Bad character rule –Good suffix rule

Boyer – Moore algorithm Right to left comparison x y y Skip some chars without missing any occurrence. But how?

Bad character rule 0 1 12345678901234567 T:xpbctbxabpqqaabpq P: tpabxab *^^^^ What would you do now?

Bad character rule 0 1 12345678901234567 T:xpbctbxabpqqaabpq P: tpabxab *^^^^ P: tpabxab

Bad character rule 0 1 123456789012345678 T:xpbctbxabpqqaabpqz P: tpabxab *^^^^ P: tpabxab * P: tpabxab

Basic bad character rule charRight-most-position in P a6 b7 p2 t1 x5 tpabxab Pre-processing: O(n)

Basic bad character rule charRight-most-position in P a6 b7 p2 t1 x5 T: xpbctbxabpqqaabpqz P: tpabxab *^^^^ P: tpabxab When rightmost T(k) in P is left to i, shift pattern P to align T(k) with the rightmost T(k) in P k i = 3 Shift 3 – 1 = 2

Basic bad character rule charRight-most-position in P a6 b7 p2 t1 x5 T: xpbctbxabpqqaabpqz P: tpabxab * P: tpabxab When T(k) is not in P, shift left end of P to align with T(k+1) k i = 7Shift 7 – 0 = 7

Basic bad character rule charRight-most-position in P a6 b7 p2 t1 x5 T: xpbctbxabpqqaabpqz P: tpabxab *^^ P: tpabxab When rightmost T(k) in P is right to i, shift pattern P one pos k i = 55 – 6 < 0. so shift 1

Extended bad character rule charPosition in P a6, 3 b7, 4 p2 t1 x5 T: xpbctbxabpqqaabpqz P: tpabxab *^^ P: tpabxab Find T(k) in P that is immediately left to i, shift P to align T(k) with that position k i = 55 – 3 = 2. so shift 2 Preprocessing still O(n)

Extended bad character rule Best possible: m / n comparisons Works better for large alphabet size In some cases the extended bad character rule is sufficiently good Worst-case: O(mn)

0 1 123456789012345678 T:prstabstubabvqxrst P: qcabdabdab *^^ P: qcabdabdab According to extended bad character rule

(weak) good suffix rule 0 1 123456789012345678 T:prstabstubabvqxrst P: qcabdabdab *^^ P: qcabdabdab

(Weak) good suffix rule t x t y t’ t y In preprocessing: For any suffix t of P, find the rightmost copy of t, t’, t ≠ t’ T P P

(Strong) good suffix rule 0 1 123456789012345678 T:prstabstubabvqxrst P: qcabdabdab *^^

(Strong) good suffix rule 0 1 123456789012345678 T:prstabstubabvqxrst P: qcabdabdab *^^ P: qcabdabdab

(Strong) good suffix rule Pre-processing can be done in linear time If P in T, may take O(mn) If P not in T, worst-case O(m+n) t x t y t’ t y In preprocessing: For any suffix t of P, find the rightmost copy of t, t’, t ≠ t’, and the char left to t ≠ the char left to t’ T P P z z

Lessons From B-M Sub-linear time is possible –But we still need to read T from disk! Bad cases require periodicity in P or T –matching random P with T is easy! Large alphabets mean large shifts Small alphabets make complicated shift data-structures possible B-M better for “english” and amino-acids than for DNA.

Algorithm KMP Not the fastest Best known Good for multiple pattern matching and real-time matching Idea –Left-to-right comparison –Shift P more chars when possible

Basic idea t t’ P t x T y t P y z z In pre-processing: for any position i in P, find the longest proper suffix of P, t = P[j+1..i], such that t matches to a prefix of P, t’, and the next char of t is different from the next char of t’, i.e., P[i+1] != P[i-j+1]. Sp’(i) = length(t)

Example P: aataac aataac Sp’(i) 010020 aaataaat aat aac

Failure link P: aataac aataac Sp’(i) 010020 aaataaat aat aac If a char in T fails to match at pos 6, re-compare it with the char at pos 3

FSA P: aataac 123450 a ataac 6 a t All other input goes to state 0 Sp’(i) 010020 aaataaat aat aac If the next char in T is t, we go to state 3

Another example P: abababc abababc Sp’(i) 0000040 abababab abab ababa ababc

Failure link P: abababc abababc Sp’(i) 0000040 ababa ababc If a char in T fails to match at pos 7, re-compare it with the char at pos 5

FSA P: abababc 123456 Sp’(i) 0000040 ababa ababc If the next char in T is a, go to state 5 0 a b a bac 7 b a All other input goes to state 0

Difference between Failure Link and FSA? Failure link –Preprocessing time and space are O(n), regardless of alphabet size –Comparison time is at most 2m FSA –Preprocessing time and space are O(n |  |) May be a problem for very large alphabet size –Comparison time is always m.

Failure link P: aataac aataac Sp’(i) 010020 aaataaat aat aac If a char in T fails to match at pos 6, re-compare it with the char at pos 3

Example aataac aataac ^^* T: aacaataaaaataaccttacta aataac.* aataac ^^^^^* aataac..* aataac.^^^^^ Each char in T may be compared multiple times. Up to n. Time complexity: O(2m). Comparison phase and shift phase. Comparison is bounded by m, shift is also bounded by m.

Example T: aacaataaaaataaccttacta Each char in T will be examined exactly once. Therefore, exact m comparisons are needed. Takes longer to do pre-processing. 123450 a ataac 6 a t 1201234501234560001001

How to do pre-processing?

CS5263 Bioinformatics Lecture 15 & 16 Exact String Matching Algorithms.

Similar presentations

Presentation on theme: "CS5263 Bioinformatics Lecture 15 & 16 Exact String Matching Algorithms."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

CS5263 Bioinformatics Lecture 15 & 16 Exact String Matching Algorithms.

Similar presentations

Presentation on theme: "CS5263 Bioinformatics Lecture 15 & 16 Exact String Matching Algorithms."— Presentation transcript:

Similar presentations

About project

Feedback