# MSc Bioinformatics for H15: Algorithms on strings and sequences

## Presentation on theme: "MSc Bioinformatics for H15: Algorithms on strings and sequences"— Presentation transcript:

MSc Bioinformatics for H15: Algorithms on strings and sequences
Master Course 14/04/2017 MSc Bioinformatics for Health Sciences H15: Algorithms on strings and sequences Xavier Messeguer Peypoch (http://www.lsi.upc.es/~alggen) Dep. de Llenguatges i Sistemes Informàtics CEPBA-IBM Research Institute Universitat Politècnica de Catalunya

Contents 1. (Exact) String matching of one pattern
14/04/2017 1. (Exact) String matching of one pattern 2. (Exact) String matching of many patterns 3. Approximate string matching (Dynamic programming) 4. Pairwise and multiple alignment 5. Suffix trees

Contents and bibliography
14/04/2017 1. (Exact) String matching of one pattern 2. (Exact) String matching of many patterns 3. Approximate string matching (Dynamic programming) 4. Pairwise and multiple alignment 5. Suffix trees Flexible pattern matching in strings G. Navarro and M. Raffinot, 2002, Cambridge Uni. Press Algorithms on strings, trees and sequences D. Gusfield, Cambridge University Press, 1997

String matching 14/04/2017 Definition: given a long text T and a set of k patterns p1,p2,…,pk, the string matching problem is to find all the ocurrences of all the patterns in the text T. On-line algorithms: the patterns are known. Off-line algorithms: the text is known. Only one pattern (exact and approximated) Five, ten, hundred, thusand,.. patterns (exact) Extended patterns Suffix trees

(Exact) string matching of one pattern
Master Course 14/04/2017 First lecture: First part: (Exact) string matching of one pattern

String matching: one pattern
14/04/2017 How does the string algorithms made the search? For instance, given the sequence CTACTACTACGTCTATACTGATCGTAGCTACTACATGC search for the pattern ACTGA. and for the pattern TACTACGGTATGACTAA As you have seen this morning ....

String Matching: Brute force algorithm
14/04/2017 Example: Given the pattern ATGTA, the search is G T A C T A G A G G A C G T A T G T A C T G ... A T G T A A T G T A A T G T A A T G T A A T G T A A T G T A As you have seen this morning ....

String Matching: Brute force algorithm
14/04/2017 Connect to and open Brute Force algorithm What is the meaning of the variables? y: array with the text T n: length of the text x: array with the pattern P m:length of the pattern What is the meaning of the variables? y: n: x: m: C code of the running file Connect to

String Matching of one pattern
14/04/2017 The cost of Brute Force algorithm is O(nm). Can the search be made with lower cost? CTACTACTACGTCTATACTGATCGTAGCTACTACATGC Prefix search Suffix search TACTACGGTATGACTAA Factor search

String matching of one pattern
14/04/2017 How does the string algorithms made the search? There is a sliding window along the text against which the pattern is compared: Pattern : Text : At each step the comparison is made and the window is shifted to the right. As you have seen this morning .... Which are the facts that differentiate the algorithms? How the comparison is made. The length of the shift.

String Matching: Brute force algorithm
14/04/2017 Which is the next position of the window? How the comparison is made? Text : Patern : From left to right: prefix search Patró : Text : The window is shifted only one cell As you have seen this morning .... The cost is O(mn).

String Matching: one pattern
14/04/2017 Most efficient algorithms (Navarro & Raffinot) BNDM : Backward Nondeterministic Dawg Matching | | BOM : Backward Oracle Matching 64 32 16 Horspool 8 BOM BNDM 4 2 w Length of the pattern

String Matching: Horspool algorithm
14/04/2017 Which is the next position of the window? How the comparison is made? Text : Pattern : From right to left: suffix search Pattern : Text : a It depends of where appears the last letter of the text, say it ‘a’, in the pattern: As you have seen this morning .... a a a Then it is necessary a preprocess that determines the length of the shift.

String Matching: Horspool algorithm
14/04/2017 Example: A 4 C 5 G 2 T 1 Given the pattern ATGTA, the shift table is And the search: G T A C T A G A G G A C G T A T G T A C T G ... A T G T A A T G T A A T G T A A T G T A A T G T A A T G T A As you have seen this morning ....

String Matching: Horspool algorithm
14/04/2017 Example: A 4 C 5 G 2 T 1 Given the pattern ATGTA, the shift table is And the search: G T A C T A G A G G A C G T A T G T A C T G ... A T G T A A T G T A A T G T A A T G T A A T G T A A T G T A As you have seen this morning .... A T G T A

String Matching: Horspool algorithm
14/04/2017 Connect to and open the Horspool algorithm C code Connect to

String Matching: one pattern
14/04/2017 The most efficient algorithms (Navarro & Raffinot) BNDM : Backward Nondeterministic Dawg Matching | | BOM : Backward Oracle Matching 64 32 16 Horspool 8 BOM BNDM 4 2 w Length of the pattern

BNDM algorithm ? How the comparison is made?
14/04/2017 How the shift is determined? How the comparison is made? Text : Pattern : Searches for suffixes of T that are factors of P x This state is expressed with an array D of bits: D2 = How the next state can be obtained? D = D<<1 & B(x) Given the mask B(x) of x, the cells where character x appears into the pattern D3 = ( ) & ( ) = ( ) If B(x) = ( ) then As you have seen this morning .... ?

BNDM algorithm: example
14/04/2017 Given the pattern ATGTA, the mask of characters is: B(A) = ( ) B(C) = B(G) = B(T) = As you have seen this morning ....

BNDM algorithm: example
14/04/2017 Given the pattern ATGTA, the mask of characters is: B(A) = ( ) B(C) = ( ) B(G) = ( ) B(T) = ( ) As you have seen this morning ....

BNDM algorithm: example
14/04/2017 Given the pattern ATGTA, the mask of characters is: B(A) = ( ) B(C) = ( ) B(G) = ( ) B(T) = ( ) Given the text : G T A C T A G A G G A C G T A T G T A C T G ... A T G T A A T G T A A T G T A A T G T A D1 = = ( ) D2 = ( ) & ( ) = ( ) D1 = = ( ) D2 = ( ) & ( ) = ( ) As you have seen this morning .... D1 = = ( ) D2 = ( ) & ( ) = ( ) D3 = ( ) & ( ) = ( ) D4 = ( ) & ( ) = ( )

BNDM algorithm: example
14/04/2017 The pattern is ATGTA , the masks are: and the text: G T A C T A G A G G A C G T A T G T A C T G ... A T G T A B(A) = ( ) B(C) = ( ) B(G) = ( ) B(T) = ( ) D1 = = ( ) A T G T A D2 = ( ) & ( ) = ( ) D3 = ( ) & ( ) = ( ) D4 = ( ) & ( ) = ( ) D5 = ( ) & ( ) = ( ) As you have seen this morning .... D6 = ( ) & ( * * * * * ) = ( ) Pattern found!

BNDM algorithm How the comparison is made?
14/04/2017 How the shift is determined? How the comparison is made? Text : Pattern : Searches for suffixes of T that are factors of P This state is expressed with an array D of bits: D = As you have seen this morning .... ?

BNDM algorithm How the comparison is made?
14/04/2017 How the shift is determined? How the comparison is made? Text : Pattern : Searches for suffixes of T that are factors of P This state is expressed with an array D of bits: D = As you have seen this morning .... If the left bit is set to one in step i, it means that a prefix of P of length i is equal to a suffix of T, then the window is shifted m-i cells; otherwise it is shifted m cells

String matching: one pattern
14/04/2017 The most efficient algorithms (Navarro & Raffinot) BNDM : Backward Nondeterministic Dawg Matching | | BOM : Backward Oracle Matching 64 32 16 Horspool 8 BOM BNDM 4 2 Long. patró w

BOM (Backward Oracle Matching)
14/04/2017 How the shifted is determined? How the comparison is made? Text : Pattern : Automaton: Factor Oracle(1999) Checks if the suffix is a factor of the pattern ? As you have seen this morning ....

Automaton Factor Oracle: properties
14/04/2017 Factor Oracle of the word G T A T G T A G A T G T A T G T A T G A T G T G G As you have seen this morning .... but the automaton also recognizes other strings as G T G then it is usefull only for discard words out as factors!

BOM: example How the comparison is made?
14/04/2017 How the comparison is made? The Factor Oracle of the inverted pattern is built. Given the pattern ATGTATG G A T Search: G T A C T A G A A T G T G T A G A C A T G T A T G G T G A... A T G T A T G As you have seen this morning ....

BOM: example How the comparison is made?
14/04/2017 How the comparison is made? The Factor Oracle of the inverted pattern is built. Given the pattern ATGTATG G A T Search: G T A C T A G A A T G T G T A G A C A T G T A T G G T G A T G T A T G A T G T A T G As you have seen this morning ....

BOM: example How the comparison is made?
14/04/2017 How the comparison is made? The Factor Oracle of the inverted pattern is built. Given the pattern ATGTATG G A T Search G T A C T A G A A T G T G T A G A C A T G T A T G G T G A T G T A T G A T G T A T G A T G T A T G As you have seen this morning ....

BOM: example How the comparison is made?
14/04/2017 How the comparison is made? The Factor Oracle of the inverted pattern is built. Given the pattern ATGTATG G A T Search : G T A C T A G A A T G T G T A G A C A T G T A T G G T G A T G T A T G A T G T A T G A T G T A T G A T G T A T G As you have seen this morning ....

BOM: example How the comparison is made?
14/04/2017 How the comparison is made? The Factor Oracle of the inverted pattern is built. Given the pattern ATGTATG G A T Search : G T A C T A G A A T G T G T A G A C A T G T A T G G T G ... A T G T A T G A T G T A T G A T G T A T G A T G T A T G A T G T A T G As you have seen this morning ....

BOM: example How the comparison is made? …
14/04/2017 How the comparison is made? Es construeix l’autòmata del patró invers: Suposem que el patró és ATGTATG G A T Search : G T A C T A G A A T G T G T A G A C A T G T A T G G T G ... A T G T A T G A T G T A T G A T G T A T G A T G T A T G A T G T A T G A T G T A T G As you have seen this morning ....

BOM (Backward Oracle Matching)
14/04/2017 How the shifted is determined? How the comparison is made? Text : Pattern : Automaton: Factor Oracle Checks if the suffix is a factor of the pattern a a is the first mismatch As you have seen this morning ....

String Matching: BNDM and BOM
14/04/2017 Connect to and open the BNDM and BOM algorithms C code of BNDM C code of BOM

(Exact) string matching of many patterns
Master Course 14/04/2017 First lecture: Second part: (Exact) string matching of many patterns

String matching: many patterns
14/04/2017 Given the sequence CTACTACTACGTCTATACTGATCGTAGCTACTACATGC Search for the patterns ACTGACT GTCT AATT ACTGATCTTT GTAGC AATACT ACATGC ACTGA. As you have seen this morning ....

Trie of words GTATGTA,GTAT,TAATA,GTGTA
14/04/2017 Trie of words GTATGTA,GTAT,TAATA,GTGTA T G T A A T G G T A T A A T A A As you have seen this morning .... Which is the cost?

Horspool for many patterns
14/04/2017 Search for ATGTATG,TATG,ATAAT,ATGTG T A G 1. Build the trie of the inverted patterns 2. lmin=4 A 1 C 4 (lmin) G 2 T 1 3. Table of shifts As you have seen this morning .... 4. Start the search

Horspool for many patterns
14/04/2017 Search for ATGTATG,TATG,ATAAT,ATGTG T A G A 1 C 4 (lmin) G 2 T 1 The text ACATGCTATGTGACA… As you have seen this morning ....

Horspool for many patterns
14/04/2017 Search for ATGTATG,TATG,ATAAT,ATGTG T A G A 1 C 4 (lmin) G 2 T 1 The text ACATGCTATGTGACA… As you have seen this morning ....

Horspool for many patterns
14/04/2017 Search for ATGTATG,TATG,ATAAT,ATGTG T A G A 1 C 4 (lmin) G 2 T 1 The text ACATGCTATGTGACA… As you have seen this morning ....

Horspool for many patterns
14/04/2017 Search for ATGTATG,TATG,ATAAT,ATGTG T A G A 1 C 4 (lmin) G 2 T 1 The text ACATGCTATGTGACA… As you have seen this morning ....

Horspool for many patterns
14/04/2017 Search for ATGTATG,TATG,ATAAT,ATGTG T A G A 1 C 4 (lmin) G 2 T 1 The text ACATGCTATGTGACA… As you have seen this morning ....

Horspool for many patterns
14/04/2017 Search for ATGTATG,TATG,ATAAT,ATGTG T A G A 1 C 4 (lmin) G 2 T 1 The text ACATGCTATGTGACA… As you have seen this morning ....

Horspool for many patterns
14/04/2017 Search for ATGTATG,TATG,ATAAT,ATGTG T A G Short Shifts! A 1 C 4 (lmin) G 2 T 1 The text ACATGCTATGTGACA… As you have seen this morning ....

How do we can increase the length of the shifts?
Horspool to Wu-Manber 14/04/2017 How do we can increase the length of the shifts? With a table shift of l-mers with the patterns ATGTATG,TATG,ATAAT,ATGTG A 1 C 4 (lmin) G 2 T 1 1 símbol AA 1 AC (LMIN-L+1) AG 3 AT 1 CA 3 CC 3 CG 3 2 símbols AA 1 AT 1 GT 1 TA 2 TG 2 As you have seen this morning ....

… Wu-Manber algorithm Search for ATGTATG,TATG,ATAAT,ATGTG
14/04/2017 Search for ATGTATG,TATG,ATAAT,ATGTG T A G AA 1 AT 1 GT 1 TA 2 TG 2 into the text: ACATGCTATGTGACATAATA As you have seen this morning .... Experimental length: log|Σ| 2*lmin*r

String matching of many patterns
14/04/2017 8 4 2 | | Wu-Manber SBOM Lmin (5 patterns) 8 4 2 Wu-Manber SBOM (10 patterns) 8 4 2 Wu-Manber SBOM (100 patterns)

String matching of many patterns
14/04/2017 | | (5 patterns) 8 Wu-Manber 4 SBOM Lmin 2 8 Wu-Manber (10 patterns) 4 SBOM 2 (1000 patterns) SBOM 8 (100 patterns) 4 2

Horspool for a set of patterns
14/04/2017 How the comparison is made? Comparison Text : Patrons: Automaton with all the patterns How the shift is determined? a Segons l’aparició de l`últim carácter del text ‘a’into the s patrons, concretament: la primera aparició per la dreta no última i més curta que lmin, o lmin As you have seen this morning ....

String matching of many patterns
14/04/2017 8 4 2 | | Wu-Manber SBOM Long. mínima (5 patterns) 8 4 2 Wu-Manber SBOM (10 patterns) Ad AC 8 4 2 Wu-Manber SBOM (100 patterns) Ad AC 8 4 2 Wu-Manber SBOM (1000 patterns) Ad AC

SBOM ? How the comparison is made? How the shifted is determined?
14/04/2017 How the shifted is determined? How the comparison is made? Text : Pattern : Automaton: Factor Oracle Checks if the suffix is a factor of any pattern ? As you have seen this morning ....

Factor Oracle of many patterns
14/04/2017 G T A T G T A T G A 1,4 A A T A 3 2 As you have seen this morning .... The AFO of GTATGTA, GTAA, TAATA i GTGTA

SBOM algorithm How the comparison is made?
14/04/2017 How the comparison is made? Text : Patrons: Autòmaton………… of lenght lmin How the shift is determined? a If the a doesn’t appears in the AFO As you have seen this morning .... If lmin characters have been read

SBOM algorithm : example
14/04/2017 Search for the patterns ATGTATG, TAATG,TAATAAT i AATGTG G T A T G T A 1 4 G A T A A T A 2 3 ACATGCTAGCTATAATAATGTATG As you have seen this morning ....

SBOM algorithm: example
14/04/2017 Search for the patterns ATGTATG, TAATG,TAATAAT i AATGTG G T A T G T A 1 4 G A T A A T A 2 3 ACATGCTAGCTATAATAATGTATG As you have seen this morning ....

SBOM algorithm: example
14/04/2017 Search for the patterns ATGTATG, TAATG,TAATAAT i AATGTG G T A T G T A 1 4 G A T A A T A 2 3 ACATGCTAGCTATAATAATGTATG As you have seen this morning ....

SBOM algorithm: example
14/04/2017 Search for the patterns ATGTATG, TAATG,TAATAAT i AATGTG G T A T G T A 1 4 G A T A A T A 2 3 ACATGCTAGCTATAATAATGTATG As you have seen this morning ....

SBOM algorithm: example
14/04/2017 Search for the patterns ATGTATG, TAATG,TAATAAT i AATGTG G T A T G T A 1 4 G A T A A T A 2 3 ACATGCTAGCTATAATAATGTATG As you have seen this morning ....

SBOM algorithm: example
14/04/2017 Search for the patterns ATGTATG, TAATG,TAATAAT i AATGTG G T A T G T A 1 4 G A T A A T A 2 3 ACATGCTAGCTATAATAATGTATG As you have seen this morning ....

SBOM algorithm: example
14/04/2017 Search for the patterns ATGTATG, TAATG,TAATAAT i AATGTG G T A T G T A 1 4 G A T A A T A 2 3 ACATGCTAGCTATAATAATGTATG As you have seen this morning ....

SBOM algorithm: example
14/04/2017 Search for the patterns ATGTATG, TAATG,TAATAAT i AATGTG G T A T G T A 1 4 G A T A A T A 2 3 ACATGCTAGCTATAATAATGT… As you have seen this morning ....

Alg. Cerca exacta de molts patrons
14/04/2017 8 4 2 | | Wu-Manber SBOM Long. mínima (5 mots) 8 4 2 Wu-Manber SBOM (10 mots) Ad AC 8 4 2 Wu-Manber SBOM (100 mots) Ad AC 8 4 2 Wu-Manber SBOM (1000 mots) Ad AC