Presentation is loading. Please wait.

Presentation is loading. Please wait.

Master Course MSc Bioinformatics for Health Sciences H15: Algorithms on strings and sequences Xavier Messeguer Peypoch (http://www.lsi.upc.es/~alggen)

Similar presentations


Presentation on theme: "Master Course MSc Bioinformatics for Health Sciences H15: Algorithms on strings and sequences Xavier Messeguer Peypoch (http://www.lsi.upc.es/~alggen)"— Presentation transcript:

1 Master Course MSc Bioinformatics for Health Sciences H15: Algorithms on strings and sequences Xavier Messeguer Peypoch (http://www.lsi.upc.es/~alggen) Dep. de Llenguatges i Sistemes Informàtics CEPBA-IBM Research Institute Universitat Politècnica de Catalunya

2 Contents 1. (Exact) String matching of one pattern 2. (Exact) String matching of many patterns 3. Approximate string matching (Dynamic programming) 4. Pairwise and multiple alignment 5. Suffix trees

3 Contents and bibliography 1. (Exact) String matching of one pattern 2. (Exact) String matching of many patterns 3. Approximate string matching (Dynamic programming) 4. Pairwise and multiple alignment 5. Suffix trees Flexible pattern matching in strings G. Navarro and M. Raffinot, 2002, Cambridge Uni. Press Algorithms on strings, trees and sequences D. Gusfield, Cambridge University Press, 1997

4 String matching Definition: given a long text T and a set of k patterns p 1,p 2,…,p k, the string matching problem is to find all the ocurrences of all the patterns in the text T. On-line algorithms: the patterns are known. Off-line algorithms: the text is known. Only one pattern (exact and approximated) Five, ten, hundred, thusand,.. patterns (exact) Extended patterns Suffix trees

5 Master Course First lecture: First part: (Exact) string matching of one pattern

6 String matching: one pattern For instance, given the sequence CTACTACTACGTCTATACTGATCGTAGCTACTACATGC search for the pattern ACTGA. How does the string algorithms made the search? and for the pattern TACTACGGTATGACTAA

7 String Matching: Brute force algorithm Given the pattern ATGTA, the search is G T A C T A G A G G A C G T A T G T A C T G... A T G T A Example:

8 What is the meaning of the variables? y: n: x: m: String Matching: Brute force algorithm Connect to and open Brute Force algorithm What is the meaning of the variables? y: array with the text T n: length of the text x: array with the pattern P m:length of the pattern C code C code of the running file Connect to

9 String Matching of one pattern The cost of Brute Force algorithm is O(nm). Can the search be made with lower cost? CTACTACTACGTCTATACTGATCGTAGCTACTACATGC TACTACGGTATGACTAA Factor search Prefix search Suffix search

10 String matching of one pattern How does the string algorithms made the search? There is a sliding window along the text against which the pattern is compared: Pattern : Text : Which are the facts that differentiate the algorithms? 1.How the comparison is made. 2.The length of the shift. At each step the comparison is made and the window is shifted to the right.

11 String Matching: Brute force algorithm Text : Patern : From left to right: prefix search Which is the next position of the window? How the comparison is made? Patró : Text : The window is shifted only one cell The cost is O(mn).

12 String Matching: one pattern Most efficient algorithms (Navarro & Raffinot) |  | Length of the pattern Horspool BNDM BOM BNDM : Backward Nondeterministic Dawg Matching BOM : Backward Oracle Matching w

13 String Matching: Horspool algorithm Text : Pattern : From right to left: suffix search Which is the next position of the window? How the comparison is made? Pattern : Text : a It depends of where appears the last letter of the text, say it ‘a’, in the pattern: aa a Then it is necessary a preprocess that determines the length of the shift. a aa aaa

14 String Matching: Horspool algorithm Given the pattern ATGTA, the shift table is A 4 C 5 G 2 T 1 And the search:G T A C T A G A G G A C G T A T G T A C T G... A T G T A Example:

15 String Matching: Horspool algorithm Given the pattern ATGTA, the shift table is A 4 C 5 G 2 T 1 And the search:G T A C T A G A G G A C G T A T G T A C T G... A T G T A Example: …

16 String Matching: Horspool algorithm Connect to and open the Horspool algorithm C code Connect to

17 String Matching: one pattern The most efficient algorithms (Navarro & Raffinot) |  | Length of the pattern Horspool BNDM BOM BNDM : Backward Nondeterministic Dawg Matching BOM : Backward Oracle Matching w

18 BNDM algorithm How the shift is determined? How the comparison is made? Text : Pattern : Searches for suffixes of T that are factors of P This state is expressed with an array D of bits: D 2 = How the next state can be obtained? D = D<<1 & B(x) Given the mask B(x) of x, the cells where character x appears into the pattern D 3 = ( ) & ( ) = ( ) If B(x) = ( ) then ? x

19 BNDM algorithm: example Given the pattern ATGTA, the mask of characters is: B(A) = ( ) B(C) = B(G) = B(T) =

20 BNDM algorithm: example Given the pattern ATGTA, the mask of characters is: B(A) = ( ) B(C) = ( ) B(G) = ( ) B(T) = ( )

21 BNDM algorithm: example Given the pattern ATGTA, Given the text :G T A C T A G A G G A C G T A T G T A C T G... A T G T A the mask of characters is: B(A) = ( ) B(C) = ( ) B(G) = ( ) B(T) = ( ) D 1 = = ( ) D 2 = ( ) & ( ) = ( ) D 1 = = ( ) D 2 = ( ) & ( ) = ( ) D 1 = = ( ) D 2 = ( ) & ( ) = ( ) D 3 = ( ) & ( ) = ( ) D 4 = ( ) & ( ) = ( )

22 BNDM algorithm: example A T G T A The pattern is ATGTA, the masks are: and the text:G T A C T A G A G G A C G T A T G T A C T G... A T G T A B(A) = ( ) B(C) = ( ) B(G) = ( ) B(T) = ( ) D 1 = = ( ) D 2 = ( ) & ( ) = ( ) D 3 = ( ) & ( ) = ( ) D 4 = ( ) & ( ) = ( ) D 5 = ( ) & ( ) = ( ) D 6 = ( ) & ( * * * * * ) = ( ) Pattern found! …

23 Text : Pattern : Searches for suffixes of T that are factors of P BNDM algorithm How the shift is determined? How the comparison is made? This state is expressed with an array D of bits: D = ?

24 Text : Pattern : Searches for suffixes of T that are factors of P BNDM algorithm How the shift is determined? How the comparison is made? This state is expressed with an array D of bits: D = If the left bit is set to one in step i, it means that a prefix of P of length i is equal to a suffix of T, then the window is shifted m-i cells; otherwise it is shifted m cells

25 String matching: one pattern The most efficient algorithms (Navarro & Raffinot) |  | Long. patró Horspool BNDM BOM BNDM : Backward Nondeterministic Dawg Matching BOM : Backward Oracle Matching w

26 BOM (Backward Oracle Matching) How the shifted is determined? How the comparison is made? Text : Pattern : Automaton: Factor Oracle(1999) Checks if the suffix is a factor of the pattern ?

27 Automaton Factor Oracle: properties Factor Oracle of the word G T A T G T A GGATT AT T A G G T A T G but the automaton also recognizes other strings as G T G then it is usefull only for discard words out as factors! A T G G T G T A T G

28 BOM: example The Factor Oracle of the inverted pattern is built. Given the pattern ATGTATG Search:G T A C T A G A A T G T G T A G A C A T G T A T G G T G A... A T G T A T G How the comparison is made? GGATT AT T A G

29 BOM: example The Factor Oracle of the inverted pattern is built. Given the pattern ATGTATG Search:G T A C T A G A A T G T G T A G A C A T G T A T G G T G A T G T A T G How the comparison is made? GGATT AT T A G A T G T A T G

30 BOM: example The Factor Oracle of the inverted pattern is built. Given the pattern ATGTATG SearchG T A C T A G A A T G T G T A G A C A T G T A T G G T G A T G T A T G How the comparison is made? GGATT AT T A G A T G T A T G

31 BOM: example The Factor Oracle of the inverted pattern is built. Given the pattern ATGTATG Search :G T A C T A G A A T G T G T A G A C A T G T A T G G T G A T G T A T G How the comparison is made? GGATT AT T A G A T G T A T G

32 BOM: example The Factor Oracle of the inverted pattern is built. Given the pattern ATGTATG Search :G T A C T A G A A T G T G T A G A C A T G T A T G G T G... A T G T A T G How the comparison is made? GGATT AT T A G A T G T A T G

33 BOM: example Es construeix l’autòmata del patró invers: Suposem que el patró és ATGTATG Search :G T A C T A G A A T G T G T A G A C A T G T A T G G T G... A T G T A T G How the comparison is made? GGATT AT T A G A T G T A T G …

34 BOM (Backward Oracle Matching) How the shifted is determined? How the comparison is made? Text : Pattern : Automaton: Factor Oracle Checks if the suffix is a factor of the pattern a a is the first mismatch

35 String Matching: BNDM and BOM Connect to and open the BNDM and BOM algorithms C code C code of BNDM C code of BOMC code

36 Master Course First lecture: Second part: (Exact) string matching of many patterns

37 String matching: many patterns Given the sequence CTACTACTACGTCTATACTGATCGTAGCTACTACATGC Search for the patterns ACTGACT GTCT AATT ACTGATCTTT GTAGC AATACT ACATGC ACTGA.

38 Trie Trie of words GTATGTA,GTAT,TAATA,GTGTA T A A G G A T T T T G A A A A T Which is the cost?

39 Horspool for many patterns Search for ATGTATG,TATG,ATAAT,ATGTG 4. Start the search T A A G G A T T T T G A A A A T 1. Build the trie of the inverted patterns 2. lmin=4 A 1 C 4 (lmin) G 2 T 1 3. Table of shifts

40 Horspool for many patterns Search for ATGTATG,TATG,ATAAT,ATGTG T A A G G A T T T T G A A A A T The text ACATGCTATGTGACA… A 1 C 4 (lmin) G 2 T 1

41 Horspool for many patterns Search for ATGTATG,TATG,ATAAT,ATGTG T A A G G A T T T T G A A A A T The text ACATGCTATGTGACA… A 1 C 4 (lmin) G 2 T 1

42 Horspool for many patterns Search for ATGTATG,TATG,ATAAT,ATGTG T A A G G A T T T T G A A A A T The text ACATGCTATGTGACA… A 1 C 4 (lmin) G 2 T 1

43 Horspool for many patterns Search for ATGTATG,TATG,ATAAT,ATGTG T A A G G A T T T T G A A A A T The text ACATGCTATGTGACA… A 1 C 4 (lmin) G 2 T 1

44 Horspool for many patterns Search for ATGTATG,TATG,ATAAT,ATGTG T A A G G A T T T T G A A A A T The text ACATGCTATGTGACA… A 1 C 4 (lmin) G 2 T 1

45 Horspool for many patterns Search for ATGTATG,TATG,ATAAT,ATGTG T A A G G A T T T T G A A A A T The text ACATGCTATGTGACA… A 1 C 4 (lmin) G 2 T 1 …

46 Horspool for many patterns Search for ATGTATG,TATG,ATAAT,ATGTG T A A G G A T T T T G A A A A T The text ACATGCTATGTGACA… A 1 C 4 (lmin) G 2 T 1 … Short Shifts!

47 AA 1 AC 3 ( LMIN-L+1 ) AG 3 AT 1 CA 3 CC 3 CG 3 … 2 símbols Horspool to Wu-Manber How do we can increase the length of the shifts? With a table shift of l-mers with the patterns ATGTATG,TATG,ATAAT,ATGTG AA 1 AT 1 GT 1 TA 2 TG 2 A 1 C 4 (lmin) G 2 T 1 1 símbol

48 Wu-Manber algorithm Search for ATGTATG,TATG,ATAAT,ATGTG T A A G G A T T T T G A A A A T into the text: ACATGCTATGTGACATAATA … AA 1 AT 1 GT 1 TA 2 TG 2 Experimental length: log |Σ| 2*lmin*r

49 String matching of many patterns |  | Wu-Manber SBOM Lmin (5 patterns) Wu-Manber SBOM (10 patterns) Wu-Manber SBOM (100 patterns)

50 String matching of many patterns |  | Wu-Manber SBOM Wu-Manber SBOM SBOM Lmin (5 patterns) (10 patterns) (100 patterns) (1000 patterns)

51 Horspool for a set of patterns Text : Patrons: Comparison How the shift is determined? How the comparison is made? a Segons l’aparició de l`últim carácter del text ‘a’into the s patrons, concretament: la primera aparició per la dreta no última i més curta que lmin, o lmin Automaton with all the patterns

52 String matching of many patterns |  | Wu-Manber SBOM Long. mínima (5 patterns) Wu-Manber SBOM (10 patterns) Ad AC Wu-Manber SBOM (1000 patterns) Ad AC Wu-Manber SBOM (100 patterns) Ad AC

53 SBOM How the shifted is determined? How the comparison is made? Text : Pattern : Automaton: Factor Oracle Checks if the suffix is a factor of any pattern ?

54 Factor Oracle of many patterns The AFO of GTATGTA, GTAA, TAATA i GTGTA T A A GGATT T T A G A 1,4 3 2 A

55 SBOM algorithm Text : Patrons: How the shift is determined? How the comparison is made? a Autòmaton………… of lenght lmin If the a doesn’t appears in the AFO If lmin characters have been read

56 SBOM algorithm : example Search for the patterns ATGTATG, TAATG,TAATAAT i AATGTG GGATT T T A G A T A A ACATGCTAGCTATAATAATGTATG A

57 SBOM algorithm: example Search for the patterns ATGTATG, TAATG,TAATAAT i AATGTG GGATT T T A G A T A A ACATGCTAGCTATAATAATGTATG A

58 SBOM algorithm: example Search for the patterns ATGTATG, TAATG,TAATAAT i AATGTG GGATT T T A G A T A A ACATGCTAGCTATAATAATGTATG A

59 SBOM algorithm: example Search for the patterns ATGTATG, TAATG,TAATAAT i AATGTG GGATT T T A G A T A A ACATGCTAGCTATAATAATGTATG A

60 SBOM algorithm: example Search for the patterns ATGTATG, TAATG,TAATAAT i AATGTG GGATT T T A G A T A A ACATGCTAGCTATAATAATGTATG A

61 SBOM algorithm: example Search for the patterns ATGTATG, TAATG,TAATAAT i AATGTG GGATT T T A G A T A A ACATGCTAGCTATAATAATGTATG A

62 SBOM algorithm: example Search for the patterns ATGTATG, TAATG,TAATAAT i AATGTG GGATT T T A G A T A A ACATGCTAGCTATAATAATGTATG A

63 SBOM algorithm: example Search for the patterns ATGTATG, TAATG,TAATAAT i AATGTG GGATT T T A G A T A A ACATGCTAGCTATAATAATGT… A

64 Alg. Cerca exacta de molts patrons |  | Wu-Manber SBOM Long. mínima (5 mots) Wu-Manber SBOM (10 mots) Ad AC Wu-Manber SBOM (1000 mots) Ad AC Wu-Manber SBOM (100 mots) Ad AC


Download ppt "Master Course MSc Bioinformatics for Health Sciences H15: Algorithms on strings and sequences Xavier Messeguer Peypoch (http://www.lsi.upc.es/~alggen)"

Similar presentations


Ads by Google