Presentation is loading. Please wait.

Presentation is loading. Please wait.

Master Course MSc Bioinformatics for Health Sciences H15: Algorithms on strings and sequences Xavier Messeguer Peypoch (http://www.lsi.upc.es/~alggen)

Similar presentations


Presentation on theme: "Master Course MSc Bioinformatics for Health Sciences H15: Algorithms on strings and sequences Xavier Messeguer Peypoch (http://www.lsi.upc.es/~alggen)"— Presentation transcript:

1 Master Course MSc Bioinformatics for Health Sciences H15: Algorithms on strings and sequences Xavier Messeguer Peypoch (http://www.lsi.upc.es/~alggen) Dep. de Llenguatges i Sistemes Informàtics CEPBA-IBM Research Institute Universitat Politècnica de Catalunya

2 Contents 1. (Exact) String matching of one pattern 2. (Exact) String matching of many patterns 3. Approximate string matching (Dynamic programming) 4. Pairwise and multiple alignment 5. Suffix trees and MUMs Flexible pattern matching in strings G. Navarro and M. Raffinot, 2002, Cambridge Uni. Press Algorithms on strings, trees and sequences D. Gusfield, Cambridge University Press, 1997

3 Master Course Third lecture: First part: Suffix trees

4 Given string ababaas: 1: ababaas 2: babaas 3: abaas 4: baas 5: aas 6: as 7: s as,3 s,6 as,5 s,7 as,4 ba baas,2 a ba baas,1 a ba baas,1 ba baas,2 as,3as,4 s,6 as,5 s,7 Suffixes: What kind of queries?

5 Queries on Suffix trees a ba baas,1 as,3 ba baas,2 as,4 s,6 as,5 s,7 Does the sequence ababaas contain any ocurrence of patterns abab, aab, and ab? Find repeats within the sequence ababaas. …………………………

6 Quadratic Insertion algorithm Given the string ababaabbs ababaabbs,1

7 Quadratic Insertion algorithm Given the string ababaabbs babaabbs,2 ababaabbs,1

8 Quadratic Insertion algorithm Given the string ababaabbs babaabbs,2 ababaabbs,1

9 Quadratic Insertion algorithm Given the string ababaabbs babaabbs,2 ababaabbs,1

10 Quadratic Insertion algorithm Given the string ababaabbs babaabbs,2 ababaabbs,1 aba baabbs,1

11 Quadratic Insertion algorithm Given the string ababaabbs babaabbs,2 aba baabbs,1 abbs,3

12 Quadratic Insertion algorithm Given the string ababaabbs babaabbs,2 aba baabbs,1 abbs,3

13 Quadratic Insertion algorithm Given the string ababaabbs babaabbs,2 aba baabbs,1 abbs,3 ba baabbs,2

14 Quadratic Insertion algorithm Given the string ababaabbs aba baabbs,1 abbs,3 ba baabbs,2 abbs,4

15 Quadratic Insertion algorithm Given the string ababaabbs aba baabbs,1 abbs,3 ba baabbs,2 abbs,4

16 Quadratic Insertion algorithm Given the string ababaabbs aba baabbs,1 abbs,3 abbs,4 ba baabbs,2 abbs,4 abbs,3 ba a baabbs,1

17 Quadratic Insertion algorithm Given the string ababaabbs abbs,4 ba baabbs,2 abbs,4 abbs,3 ba a baabbs,1 abbs,5

18 Quadratic Insertion algorithm Given the string ababaabbs abbs,4 ba baabbs,2 abbs,4 abbs,3 ba a baabbs,1 abbs,5

19 Quadratic Insertion algorithm Given the string ababaabbs abbs,4 ba baabbs,2 abbs,4 a abbs,5 b a abbs,3 baabbs,1

20 Quadratic Insertion algorithm Given the string ababaabbs abbs,4 ba baabbs,2 abbs,4 a abbs,5 b a abbs,3 baabbs,1 bs,6

21 Quadratic Insertion algorithm Given the string ababaabbs abbs,4 ba baabbs,2 abbs,4 a abbs,5 b a abbs,3 baabbs,1 bs,6

22 Quadratic Insertion algorithm Given the string ababaabbs a abbs,5 b a abbs,3 baabbs,1 bs,6 a baabbs,2 b abbs,4 bs,7

23 Quadratic Insertion algorithm Given the string ababaabbs a abbs,5 b a abbs,3 baabbs,1 bs,6 a baabbs,2 b abbs,4 bs,7 s,7

24 Quadratic Insertion algorithm Given the string ababaabbs a abbs,5 b a abbs,3 baabbs,1 bs,6 a baabbs,2 b abbs,4 bs,7 s,7

25 Generalizad suffix tree The suffix tree of many strings … and it is the suffix tree of the concatenation of strings. the generalized suffix tree of ababaabb and aabaat … is the suffix tree of ababaabαaabaatβ, : is called the generalized suffix tree … For instance,

26 Generalizad suffix tree a abbα,5 b a abbα,3 baabbα,1 bα,6 a baabbα,2 b abbα,4 bα,7 α,7 Given the suffix tree of ababaabα : Construction of the suffix tree of ababaabbαaabaaβ :

27 Generalizad suffix tree a abbα,5 b a abbα,3 baabbα,1 bα,6 a baabbα,2 b abbα,4 bα,7 α,7 Construction of the suffix tree of ababaabbαaabaaβ :

28 Generalizad suffix tree Construction of the suffix tree of ababaabbαaabaaβ : a bα,5 b a abbα,3 baabbα,1 bα,6 a baabbα,2 b abbα,4 bα,7 α,7 ab aaβ,1

29 Generalizad suffix tree Construction of the suffix tree of ababaabbαaabaaβ : a bα,5 b a abbα,3 baabbα,1 bα,6 a baabbα,2 b abbα,4 bα,7 α,7 ab aaβ,1

30 Generalizad suffix tree a bα,5 b a bbα,3 baabbα,1 bα,6 a baabbα,2 b abbα,4 bα,7 α,7 ab aaβ,1 a β,2 Construction of the suffix tree of ababaabbαaabaaβ :

31 Generalizad suffix tree a bα,5 b a bbα,3 baabbα,1 bα,6 a baabbα,2 b abbα,4 bα,7 α,7 ab aaβ,1 a β,2 Construction of the suffix tree of ababaabbαaabaaβ :

32 Generalizad suffix tree a bα,5 b a bbα,3 baabbα,1 bα,6 a baabbα,2 b bbα,4 bα,7 α,7 ab aaβ,1 a β,2 a β,3

33 Construction of the suffix tree of ababaabbαaabaaβ : Generalizad suffix tree a bα,5 b a bbα,3 baabbα,1 bα,6 a baabbα,2 b bbα,4 bα,7 α,7 ab aaβ,1 a β,2 a β,3

34 Generalizad suffix tree a bα,5 b a bbα,3 baabbα,1 bα,6 a baabbα,2 b bbα,4 bα,7 α,7 b aaβ,1 a β,2 a β,3 a β,4 Construction of the suffix tree of ababaabbαaabaaβ :

35 Generalizad suffix tree a bα,5 b a bbα,3 baabbα,1 bα,6 a baabbα,2 b bbα,4 bα,7 α,7 b aaβ,1 a β,2 a β,3 a β,4 Construction of the suffix tree of ababaabbαaabaaβ :

36 Generalizad suffix tree a bα,5 b a bbα,3 baabbα,1 bα,6 a baabbα,2 b bbα,4 bα,7 α,7 b aaβ,1 a β,2 a β,3 a β,4 Construction of the suffix tree of ababaabbαaabaaβ :

37 Generalizad suffix tree a bα,5 b a bbα,3 baabbα,1 bα,6 a baabbα,2 b bbα,4 bα,7 α,7 b aaβ,1 a β,2 a β,3 a β,4 Construction of the suffix tree of ababaabbαaabaaβ :

38 Generalizad suffix tree a bα,5 b a bbα,3 baabbα,1 bα,6 a baabbα,2 b bbα,4 bα,7 α,7 b aaβ,1 a β,2 a β,3 a β,4 Construction of the suffix tree of ababaabbαaabaaβ :

39 Generalizad suffix tree a bα,5 b a bbα,3 baabbα,1 bα,6 a baabbα,2 b bbα,4 bα,7 α,7 b aaβ,1 a β,2 a β,3 a β,4 Generalized suffix tree of ababaabbαaabaaβ :

40 Applications of Suffix trees a ba baas,1 as,3 ba baas,2 as,4 s,6 as,5 s,7 1. Exact string matching ………………………… Does the sequence ababaas contain any ocurrence of patterns abab, aab, and ab?

41 Applications of Suffix trees 2. The substring problem for a database of patterns DB Does the DB contain any ocurrence of patterns abab, aab, and ab? a bα,5 b a bbα,3 baabbα,1 bα,6 a baabbα,2 b bbα,4 bα,7 α,7 b aaβ,1 a β,2 a β,3 a β,4

42 Applications of Suffix trees 3. The longest common substring of two strings a bα,5 b a bbα,3 baabbα,1 bα,6 a baabbα,2 b bbα,4 bα,7 α,7 b aaβ,1 a β,2 a β,3 a β,4

43 Applications of Suffix trees 4. Finding the maximal repeats. a bα,5 b a bbα,3 baabbα,1 bα,6 a baabbα,2 b bbα,4 bα,7 α,7 b aaβ,1 a β,2 a β,3 a β,4

44 Applications of Suffix trees 5. Finding MUMs. Third lecture: Second part: Alignment of genomes: MUMs

45 Dynamic programming What about genomes? Quadratic cost of space and time. accaccacaccacaacgagcata … acctgagcgatat acc..tacc..t Short sequences (up to 10.000 bps) can be aligned using dynamic programming Quadratic cost of space and time. acc.................................agt | | |.................................|xx acc.................................a--

46 Genomic sequences In which cases Dinamic Programming can be applied? The length of sequences is 1000 times longer. Genomic sequences have millions of base pairs. The running time is 1.000.000 times higher ! (1 second becomes 11 days) (1 minute becomes 2 years)

47 First assumption ………………………………………………………………. ………………………….………………...…………...…. Genome B Genome A …………………………………… Genome B ……………………………. Genome A

48 Realistic assumption? Unrealistic assumption! More realistic assumption ………………………………………………………………. ………………………….………………...…………...…. Genome B Genome A …………………………………………………………………. ………………………………………………...…………...…. Genome A Genome B ………………… ……………… Genome A Genome B

49 Realistic assumptions? But, now is it a real case? Unrealistic assumption! More realistic assumption ………………………………………………………………. ………………………….………………...…………...…. Genome B Genome A ………………………………………………………………… ………………………………………………...…………...…. Genome A Genome B ………………… ……………… Genome A Genome B

50 Preview in a real case Chlamidia muridarum: 1.084.689bps Chlamidia Thrachomatis:1057413bps        

51 Preview in a real case Pyrococcus abyssis: 1.790.334 bps Pyrococcus horikoshu: 1.763.341 bps      

52 MUM … a a t g….c t g... … c g t g….c c c... MatchingUniqueMaximal MUM

53 Search for MUMs Given strings ababaabs and aabaat: List of UM aab,abaa,baa. ba a s,8 s,6 s,7 baabs,2 b a baabs,1 a bs,3 a s,5 a bs,4 b a b t,2 t,5 t,6 t,4 aat,1 t,3 (through the list of UM) 1st: Bottom-up traversal 2nd: Search for maximals (Through the tree) MUMs: aab,abaa.

54 Preview of many genomes

55 List of works

56 Image and interface accgc…….cttgc...tccgg……ccaac...

57 Computational and biological background (3) Chlamydophila pneumoniae AR39: 1.247420bps Chlamydia pneumoniae: 1.247.805 Chlamidia muridarum: 1.084.689bps Chlamidia trachomatis:1057413bps       

58 Alignment revisited Pyrococcus abyssis: 1.790.334 Pyrococcus horikoshu: 1.763.341 bps


Download ppt "Master Course MSc Bioinformatics for Health Sciences H15: Algorithms on strings and sequences Xavier Messeguer Peypoch (http://www.lsi.upc.es/~alggen)"

Similar presentations


Ads by Google