Presentation is loading. Please wait.

Presentation is loading. Please wait.

Master Course MSc Bioinformatics for Health Sciences H15: Algorithms on strings and sequences Xavier Messeguer Peypoch (http://www.lsi.upc.es/~alggen)

Similar presentations


Presentation on theme: "Master Course MSc Bioinformatics for Health Sciences H15: Algorithms on strings and sequences Xavier Messeguer Peypoch (http://www.lsi.upc.es/~alggen)"— Presentation transcript:

1 Master Course MSc Bioinformatics for Health Sciences H15: Algorithms on strings and sequences Xavier Messeguer Peypoch (http://www.lsi.upc.es/~alggen) Dep. de Llenguatges i Sistemes Informàtics CEPBA-IBM Research Institute Universitat Politècnica de Catalunya

2 Contents 1. (Exact) String matching of one pattern 2. (Exact) String matching of many patterns 3. Approximate string matching (Dynamic programming) 4. Pairwise and multiple alignment 5. Suffix trees and MUMs Flexible pattern matching in strings G. Navarro and M. Raffinot, 2002, Cambridge Uni. Press Algorithms on strings, trees and sequences D. Gusfield, Cambridge University Press, 1997

3 Master Course Fourth lecture: Examples

4 Example 1: Assume that you have a transcription factor atc(g|a)(t|g|a)gt whose ocurrences are going to be searched into a text of length 1500bps: - what is the best strategy? - how many random ocurrences will appear?

5 Example 2: Assume that you have a 100 transcription factors atc(g|a)(t|g|a)gt whose ocurrences are going to be searched into a text of length 1500bps: - what is the best strategy? - how many random ocurrences will appear?

6 Example 3: Assume that you have a 100 transcription factors atc(g|a)(t|g|a)gt whose ocurrences are going to be searched into a 50 promoter regions of 1500bps: - what is the best strategy?

7 Example 4: - what is the best strategy? - how many random ocurrences will appear? Assume that you have a transcription factor a 0 0 8 0 8 0 0 0 c 0 8 0 0 0 0 0 0 t 0 0 0 8 0 8 0 0 g 8 0 0 0 0 0 8 8 whose ocurrences are going to be searched into a text of length 1500bps:

8 Example 5: - what is the best strategy? Assume that you have a transcription factor a 0 0 7 0 8 0 5 3 c 4 3 0 7 0 1 2 2 t 2 3 1 1 0 4 1 0 g 3 2 0 0 0 3 0 4 whose ocurrences are going to be searched into a text of length 1500bps:

9 Example 6: Assume that you have two short DNA sequences and you need to compare them. In each case what are you doing? - Using global pairwise alignment. - Using local pairwise alignment. - Using suffix trees. - Using frequency table of l-mers.

10 Example 7: Assume that you have two genomic DNA sequences and you need to compare them. In each case what are you doing? - Using global pairwise alignment. - Using local pairwise alignment. - Using suffix trees. - Using frequency table of l-mers.

11 Suffix trees Given string ababaas: 1: ababaas 2: babaas 3: abaas 4: baas 5: aas 6: as 7: s as,3 s,6 as,5 s,7 as,4 ba baas,2 a ba baas,1 a ba baas,1 ba baas,2 as,3as,4 s,6 as,5 s,7 Suffixes: What kind of queries?

12 Applications of Suffix trees a ba baas,1 as,3 ba baas,2 as,4 s,6 as,5 s,7 1. Exact string matching ………………………… Does the sequence ababaas contain any ocurrence of patterns abab, aab, and ab?

13 Applications of Suffix trees 2. The substring problem for a database of patterns DB Does the DB contain any ocurrence of patterns abab, aab, and ab? a bα,5 b a bbα,3 baabbα,1 bα,6 a baabbα,2 b bbα,4 bα,7 α,7 b aaβ,1 a β,2 a β,3 a β,4

14 Applications of Suffix trees 3. The longest common substring of two strings a bα,5 b a bbα,3 baabbα,1 bα,6 a baabbα,2 b bbα,4 bα,7 α,7 b aaβ,1 a β,2 a β,3 a β,4

15 Applications of Suffix trees 4. Finding the maximal repeats. a bα,5 b a bbα,3 baabbα,1 bα,6 a baabbα,2 b bbα,4 bα,7 α,7 b aaβ,1 a β,2 a β,3 a β,4

16 Applications of Suffix trees 5. Finding MUMs. Third lecture: Second part: Alignment of genomes: MUMs

17 Dynamic programming What about genomes? Quadratic cost of space and time. accaccacaccacaacgagcata … acctgagcgatat acc..tacc..t Short sequences (up to 10.000 bps) can be aligned using dynamic programming Quadratic cost of space and time. acc.................................agt | | |.................................|xx acc.................................a--

18 Genomic sequences In which cases Dinamic Programming can be applied? The length of sequences is 1000 times longer. Genomic sequences have millions of base pairs. The running time is 1.000.000 times higher ! (1 second becomes 11 days) (1 minute becomes 2 years)

19 First assumption ………………………………………………………………. ………………………….………………...…………...…. Genome B Genome A …………………………………… Genome B ……………………………. Genome A

20 Realistic assumption? Unrealistic assumption! More realistic assumption ………………………………………………………………. ………………………….………………...…………...…. Genome B Genome A …………………………………………………………………. ………………………………………………...…………...…. Genome A Genome B ………………… ……………… Genome A Genome B

21 Realistic assumptions? But, now is it a real case? Unrealistic assumption! More realistic assumption ………………………………………………………………. ………………………….………………...…………...…. Genome B Genome A ………………………………………………………………… ………………………………………………...…………...…. Genome A Genome B ………………… ……………… Genome A Genome B

22 Preview in a real case Chlamidia muridarum: 1.084.689bps Chlamidia Thrachomatis:1057413bps        

23 Preview in a real case Pyrococcus abyssis: 1.790.334 bps Pyrococcus horikoshu: 1.763.341 bps      

24 MUM … a a t g….c t g... … c g t g….c c c... MatchingUniqueMaximal MUM

25 Search for MUMs Given strings ababaabs and aabaat: List of UM aab,abaa,baa. ba a s,8 s,6 s,7 baabs,2 b a baabs,1 a bs,3 a s,5 a bs,4 b a b t,2 t,5 t,6 t,4 aat,1 t,3 (through the list of UM) 1st: Bottom-up traversal 2nd: Search for maximals (Through the tree) MUMs: aab,abaa.

26 Preview of many genomes

27 List of works

28 Image and interface accgc…….cttgc...tccgg……ccaac...

29 Computational and biological background (3) Chlamydophila pneumoniae AR39: 1.247420bps Chlamydia pneumoniae: 1.247.805 Chlamidia muridarum: 1.084.689bps Chlamidia trachomatis:1057413bps       

30 Alignment revisited Pyrococcus abyssis: 1.790.334 Pyrococcus horikoshu: 1.763.341 bps


Download ppt "Master Course MSc Bioinformatics for Health Sciences H15: Algorithms on strings and sequences Xavier Messeguer Peypoch (http://www.lsi.upc.es/~alggen)"

Similar presentations


Ads by Google