Presentation is loading. Please wait.

Presentation is loading. Please wait.

Using Traveling Salesman Problem Algorithms to Determine Multiple Sequence Alignment Orders Weiwei Zhong.

Similar presentations


Presentation on theme: "Using Traveling Salesman Problem Algorithms to Determine Multiple Sequence Alignment Orders Weiwei Zhong."— Presentation transcript:

1 Using Traveling Salesman Problem Algorithms to Determine Multiple Sequence Alignment Orders Weiwei Zhong

2 Topics Background Algorithm Design Test Results

3 Background Definitions

4 What is a Sequence Alignment? Given 2 or more sequences a scoring scheme Insert gaps in each sequence, so that all sequences have the same length maximum pairing score match score mismatch score gap penalty

5 Scoring Matrix match = 2 mismatch = -1 gap penalty = -2 Simplified Scoring Scoring matrix In Practice

6 Global vs. Local Alignments F G K – G K G F G K F G K G - - - F G K G K G F G K F G K G - - Global: entire lengths of sequences Local: regions of sequences

7 Pairwise Alignment vs. Multiple Sequence Alignment (MSA) F G K  G K G F G K F G K G Pairwise: 2 sequences MSA: more than 2 sequences F G K  G K G F G K F G K G - G K Q G K G - - K F G K G

8 Background Basic Dynamic Programming

9 Dynamic Programming Algorithm for Pairwise Alignments Two sequences GAATTC GGATC 1. Initialization Scoring scheme match = 2 mismatch = -1 gap penalty = -2 0000000 0 0 0 0 0 G A A T T C GGATCGGATC

10 Scoring scheme match = 2 mismatch = -1 gap g = -2 2. Table fill M i-1,j-1 M i-1,j M i,j-1 M ij M i-1,j-1 + S(c i, c j ) M i,j-1 + g M i-1,j + g M ij = max cici cjcj 0000000 0 0 0 0 0 G A A T T C GGATCGGATC 2 54310 13532 -31340 -2 12 0

11 3. Trace back G A A T T C | | G G A – T C 0 0 0 0 0 0000000 G A A T T C GGATCGGATC 2 54310 13532 -31340 -2 12 0

12 Multidimensional Dynamic Programming for MSA n strings of length L each, running time is O(L n ). Impractical: 5-7 proteins of 200-300 residues each.

13 Topics Background Algorithm Design Test Results

14 An MSA Heuristic Algorithm Design

15 1. Align 2 of the sequences S i, S j 2. Align a 3 rd sequence S k to the alignment S i, S j 3. Repeat 2 until all sequences are aligned Feng-Doolittle Progressive Alignment Running Time O( n L 2 ) * T A S cjcj cici S(c i, c j ) = (S(T, S) + S(A, S)) / 2

16 Features of Feng-Doolittle Algorithm x:G A A G T T y:G A C – T T x:G A A G T T y:G A – C T T z:G A A C T G Once a gap, always a gap Early mistakes cannot be corrected Alignment order is important z:G A A C T G

17 TspMsa: First Version Algorithm Design

18 Traveling Salesman Problem (TSP) Given n nodes distances for each pair of nodes Find a roundtrip, so that visit each node exactly once minimal total length NP-complete Well studied

19 calculate pairwise distances TspMsa: Algorithm Design Feng-Doolittle alignment 01155161 10142458 151404667 512446038 615867380 0 1 2 3 4 0123401234 Alignment order 0 2 3 1 4 0 2 3 1 4 1 0 4 2 3 determine a TSP tour

20 429 814 624 8 932 84 1 14 79 284 9 15 1049 914 378 251 970 632 542 375 508 337 498 0.703 0.67 0.702 0.747 0.770 0.737 0.681 0.677 0.686 0.711 0.7 0.698 0.688 0.685 0.746 0.765 0.692 0.772 0.733 0.743 0.736 0.749 0.685 0.719 0.696 0.706 0.739 0.712 0.669 0.64 0.689 0.736 0.702 0.722 0.74 0.665 0.653 0.636 0.603 0.654 0.668 0.731 0.712 0.656 0 1 4 2 3 19 20 21 22 13 9 10 8 7 6 5 12 18 17 14 15 16 11 Starting Point and Direction of TSP Tour data set kinase_ref3

21 TspMsa: Modified Design Algorithm Design

22 TspMsa: Modified Algorithm Design calculate pairwise distances determine a TSP tour align closest nodes 0 2 3 1 4 67 15 38 1 24 2 3 1, 0 4 67 15 38 24 3 1, 0 2, 4 67 38 24 3, 1, 0, 2, 4 3, 1, 0 2, 4 67 38 one node left ? end yes no 0 2 3 1 4

23 Modified Algorithm is Better Alignment order for Kinase_ref3 Original TspMsa : 0.603 (worst) - 0.772 (best) Modified TspMsa : 0.836 5 6 7 8109014231817 14 151611121322212019

24 Topics Background Algorithm Design Test Results

25 What to Compare With?

26 Existing MSA Programs Progressive multal Iterative multalign pileup clustalw poa prrp saga hmmt less computation timebetter quality best quality Fast

27 CLUSTALW 1. Calculate pairwise distances 2. Derive a guide tree by the Neighbor Joining method 1 2 3 4 5 6 7 8 9 1 2 3 4 7 8 9 5 6 repeat until one node left at the center 123478 9 56 choose 2 closest nodes, derive an internal node r i =(Σd ik )/(n-2) d ix =(d ij + r i - r j ) /2 d jx =d ij – d ix d xm =(d im + d jm - d ij )/2 j i x j i

28 CLUSTALW 2 gap penalty values: opening, extension Dynamically changes the gap penalty and the scoring matrix 3. Progressively align all sequences following the guide tree Weighted sequences 1 p e e k s a v t a l 2 g e e k a a v l a l 3 e g e w q l v l h v Without weights Score = [S(t,v) + S(l,v)] / 2 With weights Score = [S(t,v)*w 1 *w 3 + S(l,v)*w 2 *w 3 ] / 2

29 POA E T - - P K M I V R E T T H – K M L V R 1. Convert sequences to partial order graphs E T N K E TNK E T P K TH M I V L R

30 POA 2. Align 2 sequences 3. Align one sequence to the current group E T P KTH E T N K 4. Repeat 3 until all sequences are aligned

31 Test Results Quality Evaluation

32 BAliBASE Benchmark Reference 1: equidistance sequences with various levels of similarity. < 25% sequence identity 20-40% sequence identity > 35% sequence identity Reference 2: closely related sequences with a highly divergent “orphan” sequence. Reference 3: subgroups with <25% identity between groups. Reference 4: sequences with N/C-terminal extensions. Reference 5: sequences with internal insertions.

33 Reference 1 Sequences with < 25% Identity shortmediumlong Average Score All Test Scores

34 Reference 1 Sequences with 20-40% Identity shortmediumlong Average Score All Test Scores

35 Reference 1 Sequences with >35% Identity shortmediumlong Average Score All Test Scores

36 Reference 2 shortmediumlong Average Score All Test Scores

37 Reference 3 shortmediumlong Average Score All Test Scores

38 Reference 4 and Reference 5 Reference 4Reference 5 Average Score All Test Scores

39 Alignment Quality Comparison Reference 1: <25% identity: Similar * 20-40% identity: Similar * > 35% identity: Similar Reference 2: Similar * Reference 3: TspMsa better Reference 4: CLUSTALW better Reference 5: Similar * CLUSTALW slightly better for short sequences. TspMsa and POA:TspMsa better TspMsa and CLUSTALW: comparable

40 Test Results Execution Time Evaluation

41 Fast Mode TspMsa Slow mode: full dynamic programming (accurate) Fast mode: a fast approximate method (heuristic) Most time consuming step: Pairwise distance calculations

42 Quality Impact of the Fast Mode

43 Execution Time Evaluation CLUSTALW and TspMsa in fast mode

44 Conclusions Slow mode close to CLUSTALW (slow mode) better than POA Fast mode (not as good as slow mode) comparable to CLUSTALW (fast mode) better than POA Fast mode faster than CLUSTALW (fast mode) comparable to POA QUALITY SPEED

45 Acknowledgement Dr. Robert Robinson Dr. Russell Malmberg Dr. Eileen Kraemer Computer Science Department


Download ppt "Using Traveling Salesman Problem Algorithms to Determine Multiple Sequence Alignment Orders Weiwei Zhong."

Similar presentations


Ads by Google