Using Traveling Salesman Problem Algorithms to Determine Multiple Sequence Alignment Orders Weiwei Zhong
Topics Background Algorithm Design Test Results
Background Definitions
What is a Sequence Alignment? Given 2 or more sequences a scoring scheme Insert gaps in each sequence, so that all sequences have the same length maximum pairing score match score mismatch score gap penalty
Scoring Matrix match = 2 mismatch = -1 gap penalty = -2 Simplified Scoring Scoring matrix In Practice
Global vs. Local Alignments F G K – G K G F G K F G K G F G K G K G F G K F G K G - - Global: entire lengths of sequences Local: regions of sequences
Pairwise Alignment vs. Multiple Sequence Alignment (MSA) F G K G K G F G K F G K G Pairwise: 2 sequences MSA: more than 2 sequences F G K G K G F G K F G K G - G K Q G K G - - K F G K G
Background Basic Dynamic Programming
Dynamic Programming Algorithm for Pairwise Alignments Two sequences GAATTC GGATC 1. Initialization Scoring scheme match = 2 mismatch = -1 gap penalty = G A A T T C GGATCGGATC
Scoring scheme match = 2 mismatch = -1 gap g = Table fill M i-1,j-1 M i-1,j M i,j-1 M ij M i-1,j-1 + S(c i, c j ) M i,j-1 + g M i-1,j + g M ij = max cici cjcj G A A T T C GGATCGGATC
3. Trace back G A A T T C | | G G A – T C G A A T T C GGATCGGATC
Multidimensional Dynamic Programming for MSA n strings of length L each, running time is O(L n ). Impractical: 5-7 proteins of residues each.
Topics Background Algorithm Design Test Results
An MSA Heuristic Algorithm Design
1. Align 2 of the sequences S i, S j 2. Align a 3 rd sequence S k to the alignment S i, S j 3. Repeat 2 until all sequences are aligned Feng-Doolittle Progressive Alignment Running Time O( n L 2 ) * T A S cjcj cici S(c i, c j ) = (S(T, S) + S(A, S)) / 2
Features of Feng-Doolittle Algorithm x:G A A G T T y:G A C – T T x:G A A G T T y:G A – C T T z:G A A C T G Once a gap, always a gap Early mistakes cannot be corrected Alignment order is important z:G A A C T G
TspMsa: First Version Algorithm Design
Traveling Salesman Problem (TSP) Given n nodes distances for each pair of nodes Find a roundtrip, so that visit each node exactly once minimal total length NP-complete Well studied
calculate pairwise distances TspMsa: Algorithm Design Feng-Doolittle alignment Alignment order determine a TSP tour
Starting Point and Direction of TSP Tour data set kinase_ref3
TspMsa: Modified Design Algorithm Design
TspMsa: Modified Algorithm Design calculate pairwise distances determine a TSP tour align closest nodes , , 0 2, , 1, 0, 2, 4 3, 1, 0 2, one node left ? end yes no
Modified Algorithm is Better Alignment order for Kinase_ref3 Original TspMsa : (worst) (best) Modified TspMsa :
Topics Background Algorithm Design Test Results
What to Compare With?
Existing MSA Programs Progressive multal Iterative multalign pileup clustalw poa prrp saga hmmt less computation timebetter quality best quality Fast
CLUSTALW 1. Calculate pairwise distances 2. Derive a guide tree by the Neighbor Joining method repeat until one node left at the center choose 2 closest nodes, derive an internal node r i =(Σd ik )/(n-2) d ix =(d ij + r i - r j ) /2 d jx =d ij – d ix d xm =(d im + d jm - d ij )/2 j i x j i
CLUSTALW 2 gap penalty values: opening, extension Dynamically changes the gap penalty and the scoring matrix 3. Progressively align all sequences following the guide tree Weighted sequences 1 p e e k s a v t a l 2 g e e k a a v l a l 3 e g e w q l v l h v Without weights Score = [S(t,v) + S(l,v)] / 2 With weights Score = [S(t,v)*w 1 *w 3 + S(l,v)*w 2 *w 3 ] / 2
POA E T - - P K M I V R E T T H – K M L V R 1. Convert sequences to partial order graphs E T N K E TNK E T P K TH M I V L R
POA 2. Align 2 sequences 3. Align one sequence to the current group E T P KTH E T N K 4. Repeat 3 until all sequences are aligned
Test Results Quality Evaluation
BAliBASE Benchmark Reference 1: equidistance sequences with various levels of similarity. < 25% sequence identity 20-40% sequence identity > 35% sequence identity Reference 2: closely related sequences with a highly divergent “orphan” sequence. Reference 3: subgroups with <25% identity between groups. Reference 4: sequences with N/C-terminal extensions. Reference 5: sequences with internal insertions.
Reference 1 Sequences with < 25% Identity shortmediumlong Average Score All Test Scores
Reference 1 Sequences with 20-40% Identity shortmediumlong Average Score All Test Scores
Reference 1 Sequences with >35% Identity shortmediumlong Average Score All Test Scores
Reference 2 shortmediumlong Average Score All Test Scores
Reference 3 shortmediumlong Average Score All Test Scores
Reference 4 and Reference 5 Reference 4Reference 5 Average Score All Test Scores
Alignment Quality Comparison Reference 1: <25% identity: Similar * 20-40% identity: Similar * > 35% identity: Similar Reference 2: Similar * Reference 3: TspMsa better Reference 4: CLUSTALW better Reference 5: Similar * CLUSTALW slightly better for short sequences. TspMsa and POA:TspMsa better TspMsa and CLUSTALW: comparable
Test Results Execution Time Evaluation
Fast Mode TspMsa Slow mode: full dynamic programming (accurate) Fast mode: a fast approximate method (heuristic) Most time consuming step: Pairwise distance calculations
Quality Impact of the Fast Mode
Execution Time Evaluation CLUSTALW and TspMsa in fast mode
Conclusions Slow mode close to CLUSTALW (slow mode) better than POA Fast mode (not as good as slow mode) comparable to CLUSTALW (fast mode) better than POA Fast mode faster than CLUSTALW (fast mode) comparable to POA QUALITY SPEED
Acknowledgement Dr. Robert Robinson Dr. Russell Malmberg Dr. Eileen Kraemer Computer Science Department