Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms. Elya Flax & Inbar Matarasso Multiple sequence alignment algorithms.

Similar presentations


Presentation on theme: "1 Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms. Elya Flax & Inbar Matarasso Multiple sequence alignment algorithms."— Presentation transcript:

1 1 Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms. Elya Flax & Inbar Matarasso Multiple sequence alignment algorithms

2 2 Outline The importance of multiple string alignments in molecular biology. CLUSTAL W. Family representation. How to score multiple alignments. The center star method for SP alignment. consensus strings. Approximating the optimal consensus multiple alignment. Iterative pairwise alignment. Progressive alignment and contemporary improvements. Repeated-motif methods

3 3 Motivation Why multiple string comparison? Because many important commonalties are faint or widely dispersed, they might not be apparent when comparing two strings alone but may become clear, or even obvious, when comparing a set of related strings. Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.

4 4 Defenition Definition: A global multiple alignment of k>2 strings S={S 1,S 2,…,S k } is a natural generalization of alignment for two strings. Chosen spaces are inserted into each of the k strings so that the resulting strings have the same length, defined to be l. Then the strings are arrayed in k rows of l columns each, so that each character and space of each string is in a unique column. Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.

5 5 Biological basis for multiple string comparison The second fact of biological sequence comparison Evolutionarily and functionally related molecular strings can differ significantly throughout much of the string and yet preserve the same three-dimensional structure(s), or the same tow-dimensional substructure(s) (motifs, domains), or the same active sites, or the same or related dispersed residues (DNA or amino acid). Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.

6 6 Three “big-picture” biological uses for multiple string comparison The representation of protein families and superfamilies. The identification and representation of conserved sequence features of DNA or protein that correlate with structure and/or function. The deduction of evolutionary history from DNA or protein sequences. Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.

7 7 CLUSTAL W Improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. http://www.ebi.ac.uk/clustalw/ Sequences results Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.

8 8 Family and superfamily representation Often a set of strings (a family) is defined by biological similarity, and one wants to find subsequence commonalities that characterize or represent the family. There are three common kinds of family representations that come from multiple string comparison: I. Profile representation II. Consensus sequence representation III. Signature representation Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.

9 9 Family representation and alignment with profiles Definition: Given a multiple alignment of a set of strings, a profile for that multiple alignment specifies for each column the frequency that each character appears in the column. A profile is sometimes also called a weight matrix in the biological literature. Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.

10 10 Family representation and alignment with profiles abc_a ababa accb_ cb_bc Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms. C1C2C3C4C5 a.75.25.50 b.75 c.25.50.25 _ Often the values in the profile are converted to log- odds ratio – If p(y,j) is the frequency that character y appears in column j, and p(y) is the frequency that character y appears anywhere in the multiply aligned sequences, then log( p(y,j)/p(y) ) is commonly used as the y,j profile entry.

11 11 Aligning a string to a profile Given a profile P and a new string S, we want to answer the question: “How well S, or substring of S, fit the profile P”. Since space is a legal character of a profile, a fit of S to P should also allow the insertion of spaces into S, and hence the question is naturally formalized as an easy generalization of pure string alignment. Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms. aabbc 12345 An alignment of string aabbc to the column positions of the previous alignment.

12 12 How to optimally align a string to a profile Recall that for two characters x and y, s(x,y) denotes the alphabet-weight value assigned to aligning x with y in the pure string alignment problem. Definition: For character y and column j, let p(y,j) be the frequency that character y appears in column j of the profile, and let S(x,j) denote  y [s(x,y) × p(y,j)], the score for aligning x with column j. Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.

13 13 How to optimally align a string to a profile Definition: Let V(i,j) denote the value of the optimal alignment of substring S[1..i] with the first j columns of C. The recurrence:V(i,0)=  s(S 1 (k),_)V(0,j)=  S(_,k) For I and j both strictly positive, the general recurrence is: V(i,j) = max [V(i-1,j-1) + S(S 1 (i),j), V(i-1,j) + s(S 1 (i),__), V(i,j-1) + S(_,j)]. Time analysis: O(  nm), where n is the length of S and  is the size of the alphabet. Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms. k≤ik≤j

14 14 Profile to profile alignment Another way that profiles are used is to compare one protein set to another. In that case, the profile for one set is compared to the profile of the other. Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.

15 15 Introduction to computing multiple string alignments Definition: Given a set of k > 2 strings S={S 1, S 2,...,S k }, a local multiple alignment of S is obtained by selecting one substring S i ’ from each string S i  S and then globally aligning those substrings. Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.

16 16 How to score multiple alignments To date, there is no objective function that has been as well accepted for multiple alignment as edit distance or similarity has been for two-string alignment. We will discuss three types of objective functions: I. sum-of-pairs functions II. consensus functions III. tree functions Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.

17 17 Definition: Given a multiple alignment M, the induced pairwise alignment of two strings Si and Sj is obtained from M by removing all rows except the two rows for S i and S j. That is, the induced alignment is multiple alignment M restrict to S i and S j. Any two opposing spaces in that induced alignment can be removed if desired. Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms. How to score multiple alignments

18 18 How to score multiple alignments Definition: The score of an induced pairwise alignment is determined using any chosen scoring scheme for tow-string alignment in the standard manner. Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms. AAGAA_A AT_AATG CTG_G_G ATGAA_G 4 5 5 SP score 14

19 19 Multiple alignment with the sum-of- pairs (SP) objective function Definition: The sum of pairs (SP) score of multiple alignment M is the sum of the scores of pairwise global alignments induced by M. The SP alignment problem Compute a global multiple alignment M with minimum sum-of-pairs score. Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.

20 20 An exact solution to the SP alignment problem Via dynamic programming – for k strings of length n, it takes  (n k ) time. We will develop the dynamic programming recurrence only for the case of three strings. We will develop an accelerant to the basic dynamic programming solution that somewhat increases the number of strings that can be optimally aligned. Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.

21 21 An exact solution to the SP alignment problem Definition: Let S 1, S 2 and S 3 denote three strings of length n 1,n 2 and n 3, respectively, and let D(i,j,k) be the optimal SP score for aligning S 1 [1..i], s 2 [1..j] and s 3 [1..k]. The score for a match, mismatch, or space is specified by the variables smatch, smis and sspace respectively. Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.

22 22 Recurrences for a nonboundary cell(i,j) For i:=1 to n 1 do for j:=1 to n 2 do for k:=1 to n 3 do begin if (S 1 (i)=S 2 (j) then sij:=smatch else cij:=smis; if (S 1 (i)=S 3 (k) then cik:=smatch else cik:=smis; if (S 2 (j)=S 3 (k) then cjk:=smatch else cjk:=smis; Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms. d1:=D(i-1,j-1,k-1)+cij+cik+cjk; d2:=D(i-1,j-1,k)+cij+2*sspace; d3:=D(i-1,j,k-1)+cik+2*sspace; d4:=D(i,j-1,k-1)+cjk+2*sspace; d5:=D(i-1,j,k)+2*sspace; d6:=D(i,j-1,k)+2*sspace; d7:=D(i,j,k-1)+2*sspace; D(i,j,k):=min[d1,d2,d3,d4,d5,d6,d7]; end;

23 23 D values for boundary cells Let D 1,2 (i,j) denote the familiar pairwise distance between substrings S 1 [1..i] and S 2 [1..j], and let D 1,3 (i,k) and D 2,3 (j,k) denote the analogous pairwise distance. Then, I. D(i,j,0)=D1,2(i,j)+(i+j)*sspace II. D(i,0,k)=D1,3(i,k)+(i+k)*sspace III. D(i,j,0)=D2,3(j,k)+(J+k)*sspace IV. D(0,0,0)=0 Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.

24 24 A speed up for the exact solution The program for multiple alignment that was shown uses recurrences in backward direction. In forward dynamic programming when D(i,j,k) is set, D(i,j,k) is sent forward the seven cells that can be influenced by it. Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.

25 25 A speed up for the exact solution Definition: Let d 1,2 (i,j) be the edit distance between suffixes S 1 [i..n] and S 2 [j..n] of string S 1 and S 2. Define d 1,3 (i,k) and d 2,3 (j,k) analogously. All these d values can be computed in O(n 2 ) time by reversing the strings and computing three pairwise distances. Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.

26 26 A speed up for the exact solution Suppose that some multiple alignment of S 1, S 2, and S3 is known and that the alignment has SP score z. Key idea of the heuristic speed up Recall that D(i,j) is the optimal SP score for aligning S 1 [1..i], S 2 [1..j], and S 3 [1..k]. If D(i,j,k)+d 1,2 (i,j)+d 1,3 (i,k)+d 2,3 (j,k) is greater than z, then node (i,j,k) cannot be on any optimal path and so D(i,j,k) need not be sent forward to any cell. Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.

27 27 A bounded-error approximation method for SP alignment The method is provably fast (runs in polynomial worst-case time) and yet produced alignments whose SP score is guaranteed to be less than twice the score of optimal SP alignment. Recall that for two strings, D(S i,S j ) is the (optimal) weighted edit distance between S i and S j. Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.

28 28 An initial key idea: alignments consistent with a tree Definition: Let S be a set of strings, and let T be a tree where each node is labeled with a distinct string from S. Then, a multiple alignment M of S is called consistent with T if the induced pairwise alignment of S i and S j has score D(S i,S j ) for each pair of strings (S i,S j ) that label adjacent nodes in T. Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.

29 29 A bounded-error approximation method for SP alignment Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms. 3 1 4 5 2 AXZ AXXZ AYZ AYXYZ 3AXX_Z 1AX__Z 2A_X_Z 4AY__Z 5AYXYZ a)b)

30 30 An initial key idea: alignments consistent with a tree Theorem: For any set of strings S and for any tree T whose nodes are labeled by distinct strings of S, we can efficiently find a multiple alignment M(T) of S that is consistent with T Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.

31 31 The center star method for SP alignment We will describe the method in terms of an alphabet- weighted scoring scheme for two-string alignment, and let s(x,y) be the score contributed when a character x is aligned opposite a character y. Definition: A scoring scheme satisfies the triangle inequality if for any three characters x,y and z, s(x,z)≤ s(x,y) + s(y,z). Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.

32 32 The center star method for SP alignment Definition: Given a set of k strings S, define a center string S c  S as a string in S that minimizes  Sj  S D(S c,S j ), and let M denote the minimum sum. Define the center star to be a star tree of k nodes, with the center node labeled Sc and with each of the k-1 remaining nodes labeled by a distinct string in S-S c. Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.

33 33 The center star method for SP alignment Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms. S3S3 S4S4 S2S2 S1S1 S6S6 S3S3 A generic center star for six strings, where the center string S c is S 3

34 34 The center star method for SP alignment Definition: Define the multiple alignment M c of the set of strings S to be the multiple alignment consistent with the center star. Definition: Define d(S i,S j ) as the score of the pairwise alignment of strings S i and S j induced by M c. Denote the score of an alignment M as d(M).  d(S i,S j )≥D(S i,S j ), d(M c )=  i<j d(S i,S j ), d(S i,S c )=D(S i,S c ) Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.

35 35 The center star method for SP alignment Lemma: Assume that the two-string scoring scheme satisfies the triangle inequality. Then for any strings S i and S j in S, d(S i,S j ) ≤ d(S i,S c ) + d(S c + S j ) = D(S i,S c ) + D(S c + S j ) Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.

36 36 The center star method for SP alignment Definition: Let M* be the optimal multiple alignment of the k strings of S. Let d*(S i,S j ) be the score of the pairwise alignment of strings S i and S j induced by M*. Then d(M*)=  i<j d*(S i,S j ). Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.

37 37 The center star method for SP alignment Theorem: d(M c )/d(M*) ≤ 2(k-1)/k <2. Corollary: kM≤  i<j D(S i,S j )≤d(M*)≤d(M c )≤[2(k-1)/k  i<j D(S i,S j ). Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.

38 38 Steiner consensus strings Definition: Given a set of strings S, and given another string S’, the consensus error of a string S’ relative to S is E(S’)=  Si  S D (S’, S i ). Note that S’ need not be from S. Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.

39 39 Steiner consensus strings Definition: Given a set of strings S, an optimal Steiner string S* for S is a string that minimizes the consensus error E(S*) over all possible strings. Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.

40 40 Steiner consensus strings Lemma: Let S have k strings, and assume that the two-string scoring scheme satisfies the triangle inequality. Then there exists a string S S such that E(S) / E(S*) ≤ 2 – 2/k < 2 Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.  _ _

41 41 Steiner consensus strings Recall that S c is a string that minimizes  Si  S D (S c, S i ) over all strings in S. Theorem: Assuming that the scoring scheme satisfies the triangle inequality, E(S c ) / E(S*) ≤ 2 – 2/k < 2 Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.

42 42 Consensus strings from multiple alignment Definition: Given a multiple alignment M of a set of strings S, the consensus character of column I of M is the character that minimizes the summed distance to it from all the characters in column i. let d(i) denote the minimum sum in column i. Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.

43 43 Consensus strings from multiple alignment Definition: The consensus string S M derived from alignment M is the concatenation of the consensus characters for each column of M. Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.

44 44 Consensus strings from multiple alignment Definition: Let M be a multiple alignment of a set of strings S, and let S M be its consensus string containing q characters. Then the alignment error of S M equals  d(i), and the alignment error of M is defined as the alignment error of S M. Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms. i=1 i=q

45 45 Consensus strings from multiple alignment Definition: The optimal consensus multiple alignment is a multiple alignment M for input set S whose consensus string S M has smallest alignment error over all possible multiple alignments of S Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.

46 46 Consensus strings from multiple alignment Definition: Given set S of k strings, let T be the star tree with Steiner string S* at the root and each of the k strings at distinct leaves of T. Then the multiple alignment of SUS* consistent with T is said to be consistent with S*. Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.

47 47 Consensus strings from multiple alignment Theorem: Let S’ denote the consensus string of the optimal consensus multiple alignment. Then, removal of the spaces from S’ creates the optimal Steiner string S*. Conversely’ removal of the row for S* from the multiple alignment consistent with S* creates the optimal consensus multiple alignment of S. Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.

48 48 Approximating the optimal consensus multiple alignment Theorem: Assuming the triangle inequality, the multiple alignment Mc created by the center star method has an SP score that is never more than 2 – 2/k times the SP score of the optimal SP alignment, and it has a (consensus) alignment error that is never more than 2 – 2/k times the alignment error of the optimal consensus multiple alignment. Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.

49 49 Multiple alignment to a (phylogenetic) tree Definition: Given an input tree T with a distinct string (from a set of strings S) written at each leaf, a phylogenetic alignment for T is an assignment of one string to each internal node of T. Note that the strings assigned to internal nodes need not be distinct and need not be from the input strings S. Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.

50 50 Multiple alignment to a (phylogenetic) tree Definition: If strings S and S’ are assigned to the endpoints of an edge (i,j), then (i,j) had edge distance D(S,S’). The distance along a path is the sum of the distances on the edges in the path. The distance of a phylogenetic alignment is the total of all the edge distances in the tree. Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.

51 51 Multiple alignment to a (phylogenetic) tree The phylogenetic alignment problem for T find an assignment of strings to internal nodes of T (one string to each node) that minimizes the distance of the alignment. The consensus alignment problem is a special case of the phylogenetic alignment problem (i.e., when tree T is a star). Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.

52 52 A heuristic for phylogenetic alignment Definition: A phylogenetic alignment is called a lifted alignment if for every internal node V, the string assigned to V is also assigned to one of V’s children. We will show that the best lifted alignment in T has a total distance less than twice that of the optimal phylogenetic alignment. Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.

53 53 A heuristic for phylogenetic alignment Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms. S6S6 S5S5 S6S6 S6S6 S6S6 S7S7 S8S8 S5S5 S1S1 S2S2 S2S2 S4S4 S5S5 S3S3

54 54 The transformation creating T We will construct the lifted alignment T out of T* which is the optimal phylogenetic alignment. Definition: we say a node has been lifted after it has been labeled by a string in the leaf set S. Let S v * be the string labeling internal node V in T*. S 1, S 2,…., S k – v’s children. We lift S j if D(S v *,S j )≤ D(S v *,S i ) for any i from 1 to k. Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms. L L

55 55 The lifting operation at node V. The numbers on the edges are the distances from S v * to the lifted strings labeling its children. Note that after the lift, one edge will have zero distance. Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms. The transformation creating T L Sv*Sv* S3S3 S3S3 S4S4 S1S1 S2S2 S3S3 S4S4 S1S1 S2S2 VV 5 7 3 0 6

56 56 The error analysis Theorem: The lifted alignment T has total distance less or equal to twice that of the optimal phylogenetic T* of T. Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms. L

57 57 Computing the minimum distance lifted alignment The best lifted alignment is computed by dynamic programming. Definition: Let T v be the subtree of T rooted at node V. Let d(V,S) denote the distance of the best lifted alignment of T v under the requirement that string S is assigned to node V (assuming of course that S is a string at a leaf of T v. Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.

58 58 Computing the minimum distance lifted alignment We start with the assumption that all the leaves have already been processed. S’- a string written at a leaf; V’-child of V. If V is a node all of whose children are leaves d(V,S)=  S’ D (S, S’). For a general internal node V, the dynamic programming recurrence is d(V,S)=  min [ D (S, S’) + d(V’,S’) ] Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms. V’ S’

59 59 Computing the minimum distance lifted alignment Theorem: The optimal lifted alignment can be computed in polynomial time as a function of size of the tree and the lengths of the input strings. Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.

60 60 Iterative pairwise alignment The target is to iteratively merge two multiple alignments of two subsets of strings into a single multiple alignment of the union of those subsets. As an example we will explain the average linkage method, and is also known as UPGMA, for “Unweighted Pair-Group Method using arithmetic Averages”. At each merge step, the new multiple alignment could be created by aligning some representation of the two smaller alignments (for example, by aligning profiles or consensus sequences). Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.

61 61 Iterative pairwise alignment multiple alignments serve the purpose of characterizing protein families and for identifying important molecular structures, but…. Doolittle: “ ….what we’re really interested in is a historical alignment. The historical alignment ought to reflect, as accurately as possible, the series of divergences that led to the contemporary sequences…..” Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.

62 62 Iterative pairwise alignment Iterative alignment methods determine a sequence of merges of disjoint subsets of strings. Hence the history of those merges can be described by a binary tree T. Each leaf of T represents a single string from the input set, and each node of T specifies a merge of the strings found at the leaves of its subtree. Each node also represents a multiple alignment created by the merge at that node. Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.

63 63 Progressive alignment A pair of strings with minimum edit distance (or greatest similarity) is likely obtained from the pair of taxa that has most recently diverged. Any spaces (gaps) that appear in the optimal pairwise alignment of those two strings in preserved throughout the entire sequence of successive merges. Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.

64 64 Progressive alignment The progressive alignment method is explicitly aimed at building an evolutionary tree from molecular data while simultaneously constructing an evolutionarily informative multiple alignment. Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.

65 65 Improvements to progressive alignment Sequence weighting – the weights are normalized such that the biggest one is set to 1. closely related sequences receive lowered weights. Highly divergent sequences receive high weights. Initial gap penalties – a gap opening penalty (GOP) is given for every gap, and gap extension penalty (GEP) gives the cost of every space in the gap. Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.

66 66 Improvements to progressive alignment Weight matrices – Two main series of weight matrices are offered to the user: Dayhoff PAM, BLOSUM. Divergent sequences – The most divergent sequences are usually the most difficult to align correctly. It is sometimes better to delay the incorporation of these sequences until all of the more easily aligned sequences are merged first. Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.

67 67 Progressive alignment Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms. Hbb_Human1- Hbb_Horse2.17- Hba_Human3.59.60- Hba_Horse4.59.13- Myg_Phyca5.77.75 - Glb5_Petma6.81.82.73.74.80- Lgb2_Luplu7.87.86.88.93.90 123456 Pairwise alignment: calculate distance matrix

68 68 Progressive alignment Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms. Unrooted Neighbor- joining tree Hbb_Human Hbb_Horse Hba_Human Hba_Horse Myg_Phyca Glb5_Petma Lgb2_Luplu

69 69 Progressive alignment Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms. Hbb_Human Hbb_Horse Hba_Human Hba_Horse Myg_Phyca Glb5_Petma Lgb2_Luplu Rooted NJ tree (guide tree) and sequence weights Progressive alignment: Align following the guide tree

70 70 Repeated-motif methods The second major approach used in multiple alignment methods. Definition: a motif is a substring or a small subsequence that is common to many of the strings in the set. “width” refers to the length of the motif, and “multiplicity” refers to the number of strings that it appears in. Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.

71 71 Repeated-motif methods Repeated-motif method general algorithm: 1. Find a “good” motif (wide and with high multiplicity) 2. The strings containing it are shifted so that the occurrences of the motif are aligned with each other. 3.The problems divides into two sub problems, one for substrings on each side of the motif. Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms.

72 72 Repeated-motif methods 4. Continue this recursion until no sufficiently wide or high motif is found. 5. The remaining sub problems can be solved by iterative alignment methods. 6. Strings that did not contain the first good motif are aligned separately. 7. Finally, the two alignments are merged.

73 73 Summary Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms. The importance of multiple string alignments in molecular biology. CLUSTAL W. Family representation. How to score multiple alignments. The center star method for SP alignment. consensus strings. Approximating the optimal consensus multiple alignment. Iterative pairwise alignment. Progressive alignment and contemporary improvements. Repeated-motif methods

74 74 Bibliography Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms. Algorithms on strings, trees, and sequences : computer science and computational biology; Gusfield Dan; Cambridge : Cambridge University Press, 1997 Nucleic Acids Research, 1994, Vol. 22, No. 22, Oxford University Press.


Download ppt "1 Seminar in Structural Bioinformatics - Multiple sequence alignment algorithms. Elya Flax & Inbar Matarasso Multiple sequence alignment algorithms."

Similar presentations


Ads by Google