# Bioinformatics Multiple sequence alignments Scoring multiple sequence alignments Progressive methods ClustalW Other methods Hidden Markov Models Lecture.

## Presentation on theme: "Bioinformatics Multiple sequence alignments Scoring multiple sequence alignments Progressive methods ClustalW Other methods Hidden Markov Models Lecture."— Presentation transcript:

Bioinformatics Multiple sequence alignments Scoring multiple sequence alignments Progressive methods ClustalW Other methods Hidden Markov Models Lecture 9

Comparing a pair of sequences is not sufficient for many research purposes, mainly for evolutionary reconstructions and study functional similarities. It is obvious that MSA is much more demanding in computational sense. For two protein sequences each 300 aa in length and excluding gaps, the number of comparisons to be made using dynamic programming approach is equal to 300 2 = 9 x 10 4. For 3 sequences of the same length this number is 300 3 = 2.7 x 10 7. For 10 sequences it becomes staggering. Fortunately in late 1980 and mid 1990 methods, which dramatically reduce a number of comparisons, were invented. The MSA alignment is usually done in three consecutive steps. 1. Finding alignments between each pair of sequences; 2. A trial MSA is then produced by predicting a phylogenetic tree for the sequences (for instance neighbor-joining method); 3. The sequences are then multiply aligned in the order of their relationship on the tree. Multiple sequence alignments (MSA)

Scoring Multiple sequence alignments Sequence Column A Column B Column C 1 …………..……N…………………N…………………..N 2 ………..………N…………………N…………………..N 3 ………..………N…………………N…………………..N 4 ………..………N…………………N…………………..C 5 ………..………N…………………C…………………..C No. of N-N matched pairs (each scores 6): 10 6 4 No. of N-C matched pairs (each scores -3): 0 4 6 BLOSUM62 score: 60 24 6 N N N N N N N C N N N C C N N

The most closely related sequences are first aligned by dynamic programming to build a MSA starting from the most related sequences The tree is based on pairwise comparisons of the sequences using one of the phylogenetic methods Unfortunately uncertainty is growing in the lower levels of the tree, as deletions or insertions not easy to recognise The challenge is to utilize an appropriate combination of sequence weighting, scoring matrix and gap penalties, which prevents optimal MSA Progressive methods of MSA N Y L S N K Y L S N F S N F L S N K/- Y L S N F L/- S N K/- Y/F L/- S

ClustalW This is one of the advanced version of the popular and powerful program, where W stand for weighting. ClustalW provides more realistic alignments that should reflect evolutionary changes and more appropriate distribution of gaps between conserved domains ClustalW performs a global-multiple sequence alignment by a different method than MSA, although the initial global-multiple sequence alignment is calculated similarly The steps involved are: 1. Pairwise alignment of all sequences; 2. Use the alignment scores for building a phylogenetic tree; 3. Progressive alignment guided by the phylogenetic relationships indicated by the tree The most closely related (similar) sequences are aligned first, and then additional sequences are added The initial alignments used to produce the guide tree may be obtained by a fast k- tuple approach (similar to FASTA) or a slower dynamic programming method For building a tree genetic distances between sequences are calculated as the numbers of mismatched positions in an alignment divided by the total number of matched positions

ClustalW Sequence A (weight a) ………..K………… Sequence C (weight c) ………..L………… Sequence B (weight b) ………..I…………. Sequence C (weight c) ………..L………… The same procedure applies to other columns in all pairwise alignments Scores for matching these two columns in an MSA = {[a x c x score (K,L )] + [b x c x score (I,L)] +…}/n columns m pairwisecomparisons } Weighting factor Normalized A. Calculation of sequence weights B. Use of sequence weights A 0.2 + 0.3/2 = 0.35 0.35/0.5 = 0.7 B 0.1 + 0.3/2 = 0.25 0.25/0.5 = 0.5 C 0.5 0.5/0.5 = 1 Columns in alignment 1 & 2 0.5 0.2 0.1 0.3

An output from ClustalW sequences have significant similarity CLUSTAL W (1.82) multiple sequence alignment gi|42542791|gb|AAH66228.1| MSTAGKVIKCKAAVLWELKKPFSIEEVEVAPPKAHEVRIKMVAAGICRS- 49 gi|825623|emb|CAA39813.1| MGTKGKVIKCKAAIAWEAGKPLCIEEVEVAPPKAHEVRIQIIATSLCHT- 49 gi|42738724|gb|AAS42652.1| --MQNFVFRNPTKLIFGKGQ---LEQLKTEIPQFGKKVLLVYGGGSIKRN 45. *:: : : : : :*:::. *: : : :.. :. gi|42542791|gb|AAH66228.1| ---DEHVVSGNLV-TPLPVILGHEAAGIVESVGEGVTTVKPG--DKVIPL 93 gi|825623|emb|CAA39813.1| ---DASVIDSKFEGLAFPVIVGHEAAGIVESIGPGVTNVKPG--DKVIPL 94 gi|42738724|gb|AAS42652.1| GIYDNVISILKDINAEVFELTGVEPNPRVSTVKKGIQICKDNGVEFILAV 95. * : :.. : * *. *.:: *: *.. : ::.: gi|42542791|gb|AAH66228.1| FTPQCGKCRICKNPESNYCLKN-DLGNPRG-------------------T 123 gi|825623|emb|CAA39813.1| YAPLCRKCKFCLSPLTNLCGKISNLKSPASDQ----------------QL 128 gi|42738724|gb|AAS42652.1| GGGSVIDCTKAIAAGSKYDGDVWDIVTKKAFASEALPFGTVLTLAATGSE 145.*.. ::. ::.. :.:..: : :::.. gi|42542791|gb|AAH66228.1| LQDGTRRFTCSGKPIHHFVGVSTFSQYTVVDENAVAKIDAASPLEKVCLI 173 gi|825623|emb|CAA39813.1| MEDKTSRFTCKGKPVYHFFGTSTFSQYTVVSDINLAKIDDDANLERVCLL 178 gi|42738724|gb|AAS42652.1| MNAGSVITNWETNEKYGWGSPVTFPQFSILDPVHTASVPRDQTIYGMVDI 195 :: :.. : : :. **.*::::. *.: : : : alcohol dehydrogenase, iron-containing [Bacillus cereus Class I alcohol dehydrogenase, gamma subunit [Homo sapiens] Different form of alcohol dehydrogenase [Homo sapiens]

MSA programs discussed so far are based on global alignments, including all available parts of sequences However many sequences may have blocks of similarity, which are separated by low similarity regions Three approaches were used to develop methods more oriented toward this structural feature: 1. Profile analysis; 2. Block analysis; 3. Pattern searching Profiles are found by performing the global MSA of a group of sequences and them choosing the more highly conserved regions. A score matrix for such MSA, called profile, is then made. Once produced, the profile is used to search a target sequence for possible matches to the profile using scores in the table to evaluate the likelihood at each position. Localised alignments in sequences

Profile analysis: pattern identification CONS A B C D……………………….V W Y Z Gap Len I 8 3 -2 5……………………….21 -18 -6 4 100 100 T 13 19 -5 24………………….……3 -28 –14 15 100 100 L 5 5 -5 3……………………….10 -1 5 2 22 22 S 17 14 17 13……………………….1 -8 -15 4 100 100 T 15 3 22 0……………………….9 -22 6 -4 100 100 T 8 -1 12 -2………………………19 -15 4 -3 100 100 C 17 0 24 -1……………………….9 -5 14 -7 100 100 V 11 0 18 -1……………………….31 -19 –5 -5 100 100 C 10 -8 15 -11………………………15 22 14 -11 100 100 V 7 7 -3 8……………………….26 -24 -6 8 100 100 The profile represents the specific motif pattern found for the chosen location for a set of hsp70 proteins. It is used to search a target sequence for matches to the profile. The values are log odds score of giving the probability of finding the amino acid in the target sequence at that position in the profile divided by the probability of aligning the two aa by random chance. There are 23 columns, representing 20 aa + 1 unknown aa (Z) + gap opening and extension penalties. Gaps are costly unless the profile itself include gaps, as in the row 3.

Profile analysis: pattern identification The log odds scores for the profile (Profile ij ) are given by: Profile ij = log [  (W ai x p aij )/p randomj ] all a’s Where W ai is the weight of an ancestral amino acid a at row i in the profile, p aij is the frequency of amino acid j in the PAM amino acid distribution that best matches at row i, and p randomj is the background frequency of amino acid j. Steps: 1. A profile for a protein family is prepared using a few sequences. 2. The profile is used to search a protein DB for the family members 3. Receiver operating characteristic (ROC) test plot, which could be as high as 95.6  0.6% of the known family members for as little as 6 initial sequences. The success rate may slightly be increased by using >100 sequences for the profile search.

Block Analysis This method is very similar to the profile search. The major difference is that insertions and deletions are not considered. As a result the patterns found contain regions of high similarity separated by loosely similar or dissimilar sequences These ungapped patterns may be extracted from these aligned regions and used to produce blocks. Profile matrices the same as in the previous method are built. Seq1 GVDVLVATPG RLLDLEHQNA..VKLDQV EILVLDEADR Seq2 GPDALVSTPG RYLTLEHRNV..LKPDIV TIRVLDEADR Seq3 AVEVIVSTPG RLWDLHHQNA..VQLSQD ELLDLDEADK ……………………………………………………………………………………………………………………………………… Seqn GCDKLNATPG RLMDLKHQGA..VKLLFV SILVMDEADR

Hidden Markov Models Sequence alignment N  F L S N K Y L T Q  W - T DEL INS

Hidden Markov Model for sequence alignment BEG D1 D2 D3 D4 I0 I1 I2 I3 I4 M1 END M4 M3 M2 Del Ins Match/ MisMa Transition probability Loop transition to accommodate multi- residue insertions

Hidden Markov Models: calculation of transition probabilities N  F L S N K Y L T Q  W - T A pathway for sequence N K Y LT is: BEG  M1  I1  M2  M3  M4  END Each transition has an associated probability, and sum of the probabilities of transitions leaving each state is 1. It is equal for all states 0.33, except M4 and D4. Assuming for simplicity that a match state contains a uniform distribution across the 20 aa, then p = 0.05. P NKYLT = 0.33 x 0.05 x 0.33 x 0.05 x 0.33 x 0.05 x 0.33 x 0.05 x 0.33 x 0.05 x 0.5 = 6.1 x 10 -10 The secret of successful using of HMM is to adjust the transition values and the distribution of each state by training the computer model. HMMER is a good example of such training. The training process leaves a memory and improve the ability to make better MSA. A generated pathway is called a Markov chain because the next state is dependent on the previous one. As the actual sequence/pathway information is hidden, the model is described as a Hidden Markov Model.

GeneDoc a multiple sequence alignment editor

Download ppt "Bioinformatics Multiple sequence alignments Scoring multiple sequence alignments Progressive methods ClustalW Other methods Hidden Markov Models Lecture."

Similar presentations