Presentation is loading. Please wait. # Measuring the degree of similarity: PAM and blosum Matrix

## Presentation on theme: "Measuring the degree of similarity: PAM and blosum Matrix"— Presentation transcript:

Measuring the degree of similarity: PAM and blosum Matrix
Lecture 13

Introduction Measurement of matching
Nucleic acid and amino acid substitutions The blosum Matrix The Pam Matrix Appropriate use of blosum and Pam Matrix Measurement of alignment gaps

Measurement of matching
The dot plot gives a visual representation of sequence alignment. So how do we measure the alignment. One way is to count of matches and mismatches: the difference between them Hamming distance; : The distance corresponds to mismatches for strings of equal length. agtc cgta Distance is 2 (give another example)

Measurement of matching
If the sequences (strings) are not of equal length the use: The Levenshtein distance: is the minimum number of edit operations (alter/ insert/delete) to required to turn one string into another: ag- tcc cgctca what is the levensthein distance? But what about the biological plausibility of this approach? Strings are not the same as sequences!!! (hint: amino acid alignment)

Nucleic Acid mutations
It is know that transitions a<->g are more common than transversions c<->t In sequence alignment we are trying to determine the degree of similarity and not dissimilarity; but the hamming/levenshtein measure dissimilarity. One approach would be to count the number of matches but there is now a need to include the bias associated with possible substitutions.

nucleic acid scoring table
Based on known rates we could propose, a simple, table like the following: where the each match scores a 1000 A transition A<-> G scores a 100 A transversion T<->C and others score a 10 The values correspond to the chances of a substitution (no substitution.) A G T C 1000 100 10

nucleic acid scoring table
Using this we could attempt to calculate the similarity we would look at each sequence and determine the score seq1 1 to seq 2 . Seq 1: agtc Seq 2: cgta since the are, we assume, independent elements (events) we have to multiple them to get the score. LogA+LogB = Log(A*B) However by get the log of each value we only have to add the values: log10 of about is 8. What would be the table if log values were used?

Nucleic Acid Matrix A G T C 3 2 1
So in this case all we have to do is add the values. Note this is example to illustrate the concept. This is not actual substitution matrix for nucleic acids (bases) [it can be found on the internet] . But lesk p. 255 give an example of one. Measurement of sequence similarity plays a much greater role in assessing proteins. Why do you think the similarity of proteins is more critical than nucleic: (hint: code and AA properties )

Measuring Protein similarity
Deriving a matrix for proteins is more complex because: There are 20 amino acids so much larger set of substitutions. The amino acids have properties that affect the structure and so the protein functionality. Therefore substitutions can be conserved or semi-conserved Observations shows that conserved substitutions e.g. Hydrophobic <-> hydrophobic mutations are more common semi conserved; e.g. hydrophilic <-> hydrophobic

PAM 1 matrix Pam (PERCENTAGE ACCEPTED MUTATION) 1 is the chance of a one point mutation per 100 residues; in other words a first round of divergence: the above score is dependent on the expected value of occurrence. Clearly A <-> A, no change, has a high score A hydrophobic <-> Hydrophobic V<->A (13); while V<-> I is (57) A hydrophilic <-> hydrophilic K <-> T (11); K<-> R (37) A hydrophilic <-> hydrophobic: K <-> V (1)

Dayhoff PAM (250) Matrix THE most common PAM matrix is the 250
It represents a greater degree of evolutionary divergence and corresponds to multiplying the PAM 1 by itself 250 times via a process called dynamic programming To dervive the values you use: Observed rate of mutation/ the random mutation rate (based on the AA frequency. In other words : expected value .(no bias, positive bias or negative bias). the log of this expected value is multiplied by 10 to give the results in the table opposite. Therefore a C<->S has a value of 2 or an expected value :occurred 1.6 times more often than if it was random.: log((1.6) = Multiply this by 10 gives a value of 2. The values in the PAM 250 are a obviously lower but the distribution is about the same: why?

blosum 62 matrix Another matrix the blosum Matrix used a larger data set (as there was more information available in 1992 than in 1978) Moreover the blosum looked at mutations within blocks of conserved sequences as opposed to point mutations on individual sequences in both conserved and variable regions. [ what was the logic behind excluded] The blosum 62 matrix, unlike the PAM 250 matrix , the blosum multiplied 250 times, is the probabilities are derived from blocks sharing 62% conservation . Like the PAM matrix it Hydrophobic to hydrophobic V<->A (O) V<-> I (3) Hydrophilic to Hydrophilic K <-> T (-1) K<-> R (2) Hydrophobic to hydrophilic K<-> V (-2)

PAM and blosum Matrices
In the PAM matrix the as the number increases so does evolutionary distance while it is the reverse it the blosum Matrix. According to Baxevanis (2003) the following represents the equivalence and most appropriate use of both matrices PAM250 and the blosum 45 PAM160 and the blosum 62

PAM and blosum Matrix Matrix Best in determining PAM 40/ blosum 90
Short similar (conserved) alignments PAM 250 Longer more divergent alignments Pam 160/ blosum 80 Detecting members of protein families blosum 62 In finding all potential similarities Adapted from Baxevanis 2005 An excellent review of scoring matrices can be found at : Henikoff and Henikoff 2000

Measurement of alignment gaps
Gaps represents insertions and deletions Need to be limited so that they represent biological plausibility. Baxevanis (2005) suggest that no more than “one in 20 is a good rule of thumb”. Baxevanis (2005) proposed that the use of gaps in alignments is penalised; in other words the measurement of the similarity reduces. The penalty associated with the using gaps is dependent on Opening the gap Extending the gap The length of the gap.

The Blast Algorithm The most widely used approach to determine similarity is the BLAST algorithm. Basically the algorithm is a combination of the dot plot and one of the scoring matrices: such as blosum or PAM, Is used to determine the best region of local alignment between the query sequence and target sequences (refer to dot plot example 1 in lecture 12).

Potential Exam Questions
Discuss how to derive both the PAM and blosum matrix and why it is necessary to use different variants ,of each, in determining different types of similarity analysis. The dot plot and the PAM and Blosum matrices are important tools in the measurement of amino sequences similarity. Discuss the best variant of each that should be used in the determination of sequence alignment similarity. Distinguish between the two main types of scoring matrices [PAM and blosum] and explain how they are used to measure the amount of similarity between two sequences.

References Baxevanis A.D Bioinformatics: a practical guide to the analysis of genes and proteins chapter 11; Wiley Lesk, A. 2008; Introduction to bioinformatics, 3rd edition, oxford university press

Similar presentations

Ads by Google