Presentation on theme: "Measuring the degree of similarity: PAM and blosum Matrix"— Presentation transcript:
1 Measuring the degree of similarity: PAM and blosum Matrix Lecture 13
2 Introduction Measurement of matching Nucleic acid and amino acid substitutionsThe blosum MatrixThe Pam MatrixAppropriate use of blosum and Pam MatrixMeasurement of alignment gaps
3 Measurement of matching The dot plot gives a visual representation of sequence alignment. So how do we measure the alignment.One way is to count of matches and mismatches: the difference between themHamming distance; :The distance corresponds to mismatches for strings of equal length.agtccgta Distance is 2 (give another example)
4 Measurement of matching If the sequences (strings) are not of equal length the use:The Levenshtein distance: is the minimum number of edit operations (alter/ insert/delete) to required to turn one string into another:ag- tcccgctca what is the levensthein distance?But what about the biological plausibility of this approach? Strings are not the same as sequences!!! (hint: amino acid alignment)
5 Nucleic Acid mutations It is know that transitions a<->g are more common than transversions c<->tIn sequence alignment we are trying to determine the degree of similarity and not dissimilarity; but the hamming/levenshtein measure dissimilarity.One approach would be to count the number of matches but there is now a need to include the bias associated with possible substitutions.
6 nucleic acid scoring table Based on known rates we could propose, a simple, table like the following:where the each match scores a 1000A transition A<-> G scores a 100A transversion T<->C and others score a 10The values correspond to the chances of a substitution (no substitution.)AGTC100010010
7 nucleic acid scoring table Using this we could attempt to calculate the similarity we would look at each sequence and determine the score seq1 1 to seq 2 .Seq 1: agtcSeq 2: cgtasince the are, we assume, independent elements (events) we have to multiple them to get the score.LogA+LogB = Log(A*B)However by get the log of each value we only have to add the values: log10 of about is 8.What would be the table if log values were used?
8 Nucleic Acid Matrix A G T C 3 2 1 So in this case all we have to do is add the values. Note this is example to illustrate the concept. This is not actual substitution matrix for nucleic acids (bases) [it can be found on the internet] . But lesk p. 255 give an example of one.Measurement of sequence similarity plays a much greater role in assessing proteins.Why do you think the similarity of proteins is more critical than nucleic: (hint: code and AA properties )
9 Measuring Protein similarity Deriving a matrix for proteins is more complex because:There are 20 amino acids so much larger set of substitutions.The amino acids have properties that affect the structure and so the protein functionality.Therefore substitutions can be conserved or semi-conservedObservations shows that conserved substitutionse.g. Hydrophobic <-> hydrophobic mutations are more commonsemi conserved; e.g. hydrophilic <-> hydrophobic
10 PAM 1 matrixPam (PERCENTAGE ACCEPTED MUTATION) 1 is the chance of a one point mutation per 100 residues; in other words a first round of divergence: the above score is dependent on the expected value of occurrence.Clearly A <-> A, no change, has a high scoreA hydrophobic <-> Hydrophobic V<->A (13); while V<-> I is (57)A hydrophilic <-> hydrophilic K <-> T (11); K<-> R (37)A hydrophilic <-> hydrophobic: K <-> V (1)
11 Dayhoff PAM (250) Matrix THE most common PAM matrix is the 250 It represents a greater degree of evolutionary divergence and corresponds to multiplying the PAM 1 by itself 250 times via a process called dynamic programmingTo dervive the values you use:Observed rate of mutation/ the random mutation rate (based on the AA frequency. In other words : expected value .(no bias, positive bias or negative bias).the log of this expected value is multiplied by 10 to give the results in the table opposite.Therefore a C<->S has a value of 2 or an expected value :occurred 1.6 times more often than if it was random.: log((1.6) = Multiply this by 10 gives a value of 2.The values in the PAM 250 are a obviously lower but the distribution is about the same: why?
12 blosum 62 matrixAnother matrix the blosum Matrix used a larger data set (as there was more information available in 1992 than in 1978)Moreover the blosum looked at mutations within blocks of conserved sequencesas opposed to point mutations on individual sequences in both conserved and variable regions. [ what was the logic behind excluded]The blosum 62 matrix, unlike the PAM 250 matrix , the blosum multiplied 250 times, is the probabilities are derived from blocks sharing 62% conservation .Like the PAM matrix itHydrophobic to hydrophobicV<->A (O)V<-> I (3)Hydrophilic to HydrophilicK <-> T (-1)K<-> R (2)Hydrophobic to hydrophilicK<-> V (-2)
13 PAM and blosum Matrices In the PAM matrix the as the number increases so does evolutionary distance while it is the reverse it the blosum Matrix.According to Baxevanis (2003) the following represents the equivalence and most appropriate use of both matricesPAM250 and the blosum 45PAM160 and the blosum 62
14 PAM and blosum Matrix Matrix Best in determining PAM 40/ blosum 90 Short similar (conserved) alignmentsPAM 250Longer more divergent alignmentsPam 160/ blosum 80Detecting members of protein familiesblosum 62In finding all potential similaritiesAdapted from Baxevanis 2005An excellent review of scoring matrices can be found at : Henikoff and Henikoff 2000
15 Measurement of alignment gaps Gaps represents insertions and deletionsNeed to be limited so that they represent biological plausibility.Baxevanis (2005) suggest that no more than “one in 20 is a good rule of thumb”.Baxevanis (2005) proposed that the use of gaps in alignments is penalised; in other words the measurement of the similarity reduces.The penalty associated with the using gaps is dependent onOpening the gapExtending the gapThe length of the gap.
16 The Blast AlgorithmThe most widely used approach to determine similarity is the BLAST algorithm.Basically the algorithm is a combination of the dot plot and one of the scoring matrices: such as blosum or PAM,Is used to determine the best region of local alignment between the query sequence and target sequences (refer to dot plot example 1 in lecture 12).
17 Potential Exam Questions Discuss how to derive both the PAM and blosum matrix and why it is necessary to use different variants ,of each, in determining different types of similarity analysis.The dot plot and the PAM and Blosum matrices are important tools in the measurement of amino sequences similarity. Discuss the best variant of each that should be used in the determination of sequence alignment similarity.Distinguish between the two main types of scoring matrices [PAM and blosum] and explain how they are used to measure the amount of similarity between two sequences.
18 ReferencesBaxevanis A.D Bioinformatics: a practical guide to the analysis of genes and proteins chapter 11; WileyLesk, A. 2008; Introduction to bioinformatics, 3rd edition, oxford university press