2 Scoring Matrices Types Identity matrix – exact matches receive one score and non-exat matches a different score (say 1 and 0, or 6 and –1 for local alignment.).Mutation data matrix – a scoring matrix compiled based on observation of protein point mutation (PAM, BLOSUM).Physical properties matrix – amino acids with with similar properties (e.G. hydrophobicity ) receive high score.Genetic code matrix – amino acids are scored based on similarities in the coding triple (codons).
3 Substitution MatrixAmino acids substitute easily for another due to similar physicochemical propertiesIsoleucine for Valine (both small, hydrophobic)Serine for Threonine (both polar)Such changes – “conservative”Thus, need a way to increase sensitivity of the alignment algorithmSolution – substitution matrixTherefore, we need a range of values that depend on the nature of sequences being comparedIdentical amino acids > Conservative substitutions > Nonconservative substitutions
4 Choice of scoring matrix is dictated by the alignment goals Two proteins are homologous if (and only if) they are evolutionarily related (have a common ancestor)Homologous proteins are likely to have related functions (and have the same fold)Scoring matrices must in some way model our understanding of protein evolution.Based on the result of the search we have to be able to decide if the discovered sequence similarity could happen by chance or is a signature of likely homology.
5 BLOSUMBlock – a short contiguous interval of multiple aligned sequences.BLOCKS – data base of blocks of highly conserved sequences representing hundreds of protein groups.BLOCKS substitutions frequency log odds score.Within each block cluster sequences within certain similarity threshold (80% similarity yields BLOSUM80) and have such cluster be represented by one sequence or average the contribution.BLOSUM62 – most similar to PAM250 (believed to be better).
6 BLOSUM METHOD Data base Data Base of blocks Deriving a frequency tables from a data base of blocksComputing a logarithm ofodds matrix1.2
7 Deriving a frequency table from a data base of blocks. MethodsDeriving a frequency table from a data base of blocks.Frequency table consisting of all possible amino acid pairs in a column9A + 1S there are 8+7+…+1=36 AA pairs9 AS or SA pairsno SS pairsFor a block : width of w and a depth of S, it contribute WS(S-1)/2 [1.10.(10-1)]/2=45
8 METHODSThe result of this counting is a frequency table listing the number of time each of the …+1=210 different amino acid pairs occurs among the blocks.The table is used to calculate a matrix representing odds ratio between these observed frequency and those calculated by chance.
9 Observed probability qij : METHODSObserved probability qij :fAA= 36, fAS = 9qAA= 36/45 = 0.8qAS = 9/45 = 0.2
11 MethodsThe odds ratioAn odds ratio matrix is calculated where each entry is qij/eijThe logarithm of odds ratio (Lod) in bit unitSij = log2qij/eijA Lod is then calculated as scoreIf the observed frequency is :as the expected, then Sij = 0if less than expected Sij < 0if more than expected Sij > 0
12 METHODS Clustering segment within blocks Sequences are clustered within blocks, and each cluster is weighted. This is done by specifying a clustering percentage in which sequence segments that are identical for at least that percentage of amino acids are grouped together.The lod matrix derived from a database of blocks in which sequences that are identical at 80% of aligned residues are clustered is referred to as BLOSUM 80, and so forth.
13 The Dayhoff Matrix (PAM) Developed by Margaret Dayhoff, 1978.Counted likelihood of all possible substitutions in closely related proteins.Derived mutability matrix Mi,j:Probability that Ai mutates to Aj in one evolutionary unit, PAM.Multiplying M by itself extrapolate to higher evolutionary orders (Mk).
14 PAM unitsLog-odds approach: Scores proportional to the log of the ratio of target frequencies to background frequenciesPAM – Point Accepted Mutation /Percent Accepted MutationTwo sequences S and T are defined to be one PAM unit diverged if a series of accepted point mutation (and no insertion/deletion) can convert S to T with an average of one mutation per 100 res.Point accepted mutation – mutation of one residue accepted by evolution.
15 PAM unitsProblem 1: given two sequences you cannot tell their PAM distance in the strict sense of the above definition since one residue could mutate more than onceBUT: If you take sequences that are closely related then problem above is unlikely to occur.Problem 2 : A change could happen by deletion/insertion
16 PAM Matrices - Summary There is a sequence of PAM matrices PAMn attempts to provide proper scoring for sequences that diverged n PAM units.PAMn matrix is obtained from PAM1 assuming Markov model of protein evolution where transition probabilities in 1 PAM step are given by PAM1.PAMn = PAM1 nPAM1 is constructed based on highly similar sequences (believed to be apart at most few PAM units) so that Problems1 & 2 are unlikely to occur.)
17 Computation representation Define:fp(a) = probabilities of occurrence for each amino acid a.f(a,b) = the number of times the mutation a↔b ( f(a,b) = f(b,a) )f(a) = b∑f(a,b) ( b≠a )m(a) = mutability of amino acid a = f(a) / fp(a)
18 Computation representation ,cnd M(a,b) = the probability of amino acid a changing to amino acid bM(a,b) = Pr(a↔b)= Pr(a↔b | a changed)Pr(a changed)= f(a,b)* m(a) / f(a)(the conditional probability above is estimated as the ratio between the a↔b mutations and the total number of mutations involving a )M(a,a) = 1- m(a) unchange probability(the diagonal elements)
19 Relatedness odds Matrix M(a,b) gives the probability that amino acid a will change to b in a related sequence in a intervalf(b) is the chance of a random occurrence of amino acid bScore(a,b) = 10log[M(a,b)/f(b)](symmetric matrix)
20 PAMLet us assume to AA (or nucleotides) i and j, with frequency fi and fj.P(random alignment of i and j)=fi fj.
22 Long Distance Evolution There is a different mutation probability matrix for each evolutionary interval. These can be derived from the one for 1 PAM by matrix multiplication.e.g.in 2 PAM units of evolutiona→c→b (c can be anything including a or b)In general Mⁿ is the transition probability matrix for a period of n units of evolution
23 Estimation of Evolutionary Distance Different mutation probability matrix for each evolutionary interval measured in PAMs.Calculate the percentage of amino acids that will be observed to change on the average in the intervalP = 100(1 – ∑f(i)M(i,i))A PAM250 matrix usually represents two sequences which have about 20% identity
24 Nucleotide PAM scoring matrices Assuming equal probability for each mutation PAM1 would be:A T G CATG CSome models would score higher transitions (purine into purinepirimidine into pirimidine) that transversions:ATG C
26 Discrimination of real local alignment from “by chance” alignment Method: Compute mutual information:Sx Syp(x,y) log (p(x,y)/ p(x)p(y))Recall that score s(x,y) = log (p(x,y)/ p(x)p(y))Thus we simply compute:Sx=1..20 Sy=1,..20 p(x,y) s(x,y)Examples (in bits):PAM160 = .7; PAM250 = .36Higher mutual information better discrimination between true and by chance alignment.
27 Problems with PAMDefining PAM 1 in terms of amino acid mutation rather than number of nucleotide changes.Some mutation may be rare and underrepresented in PAM1 (which is based on closely related proteins only).The mutation rate depends on the position of an amino-acid in the structure.Require construction phylogenic tree which in turn need scoring matrices for proper construction. (remains a problem for many other methods)
28 Some more problems with PAM Matrices Derived from global alignments of closely related sequences.Matrices for greater evolutionary distances are extrapolated from those for lesser ones.The number with the matrix (PAM40, PAM100) refers to the evolutionary distance; greater numbers are greater distances.Does not take into account different evolutionary rates between conserved and non-conserved regions.
29 BLOSUM matrices BLOcks SUbstitution Matrix Amino acid substitution matrices from protein blocksS. HENIKOFF and J. HENIKOFFProc. Natl. Acad. Sci.USAVol.89, pp , November 1992Biochmistry
30 Comparison to PAMThe BLOSUN series derived from alignments in blocks is fundamentally different from the Dayhoff PAM series, which is derived from the estimation of mutation rates.Nevertheless, the BLOSUM series based on percent clustering of aligned segments in blocks, can be compared to the Dayhoff matrices based on percent accepted mutation (PAM) using the measure of average information per residue pair in bits units called relative entropy.
31 Comparison between BLOSUM 62 and PAM 160 The BLOSUM 62 is less tolerant to substitutions involving hydrophilic amino acids, while it is more tolerant to substitutions involving hydrophobic amino acids.For rare amino acids especially cysteine and tryptophane, BLOSUM 62 is typically more tolerant to mismatches than is PAM 160.
32 PAM vs BLOSUMDayhoff estimated mutation rates from substitutions observed in closely related proteins and extrapolated those rates to models distant relationships.In BLOSUM approach, frequencies were obtained directly from relationships represented in the block, regardless of evolutionary distance.The Dayhoff frequency table included 36 pairs in which no accepted point mutations.
33 Differences Between the PAM and BLOSUM Approach In contrast, the pairs counted with BLOSUM, included no fewer than 2369 occurrences of any particular substitution.The BLOSUM matrices depend only on the identity and composition of groups protein in Prosite.Therefore, there is no expectation that these substitution matrices will change significantly in the future.
34 PAM Versus BLOSUM PAM is based on an evolutionary model. BLOSUM is based on protein families.PAM is based on global alignment.BLOSUM is based on local alignment.