Presentation on theme: "Substitution matrices. Outline Why do we need scoring matrices? Calculation of alignment costs Pairwise Multiple Database searches Phylogenetic distance."— Presentation transcript:
Outline Why do we need scoring matrices? Calculation of alignment costs Pairwise Multiple Database searches Phylogenetic distance calculations
What is a Sequence Alignment? Given two nucleotide or amino acid sequences, determine if they are related (descended from a common ancestor) Technically, we can align any two sequences, but not always in a meaningful way
What is a Sequence Alignment? (cont.) HIGHLY RELATED: HBA_HUMAN GSAQVKGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKL G+ +VK+HGKKV A+++++AH+D++ +++++LS+LH KL HBB_HUMAN GNPKVKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKL RELATED: HBA_HUMAN GSAQVKGHGKKVADALTNAVAHV---D--DMPNALSALSDLHAHKL ++ ++++H+ KV + +A ++ +L L+++H+ K LGB2_LUPLU NNPELQAHAGKVFKLVYEAAIQLQVTGVVVTDATLKNLGSVHVSKG SPURIOUS ALIGNMENT: HBA_HUMAN GSAQVKGHGKKVADALTNAVAHVDDMPNALSALSD----LHAHKL GS+ + G + +D L ++ H+ D+ A +AL D ++AH+ F11G11.2 GSGYLVGDSLTFVDLL--VAQHTADLLAANAALLDEFPQFKAHQE How to filter out the last one & pick up the second?
Why Should We Care? Fragment assembly in DNA sequencing – Experimental determination of nucleotide sequences is only reliable up to about 500-800 base pairs (bp) at a time – But a genome can be millions of bp long! – If fragments overlap, they can be assembled:...AAGTACAATCA CAATTACTCGGA... – Need to align to detect overlap
Why Should We Care? (cont.) Finding homologous proteins and genes – i.e. evolutionarily related (common ancestor) – Thus want to avoid the spurious alignment
What do we do? Model evolutionary process -mutational changes to DNA Can you name some?
How do we do it? Choose a scoring scheme Choose an algorithm to find optimal alignment with respect to a scoring scheme Statistically validate alignment
Scoring Schemes Since goal is to find related sequences, want evolution- based scoring scheme – Mutations occur often at the genomic level, but their rates of acceptance by natural selection vary depending on the mutation – I.e. changing an AA to one with similar properties is more likely to be accepted Assume that all changes occur independently of each other and are Markovian (makes working with probabilities easier)
If AA a i is aligned with a j, then a j was substituted for a i...KALM......KVLM... Was this due to an accepted mutation or simply by chance? – If A or V is likely in general, then there is less evidence that this is a mutation Want the score s ij to be higher if mutations more likely – Take ratio of mutation prob. to prob. of AA appearing at random Generally, if a j is similar to a i in property, then accepted mutation more likely and s ij higher
Mutations Only consider immediate mutations a i -> a j, not a i -> a k -> a j Mutations are undirected => scoring matrix is symmetric Markov assumptions
Nucleic acid substitution matrices Uniform mutation rates Higher transitions ( ) than transversions ( ) C T A G Pyrimidines Purines
Mutation matrices AGTC A0.99 G0.00333 0.99 T0.00333 0.00333 0.99 C0.00333 0.00333 0.00333 0.99 AGTC A0.99 G0.0060.99 T0.002 0.002 0.99 C0.0020.002 0.006 0.99 3 fold higher transitions ( ) than transversions ( ) Uniform mutation rates 1% probability of change at each position ( M ij values)
log-odds ratio Alignment:ATC AGT Probability under match hypothesis: P(alignment | match) = P(A,A | match)P(T,G | match)P(C,T | match) Probability under random hypothesis: P(alignment | random ) = P(A,A | random )P(T,G | random)P(C,T | random) Relative probability (odds ratio) R(alignment) = P(alignment | match) / P(alignment | random ) = R(A,A)R(T,C)R(C,T) For computational reasons we take logarithms log R(alignment) = log R(A,A) + log R(T,G) + log R(C,T)
log-odds ratio (cont.) Use mutation matrix to derive log odds matrix The probability (s ij ) of obtaining a match between nucleotides i and j, divided by the random probability of aligning i and j is given by s ij = log (p i M ij / p i p j ) = log (M ij / p j ) where M ij is the mutation matrix value and p i and p j are the fractional compositions of each nucleotide.
Substitution matrices A G T C A 2 G-6 2 T-6 -6 2 C-6 -6 -6 2 A G T C A 2 G-5 2 T-7 -7 2 C-7 -7 -5 2 3 fold higher transitions ( ) than transversions ( ) Uniform mutation rates 1% probability of change at each position, log base 2
Gap penalties Why?
Gap penalties Modeling biology Insertions / deletions: – Single nucleotide – Unequal crossover – Introns, transposons etc. Gap score w x for gap of length x consists of Gap opening (g) and gap extension (r) penalties. w x = g + rx
Protein matrices PAM BLOSUM
The PAM Transition Matrices Dayhoff et al. started with several hundred alignments between very closely related proteins Compute the frequency with which each AA is changed into each other AA changes over a short evolutionary distance (short enough where only 1% AAs change) 1 PAM = 1% point accepted mutation (accepted by natural selection, i. e. not altering the function of the protein)
PAMs (cont.) Estimate p i with the frequency of AA a i over both sequences, i.e. # of a i ’s/number of AAs Let f ij = f ji = # of a i a j changes in data set, f i = j i f ij and f = i f i Define the scale to be the amount of evolution to change 1 in 100 AAs (on average) [1 PAM dist] Relative mutability of a i is the ratio of number of mutations to total exposure to mutation: m i = f i / (100fp i )
PAMs (cont.) If m i is probability of a mutation for a i, then M ii = 1 - m i is prob. of no change a i -> a j if and only if a i changes and a i -> a j given that a i changes, so M ij = Pr(a i -> a j ) = Pr(a i -> a j | a i changed )Pr(a i changed ) = (f ij /f i )m i = f ij /(100fp i ) The 1 PAM transition matrix consists of the M ii ’s and gives the probabilities of mutations from a i to a j
What About 2 PAM? How about the probability that a i -> a j in two evolutionary steps? It’s the prob. that a i -> a k (for any k ) in step 1, and a k -> a j in step 2. This is k M ik M kj = M ij 2
k PAM Transition Matrix In general, the probability that a i -> a j in k evolutionary steps is M ij k As k -> , the rows of M k tend to be identical with the i th entry of each row equal to p i – A result of our Markovian assumption of mutation
Building a Scoring Matrix When aligning different AAs in two sequences, we want to differentiate mutations and random events We are thus interested in the ratio of transition probability to prob. of randomly seeing new AA Ratio > 1 if and only if mutation more likely than random event
Building a Scoring Matrix (cont’d) When aligning multiple AAs, ratio of probs for multiple alignment = product of ratios: Taking logs will let us use sums rather than products
Building a Scoring Matrix (cont’d) Final step: computation faster with integers than with reals, so scale up (to increase precision) and round: s ij = C log 2 (M ij /p j ) C is a scaling constant For k PAM, use M ij k
PAM Scoring Matrix Miscellany Pairs of AAs with similar properties (e.g. hydrophobicity) have high pairwise scores, since similar AAs are more likely to be accepted mutations. In general, low PAM numbers find short, strong local similarities and high PAM numbers find long, weak ones. Often multiple searches will be run, using e.g. 40 PAM, 120 PAM, 250 PAM.
BLOSUM Scoring Matrices Based on multiple alignments, not pairwise. Direct derivation of scores for more distantly related proteins. Only possible because of new data: multiple alignments of known related proteins.
BLOCKS database entry
BLOSUM Scoring Matrices (cont’d) Started with ungapped alignments from BLOCKS database Sequences clustered at L % sequence identity This time f ij = # of a i a j changes between pairs of sequences from different clusters, normalizing by dividing by (n 1 n 2 ) = product of sizes of clusters 1 and 2 f i = j f ij and f = i f i (different from PAM i=j) Then the scoring matrix entry is
BLOSUM 50 Scoring Matrix
Substitution matrices: summary Nucleic acid matrices are model derived Amino acid matrices are derived from observed data – PAM: closely related sequences – BLOSUM: BLOCKS domain database ratio of odds: pr. of real / pr. of random match log-odds: additive scoring system