PAM250. M. Dayhoff Scoring Matrices Point Accepted Mutations or PAM matrices Proteins with 85% identity were used -> the function is not significantly.

PAM250

M. Dayhoff Scoring Matrices Point Accepted Mutations or PAM matrices Proteins with 85% identity were used -> the function is not significantly changed -> the mutations are “accepted” PAM units – the measure of the amount of evolutionary distance between two amino acid sequences. One PAM unit – S 1 has converted (mutated) to S 2 with an average of one accepted point-mutation event per 100 amino acids.

p a (=N a /N) – probability of occurrence of amino acid ‘a’ over a large, sufficiently varied, data set.  a p a = 1 f ab – the number of times the mutation a b was observed to occur. f a =  b  a f ab - - - the total number of mutations in which a was involved f =  a f a - the total number of amino acid occurrences involved in mutations. 1)Probability matrix 2)Scoring matrix

M - 20x20 probability matrix M ab - the probability of amino acid ‘a’ changing into ‘b’ m a = (f a / f) * 1/(100 * p a ) relative mutability of amino acid ‘a’. It is the probability that the given amino acid will change in the evolutionary period of interest. Assumptions – (a) 1 in 100 amino acids on average is changed. (b) mutations are position independent. (c) mutations are independent on its past.

M aa =1- m a - the probablity of ‘a’ to remain unchanged. M ab = Pr(a -> b) = Pr(a -> b | a changed) Pr(a changed)= = (f ab /f a )m a Easy to see:  b M ab =1 = M aa +  b  a (f ab /f a )m a = 1- m a + m a /f a  b  a f ab = 1  a p a M aa =0.99 -> in average 1 mutation every 100 positions.

What is the probability that ‘a’ mutates into ‘b’ in two PAM units of evolution? a->c->b or a->d-> …  c M ac M cb = M 2 ab ->M 2, M 3, M 4 … M 250 … k->  M k converges to a matrix with identical rows. M k ac = p c - no matter what amino acid you start with, after a long period of evolution the resulting amino acid will be ‘c’ with probability p c.

PAM-k ab = M k ab / p b - probability that a pair ‘ab’ is a mutation as opposed to being a random occurrence (likelihood or odds ratio). M ab / p b = [(f ab /f a )m a ] / p b = (f ab /f a ) f a / (f * 100 * p a * p b ) = f ab / (f * 100 * p a * p b ) = M ba / p a The total alignment score is the product of Pam-k ab. To avoid accuracy problems: Pam-k ab = 10 log M k ab / p b -> The total alignment score is the sum of Pam-k ab. PAM-k matrix

Multiple Sequence Alignment Mult-Seq-Align allows to detect similarities which cannot be detected with Pairwise-Seq-Align methods. Detection of family characteristics. Three questions: 1.Scoring 2.Computation of Mult-Seq-Align. 3.Family representation.

Multiple Sequence Alignment

Scoring: SP (sum of pairs) SP – the sum of pairwise scores of all pairs of symbols in the column. ρ 3 (-,A,A) = (-,A)+(-,A)+(A,A) SP Total Score = Σ ρ i (-,-) = 0

Induced pairwise alignment Induced pairwise alignment or projection of a multiple alignment. a(S 1, S 2 ) a(S 2, S 3 ) a(S 1, S 3 ) (-,-) = 0 SP Total Score = Σ i<j score[ a(S i, S j ) ]

Dyn.Prog. Solution

Dynamic Programming Solution The best multiple alignment of r sequences is calculated using an r- dimensional hyper-cube The size of the hyper-cube is O( Πn i ) Time complexity O(2 r n r ) * O( computation of the ρ function ). Exact problem is NP-Complete (metrics: sum-of-pairs or evolutionary tree). more efficient solution is needed

Multiple Alignment from Pairwise Alignments ? Problem: The best pairwise alignment does not necessary lead to the best multiple alignment.

Pattern-APattern-X Pattern-APattern-X Pattern-B Pattern-XPattern-B Pattern-D S1 S3 S2 S1S2S1S3S2S3 Pattern-APattern-BPattern-D Empty Correct Solution S1S2S3 Pattern-X

Center Star Alignment S1S1 S2S2 S3S3 SkSk ScSc S k-1 S k-2 (a)Scoring scheme – distance. (b)Scoring scheme satisfies the triangle inequality: for any character a,b,c dist(a,c) ≤ dist(a,b) + dist(b,c) (in practice not all scoring matrices satisfy the triangle inequality) (c) D(S i, S j ) – score of the optimal pairwise alignment. (d) D(M) = Σ i<j a M (S i, S j ) – score of the multiple alignment M. (e) a M (S i, S j ) – pairwise alignment/score induced by M.

S1S1 S2S2 S3S3 SkSk ScSc S k-1 S k-2 The Center Star Algorithm: (a) Find S c minimizing Σ i  c D(S c, S i ). (b) Iteratively construct the multiple alignment M c : 1. M c ={S c } 2. Add the sequences in S\{S c } to M c one by one so that the induced alignment a Mc (S c, S i ) of every newly added sequence S i with S c is optimal. Add spaces, when needed, to all pre-aligned sequences. Running time: * O(n 2 ). AC-BC DCABC AC--BC DCAAB C AC--BC DCA-BC DCAAB C

D(M c ) is at most twice the score of the D(M opt ) D (M c ) / D (M opt ) ≤ 2(k-1)/k ( < 2 ) Proof: (a) a(S i, S j ) ≥ D (S i, S j ) (any induced align. is not better than optimal align.) a Mc (S c, S j ) = D (S c, S j ) (b) a Mc (S i, S j ) ≤ a Mc (S i, S c ) + a Mc (S c, S j ) = D (S i, S c ) + D (S c, S j ) (follows from the triangle inequality) (c) 2 D(M c ) = Σ i=1..k Σ j=1..k,j  i a Mc (S i, S j ) ≤ Σ i=1..k Σ j=1..k,j  i ( a Mc (S i, S c ) + a Mc (S c, S j ) )= 2(k-1) Σ j  c a Mc (S c, S j ) = 2(k-1) Σ j  c D(S c, S j )

(d) k Σ j=1..k,j  c D(S c, S j ) = Σ i=1..k Σ j=1..k,j  c D(S c, S j ) ≤ Σ i=1..k Σ j=1..k,j  i D(S i, S j ) ≤ Σ i=1..k Σ j=1..k,j  i a Mopt (S i, S j ) = 2 D(M opt ) (e) → 2 D(M c ) ≤ 2(k-1) Σ j  c D(S c, S j ) k Σ j  c D(S c, S j ) ≤ 2 D(M opt ) → D(M c )/(k-1) ≤ Σ j  c D(S c, S i ) Σ j  c D(S c, S i ) ≤ 2 D(M opt )/k → D (M c ) / D (M opt ) ≤ 2(k-1)/k

PAM250. M. Dayhoff Scoring Matrices Point Accepted Mutations or PAM matrices Proteins with 85% identity were used -> the function is not significantly.

Similar presentations

Presentation on theme: "PAM250. M. Dayhoff Scoring Matrices Point Accepted Mutations or PAM matrices Proteins with 85% identity were used -> the function is not significantly."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

PAM250. M. Dayhoff Scoring Matrices Point Accepted Mutations or PAM matrices Proteins with 85% identity were used -> the function is not significantly.

Similar presentations

Presentation on theme: "PAM250. M. Dayhoff Scoring Matrices Point Accepted Mutations or PAM matrices Proteins with 85% identity were used -> the function is not significantly."— Presentation transcript:

Similar presentations

About project

Feedback