Presentation is loading. Please wait.

Presentation is loading. Please wait.

Information theoretic interpretation of PAM matrices Sorin Istrail and Derek Aguiar.

Similar presentations


Presentation on theme: "Information theoretic interpretation of PAM matrices Sorin Istrail and Derek Aguiar."— Presentation transcript:

1 Information theoretic interpretation of PAM matrices Sorin Istrail and Derek Aguiar

2 Local alignments The preferred method to compute regions of local similarity for two sequences of amino acids is to consider the entire length of the sequence and optimize a similarity matrix. PAM and BLOSUM both a number of different matrices constructed to model similarity between amino acid sequences at different evolutionary distances. Here, we follow Altschul c99c to investigate PAM matrices from an information theoretic perspective.

3 Caveats and assumptions The following theory only applies to locally aligned segments that lack gaps. Why is this assumption easier to tolerate in local alignment vs. global alignment? Why is this assumption still restrictive for local alignments?

4 Notation and definitions Amino acids: a i Substitution score of aligned amino acids a i and a j : s ij A Maximal Segment Pair (MSP) is a pair of equal length segments from two amino acid sequences that, when aligned, have maximum score.

5 Random model For any two amino acid sequences, there exists at least one MSP. It is convenient to compute what MSP scores look like for random sequences to serve as a basis for comparison. We will consider a very simple model Each amino acid a i appears randomly with probability p i reflecting actual frequencies of amino acid sequences What could be a more biologically accurate (yet mathematically less feasible) method for generating amino acid sequences?

6 More assumptions...

7

8 ACGT- Ac-c C c G c T c - ACGT- A1 C 1 G 1 T 1 - MSP score = 8MSP score = c*8...AGCGCTAC...

9

10 Random Model

11

12

13 Substitution matrices

14 Local alignment and information theory

15 Relative entropy and substitution matrices

16 Relative entropy Relative entropy (KL divergence) is a measure of how closely related two probability distributions are Given two probability distributions Q and P, relative entropy can be informally stated in several different manners The amount of additional bits required to code samples from P when using Q The amount of information lost when Q is used and P is the true distribution of the data

17 Relative entropy and substitution matrices But how does this relate to substitution matrices? Well, if the target and background frequency distributions are closely related, then the relative entropy is low and it is very difficult to distinguish between the target and background frequencies. We would therefore require a much longer alignment. On the other hand, if the target and background frequency distributions are very different, the relative entropy is high and we’re able to compute much shorter alignments.

18 Example 1 – cystic fibrosis Variants in a transport protein have been associated with cystic fibrosis A search of this gene in the PIR protein sequence database yields the table on the following slide

19 Example 1 – cystic fibrosis Altshul, S.F. (c99c) “Amino Acid Substitution Matrices from an Information Theoretic Perspective”, Journal of Molecular Biology, 2c9:555-565

20 Example 1 – cystic fibrosis Of note, the best PAM-250 score is not higher than the highest score of a random alignment given the background frequencies. On the other hand, PAM-120 gives alignments in the same region with scores higher than the highest chance alignment Why do you think PAM-120 a better fit here?

21 References Explains the connection between information theory and substitution matrices Altshul, S.F. (c99c) “Amino Acid Substitution Matrices from an Information Theoretic Perspective”, Journal of Molecular Biology, 2c9:555-565 Provides much of the theory for the above article Karlin, S. Dembo, A. Kawabata, T. “Statistical Composition of High-Scoring Segments from Molecular Sequences.” The Annals of Statistics 18 (1990), (2), 571--581. Karlin, S. and Altschul SF. “Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes” PNAS 1990 87 (6) 2264-2268


Download ppt "Information theoretic interpretation of PAM matrices Sorin Istrail and Derek Aguiar."

Similar presentations


Ads by Google