Presentation is loading. Please wait.

Presentation is loading. Please wait.

Class 3: Estimating Scoring Rules for Sequence Alignment.

Similar presentations


Presentation on theme: "Class 3: Estimating Scoring Rules for Sequence Alignment."— Presentation transcript:

1 Class 3: Estimating Scoring Rules for Sequence Alignment

2 2 Reminder u Last class we discussed dynamic programming algorithms for l global alignment l local alignment  All of these assumed a pre-specified scoring rule (substitution matrix): that determines the quality of perfect matches, substitutions and indels, independently of neighboring positions.

3 3 A Probabilistic Model u But how do we derive a “good” substitution matrix? u It should “encourage” pairs, that are probable to change in close sequences, and “punish” others. u Lets examine a general probabilistic approach, guided by evolutionary intuitions. u Assume that we consider only two options: M: the sequences are evolutionary related R: the sequences are unrelated

4 4 Unrelated Sequences  Our model of 2 unrelated sequences s, t is simple: For each position i, both s[i], t[i] are sampled independently from some “background” distribution q(·) over the alphabet . Let q(a) be the probability of seeing letter a in any position.  Then the likelihood of s, t (probability of seeing s, t), given they are unrelated is:

5 5 Related Sequences  Now lets assume that each pair of aligned positions (s[i],t[i]) evolved from a common ancestor => s[i],t[i] are dependent ! We assume s[i],t[i] are sampled from some distribution p(·,·) of letters pairs. Let p(a,b) be a probability that some ancestral letter evolved into this particular pair of letters  Then the likelihood of s, t, given they are related is:

6 6 Decision Problem  Given two sequences s[1..n] and t[1..n] decide whether they were sampled from M or from R u This is an instance of a decision problem that is quite frequent in statistics: hypothesis testing  We want to construct a procedure Decide(s,t) = D(s,t) that returns either M or R u Intuitively, we want to compare the likelihoods of the data in both models…

7 7 Types of Error u Our procedure can make two types of errors: I. s and t are sampled from R, but D(s,t) = M II. s and t are sampled from M, but D(s,t) = R u Define the following error probabilities:  We want to find a procedure D(s,t) that minimizes both types of errors

8 8 Neyman-Pearson Lemma Suppose that D* is such that for some k If any other D is such that  (D)   (D*), then  (D)   (D*) --> D* is optimal k might refer to the weights we wish to give to both types of errors, and on relative abundance of M comparing to R

9 9 Likelihood Ratio for Alignment u The likelihood ratio is a quantitative measure of two sequences being derived from a common origin, compared to random. u Lets see, that it is a natural score for their alignment ! u Plugging in the model, we have that:

10 10 Likelihood Ratio for Alignment u Taking logarithm of both sides, we get u We can see that the (log-)likelihood score decomposes to sum of single position scores, each dependent only on the two aligned letters !

11 11 Probabilistic Interpretation of Scoring Rule u Therefore, if we take our substitution matrix be: then the score of an alignment is the log-ratio between the two models likelihoods, which is nice. Score > 0  M is more “probable” (k=1) Score < 0  R is more “probable”

12 12 Modeling Assumptions u It is important to note that this interpretation depends on our modeling assumption of the two hypotheses!! u For example, if we assume that the letter in each position depends on the letter in the preceding position, then the likelihood ration will have a different form.

13 13 Constructing Scoring Rules The formula suggests how to construct a scoring rule:  Estimate p(·,·) and q(·) from the data  Compute  (a,b) based on p(·,·) and q(·)

14 14 Estimating Probabilities  Suppose we are given a long string s[1..n] of letters from   We want to estimate the distribution q(·) that “generated” the sequence u How should we go about this? u We build on the theory of parameter estimation in statistics

15 15 Statistical Parameter Fitting u Consider instances x[1], x[2], …, x[M] such that l The set of values that x can take is known l Each is sampled from the same (unknown) distribution of a known family (multinomial, Gaussian, Poisson, etc.) l Each is sampled independently of the rest  The task is to find a parameters set  defining the most likely distribution P(x|  ), from which the instances could be sampled. u The parameters  depend on the given family of probability distributions.

16 16 Example: Binomial Experiment u When tossed, it can land in one of two positions: Head or Tail  We denote by  the (unknown) probability P(H). Estimation task: u Given a sequence of toss samples x[1], x[2], …, x[M] we want to estimate the probabilities P(H)=  and P(T) = 1 -  HeadTail

17 17 Why Learning is Possible?  Suppose we perform M independent flips of the thumbtack u The number of head we see is a binomial distribution u and thus This suggests, that we can estimate  by

18 18 Expected Behavior (  = 0.5) 00.20.40.60.81 M = 10 M = 100 M = 1000 Probability (rescaled) over datasets of i.i.d. samples  From most large datasets, we get a good approximation to  u How do we derive such estimators in a principled way?

19 19 The Likelihood Function u How good is a particular  ? It depends on how likely it is to generate the observed data u The likelihood for the sequence H,T, T, H, H is 00.20.40.60.81  L(  :D)

20 20 Maximum Likelihood Estimation u MLE Principle: Learn parameters that maximize the likelihood function u This is one of the most commonly used estimators in statistics u Intuitively appealing

21 21 Computing the Likelihood Functions  To compute the likelihood in the thumbtack example we only require N H and N T (the number of heads and the number of tails)  N H and N T are sufficient statistics for the binomial distribution

22 22 Sufficient Statistics u A sufficient statistic is a function of the data that summarizes the relevant information for the likelihood  Formally, s(D) is a sufficient statistics if for any two datasets D and D’ l s(D) = s(D’ )  L(  |D) = L(  |D’) Datasets Statistics

23 23 Example: MLE in Binomial Data u Applying the MLE principle we get (after differentiating) (Which coincides with what we would expect) 00.20.40.60.81 L( :D)L( :D) Example: (N H,N T ) = (3,2) MLE estimate is 3/5 = 0.6

24 24 From Binomial to Multinomial  Suppose X can have the values 1,2,…,K  We want to learn the parameters  1,  2, …,  K Sufficient statistics:  N 1, N 2, …, N K - the number of times each outcome is observed Likelihood function: MLE (differentiation with Lagrange multipliers):

25 25 At last: Estimating q(·)  Suppose we are given a long string s[1..n] of letters from  s can be the concatenation of all sequences in our database  We want to estimate the distribution q(·) Likelihood function: MLE parameters: Number of times a appears in s

26 26 Estimating p(·,·) Intuition:  Find pair of presumably related aligned sequences s[1..n], t[1..n] u Estimate probability of pairs in the sequence: u Again, s and t can be the concatenation of many aligned pairs from the database Number of times a is aligned with b in (s,t)

27 27 Estimating p(·,·) Problems: u How do we find pairs of presumably related aligned sequences? u Can we ensure that the two sequences are indeed based on a common ancestor? u How far back should this ancestor be? earlier divergence  low sequence similarity later divergence  high sequence similarity u The substitution score of each 2 letters should depend on the evolutionary distance of the compared sequences !

28 28 Let Evolution In u Again, we need to make some assumptions: u Each position changes independently of the rest u The probability of mutations is the same in each positions u Evolution does not “remember” Time t t+  t+2  t+3  t+4  A A C CG T T T CG

29 29 Model of Evolution u How do we model such a process? u The process (for each position independently) is called a Markov Chain u A chain is defined by the transition probability P(X t+  =b|X t =a) - the probability that the next state is b given that the current state is a  We often describe these probabilities by a matrix: T[  ] ab = P(X t+  =b|X t =a)

30 30 Two-Step Changes  Based on T[  ], we can compute the probabilities of changes over two time periods  Thus T[2  ] = T[  ]T[  ]  By induction: T[k  ] = T[  ] k

31 31 Longer Term Changes Idea:  Estimate T[  ] from some closely related sequences set S  Use T[  ] to compute T[k  ]  Derive substitution probability after time k  : l Note, that the score depends on evolutionary distance, as requested

32 32 Estimating PAM1  Collect counts N ab of aligned pairs (a,b) in similar sequences in S l Sources include phylogenetic trees and closely related sequences (at least 85% positions have exact match)  Normalize counts to get transition matrix T [  ], such that average number of changes is 1% that is, l this is called 1 point accepted mutation (PAM1) – an evolutionary time unit !

33 33 Using PAM  The matrix PAM- k is defined to be the score based on T k u Historically researchers use PAM250 l Longer than 100 ! u Original PAM matrices were based on small number of proteins u Later versions of PAM use more examples u Used to be the most popular scoring rule

34 34 Problems with PAM u PAM extrapolates statistics collected from closely related sequences onto distant ones. u But “short-time” substitutions behave differently than “long-time” substitutions: l short-time substitutions are dominated by a single nucleotide changes that led to different translation (like L->I) l long-time substitutions do not exhibit such behavior, are much more random. u Therefore, statistics would be different for different stages in evolution.

35 35 BLOSUM (blocks substitution) matrix u Source: aligned ungapped regions of protein families l These are assumed to have a common ancestor u Procedure: l Group together all sequences in a family with more than e.g. 62% identical residues l Count number of substitutions within the same family but across different groups l Estimate frequencies of each pair of letters l The resulting matrix is called BLOSUM62


Download ppt "Class 3: Estimating Scoring Rules for Sequence Alignment."

Similar presentations


Ads by Google