Download presentation
Presentation is loading. Please wait.
1
Class 3: Estimating Scoring Rules for Sequence Alignment
2
2 Reminder u Last class we discussed dynamic programming algorithms for l global alignment l local alignment All of these assumed a pre-specified scoring rule (substitution matrix): that determines the quality of perfect matches, substitutions and indels, independently of neighboring positions.
3
3 A Probabilistic Model u But how do we derive a “good” substitution matrix? u It should “encourage” pairs, that are probable to change in close sequences, and “punish” others. u Lets examine a general probabilistic approach, guided by evolutionary intuitions. u Assume that we consider only two options: M: the sequences are evolutionary related R: the sequences are unrelated
4
4 Unrelated Sequences Our model of 2 unrelated sequences s, t is simple: For each position i, both s[i], t[i] are sampled independently from some “background” distribution q(·) over the alphabet . Let q(a) be the probability of seeing letter a in any position. Then the likelihood of s, t (probability of seeing s, t), given they are unrelated is:
5
5 Related Sequences Now lets assume that each pair of aligned positions (s[i],t[i]) evolved from a common ancestor => s[i],t[i] are dependent ! We assume s[i],t[i] are sampled from some distribution p(·,·) of letters pairs. Let p(a,b) be a probability that some ancestral letter evolved into this particular pair of letters Then the likelihood of s, t, given they are related is:
6
6 Decision Problem Given two sequences s[1..n] and t[1..n] decide whether they were sampled from M or from R u This is an instance of a decision problem that is quite frequent in statistics: hypothesis testing We want to construct a procedure Decide(s,t) = D(s,t) that returns either M or R u Intuitively, we want to compare the likelihoods of the data in both models…
7
7 Types of Error u Our procedure can make two types of errors: I. s and t are sampled from R, but D(s,t) = M II. s and t are sampled from M, but D(s,t) = R u Define the following error probabilities: We want to find a procedure D(s,t) that minimizes both types of errors
8
8 Neyman-Pearson Lemma Suppose that D* is such that for some k If any other D is such that (D) (D*), then (D) (D*) --> D* is optimal k might refer to the weights we wish to give to both types of errors, and on relative abundance of M comparing to R
9
9 Likelihood Ratio for Alignment u The likelihood ratio is a quantitative measure of two sequences being derived from a common origin, compared to random. u Lets see, that it is a natural score for their alignment ! u Plugging in the model, we have that:
10
10 Likelihood Ratio for Alignment u Taking logarithm of both sides, we get u We can see that the (log-)likelihood score decomposes to sum of single position scores, each dependent only on the two aligned letters !
11
11 Probabilistic Interpretation of Scoring Rule u Therefore, if we take our substitution matrix be: then the score of an alignment is the log-ratio between the two models likelihoods, which is nice. Score > 0 M is more “probable” (k=1) Score < 0 R is more “probable”
12
12 Modeling Assumptions u It is important to note that this interpretation depends on our modeling assumption of the two hypotheses!! u For example, if we assume that the letter in each position depends on the letter in the preceding position, then the likelihood ration will have a different form.
13
13 Constructing Scoring Rules The formula suggests how to construct a scoring rule: Estimate p(·,·) and q(·) from the data Compute (a,b) based on p(·,·) and q(·)
14
14 Estimating Probabilities Suppose we are given a long string s[1..n] of letters from We want to estimate the distribution q(·) that “generated” the sequence u How should we go about this? u We build on the theory of parameter estimation in statistics
15
15 Statistical Parameter Fitting u Consider instances x[1], x[2], …, x[M] such that l The set of values that x can take is known l Each is sampled from the same (unknown) distribution of a known family (multinomial, Gaussian, Poisson, etc.) l Each is sampled independently of the rest The task is to find a parameters set defining the most likely distribution P(x| ), from which the instances could be sampled. u The parameters depend on the given family of probability distributions.
16
16 Example: Binomial Experiment u When tossed, it can land in one of two positions: Head or Tail We denote by the (unknown) probability P(H). Estimation task: u Given a sequence of toss samples x[1], x[2], …, x[M] we want to estimate the probabilities P(H)= and P(T) = 1 - HeadTail
17
17 Why Learning is Possible? Suppose we perform M independent flips of the thumbtack u The number of head we see is a binomial distribution u and thus This suggests, that we can estimate by
18
18 Expected Behavior ( = 0.5) 00.20.40.60.81 M = 10 M = 100 M = 1000 Probability (rescaled) over datasets of i.i.d. samples From most large datasets, we get a good approximation to u How do we derive such estimators in a principled way?
19
19 The Likelihood Function u How good is a particular ? It depends on how likely it is to generate the observed data u The likelihood for the sequence H,T, T, H, H is 00.20.40.60.81 L( :D)
20
20 Maximum Likelihood Estimation u MLE Principle: Learn parameters that maximize the likelihood function u This is one of the most commonly used estimators in statistics u Intuitively appealing
21
21 Computing the Likelihood Functions To compute the likelihood in the thumbtack example we only require N H and N T (the number of heads and the number of tails) N H and N T are sufficient statistics for the binomial distribution
22
22 Sufficient Statistics u A sufficient statistic is a function of the data that summarizes the relevant information for the likelihood Formally, s(D) is a sufficient statistics if for any two datasets D and D’ l s(D) = s(D’ ) L( |D) = L( |D’) Datasets Statistics
23
23 Example: MLE in Binomial Data u Applying the MLE principle we get (after differentiating) (Which coincides with what we would expect) 00.20.40.60.81 L( :D)L( :D) Example: (N H,N T ) = (3,2) MLE estimate is 3/5 = 0.6
24
24 From Binomial to Multinomial Suppose X can have the values 1,2,…,K We want to learn the parameters 1, 2, …, K Sufficient statistics: N 1, N 2, …, N K - the number of times each outcome is observed Likelihood function: MLE (differentiation with Lagrange multipliers):
25
25 At last: Estimating q(·) Suppose we are given a long string s[1..n] of letters from s can be the concatenation of all sequences in our database We want to estimate the distribution q(·) Likelihood function: MLE parameters: Number of times a appears in s
26
26 Estimating p(·,·) Intuition: Find pair of presumably related aligned sequences s[1..n], t[1..n] u Estimate probability of pairs in the sequence: u Again, s and t can be the concatenation of many aligned pairs from the database Number of times a is aligned with b in (s,t)
27
27 Estimating p(·,·) Problems: u How do we find pairs of presumably related aligned sequences? u Can we ensure that the two sequences are indeed based on a common ancestor? u How far back should this ancestor be? earlier divergence low sequence similarity later divergence high sequence similarity u The substitution score of each 2 letters should depend on the evolutionary distance of the compared sequences !
28
28 Let Evolution In u Again, we need to make some assumptions: u Each position changes independently of the rest u The probability of mutations is the same in each positions u Evolution does not “remember” Time t t+ t+2 t+3 t+4 A A C CG T T T CG
29
29 Model of Evolution u How do we model such a process? u The process (for each position independently) is called a Markov Chain u A chain is defined by the transition probability P(X t+ =b|X t =a) - the probability that the next state is b given that the current state is a We often describe these probabilities by a matrix: T[ ] ab = P(X t+ =b|X t =a)
30
30 Two-Step Changes Based on T[ ], we can compute the probabilities of changes over two time periods Thus T[2 ] = T[ ]T[ ] By induction: T[k ] = T[ ] k
31
31 Longer Term Changes Idea: Estimate T[ ] from some closely related sequences set S Use T[ ] to compute T[k ] Derive substitution probability after time k : l Note, that the score depends on evolutionary distance, as requested
32
32 Estimating PAM1 Collect counts N ab of aligned pairs (a,b) in similar sequences in S l Sources include phylogenetic trees and closely related sequences (at least 85% positions have exact match) Normalize counts to get transition matrix T [ ], such that average number of changes is 1% that is, l this is called 1 point accepted mutation (PAM1) – an evolutionary time unit !
33
33 Using PAM The matrix PAM- k is defined to be the score based on T k u Historically researchers use PAM250 l Longer than 100 ! u Original PAM matrices were based on small number of proteins u Later versions of PAM use more examples u Used to be the most popular scoring rule
34
34 Problems with PAM u PAM extrapolates statistics collected from closely related sequences onto distant ones. u But “short-time” substitutions behave differently than “long-time” substitutions: l short-time substitutions are dominated by a single nucleotide changes that led to different translation (like L->I) l long-time substitutions do not exhibit such behavior, are much more random. u Therefore, statistics would be different for different stages in evolution.
35
35 BLOSUM (blocks substitution) matrix u Source: aligned ungapped regions of protein families l These are assumed to have a common ancestor u Procedure: l Group together all sequences in a family with more than e.g. 62% identical residues l Count number of substitutions within the same family but across different groups l Estimate frequencies of each pair of letters l The resulting matrix is called BLOSUM62
Similar presentations
© 2024 SlidePlayer.com Inc.
All rights reserved.