Class 3: Estimating Scoring Rules for Sequence Alignment.

Slides:



Advertisements
Similar presentations
Fa07CSE 182 CSE182-L4: Database filtering. Fa07CSE 182 Summary (through lecture 3) A2 is online We considered the basics of sequence alignment –Opt score.
Advertisements

. Sequence Alignment I Lecture #2 This class has been edited from Nir Friedman’s lecture which is available at Changes made by.
Likelihood Ratio, Wald, and Lagrange Multiplier (Score) Tests
Scores and substitution matrices in sequence alignment Sushmita Roy BMI/CS 576 Sushmita Roy Sep 11 th,
.. . Parameter Estimation using likelihood functions Tutorial #1 This class has been cut and slightly edited from Nir Friedman’s full course of 12 lectures.
Chap 8: Estimation of parameters & Fitting of Probability Distributions Section 6.1: INTRODUCTION Unknown parameter(s) values must be estimated before.
The General Linear Model. The Simple Linear Model Linear Regression.
Parameter Estimation using likelihood functions Tutorial #1
Introduction to Bioinformatics
Maximum Likelihood. Likelihood The likelihood is the probability of the data given the model.
Visual Recognition Tutorial
HMM for CpG Islands Parameter Estimation For HMM Maximum Likelihood and the Information Inequality Lecture #7 Background Readings: Chapter 3.3 in the.
Lecture 6, Thursday April 17, 2003
. Phylogeny II : Parsimony, ML, SEMPHY. Phylogenetic Tree u Topology: bifurcating Leaves - 1…N Internal nodes N+1…2N-2 leaf branch internal node.
Heuristic alignment algorithms and cost matrices
Sequence analysis course
Maximum Likelihood. Historically the newest method. Popularized by Joseph Felsenstein, Seattle, Washington. Its slow uptake by the scientific community.
This presentation has been cut and slightly edited from Nir Friedman’s full course of 12 lectures which is available at Changes.
Basics of Statistical Estimation. Learning Probabilities: Classical Approach Simplest case: Flipping a thumbtack tails heads True probability  is unknown.
Evaluating Hypotheses
Defining Scoring Functions, Multiple Sequence Alignment Lecture #4
. PGM: Tirgul 10 Parameter Learning and Priors. 2 Why learning? Knowledge acquisition bottleneck u Knowledge acquisition is an expensive process u Often.
. Maximum Likelihood (ML) Parameter Estimation with applications to reconstructing phylogenetic trees Comput. Genomics, lecture 6b Presentation taken from.
Sequence similarity.
Similar Sequence Similar Function Charles Yan Spring 2006.
. Computational Genomics Lecture #3a (revised 24/3/09) This class has been edited from Nir Friedman’s lecture which is available at
1-month Practical Course Genome Analysis Lecture 3: Residue exchange matrices Centre for Integrative Bioinformatics VU (IBIVU) Vrije Universiteit Amsterdam.
BLOSUM Information Resources Algorithms in Computational Biology Spring 2006 Created by Itai Sharon.
. Sequence Alignment II Lecture #3 This class has been edited from Nir Friedman’s lecture. Changes made by Dan Geiger, then by Shlomo Moran. Background.
Fa05CSE 182 CSE182-L5: Scoring matrices Dictionary Matching.
Sequence Alignments Revisited
Alignment IV BLOSUM Matrices. 2 BLOSUM matrices Blocks Substitution Matrix. Scores for each position are obtained frequencies of substitutions in blocks.
Thanks to Nir Friedman, HU
Alignment III PAM Matrices. 2 PAM250 scoring matrix.
. Class 5: Hidden Markov Models. Sequence Models u So far we examined several probabilistic model sequence models u These model, however, assumed that.
Variants of HMMs. Higher-order HMMs How do we model “memory” larger than one time point? P(  i+1 = l |  i = k)a kl P(  i+1 = l |  i = k,  i -1 =
. Sequence Alignment II Lecture #3 This class has been edited from Nir Friedman’s lecture which is available at Changes made by.
. Sequence Alignment II Lecture #3 This class has been edited from Nir Friedman’s lecture which is available at Changes made by.
. Sequence Alignment II Lecture #3 This class has been edited from Nir Friedman’s lecture. Changes made by Dan Geiger, then by Shlomo Moran. Background.
Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.
.. . Maximum Likelihood (ML) Parameter Estimation with applications to inferring phylogenetic trees Comput. Genomics, lecture 6a Presentation taken from.
Evolution and Scoring Rules Example Score = 5 x (# matches) + (-4) x (# mismatches) + + (-7) x (total length of all gaps) Example Score = 5 x (# matches)
BINF6201/8201 Hidden Markov Models for Sequence Analysis
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Comp. Genomics Recitation 3 The statistics of database searching.
Construction of Substitution Matrices
Sequence Alignment Csc 487/687 Computing for bioinformatics.
More statistical stuff CS 394C Feb 6, Today Review of material from Jan 31 Calculating pattern probabilities Why maximum parsimony and UPGMA are.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
Pairwise Sequence Analysis-III
Lecture 3: Statistics Review I Date: 9/3/02  Distributions  Likelihood  Hypothesis tests.
. Fasta, Blast, Probabilities. 2 Reminder u Last classes we discussed dynamic programming algorithms for l global alignment l local alignment l Multiple.
Maximum Likelihood Estimation
Statistical Estimation Vasileios Hatzivassiloglou University of Texas at Dallas.
Sequence Alignment.
Construction of Substitution matrices
Step 3: Tools Database Searching
The statistics of pairwise alignment BMI/CS 576 Colin Dewey Fall 2015.
Modelling evolution Gil McVean Department of Statistics TC A G.
Substitution Matrices and Alignment Statistics BMI/CS 776 Mark Craven February 2002.
Pairwise Sequence Alignment and Database Searching
Maximum Likelihood Estimation
Tutorial #3 by Ma’ayan Fishelson
Chapter 9 Hypothesis Testing.
Discrete Event Simulation - 4
CS498-EA Reasoning in AI Lecture #20
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Pairwise Sequence Alignment (cont.)
Parametric Methods Berlin Chen, 2005 References:
Alignment IV BLOSUM Matrices
Presentation transcript:

Class 3: Estimating Scoring Rules for Sequence Alignment

2 Reminder u Last class we discussed dynamic programming algorithms for l global alignment l local alignment  All of these assumed a pre-specified scoring rule (substitution matrix): that determines the quality of perfect matches, substitutions and indels, independently of neighboring positions.

3 A Probabilistic Model u But how do we derive a “good” substitution matrix? u It should “encourage” pairs, that are probable to change in close sequences, and “punish” others. u Lets examine a general probabilistic approach, guided by evolutionary intuitions. u Assume that we consider only two options: M: the sequences are evolutionary related R: the sequences are unrelated

4 Unrelated Sequences  Our model of 2 unrelated sequences s, t is simple: For each position i, both s[i], t[i] are sampled independently from some “background” distribution q(·) over the alphabet . Let q(a) be the probability of seeing letter a in any position.  Then the likelihood of s, t (probability of seeing s, t), given they are unrelated is:

5 Related Sequences  Now lets assume that each pair of aligned positions (s[i],t[i]) evolved from a common ancestor => s[i],t[i] are dependent ! We assume s[i],t[i] are sampled from some distribution p(·,·) of letters pairs. Let p(a,b) be a probability that some ancestral letter evolved into this particular pair of letters  Then the likelihood of s, t, given they are related is:

6 Decision Problem  Given two sequences s[1..n] and t[1..n] decide whether they were sampled from M or from R u This is an instance of a decision problem that is quite frequent in statistics: hypothesis testing  We want to construct a procedure Decide(s,t) = D(s,t) that returns either M or R u Intuitively, we want to compare the likelihoods of the data in both models…

7 Types of Error u Our procedure can make two types of errors: I. s and t are sampled from R, but D(s,t) = M II. s and t are sampled from M, but D(s,t) = R u Define the following error probabilities:  We want to find a procedure D(s,t) that minimizes both types of errors

8 Neyman-Pearson Lemma Suppose that D* is such that for some k If any other D is such that  (D)   (D*), then  (D)   (D*) --> D* is optimal k might refer to the weights we wish to give to both types of errors, and on relative abundance of M comparing to R

9 Likelihood Ratio for Alignment u The likelihood ratio is a quantitative measure of two sequences being derived from a common origin, compared to random. u Lets see, that it is a natural score for their alignment ! u Plugging in the model, we have that:

10 Likelihood Ratio for Alignment u Taking logarithm of both sides, we get u We can see that the (log-)likelihood score decomposes to sum of single position scores, each dependent only on the two aligned letters !

11 Probabilistic Interpretation of Scoring Rule u Therefore, if we take our substitution matrix be: then the score of an alignment is the log-ratio between the two models likelihoods, which is nice. Score > 0  M is more “probable” (k=1) Score < 0  R is more “probable”

12 Modeling Assumptions u It is important to note that this interpretation depends on our modeling assumption of the two hypotheses!! u For example, if we assume that the letter in each position depends on the letter in the preceding position, then the likelihood ration will have a different form.

13 Constructing Scoring Rules The formula suggests how to construct a scoring rule:  Estimate p(·,·) and q(·) from the data  Compute  (a,b) based on p(·,·) and q(·)

14 Estimating Probabilities  Suppose we are given a long string s[1..n] of letters from   We want to estimate the distribution q(·) that “generated” the sequence u How should we go about this? u We build on the theory of parameter estimation in statistics

15 Statistical Parameter Fitting u Consider instances x[1], x[2], …, x[M] such that l The set of values that x can take is known l Each is sampled from the same (unknown) distribution of a known family (multinomial, Gaussian, Poisson, etc.) l Each is sampled independently of the rest  The task is to find a parameters set  defining the most likely distribution P(x|  ), from which the instances could be sampled. u The parameters  depend on the given family of probability distributions.

16 Example: Binomial Experiment u When tossed, it can land in one of two positions: Head or Tail  We denote by  the (unknown) probability P(H). Estimation task: u Given a sequence of toss samples x[1], x[2], …, x[M] we want to estimate the probabilities P(H)=  and P(T) = 1 -  HeadTail

17 Why Learning is Possible?  Suppose we perform M independent flips of the thumbtack u The number of head we see is a binomial distribution u and thus This suggests, that we can estimate  by

18 Expected Behavior (  = 0.5) M = 10 M = 100 M = 1000 Probability (rescaled) over datasets of i.i.d. samples  From most large datasets, we get a good approximation to  u How do we derive such estimators in a principled way?

19 The Likelihood Function u How good is a particular  ? It depends on how likely it is to generate the observed data u The likelihood for the sequence H,T, T, H, H is  L(  :D)

20 Maximum Likelihood Estimation u MLE Principle: Learn parameters that maximize the likelihood function u This is one of the most commonly used estimators in statistics u Intuitively appealing

21 Computing the Likelihood Functions  To compute the likelihood in the thumbtack example we only require N H and N T (the number of heads and the number of tails)  N H and N T are sufficient statistics for the binomial distribution

22 Sufficient Statistics u A sufficient statistic is a function of the data that summarizes the relevant information for the likelihood  Formally, s(D) is a sufficient statistics if for any two datasets D and D’ l s(D) = s(D’ )  L(  |D) = L(  |D’) Datasets Statistics

23 Example: MLE in Binomial Data u Applying the MLE principle we get (after differentiating) (Which coincides with what we would expect) L( :D)L( :D) Example: (N H,N T ) = (3,2) MLE estimate is 3/5 = 0.6

24 From Binomial to Multinomial  Suppose X can have the values 1,2,…,K  We want to learn the parameters  1,  2, …,  K Sufficient statistics:  N 1, N 2, …, N K - the number of times each outcome is observed Likelihood function: MLE (differentiation with Lagrange multipliers):

25 At last: Estimating q(·)  Suppose we are given a long string s[1..n] of letters from  s can be the concatenation of all sequences in our database  We want to estimate the distribution q(·) Likelihood function: MLE parameters: Number of times a appears in s

26 Estimating p(·,·) Intuition:  Find pair of presumably related aligned sequences s[1..n], t[1..n] u Estimate probability of pairs in the sequence: u Again, s and t can be the concatenation of many aligned pairs from the database Number of times a is aligned with b in (s,t)

27 Estimating p(·,·) Problems: u How do we find pairs of presumably related aligned sequences? u Can we ensure that the two sequences are indeed based on a common ancestor? u How far back should this ancestor be? earlier divergence  low sequence similarity later divergence  high sequence similarity u The substitution score of each 2 letters should depend on the evolutionary distance of the compared sequences !

28 Let Evolution In u Again, we need to make some assumptions: u Each position changes independently of the rest u The probability of mutations is the same in each positions u Evolution does not “remember” Time t t+  t+2  t+3  t+4  A A C CG T T T CG

29 Model of Evolution u How do we model such a process? u The process (for each position independently) is called a Markov Chain u A chain is defined by the transition probability P(X t+  =b|X t =a) - the probability that the next state is b given that the current state is a  We often describe these probabilities by a matrix: T[  ] ab = P(X t+  =b|X t =a)

30 Two-Step Changes  Based on T[  ], we can compute the probabilities of changes over two time periods  Thus T[2  ] = T[  ]T[  ]  By induction: T[k  ] = T[  ] k

31 Longer Term Changes Idea:  Estimate T[  ] from some closely related sequences set S  Use T[  ] to compute T[k  ]  Derive substitution probability after time k  : l Note, that the score depends on evolutionary distance, as requested

32 Estimating PAM1  Collect counts N ab of aligned pairs (a,b) in similar sequences in S l Sources include phylogenetic trees and closely related sequences (at least 85% positions have exact match)  Normalize counts to get transition matrix T [  ], such that average number of changes is 1% that is, l this is called 1 point accepted mutation (PAM1) – an evolutionary time unit !

33 Using PAM  The matrix PAM- k is defined to be the score based on T k u Historically researchers use PAM250 l Longer than 100 ! u Original PAM matrices were based on small number of proteins u Later versions of PAM use more examples u Used to be the most popular scoring rule

34 Problems with PAM u PAM extrapolates statistics collected from closely related sequences onto distant ones. u But “short-time” substitutions behave differently than “long-time” substitutions: l short-time substitutions are dominated by a single nucleotide changes that led to different translation (like L->I) l long-time substitutions do not exhibit such behavior, are much more random. u Therefore, statistics would be different for different stages in evolution.

35 BLOSUM (blocks substitution) matrix u Source: aligned ungapped regions of protein families l These are assumed to have a common ancestor u Procedure: l Group together all sequences in a family with more than e.g. 62% identical residues l Count number of substitutions within the same family but across different groups l Estimate frequencies of each pair of letters l The resulting matrix is called BLOSUM62