# Cbio course, spring 2005, Hebrew University (Alignment) Score Statistics.

## Presentation on theme: "Cbio course, spring 2005, Hebrew University (Alignment) Score Statistics."— Presentation transcript:

cbio course, spring 2005, Hebrew University (Alignment) Score Statistics

cbio course, spring 2005, Hebrew University Motivation Reminder: Basic motivation: we want to check if 2 sequences are “related” or not We align 2 sequences and get a score (s) which measures how similar they are Given s, do we accept the hypothesis the 2 are related or reject it ? How high should s be so that we “believe” they are related ??

cbio course, spring 2005, Hebrew University Motivation (2) We need a “rigorous way” to decide on a threshold s* so that for s > s*, we call the sequences “related” Note: s* should obviously be s*(n,m) where n and m are the length of the 2 sequences aligned When we try matching sequence x against a D.B of N (N>>1) sequences, we need to account for the fact we might see high scores “just by chance” We can make 2 kinds of mistakes in our calls: FP FN → We want our “rigorous way” to control FP FN mistakes

cbio course, spring 2005, Hebrew University Motivation (3) The problem of assigning statistical significance to scores and controlling our FP and FN mistakes is of general interest. Examples: Similarities between protein sequnece to profile HMM Log ratio scores when searching DNA sequence motifs …. The methods we develop now will be of general use

cbio course, spring 2005, Hebrew University Reminder In the last lesson we talked about 2 ways to analyze alignment scores and their significance: Bayesian Classical EVD approach We reviewed how the amount of FP mistakes can be controlled using each of these approaces We reviewed Karlin & Altshul (1990) results

cbio course, spring 2005, Hebrew University Review First Approach – Bayesian Where: We saw: Assume we have two states in our world : M (Model = related sequences) R (Random = un realated sequences) Given a fixed alignment of two sequences (x,y) we ask “from which state it came from M or R ?”

cbio course, spring 2005, Hebrew University

Review Bayesian Approach cont. We saw that in order to control the expected number of false identifications, when testing scores which came from R, we need the threshold over the scores S * to have S * ~ log(number of trials * K ) Where: Number of trials for scoring a sequence of length m in local aligment against N sequences of length n is nmN K in [0,1] is correlation factor compensating for the fact the trials are correlated.

cbio course, spring 2005, Hebrew University Review EVD Approach In the EVD approach we are interested in the question: “given a score s for aligning x and y, If this s came from a distribution of scores for unrelated sequences (like R in the Bayesian approach), What’s the probability of seeing a score as good as s by chance, simply because I tried so many matches of sequences against x”? R here is the null hypothesis we are testing against. If P(score >= s | we tried N scores) < Threshold (say 0.01) then we “reject” the null hypothesis (R) NOTE: There is no “second” hypothesis here. We are guarding against type 1 errors (FP) No control or assumptions are made about FN here !! This setting is appropriate for the problem we have at hand (D.B search)

cbio course, spring 2005, Hebrew University

Toy Problem Let s,t be two randomly chosen DNA sequences of length n sampled from the uniform distribution over the DNA alphabet. Align s versus t with no gaps (i.e. s[1] is aligned to t[1] until s[n] is aligned to t[n].) What is the probability that there are k matches (not necessarily continuous ones) between s and t? Suppose you are a researcher and you have two main hypothesis: Either these two sequences are totally unrelated or there was a common ancestor to both of them (there is no indel option here). How would you use the number of matches to decide between the two options and attach a statistical confidence measure to this decision?

cbio course, spring 2005, Hebrew University 0102030405060708090100 0 0.05 0.1 0.15 0.2 0.25 Empirical Distribution over  scores, for p=0.25, using M = 100K samples n = 100 n = 20 Pvalue for score = 30 and n= 100 NOTE: As in our “real” problem - pvalue of score depends on n

cbio course, spring 2005, Hebrew University EVD for our problem In the EVD approach, we are interested in the question: “what is the probability of me seeing such a good a score as S*, only from matches to non related sequences, if I tried N such matches?” Compute: If we want to guarantee the P{ Max(S 1 … S N ) >= S*} < 0.05 where S i are scores of matches against non related sequences sampled i.i.d, then: [1 – pvalue(S*)] N > 0.95 i.e

cbio course, spring 2005, Hebrew University

Guarding against mistakes & evaluating performace In the EVD we kept guarding against FP mistakes. This is very important when doing tasks where many tests are preformed, as in our case of D.B search Sometime we are not able to compute EVD and we still want to control FPR. A very strict and simple Solution is the “Bonferroni corrected pvalue” = pvalue*N Where N is the number of tests perfromed. Note: The relation to the “union bound” is clear Problem: Bonf. Controls the FWER (family wise error rate) i.e the probability of seeing even 1 mistake in the results we report as significant (a FP mistake). It does so with very basically no assumption on the distribution, the relations between the hypothesis tested etc. and still guaranties control over FWER The price to pay is in FN …..

cbio course, spring 2005, Hebrew University Bonf. Example on our case We saw: If we want to guarantee the P{ Max(S 1 … S N ) >= S*} < 0.05 where S i are scores of matches against non related sequences sampled i.i.d, then: [1 – pvalue(S*)] N > 0.95 i.e Compare for N = 10 the result for this equation: 0.005116 to the Bonf. Corrected pvalue: 0.05/N = 0.005 For N = 20: 0.00256 vs. 0.002 etc… If we used the strict Bonf. for the same guarantee level we wanted, we might have rejected some “good” results.

cbio course, spring 2005, Hebrew University How to estimate performance? Say you have a method with a score (in our case: “method” = scoring local alignment with affine gaps, and scoring matrix “Sigma” (e.g. Sigma = PAM1) You set a threshold over the scores based on some criteria (e.g EVD estimation of the scores in random matches) You want to evaluate your methods performace on some “test” data set. The data set would typically contain some true and false examples. Assumption: you KNOW the answers of this test set ! Aim: You want to see the tradeoff you get for using various thresholds on the scores, in terms of FP and FN on the data set.

cbio course, spring 2005, Hebrew University ROC curves ROC = Receiver Operator Curve FPR = False Positive Rate = Empirical pvalue = FP/ (FP + TN) = FP / ( “real negatives”) = “What ratio of the bad ones we pass” Sensitivity = TP/ (TP+FN) = “What ratio of the true ones we capture” 0% 100% Best Performance

cbio course, spring 2005, Hebrew University ROC curves NOTE: Each point in the ROC matches a certain threshold over the method’s scores Each method gives a different Curve We can now compare methods performance: At a certain point on the graph Via the total size of area under the graph FPR = False Positive Rate = Empirical pvalue = FP/ (FP + TN) = FP / ( “real negatives”) = “What ratio of the bad ones we pass” Sensitivity = TP/ (TP+FN) = “What ratio of the true ones we capture” 0% 100% Best Performance 2%

cbio course, spring 2005, Hebrew University FDR A less stringent statistical crieteria is FDR (False Detection Rate), suggested by Benjamini & Huchberg (95). Main idea: control the rate of false reports in the total amount of reports you give. i.e. : FDR 5% means that the expected ratio of false detections in your total number of detections is going to be 5% E [ FP/(FP+TP)] = 0.05 The expectation (E) is done over the total distribution which may contain both True and False hypothesis. When there are no True hypothesis then FDR is the same as Bonf. But if not, it will give you more power to the test…

cbio course, spring 2005, Hebrew University EVD for our problem In the EVD approach, we are interested in the question: “what is the probability of me seeing such a good a score as S*, only from matches to non related sequences, if I tried N such matches?” Compute: If we want to guarantee the P{ Max(S 1 … S N ) >= S*} < 0.05 where S i are scores of matches against non related sequences sampled i.i.d, then: [1 – pvalue(S*)] N > 0.95 i.e Compare for N = 10 the result 0.005116 to the Bonf. Corrected pvalue: 0.05/N = 0.005 For N = 20: 0.00256 vs. 0.002 etc…

cbio course, spring 2005, Hebrew University Back to our Toy Problem Assume the data we need to handle came from two sources, as in the Bayesian approach: R – no related sequences, p(a,a) = 0.25 M – related sequences p(a,a) = 0.4 p(a,b) = 0.2 Delta scoring matrix i.e. S(a,a) = 1 S(a,b) = 0

cbio course, spring 2005, Hebrew University Finish with a Thought… In our toy problem – what’s the relation between the graph of the last slide and the ROC curve we talked about? How does the relative amount of samples from M and R in our data set effects the ROC? How should the total distribution over the scores look like?