Cbio course, spring 2005, Hebrew University (Alignment) Score Statistics.

Slides:



Advertisements
Similar presentations
Sequence comparison: Significance of similarity scores Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.
Advertisements

Evaluating Classifiers
Learning Algorithm Evaluation
CHAPTER 21 Inferential Statistical Analysis. Understanding probability The idea of probability is central to inferential statistics. It means the chance.
From the homework: Distribution of DNA fragments generated by Micrococcal nuclease digestion mean(nucs) = bp median(nucs) = 110 bp sd(nucs+ = 17.3.
Decision Errors and Power
Review of the Basic Logic of NHST Significance tests are used to accept or reject the null hypothesis. This is done by studying the sampling distribution.
Statistical Issues in Research Planning and Evaluation
Chapter 10: Hypothesis Testing
The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press 1 Computational Statistics with Application to Bioinformatics Prof. William.
Heuristic alignment algorithms and cost matrices
Evaluating Classifiers Lecture 2 Instructor: Max Welling Read chapter 5.
Differentially expressed genes
Hypothesis Testing: Type II Error and Power.
BCOR 1020 Business Statistics Lecture 21 – April 8, 2008.
Sample Size Determination In the Context of Hypothesis Testing
Chapter 9 Hypothesis Testing.
Today Concepts underlying inferential statistics
Chapter 12 Inferring from the Data. Inferring from Data Estimation and Significance testing.
Sequence comparison: Significance of similarity scores Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.
Standard Error of the Mean
Statistical hypothesis testing – Inferential statistics I.
Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.
Multiple testing correction
© 2008 McGraw-Hill Higher Education The Statistical Imagination Chapter 9. Hypothesis Testing I: The Six Steps of Statistical Inference.
Presented by Mohammad Adil Khan
Lucio Baggio - Lucio Baggio - False discovery rate: setting the probability of false claim of detection 1 False discovery rate: setting the probability.
Chapter 5 Sampling and Statistics Math 6203 Fall 2009 Instructor: Ayona Chatterjee.
RMTD 404 Lecture 8. 2 Power Recall what you learned about statistical errors in Chapter 4: Type I Error: Finding a difference when there is no true difference.
1 Power and Sample Size in Testing One Mean. 2 Type I & Type II Error Type I Error: reject the null hypothesis when it is true. The probability of a Type.
Making decisions about distributions: Introduction to the Null Hypothesis 47:269: Research Methods I Dr. Leonard April 14, 2010.
Significance Toolbox 1) Identify the population of interest (What is the topic of discussion?) and parameter (mean, standard deviation, probability) you.
Psy B07 Chapter 4Slide 1 SAMPLING DISTRIBUTIONS AND HYPOTHESIS TESTING.
Topic 7 - Hypothesis tests based on a single sample Sampling distribution of the sample mean - pages Basics of hypothesis testing -
10.2 Tests of Significance Use confidence intervals when the goal is to estimate the population parameter If the goal is to.
Biostatistics Class 6 Hypothesis Testing: One-Sample Inference 2/29/2000.
Bioinformatics Expression profiling and functional genomics Part II: Differential expression Ad 27/11/2006.
IE241: Introduction to Hypothesis Testing. We said before that estimation of parameters was one of the two major areas of statistics. Now let’s turn to.
Comp. Genomics Recitation 3 The statistics of database searching.
CS5263 Bioinformatics Lecture 20 Practical issues in motif finding Final project.
Confidence intervals and hypothesis testing Petter Mostad
PSY2004 Research Methods PSY2005 Applied Research Methods Week Five.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
Introduction to the Practice of Statistics Fifth Edition Chapter 6: Introduction to Inference Copyright © 2005 by W. H. Freeman and Company David S. Moore.
One-way ANOVA: - Comparing the means IPS chapter 12.2 © 2006 W.H. Freeman and Company.
Statistical Inference for the Mean Objectives: (Chapter 9, DeCoursey) -To understand the terms: Null Hypothesis, Rejection Region, and Type I and II errors.
Chapter 8 Evaluating Search Engine. Evaluation n Evaluation is key to building effective and efficient search engines  Measurement usually carried out.
Computational Intelligence: Methods and Applications Lecture 16 Model evaluation and ROC Włodzisław Duch Dept. of Informatics, UMK Google: W Duch.
Module 15: Hypothesis Testing This modules discusses the concepts of hypothesis testing, including α-level, p-values, and statistical power. Reviewed.
Fall 2002Biostat Statistical Inference - Confidence Intervals General (1 -  ) Confidence Intervals: a random interval that will include a fixed.
Simple examples of the Bayesian approach For proportions and means.
Machine Learning Tutorial-2. Recall, Precision, F-measure, Accuracy Ch. 5.
Spatial Smoothing and Multiple Comparisons Correction for Dummies Alexa Morcom, Matthew Brett Acknowledgements.
Professor William H. Press, Department of Computer Science, the University of Texas at Austin1 Opinionated in Statistics by Bill Press Lessons #50 Binary.
The statistics of pairwise alignment BMI/CS 576 Colin Dewey Fall 2015.
Marshall University School of Medicine Department of Biochemistry and Microbiology BMS 617 Lecture 6 –Multiple hypothesis testing Marshall University Genomics.
The Idea of the Statistical Test. A statistical test evaluates the "fit" of a hypothesis to a sample.
Statistical Inference for the Mean Objectives: (Chapter 8&9, DeCoursey) -To understand the terms variance and standard error of a sample mean, Null Hypothesis,
Statistical Decision Making. Almost all problems in statistics can be formulated as a problem of making a decision. That is given some data observed from.
Two main tasks in inferential statistics: (revisited) 1)Estimation : Use data to infer population parameter  e.g., estimate victimization rate from NCVS.
Your friend has a hobby of generating random bit strings, and finding patterns in them. One day she come to you, excited and says: I found the strangest.
Substitution Matrices and Alignment Statistics BMI/CS 776 Mark Craven February 2002.
Evaluating Classifiers
Sequence comparison: Significance of similarity scores
Sequence comparison: Multiple testing correction
I. Statistical Tests: Why do we use them? What do they involve?
Pairwise Sequence Alignment (cont.)
Sequence comparison: Multiple testing correction
Sequence comparison: Significance of similarity scores
False discovery rate estimation
Presentation transcript:

cbio course, spring 2005, Hebrew University (Alignment) Score Statistics

cbio course, spring 2005, Hebrew University Motivation Reminder: Basic motivation: we want to check if 2 sequences are “related” or not We align 2 sequences and get a score (s) which measures how similar they are Given s, do we accept the hypothesis the 2 are related or reject it ? How high should s be so that we “believe” they are related ??

cbio course, spring 2005, Hebrew University Motivation (2) We need a “rigorous way” to decide on a threshold s* so that for s > s*, we call the sequences “related” Note: s* should obviously be s*(n,m) where n and m are the length of the 2 sequences aligned When we try matching sequence x against a D.B of N (N>>1) sequences, we need to account for the fact we might see high scores “just by chance” We can make 2 kinds of mistakes in our calls: FP FN → We want our “rigorous way” to control FP FN mistakes

cbio course, spring 2005, Hebrew University Motivation (3) The problem of assigning statistical significance to scores and controlling our FP and FN mistakes is of general interest. Examples: Similarities between protein sequnece to profile HMM Log ratio scores when searching DNA sequence motifs …. The methods we develop now will be of general use

cbio course, spring 2005, Hebrew University Reminder In the last lesson we talked about 2 ways to analyze alignment scores and their significance: Bayesian Classical EVD approach We reviewed how the amount of FP mistakes can be controlled using each of these approaces We reviewed Karlin & Altshul (1990) results

cbio course, spring 2005, Hebrew University Review First Approach – Bayesian Where: We saw: Assume we have two states in our world : M (Model = related sequences) R (Random = un realated sequences) Given a fixed alignment of two sequences (x,y) we ask “from which state it came from M or R ?”

cbio course, spring 2005, Hebrew University

Review Bayesian Approach cont. We saw that in order to control the expected number of false identifications, when testing scores which came from R, we need the threshold over the scores S * to have S * ~ log(number of trials * K ) Where: Number of trials for scoring a sequence of length m in local aligment against N sequences of length n is nmN K in [0,1] is correlation factor compensating for the fact the trials are correlated.

cbio course, spring 2005, Hebrew University Review EVD Approach In the EVD approach we are interested in the question: “given a score s for aligning x and y, If this s came from a distribution of scores for unrelated sequences (like R in the Bayesian approach), What’s the probability of seeing a score as good as s by chance, simply because I tried so many matches of sequences against x”? R here is the null hypothesis we are testing against. If P(score >= s | we tried N scores) < Threshold (say 0.01) then we “reject” the null hypothesis (R) NOTE: There is no “second” hypothesis here. We are guarding against type 1 errors (FP) No control or assumptions are made about FN here !! This setting is appropriate for the problem we have at hand (D.B search)

cbio course, spring 2005, Hebrew University

Toy Problem Let s,t be two randomly chosen DNA sequences of length n sampled from the uniform distribution over the DNA alphabet. Align s versus t with no gaps (i.e. s[1] is aligned to t[1] until s[n] is aligned to t[n].) What is the probability that there are k matches (not necessarily continuous ones) between s and t? Suppose you are a researcher and you have two main hypothesis: Either these two sequences are totally unrelated or there was a common ancestor to both of them (there is no indel option here). How would you use the number of matches to decide between the two options and attach a statistical confidence measure to this decision?

cbio course, spring 2005, Hebrew University Empirical Distribution over  scores, for p=0.25, using M = 100K samples n = 100 n = 20 Pvalue for score = 30 and n= 100 NOTE: As in our “real” problem - pvalue of score depends on n

cbio course, spring 2005, Hebrew University EVD for our problem In the EVD approach, we are interested in the question: “what is the probability of me seeing such a good a score as S*, only from matches to non related sequences, if I tried N such matches?” Compute: If we want to guarantee the P{ Max(S 1 … S N ) >= S*} < 0.05 where S i are scores of matches against non related sequences sampled i.i.d, then: [1 – pvalue(S*)] N > 0.95 i.e

cbio course, spring 2005, Hebrew University

Guarding against mistakes & evaluating performace In the EVD we kept guarding against FP mistakes. This is very important when doing tasks where many tests are preformed, as in our case of D.B search Sometime we are not able to compute EVD and we still want to control FPR. A very strict and simple Solution is the “Bonferroni corrected pvalue” = pvalue*N Where N is the number of tests perfromed. Note: The relation to the “union bound” is clear Problem: Bonf. Controls the FWER (family wise error rate) i.e the probability of seeing even 1 mistake in the results we report as significant (a FP mistake). It does so with very basically no assumption on the distribution, the relations between the hypothesis tested etc. and still guaranties control over FWER The price to pay is in FN …..

cbio course, spring 2005, Hebrew University Bonf. Example on our case We saw: If we want to guarantee the P{ Max(S 1 … S N ) >= S*} < 0.05 where S i are scores of matches against non related sequences sampled i.i.d, then: [1 – pvalue(S*)] N > 0.95 i.e Compare for N = 10 the result for this equation: to the Bonf. Corrected pvalue: 0.05/N = For N = 20: vs etc… If we used the strict Bonf. for the same guarantee level we wanted, we might have rejected some “good” results.

cbio course, spring 2005, Hebrew University How to estimate performance? Say you have a method with a score (in our case: “method” = scoring local alignment with affine gaps, and scoring matrix “Sigma” (e.g. Sigma = PAM1) You set a threshold over the scores based on some criteria (e.g EVD estimation of the scores in random matches) You want to evaluate your methods performace on some “test” data set. The data set would typically contain some true and false examples. Assumption: you KNOW the answers of this test set ! Aim: You want to see the tradeoff you get for using various thresholds on the scores, in terms of FP and FN on the data set.

cbio course, spring 2005, Hebrew University ROC curves ROC = Receiver Operator Curve FPR = False Positive Rate = Empirical pvalue = FP/ (FP + TN) = FP / ( “real negatives”) = “What ratio of the bad ones we pass” Sensitivity = TP/ (TP+FN) = “What ratio of the true ones we capture” 0% 100% Best Performance

cbio course, spring 2005, Hebrew University ROC curves NOTE: Each point in the ROC matches a certain threshold over the method’s scores Each method gives a different Curve We can now compare methods performance: At a certain point on the graph Via the total size of area under the graph FPR = False Positive Rate = Empirical pvalue = FP/ (FP + TN) = FP / ( “real negatives”) = “What ratio of the bad ones we pass” Sensitivity = TP/ (TP+FN) = “What ratio of the true ones we capture” 0% 100% Best Performance 2%

cbio course, spring 2005, Hebrew University FDR A less stringent statistical crieteria is FDR (False Detection Rate), suggested by Benjamini & Huchberg (95). Main idea: control the rate of false reports in the total amount of reports you give. i.e. : FDR 5% means that the expected ratio of false detections in your total number of detections is going to be 5% E [ FP/(FP+TP)] = 0.05 The expectation (E) is done over the total distribution which may contain both True and False hypothesis. When there are no True hypothesis then FDR is the same as Bonf. But if not, it will give you more power to the test…

cbio course, spring 2005, Hebrew University EVD for our problem In the EVD approach, we are interested in the question: “what is the probability of me seeing such a good a score as S*, only from matches to non related sequences, if I tried N such matches?” Compute: If we want to guarantee the P{ Max(S 1 … S N ) >= S*} < 0.05 where S i are scores of matches against non related sequences sampled i.i.d, then: [1 – pvalue(S*)] N > 0.95 i.e Compare for N = 10 the result to the Bonf. Corrected pvalue: 0.05/N = For N = 20: vs etc…

cbio course, spring 2005, Hebrew University Back to our Toy Problem Assume the data we need to handle came from two sources, as in the Bayesian approach: R – no related sequences, p(a,a) = 0.25 M – related sequences p(a,a) = 0.4 p(a,b) = 0.2 Delta scoring matrix i.e. S(a,a) = 1 S(a,b) = 0

cbio course, spring 2005, Hebrew University Finish with a Thought… In our toy problem – what’s the relation between the graph of the last slide and the ROC curve we talked about? How does the relative amount of samples from M and R in our data set effects the ROC? How should the total distribution over the scores look like?