Sequence comparison: Significance of similarity scores Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.

Slides:



Advertisements
Similar presentations
Introduction to Hypothesis Testing
Advertisements

Overview of Lecture Partitioning Evaluating the Null Hypothesis ANOVA
Lecture 2 ANALYSIS OF VARIANCE: AN INTRODUCTION
Class 6: Hypothesis testing and confidence intervals
On Comparing Classifiers : Pitfalls to Avoid and Recommended Approach
Chi-Square Test A fundamental problem is genetics is determining whether the experimentally determined data fits the results expected from theory (i.e.
Chapter 7 Hypothesis Testing
Hypothesis Tests: Two Independent Samples
Chapter 9 Introduction to the t-statistic
Fa07CSE 182 CSE182-L4: Database filtering. Fa07CSE 182 Summary (through lecture 3) A2 is online We considered the basics of sequence alignment –Opt score.
Sequence comparison: Significance of similarity scores Genome 559: Introduction to Statistical and Computational Genomics Prof. William Stafford Noble.
BLAST Sequence alignment, E-value & Extreme value distribution.
Lecture 6 CS5661 Pairwise Sequence Analysis-V Relatedness –“Not just important, but everything” Modeling Alignment Scores –Coin Tosses –Unit Distributions.
Review of the Basic Logic of NHST Significance tests are used to accept or reject the null hypothesis. This is done by studying the sampling distribution.
BLAST, PSI-BLAST and position- specific scoring matrices Prof. William Stafford Noble Department of Genome Sciences Department of Computer Science and.
Random Walks and BLAST Marek Kimmel (Statistics, Rice)
Lecture outline Database searches
Optimatization of a New Score Function for the Detection of Remote Homologs Kann et al.
Heuristic alignment algorithms and cost matrices
1 1. BLAST (Basic Local Alignment Search Tool) Heuristic Only parts of protein are frequently subject to mutations. For example, active sites (that one.
Cbio course, spring 2005, Hebrew University (Alignment) Score Statistics.
CISC667, F05, Lec7, Liao CISC 667 Intro to Bioinformatics (Fall 2005) Sequence pairwise alignment Score statistics –Bayesian –Extreme value distribution.
Sequence Similarity Search 2005/10/ Autumn / YM / Bioinformatics.
Similar Sequence Similar Function Charles Yan Spring 2006.
Fa05CSE 182 CSE182-L5: Scoring matrices Dictionary Matching.
Sequence alignment, E-value & Extreme value distribution
On Comparing Classifiers: Pitfalls to Avoid and Recommended Approach Published by Steven L. Salzberg Presented by Prakash Tilwani MACS 598 April 25 th.
Sequence comparison: Score matrices Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas
Sequence comparison: Significance of similarity scores Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.
Sequence comparison: Local alignment
Whole genome alignments Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas
Information theoretic interpretation of PAM matrices Sorin Istrail and Derek Aguiar.
Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.
AM Recitation 2/10/11.
Multiple testing correction
Protein Sequence Alignment and Database Searching.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
1 Lecture outline Database searches –BLAST –FASTA Statistical Significance of Sequence Comparison Results –Probability of matching runs –Karin-Altschul.
BLAST Anders Gorm Pedersen & Rasmus Wernersson. Database searching Using pairwise alignments to search databases for similar sequences Database Query.
Comp. Genomics Recitation 3 The statistics of database searching.
Chapter 13 - ANOVA. ANOVA Be able to explain in general terms and using an example what a one-way ANOVA is (370). Know the purpose of the one-way ANOVA.
Statistical significance of alignment scores Prof. William Stafford Noble Department of Genome Sciences Department of Computer Science and Engineering.
Using BLAST for Genomic Sequence Annotation Jeremy Buhler For HHMI / BIO4342 Tutorial Workshop.
Significance in protein analysis
KNR 445 Statistics t-tests Slide 1 Introduction to Hypothesis Testing The z-test.
Pairwise Sequence Alignment Part 2. Outline Summary Local and Global alignments FASTA and BLAST algorithms Evaluating significance of alignments Alignment.
Sequence Alignment.
Construction of Substitution matrices
The statistics of pairwise alignment BMI/CS 576 Colin Dewey Fall 2015.
Comp. Genomics Recitation 3 (week 4) 26/3/2009 Multiple Hypothesis Testing+Suffix Trees Based in part on slides by William Stafford Noble.
BLAST: Database Search Heuristic Algorithm Some slides courtesy of Dr. Pevsner and Dr. Dirk Husmeier.
Your friend has a hobby of generating random bit strings, and finding patterns in them. One day she come to you, excited and says: I found the strangest.
Substitution Matrices and Alignment Statistics BMI/CS 776 Mark Craven February 2002.
Comp. Genomics Recitation 2 (week 3) 19/3/09. Outline Finding repeats Branch & Bound for MSA Multiple hypotheses testing.
9/6/07BCB 444/544 F07 ISU Dobbs - Lab 3 - BLAST1 BCB 444/544 Lab 3 BLAST Scoring Matrices & Alignment Statistics Sept6.
Step 1: Specify a null hypothesis
Statistics in MSmcDESPOT
BLAST Anders Gorm Pedersen & Rasmus Wernersson.
Sequence comparison: Local alignment
Sequence comparison: Significance of similarity scores
Sequence comparison: Multiple testing correction
I. Statistical Tests: Why do we use them? What do they involve?
Pairwise Sequence Alignment (cont.)
Hypothesis Testing.
Lecture 6: Sequence Alignment Statistics
Sequence comparison: Multiple testing correction
Sequence comparison: Significance of similarity scores
False discovery rate estimation
Sequence alignment, E-value & Extreme value distribution
1-month Practical Course Genome Analysis Iterative homology searching
Presentation transcript:

Sequence comparison: Significance of similarity scores Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas

Review How to compute and use a score matrix. log-odds of sum-of-pair counts vs. expected counts. Why gap scores should be affine.

Are these proteins related? SEQ 1: RVVNLVPS--FWVLDATYKNYAINYNCDVTYKLY L P W L Y N Y C L SEQ 2: QFFPLMPPAPYWILATDYENLPLVYSCTTFFWLF SEQ 1: RVVNLVPS--FWVLDATYKNYAINYNCDVTYKLY L P W LDATYKNYA Y C L SEQ 2: QFFPLMPPAPYWILDATYKNYALVYSCTTFFWLF SEQ 1: RVVNLVPS--FWVLDATYKNYAINYNCDVTYKLY RVV L PS W LDATYKNYA Y CDVTYKL SEQ 2: RVVPLMPSAPYWILDATYKNYALVYSCDVTYKLF YES (score = 24) PROBABLY (score = 15) NO (score = 9)

Significance of scores Alignment algorithm HPDKKAHSIHAWILSKSKVLEGNTKEVVDNVLKT LENENQGKCTIAEYKYDGKKASVYNSFVSNGVKE 45 Low score = unrelated High score = related How high is high enough?

The null hypothesis We are interested in characterizing the distribution of scores from pairwise sequence alignments. We measure how surprising a given score is, assuming that the two sequences are not related. This assumption is called the null hypothesis. The purpose of most statistical tests is to determine whether the observed result(s) provide a reason to reject the null hypothesis.

Sequence similarity score distribution Search a randomly generated database of sequences using a given query sequence. What will be the form of the resulting distribution of pairwise sequence comparison scores? Sequence comparison score Frequency ?

Empirical score distribution This shows the distribution of scores from a real database search using BLAST. This distribution contains scores from related and unrelated pair alignments. High scores from related sequences

Empirical null score distribution This distribution is similar to the previous one, but generated using a randomized sequence database (each sequence shuffled). (notice the scale is shorter here) 1,685 scores

Computing a p-value The probability of observing a score >=X is the area under the curve to the right of X. This probability is called a p-value. p-value = Pr(data|null) (read as probability of data given a null hypothesis) e.g. out of 1,685 scores, 28 received a score of 20 or better. Thus, the p-value associated with a score of 20 is approximately 28/1685 =

Problems with empirical distributions We are interested in very small probabilities. These are computed from the tail of the null distribution. Estimating a distribution with an accurate tail is feasible but computationally very expensive because we have to compute a very large number of scores.

A solution Solution: Characterize the form of the distribution mathematically. Fit the parameters of the distribution empirically, or compute them analytically. Use the resulting distribution to compute accurate p-values. First solved by Karlin and Altschul.

Extreme value distribution This distribution is roughly normal near the peak, but characterized by a larger tail on the right.

Computing a p-value The probability of observing a score >=4 is the area under the curve to the right of 4. p-value = Pr(data|null)

Unscaled EVD equation Compute this value for x=4.

Computing a p-value

Scaling the EVD An EVD derived from, e.g., the Smith-Waterman algorithm with BLOSUM62 matrix and a given gap penalty has a characteristic mode μ and scale parameter λ. and  depend on the size of the query, the size of the target database, the substitution matrix and the gap penalties. scaled:

An example You run BLAST and get a score of 45. You then run BLAST on a shuffled version of the database, and fit an extreme value distribution to the resulting empirical distribution. The parameters of the EVD are   = 25 and   = What is the p-value associated with 45? BLAST has precomputed values of  and for all common matrices and gap penalties (and the run scales them for the size of the query and database)

What p-value is significant? The most common thresholds are 0.01 and A threshold of 0.05 means you are 95% sure that the result is significant. Is 95% enough? It depends upon the cost associated with making a mistake. Examples of costs: –Doing extensive wet lab validation (expensive) –Making clinical treatment decisions (very expensive) –Misleading the scientific community (very expensive) –Doing further simple computational tests (cheap) –Telling your grandmother (very cheap)

Multiple testing Say that you perform a statistical test with a 0.05 threshold, but you repeat the test on twenty different observations (e.g. 20 different blast runs) Assume that all of the observations are explainable by the null hypothesis. What is the chance that at least one of the observations will receive a p-value less than 0.05?

Bonferroni correction Assume that individual tests are independent. Divide the desired p-value threshold by the number of tests performed.

Database searching Say that you search the non-redundant protein database at NCBI, containing roughly one million sequences (i.e. you are doing 10 6 pairwise tests). What p-value threshold should you use? Say that you want to use a conservative p-value of Recall that you would observe such a p-value by chance approximately every 1000 times in a random database. A Bonferroni correction would suggest using a p-value threshold of / 10 6 =

E-values A p-value is the probability of making a mistake. An E-value is the expected number of times that the given score would appear in a random database of the given size. One simple way to compute the E-value is to multiply the p-value times the size of the database. Thus, for a p-value of and a database of 1,000,000 sequences, the corresponding E-value is × 1,000,000 = 1,000. (BLAST actually calculates E-values in a more complex way, but they mean the same thing)

Summary A distribution plots the frequencies of types of observation. The area under the distribution curve is 1. Most statistical tests compare observed data to the expected result according to a null hypothesis. Sequence similarity scores follow an extreme value distribution, which is characterized by a long tail. The p-value associated with a score is the area under the curve to the right of that score. Selecting a significance threshold requires evaluating the cost of making a mistake. Bonferroni correction: Divide the desired p-value threshold by the number of statistical tests performed. The E-value is the expected number of times that a given score would appear in a random database of the given size.