Comp. Genomics Recitation 3 The statistics of database searching.

Comp. Genomics Recitation 3 The statistics of database searching

Substitution matrices Random model(R): x and y appear at random Match model(M): x and y are derived from a common ancestor The odds ratio: Using the log of the odds ratio gives an additive scoring system

Substitution matrices PAM (Point Accepted Mutation, by Margaret Dayhoff ‘78) matrices extrapolates amino acids substitution rates of evolutionary distant proteins from observed substitutions of evolutionary close proteins PAM1 gives the frequencies of changes for proteins that differ at 1% of the sites PAM250=(PAM1) 250 give the frequencies of changes for proteins that differ in 250% of sites (multiple substitutions per site are possible) This calculation assumes that changes are independent of each other and past changes

Substitution matrices BLOSUM (Blocks of Amino Acid Substitution Matrix) matrices derive substitution rates from proteins of various evolutionary distances Change rates were derived from alignment of common patterns E.g., Blosum60 was derived from patterns that were 60% similar No assumption is made regarding the evolutionary relationship between the proteins from which the patterns are taken P ij is the probability of two amino acids i and j replacing each other in a homologous sequence, and q i and q j are the background probabilities

Exercise The following substitution matrix is given: What is the average score per nucleotide pair? Assume each nucleotide appears with equal probability. GACT 01T 10C 01 A 10 G

Solution What happens if the average score is not negative? What is the chance that in a pair of evolutionary related sequences T is replaced by C?

Solution S(i,j)= P ij =1/16 Does the optimal alignment change if we multiply the matrix by a constant C?

How significant is my score? Create a mathematical model of the alignment of random sequences, and derive the score distribution analytically Use simulation to estimate the score distribution of the alignment of: Generated sequences Real sequences that are known to be non- homologous, or that are shuffled

Empirical score distribution The picture shows a distribution of scores from a real database search using BLAST. This distribution contains scores from non-homologous and homologous pairs. High scores from homology.

Empirical null score distribution This distribution is similar to the previous one, but generated using a randomized sequence database.

Statistical analysis What is a null hypothesis? An assumption that may be contradicted (but not validated) by the data. The purpose of most statistical tests is to determine whether the observed data can be explained by the null hypothesis.

Overview What is a p-value? The probability of observing an effect as strong or stronger than you observed, given the null hypothesis. I.e., “How likely is this effect to occur by chance?” Pr(x > S|null)

Extreme value distribution What is the name of the distribution created by local alignment scores, and what does it look like? Extreme value distribution, or Gumbel distribution. It looks similar to a normal distribution, but it has a larger tail on the right. Ungapped local alignment max scores follow this distribution, and gapped alignment scores seem to follow it

Extreme value distribution The expected number of matches with a score≥ S is given by the formula: For ungapped local alignments, the parameters can be calculated directly from the substitution matrix scores and the lengths of the aligned sequences where m,n are sequence lengths, λ is a scaling parameter for the scoring system and K is a scaling parameter for the search space (e.g. accounts for overlaps) (E-value)

Exercise Assuming that the probability for seeing x matches with score ≥S is given by the Poisson distribution: where μ is the mean, what is the p-value of the score S?

Solution The p-value is the probability of seeing the score ≥S by chance The probability of not seeing the score by chance is The probability of seeing the score by chance is

What p-value is significant? The most common thresholds are 0.01 and 0.05. A threshold of 0.05 means you are 95% sure that the result is significant. Is 95% enough? It depends upon the cost associated with making a mistake. Examples of costs: Doing expensive wet lab validation. Making clinical treatment decisions. Misleading the scientific community.

Database searching A database contains many sequences Problem: multiple comparisons Increase chance for random high score

Multiple testing Say that you perform a statistical test with a 0.05 threshold, but you repeat the test on twenty different observations. Assume that all of the observations are explainable by the null hypothesis. What is the chance that at least one of the observations will receive a p-value less than 0.05?

Multiple testing Say that you perform a statistical test with a 0.05 threshold, but you repeat the test on twenty different observations. Assuming that all of the observations are explainable by the null hypothesis, what is the chance that at least one of the observations will receive a p-value less than 0.05? Pr(making a mistake) = 0.05 Pr(not making a mistake) = 0.95 Pr(not making any mistake) = 0.95 20 = 0.358 Pr(making at least one mistake) = 1 - 0.358 = 0.642 There is a 64.2% chance of making at least one mistake.

Bonferroni correction Divide the desired p-value threshold by the number of tests performed. For the previous example, 0.05 / 20 = 0.0025. Pr(making a mistake) = 0.0025 Pr(not making a mistake) = 0.9975 Pr(not making any mistake) = 0.9975 20 = 0.9512 Pr(making at least one mistake) = 1 - 0.9512 = 0.0488

Corrections for Database searching Say that you search the non-redundant protein database at NCBI, containing roughly one million sequences. What p- value threshold should you use? What is the hidden assumption here?

Example Say that you search the non-redundant protein database at NCBI, containing roughly one million sequences. What p-value threshold should you use? Say that you want to use a conservative p-value of 0.001. Recall that you would observe such a p-value by chance approximately every 1000 times in a random database. A Bonferroni correction would suggest using a p-value threshold of 0.001 / 1,000,000 = 0.000000001 = 10 -9.

Exercise A sequence of size m is queried against a database. The database contains k sequences of lengths n 1,n 2,…,n k. The E-value for alignment i is. What is the query E-value for score S? If we know that the p-value for alignment i is P i, what is the query p-value?

Solution Let A i denote the number of matches in alignment i that scored ≥ S E(A 1 +A 2 +…+A k )=E(A 1 )+E(A 2 )+…+E(A k )=

Solution The probability of seeing a match with score ≥S by chance in the entire database is

Comp. Genomics Recitation 3 The statistics of database searching.

Similar presentations

Presentation on theme: "Comp. Genomics Recitation 3 The statistics of database searching."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Comp. Genomics Recitation 3 The statistics of database searching.

Similar presentations

Presentation on theme: "Comp. Genomics Recitation 3 The statistics of database searching."— Presentation transcript:

Similar presentations

About project

Feedback