Presentation is loading. Please wait.

Presentation is loading. Please wait.

Comp. Genomics Recitation 2 (week 3) 19/3/09. Outline Finding repeats Branch & Bound for MSA Multiple hypotheses testing.

Similar presentations


Presentation on theme: "Comp. Genomics Recitation 2 (week 3) 19/3/09. Outline Finding repeats Branch & Bound for MSA Multiple hypotheses testing."— Presentation transcript:

1 Comp. Genomics Recitation 2 (week 3) 19/3/09

2 Outline Finding repeats Branch & Bound for MSA Multiple hypotheses testing

3 Exercise: Finding repeats Basic objective: find a pair of subsequences within S with maximum similarity Simple (albeit wrong) idea: Find an optimal alignment of S with itself! (Why is this wrong?) But using local alignment is still a good idea

4 Variant #1 First specification: the two subsequences may overlap Solution: Change the local alignment algorithm: Compute only the upper triangular submatrix (V(i,j), where j>i). Set diagonal values to 0 Complexity: O(n 2 ) time and O(n) space

5 Variant #2 Ssecond specification: the two sequences may not overlap Solution: Absence of overlap means that k exists such that one string is in S[1..k] and another in S[k+1..n] Check local alignments between S[1..k] and S[k+1..n] for all 1<=k<n Pick the highest-scoring alignment Complexity: O(n 3 ) time and O(n) space

6 Second Variant, Pictorially

7 Third Variant Specification: the two sequences must be consecutive (tandem repeat) Solution: Similar to variant #2, but somewhat “ends-free”: seek a global alignment between S[1..k] and S[k+1..n], No penalties for gaps in the beginning of S[1..k] No penalties for gaps in the end of S[k+1..n] Complexity: O(n 3 ) time and O(n) space

8 Variant #3

9 Branch and Bound algorithms Another heuristic for hard problems Idea: Compute some rough bound on the score Start exploring the complete solution space Do not explore “regions” that definitely score worse than the bound Maybe: improve the bound as you go along Guarantees optimal solution Does not guarantee runtime Good bound is essential for good performance

10 Slightly more formal Branch: enumerate all possible next steps from current partial solution Bound: if a partial solution violates some constraint, e.g., an upper bound on cost, drop/prune the branch (don’t follow it further) Backtracking: once a branch is pruned, move back to the previous partial solution and try another branch (depth-first branch-and-bound)

11 Example: TSP Traveling salesperson problem: Input: Directed full graph with weights on the edges Tour: a directed cycle visiting every node exactly once Problem: find a tour of minimum cost NP-Hard (reduction from directed Hamiltonial cycle…) Difficult to approximate Good example of enumeration

12 Example: TSP “Find me a tour of 10 cities in 2500 miles”

13 Reminder: The hypercube Dynamic programming extension Exponential in the number of sequences Source: Ralf Zimmer slides

14 B&B for MSA Carillo and Lipman (88) See scribe for full details The bound: Every multiple alignment induces k(k-1)/2 pairwise alignments Neither of those is better than the optimal pairwise alignment (Why?) The branching: Compute the complete hypercube, just ignoring regions above the bound

15 B&B for MSA

16 (Some) gory details Requires a “forward” recurrence: once cell D(i,j,k) is computed it is “transmitted forward” to the next cells The cube cells stored in a queue The next computed cell is picked up from the top of the queue (Easy) to show that once a cell reaches the top of the queue, it has received “transmissions” from all its relevant neighbors

17 (Some) gory details d a,b (i,j) is the optimal distance between S 1 [i..n] and S 2 [j..n] (suffixes aligned) (Easy to compute?) Assume we found some MSA with score z (e.g., by an iterative alignment) Key lemma: if D(i,j,k)+d 1,2 (i,j)+d 1,3 (i,k)+d 2,3 (j,k)>z then D(i,j,k) is not on any optimal path and thus we need not send it forward.

18 (Some) gory details In practice, Carillo-Lipman can align 6 sequences of about 200 characters in a “practical” amount of time Probably impractical for larger numbers Is optimality that important?

19 Review What is a null hypothesis? A statistician’s way of characterizing “chance.” Generally, a mathematical model of randomness with respect to a particular set of observations. The purpose of most statistical tests is to determine whether the observed data can be explained by the null hypothesis.

20 Empirical score distribution The picture shows a distribution of scores from a real database search using BLAST. This distribution contains scores from non-homologous and homologous pairs. High scores from homology.

21 Empirical null score distribution This distribution is similar to the previous one, but generated using a randomized sequence database.

22 Review What is a p-value? The probability of observing an effect as strong or stronger than you observed, given the null hypothesis. I.e., “How likely is this effect to occur by chance?” Pr(x > S|null)

23 Review If BLAST returns a score of 75, how would you compute the corresponding p-value? First, compute many BLAST scores using random queries and a random database. Summarize those scores into a distribution. Compute the area of the distribution to the right of the observed score (more details to come).

24 Review What is the name of the distribution created by sequence similarity scores, and what does it look like? Extreme value distribution, or Gumbel distribution. It looks similar to a normal distribution, but it has a larger tail on the right.

25 Extreme value distribution The distribution of the maximum of a series of independently and identically distributed (i.i.d) variables Actually a family of distributions (Fréchet, Weibull and Gumbel) Shape Location Scale

26 What p-value is significant? The most common thresholds are 0.01 and 0.05. A threshold of 0.05 means you are 95% sure that the result is significant. Is 95% enough? It depends upon the cost associated with making a mistake. Examples of costs: Doing expensive wet lab validation. Making clinical treatment decisions. Misleading the scientific community. Most sequence analysis uses more stringent thresholds because the p-values are not very accurate.

27 Multiple testing Say that you perform a statistical test with a 0.05 threshold, but you repeat the test on twenty different observations. Assume that all of the observations are explainable by the null hypothesis. What is the chance that at least one of the observations will receive a p-value less than 0.05?

28 Multiple testing Say that you perform a statistical test with a 0.05 threshold, but you repeat the test on twenty different observations. Assuming that all of the observations are explainable by the null hypothesis, what is the chance that at least one of the observations will receive a p-value less than 0.05? Pr(making a mistake) = 0.05 Pr(not making a mistake) = 0.95 Pr(not making any mistake) = 0.95 20 = 0.358 Pr(making at least one mistake) = 1 - 0.358 = 0.642 There is a 64.2% chance of making at least one mistake.

29 Bonferroni correction Assume that individual tests are independent. (Is this a reasonable assumption?) Divide the desired p-value threshold by the number of tests performed. For the previous example, 0.05 / 20 = 0.0025. Pr(making a mistake) = 0.0025 Pr(not making a mistake) = 0.9975 Pr(not making any mistake) = 0.9975 20 = 0.9512 Pr(making at least one mistake) = 1 - 0.9512 = 0.0488

30 Database searching Say that you search the non-redundant protein database at NCBI, containing roughly one million sequences. What p- value threshold should you use?

31 Database searching Say that you search the non-redundant protein database at NCBI, containing roughly one million sequences. What p-value threshold should you use? Say that you want to use a conservative p-value of 0.001. Recall that you would observe such a p-value by chance approximately every 1000 times in a random database. A Bonferroni correction would suggest using a p-value threshold of 0.001 / 1,000,000 = 0.000000001 = 10 -9.

32 E-values A p-value is the probability of making a mistake. The E-value is a version of the p-value that is corrected for multiple tests; it is essentially the converse of the Bonferroni correction. The E-value is computed by multiplying the p- value times the size of the database. The E-value is the expected number of times that the given score would appear in a random database of the given size. Thus, for a p-value of 0.001 and a database of 1,000,000 sequences, the corresponding E-value is 0.001 × 1,000,000 = 1,000.

33 E-value vs. Bonferroni You observe among n repetitions of a test a particular p-value p. You want a significance threshold α. Bonferroni: Divide the significance threshold by α p < α/n. E-value: Multiply the p-value by n. pn < α. * BLAST actually calculates E-values in a slightly more complex way.

34

35

36

37 False discovery rate The false discovery rate (FDR) is the percentage of examples above a given position in the ranked list that are expected to be false positives. 5 FP 13 TP 33 TN 5 FN FDR = FP / (FP + TP) = 5/18 = 27.8%

38 Bonferroni vs. FDR Bonferroni controls the family-wise error rate; i.e., the probability of at least one false positive. FDR is the proportion of false positives among the examples that are flagged as true.

39 Controlling the FDR Order the unadjusted p-values p 1  p 2  …  p m. To control FDR at level α, Reject the null hypothesis for j = 1, …, j*. This approach is conservative if many examples are true. (Benjamini & Hochberg, 1995)

40 Q-value software http://faculty.washington.edu/~jstorey/qvalue/

41 Significance Summary Selecting a significance threshold requires evaluating the cost of making a mistake. Bonferroni correction: Divide the desired p-value threshold by the number of statistical tests performed. The E-value is the expected number of times that the given score would appear in a random database of the given size.


Download ppt "Comp. Genomics Recitation 2 (week 3) 19/3/09. Outline Finding repeats Branch & Bound for MSA Multiple hypotheses testing."

Similar presentations


Ads by Google