Presentation is loading. Please wait.

Presentation is loading. Please wait.

A faster reliable algorithm to estimate the p-value of the multinomial llr statistic Uri Keich and Niranjan Nagarajan (Department of Computer Science,

Similar presentations


Presentation on theme: "A faster reliable algorithm to estimate the p-value of the multinomial llr statistic Uri Keich and Niranjan Nagarajan (Department of Computer Science,"— Presentation transcript:

1 A faster reliable algorithm to estimate the p-value of the multinomial llr statistic Uri Keich and Niranjan Nagarajan (Department of Computer Science, Cornell University, Ithaca, NY, USA)

2 Motivation: Hunting for Motifs

3 Is there a motif here? Is the alignment interesting/significant? Which of the columns form a motif? Or for that matter here?

4 Work with column profiles Null hypothesis: multinomial distribution Uncovering significant columns Calculate p-value of the observed score s Test Statistic: Log-likelihood ratio where for a random sample of size

5 and more … Bioinformatics applications: Sequence-Profile and Profile-Profile alignment Locating binding site correlations in DNA Detecting compensatory mutations in proteins and RNA Other applications: Signal Processing Natural Language Processing

6 Direct enumeration Computational Extremes! Constant time approximation Fails with N fixed and s approaching the tail Asymptotic approximation As, possible outcomes Exact results Exponential time and space requirements (O(N K )) even with pruning of the search space!

7 Compute p Q the p.m.f. of the integer valued (Baglivo et al, 1992) where is the mesh size. Runtime is polynomial in N, K and Q Accurate upto the granularity of the lattice The middle path

8 Baglivo et al.’s approach Compute p Q by first computing the DFT! Let and then, Then recover p Q by the inverse-DFT But how do we compute the DFT in the first place without knowing p Q ??

9 We compute where Hertz and Stormo’s Algorithm Sample from which are independent Poissons with mean (instead of X) Fact: Recursion:

10 Baglivo et al.’s Algorithm Recurse over k to compute DFT of Q k,n Let DQ k,n be. Then, And is upto a constant

11 Comparison of the two algorithms Both have O(QKN 2 ) time complexity Space complexity: O(QN) for Hertz and Stormo’s method O(Q+N) for Baglivo et al.’s method Numerical errors: Bounded for Hertz and Stormo’s method. Baglivo et al.’s method can yield negative p-values in some cases!

12 Whats going wrong? Hoeffding (1965) proved that P(I  s)  f(N,K)  e -s If we compare with we would get …

13 And why? Fixed precision arithmetic: In double precision arithmetic Due to roundoff errors the smaller numbers in the summation in the DFT are overwhelmed by the largest ones! Solution: boost the smaller values How? And by how much?

14 Our Solution We shift p Q (s) with e  s : where is the MGF of I Q To compute the DFT of p  we replace the Poisson p k with and compute

15 Which  to use? With  = 1, N = 400, K = 5,

16 Optimizing  Theoretical error bound Want to calculate P(I > s 0 ) and so a good choice for  seems to be We can solve for  numerically. This only adds O(KN 2 ) to the runtime.

17 So far … We have shown how to compensate for the errors in Baglivo et al.’s algorithm provide error guarantees As a bonus we also avoid the need for log- computation! The updated algorithm has O(Q+N) space complexity. However the runtime is still O(QKN 2 ) Can we speed it up?

18 Unfortunately the FFT based convolution introduces more numerical errors: When  =1, A faster variant! Note that the recursive step is a convolution Naïve convolution takes time O(N 2 ) An FFT based convolution, takes O(NlogN) time!

19 The final piece Need a shift that works well on both and The variation over l is expected to be small. So we focus on computing the correct shift for different values of k. We make an intuitive choice where M k (  2 ) is the MGF of

20 Experimental Results (Accuracy) Both the shifts work well in practice! We tested our method with K values upto a 100, N upto 10,000 and various choices of  and s In all cases relative error in comparison to Hertz and Stormo was less than 1e-9

21 Experimental Results (Runtime) As expected, our algorithm scales as NlogN as opposed to N 2 for Hertz and Stormo’s method!

22 Recovering the entire range Imaginary part of computed p  is a measure of numerical error Adhoc test for reliability: Real(p  (j)) > 10 3 × max j Imag(p  (j)) In practise, we can recover the entire p.m.f. using as few as 2 to 3 different s (or equivalently  ) values. For large Ns this is still significantly faster than Hertz and Stormo’s algorithm.

23 Work in progress … Rigorous error bounds for choice of  2 ’s Applying the methodology to compute other statistics Extending the method to automatically recover the entire range Exploring applications


Download ppt "A faster reliable algorithm to estimate the p-value of the multinomial llr statistic Uri Keich and Niranjan Nagarajan (Department of Computer Science,"

Similar presentations


Ads by Google