A faster reliable algorithm to estimate the p-value of the multinomial llr statistic Uri Keich and Niranjan Nagarajan (Department of Computer Science,

Slides:

Advertisements

Similar presentations

Roundoff and truncation errors

Advertisements

Slides prepared by Timothy I. Matis for SpringSim’06, April 4, 2006 Estimating Rare Event Probabilities Using Truncated Saddlepoint Approximations Timothy.

Introduction to Algorithms Rabie A. Ramadan rabieramadan.org 2 Some of the sides are exported from different sources.

Fast Algorithms For Hierarchical Range Histogram Constructions

Yasuhiro Fujiwara (NTT Cyber Space Labs)

Randomized Sensing in Adversarial Environments Andreas Krause Joint work with Daniel Golovin and Alex Roper International Joint Conference on Artificial.

NASSP Masters 5003F - Computational Astronomy Lecture 5: source detection. Test the null hypothesis (NH). –The NH says: let’s suppose there is no.

Budapest May 27, 2008 Unifying mixed linear models and the MASH algorithm for breakpoint detection and correction Anders Grimvall, Sackmone Sirisack, Agne.

Numerical Analysis 1 EE, NCKU Tien-Hao Chang (Darby Chang)

© 2013 Pearson Education, Inc. Active Learning Lecture Slides For use with Classroom Response Systems Introductory Statistics: Exploring the World through.

Sociology 601 Class 13: October 13, 2009 Measures of association for tables (8.4) –Difference of proportions –Ratios of proportions –the odds ratio Measures.

Theoretical Program Checking Greg Bronevetsky. Background The field of Program Checking is about 13 years old. Pioneered by Manuel Blum, Hal Wasserman,

Infinite Horizon Problems

Hypothesis testing Week 10 Lecture 2.

Algorithmic Complexity Nelson Padua-Perez Bill Pugh Department of Computer Science University of Maryland, College Park.

Heuristic alignment algorithms and cost matrices

Differentially expressed genes

Reduced Support Vector Machine

Finding Compact Structural Motifs Presented By: Xin Gao Authors: Jianbo Qian, Shuai Cheng Li, Dongbo Bu, Ming Li, and Jinbo Xu University of Waterloo,

Evaluating Hypotheses

Cbio course, spring 2005, Hebrew University (Alignment) Score Statistics.

Expectation Maximization Algorithm

Efficient Estimation of Emission Probabilities in profile HMM By Virpi Ahola et al Reviewed By Alok Datar.

How Science Works Glossary AS Level. Accuracy An accurate measurement is one which is close to the true value.

Lehrstuhl für Informatik 2 Gabriella Kókai: Maschine Learning 1 Evaluating Hypotheses.

Detecting the Domain Structure of Proteins from Sequence Information Niranjan Nagarajan and Golan Yona Department of Computer Science Cornell University.

Normalised Least Mean-Square Adaptive Filtering

Motif search and discovery Prof. William Stafford Noble Department of Genome Sciences Department of Computer Science and Engineering University of Washington.

Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.

AM Recitation 2/10/11.

Proteomics Informatics – Data Analysis and Visualization (Week 13)

Multiple testing in high- throughput biology Petter Mostad.

Marshall University School of Medicine Department of Biochemistry and Microbiology BMS 617 Lecture 6 – Multiple comparisons, non-normality, outliers Marshall.

Ch 8.1 Numerical Methods: The Euler or Tangent Line Method

Copyright © Cengage Learning. All rights reserved. 8 Tests of Hypotheses Based on a Single Sample.

14 Elements of Nonparametric Statistics

HAWKES LEARNING SYSTEMS Students Matter. Success Counts. Copyright © 2013 by Hawkes Learning Systems/Quant Systems, Inc. All rights reserved. Section 10.2.

Template attacks Suresh Chari, Josyula R. Rao, Pankaj Rohatgi IBM Research.

BPS - 3rd Ed. Chapter 211 Inference for Regression.

Optimal n fe Tian-Li Yu & Kai-Chun Fan. n fe n fe = Population Size × Convergence Time n fe is one of the common used metrics to measure the performance.

SCIENTIFIC METHOD CHEM 04 A series of logical steps to follow to solve problems Define the Scientific Method:

Evidence Based Medicine

Segmental Hidden Markov Models with Random Effects for Waveform Modeling Author: Seyoung Kim & Padhraic Smyth Presentor: Lu Ren.

Trust-Aware Optimal Crowdsourcing With Budget Constraint Xiangyang Liu 1, He He 2, and John S. Baras 1 1 Institute for Systems Research and Department.

A Comparison of Statistical Significance Tests for Information Retrieval Evaluation CIKM´07, November 2007.

Motif finding with Gibbs sampling CS 466 Saurabh Sinha.

Major objective of this course is: Design and analysis of modern algorithms Different variants Accuracy Efficiency Comparing efficiencies Motivation thinking.

Scalable Symbolic Model Order Reduction Yiyu Shi*, Lei He* and C. J. Richard Shi + *Electrical Engineering Department, UCLA + Electrical Engineering Department,

Comp. Genomics Recitation 3 The statistics of database searching.

CS5263 Bioinformatics Lecture 20 Practical issues in motif finding Final project.

The Group Lasso for Logistic Regression Lukas Meier, Sara van de Geer and Peter Bühlmann Presenter: Lu Ren ECE Dept., Duke University Sept. 19, 2008.

1 Chapter 10: Introduction to Inference. 2 Inference Inference is the statistical process by which we use information collected from a sample to infer.

1 Markov Decision Processes Infinite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.

Limits and Derivatives

Testing Hypothesis That Data Fit a Given Probability Distribution Problem: We have a sample of size n. Determine if the data fits a probability distribution.

Sparse Signals Reconstruction Via Adaptive Iterative Greedy Algorithm Ahmed Aziz, Ahmed Salim, Walid Osamy Presenter : 張庭豪 International Journal of Computer.

Big Data, Computation and Statistics Michael I. Jordan February 23,

Hypothesis test flow chart frequency data Measurement scale number of variables 1 basic χ 2 test (19.5) Table I χ 2 test for independence (19.9) Table.

. Finding Motifs in Promoter Regions Libi Hertzberg Or Zuk.

CGH Data BIOS Chromosome Re-arrangements.

Chapter 9: Introduction to the t statistic. The t Statistic The t statistic allows researchers to use sample data to test hypotheses about an unknown.

In Bayesian theory, a test statistics can be defined by taking the ratio of the Bayes factors for the two hypotheses: The ratio measures the probability.

Inferential Statistics Psych 231: Research Methods in Psychology.

BPS - 5th Ed. Chapter 231 Inference for Regression.

Copyright © 2009 Pearson Education, Inc t LEARNING GOAL Understand when it is appropriate to use the Student t distribution rather than the normal.

1 CS 391L: Machine Learning: Computational Learning Theory Raymond J. Mooney University of Texas at Austin.

Distance based phylogenetics

Motif p-values GENOME 559: Introduction to Statistical and Computational Genomics Prof. William Stafford Noble.

Objective of This Course

Minimax strategies, alpha beta pruning

Presentation transcript:

A faster reliable algorithm to estimate the p-value of the multinomial llr statistic Uri Keich and Niranjan Nagarajan (Department of Computer Science, Cornell University, Ithaca, NY, USA)

Motivation: Hunting for Motifs

Is there a motif here? Is the alignment interesting/significant? Which of the columns form a motif? Or for that matter here?

Work with column profiles Null hypothesis: multinomial distribution Uncovering significant columns Calculate p-value of the observed score s Test Statistic: Log-likelihood ratio where for a random sample of size

and more … Bioinformatics applications: Sequence-Profile and Profile-Profile alignment Locating binding site correlations in DNA Detecting compensatory mutations in proteins and RNA Other applications: Signal Processing Natural Language Processing

Direct enumeration Computational Extremes! Constant time approximation Fails with N fixed and s approaching the tail Asymptotic approximation As, possible outcomes Exact results Exponential time and space requirements (O(N K )) even with pruning of the search space!

Compute p Q the p.m.f. of the integer valued (Baglivo et al, 1992) where is the mesh size. Runtime is polynomial in N, K and Q Accurate upto the granularity of the lattice The middle path

Baglivo et al.’s approach Compute p Q by first computing the DFT! Let and then, Then recover p Q by the inverse-DFT But how do we compute the DFT in the first place without knowing p Q ??

We compute where Hertz and Stormo’s Algorithm Sample from which are independent Poissons with mean (instead of X) Fact: Recursion:

Baglivo et al.’s Algorithm Recurse over k to compute DFT of Q k,n Let DQ k,n be. Then, And is upto a constant

Comparison of the two algorithms Both have O(QKN 2 ) time complexity Space complexity: O(QN) for Hertz and Stormo’s method O(Q+N) for Baglivo et al.’s method Numerical errors: Bounded for Hertz and Stormo’s method. Baglivo et al.’s method can yield negative p-values in some cases!

Whats going wrong? Hoeffding (1965) proved that P(I  s)  f(N,K)  e -s If we compare with we would get …

And why? Fixed precision arithmetic: In double precision arithmetic Due to roundoff errors the smaller numbers in the summation in the DFT are overwhelmed by the largest ones! Solution: boost the smaller values How? And by how much?

Our Solution We shift p Q (s) with e  s : where is the MGF of I Q To compute the DFT of p  we replace the Poisson p k with and compute

Which  to use? With  = 1, N = 400, K = 5,

Optimizing  Theoretical error bound Want to calculate P(I > s 0 ) and so a good choice for  seems to be We can solve for  numerically. This only adds O(KN 2 ) to the runtime.

So far … We have shown how to compensate for the errors in Baglivo et al.’s algorithm provide error guarantees As a bonus we also avoid the need for log- computation! The updated algorithm has O(Q+N) space complexity. However the runtime is still O(QKN 2 ) Can we speed it up?

Unfortunately the FFT based convolution introduces more numerical errors: When  =1, A faster variant! Note that the recursive step is a convolution Naïve convolution takes time O(N 2 ) An FFT based convolution, takes O(NlogN) time!

The final piece Need a shift that works well on both and The variation over l is expected to be small. So we focus on computing the correct shift for different values of k. We make an intuitive choice where M k (  2 ) is the MGF of

Experimental Results (Accuracy) Both the shifts work well in practice! We tested our method with K values upto a 100, N upto 10,000 and various choices of  and s In all cases relative error in comparison to Hertz and Stormo was less than 1e-9

Experimental Results (Runtime) As expected, our algorithm scales as NlogN as opposed to N 2 for Hertz and Stormo’s method!

Recovering the entire range Imaginary part of computed p  is a measure of numerical error Adhoc test for reliability: Real(p  (j)) > 10 3 × max j Imag(p  (j)) In practise, we can recover the entire p.m.f. using as few as 2 to 3 different s (or equivalently  ) values. For large Ns this is still significantly faster than Hertz and Stormo’s algorithm.

Work in progress … Rigorous error bounds for choice of  2 ’s Applying the methodology to compute other statistics Extending the method to automatically recover the entire range Exploring applications