CIS: Compound Importance Sampling for Binding Site p-value Estimation The Hebrew University, Jerusalem, Israel Yoseph Barash Gal Elidan Tommy Kaplan Nir.

Presentation on theme: "CIS: Compound Importance Sampling for Binding Site p-value Estimation The Hebrew University, Jerusalem, Israel Yoseph Barash Gal Elidan Tommy Kaplan Nir."— Presentation transcript:

CIS: Compound Importance Sampling for Binding Site p-value Estimation The Hebrew University, Jerusalem, Israel Yoseph Barash Gal Elidan Tommy Kaplan Nir Friedman

2 Detecting Target Genes promoter binding site? gene binding site? Probabilistic framework Log odds Score: ACGTACGT 1 2 k p[i,c] – prob. of letter c at position i

3 Detecting target genes (2) ? ?

4 p-value of Scores Score Prob S

5 p-value score: Universal Interpretable Control false positive error rate Detecting target genes (3) Bonferroni corrected p-value 0.01 score 15 13 11 9 7 p-value 10 -7 10 -6 10 -5 10 -4 10 -2 10 -3

6 p-value Estimation Score Problem 1: naïve enumeration infeasible #seq = 4 k Prob S* Estimate the p-value by sampling from P 0 : samples scores: s 1 …s n

7 p-value Estimation Need ~10 7 attempts to get a sample with pvalue < 10 -7 Prob Problem 2: Multiple hypothesis Testing low p-values (10 -7 ) S* Score S*

8 Importance Sampling Approach Score 1.Cheat: Sample from Q(s 1 …s k ), to get high scoring samples 2. Get absolution: Weigh each sample S* Prob Empirical p-value ~ 10 -8 N ~ 10 4

9 Why is this allowed? x = subsequence Importance Sampling Desired estimate: expectation of log-odds Sample from P 0 (x) and count Multiply and divide by Q(x) Sample from Q(x) and reweight How to choose Q?

10 Choosing Sampling Distribution Score Q 10 = MotifQ 1 = Background Q5Q5 Under-sampled region Density

11 Choosing Sampling Distribution wRescale wCombine Comprehensive Coverage Sampling distribution Score Density Mixing ratio

12 PSSM Example 6e-5 Naive 0 2e-5 4e-5 10121416182022 MAST (Bailey et al. 98) Normal p-value Score CIS (10 000000) (40 000) What if we want something else?

13 wDependency Models - Many possible variants: Trees, Mixture of PSSMs, Mixture of Trees etc. Tree Example: wSuggested by several recent papers: Barash et al.(2003), King & Roth (2003), Zhou & Liu (2004),… Beyond PSSM Models wMain Point: Capture dependencies between biding site positions Improve sites predictions Challenge: compute p-values for general models X1X1 X2X2 X3X3 X4X4 X5X5

14 Tree Model Example 0 2e-5 4e-5 6e-5 8e-5 1e-4 101214161820 p-value Scor e X Not efficient X Not applicable X Not accurate wNaïve Sampling wMAST (Baily et al,98) wNormal Approx. Naive Normal CIS (10 000000) (40 000)

15 Decreased Estimator Variability 0 2e-5 4e-5 6e-5 8e-5 1e-4 101214161820 p-value Scor e 10 repeats of sampling Naive Normal CIS ( 10x10 000000 ) ( 10x40 000 )

16 CIS - Summary General form – Wide range of probabilistic models Computationally efficient Handles low p-values accurately Available online, at: http://compbio.cs.huji.ac.il/CIS

17 Thank you http://compbio.cs.huji.ac.il/CIS Joint Work with: Nir Friedman Gal Elidan Tommy Kaplan

Download ppt "CIS: Compound Importance Sampling for Binding Site p-value Estimation The Hebrew University, Jerusalem, Israel Yoseph Barash Gal Elidan Tommy Kaplan Nir."

Similar presentations