Presentation is loading. Please wait.

Presentation is loading. Please wait.

AlgoDEEP 16/04/101 An Efficient Rigorous Approach for Identifying Statistically Significant Frequent Itemsets Fabio Vandin DEI - Università di Padova CS.

Similar presentations


Presentation on theme: "AlgoDEEP 16/04/101 An Efficient Rigorous Approach for Identifying Statistically Significant Frequent Itemsets Fabio Vandin DEI - Università di Padova CS."— Presentation transcript:

1 AlgoDEEP 16/04/101 An Efficient Rigorous Approach for Identifying Statistically Significant Frequent Itemsets Fabio Vandin DEI - Università di Padova CS Dept. - Brown University Join work with: A. Kirsch, M. Mitzenmacher, A. Pietracaprina, G. Pucci, E. Upfal

2 AlgoDEEP 16/04/102 Data Mining Discovery of hidden patterns (e.g., correlations, association rules, clusters, anomalies, etc.) from large data sets When is a pattern significant ? Open problem: development of rigorous (mathematical/statistical) approaches to assess significance and to discover significant patterns efficiently

3 AlgoDEEP 16/04/103 Frequent Itemsets (1) D Dataset D of transactions over set of items I (D ⊆ 2 I ) Support of an itemset X ∈ 2 I in D = number of transactions that contain X support({Beer,Diaper}) = 3 Significant?

4 AlgoDEEP 16/04/104 Original formulation of the problem [Agrawal et al. 93] input: dataset D over I, support threshold s output: all itemsets of support ≥ s in D (frequent itemsets ) Rationale: significance = high support (≥ s) Drawbacks: Threshold s hard to fix too low  possible output explosion and spurious discoveries (false positives) too high  loss of interesting itemsets (false negatives) No guarantee of significance of output itemsets Alternative formulations proposed to mitigate the above drawbacks Closed itemsets, maximal itemsets, top-K itemsets Frequent Itemsets (2)

5 AlgoDEEP 16/04/105 Significance Focus on statistical significance significance w.r.t. random model We address the following questions: What support level makes an itemset significantly frequent? How to narrow the search down to significant itemsets? Goal: minimize false discoveries and improve quality of subsequent analysis

6 AlgoDEEP 16/04/106 Many works consider significance of itemsets in isolation. E.g., [Silverstein, Brin, Motwani, 98]: rigorous statistical framework (with flaws!)  2 test to assess degree of dependence of items in an itemset Global characteristics of dataset taken into account in [Gionis, Mannila, et al., 06]: deviation from random dataset w.r.t. number of frequent itemsets no rigouros statistical grounding Related Work

7 AlgoDEEP 16/04/107 Statistical Tests Standard statistical test null hypothesis H 0 (≈not significant) alternative hypothesis H 1 H 0 is tested against H 1 by observing a certain statistic s p-value = Prob( obs ≥ s | H 0 is true ) Significance level α = probability of rejecting H 0 when it is true (false positive). Also called probability of Type I error

8 AlgoDEEP 16/04/108 Random Model I = set of n items D = input dataset of t transactions over I:  i ∊ I: n(i) = support of {i} in D f i = n(i)/t = frequency of i in D D = random dataset of t transactions over I: Item i is included in transaction j with probability f i independently of all other events

9 AlgoDEEP 16/04/109 For each itemset X =  i 1, i 2,.., i k  ⊆ I: f X = f i1  f i2  …  f ik expected frequency of X in D null hypothesis H 0 (X): the support of X in D conforms with D, (i.e., it is as drawn from Binomial(t, f X ) ) alternative hypothesis H 1 (X): the support of X in D does not conforms with D Naïve Approach (1)

10 AlgoDEEP 16/04/1010 Naïve Approach (2) Statistic of interest: s x = support of X in D Reject H 0 (X) if: p-value = Prob(B(t, f X ) ≥ s X ) ≤ α Significant itemsets =  X ⊆ I : H 0 (X) is rejected 

11 AlgoDEEP 16/04/1011 What’s wrong? D with t=1,000,000 transactions, over n=1000 items, each item with frequency 1/1000. Pair {i,j} that occurs 7 times: is it statistically significant? In D (random dataset) E[support({i,j})] = 1 p-value = Prob({i,j} has support ≥ 7 ) ≃ 0.0001  {i,j} must be significant! Naïve Approach (3)

12 AlgoDEEP 16/04/1012 Expected number of pairs with support ≥ 7 in random dataset is ≃ 50  existence of {i,j} with support ≥ 7 is not such a rare event! returning {i,j} as significant itemset could be a false discovery However, 300 (disjoint) pairs with support ≥ 7 in D is an extremely rare event (prob ≤ 2 -300 ) Naïve Approach (4)

13 AlgoDEEP 16/04/1013 Multi-Hypothesis test (1) Looking for significant itemsets of size k (k- itemsets) involves testing simultaneously for m= null hypotheses: {H 0 (X)} |X|=k How to combine m tests while minimizing false positives?

14 AlgoDEEP 16/04/1014 Multi-Hypothesis test (2) V = number of false positives R = total number rejected null hypotheses = number itemsets flagged as significant False Discovery Rate (FDR) = E[V/R] (FDR=0 when R=0) GOAL: maximize R while ensuring FDR ≤ β [Benjamini-Yekutieli ’01] Reject hypothesis with i–th smallest p-value if ≤ i·β/m: m = does not yield a support threshold for mining

15 AlgoDEEP 16/04/1015 Our Approach Q(k, s) = obs. number of k-itemsets of support ≥ s null hypothesis H 0 (s): the number of k-itemsets of support  s in D conforms with D alternative hypothesis H 1 (s): the number of k-itemsets of support  s in D does not conforms with D Problem: how to compute the p-value of Q(k, s)?

16 AlgoDEEP 16/04/1016 Main Results (PODS 2009) Result 1 (Poisson approx) Q (k,s)= number of k-itemsets of support ≥ s in D Theorem Exists s min : for s≥s min, Q (k,s) is well approximated by a Poisson distribution. Result 2 Methodology to establish a support threshold for discovering significant itemsets with small FDR

17 AlgoDEEP 16/04/1017 Approximation Result (1) Based on Chen-Stein method (1975) Q (k,s) = number of k-itemsets of support ≥ s in random dataset D U~Poisson(λ), λ = E[ Q (k,s)] Theorem: for k=O(1), t=poly(n), for a large range of item distributions and supports s: distance ( Q (k,s), U) =O(1/n)

18 AlgoDEEP 16/04/1018 Approximation Result (2) Corollary: there exists s min s.t. Q (k,s) is well approximated by a Poisson distribution for s≥s min In practice: Monte-Carlo method to determine s min s.t., with probability at least 1- δ, distance ( Q (k,s), U) ≤ ε for all s≥s min

19 AlgoDEEP 16/04/1019 Support threshold for mining significant itemsets (1) Determine s min and let h be such that s min +2 h is the maximum support for an itemset Fix α 1, α 2,.., α h such that ∑ α i ≤ α Fix β 1, β 2,.., β h such that ∑ β i ≤ β For i=1 to h: s i = s min +2 i Q(k, s i ) = obs. number of k-itemsets of support ≥ s i H 0 (k,s i ): Q(k,s i ) conforms with Poisson( λ i = E[ Q (k, s i )]) reject H 0 (k,s i ) if: p-value of Q(k,s i ) < α i and Q(k,s i ) ≥ λ i / β i

20 AlgoDEEP 16/04/1020 Support threshold for mining significant itemsets (2) Theorem. Let s* be the minimum s such that H 0 (k,s) was rejected. We have: 1.With significance level α, the number of k- itemsets of support ≥ s* is significant 2.The k-itemsets with support ≥ s* are significant with FDR ≤ β

21 AlgoDEEP 16/04/1021 FIMI repository http://fimi.cs.helsinki.fi/data/ Experiments: benchmark datasets avg. trans. length items frequencies range

22 AlgoDEEP 16/04/1022 Test: α = 0.05, β = 0.05 Q k,s* = number of k-itemsets of support ≥ s* in D λ(s*) = expected number of k-itemsets with support ≥ s* Itemset of size 154 with support ≥ 7 Experiments: results (1)

23 AlgoDEEP 16/04/1023 Experiments: results (2) Comparison with standard application of Benjamini Yekutieli: FDR ≤ 0.05 R = output (standard approach) Q k,s* = output (our approach) r = |Q k,s* |/| R |

24 AlgoDEEP 16/04/1024 Poisson approximation for number of k- itemsets of support s ≥ s min in a random dataset An statistically sound method to determine a support threshold for mining significant frequent itemsets with controlled FDR Conclusions

25 AlgoDEEP 16/04/1025 Deal with false negatives Software package Application of the method to other frequent pattern problems Future Work

26 AlgoDEEP 16/04/1026 Questions? Thank you!


Download ppt "AlgoDEEP 16/04/101 An Efficient Rigorous Approach for Identifying Statistically Significant Frequent Itemsets Fabio Vandin DEI - Università di Padova CS."

Similar presentations


Ads by Google