Presentation is loading. Please wait.

Presentation is loading. Please wait.

An Efficient Rigorous Approach for Identifying Statistically Significant Frequent Itemsets Adam Kirsch, Michael Mitzenmacher, Havard University Andrea.

Similar presentations


Presentation on theme: "An Efficient Rigorous Approach for Identifying Statistically Significant Frequent Itemsets Adam Kirsch, Michael Mitzenmacher, Havard University Andrea."— Presentation transcript:

1 An Efficient Rigorous Approach for Identifying Statistically Significant Frequent Itemsets Adam Kirsch, Michael Mitzenmacher, Havard University Andrea Pietracaprina, Geppino Pucci, Fabio Vandin, University of Padova Eli Upfal, Brown University PODS 2009 Presented by Dongjoo Lee, IDS Lab., CSE, SNU

2 Copyright  2009 by CEBT Frequent Pattern Mining 2IDS Lab. 2009 Spring Seminar D Transaction Datasets D Transaction Datasets Frequent Itemsets Mining Algorithm Given a dataset D of transactions over a set of items I, and a support threshold s, return all itemsets X such that X ⊆ I with support at least s in D (i.e., contained in at least s transactions).  association rules  correlations  sequences  episodes  classifiers  clusters  … support  Apriori  FP-Growth  … Among all possible n C k itemsets of size k (k-itemsets), we are interested in statistically significant ones, that is, itemsets whose supports are significantly higher, in a statistical sense, than their expected supports in a dataset where individual items are placed independently in transactions.

3 Copyright  2009 by CEBT Measuring the Statistical Significance of a Discovery: Model 3IDS Lab. 2009 Spring Seminar D D dataset of t transactions on a set I of n items, where each transaction d i ⊆ I n(i): number of transactions that contain item I f i = n(i)/t: frequency of item i D D ^ random data set where item i is included in any given transaction with probability f i, independent of all other items and all other transactions The support s(X, D ) in D is drawn from the same distribution as its support s(X, D ) in D. Null hypothesis H 0 The support s(X, D ) in D is not drawn from that distribution, and in particular that there is a positive correlation between the occurrences of the individual items in X. Alternative hypothesis H 1 ^ ^

4 Copyright  2009 by CEBT Measuring the Statistical Significance of a Discovery: Example 4IDS Lab. 2009 Spring Seminar  t = 1,000,000  |I| = 1,000  f i = f j = 1/1,000  support({i, j}) = 7  | Q 7 | = 300 D D  T i,j = {t | t ∈ D, i ∈ t, j ∈ t}  Q n = {{i,j} | |T i,j | = n} D D ^  Pr(i,j) = 0.000001  E[|T i,j |] = (0.000001) × (1,000,000) = 1  Pr(|T i,j | = 7) = 0.0001  1000 C 7 = 499,500  E[| Q 7 |] = 0.0001 × 499,500 = 50  Pr(| Q 7 | = 300) ≤ 2 -300 Binomial distribution Pr(k) = n C k p k (1-p) n-k

5 Copyright  2009 by CEBT Statistical Hypothesis Testing 5IDS Lab. 2009 Spring Seminar  Significance level of the test: α = Pr (Type I error) probability of rejecting H 0 when it is true (false positive)  Power of the test: β = 1- Pr (Type II error) probability of correctly rejecting the null hypothesis D D D D ^ | Q 7 | = 300 What is the probability (Pr: p-value) of observation if the null hypothesis is true C p-value ≤ 0.05 If the observation is in critical region, reject the null hypothesis. Pr(| Q 7 | = 300|H 0 ) ≤ 2 -300 2 -300 observation

6 Copyright  2009 by CEBT Multi-hypothesis Testing 6IDS Lab. 2009 Spring Seminar  The outcome of an experiment is used to test simultaneously a number of hypotheses.  Significance level of multi-hypothesis testing Family Wise Error Rate (FWER) –probability of incurring at least one Type I error in any of the individual tests –conservative –for large numbers of hypotheses all of these techniques lead to test with low power. False Discovery Rate (FDR) –less conservative D D D D ^ | Q X1, k | = s | Q X2, k | = s … | Q Xi, k | = s … | Q X1, k | = s | Q X2, k | = s … | Q Xi, k | = s … C ≤ 0.05 Pr(| Q X1,k | = s|H 0 1 ) = p X1 Pr(| Q X2,k | = s|H 0 2 ) = p X2 … Pr(| Q Xi,k | = s|H 0 i ) = p Xi … observation ? nCknCk

7 Copyright  2009 by CEBT False Discovery Rate Control  False Discovery Rate (FDR) expected ratio of erroneous rejections among all rejections FDR = E[V/R] ( V/R = 0 when R = 0) –V: number of Type I errors in the individual test –R: total number of null hypotheses rejected by the multiple test  FDR Control controls the expected proportion of incorrectly rejected null hypotheses. 7IDS Lab. 2009 Spring Seminar

8 Copyright  2009 by CEBT Standard FDR Control 8IDS Lab. 2009 Spring Seminar Consider all possible combination of k-itemsets Get the p-value of itemset X with support s, following binomial distribution Find itemsets that keep FDR constraints

9 Copyright  2009 by CEBT What Do the Authors Do? 1.Approximate the distribution of Q k,s with minimum support s min. Poisson Approximation by using Chen-Stein method. 2.Find s min approximating distribution Q k,s with the error. A Monte Carlo method 3.Establish a support threshold s* with a controlled FDR. Reduce the number of FDR compared to standard multi-comparison test 9IDS Lab. 2009 Spring Seminar ^ ^

10 Copyright  2009 by CEBT Poisson Distribution If the expected number of occurrences in a certain interval is λ, then the probability that there are exactly k occurrences (k being a non-negative integer, k = 0, 1, 2,…) is equal to 10IDS Lab. 2009 Spring Seminar Probability mass functionCumulative distribution function

11 Copyright  2009 by CEBT Poisson Approximation for Q k,s  Let Q k,s be the number of itemsets of size k with support at least s with respect to D, Q k,s is the corresponding random variable for D.  Fix k and s,  Define a collection of Bernoulli random variables { Z X | X ⊂ I, |X| = k }, such that Z X = 1 if the itemset X appears in at least s transaction in the random dataset D, and Z X = 0 otherwise. px = Pr(Z X = 1)  Let I(X) = { X´ | X ∩ X´ ≠ ø, |X´| = |X|}  If Y I(X) then Z Y and Z X are independent. 11IDS Lab. 2009 Spring Seminar ^ ^^ ^

12 Copyright  2009 by CEBT Poisson Approximation for Q k,s  THEOREM 1. Let U be a Poisson random variable such that E[U] = E[ Q k,s ] = λ < ∞. The variation distance between the distributions L (Q k,s ) of Q k,s and L (U) of U is such that 12IDS Lab. 2009 Spring Seminar ^ ^ ^^ … b 1 and b 2 are both decreasing in s. Therefore, if b 1 + b 2 s.

13 Copyright  2009 by CEBT A Monte Carlo Method for Determining s min 13IDS Lab. 2009 Spring Seminar

14 Copyright  2009 by CEBT A Novel Multi-Hypothesis Testing 14IDS Lab. 2009 Spring Seminar Set initial support value as s min Maximum number of calculation to obtain s* Found the s* Set next support value

15 Copyright  2009 by CEBT Novel Testing vs. Standard Testing 15IDS Lab. 2009 Spring Seminar Standard FDR TestingNovel Testing Constrain itemsetsConstrain support s* Control more hypothesesControl less hypotheses Evaluate the significance of individual itemset Evaluate the significance of entire itemsets

16 Copyright  2009 by CEBT Experimental Results – Experiments on Benchmark Datasets 16IDS Lab. 2009 Spring Seminar

17 Copyright  2009 by CEBT Experimental Results – Experiments on Random Datasets 17IDS Lab. 2009 Spring Seminar

18 Copyright  2009 by CEBT Experimental Results – Comparison with Standard FDR Test 18IDS Lab. 2009 Spring Seminar

19 Copyright  2009 by CEBT Conclusion  In a random dataset where items are placed independently in transactions, there is a minimum support s min such that the number of k - itemsets with support at least s min is well approximated by a Poisson distribution.  Novel multi-hypothesis testing incur a small FDR tests.  First attempt at establishing a support threshold for the classical frequent itemset mining problem with a quantitative guarantee on the significance of the output. 19IDS Lab. 2009 Spring Seminar

20 Copyright  2009 by CEBT Discussion  Hard to understand, because it needs so many related knowledge or notions to understand the content  Pros Good Approximation Less FDR Find appropriate support through exploring structure of whole dataset  Cons Fail to find significant itemsets with small support 20IDS Lab. 2009 Spring Seminar


Download ppt "An Efficient Rigorous Approach for Identifying Statistically Significant Frequent Itemsets Adam Kirsch, Michael Mitzenmacher, Havard University Andrea."

Similar presentations


Ads by Google