An Efficient Rigorous Approach for Identifying Statistically Significant Frequent Itemsets Adam Kirsch, Michael Mitzenmacher, Havard University Andrea.

Slides:



Advertisements
Similar presentations
Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. Chapter 9 Inferences Based on Two Samples.
Advertisements

Chapter 10.  Real life problems are usually different than just estimation of population statistics.  We try on the basis of experimental evidence Whether.
Hypothesis testing Week 10 Lecture 2.
Statistical Significance What is Statistical Significance? What is Statistical Significance? How Do We Know Whether a Result is Statistically Significant?
HYPOTHESIS TESTING Four Steps Statistical Significance Outcomes Sampling Distributions.
Hypothesis testing Some general concepts: Null hypothesisH 0 A statement we “wish” to refute Alternative hypotesisH 1 The whole or part of the complement.
Section 7.1 Hypothesis Testing: Hypothesis: Null Hypothesis (H 0 ): Alternative Hypothesis (H 1 ): a statistical analysis used to decide which of two competing.
Differentially expressed genes
Statistical Significance What is Statistical Significance? How Do We Know Whether a Result is Statistically Significant? How Do We Know Whether a Result.
Chapter 6 Hypotheses texts. Central Limit Theorem Hypotheses and statistics are dependent upon this theorem.
Topic 2: Statistical Concepts and Market Returns
C82MCP Diploma Statistics School of Psychology University of Nottingham 1 Overview of Lecture Independent and Dependent Variables Between and Within Designs.
Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. Chapter 14 Goodness-of-Fit Tests and Categorical Data Analysis.
Aaker, Kumar, Day Seventh Edition Instructor’s Presentation Slides
Inferences About Process Quality
BCOR 1020 Business Statistics Lecture 18 – March 20, 2008.
Chapter 9 Hypothesis Testing.
BCOR 1020 Business Statistics
Today Concepts underlying inferential statistics
AlgoDEEP 16/04/101 An Efficient Rigorous Approach for Identifying Statistically Significant Frequent Itemsets Fabio Vandin DEI - Università di Padova CS.
5-3 Inference on the Means of Two Populations, Variances Unknown
Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. Chapter 8 Tests of Hypotheses Based on a Single Sample.
An Efficient Rigorous Approach for Identifying Statistically Significant Frequent Itemsets.
Copyright © Cengage Learning. All rights reserved. 11 Applications of Chi-Square.
Aaker, Kumar, Day Ninth Edition Instructor’s Presentation Slides
Overview of Statistical Hypothesis Testing: The z-Test
1 Dr. Jerrell T. Stracener EMIS 7370 STAT 5340 Probability and Statistics for Scientists and Engineers Department of Engineering Management, Information.
Hypothesis Testing.
Hypothesis Testing.
1 STATISTICAL HYPOTHESES AND THEIR VERIFICATION Kazimieras Pukėnas.
Sections 8-1 and 8-2 Review and Preview and Basics of Hypothesis Testing.
The paired sample experiment The paired t test. Frequently one is interested in comparing the effects of two treatments (drugs, etc…) on a response variable.
Overview Basics of Hypothesis Testing
1 Power and Sample Size in Testing One Mean. 2 Type I & Type II Error Type I Error: reject the null hypothesis when it is true. The probability of a Type.
Statistical Review We will be working with two types of probability distributions: Discrete distributions –If the random variable of interest can take.
Topics: Statistics & Experimental Design The Human Visual System Color Science Light Sources: Radiometry/Photometry Geometric Optics Tone-transfer Function.
Chapter 9 Hypothesis Testing II: two samples Test of significance for sample means (large samples) The difference between “statistical significance” and.
Statistical Decision Making. Almost all problems in statistics can be formulated as a problem of making a decision. That is given some data observed from.
Maximum Likelihood Estimator of Proportion Let {s 1,s 2,…,s n } be a set of independent outcomes from a Bernoulli experiment with unknown probability.
1 SMU EMIS 7364 NTU TO-570-N Inferences About Process Quality Updated: 2/3/04 Statistical Quality Control Dr. Jerrell T. Stracener, SAE Fellow.
Biostatistics Class 6 Hypothesis Testing: One-Sample Inference 2/29/2000.
Bioinformatics Expression profiling and functional genomics Part II: Differential expression Ad 27/11/2006.
Chapter 8 Introduction to Hypothesis Testing ©. Chapter 8 - Chapter Outcomes After studying the material in this chapter, you should be able to: 4 Formulate.
Back to basics – Probability, Conditional Probability and Independence Probability of an outcome in an experiment is the proportion of times that.
EMIS 7300 SYSTEMS ANALYSIS METHODS FALL 2005 Dr. John Lipp Copyright © Dr. John Lipp.
McGraw-Hill/Irwin Copyright © 2007 by The McGraw-Hill Companies, Inc. All rights reserved. Chapter 8 Hypothesis Testing.
Detecting Group Differences: Mining Contrast Sets Author: Stephen D. Bay Advisor: Dr. Hsu Graduate: Yan-Cheng Lin.
Learning Objectives Copyright © 2002 South-Western/Thomson Learning Statistical Testing of Differences CHAPTER fifteen.
Copyright ©2013 Pearson Education, Inc. publishing as Prentice Hall 9-1 σ σ.
Lecture 3: Statistics Review I Date: 9/3/02  Distributions  Likelihood  Hypothesis tests.
Inferential Statistics. Coin Flip How many heads in a row would it take to convince you the coin is unfair? 1? 10?
Machine Learning Chapter 5. Evaluating Hypotheses
Copyright © Cengage Learning. All rights reserved. 12 Analysis of Variance.
© Copyright McGraw-Hill 2004
Formulating the Hypothesis null hypothesis 4 The null hypothesis is a statement about the population value that will be tested. null hypothesis 4 The null.
Statistical Inference Statistical inference is concerned with the use of sample data to make inferences about unknown population parameters. For example,
Hypothesis Testing Steps for the Rejection Region Method State H 1 and State H 0 State the Test Statistic and its sampling distribution (normal or t) Determine.
Jump to first page Inferring Sample Findings to the Population and Testing for Differences.
Statistical Inference for the Mean Objectives: (Chapter 8&9, DeCoursey) -To understand the terms variance and standard error of a sample mean, Null Hypothesis,
Educational Research Inferential Statistics Chapter th Chapter 12- 8th Gay and Airasian.
Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. Chapter 7 Inferences Concerning Means.
McGraw-Hill/Irwin © 2003 The McGraw-Hill Companies, Inc.,All Rights Reserved. Part Four ANALYSIS AND PRESENTATION OF DATA.
Statistical Decision Making. Almost all problems in statistics can be formulated as a problem of making a decision. That is given some data observed from.
Part Four ANALYSIS AND PRESENTATION OF DATA
Hypothesis Testing I The One-sample Case
Review and Preview and Basics of Hypothesis Testing
Hypothesis Testing and Confidence Intervals (Part 1): Using the Standard Normal Lecture 8 Justin Kern October 10 and 12, 2017.
Chapter 9 Hypothesis Testing.
Discrete Event Simulation - 4
Random Number Generation
Presentation transcript:

An Efficient Rigorous Approach for Identifying Statistically Significant Frequent Itemsets Adam Kirsch, Michael Mitzenmacher, Havard University Andrea Pietracaprina, Geppino Pucci, Fabio Vandin, University of Padova Eli Upfal, Brown University PODS 2009 Presented by Dongjoo Lee, IDS Lab., CSE, SNU

Copyright  2009 by CEBT Frequent Pattern Mining 2IDS Lab Spring Seminar D Transaction Datasets D Transaction Datasets Frequent Itemsets Mining Algorithm Given a dataset D of transactions over a set of items I, and a support threshold s, return all itemsets X such that X ⊆ I with support at least s in D (i.e., contained in at least s transactions).  association rules  correlations  sequences  episodes  classifiers  clusters  … support  Apriori  FP-Growth  … Among all possible n C k itemsets of size k (k-itemsets), we are interested in statistically significant ones, that is, itemsets whose supports are significantly higher, in a statistical sense, than their expected supports in a dataset where individual items are placed independently in transactions.

Copyright  2009 by CEBT Measuring the Statistical Significance of a Discovery: Model 3IDS Lab Spring Seminar D D dataset of t transactions on a set I of n items, where each transaction d i ⊆ I n(i): number of transactions that contain item I f i = n(i)/t: frequency of item i D D ^ random data set where item i is included in any given transaction with probability f i, independent of all other items and all other transactions The support s(X, D ) in D is drawn from the same distribution as its support s(X, D ) in D. Null hypothesis H 0 The support s(X, D ) in D is not drawn from that distribution, and in particular that there is a positive correlation between the occurrences of the individual items in X. Alternative hypothesis H 1 ^ ^

Copyright  2009 by CEBT Measuring the Statistical Significance of a Discovery: Example 4IDS Lab Spring Seminar  t = 1,000,000  |I| = 1,000  f i = f j = 1/1,000  support({i, j}) = 7  | Q 7 | = 300 D D  T i,j = {t | t ∈ D, i ∈ t, j ∈ t}  Q n = {{i,j} | |T i,j | = n} D D ^  Pr(i,j) =  E[|T i,j |] = ( ) × (1,000,000) = 1  Pr(|T i,j | = 7) =  1000 C 7 = 499,500  E[| Q 7 |] = × 499,500 = 50  Pr(| Q 7 | = 300) ≤ Binomial distribution Pr(k) = n C k p k (1-p) n-k

Copyright  2009 by CEBT Statistical Hypothesis Testing 5IDS Lab Spring Seminar  Significance level of the test: α = Pr (Type I error) probability of rejecting H 0 when it is true (false positive)  Power of the test: β = 1- Pr (Type II error) probability of correctly rejecting the null hypothesis D D D D ^ | Q 7 | = 300 What is the probability (Pr: p-value) of observation if the null hypothesis is true C p-value ≤ 0.05 If the observation is in critical region, reject the null hypothesis. Pr(| Q 7 | = 300|H 0 ) ≤ observation

Copyright  2009 by CEBT Multi-hypothesis Testing 6IDS Lab Spring Seminar  The outcome of an experiment is used to test simultaneously a number of hypotheses.  Significance level of multi-hypothesis testing Family Wise Error Rate (FWER) –probability of incurring at least one Type I error in any of the individual tests –conservative –for large numbers of hypotheses all of these techniques lead to test with low power. False Discovery Rate (FDR) –less conservative D D D D ^ | Q X1, k | = s | Q X2, k | = s … | Q Xi, k | = s … | Q X1, k | = s | Q X2, k | = s … | Q Xi, k | = s … C ≤ 0.05 Pr(| Q X1,k | = s|H 0 1 ) = p X1 Pr(| Q X2,k | = s|H 0 2 ) = p X2 … Pr(| Q Xi,k | = s|H 0 i ) = p Xi … observation ? nCknCk

Copyright  2009 by CEBT False Discovery Rate Control  False Discovery Rate (FDR) expected ratio of erroneous rejections among all rejections FDR = E[V/R] ( V/R = 0 when R = 0) –V: number of Type I errors in the individual test –R: total number of null hypotheses rejected by the multiple test  FDR Control controls the expected proportion of incorrectly rejected null hypotheses. 7IDS Lab Spring Seminar

Copyright  2009 by CEBT Standard FDR Control 8IDS Lab Spring Seminar Consider all possible combination of k-itemsets Get the p-value of itemset X with support s, following binomial distribution Find itemsets that keep FDR constraints

Copyright  2009 by CEBT What Do the Authors Do? 1.Approximate the distribution of Q k,s with minimum support s min. Poisson Approximation by using Chen-Stein method. 2.Find s min approximating distribution Q k,s with the error. A Monte Carlo method 3.Establish a support threshold s* with a controlled FDR. Reduce the number of FDR compared to standard multi-comparison test 9IDS Lab Spring Seminar ^ ^

Copyright  2009 by CEBT Poisson Distribution If the expected number of occurrences in a certain interval is λ, then the probability that there are exactly k occurrences (k being a non-negative integer, k = 0, 1, 2,…) is equal to 10IDS Lab Spring Seminar Probability mass functionCumulative distribution function

Copyright  2009 by CEBT Poisson Approximation for Q k,s  Let Q k,s be the number of itemsets of size k with support at least s with respect to D, Q k,s is the corresponding random variable for D.  Fix k and s,  Define a collection of Bernoulli random variables { Z X | X ⊂ I, |X| = k }, such that Z X = 1 if the itemset X appears in at least s transaction in the random dataset D, and Z X = 0 otherwise. px = Pr(Z X = 1)  Let I(X) = { X´ | X ∩ X´ ≠ ø, |X´| = |X|}  If Y I(X) then Z Y and Z X are independent. 11IDS Lab Spring Seminar ^ ^^ ^

Copyright  2009 by CEBT Poisson Approximation for Q k,s  THEOREM 1. Let U be a Poisson random variable such that E[U] = E[ Q k,s ] = λ < ∞. The variation distance between the distributions L (Q k,s ) of Q k,s and L (U) of U is such that 12IDS Lab Spring Seminar ^ ^ ^^ … b 1 and b 2 are both decreasing in s. Therefore, if b 1 + b 2 s.

Copyright  2009 by CEBT A Monte Carlo Method for Determining s min 13IDS Lab Spring Seminar

Copyright  2009 by CEBT A Novel Multi-Hypothesis Testing 14IDS Lab Spring Seminar Set initial support value as s min Maximum number of calculation to obtain s* Found the s* Set next support value

Copyright  2009 by CEBT Novel Testing vs. Standard Testing 15IDS Lab Spring Seminar Standard FDR TestingNovel Testing Constrain itemsetsConstrain support s* Control more hypothesesControl less hypotheses Evaluate the significance of individual itemset Evaluate the significance of entire itemsets

Copyright  2009 by CEBT Experimental Results – Experiments on Benchmark Datasets 16IDS Lab Spring Seminar

Copyright  2009 by CEBT Experimental Results – Experiments on Random Datasets 17IDS Lab Spring Seminar

Copyright  2009 by CEBT Experimental Results – Comparison with Standard FDR Test 18IDS Lab Spring Seminar

Copyright  2009 by CEBT Conclusion  In a random dataset where items are placed independently in transactions, there is a minimum support s min such that the number of k - itemsets with support at least s min is well approximated by a Poisson distribution.  Novel multi-hypothesis testing incur a small FDR tests.  First attempt at establishing a support threshold for the classical frequent itemset mining problem with a quantitative guarantee on the significance of the output. 19IDS Lab Spring Seminar

Copyright  2009 by CEBT Discussion  Hard to understand, because it needs so many related knowledge or notions to understand the content  Pros Good Approximation Less FDR Find appropriate support through exploring structure of whole dataset  Cons Fail to find significant itemsets with small support 20IDS Lab Spring Seminar