AlgoDEEP 16/04/101 An Efficient Rigorous Approach for Identifying Statistically Significant Frequent Itemsets Fabio Vandin DEI - Università di Padova CS.

Slides:



Advertisements
Similar presentations
Mining Compressed Frequent- Pattern Sets Dong Xin, Jiawei Han, Xifeng Yan, Hong Cheng Department of Computer Science University of Illinois at Urbana-Champaign.
Advertisements

CHAPTER 21 Inferential Statistical Analysis. Understanding probability The idea of probability is central to inferential statistics. It means the chance.
Clustering V. Outline Validating clustering results Randomization tests.
STAT 135 LAB 14 TA: Dongmei Li. Hypothesis Testing Are the results of experimental data due to just random chance? Significance tests try to discover.
© 2010 Pearson Prentice Hall. All rights reserved The Chi-Square Goodness-of-Fit Test.
EPIDEMIOLOGY AND BIOSTATISTICS DEPT Esimating Population Value with Hypothesis Testing.
8. Statistical tests 8.1 Hypotheses K. Desch – Statistical methods of data analysis SS10 Frequent problem: Decision making based on statistical information.
Statistics II: An Overview of Statistics. Outline for Statistics II Lecture: SPSS Syntax – Some examples. Normal Distribution Curve. Sampling Distribution.
Chapter 9 Hypothesis Testing Testing Hypothesis about µ, when the s.t of population is known.
Hypothesis Testing Steps of a Statistical Significance Test. 1. Assumptions Type of data, form of population, method of sampling, sample size.
Differentially expressed genes
Basic Elements of Testing Hypothesis Dr. M. H. Rahbar Professor of Biostatistics Department of Epidemiology Director, Data Coordinating Center College.
Analysis of Variance: Inferences about 2 or More Means
Chapter 6 Hypotheses texts. Central Limit Theorem Hypotheses and statistics are dependent upon this theorem.
Lecture 2: Thu, Jan 16 Hypothesis Testing – Introduction (Ch 11)
Introduction to Hypothesis Testing
SSCP: Mining Statistically Significant Co-location Patterns Sajib Barua and Jörg Sander Dept. of Computing Science University of Alberta, Canada.
Simulation Modeling and Analysis Session 12 Comparing Alternative System Designs.
Topic 2: Statistical Concepts and Market Returns
Business Statistics: A Decision-Making Approach, 6e © 2005 Prentice-Hall, Inc. Chap 8-1 Business Statistics: A Decision-Making Approach 6 th Edition Chapter.
Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. Chapter 14 Goodness-of-Fit Tests and Categorical Data Analysis.
Chapter 2 Simple Comparative Experiments
Copyright © 2010 Pearson Education, Inc. Publishing as Prentice Hall Statistics for Business and Economics 7 th Edition Chapter 9 Hypothesis Testing: Single.
Experimental Evaluation
Inferences About Process Quality
Chapter 8 Introduction to Hypothesis Testing
Review for Exam 2 Some important themes from Chapters 6-9 Chap. 6. Significance Tests Chap. 7: Comparing Two Groups Chap. 8: Contingency Tables (Categorical.
An Efficient Rigorous Approach for Identifying Statistically Significant Frequent Itemsets.
Statistical hypothesis testing – Inferential statistics I.
McGraw-Hill/IrwinCopyright © 2009 by The McGraw-Hill Companies, Inc. All Rights Reserved. Chapter 9 Hypothesis Testing.
Statistical Hypothesis Testing. Suppose you have a random variable X ( number of vehicle accidents in a year, stock market returns, time between el nino.
Chapter 10 Hypothesis Testing
Multiple testing in high- throughput biology Petter Mostad.
Confidence Intervals and Hypothesis Testing - II
Statistical inference: confidence intervals and hypothesis testing.
Lucio Baggio - Lucio Baggio - False discovery rate: setting the probability of false claim of detection 1 False discovery rate: setting the probability.
The paired sample experiment The paired t test. Frequently one is interested in comparing the effects of two treatments (drugs, etc…) on a response variable.
1 Introduction to Hypothesis Testing. 2 What is a Hypothesis? A hypothesis is a claim A hypothesis is a claim (assumption) about a population parameter:
Individual values of X Frequency How many individuals   Distribution of a population.
1 Statistical Distribution Fitting Dr. Jason Merrick.
Statistics - methodology for collecting, analyzing, interpreting and drawing conclusions from collected data Anastasia Kadina GM presentation 6/15/2015.
Statistical Hypotheses & Hypothesis Testing. Statistical Hypotheses There are two types of statistical hypotheses. Null Hypothesis The null hypothesis,
Confidence intervals and hypothesis testing Petter Mostad
Back to basics – Probability, Conditional Probability and Independence Probability of an outcome in an experiment is the proportion of times that.
Testing Hypothesis That Data Fit a Given Probability Distribution Problem: We have a sample of size n. Determine if the data fits a probability distribution.
EMIS 7300 SYSTEMS ANALYSIS METHODS FALL 2005 Dr. John Lipp Copyright © Dr. John Lipp.
Chap 8-1 A Course In Business Statistics, 4th © 2006 Prentice-Hall, Inc. A Course In Business Statistics 4 th Edition Chapter 8 Introduction to Hypothesis.
Detecting Group Differences: Mining Contrast Sets Author: Stephen D. Bay Advisor: Dr. Hsu Graduate: Yan-Cheng Lin.
Learning Objectives Copyright © 2002 South-Western/Thomson Learning Statistical Testing of Differences CHAPTER fifteen.
Multiple Testing Matthew Kowgier. Multiple Testing In statistics, the multiple comparisons/testing problem occurs when one considers a set of statistical.
Chapter Outline Goodness of Fit test Test of Independence.
1 URBDP 591 A Lecture 12: Statistical Inference Objectives Sampling Distribution Principles of Hypothesis Testing Statistical Significance.
Formulating the Hypothesis null hypothesis 4 The null hypothesis is a statement about the population value that will be tested. null hypothesis 4 The null.
AP Statistics Chapter 21 Notes
1 1 Slide © 2008 Thomson South-Western. All Rights Reserved Chapter 12 Tests of Goodness of Fit and Independence n Goodness of Fit Test: A Multinomial.
Assessing the significance of (data mining) results Data D, an algorithm A Beautiful result A (D) But: what does it mean? How to determine whether the.
Hypothesis Testing Steps for the Rejection Region Method State H 1 and State H 0 State the Test Statistic and its sampling distribution (normal or t) Determine.
Hypothesis Testing. Suppose we believe the average systolic blood pressure of healthy adults is normally distributed with mean μ = 120 and variance σ.
An Efficient Rigorous Approach for Identifying Statistically Significant Frequent Itemsets Adam Kirsch, Michael Mitzenmacher, Havard University Andrea.
Copyright © 2013 Pearson Education, Inc. Publishing as Prentice Hall Statistics for Business and Economics 8 th Edition Chapter 9 Hypothesis Testing: Single.
Statistical Inference for the Mean Objectives: (Chapter 8&9, DeCoursey) -To understand the terms variance and standard error of a sample mean, Null Hypothesis,
Appendix I A Refresher on some Statistical Terms and Tests.
Estimating the False Discovery Rate in Genome-wide Studies BMI/CS 576 Colin Dewey Fall 2008.
Mining Statistically Significant Co-location and Segregation Patterns.
Chapter 5 STATISTICAL INFERENCE: ESTIMATION AND HYPOTHESES TESTING
Chapter 2 Simple Comparative Experiments
P-value Approach for Test Conclusion
Properties of Random Numbers
Discrete Event Simulation - 4
Random Number Generation
Presentation transcript:

AlgoDEEP 16/04/101 An Efficient Rigorous Approach for Identifying Statistically Significant Frequent Itemsets Fabio Vandin DEI - Università di Padova CS Dept. - Brown University Join work with: A. Kirsch, M. Mitzenmacher, A. Pietracaprina, G. Pucci, E. Upfal

AlgoDEEP 16/04/102 Data Mining Discovery of hidden patterns (e.g., correlations, association rules, clusters, anomalies, etc.) from large data sets When is a pattern significant ? Open problem: development of rigorous (mathematical/statistical) approaches to assess significance and to discover significant patterns efficiently

AlgoDEEP 16/04/103 Frequent Itemsets (1) D Dataset D of transactions over set of items I (D ⊆ 2 I ) Support of an itemset X ∈ 2 I in D = number of transactions that contain X support({Beer,Diaper}) = 3 Significant?

AlgoDEEP 16/04/104 Original formulation of the problem [Agrawal et al. 93] input: dataset D over I, support threshold s output: all itemsets of support ≥ s in D (frequent itemsets ) Rationale: significance = high support (≥ s) Drawbacks: Threshold s hard to fix too low  possible output explosion and spurious discoveries (false positives) too high  loss of interesting itemsets (false negatives) No guarantee of significance of output itemsets Alternative formulations proposed to mitigate the above drawbacks Closed itemsets, maximal itemsets, top-K itemsets Frequent Itemsets (2)

AlgoDEEP 16/04/105 Significance Focus on statistical significance significance w.r.t. random model We address the following questions: What support level makes an itemset significantly frequent? How to narrow the search down to significant itemsets? Goal: minimize false discoveries and improve quality of subsequent analysis

AlgoDEEP 16/04/106 Many works consider significance of itemsets in isolation. E.g., [Silverstein, Brin, Motwani, 98]: rigorous statistical framework (with flaws!)  2 test to assess degree of dependence of items in an itemset Global characteristics of dataset taken into account in [Gionis, Mannila, et al., 06]: deviation from random dataset w.r.t. number of frequent itemsets no rigouros statistical grounding Related Work

AlgoDEEP 16/04/107 Statistical Tests Standard statistical test null hypothesis H 0 (≈not significant) alternative hypothesis H 1 H 0 is tested against H 1 by observing a certain statistic s p-value = Prob( obs ≥ s | H 0 is true ) Significance level α = probability of rejecting H 0 when it is true (false positive). Also called probability of Type I error

AlgoDEEP 16/04/108 Random Model I = set of n items D = input dataset of t transactions over I:  i ∊ I: n(i) = support of {i} in D f i = n(i)/t = frequency of i in D D = random dataset of t transactions over I: Item i is included in transaction j with probability f i independently of all other events

AlgoDEEP 16/04/109 For each itemset X =  i 1, i 2,.., i k  ⊆ I: f X = f i1  f i2  …  f ik expected frequency of X in D null hypothesis H 0 (X): the support of X in D conforms with D, (i.e., it is as drawn from Binomial(t, f X ) ) alternative hypothesis H 1 (X): the support of X in D does not conforms with D Naïve Approach (1)

AlgoDEEP 16/04/1010 Naïve Approach (2) Statistic of interest: s x = support of X in D Reject H 0 (X) if: p-value = Prob(B(t, f X ) ≥ s X ) ≤ α Significant itemsets =  X ⊆ I : H 0 (X) is rejected 

AlgoDEEP 16/04/1011 What’s wrong? D with t=1,000,000 transactions, over n=1000 items, each item with frequency 1/1000. Pair {i,j} that occurs 7 times: is it statistically significant? In D (random dataset) E[support({i,j})] = 1 p-value = Prob({i,j} has support ≥ 7 ) ≃  {i,j} must be significant! Naïve Approach (3)

AlgoDEEP 16/04/1012 Expected number of pairs with support ≥ 7 in random dataset is ≃ 50  existence of {i,j} with support ≥ 7 is not such a rare event! returning {i,j} as significant itemset could be a false discovery However, 300 (disjoint) pairs with support ≥ 7 in D is an extremely rare event (prob ≤ ) Naïve Approach (4)

AlgoDEEP 16/04/1013 Multi-Hypothesis test (1) Looking for significant itemsets of size k (k- itemsets) involves testing simultaneously for m= null hypotheses: {H 0 (X)} |X|=k How to combine m tests while minimizing false positives?

AlgoDEEP 16/04/1014 Multi-Hypothesis test (2) V = number of false positives R = total number rejected null hypotheses = number itemsets flagged as significant False Discovery Rate (FDR) = E[V/R] (FDR=0 when R=0) GOAL: maximize R while ensuring FDR ≤ β [Benjamini-Yekutieli ’01] Reject hypothesis with i–th smallest p-value if ≤ i·β/m: m = does not yield a support threshold for mining

AlgoDEEP 16/04/1015 Our Approach Q(k, s) = obs. number of k-itemsets of support ≥ s null hypothesis H 0 (s): the number of k-itemsets of support  s in D conforms with D alternative hypothesis H 1 (s): the number of k-itemsets of support  s in D does not conforms with D Problem: how to compute the p-value of Q(k, s)?

AlgoDEEP 16/04/1016 Main Results (PODS 2009) Result 1 (Poisson approx) Q (k,s)= number of k-itemsets of support ≥ s in D Theorem Exists s min : for s≥s min, Q (k,s) is well approximated by a Poisson distribution. Result 2 Methodology to establish a support threshold for discovering significant itemsets with small FDR

AlgoDEEP 16/04/1017 Approximation Result (1) Based on Chen-Stein method (1975) Q (k,s) = number of k-itemsets of support ≥ s in random dataset D U~Poisson(λ), λ = E[ Q (k,s)] Theorem: for k=O(1), t=poly(n), for a large range of item distributions and supports s: distance ( Q (k,s), U) =O(1/n)

AlgoDEEP 16/04/1018 Approximation Result (2) Corollary: there exists s min s.t. Q (k,s) is well approximated by a Poisson distribution for s≥s min In practice: Monte-Carlo method to determine s min s.t., with probability at least 1- δ, distance ( Q (k,s), U) ≤ ε for all s≥s min

AlgoDEEP 16/04/1019 Support threshold for mining significant itemsets (1) Determine s min and let h be such that s min +2 h is the maximum support for an itemset Fix α 1, α 2,.., α h such that ∑ α i ≤ α Fix β 1, β 2,.., β h such that ∑ β i ≤ β For i=1 to h: s i = s min +2 i Q(k, s i ) = obs. number of k-itemsets of support ≥ s i H 0 (k,s i ): Q(k,s i ) conforms with Poisson( λ i = E[ Q (k, s i )]) reject H 0 (k,s i ) if: p-value of Q(k,s i ) < α i and Q(k,s i ) ≥ λ i / β i

AlgoDEEP 16/04/1020 Support threshold for mining significant itemsets (2) Theorem. Let s* be the minimum s such that H 0 (k,s) was rejected. We have: 1.With significance level α, the number of k- itemsets of support ≥ s* is significant 2.The k-itemsets with support ≥ s* are significant with FDR ≤ β

AlgoDEEP 16/04/1021 FIMI repository Experiments: benchmark datasets avg. trans. length items frequencies range

AlgoDEEP 16/04/1022 Test: α = 0.05, β = 0.05 Q k,s* = number of k-itemsets of support ≥ s* in D λ(s*) = expected number of k-itemsets with support ≥ s* Itemset of size 154 with support ≥ 7 Experiments: results (1)

AlgoDEEP 16/04/1023 Experiments: results (2) Comparison with standard application of Benjamini Yekutieli: FDR ≤ 0.05 R = output (standard approach) Q k,s* = output (our approach) r = |Q k,s* |/| R |

AlgoDEEP 16/04/1024 Poisson approximation for number of k- itemsets of support s ≥ s min in a random dataset An statistically sound method to determine a support threshold for mining significant frequent itemsets with controlled FDR Conclusions

AlgoDEEP 16/04/1025 Deal with false negatives Software package Application of the method to other frequent pattern problems Future Work

AlgoDEEP 16/04/1026 Questions? Thank you!