1 Efficient Data Reduction Methods for Online Association Rule Discovery -NGDM’02 Herve Bronnimann, Bin Chen, Manoranjan Dash, Peter Haas, Yi Qiao, Peter.

Slides:



Advertisements
Similar presentations
Recap: Mining association rules from large datasets
Advertisements

Design of the fast-pick area Based on Bartholdi & Hackman, Chpt. 7.
Association Analysis (2). Example TIDList of item ID’s T1I1, I2, I5 T2I2, I4 T3I2, I3 T4I1, I2, I4 T5I1, I3 T6I2, I3 T7I1, I3 T8I1, I2, I3, I5 T9I1, I2,
Frequent Itemset Mining Methods. The Apriori algorithm Finding frequent itemsets using candidate generation Seminal algorithm proposed by R. Agrawal and.
Data Mining (Apriori Algorithm)DCS 802, Spring DCS 802 Data Mining Apriori Algorithm Spring of 2002 Prof. Sung-Hyuk Cha School of Computer Science.
Random Testing Tor Stålhane Jonas G. Brustad. What is random testing The principle of random testing is simple and can be described as follows: 1.For.
Frequent Closed Pattern Search By Row and Feature Enumeration
Fast Algorithms For Hierarchical Range Histogram Constructions
3/13/2012Data Streams: Lecture 161 CS 410/510 Data Streams Lecture 16: Data-Stream Sampling: Basic Techniques and Results Kristin Tufte, David Maier.
Adaptive Frequency Counting over Bursty Data Streams Bill Lin, Wai-Shing Ho, Ben Kao and Chun-Kit Chui Form CIDM07.
1 Maximal Independent Set. 2 Independent Set (IS): In a graph G=(V,E), |V|=n, |E|=m, any set of nodes that are not adjacent.
Resource-oriented Approximation for Frequent Itemset Mining from Bursty Data Streams SIGMOD’14 Toshitaka Yamamoto, Koji Iwanuma, Shoshi Fukuda.
Mining Frequent Patterns in Data Streams at Multiple Time Granularities CS525 Paper Presentation Presented by: Pei Zhang, Jiahua Liu, Pengfei Geng and.
Sampling Large Databases for Association Rules ( Toivenon’s Approach, 1996) Farzaneh Mirzazadeh Fall 2007.
Data Mining Association Analysis: Basic Concepts and Algorithms
Adaptive Load Shedding for Mining Frequent Patterns from Data Streams Xuan Hong Dang, Wee-Keong Ng, and Kok-Leong Ong (DaWaK 2006) 2008/3/191Yi-Chun Chen.
New Sampling-Based Summary Statistics for Improving Approximate Query Answers P. B. Gibbons and Y. Matias (ACM SIGMOD 1998) Rongfang Li Feb 2007.
COMP53311 Data Stream Prepared by Raymond Wong Presented by Raymond Wong
Association Analysis. Association Rule Mining: Definition Given a set of records each of which contain some number of items from a given collection; –Produce.
Data Mining Techniques So Far: Cluster analysis K-means Classification Decision Trees J48 (C4.5) Rule-based classification JRIP (RIPPER) Logistic Regression.
Association Analysis (2). Example TIDList of item ID’s T1I1, I2, I5 T2I2, I4 T3I2, I3 T4I1, I2, I4 T5I1, I3 T6I2, I3 T7I1, I3 T8I1, I2, I3, I5 T9I1, I2,
Reducing the collection of itemsets: alternative representations and combinatorial problems.
Data Mining Association Analysis: Basic Concepts and Algorithms
Is Sampling Useful in Data Mining? A Case in the Maintenance of Discovered Association Rules S.D. Lee, David W. Cheung, Ben Kao The University of Hong.
4/3/01CS632 - Data Mining1 Data Mining Presented By: Kevin Seng.
Data Mining Association Analysis: Basic Concepts and Algorithms
Maintenance of Discovered Association Rules S.D.LeeDavid W.Cheung Presentation : Pablo Gazmuri.
Fast Algorithms for Association Rule Mining
Is Sampling Useful in Data Mining? A Case in the Maintenance of Discovered Association Rules S.D. Lee David W. Cheung Ben Kao The University of Hong Kong.
Mining Long Sequential Patterns in a Noisy Environment Jiong Yang, Wei Wang, Philip S. Yu, Jiawei Han SIGMOD 2002.
Carmine Cerrone, Raffaele Cerulli, Bruce Golden GO IX Sirmione, Italy July
Performance and Scalability: Apriori Implementation.
A Dip in the Reservoir: Maintaining Sample Synopses of Evolving Datasets Rainer Gemulla (University of Technology Dresden) Wolfgang Lehner (University.
USpan: An Efficient Algorithm for Mining High Utility Sequential Patterns Authors: Junfu Yin, Zhigang Zheng, Longbing Cao In: Proceedings of the 18th ACM.
NGDM’02 1 Efficient Data-Reduction Methods for On-line Association Rule Mining H. Bronnimann B. ChenM. Dash, Y. Qiao, P. ScheuermannP. Haas Polytechnic.
VLDB 2012 Mining Frequent Itemsets over Uncertain Databases Yongxin Tong 1, Lei Chen 1, Yurong Cheng 2, Philip S. Yu 3 1 The Hong Kong University of Science.
Mining frequency counts from sensor set data Loo Kin Kong 25 th June 2003.
Approximate Frequency Counts over Data Streams Loo Kin Kong 4 th Oct., 2002.
Approximate Frequency Counts over Data Streams Gurmeet Singh Manku, Rajeev Motwani Standford University VLDB2002.
1 Verifying and Mining Frequent Patterns from Large Windows ICDE2008 Barzan Mozafari, Hetal Thakkar, Carlo Zaniolo Date: 2008/9/25 Speaker: Li, HueiJyun.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining By Tan, Steinbach, Kumar Lecture.
Modul 7: Association Analysis. 2 Association Rule Mining  Given a set of transactions, find rules that will predict the occurrence of an item based on.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
1 Maximal Independent Set. 2 Independent Set (IS): In a graph G=(V,E), |V|=n, |E|=m, any set of nodes that are not adjacent.
Implementation of “A New Two-Phase Sampling Based Algorithm for Discovering Association Rules” Tokunbo Makanju Adan Cosgaya Faculty of Computer Science.
August 21, 2002VLDB Gurmeet Singh Manku Frequency Counts over Data Streams Frequency Counts over Data Streams Stanford University, USA.
CS 8751 ML & KDDSupport Vector Machines1 Mining Association Rules KDD from a DBMS point of view –The importance of efficiency Market basket analysis Association.
CSE4334/5334 DATA MINING CSE4334/5334 Data Mining, Fall 2014 Department of Computer Science and Engineering, University of Texas at Arlington Chengkai.
Is Sampling Useful in Data Mining? A Case in the Maintenance of Discovered Association Rules S.D. Lee, David W. Cheung, Ben Kao The University of Hong.
New Sampling-Based Summary Statistics for Improving Approximate Query Answers Yinghui Wang
1 Inverted Matrix: Efficient Discovery of Frequent Items in Large Datasets in the Context of Interactive Mining -SIGKDD’03 Mohammad El-Hajj, Osmar R. Zaïane.
Association Rule Mining
ApproxHadoop Bringing Approximations to MapReduce Frameworks
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
1 Data Mining Lecture 6: Association Analysis. 2 Association Rule Mining l Given a set of transactions, find rules that will predict the occurrence of.
Clustering Data Streams A presentation by George Toderici.
Mining High-Speed Data Streams Presented by: William Kniffin Pedro Domingos Geoff Hulten Sixth ACM SIGKDD International Conference
Data Transformation: Normalization
Frequency Counts over Data Streams
Data Mining Association Analysis: Basic Concepts and Algorithms
Frequent Pattern Mining
A Framework for Automatic Resource and Accuracy Management in A Cloud Environment Smita Vijayakumar.
Supporting Fault-Tolerance in Streaming Grid Applications
Mining Sequential Patterns
Association Rule Mining
A Parameterised Algorithm for Mining Association Rules
Farzaneh Mirzazadeh Fall 2007
Approximate Frequency Counts over Data Streams
Presentation transcript:

1 Efficient Data Reduction Methods for Online Association Rule Discovery -NGDM’02 Herve Bronnimann, Bin Chen, Manoranjan Dash, Peter Haas, Yi Qiao, Peter Scheuermann Presented by: Ivy Tong 18 June, 2003

2 Outline  Outline of the Presentation  Motivation  FAST  Epsilon Approximation  Experimental Results  Data Stream Reduction  Conclusion

3 Motivation  Volume of Data in Warehouses & Internet is growing faster than Moore’s Law  Scalability is a major concern  Classical algorithms require one/more scans of the database  Need to adapt to Streaming Data  Data elements arrive online  Limited amount of memory  One Solution: Execute algorithm on a subset of data

4 Motivation  Sampling Methods  Advantage: can explicitly trade-off accuracy and speed  Work best when tailored to application  Contributions of this paper  Sampling methods for count datasets Application: Association rule mining

5 Notations  D: Database of interest  S: A simple random sample drawn without replacement from D  I: The set of all items that appear in D  I(D): the collection of itemsets that appear in D  I(S): itemsets that appear in S  For k  1, I k (D) and I k (S) denote the connection of k-itemsets in D and S  L(D) and L(S): frequent itemsets in D and S  L k (D) and L k (S): collections of frequent k-itemsets in D and S  For an itemset A  I and a set of transactions T,  Let n(A; T) be the number of transactions in T that contain A  |T|: total number of transactions in T  Support of A in D: f(A;D) = n(A;D)/|D| Support of A in S: f(A;S)= n(A;S)/|S|

6 Problem Definition  Generate a smaller subset S 0 of a larger S such that the supports of 1-itemsets in S 0 are close to those in S I 1 (T) = set of all 1-itemsets in transaction set T L 1 (T) = set of frequent 1-itemsets in transaction set T f(A;T) = support of itemset A in transaction set T

7 FAST Algorithm  Finding Association rules from Sampled Transactions (SIGKDD’02)SIGKDD’02  Given a specified minimum support p and confidence c, FAST proceeds as follows: 1.Obtain a large simple random sample S from D. 2.Compute f(A;S) for each 1-itemset A. 3.Using the supports computed in Step 2, obtain a reduced sample S 0 from S by trimming away outlier transactions. 4.Run a standard association-rule algorithm against S 0 - with Minimum support p and confidence c - to obtain the final set of Association Rules.

8 FAST-trim  Removes the “outlier” transactions from the sample S to obtain S 0  Outlier – a transaction whose removal from S maximally reduces (minimally increases) the difference between the supports of the 1-itemsets in S and the corresponding supports in D.  Since supports of items in D is unknown, estimate them from S as computed in Step[2]  Distance function used:

9 FAST-trim  Uses input parameter k to explicitly trade-off speed and accuracy (1  k  |S|) Trimming Phase Note: Removal of outlier t *, causes maximum decrease or minimum increase in Dist(S 0,S) while (|S 0 | > n) { divide S 0 into disjoint groups of min(k,|S 0 |) transactions each; for each group G { compute f(A;S 0 ) for each item A; set S 0 =S 0 – {t*}, where Dist(S 0 -{t*},S) = min Dist(S 0 - {t},S) }

10 FAST-grow  Select representative transactions from S and add them to the sample S 0 that is initially empty Growing Phase while (|S 0 | < n) { divide S 0 into disjoint groups of min(k,|S 0 |) transactions each; for each group G { compute f(A;S 0 ) for each item A; set S 0 =S 0  {t*}, where Dist(S 0  {t*},S) = min Dist(S 0  {t},S) }

11 Epsilon Approximation (EA)  Similar to FAST  Find a small subset having 1-itemset supports that are close to those in the entire database  The discrepancy of any subset S 0 of a superset S (the distance between S 0 and S with respect to the 1-itemset frequencies) is computed as the L  distance between the frequency vectors Def: A sample S 0 of S 1 is an  approximation iff discrepancy satisfies Dist  (S 0,S 1 ) 

12 Epsilon Approximation (EA) Halving Method  Deterministically halves the data to get sample S 0  Apply halving repeatedly (S 1 => S 2 => … => S t (= S 0 ))  Each halving step introduce a discrepancy  i (n i,m) where m=total no. of items in database n i =size of sub-sample S i  Halving stops with the max t such that

13 Epsilon Approximation (EA) 1.Color each transaction  red (in sample) or blue (not in sample) 2.Penalty for each item, reflects  Penalty small if red/blue approximately balanced  Penalty will shoot up exponentially when red dominates (item is over-sampled), or blue dominates (item is under-sampled) 3.Color transactions sequentially, keeping penalty low  Choose the color which gives smaller penalty

14 Epsilon Approximation (EA)  Penalty Computation  Let Q i = Penalty for item A i  Init Q i = 2  Suppose that we have colored the first j transactions where r i = r i (j) = no. of red transactions containing A i b i = b i (j) = no. of blue transactions containing A i = parameter that influences how fast penalty changes as function of |r i - b i |,  (0,1) Error bound

15 Epsilon Approximation (EA)  How to color transaction j+1  Compute global penalty  Choose color for which global penalty is smaller = Global penalty assuming transaction j+1 is red = Global penalty assuming transaction j+1 is blue

16 Epsilon Approximation (EA) initialization Compute penalty of each item Global penalty Red transactions are added to Sample, blue are forgotten Decide to color it red or blue

17 Epsilon Approximation (EA)  Repeated halving method starts with S  Apply one round of halving to get S 1  Then another round of halving to S 1 to S 2 etc  If S 1 is an  1 approximation of S and S 2 is an  2 approximation of S 1,  S 2 is an (  1 +  2 ) approximation of S  S t is an  t -approximation,  t =  k  t  (n k,m)  Stop repeated halving for the max t s.t.  t  

18 Epsilon Approximation (EA)  Require t passes over the database  Observation: Halving is sequential in deciding the color of a transaction  In single pass,  store all penalties of each halving method  Based on penalties from 1 st halving, decide to color it red or not  If red, compute the penalty for 2 nd halving, etc. until the transaction is colored blue, or belongs to S t, t=log n

19 Experiments  Synthetic data set  IBM QUEST project  100,000 transactions  1,000 items  number of maximal potentially large itemsets = 2000  average transaction length: 10  average length of maximal large itemsets: 4  minimum support: 0.77%  length of the maximal large itemsets: 6

20 Experiments  Use Apriori in all cases to compute the large itemsets  Accuracy and execution time measured  FAST  2 implementations: Dist 1 and Dist 2  Phase 1: sample size=30%  Parameter k:10  EA  Run EA with a given  value and then use the obtained sample size to run FAST and SRS  Final sampling ratios: 0.76%, 1.51%, 3.02%,6.04%, 12.4%, and 24.9% … dictated by EA halvings

21 Experimental Results Accuracy vs Sampling Ratio

22 Experimental Results Time vs Sampling Ratio

23 Streaming Data Analysis  Streaming databases grow continuously, rapidly and without bound  Example applications:  Stock tickers  Network traffic monitors  POS systems  Phone conversation wiretaps  Challenges of analysis:  Timely response  Use of limited memory

24 Previous Work  Algorithms that identify frequent singleton items over a data stream (VLDB’02)VLDB’02  Sticky Sampling Sticky Sampling  Lossy Counting Lossy Counting  Problems  Can accurately maintain statistics of items over a stable data stream in which patterns change slowly  Fail for some applications that require information of the entire stream but with emphasis on the most recent data

25 DSR: Data Stream Reduction  EA-based algorithm, to sample data streams  Goal: generate a sample which carries information about entire stream while favor recent data  Model:  Each element of the data stream is a transaction consisting of a set of items (0-1 problem)  Suppose we want to generate an N s -element sample, S s  S s puts more weight on recent data

26 Data Stream Reduction  Representative sample of data stream … … N S /2 NS/2msNS/2ms mSmS m S -1 m S -2 1 Bucket# 1 m S -2m S -1 mSmS To generate N S -element sample, halve (m S -k) times of bucket k Total #Transactions = m s.N s /2 N S /2 N S /8 N S /4 #Transactions in S ~ N s

27 Problem-frequent halving  Expensive to map transactions into conceptual buckets and compute the representative subset of each bucket whenever a new transaction arrives.  Goal: Stimulate the ideal scenario while avoid frequent halving  Solution: Use a working buffer that hold N s transactions and compute new representative sample when buffer is full by applying EA

28 Frequent Halving NsNs Empty Full 0 Halving 1 Halving Full 1 Halving 0 Halving 1 Halving 2 Halving 1 Halving 2 Halving 0 Halving Full Empty 1 Halving 2 Halving 3 Halving Empty

29 Problem of Halving  Problem: Two users immediately before and after halving operation see data that varies substantially  Continuous DSR: Buffer divided into N s /2n s chunks, n s << N s 2n s 4n s N s -2n s NsNs Next n s transactions arrive Oldest chunk is halved first nsns 3n s 5n s N s -n s NsNs New transactions == ==

30 Discussions Advantages of DSR  DSR is more sensitive to recent changes in the stream  DSR generates a representative sample instead of collecting statistics such as counts  => more flexibility  Each halving operation is relatively cost-effective when compared to expensive frequent itemset identifcations in the Lossy Counting based approach Future work  Choice of discrepancy function  Based on single item frequencies  How to evaluate goodness of representative subset

31 Conclusions  FAST: Two 2-stage sampling approach based on trimming outliers or selecting representative transactions  Epsilon approximation: deterministic method for repeatedly halving data to obtain final sample  Can be used in conjunction with other non-sampling count-based mining algorithms  Trade-off processing speed and accuracy of results  EA-based data stream reduction

32 References  H. Bronnimann, B. Chen, M. Dash, P. Haas, and Y. Qiao, Peter Scheuermann. Efficient Data-Reduction Methods for On-Line Association Rule Discovery, NGDM’02 NGDM’02  B. Chen, P. Haas, and P. Scheuermann. A New Two- Phase Sampling Based Algorithm for Discovering Association Rules, SIGKDD’02SIGKDD’02  G. S. Manku, and R. Motwani. Approximate Frequency Counts Over Data Streams, VLDB’02VLDB’02

33 Sticky Sampling……  Use a fixed-size buffer and various sampling rates to estimate counts  Sample the first 2t incoming items at rate r=1 (select one item for every item seen) Sample the next 2t items at rate r=2 (select one item for every two items seen) For next 4t items, r=4  t is predefined based on freq. Threshold, user specified error and prob. of failure  Randomly select same no. of elements from an enlarging moving window which keeps doubling itself

34 Lossy Counting……  Store observed freq and estimated maximal freq. error for each frequent, or potentially frequent, item in a series of conceptual buckets.  Keep adding new items to and removing existing less frequent items from the buckets