Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Efficient Data Reduction Methods for Online Association Rule Discovery -NGDM’02 Herve Bronnimann, Bin Chen, Manoranjan Dash, Peter Haas, Yi Qiao, Peter.

Similar presentations


Presentation on theme: "1 Efficient Data Reduction Methods for Online Association Rule Discovery -NGDM’02 Herve Bronnimann, Bin Chen, Manoranjan Dash, Peter Haas, Yi Qiao, Peter."— Presentation transcript:

1 1 Efficient Data Reduction Methods for Online Association Rule Discovery -NGDM’02 Herve Bronnimann, Bin Chen, Manoranjan Dash, Peter Haas, Yi Qiao, Peter Scheuermann Presented by: Ivy Tong 18 June, 2003

2 2 Outline  Outline of the Presentation  Motivation  FAST  Epsilon Approximation  Experimental Results  Data Stream Reduction  Conclusion

3 3 Motivation  Volume of Data in Warehouses & Internet is growing faster than Moore’s Law  Scalability is a major concern  Classical algorithms require one/more scans of the database  Need to adapt to Streaming Data  Data elements arrive online  Limited amount of memory  One Solution: Execute algorithm on a subset of data

4 4 Motivation  Sampling Methods  Advantage: can explicitly trade-off accuracy and speed  Work best when tailored to application  Contributions of this paper  Sampling methods for count datasets Application: Association rule mining

5 5 Notations  D: Database of interest  S: A simple random sample drawn without replacement from D  I: The set of all items that appear in D  I(D): the collection of itemsets that appear in D  I(S): itemsets that appear in S  For k  1, I k (D) and I k (S) denote the connection of k-itemsets in D and S  L(D) and L(S): frequent itemsets in D and S  L k (D) and L k (S): collections of frequent k-itemsets in D and S  For an itemset A  I and a set of transactions T,  Let n(A; T) be the number of transactions in T that contain A  |T|: total number of transactions in T  Support of A in D: f(A;D) = n(A;D)/|D| Support of A in S: f(A;S)= n(A;S)/|S|

6 6 Problem Definition  Generate a smaller subset S 0 of a larger S such that the supports of 1-itemsets in S 0 are close to those in S I 1 (T) = set of all 1-itemsets in transaction set T L 1 (T) = set of frequent 1-itemsets in transaction set T f(A;T) = support of itemset A in transaction set T

7 7 FAST Algorithm  Finding Association rules from Sampled Transactions (SIGKDD’02)SIGKDD’02  Given a specified minimum support p and confidence c, FAST proceeds as follows: 1.Obtain a large simple random sample S from D. 2.Compute f(A;S) for each 1-itemset A. 3.Using the supports computed in Step 2, obtain a reduced sample S 0 from S by trimming away outlier transactions. 4.Run a standard association-rule algorithm against S 0 - with Minimum support p and confidence c - to obtain the final set of Association Rules.

8 8 FAST-trim  Removes the “outlier” transactions from the sample S to obtain S 0  Outlier – a transaction whose removal from S maximally reduces (minimally increases) the difference between the supports of the 1-itemsets in S and the corresponding supports in D.  Since supports of items in D is unknown, estimate them from S as computed in Step[2]  Distance function used:

9 9 FAST-trim  Uses input parameter k to explicitly trade-off speed and accuracy (1  k  |S|) Trimming Phase Note: Removal of outlier t *, causes maximum decrease or minimum increase in Dist(S 0,S) while (|S 0 | > n) { divide S 0 into disjoint groups of min(k,|S 0 |) transactions each; for each group G { compute f(A;S 0 ) for each item A; set S 0 =S 0 – {t*}, where Dist(S 0 -{t*},S) = min Dist(S 0 - {t},S) }

10 10 FAST-grow  Select representative transactions from S and add them to the sample S 0 that is initially empty Growing Phase while (|S 0 | < n) { divide S 0 into disjoint groups of min(k,|S 0 |) transactions each; for each group G { compute f(A;S 0 ) for each item A; set S 0 =S 0  {t*}, where Dist(S 0  {t*},S) = min Dist(S 0  {t},S) }

11 11 Epsilon Approximation (EA)  Similar to FAST  Find a small subset having 1-itemset supports that are close to those in the entire database  The discrepancy of any subset S 0 of a superset S (the distance between S 0 and S with respect to the 1-itemset frequencies) is computed as the L  distance between the frequency vectors Def: A sample S 0 of S 1 is an  approximation iff discrepancy satisfies Dist  (S 0,S 1 ) 

12 12 Epsilon Approximation (EA) Halving Method  Deterministically halves the data to get sample S 0  Apply halving repeatedly (S 1 => S 2 => … => S t (= S 0 ))  Each halving step introduce a discrepancy  i (n i,m) where m=total no. of items in database n i =size of sub-sample S i  Halving stops with the max t such that

13 13 Epsilon Approximation (EA) 1.Color each transaction  red (in sample) or blue (not in sample) 2.Penalty for each item, reflects  Penalty small if red/blue approximately balanced  Penalty will shoot up exponentially when red dominates (item is over-sampled), or blue dominates (item is under-sampled) 3.Color transactions sequentially, keeping penalty low  Choose the color which gives smaller penalty

14 14 Epsilon Approximation (EA)  Penalty Computation  Let Q i = Penalty for item A i  Init Q i = 2  Suppose that we have colored the first j transactions where r i = r i (j) = no. of red transactions containing A i b i = b i (j) = no. of blue transactions containing A i = parameter that influences how fast penalty changes as function of |r i - b i |,  (0,1) Error bound

15 15 Epsilon Approximation (EA)  How to color transaction j+1  Compute global penalty  Choose color for which global penalty is smaller = Global penalty assuming transaction j+1 is red = Global penalty assuming transaction j+1 is blue

16 16 Epsilon Approximation (EA) initialization Compute penalty of each item Global penalty Red transactions are added to Sample, blue are forgotten Decide to color it red or blue

17 17 Epsilon Approximation (EA)  Repeated halving method starts with S  Apply one round of halving to get S 1  Then another round of halving to S 1 to S 2 etc  If S 1 is an  1 approximation of S and S 2 is an  2 approximation of S 1,  S 2 is an (  1 +  2 ) approximation of S  S t is an  t -approximation,  t =  k  t  (n k,m)  Stop repeated halving for the max t s.t.  t  

18 18 Epsilon Approximation (EA)  Require t passes over the database  Observation: Halving is sequential in deciding the color of a transaction  In single pass,  store all penalties of each halving method  Based on penalties from 1 st halving, decide to color it red or not  If red, compute the penalty for 2 nd halving, etc. until the transaction is colored blue, or belongs to S t, t=log n

19 19 Experiments  Synthetic data set  IBM QUEST project  100,000 transactions  1,000 items  number of maximal potentially large itemsets = 2000  average transaction length: 10  average length of maximal large itemsets: 4  minimum support: 0.77%  length of the maximal large itemsets: 6

20 20 Experiments  Use Apriori in all cases to compute the large itemsets  Accuracy and execution time measured  FAST  2 implementations: Dist 1 and Dist 2  Phase 1: sample size=30%  Parameter k:10  EA  Run EA with a given  value and then use the obtained sample size to run FAST and SRS  Final sampling ratios: 0.76%, 1.51%, 3.02%,6.04%, 12.4%, and 24.9% … dictated by EA halvings

21 21 Experimental Results Accuracy vs Sampling Ratio

22 22 Experimental Results Time vs Sampling Ratio

23 23 Streaming Data Analysis  Streaming databases grow continuously, rapidly and without bound  Example applications:  Stock tickers  Network traffic monitors  POS systems  Phone conversation wiretaps  Challenges of analysis:  Timely response  Use of limited memory

24 24 Previous Work  Algorithms that identify frequent singleton items over a data stream (VLDB’02)VLDB’02 http://www.csis.hku.hk/~dbgroup/seminar/seminar021004.htm  Sticky Sampling Sticky Sampling  Lossy Counting Lossy Counting  Problems  Can accurately maintain statistics of items over a stable data stream in which patterns change slowly  Fail for some applications that require information of the entire stream but with emphasis on the most recent data

25 25 DSR: Data Stream Reduction  EA-based algorithm, to sample data streams  Goal: generate a sample which carries information about entire stream while favor recent data  Model:  Each element of the data stream is a transaction consisting of a set of items (0-1 problem)  Suppose we want to generate an N s -element sample, S s  S s puts more weight on recent data

26 26 Data Stream Reduction  Representative sample of data stream … … N S /2 NS/2msNS/2ms mSmS m S -1 m S -2 1 Bucket# 1 m S -2m S -1 mSmS To generate N S -element sample, halve (m S -k) times of bucket k Total #Transactions = m s.N s /2 N S /2 N S /8 N S /4 #Transactions in S ~ N s

27 27 Problem-frequent halving  Expensive to map transactions into conceptual buckets and compute the representative subset of each bucket whenever a new transaction arrives.  Goal: Stimulate the ideal scenario while avoid frequent halving  Solution: Use a working buffer that hold N s transactions and compute new representative sample when buffer is full by applying EA

28 28 Frequent Halving NsNs Empty Full 0 Halving 1 Halving Full 1 Halving 0 Halving 1 Halving 2 Halving 1 Halving 2 Halving 0 Halving Full Empty 1 Halving 2 Halving 3 Halving Empty 1234 567

29 29 Problem of Halving  Problem: Two users immediately before and after halving operation see data that varies substantially  Continuous DSR: Buffer divided into N s /2n s chunks, n s << N s 2n s 4n s N s -2n s NsNs Next n s transactions arrive Oldest chunk is halved first nsns 3n s 5n s N s -n s NsNs New transactions == ==

30 30 Discussions Advantages of DSR  DSR is more sensitive to recent changes in the stream  DSR generates a representative sample instead of collecting statistics such as counts  => more flexibility  Each halving operation is relatively cost-effective when compared to expensive frequent itemset identifcations in the Lossy Counting based approach Future work  Choice of discrepancy function  Based on single item frequencies  How to evaluate goodness of representative subset

31 31 Conclusions  FAST: Two 2-stage sampling approach based on trimming outliers or selecting representative transactions  Epsilon approximation: deterministic method for repeatedly halving data to obtain final sample  Can be used in conjunction with other non-sampling count-based mining algorithms  Trade-off processing speed and accuracy of results  EA-based data stream reduction

32 32 References  H. Bronnimann, B. Chen, M. Dash, P. Haas, and Y. Qiao, Peter Scheuermann. Efficient Data-Reduction Methods for On-Line Association Rule Discovery, NGDM’02 NGDM’02  B. Chen, P. Haas, and P. Scheuermann. A New Two- Phase Sampling Based Algorithm for Discovering Association Rules, SIGKDD’02SIGKDD’02  G. S. Manku, and R. Motwani. Approximate Frequency Counts Over Data Streams, VLDB’02VLDB’02

33 33 Sticky Sampling……  Use a fixed-size buffer and various sampling rates to estimate counts  Sample the first 2t incoming items at rate r=1 (select one item for every item seen) Sample the next 2t items at rate r=2 (select one item for every two items seen) For next 4t items, r=4  t is predefined based on freq. Threshold, user specified error and prob. of failure  Randomly select same no. of elements from an enlarging moving window which keeps doubling itself

34 34 Lossy Counting……  Store observed freq and estimated maximal freq. error for each frequent, or potentially frequent, item in a series of conceptual buckets.  Keep adding new items to and removing existing less frequent items from the buckets


Download ppt "1 Efficient Data Reduction Methods for Online Association Rule Discovery -NGDM’02 Herve Bronnimann, Bin Chen, Manoranjan Dash, Peter Haas, Yi Qiao, Peter."

Similar presentations


Ads by Google