Is Sampling Useful in Data Mining? A Case in the Maintenance of Discovered Association Rules S.D. Lee David W. Cheung Ben Kao The University of Hong Kong.

Is Sampling Useful in Data Mining? A Case in the Maintenance of Discovered Association Rules S.D. Lee David W. Cheung Ben Kao The University of Hong Kong Data Mining and Knowledge Discovery, 1998 Presenter: Tri Tran CS331 – Spring 2006

Outline Introduction Problem Descriptions and Solutions Mining of Association Rules Update of Association Rules Scheduling Update of Association Rules DELI Algorithm Example of DELI Algorithm Experimental Results Conclusions

Introduction Applicability of data mining in many areas, such as decision support, market strategy and financial forecasts Data mining enables us to find out useful information from huge databases It enables marketers to develop and implement customized marketing programs and strategies Mining of association rules is one of the most common data mining problems

Introduction (cont.) Database keeps changing overtime, hence, the set of discovered association rules needs to be updated to reflect the changes  Maintenance of discovered association rules is also an important problem Existing solutions scan database multiple times to discover exactly the association rules Apriori algorithm: discover a set of association rules FUP2 algorithm: update the discovered association rules efficiently when transactions are added to, deleted from or modified in the database. Authors propose an algorithm DELI to determine when rule update should be applied. The algorithm estimates the maximum amount of changes in the set of rules due to newly added transactions using sampling techniques.

Problem 1: Mining of Association Rules Given a database D of transactions and a set of possible items, find the large itemsets Large Itemsets: itemsets which have a transaction support above a pre-specified support, s% Transaction: a non-empty set of items Association Rule: X => Y, X and Y are itemsets By examining large itemsets, find association rules that their confidence are above a confidence threshold, c%

Solution: Apriori Algorithm Finds out the large itemsets iteratively At iteration k: Use large (k-1)-itemsets, L k-1, find candidate itemsets of size k, C k Check which ones have a support above pre-specified and add them to large k-itemsets At every iteration, it scans the database to count the transactions which contain each candidate itemset A large amount of time is spent in scanning the whole database

Problem 2: Update of Association Rules After some updates have been applied to a database, find the new large itemsets and their support counts in an efficient manner All database updates are either insertions or deletions Association Rule Maintenance Problem Efficiently update the discovered association rules by using the old database mining results

Update of Association Rules Δ - :set of deleted transactions Δ + :set of added transactions D :old database D' :update database D*:set of unchanged transactions σ X : support count of itemset X σ’ X : new support count of itemset X δ X - : support count of itemset X in Δ- δ X + : support count of itemset X in Δ+ D’ = (D - Δ - ) U Δ + σ’ X = (σ X - δ X - ) + δ X + Δ - (δ - X ) Δ + (δ + X ) D* D(σX)D(σX) D’ ( σ ’ X )

FUP2 Algorithm Addresses maintenance problem Apriori fails to use old data mining result FUP2 reduces the amount of work that needs to be done FUP2 works similarly to Apriori by generating large itemsets iteratively It scans only the updated part of the database for old large itemsets For the rest, it scans the whole database

FUP2 Algorithm Finds out the large itemsets iteratively by reusing the results of previous mining At iteration k: Use the new large (k-1)-itemsets L’ k-1 (w.r.t. D’) to find candidate itemsets of size k, C k Find support count of the candidate itemsets in C k Divide C k into two partitions: P k = C k ۸ L k and Q k = C k – P k For X in P k, calculate σ’ X = (σ X - δ X - ) + δ X + For X in Q k, eliminate candidates with δ X + - δ X - < (Δ + -Δ - )s%, For the remaining candidates X in Q k, scan D* to find counts and add to δ X + to get σ’ X

Problem 3: Find the difference between the old and new association rules Before doing the update to find L’, we want to know the difference between L and L’ Symmetric difference: measure how many large itemsets have been added and deleted after the database update If too many => time to update association rules If too few => old association rules are a good approximation of the updated database

DELI Algorithm Difference Estimation for Large Itemsets Purpose: to estimate the difference between the association rules in a database before and after it is updated Decides whether to update the association rules Key idea: it approximates upper bound of the size of the association rule change by examining samples of the database Advantage: DELI saves machine resources and time

DELI Algorithm Input: old support counts, D, Δ + and Δ - Output: a Boolean value indicating whether a rule- update is needed Iterative algorithm: construct C k from ~L k-1 which is an approximation itemsets of L’ k-1 In each iteration, estimate the support count of itemsets in C k using a sample S of m random transactions drawn from database D

DELI Algorithm – Step 1 Obtain a random sample S of size m from database D In each iteration: generate a candidate set C k = I (all 1-itemsets), k=1 apriori_gen(~L k-1 ), k>1 divide C k into 2 partitions: P k = C k ۸ L k and Q k = C k – P k QkQk PkPk CkCk LkLk

DELI Algorithm – Step 2 P k - the itemsets of size k that were large (>|D|s%) in old database and potentially large in the new one For each itemset X  P k : σ’ X = (σ X - δ X - ) + δ X + (scan only Δ - and Δ + ) If (σ’ X >= |D’| * s%), then add X to L k » (L k » - itemsets, large both in old and new databases) Δ - (δ - X ) Δ + (δ + X ) D* D(σX)D(σX) D’ ( σ ’ X )

DELI Algorithm – Step 3 Q k - the itemsets of size k that were not large ( |D’|s%) For each itemset X  Q : If (δ X + - δ X - ) <= (| Δ + | - | Δ - |)*s%, then delete X from Q Prune away candidate itemsets that its support counts is not large (<|D’|s%) in the new database For each remaining itemset X  Q : Find support count of X in the sample S, T x (a binomially distributed random variable) Estimate support count of X in D, σ X and obtain an interval [a x, b x ] with a 100(1-  )% confidence σ’ X  [a x +  x, b x +  x ], where  x = δ X + - δ X - Reason: σ’ X = σ X + (δ X + - δ X - )

DELI Algorithm – Step 3 For each itemset X  Q : Compare estimated interval σ’ X  [a x +  x, b x +  x ] with |D’|*s% L k > - itemsets that were not large in D but are large in D’ with 100(1-  )% confidence L k ≈ - itemsets that were not large in D, maybe large in D’ a X +  X b X +  X Lk>Lk> Lk≈Lk≈ |D’|*s%

DELI Algorithm – Step 4 Obtain the estimated set of large itemsets of size k ~L k = L k »  L k >  L k ≈ Itemsets: L k » - large in D, large in D’ (Step 2) L k > - not large in D, large in D’ with a certain confidence (Step 3) L k ≈ - not large in D, maybe large in D’ (Step 3) ~L k is an approximation of new L k. ~L k is an overestimated itemsets, therefore, the difference between ~L k and L k gives an upper bound. LkLk ~L k CkCk QkQk PkPk Lk»Lk» Lk>Lk> Lk≈Lk≈

DELI Algorithm – Step 5 Decide whether an association rule update is needed IF uncertainty ( L k ≈ /~L k ) is too large => DELI halts, update is needed IF symmetric difference of large itemsets is too large => DELI halts, update is needed IF ~L k is empty => DELI halts, no update is necessary IF ~L k is non-empty => k = k + 1, go to Step 1

|D|=10 6 |  - |=9000 |  +|=10000S%=2% DELI Algorithm – Example |D|=10 6 |  - |=9000 |  +|=10000S%=2% X σXσX in L?δX-δX- δX+δX+ σ' X in L'? A24803Y49751224818Y B31305Y69082331438Y C24323Y76084724410Y D27887Y19218527880Y E21208Y50349621201Y F19887N12013719904N AB23766Y809923785Y AC22302Y929822308Y AD21188Y11313521210Y AE20100Y19410320009N BC22321Y10813722350Y BD25086Y17215325067Y BE19803N20243620037Y CD23847Y12515223874Y CE14467N48043814425N DE16782N16815216766N ABC21033Y799621050Y ABD13744N589013776N ACD20387Y859620398Y BCD20211Y8712020244Y

DELI Algorithm – Example k=1: 1) C 1 = {A, B, C, D, E, F}, P 1 = {A, B, C, D, E}, Q = {F} 2) P 1 : |D’|*s% = 20020 => L 1 = {A, B, C, D, E} 3) Q 1 : (δ X + - δ X - ) = 17 (| Δ + | - | Δ - |)*s% = 20 17 drop F 4) ~L1= L1= {A, B, C, D, E} 5) Update? No. k = 2, proceed to Step 1

DELI Algorithm – Example k=2: 1. ~L 1 = {A, B, C, D, E}, P 2 ={AB, AC, AD, AE, BC, BD, CD}, Q 2 ={BE, CE, DE} 2. P 2 : |D’|*s% = 20020 => L 2 » = {AB, AC, AD, BC, BD, CD} 3. Q 2 : drop CE, DE; because (δ X + - δ X - ) <= (| Δ + | - | Δ - |)*s% For BE: Assume support count of BE in S, T x =202 => σ X =20200 95% confidence interval [20200-2757, 20200+2757] for σ X For σ’ X, confidence interval: [17677, 23191] 17677 L 2 ≈ ={BE} L 2 > = Ø

DELI Algorithm – Example k=2: 4) ~L 2 = {AB, AC, AD, BC, BD, CD, BE} 5) Update? No. (uncertainty=1/7 and difference=2/15). k = 3, proceed to Step 1. k=3: … 4) ~L 3 = {ABC, ACD, BCD} 5) Update? No. (uncertainty=0 and difference=2/15) k=4: C 4 = Ø STOP. Returns: False (no update of association rules is needed).

Experimental Results Synthetic databases – generate D, Δ +, Δ - Use Apriori to find large itemsets FUP2 is invoked to find large itemsets in the updated database – record time Run DELI – record time |D| = 100000, | Δ + |=| Δ - |= 5000, confidence = 95%, s% = 2%, m = 20000

Experimental Results

90% ----------level of confidence--------- 99%

Experimental Results

Conclusions Real-world databases get updated constantly, therefore the knowledge extracted from them changes too The authors proposed DELI algorithm to determine if the change is significant so that when to update the extracted association rules The algorithm applies sampling techniques and statistic methods to efficiently estimate an approximate large itemsets

Final Exam Questions Q1: Compare and contrast FUP2 and DELI Both algorithms are used in Association Analysis Goal: DELI decides when to update the association rules while FUP2 provides an efficient way of updating them Technique: DELI scans a small portion of the database (sample) and approximates the large itemsets whereas FUP2 scans the whole database and returns the large itemsets exactly DELI saves machine resources and time

Final Exam Questions Q2: Difference Estimation for Large Itemsets Q3 Difference between Apriori and FUP2: Apriori scans the whole database to find association rules, and does not use old data mining results For most itemsets, FUP2 scans only the updated part of the database and takes advantage of the old association analysis results

Is Sampling Useful in Data Mining? A Case in the Maintenance of Discovered Association Rules S.D. Lee David W. Cheung Ben Kao The University of Hong Kong.

Similar presentations

Presentation on theme: "Is Sampling Useful in Data Mining? A Case in the Maintenance of Discovered Association Rules S.D. Lee David W. Cheung Ben Kao The University of Hong Kong."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Is Sampling Useful in Data Mining? A Case in the Maintenance of Discovered Association Rules S.D. Lee David W. Cheung Ben Kao The University of Hong Kong.

Similar presentations

Presentation on theme: "Is Sampling Useful in Data Mining? A Case in the Maintenance of Discovered Association Rules S.D. Lee David W. Cheung Ben Kao The University of Hong Kong."— Presentation transcript:

Similar presentations

About project

Feedback