Is Sampling Useful in Data Mining? A Case in the Maintenance of Discovered Association Rules S.D. Lee, D. W. Cheung, B. Kao Department of Computer Science.

Is Sampling Useful in Data Mining? A Case in the Maintenance of Discovered Association Rules S.D. Lee, D. W. Cheung, B. Kao Department of Computer Science The University of Hong Kong Presenter: Elena Zheleva, April 8, 2004 Data Mining and Knowledge Discovery 1998

Introduction Data mining enables us to find out useful information from huge databases It enables marketers to develop and implement customized marketing programs and strategies Databases are not static, so maintenance of discovered association rules is an important problem – Example: Inventory Database

Introduction To update the association rules, multiple scans of the database will be necessary Authors propose a method to determine when to update the association rules by scanning a sample from the database and its changes Lecture Week 3: Association Analysis Related to first step in Association Analysis: constructing large itemsets

Outline Problem Descriptions and Solutions – Mining of Association Rules – Update of Association Rules – Scheduling Update of Association Rules DELI Algorithm Example Experimental Results Conclusion

Problem Descriptions and Solutions

Problem 1: Mining of Large Itemsets Given a database D of transactions and a set of possible items, find the large itemsets Large Itemsets: itemsets which have a transaction support above a pre-specified support% Transaction: a non-empty set of items Association Rule: X => Y, X and Y are itemsets Find association rules by examining large itemsets

Solution: Apriori Algorithm Finds out the large itemsets iteratively At iteration k: – Use large (k-1)-itemsets, find candidate itemsets of size k – Check which ones have a support above pre-specified and add them to large k-itemsets At every iteration, it scans the database to count the transactions which contain each candidate itemset A large amount of time is spent in scanning the whole database

Problem 2: Update of Association Rules After some updates have been applied to a database, find the new large itemsets and their support counts in an efficient manner Efficient: by using the old database mining results All database updates are either insertions or deletions Association Rule Maintenance Problem

Update of Association Rules D- :set of deleted T D+ :set of added T D :old database D' :update database D*:set of unchanged transactions D D- D* D+ D' D'

FUP2 Algorithm Addresses maintenance problem Apriori fails to use old data mining result FUP2 reduces the amount of work that needs to be done FUP2 works similarly to Apriori but it scans only the updated part of the database for old large itemsets For the rest, it scans the whole database

Problem 3: When to Update Association Rules First idea: after n transactions have been updated BAD! Symmetric difference: measure how many large itemsets have been added and deleted after the database update If too many => time to update association rules If too few => old association rules are a good approximation of the updated database

DELI Algorithm

Q2: Difference Estimation for Large Itemsets Purpose: to estimate the difference between the association rules in a database before and after it is updated Decides whether to update the association rules Overview: it estimates the size of the association rule change by examining a sample Advantage: DELI saves machine resources and time

DELI Algorithm Basic Notation: Input: old support counts, D, D+ and D- Output: a Boolean value indicating whether a rule-update is needed Iterative algorithm - start with k = 1 Each iteration: 13 steps, reduced to 5 logical steps D- D* D+ D' D' D

DELI Algorithm – Step 1 Generate large I (all 1-itemsets), k=1 itemset candidates apriori_gen(~L ), k>1 and partition the set into P and Q k { k-1 kk C =

DELI Algorithm – Step 2 P - the itemsets of size k that were large in old database and potentially large in the new one For each itemset X  P : - SupCount(D’) = SupCount(D) + SupCount(D+) - SupCount(D-) (scan only D+ and D-) - If (SupCount(D’) >= |D’| * support%), then add X to L (L - itemsets, large both in old and new db’s) k >> k

DELI Algorithm – Step 3 Q - the itemsets of size k that were not large in old database and potentially large in the new one For each itemset X  Q : - If (SupCount(D+) - SupCount(D-)) <= (|D+| - |D-|)*support%, then delete X from Q Take a random sample S from old database of size m For each itemset X  Q : - Find SupCount(S) and obtain an interval [a, b] for SupCount(D) with a 100(1-  )% confidence - SupCount(D’)  [a + , b +  ], where  = SupCount(D+) - SupCount(D-) Reason: SupCount(D’) = SupCount(D) + SupCount(D+) - SupCount(D-) k k k

DELI Algorithm – Step 3 For each itemset X  Q : - Compare estimated SupCount(D’) interval with |D’|*support% - L - itemsets that were not large in D but are large in D’ with a certain confidence - L - itemsets that were not large in D, maybe large in D’ a +  b +  L > L  >  k

DELI Algorithm – Step 4 Obtain the estimated set of large itemsets of size k ~L = L  L  L Itemsets: L - large in D, large in D’ (Step 2) L - not large in D, large in D’ with a certain confidence (Step 3) L - not large in D, maybe large in D’ (Step 3) ~L k is an approximation of new L k. However, misses are rare and also the false hits are very rare.  << < k < 

DELI Algorithm – Step 5 Decide whether an association rule update is needed – IF uncertainty ( L / ~L ) is too large => DELI halts, update is needed – IF symmetric difference of large itemsets is too large => DELI halts, update is needed – IF ~L is empty => DELI halts, no update is necessary – IF ~L is non-empty => k = k + 1, go to Step 1  k k k

Example

|D|=10 6 |  - |=9000 |  + |=10000S%=2%

DELI Algorithm – Example k=1: 1) C = {A, B, C, D, E, F}, P = {A, B, C, D, E}, Q = {F} 2) P: |D’|*support% = 20020 => L = {A, B, C, D, E} ItemsetSupCount(D’) A24818 B31438 C24410 D27880 E21201 3) Q: 17 drop F 4) ~L = {A, B, C, D, E} 5) Update? No. k = 2, proceed to Step 1 >> 1

DELI Algorithm – Example k=2: 1) P={AB, AC, AD, AE, BC, BD, CD}, Q={BE, CE, DE} 2) P: |D’|*support% = 20020 => L = {AB, AC, AD, BC, BD, CD} 3) Q: drop CE, DE; For BE: SupCount(Sample)=202 => SupCount(D)=20200 95% confidence interval [20200+2757, 20200-2757] For SupCount(D’), confidence interval: [17677, 23191] 17677 L ={BE} >> 

DELI Algorithm – Example k=2: 4) ~L = {AB, AC, AD, BC, BD, CD, BE} 5) Update? No. (uncertainty=1/7 and difference=2/15). k = 3, proceed to Step 1. k=3: … 4) ~L = {ABC, ACD, BCD} 5) Update? No. (uncertainty=0 and difference=2/15) Returns: False (no update of association rules is needed). 2 3

Experimental Results

Synthetic databases – generate D, D+, D- Use Apriori to find large itemsets FUP2 is invoked to find large itemsets in the updated database – record time Run DELi – record time |D| = 100000, |D+|=|D-|= 5000 confidence = 95%, support% = 2% Sample size = 20000

Experimental Results Figure 3

Experimental Results 90% ----------level of confidence--------- 99%

Conclusion

Conclusion Real-world databases get updated constantly, therefore the knowledge extracted from them changes too We have to know when the change is significant Applying sampling techniques and statistic methods, we can efficiently determine when to update the extracted association rules Sampling is really useful in data mining.

Final Exam Questions Q1: Compare and contrast FUP2 and DELI – Both algorithms are used in Association Analysis – Goal: DELI decides when to update the association rules while FUP2 provides an efficient way of updating them – Technique: DELI scans a small portion of the database whereas FUP2 scans the whole database – DELI saves machine resources and time

Final Exam Questions Q2: Difference Estimation for Large Itemsets Q3 Difference between Apriori and FUP2: – Apriori scans the whole database to find association rules, and does not use old data mining results – For most itemsets, FUP2 scans only the updated part of the database and takes advantage of the old association analysis results

Is Sampling Useful in Data Mining? A Case in the Maintenance of Discovered Association Rules S.D. Lee, D. W. Cheung, B. Kao Department of Computer Science.

Similar presentations

Presentation on theme: "Is Sampling Useful in Data Mining? A Case in the Maintenance of Discovered Association Rules S.D. Lee, D. W. Cheung, B. Kao Department of Computer Science."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Is Sampling Useful in Data Mining? A Case in the Maintenance of Discovered Association Rules S.D. Lee, D. W. Cheung, B. Kao Department of Computer Science.

Similar presentations

Presentation on theme: "Is Sampling Useful in Data Mining? A Case in the Maintenance of Discovered Association Rules S.D. Lee, D. W. Cheung, B. Kao Department of Computer Science."— Presentation transcript:

Similar presentations

About project

Feedback