Maintenance of Discovered Association Rules S.D.LeeDavid W.Cheung Presentation : Pablo Gazmuri.

Maintenance of Discovered Association Rules S.D.LeeDavid W.Cheung Presentation : Pablo Gazmuri

Outline n Mining of Association Rules n Maintaining the Discovered Association Rules (FUP 2 Algorithm) n Difference Estimation for Large Itemsets (DELI) algorithm n Experimental Results n Conclusions

Motivation n Databases are not static, so maintenance of discovered association rules is an important problem n Problems in Maintaining the Discovered Rules  some existing algorithm use not only changed part but also the unchanged part,so could be time-consuming  frequently updated, however, the underlying rule set does not changed much

Motivation (continue) n Is 100% much better than 95%?  data is collected from real-world, so it may have random errors  Sampling technique »examine a small sample instead of the entire huge database »since the sampling size is small, it can often fit entirely into main memory, thus reducing the I/O overhead of repeated database scanning.

Mining of Association Rules n Apriori Algorithm Find the large itemsets iteratively n Some Definitions:  L k : the set of all large item set of size k   X : the support count of an itemset X, is the number of transactions in D that contain X  S%: support threshold

Update of Association Rule  - :set of deleted T  + :set of added T D :old database D' :update database D*:set of unchanged transactions D ---- D* ++++ D' D'

|D|=10 6 |  - |=9000 |  + |=10000S%=2%

FUP 2 Algorithm n Was designed to address the maintenance problem n Make use of the results of the previous mining n More efficient than use Apriori on the whole database

Steps of FUP 2 n 1. Generate a candidate set C k.For k=1, C 1 =I.For k  2,C k =apriori_gen(L k-1 ) n 2. Divide C k into 2 parts:P k =c k  L k and Q k =C k -P k n 3. Scan  + and  - to obtain  x for all X  C k n 4. For each X  P k, retrieve  X from the results of the previous mining operation.Then calculate  X ‘ =  X +  x If  X ‘  |D’|*s%, add X to L k (*) n 5. For each X  Q k, if  x  (|  + |-|  - |)*s%,delete it from Q k

n 6. For the remaining X  Q k, rescan the databas D’ and calculate the  x to determine whether the X should put into the L or not n 8. If L k * is non-empty, increment k and goto step 2. Otherwise, halt.

DELI algorithm n Problem with FUP2:  still has to rescan the old database several times  when should we update the ARs? n DELI apply sampling technique to estimate the difference between the association rules in a database before and after the database is updated and this difference could be used to determines whether we should update the mined association rules or not apply sampling technique to estimate the difference between the association rules in a database before and after the database is updated and this difference could be used to determines whether we should update the mined association rules or not

DELI algorithm n 1.Obtain a random sample S of size m from the original database D.Set k=1 n 2. Generate a candidate set C k.For k=1, C 1 =I.For k  2,C k =apriori_gen(L k-1 * ) n 3. Divide C k into 2 parts:P k =c k  L k and Q k =C k -P k n 4. Scan  + and  - to obtain  x for all X  C k n 5. For each X  P k, retrieve  X from the results of the previous mining operation.Then calculate  X ‘ =  X +  x If  X ‘  |D’|*s%, add X to L k (>>) n 6. Calculate  k =|L k -C k |+|P k -L k (>>) | n 7. For each X  Q k, if  x  (|  + |-|  - |)*s%,delete it from Q k

n 8. For the remaining X  Q k, obtain a 100(1-  )% confidence interval[a X,b X ] for  X by examining the sample S. Then,check the following conditions for X: (a)If |D’|  s% ) (b) If a X +  x  |D’|  s%  b X +  x,add X to L k (  ) (c ) if b X +  x  |D’|  s%, drop X. n 9. Calculate  k = | L k (>) |+| L k (  ) | n 10. Let L k * =L k (>>)  L k (>)  L k (  ) n 11. If  k =| L k (  ) |/ L k *  ,signal the need for a rule-update operation and then halt n 12. If d k =  j=1 k (  k +  k ) /|L|  d, signal the need for a rule-update operation and then halt. n 13. If L k * is non-empty, increment k and goto step 2. Otherwise, conclude that L  L’ and hence use L as an approximation for L’ and halt.

Comments for the Algorithm n How to find confidence interval a x =  X ’-z  /2   X ’(|D|-  X ’)/m b x =  X ’+z  /2   X ’(|D|-  X ’)/m m: size of sample  X ’: point estimator  X ’=T x  |D|/m  X ’=T x  |D|/m T x :total number of T in sample containing X T x :total number of T in sample containing X

Comments for the algorithm n Upper bound on the amount of changes in large itemsets  k is used in computing the size of the symmetric difference  k is the uncertainty factor since itemsets in L k (  ) may introduce error to the final result n L k * is an approximation of L k ', however, misses are rare and also the false hit is very rare.

Conclusions n Applying sampling techniques and statistic methods, DELI is to determine whether it is necessary to update the association rules. n Sampling is really useful in data mining.

Maintenance of Discovered Association Rules S.D.LeeDavid W.Cheung Presentation : Pablo Gazmuri.

Similar presentations

Presentation on theme: "Maintenance of Discovered Association Rules S.D.LeeDavid W.Cheung Presentation : Pablo Gazmuri."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Maintenance of Discovered Association Rules S.D.LeeDavid W.Cheung Presentation : Pablo Gazmuri.

Similar presentations

Presentation on theme: "Maintenance of Discovered Association Rules S.D.LeeDavid W.Cheung Presentation : Pablo Gazmuri."— Presentation transcript:

Similar presentations

About project

Feedback