Download presentation

Presentation is loading. Please wait.

Published byHaleigh Truelove Modified over 2 years ago

1
Zeev Dvir – dvirzeev@post.tau.ac.il GenMax From: “ Efficiently Mining Frequent Itemsets ” By : Karam Gouda & Mohammed J. Zaki

2
Zeev Dvir – dvirzeev@post.tau.ac.il The Problem Given a large database of items transactions, find all frequent itemsets A frequent itemset is a set of items that occurs in at-least a user-specified percentage of the data-base We call this percentage : min_sup (for minimum support).

3
Zeev Dvir – dvirzeev@post.tau.ac.il A Maximal Frequent Itemset is a frequent itemset, that doesn ’ t have a frequent superset FI := frequent itemsets MFI := maximal frequent itemsets Fact: |MFI| << |FI| GenMax is an algorithm to find the exact MFI

4
Zeev Dvir – dvirzeev@post.tau.ac.il Example Item /Tid ABCD 1xxx 2xx 3xxx 4xxxx 5x 6xx 7x ABCD ABC ABD ACD BCD AB AC AD BC BD CD A B C D Min_sup = 3

5
Zeev Dvir – dvirzeev@post.tau.ac.il Some Useful Definitions The Combine-Set of an itemset I, is the set of items that can be added to I to create a frequent itemset. For example, in the previous example, The combine-set of the itemset {A} is {B, C}. The combine-set of the empty itemset is called F1 and is actually the set of frequent itemsets ofsize 1.

6
Zeev Dvir – dvirzeev@post.tau.ac.il

8
Improvement At each level, sort the combine-set (C) in increasing order of support An itemset with low support has a smaller chance of producing a large combine-set in the next level The sooner we prune the tree, the more work we save This heuristic was first used in MaxMiner

9
Zeev Dvir – dvirzeev@post.tau.ac.il Bottlenecks 1.Superset checking : The best algorithms for superset checking give an amortized bound of per operation. that ’ s bad if we have many itemsets in the MFI. 2. Frequency testing : How can we make frequency testing faster ?

10
Zeev Dvir – dvirzeev@post.tau.ac.il Optimizing Superset Checking A technique called “ Progressive Focusing ” is used to narrow down the group of potential supersets, as the recursive calls are made LMFI := Local MFI Before each recursive call, we construct the LMFI for the next call, based on the current LMFI and the new item added.

11
Zeev Dvir – dvirzeev@post.tau.ac.il FGHI FGHJ … FGH FGI … FG … LMFI Example

12
Zeev Dvir – dvirzeev@post.tau.ac.il

13
Frequency Testing Optimization GenMax uses a “ vertical database format ” : For each item, we have a set of all the transactions containing this item. This set is called a tidset. (Transaction ID Set). This method makes support computations easier, because we don ’ t have to go over the entire database.

14
Zeev Dvir – dvirzeev@post.tau.ac.il Vertical Database Item /Tid ABCD 1xxx 2xx 3xxx 4xxxx 5x 6xx 7x A {1, 3, 4, 5} B {1, 3, 4, 6} C {1,2,3,4,7} D {2, 4, 6} t(A) = {1, 3, 4, 5} t(AC) = {1, 3, 4} supp(I) = |t(I)|

15
Zeev Dvir – dvirzeev@post.tau.ac.il ABC ABD ABE AB … = { C, E } t(ABC) t(ABE) Each item y in the combine-set, actually represents the itemset, and stores the tidset associated with it.

16
Zeev Dvir – dvirzeev@post.tau.ac.il Additional Optimization Diffsets: don ’ t store the entire tidsets, only the differences between tidsets (described in “ Fast Vertical Mining Using Diffsets ” )

17
Zeev Dvir – dvirzeev@post.tau.ac.il Experimental Results GenMax is compared with: MaxMiner, MAFIA, MAFIA-PP MaxMiner & MAFIA-PP give the exact MFI, while MAFIA gives a superset of the MFI The Databases used in the experiments are grouped according to the MFI length distribution

18
Zeev Dvir – dvirzeev@post.tau.ac.il Type I Datasets

19
Zeev Dvir – dvirzeev@post.tau.ac.il Type II Datasets

20
Zeev Dvir – dvirzeev@post.tau.ac.il Type III Datasets

21
Zeev Dvir – dvirzeev@post.tau.ac.il Type IV Datasets

22
Zeev Dvir – dvirzeev@post.tau.ac.il

Similar presentations

OK

Data Mining Association Analysis: Basic Concepts and Algorithms

Data Mining Association Analysis: Basic Concepts and Algorithms

© 2017 SlidePlayer.com Inc.

All rights reserved.

Ads by Google