1 Inverted Matrix: Efficient Discovery of Frequent Items in Large Datasets in the Context of Interactive Mining -SIGKDD’03 Mohammad El-Hajj, Osmar R. Zaïane.

1 Inverted Matrix: Efficient Discovery of Frequent Items in Large Datasets in the Context of Interactive Mining -SIGKDD’03 Mohammad El-Hajj, Osmar R. Zaïane Presented by: Ivy Tong 19 December, 2003

2 Introduction  Association rules mining  Existing algorithms highly depend on  Memory size  Repeated I/O scans for the datasets  =>Insufficient for extremely large datasets

3 Related Work  Two major approaches in the literature:  Apriori-like algorithms  Algorithms like FP-tree  Apriori  Extensive I/O scans for the DB  High cost of computations  FP-tree  High memory requirement (data structure+conditional trees)  Storing the structure in disk significantly increases number of I/O

4 Contributions of this paper  Current algorithms  Handle small sizes DB with low dimensions Scale up to a few millions of transactions A few thousands of dimensions No algorithms can handle > 15M transactions, and hundreds of thousands of dimensions.  This paper  Disk-based algorithm based on conditional pattern concept  Divided into 2 phases

5 Contributions of this paper  Phase 1 (Pre-processing):  2 full I/O scans of dataset  Generate a special disk-based data structure called Inverted Matrix  Phase 2 (Mining):  Mine the Inverted Matrix (using different support levels) to generate association rule  Less than one-full I/O scan of the data structure Only the part of frequent items are scanned and used to generate frequent patterns

6 Outline of the talk  Problem Statement  Transactional Layout  Motivations of using Inverted Matrix  Design and Constructions of COFI-trees (Co-Occurrence Frequent Item Trees)  Experimental Results  Conclusions and Future Work

7 Problem Statement  Let I = {i 1, i 2,…, i m } be a set of literals, called items  Let D be a set of transactions, where each transaction T is a set of items (itemsets) such that T  I  A unique identifier TID is given to each transaction  A transaction T is said to contain X, a set of items in I, if X  T.  An association rule is an implication of the form “X=>Y”, where X  I, Y  I, and X  Y= 

8 Problem Statement  An itemset X is large or frequent if its support s is greater or equal than a given min support threshold min_sup  The rule X=>Y has support s if s% of transactions in D contains X  Y.  The rule X=>Y has confidence c if c% of transactions in D that contains X also contains Y.

9 Transactional Layout  If support threshold changes, mining process is repeated  In practice, minimum support is not known and requires tuning  => mining process is repeated  If min_sup is reduced Apriori: new scans of DB needed FP-growth: new memory structure is built  Previous accumulated knowledge not used

11 Horizontal VS Vertical  Format of transactions in the DB  Affects efficiency of the algorithms  Existing algorithms use of the two:  Horizontal - relates all items on the same transaction together Key: transaction ID  Vertical – relates all transactions that share the same items together Key: the item

12 Horizontal Layout Vertical Layout

13 Horizontal VS Vertical  Horizontal  Combines all items in one transaction together  Possibly eliminate candidacy generation step by using some clever techniques, e.g. FP-growth  Problem: useless work on scanning the whole DB  Vertical  An index on the items  Reduces the effect of large data sizes  Problem: expensive candidacy generation Intersecting records of different items of the candidate patterns

14 Inverted Matrix Layout  Combines horizontal and vertical layouts  Idea:  Associate each item with all transactions in which it occurs (inverted index), and  Associate each transaction with all its items using pointers

15 Inverted Matrix Layout  Similar to vertical layout  Key is the item  Differences:  Fields are not transaction Ids  Each field is a pointer that points to the location of the next item on the same transaction  2 parts in a pointer First element: address of a line in the matrix (which is the next item) Second element: address of a column

16 Inverted Matrix Layout  Each row in the matrix  Has an address  Prefixed by the item it represents, with its frequency in the database  Rows are ordered in ascending order of the frequency of the item they represent

17 Inverted Matrix

18 Building the Inverted Matrix  Two phases  Phase 1:  Scan the database once  Find the frequency of each item  Order the items into ascending order of their frequencies

19 Building the Inverted Matrix Phase 2:  Scan the database again once  For each transaction  Sort the items into ascending order according to the frequency of each item  Fill in the matrix appropriately

20 Example Building the Inverted Matrix (Example) First transaction: (A, B, C, D, E) => (D, B, C, E, A)

22 Mining the Inverted Matrix  Objectives:  Minimize the candidacy generation  Eliminate scans of infrequent items  support border  First item in the index of the Inverted matrix that has a support greater or equal to min_sup

23 Mining the Inverted Matrix  Follow the chain of items starting from C, rebuild parts of the transactions that contain only the frequent items  Avoid processing non- frequent items  (A) is never built at once

24 COFI-Tree  Co-Occurrence Frequent-Item Tree  To compute frequencies  Read sub-transactions for frequent items directly from the Inverted Matrix  Build independent relatively small COFI-trees for each frequent item in the transactional database  Mine separately each one of the trees once they are built (Discarded as soon as mined)

25 COFI-Tree  Similar to conditional FP-tree  A header of ordered frequent items  Horizontal pointers pointing to a succession of nodes containing the same frequent items  A prefix tree, paths representing sub-transactions  Difference  Bidirectional links in the tree  Nodes contain item label, a frequency counter and a participation counter (explained later)  A COFI-tree for a frequent item x contains only nodes labeled with items that are more frequent or as frequent as x.

26 COFI-Tree  Assume a transactional DB with frequent items (A, B, C, D, E, F)  Order of Increasing Frequencies: F < E < C < D < B < A  Sub-transactions generated from the Inverted Matrix (not realized)

27 COFI-Tree: Construction  Itemsets of different sizes are found simultaneously  For each given frequent 1-itemset  Find all frequent k-itemsets that subsume it  A COFI-tree is built for each frequent item except the most frequent one, starting from the least frequent

29 Construction-Example 1  F – least frequent  Build a COFI-tree for F first  All frequent items which are more frequent than F participate in building this tree  Root node: F (the item in question)  For each sub-transaction containing F with other frequent items (more frequent than F), a branch is formed starting from root.  If multiple frequent items share same prefix, they are merged into one branch, count is adjusted accordingly

30 Construction-Example 2 Support for the node Participation count Initialized to 0 Used in Mining (later) Item label Round node:Tree node Square node: cells from header table Horizontal Link: points to the next node that has the same item-name Vertical (bi-directional) Link: link a parent with child and child with parent

31 Construction-Example 2 C:4:0 B:4:0 A:4:0 BABA C:5:0 B:4:0 A:4:0 DBADBA D:1:0 A:1:0 C:6:0 B:5:0 A:4:0 DBADBA D:1:0 A:1:0 C:8:0 B:5:0 A:4:0 DBADBA D:3:0 A:1:0 Step1: CBA:4Step2: CDA:1 Step3: CB:1Step4: CD:2

32 Construction-Example

33 Generate Frequent Patterns  COFI-trees of all frequent items are not constructed together  Each tree is built, mined and then discarded before the next COFI-tree is built  Mining is done for each tree independently  To find all frequent k-itemset patterns that the item on the root  Top-down approach used to generate patterns

34 Generate Frequent Patterns  COFI-tree for a frequent item I is built by following the chain of pointers in the Inverted Matrix  I-COFI-tree is mined  branch by branch starting with the node of the most frequent item  Go upward in the tree to identify candidate frequent patterns containing I  A list of these candidates is kept and updated with frequencies of the branches where they occur.  Since a node could belong to more than one branch, a participation count is used to avoid re-counting.

36 Experimental Results  Compared with Apriori and FP-growth  Run on a 733-Mhz machine with 256MB RAM  IBM synthetic dataset  2 tests:  Time needed to mine different transaction sizes  Time needed to mine with different supports level

37 Scalability Settings  min_sup=0.01%  DB size: 1M-25M  Average length of transaction=24 items  Dimensionality  1M DB: 10000 items  5M-25M: 100,000 items Results  Apriori failed to mine 5M DB while FP-growth couldn’t mine beyond 5M  Inverted Matrix scales well,  Can mine 25M transactions in 2731sec.

38 Performance (VS. min_sup) Settings  1M transactions  10,000 items  Average transaction length=24  Min_sup: 0.0025%-0.01% Results  Matrix built in 763s, size 109MB on disk (original DB: 102 MB) Time needed to mine 1M transactions with different support levels Accumulated time needed to mine 1M transactions using 4 different support levels

39 Future Work  Reduction of the Inverted Matrix size  Reduction of number of I/Os when building COFI-trees  Inverted Matrix clusters frequent items at the bottom  Traversing one transaction may call more than 1 page  => Cluster the same transactions on the same page  Update of matrix by addition or deletion of transactions

40 Example: Two transactions: (A, B, C, D, E) and (A, E, C, H, G) Ordering both transactions according to the frequencies: (D, B, C, E, A) and (G, H, C, E, A) Both transactions share the same suffix C, E, A, we can view them as

41 Conclusions  A new scalable algorithm is proposed  Uses the disk to store the transactions in a special layout – Inverted Matrix  Uses memory to interactively mine small structures called COFI-trees  Experiments show the algorithms can mine very large transactional databases, with very large number of unique items  Useful in a repetitive and interactive setting

42 References  Mohammad El-Hajj, Osmar R. Zaïane. Inverted Matrix: Efficient Discovery of Frequent Items in Large Datasets in the Context of Interactive Mining, SIGKDD’03  Mohammad El-Hajj, Osmar R. Zaïane. Non Recursive Generation of Frequent K-itemsets from Frequent Pattern Tree Representations, DaWak'2003

43 Thank You!

1 Inverted Matrix: Efficient Discovery of Frequent Items in Large Datasets in the Context of Interactive Mining -SIGKDD’03 Mohammad El-Hajj, Osmar R. Zaïane.

Similar presentations

Presentation on theme: "1 Inverted Matrix: Efficient Discovery of Frequent Items in Large Datasets in the Context of Interactive Mining -SIGKDD’03 Mohammad El-Hajj, Osmar R. Zaïane."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 Inverted Matrix: Efficient Discovery of Frequent Items in Large Datasets in the Context of Interactive Mining -SIGKDD’03 Mohammad El-Hajj, Osmar R. Zaïane.

Similar presentations

Presentation on theme: "1 Inverted Matrix: Efficient Discovery of Frequent Items in Large Datasets in the Context of Interactive Mining -SIGKDD’03 Mohammad El-Hajj, Osmar R. Zaïane."— Presentation transcript:

Similar presentations

About project

Feedback