1 Inverted Matrix: Efficient Discovery of Frequent Items in Large Datasets in the Context of Interactive Mining -SIGKDD’03 Mohammad El-Hajj, Osmar R. Zaïane.

Slides:

Advertisements

Similar presentations

Association rules and frequent itemsets mining

Advertisements

Query Optimization of Frequent Itemset Mining on Multiple Databases Mining on Multiple Databases David Fuhry Department of Computer Science Kent State.

Frequent Closed Pattern Search By Row and Feature Enumeration

FP-Growth algorithm Vasiljevic Vladica,

FP (FREQUENT PATTERN)-GROWTH ALGORITHM ERTAN LJAJIĆ, 3392/2013 Elektrotehnički fakultet Univerziteta u Beogradu.

Data Mining Association Analysis: Basic Concepts and Algorithms

FPtree/FPGrowth (Complete Example). First scan – determine frequent 1- itemsets, then build header B8 A7 C7 D5 E3.

Association Analysis. Association Rule Mining: Definition Given a set of records each of which contain some number of items from a given collection; –Produce.

Data Mining Techniques So Far: Cluster analysis K-means Classification Decision Trees J48 (C4.5) Rule-based classification JRIP (RIPPER) Logistic Regression.

Data Mining Association Analysis: Basic Concepts and Algorithms Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction.

Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,

Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,

Data Mining Association Analysis: Basic Concepts and Algorithms

Data Mining Association Analysis: Basic Concepts and Algorithms

1 Mining Frequent Patterns Without Candidate Generation Apriori-like algorithm suffers from long patterns or quite low minimum support thresholds. Two.

Association Analysis: Basic Concepts and Algorithms.

Association Rule Mining. Generating assoc. rules from frequent itemsets  Assume that we have discovered the frequent itemsets and their support  How.

Data Mining Association Analysis: Basic Concepts and Algorithms

Pattern Lattice Traversal by Selective Jumps Osmar R. Zaïane and Mohammad El-Hajj Department of Computing Science, University of Alberta Edmonton, AB,

FPtree/FPGrowth. FP-Tree/FP-Growth Algorithm Use a compressed representation of the database using an FP-tree Then use a recursive divide-and-conquer.

Fast Algorithms for Association Rule Mining

Association Analysis (3). FP-Tree/FP-Growth Algorithm Use a compressed representation of the database using an FP-tree Once an FP-tree has been constructed,

ACM SIGKDD Aug – Washington, DC  M. El-Hajj and O. R. Zaïane, 2003 Database Lab. University of Alberta Canada Inverted Matrix: Efficient Discovery.

© Vipin Kumar CSci 8980 Fall CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance Computing Research Center Department of Computer.

SEG Tutorial 2 – Frequent Pattern Mining.

林俊宏 Parallel Association Rule Mining based on FI-Growth Algorithm Bundit Manaskasemsak, Nunnapus Benjamas, Arnon Rungsawang.

Ch5 Mining Frequent Patterns, Associations, and Correlations

Mining Frequent Patterns without Candidate Generation Presented by Song Wang. March 18 th, 2009 Data Mining Class Slides Modified From Mohammed and Zhenyu’s.

Jiawei Han, Jian Pei, and Yiwen Yin School of Computing Science Simon Fraser University Mining Frequent Patterns without Candidate Generation SIGMOD 2000.

AR mining Implementation and comparison of three AR mining algorithms Xuehai Wang, Xiaobo Chen, Shen chen CSCI6405 class project.

Data Mining Frequent-Pattern Tree Approach Towards ARM Lecture

Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,

EFFICIENT ITEMSET EXTRACTION USING IMINE INDEX By By U.P.Pushpavalli U.P.Pushpavalli II Year ME(CSE) II Year ME(CSE)

Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining By Tan, Steinbach, Kumar Lecture.

Modul 7: Association Analysis. 2 Association Rule Mining  Given a set of transactions, find rules that will predict the occurrence of an item based on.

Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,

Efficient Data Mining for Calling Path Patterns in GSM Networks Information Systems, accepted 5 December 2002 SPEAKER: YAO-TE WANG ( 王耀德 )

Mining Frequent Patterns without Candidate Generation.

Mining Frequent Patterns without Candidate Generation : A Frequent-Pattern Tree Approach 指導教授：廖述賢博士報告人：朱佩慧班級：管科所博一.

Frequent Item Mining. What is data mining? =Pattern Mining? What patterns? Why are they useful?

CPSC 404, Laks V.S. Lakshmanan1 External Sorting Chapter 13: Ramakrishnan & Gherke and Chapter 2.3: Garcia-Molina et al.

CanTree: a tree structure for efficient incremental mining of frequent patterns Carson Kai-Sang Leung, Quamrul I. Khan, Tariqul Hoque ICDM ’ 05 報告者：林靜怡.

 Frequent Word Combinations Mining and Indexing on HBase Hemanth Gokavarapu Santhosh Kumar Saminathan.

CMU SCS : Multimedia Databases and Data Mining Lecture #30: Data Mining - assoc. rules C. Faloutsos.

M. Sulaiman Khan Dept. of Computer Science University of Liverpool 2009 COMP527: Data Mining ARM: Improvements March 10, 2009 Slide.

Association Analysis (3)

Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,

A Scalable Association Rules Mining Algorithm Based on Sorting, Indexing and Trimming Chuang-Kai Chiou, Judy C. R Tseng Proceedings of the Sixth International.

Reducing Number of Candidates Apriori principle: – If an itemset is frequent, then all of its subsets must also be frequent Apriori principle holds due.

1 Data Mining Lecture 6: Association Analysis. 2 Association Rule Mining l Given a set of transactions, find rules that will predict the occurrence of.

1 Top Down FP-Growth for Association Rule Mining By Ke Wang.

Chapter 3 Data Mining: Classification & Association Chapter 4 in the text box Section: 4.3 (4.3.1),

Fast Mining Frequent Patterns with Secondary Memory Kawuu W. Lin, Sheng-Hao Chung, Sheng-Shiung Huang and Chun-Cheng Lin Department of Computer Science.

Reducing Number of Candidates

Data Mining Association Analysis: Basic Concepts and Algorithms

Data Mining: Concepts and Techniques

Frequent Pattern Mining

Byung Joon Park, Sung Hee Kim

CARPENTER Find Closed Patterns in Long Biological Datasets

Market Basket Analysis and Association Rules

Data Mining Association Analysis: Basic Concepts and Algorithms

Vasiljevic Vladica, FP-Growth algorithm Vasiljevic Vladica,

Mining Association Rules from Stars

COMP5331 FP-Tree Prepared by Raymond Wong Presented by Raymond Wong

732A02 Data Mining - Clustering and Association Analysis

Mining Frequent Patterns without Candidate Generation

The BIRCH Algorithm Davitkov Miroslav, 2011/3116

Frequent-Pattern Tree

FP-Growth Wenlong Zhang.

Finding Frequent Itemsets by Transaction Mapping

Presentation transcript:

1 Inverted Matrix: Efficient Discovery of Frequent Items in Large Datasets in the Context of Interactive Mining -SIGKDD’03 Mohammad El-Hajj, Osmar R. Zaïane Presented by: Ivy Tong 19 December, 2003

2 Introduction  Association rules mining  Existing algorithms highly depend on  Memory size  Repeated I/O scans for the datasets  =>Insufficient for extremely large datasets

3 Related Work  Two major approaches in the literature:  Apriori-like algorithms  Algorithms like FP-tree  Apriori  Extensive I/O scans for the DB  High cost of computations  FP-tree  High memory requirement (data structure+conditional trees)  Storing the structure in disk significantly increases number of I/O

4 Contributions of this paper  Current algorithms  Handle small sizes DB with low dimensions Scale up to a few millions of transactions A few thousands of dimensions No algorithms can handle > 15M transactions, and hundreds of thousands of dimensions.  This paper  Disk-based algorithm based on conditional pattern concept  Divided into 2 phases

5 Contributions of this paper  Phase 1 (Pre-processing):  2 full I/O scans of dataset  Generate a special disk-based data structure called Inverted Matrix  Phase 2 (Mining):  Mine the Inverted Matrix (using different support levels) to generate association rule  Less than one-full I/O scan of the data structure Only the part of frequent items are scanned and used to generate frequent patterns

6 Outline of the talk  Problem Statement  Transactional Layout  Motivations of using Inverted Matrix  Design and Constructions of COFI-trees (Co-Occurrence Frequent Item Trees)  Experimental Results  Conclusions and Future Work

7 Problem Statement  Let I = {i 1, i 2,…, i m } be a set of literals, called items  Let D be a set of transactions, where each transaction T is a set of items (itemsets) such that T  I  A unique identifier TID is given to each transaction  A transaction T is said to contain X, a set of items in I, if X  T.  An association rule is an implication of the form “X=>Y”, where X  I, Y  I, and X  Y= 

8 Problem Statement  An itemset X is large or frequent if its support s is greater or equal than a given min support threshold min_sup  The rule X=>Y has support s if s% of transactions in D contains X  Y.  The rule X=>Y has confidence c if c% of transactions in D that contains X also contains Y.

9 Transactional Layout  If support threshold changes, mining process is repeated  In practice, minimum support is not known and requires tuning  => mining process is repeated  If min_sup is reduced Apriori: new scans of DB needed FP-growth: new memory structure is built  Previous accumulated knowledge not used

10

11 Horizontal VS Vertical  Format of transactions in the DB  Affects efficiency of the algorithms  Existing algorithms use of the two:  Horizontal - relates all items on the same transaction together Key: transaction ID  Vertical – relates all transactions that share the same items together Key: the item

12 Horizontal Layout Vertical Layout

13 Horizontal VS Vertical  Horizontal  Combines all items in one transaction together  Possibly eliminate candidacy generation step by using some clever techniques, e.g. FP-growth  Problem: useless work on scanning the whole DB  Vertical  An index on the items  Reduces the effect of large data sizes  Problem: expensive candidacy generation Intersecting records of different items of the candidate patterns

14 Inverted Matrix Layout  Combines horizontal and vertical layouts  Idea:  Associate each item with all transactions in which it occurs (inverted index), and  Associate each transaction with all its items using pointers

15 Inverted Matrix Layout  Similar to vertical layout  Key is the item  Differences:  Fields are not transaction Ids  Each field is a pointer that points to the location of the next item on the same transaction  2 parts in a pointer First element: address of a line in the matrix (which is the next item) Second element: address of a column

16 Inverted Matrix Layout  Each row in the matrix  Has an address  Prefixed by the item it represents, with its frequency in the database  Rows are ordered in ascending order of the frequency of the item they represent

17 Inverted Matrix

18 Building the Inverted Matrix  Two phases  Phase 1:  Scan the database once  Find the frequency of each item  Order the items into ascending order of their frequencies

19 Building the Inverted Matrix Phase 2:  Scan the database again once  For each transaction  Sort the items into ascending order according to the frequency of each item  Fill in the matrix appropriately

20 Example Building the Inverted Matrix (Example) First transaction: (A, B, C, D, E) => (D, B, C, E, A)

21

22 Mining the Inverted Matrix  Objectives:  Minimize the candidacy generation  Eliminate scans of infrequent items  support border  First item in the index of the Inverted matrix that has a support greater or equal to min_sup

23 Mining the Inverted Matrix  Follow the chain of items starting from C, rebuild parts of the transactions that contain only the frequent items  Avoid processing non- frequent items  (A) is never built at once

24 COFI-Tree  Co-Occurrence Frequent-Item Tree  To compute frequencies  Read sub-transactions for frequent items directly from the Inverted Matrix  Build independent relatively small COFI-trees for each frequent item in the transactional database  Mine separately each one of the trees once they are built (Discarded as soon as mined)

25 COFI-Tree  Similar to conditional FP-tree  A header of ordered frequent items  Horizontal pointers pointing to a succession of nodes containing the same frequent items  A prefix tree, paths representing sub-transactions  Difference  Bidirectional links in the tree  Nodes contain item label, a frequency counter and a participation counter (explained later)  A COFI-tree for a frequent item x contains only nodes labeled with items that are more frequent or as frequent as x.

26 COFI-Tree  Assume a transactional DB with frequent items (A, B, C, D, E, F)  Order of Increasing Frequencies: F < E < C < D < B < A  Sub-transactions generated from the Inverted Matrix (not realized)

27 COFI-Tree: Construction  Itemsets of different sizes are found simultaneously  For each given frequent 1-itemset  Find all frequent k-itemsets that subsume it  A COFI-tree is built for each frequent item except the most frequent one, starting from the least frequent

28

29 Construction-Example 1  F – least frequent  Build a COFI-tree for F first  All frequent items which are more frequent than F participate in building this tree  Root node: F (the item in question)  For each sub-transaction containing F with other frequent items (more frequent than F), a branch is formed starting from root.  If multiple frequent items share same prefix, they are merged into one branch, count is adjusted accordingly

30 Construction-Example 2 Support for the node Participation count Initialized to 0 Used in Mining (later) Item label Round node:Tree node Square node: cells from header table Horizontal Link: points to the next node that has the same item-name Vertical (bi-directional) Link: link a parent with child and child with parent

31 Construction-Example 2 C:4:0 B:4:0 A:4:0 BABA C:5:0 B:4:0 A:4:0 DBADBA D:1:0 A:1:0 C:6:0 B:5:0 A:4:0 DBADBA D:1:0 A:1:0 C:8:0 B:5:0 A:4:0 DBADBA D:3:0 A:1:0 Step1: CBA:4Step2: CDA:1 Step3: CB:1Step4: CD:2

32 Construction-Example

33 Generate Frequent Patterns  COFI-trees of all frequent items are not constructed together  Each tree is built, mined and then discarded before the next COFI-tree is built  Mining is done for each tree independently  To find all frequent k-itemset patterns that the item on the root  Top-down approach used to generate patterns

34 Generate Frequent Patterns  COFI-tree for a frequent item I is built by following the chain of pointers in the Inverted Matrix  I-COFI-tree is mined  branch by branch starting with the node of the most frequent item  Go upward in the tree to identify candidate frequent patterns containing I  A list of these candidates is kept and updated with frequencies of the branches where they occur.  Since a node could belong to more than one branch, a participation count is used to avoid re-counting.

35

36 Experimental Results  Compared with Apriori and FP-growth  Run on a 733-Mhz machine with 256MB RAM  IBM synthetic dataset  2 tests:  Time needed to mine different transaction sizes  Time needed to mine with different supports level

37 Scalability Settings  min_sup=0.01%  DB size: 1M-25M  Average length of transaction=24 items  Dimensionality  1M DB: items  5M-25M: 100,000 items Results  Apriori failed to mine 5M DB while FP-growth couldn’t mine beyond 5M  Inverted Matrix scales well,  Can mine 25M transactions in 2731sec.

38 Performance (VS. min_sup) Settings  1M transactions  10,000 items  Average transaction length=24  Min_sup: %-0.01% Results  Matrix built in 763s, size 109MB on disk (original DB: 102 MB) Time needed to mine 1M transactions with different support levels Accumulated time needed to mine 1M transactions using 4 different support levels

39 Future Work  Reduction of the Inverted Matrix size  Reduction of number of I/Os when building COFI-trees  Inverted Matrix clusters frequent items at the bottom  Traversing one transaction may call more than 1 page  => Cluster the same transactions on the same page  Update of matrix by addition or deletion of transactions

40 Example: Two transactions: (A, B, C, D, E) and (A, E, C, H, G) Ordering both transactions according to the frequencies: (D, B, C, E, A) and (G, H, C, E, A) Both transactions share the same suffix C, E, A, we can view them as

41 Conclusions  A new scalable algorithm is proposed  Uses the disk to store the transactions in a special layout – Inverted Matrix  Uses memory to interactively mine small structures called COFI-trees  Experiments show the algorithms can mine very large transactional databases, with very large number of unique items  Useful in a repetitive and interactive setting

42 References  Mohammad El-Hajj, Osmar R. Zaïane. Inverted Matrix: Efficient Discovery of Frequent Items in Large Datasets in the Context of Interactive Mining, SIGKDD’03  Mohammad El-Hajj, Osmar R. Zaïane. Non Recursive Generation of Frequent K-itemsets from Frequent Pattern Tree Representations, DaWak'2003

43 Thank You!