Performance and Scalability: Apriori Implementation.

Slides:



Advertisements
Similar presentations
Association Rule Mining
Advertisements

Association Analysis (2). Example TIDList of item ID’s T1I1, I2, I5 T2I2, I4 T3I2, I3 T4I1, I2, I4 T5I1, I3 T6I2, I3 T7I1, I3 T8I1, I2, I3, I5 T9I1, I2,
Frequent Itemset Mining Methods. The Apriori algorithm Finding frequent itemsets using candidate generation Seminal algorithm proposed by R. Agrawal and.
LOGO Association Rule Lecturer: Dr. Bo Yuan
Sampling Large Databases for Association Rules ( Toivenon’s Approach, 1996) Farzaneh Mirzazadeh Fall 2007.
Data Mining Association Analysis: Basic Concepts and Algorithms
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
CPS : Information Management and Mining
Frequent Item Mining.
Rakesh Agrawal Ramakrishnan Srikant
Chapter 5: Mining Frequent Patterns, Association and Correlations
Data Mining Techniques So Far: Cluster analysis K-means Classification Decision Trees J48 (C4.5) Rule-based classification JRIP (RIPPER) Logistic Regression.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
732A02 Data Mining - Clustering and Association Analysis ………………… Jose M. Peña Association rules Apriori algorithm FP grow algorithm.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Data Mining Association Analysis: Basic Concepts and Algorithms
Association Analysis (2). Example TIDList of item ID’s T1I1, I2, I5 T2I2, I4 T3I2, I3 T4I1, I2, I4 T5I1, I3 T6I2, I3 T7I1, I3 T8I1, I2, I3, I5 T9I1, I2,
Association Rule Mining Part 2 (under construction!) Introduction to Data Mining with Case Studies Author: G. K. Gupta Prentice Hall India, 2006.
Data Mining Association Analysis: Basic Concepts and Algorithms
Data Mining Association Rules Yao Meng Hongli Li Database II Fall 2002.
1 Association Rule Mining Instructor Qiang Yang Slides from Jiawei Han and Jian Pei And from Introduction to Data Mining By Tan, Steinbach, Kumar.
Association Analysis: Basic Concepts and Algorithms.
Data Mining Association Analysis: Basic Concepts and Algorithms
Chapter 4: Mining Frequent Patterns, Associations and Correlations
Mining Association Rules in Large Databases
Summarization of Frequent Pattern Mining. What is FPM? Why being frequent is so important? Application of FPM Decision make/Business Software Debugging.
© Vipin Kumar CSci 8980 Fall CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance Computing Research Center Department of Computer.
Mining Association Rules
Association Analysis: Basic Concepts and Algorithms
Mining Association Rules
© Vipin Kumar CSci 8980 Fall CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance Computing Research Center Department of Computer.
Chapter 2: Mining Frequent Patterns, Associations and Correlations
Mining Association Rules between Sets of Items in Large Databases presented by Zhuang Wang.
1 Apriori Algorithm Review for Finals. SE 157B, Spring Semester 2007 Professor Lee By Gaurang Negandhi.
Data Mining: Concepts and Techniques (3rd ed.) — Chapter 6 —
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
What Is Association Mining? l Association rule mining: – Finding frequent patterns, associations, correlations, or causal structures among sets of items.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining By Tan, Steinbach, Kumar Lecture.
Modul 7: Association Analysis. 2 Association Rule Mining  Given a set of transactions, find rules that will predict the occurrence of an item based on.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Chapter 4 – Frequent Pattern Mining Shuaiqiang Wang ( 王帅强 ) School of Computer Science and Technology Shandong University of Finance and Economics Homepage:
Frequent Item Mining. What is data mining? =Pattern Mining? What patterns? Why are they useful?
CSE4334/5334 DATA MINING CSE4334/5334 Data Mining, Fall 2014 Department of Computer Science and Engineering, University of Texas at Arlington Chengkai.
Data Mining Find information from data data ? information.
M. Sulaiman Khan Dept. of Computer Science University of Liverpool 2009 COMP527: Data Mining ARM: Improvements March 10, 2009 Slide.
Association Analysis (3)
The UNIVERSITY of KENTUCKY Association Rule Mining CS 685: Special Topics in Data Mining Spring 2009.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
The UNIVERSITY of KENTUCKY Association Rule Mining CS 685: Special Topics in Data Mining.
Reducing Number of Candidates Apriori principle: – If an itemset is frequent, then all of its subsets must also be frequent Apriori principle holds due.
Data Mining Association Rules Mining Frequent Itemset Mining Support and Confidence Apriori Approach.
1 Data Mining Lecture 6: Association Analysis. 2 Association Rule Mining l Given a set of transactions, find rules that will predict the occurrence of.
DATA MINING: ASSOCIATION ANALYSIS (2) Instructor: Dr. Chun Yu School of Statistics Jiangxi University of Finance and Economics Fall 2015.
Reducing Number of Candidates
Data Mining Association Analysis: Basic Concepts and Algorithms
Association Rule Mining
Information Management course
Frequent Pattern Mining
Dynamic Itemset Counting
Data Mining Association Analysis: Basic Concepts and Algorithms
Data Mining Association Analysis: Basic Concepts and Algorithms
Market Baskets Frequent Itemsets A-Priori Algorithm
Hash-Based Improvements to A-Priori
Data Mining Association Analysis: Basic Concepts and Algorithms
Association Analysis: Basic Concepts and Algorithms
Frequent-Pattern Tree
Data Mining: Concepts and Techniques (3rd ed.) — Chapter 6 —
Association Rule Mining
Association Analysis: Basic Concepts
What Is Association Mining?
Presentation transcript:

Performance and Scalability: Apriori Implementation

Apriori R. Agrawal and R. Srikant. Fast algorithms for mining association rules. VLDB, , 1994Fast algorithms for mining association rules

Reducing Number of Comparisons  Candidate counting: Scan the database of transactions to determine the support of each candidate itemset To reduce the number of comparisons, store the candidates in a hash structure  Instead of matching each transaction against every candidate, match it against candidates contained in the hashed buckets

Generate Hash Tree ,4,7 2,5,8 3,6,9 Hash function Suppose you have 15 candidate itemsets of length 3: {1 4 5}, {1 2 4}, {4 5 7}, {1 2 5}, {4 5 8}, {1 5 9}, {1 3 6}, {2 3 4}, {5 6 7}, {3 4 5}, {3 5 6}, {3 5 7}, {6 8 9}, {3 6 7}, {3 6 8} You need: Hash function Max leaf size: max number of itemsets stored in a leaf node (if number of candidate itemsets exceeds max leaf size, split the node)

Association Rule Discovery: Hash tree ,4,7 2,5,8 3,6,9 Hash Function Candidate Hash Tree Hash on 1, 4 or 7

Association Rule Discovery: Hash tree ,4,7 2,5,8 3,6,9 Hash Function Candidate Hash Tree Hash on 2, 5 or 8

Association Rule Discovery: Hash tree ,4,7 2,5,8 3,6,9 Hash Function Candidate Hash Tree Hash on 3, 6 or 9

Subset Operation Given a transaction t, what are the possible subsets of size 3?

Subset Operation Using Hash Tree ,4,7 2,5,8 3,6,9 Hash Function transaction

Subset Operation Using Hash Tree ,4,7 2,5,8 3,6,9 Hash Function transaction

Subset Operation Using Hash Tree ,4,7 2,5,8 3,6,9 Hash Function transaction Match transaction against 11 out of 15 candidates

Prefix Tree Representation Efficient Implementations of Apriori and Eclat Christian Borgelt., FIMI’03

Prefix Tree

Prefix Tree Structure for Counting

Other key optimization  Recording the items Why is this relevant?  Transaction Tree Organize transaction into trees Count through two trees

Important websites:  FIMI workshop Not only Apriori and FIM  FP-tree, ECLAT, Closed, Maximal  Christian Borgelt’s website  Ferenc Bodon’s website

References:  Christian Borgelt, Efficient Implementations of Apriori and Eclat, FIMI’03  Ferenc Bodon, A fast APRIORI implementation, FIMI’03  Ferenc Bodon, A Survey on Frequent Itemset Mining, Technical Report, Budapest University of Technology and Economic, 2006

Scalability  How to handle very large dataset?  The dataset can not be stored in the main memory  Performance of out-of-core datasets/Performance of in-core datasets

Partition: Scan Database Only Twice  Any itemset that is potentially frequent in DB must be frequent in at least one of the partitions of DB Scan 1: partition database and find local frequent patterns Scan 2: consolidate global frequent patterns  A. Savasere, E. Omiecinski, and S. Navathe. An efficient algorithm for mining association in large databases. In VLDB’95

DHP: Reduce the Number of Candidates  A k-itemset whose corresponding hashing bucket count is below the threshold cannot be frequent Candidates: a, b, c, d, e Hash entries: {ab, ad, ae} {bd, be, de} … Frequent 1-itemset: a, b, d, e ab is not a candidate 2-itemset if the sum of count of {ab, ad, ae} is below support threshold  J. Park, M. Chen, and P. Yu. An effective hash- based algorithm for mining association rules. In SIGMOD’95

Sampling for Frequent Patterns  Select a sample of original database, mine frequent patterns within sample using Apriori  Scan database once to verify frequent itemsets found in sample, only borders of closure of frequent patterns are checked Example: check abcd instead of ab, ac, …, etc.  Scan database again to find missed frequent patterns  H. Toivonen. Sampling large databases for association rules. In VLDB’96

DIC: Reduce Number of Scans ABCD ABC ABDACD BCD ABACBC AD BDCD A BCD {} Itemset lattice  Once both A and D are determined frequent, the counting of AD begins  Once all length-2 subsets of BCD are determined frequent, the counting of BCD begins Transactions 1-itemsets 2-itemsets … Apriori 1-itemsets 2-items 3-itemsDIC S. Brin R. Motwani, J. Ullman, and S. Tsur. Dynamic itemset counting and implication rules for market basket data. In SIGMOD’97