Presentation is loading. Please wait.

Presentation is loading. Please wait.

Fast Algorithms for Association Rule Mining

Similar presentations


Presentation on theme: "Fast Algorithms for Association Rule Mining"— Presentation transcript:

1 Fast Algorithms for Association Rule Mining
Presented by Muhammad Aurangzeb Ahmad Nupur Bhatnagar R. Agrawal and R. Srikant

2 Outline Background and Motivation Problem Definition
Major Contribution Key Concepts Validation Assumptions Future Revision

3 Background & Motivation
Basket Data: Collection of records consisting of transaction identifier and the items bought in a transaction. Mining for associations among items in a large database of sales transaction to predict the occurrence of an item based on the occurrences of other items in the transaction. For Example:

4 Terms and Notations Items : I = {i1,i2,…,im}
Transaction – set of items such as Items are sorted lexicographically TID – unique identifier for each transaction Association Rule : X->Y where

5 Terms and Notations Confidence : A rule X->Y holds in the transaction set D with confidence c if c% of transactions in D that contain X also contain Y. Support: A rule X->Y has support s if s% of transactions in D contain X and Y. Large Itemset Itemsets having support greater than minimum support and minimum confidence are called large itemsets other they are called small itemsets. Candidate Itemsets A set of itemsets which are generated from a seed of itemsets which were found to be large in the previous pass having support ≥ minsup threshold confidence ≥ minconf threshold

6 Problem Definition Objective:
INPUT A set of transactions Objective: Given a set of transactions D, generate all association rules that have support and confidence greater than the user-specified minimum support and minimum confidence. Minimize computation time by pruning. Constraints: Items should be in lexicographical order Association Rules {Diaper}  {Beer}, {Milk, Bread}  {Eggs, Coke}, {Beer, Bread}  {Milk}, Real World Applications NCR (Terradata) does ARM for more than 20 large retail organizations including Walmart. Used for pattern discovery in biological DBs.

7 Major Contribution Proposed two new algorithms for fast association rule mining: Apriori and AprioriTID, along with a hybrid of the two algorithms . Empirical evaluations of the performance of the proposed algorithms as compared with the contemporary algorithms. Completeness: Find all rules.

8 Related Work -SETM and AIS
Major difference in Candidate Itemset generation In pass k, read a database transaction t Determine which of the large itemsets in Lk-1 are present in t. Each of these large itemsets l is then extended with all those large items that are present in t and occur later in the lexicographic ordering than any of the items in l. Results: A lot of Candidate Itemsets are generated which are later discarded.

9 Key Concepts: Support and Confidence
Why do we need Support and Confidence? Given a rule : X->Y Support determines how often a rule is applicable to a given data set. Confidence determines how frequently items in Y appear in transactions that contains X. A rule having low support may occur by chance!! A low support rule tends to be uninteresting from a business perspective. Confidence measures the reliability of the inference made by a rule.

10 Key Concepts –Association Rule Mining Problem
Given a set of transactions T, find all rules having support >= minsupport and confidence>=minconfidence. Decomposition of Problem: Frequent Itemset Generation : Find all itemsets having transaction support above minimum support. These itemsets are called frequent itemsets. Rule Generation: Use the large itemsets to generate rules. These rules are high- confidence rules extracted from the frequent itemsets found in the previous step.

11 Frequent Itemset Generation: Apriori
Apriori Principle: Given an itemeset I={a,b,c,d,e}. If an item set is frequent, then all of its subsets must also be frequent and vice-versa.

12 Frequent Itemset Generation: Apriori
Apriori Principle: if {c,d,e} is frequent then all its subsets must also be frequent

13 Frequent Itemset Generation: Apriori
Apriori Principle: Candidate Pruning If {a,b} is infrequent, then all it supersets are infrequent

14 Key Concepts –Frequent Itemset Generation : Apriori Algorithm
Input The market base transaction dataset. Process Determine large 1-itemsets. Repeat until no new large 1-itemsets are identified. Generate (k+1) length candidate itemsets from length k large itemsets. Prune candidate itemsets that are not large. Count the support of each candidate itemset. Eliminate candidate itemsets that are small. Output Itemsets that are “large” and qualify the min support and min confidence thresholds.

15 Apriori Example: Minimum support two transaction
1-itemset Pruning 2-itemset Pruning 3-itemset

16 Apriori Candidate Generation
Given an k-itemset, generate k+1 itemset in two steps: C(4)={{135},{235}} C(4) = {{235}} JOIN STEP Delete all candidates having non-frequent subset PRUNE Join k- itemset with k-itemset, with the join condition that the first k-1 items should be the same.

17 AprioriTID AprioriTID Same candidate generation function as Apriori.
Does not use database for counting support after the first pass. Encoding of the candidate itemsets used in the previous pass. Saves reading effort.

18 Apriori Tid Example: Support Count:2
Database Items TID 1 3 4 100 2 3 5 200 300 2 5 400 Set-of-itemsets TID { {1},{3},{4} } 100 { {2},{3},{5} } 200 { {1},{2},{3},{5} } 300 { {2},{5} } 400 Support Itemset 2 {1} 3 {2} {3} {5} L1 C^1 Item Support {1 2} 1 {1 3} 2 {1 5} 1 {2 3} 2 {2 5} 3 {3 5} 2 Set-of-itemsets TID { {1 3} } 100 { {2 3},{2 5} {3 5} } 200 { {1 2},{1 3},{1 5}, {2 3}, {2 5}, {3 5} } 300 { {2 5} } 400 Support Itemset 2 {1 3} 3 {2 3} {2 5} {3 5} C2 C^2 L2 Set-of-itemsets TID { {2 3 5} } 200 300 itemset {2 3 5} C^3 C3 Support Itemset 2 {2 3 5} L3

19 Apriori Tid : Analysis Advantages : If a transaction does not contain k-itemset candidates, then Ck will not have an entry for this transaction. For large k, each entry may be smaller than the transaction because very few candidates may be present in the transaction. Disadvantages: For small k, each entry may be larger than the corresponding transaction. An entry includes all k-itemsets contained in the transaction.

20 Apriori Hybrid Apriori Hybrid :
It uses Apriori in the initial passes and switches to AprioriTid when it expects that the candidate itemsets at the end of the pass will be in memory.

21 Validation : Computer Experiments
Parameters for data generation D – Number of transactions T – Average size of the transaction I – Average size of the maximal potentially large itemsets L – Number of maximal potentially large itemsets N – Number of Items. Parameter Settings : 6 synthetic data sets

22 Results : Execution Time
Apriori is always better than AIS and SETM. SETM values were too big. Apriori is better than Apriori TID in large transactions.

23 Results : Analysis AprioriTid uses C^k instead of the database. If C^k fits in memory AprioriTid is faster than Apriori. When C^k is too big to fit in memory, the computation time is much longer. Thus Apriori is faster than AprioriTid.

24 Results: Execution time Apriori Hybrid
Graphs: Apriori Hybrid performs better than Apriori in almost all cases.

25 Scale Up - Experiments Apriori Hybrid scales up as the number of transactions is increased from 100,000 to 10 million. Minimum support .75% Apriori Hybrid scales up when average transaction size was increased. Done to see the affect on data structures independent of physical db size and number of large item sets.

26 Results: The Apriori algorithms are better than the SETM and AIS.
The algorithms performs there best when combined. The algorithm shows good results in scale-up experiments.

27 Validation Methodology-Weakness and Strength
Author use a substantial basket data for guiding the process of designing fast algorithms for association rule mining. Weakness: Synthetic data set is used for validation. The data might be too synthetic as to not give any valuable information about real world datasets.

28 Assumptions Synthetic dataset is used. It is assumed that performance of the algorithm in the synthetic dataset is indicative of its performance on a real world dataset. All the items in the data are in a lexicographical order. Assume that all data is categorical. It is assumed that all the data is present in the same site or table and there are no cases which there would be a requirement to make joins.

29 Possible Revision Some real world datasets should be used to perform the experiments . The number of large itemsets could exponentially increase with large databases. Modification in the representation structure is required that captures just a subset of the candidate large itemsets. Limitations of Support and Confidence Framework Support : Potentially interesting patterns involving low support items might be eliminated. Confidence: Confidence ignores the support of the itemset in the rule consequent. Improvement : Interestingness measure : Computes the ratio between the rule’s confidence and the support of the itemset in the rule consequent. = S(a,b)/s(a) * s(b) Effect of Skewed support Distribution

30 Questions?


Download ppt "Fast Algorithms for Association Rule Mining"

Similar presentations


Ads by Google