# Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,

## Presentation on theme: "Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,"— Presentation transcript:

Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 2 Association Rule Mining l Given a set of transactions, find rules that will predict the occurrence of an item based on the occurrences of other items in the transaction Market-Basket transactions Example of Association Rules {Diaper}  {Beer}, {Milk, Bread}  {Eggs,Coke}, {Beer, Bread}  {Milk}, Implication means co-occurrence, not causality!

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 3 Definition: Frequent Itemset l Itemset –A collection of one or more items  Example: {Milk, Bread, Diaper} –k-itemset  An itemset that contains k items l Support count (  ) –Frequency of occurrence of an itemset –E.g.  ({Milk, Bread,Diaper}) = 2 l Support –Fraction of transactions that contain an itemset –E.g. s({Milk, Bread, Diaper}) = 2/5 l Frequent Itemset –An itemset whose support is greater than or equal to a minsup threshold

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 4 Definition: Association Rule Example: l Association Rule –An implication expression of the form X  Y, where X and Y are itemsets –Example: {Milk, Diaper}  {Beer} l Rule Evaluation Metrics –Support (s)  Fraction of transactions that contain both X and Y –Confidence (c)  Measures how often items in Y appear in transactions that contain X

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 5 Association Rule Mining Task l Given a set of transactions T, the goal of association rule mining is to find all rules having –support ≥ minsup threshold –confidence ≥ minconf threshold l Brute-force approach: –List all possible association rules –Compute the support and confidence for each rule –Prune rules that fail the minsup and minconf thresholds  Computationally prohibitive!

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 6 Frequent Itemset Generation Given d items, there are 2 d possible candidate itemsets

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 7 Found to be Infrequent Illustrating Apriori Principle Pruned supersets

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 8 Illustrating Apriori Principle Items (1-itemsets) Pairs (2-itemsets) (No need to generate candidates involving Coke or Eggs) Triplets (3-itemsets) Minimum Support = 3 If every subset is considered: 41 With support-based pruning: 13

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 9 Apriori Algorithm l Method: –Let k=1 –Generate frequent itemsets of length 1 –Repeat until no new frequent itemsets are identified  Generate length (k+1) candidate itemsets from length k frequent itemsets  Prune candidate itemsets containing subsets of length k that are infrequent  Count the support of each candidate by scanning the DB  Eliminate candidates that are infrequent, leaving only those that are frequent

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 10 Apriori in DB2- S. Sarawagi, S. Thomas and R. Agrawal. "Integrating association rule mining with databases: alternatives and implications". Proc. of the ACM SIGMOD Int'l Conference on Management of Data, Seattle, Washington, June 1998. BEST PAPER AWARD. An extented version also appeared in Data Mining and Knowledge Discovery Journal, 4(2/3), July 2000. Because of the obvious challenges, reinforced by this paper, vendors and researchers gave up on the idea of turning DBMS into data mining systems: OLAP are tightly integrated into the DBMS but the KDD methods are not. 10

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 11 But much research into better methods for frequent Items l Many methods explored different data representations and algorithms:

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 12 The best Algorithm combines both l Use a compressed representation of the database using an FP-tree l Once an FP-tree has been constructed, it uses a recursive divide-and-conquer approach to mine the frequent itemsets FP-growth Algorithm

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 13 FP-tree construction null A:1 B:1 null A:1 B:1 C:1 D:1 After reading TID=1: After reading TID=2:

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 14 FP-Tree null A:7 B:5 B:3 C:3 D:1 C:1 D:1 C:3 D:1 E:1 Pointers are used to assist frequent itemset generation D:1 E:1 Transaction Database Header table

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 15 FP-growth null A:7 B:5 B:1 C:1 D:1 C:1 D:1 C:3 D:1 Conditional Pattern base for D: Recursively apply FP- growth on P Frequent Itemsets found (with sup > 1): AD, BD, CD, ACD, BCD D:1

Download ppt "Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,"

Similar presentations