Presentation is loading. Please wait.

Presentation is loading. Please wait.

Association Rules l Mining Association Rules between Sets of Items in Large Databases (R. Agrawal, T. Imielinski & A. Swami) 1993. l Fast Algorithms for.

Similar presentations


Presentation on theme: "Association Rules l Mining Association Rules between Sets of Items in Large Databases (R. Agrawal, T. Imielinski & A. Swami) 1993. l Fast Algorithms for."— Presentation transcript:

1 Association Rules l Mining Association Rules between Sets of Items in Large Databases (R. Agrawal, T. Imielinski & A. Swami) 1993. l Fast Algorithms for Mining Association Rules (R. Agrawal & R. Srikant) 1994.

2 Basket Data Retail organizations, e.g., supermarkets, collect and store massive amounts sales data, called basket data. A record consist of n transaction date n items bought Or, basket data may consist of items bought by a customer over a period.

3 Example Association Rule 90% of transactions that purchase bread and butter also purchase milk Antecedent: bread and butter Consequent: milk Confidence factor: 90%

4 Example Queries l Find all the rules that have “Uludağ Gazozu” as consequent. l Find all rules that have “Diet Coke” in the antecedent. l Find all rules that have “sausage” in the antecedent and “mustard” in the consequent. l Find all the rules relating items located on shelves A and B in the store. l Find the “best” (most confident) k rules that have “Uludağ Gazozu” in the consequent.

5 Formal Model I = i 1, i 2, …, i m : set of literals (items) D : database of transactions T  D : a transaction. T  I n TID: unique identifier, associated with each T X: a subset of I n T contains X if X  T.

6 Formal Model (Cont.) Association rule: X  Y here X  I, Y  I and X  Y = . Rule X  Y has a confidence c in D if c% of transactions in D that contain X also contain Y. Rule X  Y has a support s in D if s% of transactions in D contain X  Y.

7 Example I : itemset {cucumber, parsley, onion, tomato, salt, bread, olives, cheese, butter} D : set of transactions 1{{cucumber, parsley, onion, tomato, salt, bread}, 2 {tomato, cucumber, parsley}, 3 {tomato, cucumber, olives, onion, parsley}, 4 {tomato, cucumber, onion, bread}, 5 {tomato, salt, onion}, 6 {bread, cheese} 7 {tomato, cheese, cucumber} 8 {bread, butter}}

8 Problem l Given a set of transactions, l Generate all association rules l that have the support and confidence greater than the user-specified minimum support (minsup) and minimum confidence (minconf).

9 Problem decomposition 1. Find all itemsets that have transaction support above minimum support. 2. Use the large itemsets to generate the Association rules: 2 1. For every large itemset I, find its all subsets 2.2. For every subset a, output a rule: a  (I - a) if

10 Discovering Large Itemsets Apriori and AprioriTid algorithms: Basic intuition: Any subset of a large itemset must be large Itemset having k items can be generated by joining large itemsets having k-1 items, and deleting those that contain any subset that is not large. Def. k-itemset: large itemset with k items.

11 Apriori Algorithm L 1 = { large 1-itemsets } for (k=2; L k-1  ; k++) do begin C k = apriori-gen(L k-1 ); // New candidates forall transactions t  D do begin C’ t = subset (C k, t) // Candidates contained in t forall candidates c  C t do c.count++ end L k = {c  C t | c.count  minsup} end Return  k L k

12 Apriori Candidate Generation apriori-gen(L k-1 ): Returns a superset of the set of all large k-items l First select two itemsets p, q from L k-1 s.t. first k-2 items of p and q are the same, form a new candidate k-itemset c as common k-2 items + 2 differing items l Prune those c, s.t. some (k-1) subset of c is not in L k-1

13 Apriori Algorithm (cont.) Go thru all transactions in D, increment the counts of all itemsets in C k l L k is the set of all large itemsets in C k l For minsup s=30%, L= {{bread}, {cheese}, {cucumber}, {onion}, {parsley}, {salt}, {tomato}, {cucumber, onion}, {cucumber, parsley}, {cucumber, tomato}, {onion, tomato}, {parsley, tomato}, {cucumber, parsley, tomato}}

14 Subset Function Subset (C k, t): candidate itemsets contained in t l Candidate itemsets in C k are stored in a hash-tree l Leaf node: contains a list of itemsets l Interior node: contains a hash table n Each bucket points to another node n Depth of root = 1 n Buckets of a node at depth d points to nodes at depth d+1

15 Subset Function (cont.) Construction of hash-tree for C k l To add itemset c: n start from the root n go down until reaching a leaf node n At interior node at depth d, to choose the branch to follow, apply a hash function to the d th item of c l All nodes are initially created as leaves l A leaf is converted into internal when the number of nodes exceeds a threshold.

16 Subset Function (cont.) l After constructing the hash-tree for C k, subset function finds candidates contained in t as follows: n At a leaf, find itemsets contained in t n At an interior node reached by hashing on item i, hash on each item that comes after i in t, recursively apply to the nodes in the corresponding bucket n At root, hash on every item in t.

17 AprioriTid Algorithm l Uses apriori-gen to generate candidates Database D is not used for counting support after the first pass l The set C k is used, for this purpose l Elements of C k are in the form where each X k is a potentially large k-itemset present in the transaction with identifier TID. l The member of C k corresponding to transaction t is

18 AprioriTid Algorithm (cont.) L 1 = { large 1-itemsets } for (k=2; L k-1  ; k++) do begin C k = apriori-gen(L k-1 ); // New candidates C k =  forall transactions t  C k do begin // Determine candidates in C k contained in t.TID C’ t = {c  C k | last two elements of c are in t } forall candidates c  C’ t do c.count++ if (C t  ) then C k = += end L k = {c  C’ t | c.count  minsup} end Return  k L k

19 Example minsup = 2 transactions, s=50 D: TID Items L 1 : Itemset Sup C 1 : TID Set-of-Itemsets 100 1 3 4 {1} 2 100 {{1},{3},{4}} 200 2 3 5 {2} 3 200 {{2},{3},{5}} 300 1 2 3 5 {3} 3 300 {{1},{2},{3},{5}} 400 2 5 {5} 3 400 {{2},{5}} C 2 ={{100,{{1,3}}},{200,{{2,3},{2,5},{3,5}}, {300,{{2,3},{2,5},{3,5}}, {400,{{2,5}}}} L 2 ={{1,3}, {2,3}, {2,5}, {3,5}}

20 Performance Example: HW: IBM RS/6000, 33MHz Dataset: Number of Items: 1000 Avg. size of transactions: 10 Avg. size of maximal potentially large items: 4 Number of transactions: 100K Data size: 4.4 MBytes

21 Apriori vs. AprioriTid Per pass execution times of Apriori and AprioriTid Average size of transactions: 10 Average size of maximal potentially large items: 4 Number of transactions: 100K minsup=0.75%

22 AprioriHybrid Algorithm l Uses Apriori in the initial passes and switches to AprioriTid when it expects that the set C k at the end of the pass will fit in memory.

23 Conclusions and Future Work l Apriori, AprioriTid and AprioriHybrid algorithms presented l Future work: n use is-a hirarchies (e.g., beef is-a red-meat is-a meat) n use quantities of items bought l This work is in the context of Quest Project of IBM


Download ppt "Association Rules l Mining Association Rules between Sets of Items in Large Databases (R. Agrawal, T. Imielinski & A. Swami) 1993. l Fast Algorithms for."

Similar presentations


Ads by Google