Presentation is loading. Please wait.

Presentation is loading. Please wait.

Data Mining: Association Rule Mining CSE880: Database Systems.

Similar presentations


Presentation on theme: "Data Mining: Association Rule Mining CSE880: Database Systems."— Presentation transcript:

1 Data Mining: Association Rule Mining CSE880: Database Systems

2 Data, data everywhere l Walmart records ~20 million transactions per day l Google indexed ~3 billion Web pages l Yahoo collects ~10 GB Web traffic data per hour –U of M Computer Science department collects ~80MB of Web traffic on April 28, 2003 l NASA Earth Observing Satellites (EOS) produces over 1 PB of Earth Science data per year l NYSE trading volume: 1,443,627,000 (Oct 31, 2003) l Scientific simulations can produce terabytes of data per hour

3 Data Mining - Motivation l There is often information “hidden” in the data that is not readily evident l Human analysts may take weeks to discover useful information l Much of the data is never analyzed at all The Data Gap Total new disk (TB) since 1995 Number of analysts From: R. Grossman, C. Kamath, V. Kumar, “Data Mining for Scientific and Engineering Applications”

4 What is Data Mining? l Many Definitions –Non-trivial extraction of implicit, previously unknown and potentially useful information from data –Exploration & analysis, by automatic or semi-automatic means, of large quantities of data in order to discover meaningful patterns Knowledge Discovery Process What is Data Information Mining?

5 Data Mining Tasks Classification Clustering Association Rule Mining Anomaly Detection Milk Data

6 Association Rule Mining l Given a set of transactions, find rules that will predict the occurrence of an item based on the occurrences of other items in the transaction Market-Basket transactions Example of Association Rules {Diaper}  {Beer}, {Milk, Bread}  {Eggs,Coke}, {Beer, Bread}  {Milk}, Implication means co-occurrence, not causality!

7 Terminology: Frequent Itemset l Market Basket: items bought in a single customer transaction l Itemset: X={x1, …, xk} l Support: The support of itemset X is the fraction of transactions in the database that contains all the items in the itemset Let minsup = 50% Itemsets that meet the minsup threshold are called frequent itemsets Itemsets {A,D}, {D} … do not meet the minsup threshold Transaction-idItems bought 1A, B, C 2A, C 3A, D 4B, E, F Frequent ItemsetSupport {A}75% {B}50% {C}50% {A, C}50%

8 Terminology: Association Rule l Given: a pair of itemsets X, Y l Association rule X  Y: If X is purchased in a transaction, it is very likely that Y will also be purchased in that transaction. –Support of X  Y : the fraction of transactions that contain both X and Y i.e. the support of itemset X  Y –Confidence of X  Y : conditional probability that a transaction having X also contains Y. i.e. Support (X  Y) / support (X) Sup(X) Sup(Y) Customer buys Y Customer buys both Sup( X  Y ) Customer buys X Goal of Association Rule Mining: To discover all rules having 1. support  minsup and 2. confidence  minconf

9 An Illustrative Example Example:

10 How to Generate Association Rules l How many possible association rules? l Naïve way: –Enumerate all the possible rules –Compute support and confidence for each rule –Exponentially expensive!! l Need to decouple the minimum support and minimum confidence requirements

11 How to Generate Association Rules? Example of Rules: {Milk,Diaper}  {Beer} (s=0.4, c=0.67) {Milk,Beer}  {Diaper} (s=0.4, c=1.0) {Diaper,Beer}  {Milk} (s=0.4, c=0.67) {Diaper}  {Milk,Beer} (s=0.4, c=0.5) Observations: All the rules above have identical support Why? Approach: divide the rule generation process into 2 steps Generate the frequent itemsets first Generate association rules from the frequent itemsets

12 Generating Frequent Itemsets There are 2 d possible itemsets

13 Generating Frequent Itemsets l Naive approach: –Each itemset in the lattice is a candidate frequent itemset –Count the support of each candidate by scanning the database –Complexity ~ O(NMw) => Expensive since M = 2 d !!!

14 Approach for Mining Frequent Itemsets l Reduce the number of candidates (M) –Complete search: M=2 d l Reduce the number of transactions (N) –Reduce size of N as the size of itemset increases l Reduce the number of comparisons (NM) –Use efficient data structures to store the candidates and transactions –No need to match every candidate against every transaction

15 Apriori Principle l Any subset of a frequent itemset must be frequent –If {beer, diaper, nuts} is frequent, so is {beer, diaper} –Every transaction having {beer, diaper, nuts} also contains {beer, diaper} l Support has anti-monotone property –If A is a subset of B, then support(A) ≥ support(B) l Apriori pruning principle: If there is any itemset that is infrequent, its superset need not be generated!

16 Found to be Infrequent Illustrating Apriori Principle Pruned supersets

17 Apriori Algorithm l Method: –Let k=1 –Find frequent itemsets of length 1 –Repeat until no new frequent itemsets are identified  Generate length (k+1) candidate itemsets from length k frequent itemsets  Update the support count of candidates by scanning the DB  Eliminate candidates that are infrequent, leaving only those that are frequent

18 Illustrating Apriori Principle Items (1-itemsets) Pairs (2-itemsets) (No need to generate candidates involving Coke or Eggs) Triplets (3-itemsets) Minimum Support = 3 If every subset is considered, 6 C 1 + 6 C 2 + 6 C 3 = 41 With support-based pruning, 6 + 6 + 1 = 13

19 How to Generate Candidates? l Given set of frequent itemsets of length 3: {A,B,C}, {A,B,D}, {A,C,D}, {B,C,D},{B,C,E},{C,D,E} l Generate candidate itemsets of length 4 by adding a frequent item to each frequent itemset of length 3 –Produces lots of unnecessary candidates  Suppose A,B,C,D,E,F,G,H are frequent items  {A,B,C} + E produces the candidate {A,B,C,E} {A,B,C} + F produces the candidate {A,B,C,F} {A,B,C} + G produces the candidate {A,B,C,G} {A,B,C} + H produces the candidate {A,B,C,H}  These candidates are guaranteed to be infrequent – Why?

20 How to Generate Candidates? l Given set of frequent itemsets of length 3: {A,B,C}, {A,B,D}, {A,C,D}, {B,C,D},{B,C,E},{C,D,E} l Generate candidate itemsets of length 4 by joining pairs of frequent itemsets of length 3 –Joining {A,B,C} with {A,B,D} will produce the candidate {A,B,C,D} –Problem 1: Duplicate/Redundant candidates  Joining {A,B,C} with {B,C,D} will produce the same candidate {A,B,C,D} –Problem 2: Unnecessary candidates  Join {A,B,C} with {B,C,E} will produce the candidate {A,B,C,E}, which is guaranteed to be infrequent

21 How to Generate Candidates? l Given set of frequent itemsets of length 3 (Ordered): {A,B,C}, {A,B,D}, {A,C,D}, {B,C,D},{B,C,E},{C,D,E} l Ordered: therefore, there is no (A,C,B) between (ACD) and (BCD), Combing will create infrequent (ACB) l Join a pair of frequent k-itemsets only if its prefix of length (k-1) are identical (guarantees subset is frequent): –Join {A,B,C} and {A,B,D} to produce {A,B,C,D} –Do not have to join {A,B,C} and {B,C,D} since they don’t have the same prefix  This avoids generating duplicate candidates –Do not have to join {A,C,D} and {C,D,E} Subset is not frequent l Pruning –Joining {B,C,D} and {B,C,E} will produce {B,C,D,E} –Prune {B,C,D,E} because one of its subset {B,D,E} is not frequentx

22 How to Generate Candidates? l Let L k-1 be the set of all frequent itemsets of length k-1 l Assume each itemset in L k-1 are sorted in lexicographic order l Step 1: self-joining L k-1 insert into C k select p.item 1, p.item 2, …, p.item k-1, q.item k-1 from L k-1 p, L k-1 q where p.item 1 =q. item 1, …, p. item k-2 =q. item k-2, p. item k-1 < q. item k-1 l Step 2: pruning forall itemsets c in C k do forall (k-1)-subsets s of c do if (s is not in L k-1 ) then delete c from C k

23 How to Count Support of Candidates? l Why counting supports of candidates is a problem? –The total number of candidates can be huge – Each transaction may contain many candidates l Method: –Store candidate itemsets in a hash-tree –Leaf node of the hash-tree contains a list of itemsets and their respective support counts –Interior node contains a hash table

24 Generate Hash Tree 2 3 4 5 6 7 1 4 5 1 3 6 1 2 4 4 5 7 1 2 5 4 5 8 1 5 9 3 4 5 3 5 6 3 5 7 6 8 9 3 6 7 3 6 8 1,4,7 2,5,8 3,6,9 Hash function Suppose you have 15 candidate itemsets of length 3: {1 4 5}, {1 2 4}, {4 5 7}, {1 2 5}, {4 5 8}, {1 5 9}, {1 3 6}, {2 3 4}, {5 6 7}, {3 4 5}, {3 5 6}, {3 5 7}, {6 8 9}, {3 6 7}, {3 6 8} You need: Hash function Max leaf size: max number of itemsets stored in a leaf node (if number of itemsets exceed max leaf size, split the node)

25 Example: Counting Supports of Candidates 1,4,7 2,5,8 3,6,9 Subset function 2 3 4 5 6 7 1 4 5 1 3 6 1 2 4 4 5 7 1 2 5 4 5 8 1 5 9 3 4 5 3 5 6 3 5 7 6 8 9 3 6 7 3 6 8 Transaction: 1 2 3 5 6 1 + 2 3 5 6 1 2 + 3 5 6 1 3 + 5 6

26 Rule Generation l Given a frequent itemset L, find all non-empty subsets f  L such that f  L – f satisfies the minimum confidence requirement –If {A,B,C,D} is a frequent itemset, candidate rules: ABC  D, ABD  C, ACD  B, BCD  A, A  BCD,B  ACD,C  ABD, D  ABC AB  CD,AC  BD, AD  BC, BC  AD, BD  AC, CD  AB, l If |L| = k, then there are 2 k – 2 candidate association rules (ignoring L   and   L)

27 Rule Generation l How to efficiently generate rules from frequent itemsets? –In general, confidence does not have an anti- monotone property –But confidence of rules generated from the same itemset has an anti-monotone property –Frequent itemset: f = {A,B,C,D} conf(ABC  D)  conf(AB  CD)  conf(A  BCD)  Confidence is non-increasing as number of items in rule consequent increases

28 Rule Generation for Apriori Algorithm Lattice of rules

29 Bottleneck of Apriori Algorithm l The core of the Apriori algorithm: –Use frequent (k – 1)-itemsets to generate candidate frequent k-itemsets –Use database scan and pattern matching to collect counts for the candidate itemsets l Bottleneck of Apriori: candidate generation –Still far too many candidates –For example, if there are 10000 frequent items, need to generate (10000  9999)/2 candidates of length 2  Many of them ends up being infrequent

30 FP-Tree and FP-growth l Compress a large database into a compact, Frequent-Pattern tree (FP-tree) structure –highly condensed, but complete for frequent pattern mining –avoid costly database scans l Develop an efficient, FP-tree-based frequent pattern mining algorithm (called FP-growth) –A divide-and-conquer methodology: decompose mining tasks into smaller ones –Generate “candidates” only if when it is necessary

31 Construct FP-tree {} f:4c:1 b:1 p:1 b:1c:3 a:3 b:1m:2 p:2m:1 Header Table Item frequency head f4 c4 a3 b3 m3 p3 minsup = 3 TIDItems bought (ordered) frequent items 100{f, a, c, d, g, i, m, p}{f, c, a, m, p} 200{a, b, c, f, l, m, o}{f, c, a, b, m} 300 {b, f, h, j, o, w}{f, b} 400 {b, c, k, s, p}{c, b, p} 500 {a, f, c, e, l, p, m, n}{f, c, a, m, p} 1.Scan DB once, find frequent 1-itemset (single item pattern) 2.Sort frequent items in frequency descending order, f-list 3.Scan DB again, construct FP-tree F-list=f-c-a-b-m-p

32 Benefits of the FP-tree Structure l Completeness –Preserve information for frequent pattern mining l Compactness –Reduce irrelevant info—infrequent items are gone –Items in frequency descending order: the more frequently occurring, the more likely to be shared –Never larger than the original database (if we exclude node-links and the count field)  For some dense data sets, compression ratio can be more than a factor of 100

33 Partition Patterns and Databases l Frequent patterns can be partitioned into subsets according to f-list (efficient partitioning because ordered) –F-list=f-c-a-b-m-p –Patterns containing p –Patterns having m but no p –… –Patterns having c but no a nor b, m, p –Pattern f l Completeness and no redundancy

34 Deriving Frequent Itemsets from FP-tree l Start at the frequent item header table in the FP-tree l Traverse the FP-tree by following the link of each frequent item p l Accumulate all of transformed prefix paths of item p to form p’s conditional pattern base l Cond. Pattern base below is generated recursively, shown next Conditional pattern bases itemcond. pattern base cf:3 afc:3 bfca:1, f:1, c:1 mfca:2, fcab:1 pfcam:2, cb:1 {} f:4c:1 b:1 p:1 b:1c:3 a:3 b:1m:2 p:2m:1 Header Table Item frequency head f4 c4 a3 b3 m3 p3

35 Deriving Frequent Itemsets from FP-tree l Start with the bottom, p. p is frequent pattern. Pattern base for p:, frequent items:c:3; Cond. FP-tree for p: ; recursive call generates freq. pattern pc l m: m is frequent pattern. Pattern base: fca2, fcab1, frequent items: fca3, Create cond. FP-tree. Mine( |m) involves mining a, c, f in sequence l Create frequent pattern am:3 and call mine( |am) l Second: create frequent pattern cm:3 and call mine( |cam) l mine( |cam) creates the longest pattern fcam:3 l Note: each conditional FP-tree has calls from several headers All frequent

36 Recursion: Mine Each Conditional FP- tree l For each pattern-base –Accumulate the count for each item in the base –Construct the FP-tree for the frequent items of the pattern base m-conditional pattern base: fca:2, fcab:1 {} f:3 c:3 a:3 m-conditional FP-tree All frequent patterns relate to m m, fm, cm, am, fcm, fam, cam, fcam   {} f:4c:1 b:1 p:1 b:1c:3 a:3 b:1m:2 p:2m:1 Header Table Item frequency head f4 c4 a3 b3 m3 p3

37 Recursion: Mine Each Conditional FP-tree {} f:3 c:3 a:3 m-conditional FP-tree Cond. pattern base of “am”: (fc:3) {} f:3 c:3 am-conditional FP-tree Cond. pattern base of “cm”: (f:3) {} f:3 cm-conditional FP-tree Cond. pattern base of “cam”: (f:3) {} f:3 cam-conditional FP-tree

38 Special Case: Single Prefix Path in FP-tree l Suppose a (conditional) FP-tree T has a shared single prefix-path P l Mining can be decomposed into two parts –Reduction of the single prefix path into one node –Concatenation of the mining results of the two parts  a 2 :n 2 a 3 :n 3 a 1 :n 1 {} b 1 :m 1 C 1 :k 1 C 2 :k 2 C 3 :k 3 b 1 :m 1 C 1 :k 1 C 2 :k 2 C 3 :k 3 r1r1 + a 2 :n 2 a 3 :n 3 a 1 :n 1 {} r1r1 =

39 Mining Frequent Patterns With FP-trees l Idea: Frequent pattern growth –Recursively grow frequent patterns by pattern and database partition l Method –For each frequent item, construct its conditional pattern-base, and then its conditional FP-tree –Repeat the process on each newly created conditional FP-tree –Until the resulting FP-tree is empty, or it contains only one path—single path will generate all the combinations of its sub-paths, each of which is a frequent pattern

40 Scaling FP-growth by DB Projection l FP-tree cannot fit in memory?—DB projection l First partition a database into a set of projected DBs l Then construct and mine FP-tree for each projected DB

41 Partition-based Projection l Parallel projection needs a lot of disk space l Partition projection saves it Tran. DB fcamp fcabm fb cbp fcamp p-proj DB fcam cb fcam m-proj DB fcab fca b-proj DB f cb … a-proj DB fc … c-proj DB f … f-proj DB … am-proj DB fc cm-proj DB f …

42 References l Srikant and Agrawal, “Fast Algorithms for Mining Association Rules” l Han, Pei, Yin, “Mining Frequent Patterns without Candidate Generation”


Download ppt "Data Mining: Association Rule Mining CSE880: Database Systems."

Similar presentations


Ads by Google