Presentation is loading. Please wait.

Presentation is loading. Please wait.

Association Rule Mining CENG 514 Data Mining July 2, 20161.

Similar presentations


Presentation on theme: "Association Rule Mining CENG 514 Data Mining July 2, 20161."— Presentation transcript:

1 Association Rule Mining CENG 514 Data Mining July 2, 20161

2 What Is Frequent Pattern Analysis? Motivation: Finding inherent regularities in data – What products were often purchased together?— Beer and diapers?! – What are the subsequent purchases after buying a PC? – What kinds of DNA are sensitive to this new drug? Frequent pattern: a pattern (a set of items, subsequences, substructures, etc.) that occurs frequently in a data set First proposed by Agrawal, Imielinski, and Swami [AIS93] in the context of frequent itemsets and association rule mining July 2, 20162

3 Basic Concepts: Frequent Patterns and Association Rules Itemset X = {x 1, …, x k } Find all the rules X  Y with minimum support and confidence – support, s, probability that a transaction contains X  Y – confidence, c, conditional probability that a transaction having X also contains Y July 2, 20163 Let sup min = 50%, conf min = 50% Freq. Pat.: {A:3, B:3, D:4, E:3, AD:3} Association rules: A  D (60%, 100%) D  A (60%, 75%) Customer buys diaper Customer buys both Customer buys beer Transaction-idItems bought 10A, B, D 20A, C, D 30A, D, E 40B, E, F 50B, C, D, E, F

4 July 2, 20164 Use of Association Rules Association rules do not represent any sort of causality or correlation between the two itemsets. – X  Y does not mean X causes Y, so no Causality – X  Y can be different from Y  X, unlike correlation Association rules assist in marketing, targeted advertising, floor planning, inventory control, churning management, homeland security, Basket data analysis, cross-marketing, catalog design, sale campaign analysis, Web log (click stream) analysis, and DNA sequence analysis.

5 Closed Patterns and Max-Patterns A long pattern contains a combinatorial number of sub- patterns, e.g., {a 1, …, a 100 } contains ( 100 1 ) + ( 100 2 ) + … + ( 1 1 0 0 0 0 ) = 2 100 – 1 = 1.27*10 30 sub-patterns! Solution: Mine closed patterns and max-patterns instead An itemset X is closed if it is frequent and there exists no super- pattern Y כ X, with the same support as X An itemset X is a max-pattern if it is frequent and there exists no frequent super-pattern Y כ X Closed pattern is a lossless compression of freq. patterns – Reducing the # of patterns and rules An itemset X is generator if it is frequent and there exists no sub-pattern X כ Y with the same support as X July 2, 20165

6 Closed Patterns and Max-Patterns Example DB = {, } – Minimum support count is 1. What is the set of closed itemset? – : 1 – : 2 What is the set of max-patterns? – : 1 What is the set of all frequent patterns? July 2, 20166

7 Methods for Mining Frequent Patterns Apriori Rule (The downward closure property) – Any subset of a frequent itemset must be frequent – If {beer, diaper, nuts} is frequent, so is {beer, diaper} – i.e., every transaction having {beer, diaper, nuts} also contains {beer, diaper} Basic FIM methods: – Apriori – Freq. pattern growth (ClosetFPgrowth—Han, Pei & Yin @SIGMOD’00) – Vertical data format approach (SPADE, CHARM) July 2, 20167

8 Apriori: A Candidate Generation-and-Test Approach Apriori pruning principle: If there is any itemset which is infrequent, its superset should not be generated/tested! (Agrawal & Srikant @VLDB’94, Mannila, et al. @ KDD’ 94) Method: – Initially, scan DB once to get frequent 1-itemset – Generate length (k+1) candidate itemsets from length k frequent itemsets – Test the candidates against DB – Terminate when no frequent or candidate set can be generated July 2, 20168

9 The Apriori Algorithm—An Example July 2, 20169 Database TDB 1 st scan C1C1 L1L1 L2L2 C2C2 C2C2 2 nd scan C3C3 L3L3 3 rd scan TidItems 10A, C, D 20B, C, E 30A, B, C, E 40B, E Itemsetsup {A}2 {B}3 {C}3 {D}1 {E}3 Itemsetsup {A}2 {B}3 {C}3 {E}3 Itemset {A, B} {A, C} {A, E} {B, C} {B, E} {C, E} Itemsetsup {A, B}1 {A, C}2 {A, E}1 {B, C}2 {B, E}3 {C, E}2 Itemsetsup {A, C}2 {B, C}2 {B, E}3 {C, E}2 Itemset {B, C, E} Itemsetsup {B, C, E}2 Sup min = 2

10 The Apriori Algorithm Pseudo-code: C k : Candidate itemset of size k L k : frequent itemset of size k L 1 = {frequent items}; for (k = 1; L k !=  ; k++) do begin C k+1 = candidates generated from L k ; for each transaction t in database do increment the count of all candidates in C k+1 that are contained in t L k+1 = candidates in C k+1 with min_support end return  k L k ; July 2, 201610

11 Important Details of Apriori How to generate candidates? – Step 1: self-joining L k – Step 2: pruning Example of Candidate-generation – L 3 ={abc, abd, acd, ace, bcd} – Self-joining: L 3 *L 3 abcd from abc and abd acde from acd and ace – Pruning: acde is removed because ade is not in L 3 – C 4 ={abcd} July 2, 201611

12 How to Generate Candidates? Suppose the items in L k-1 are listed in an order Step 1: self-joining L k-1 insert into C k select p.item 1, p.item 2, …, p.item k-1, q.item k-1 from L k-1 p, L k-1 q where p.item 1 =q.item 1, …, p.item k-2 =q.item k-2, p.item k-1 < q.item k-1 Step 2: pruning forall itemsets c in C k do forall (k-1)-subsets s of c do if (s is not in L k-1 ) then delete c from C k July 2, 201612

13 How to Count Supports of Candidates? Why counting supports of candidates a problem? – The total number of candidates can be very huge – One transaction may contain many candidates Method: – Candidate itemsets are stored in a hash-tree – Leaf node contains a list of itemsets and counts – Interior node contains a hash table – Subset function: finds all the candidates contained in a transaction July 2, 201613

14 Example: Counting Supports of Candidates July 2, 201614 2 3 4 5 6 7 1 4 5 1 3 6 1 2 4 4 5 7 1 2 5 4 5 8 1 5 9 3 4 5 3 5 6 3 5 7 6 8 9 3 6 7 3 6 8 Transaction: 1 2 3 5 6 1 + 2 3 5 6 1 2 + 3 5 6 1 3 + 5 6

15 Challenges of Frequent Pattern Mining Challenges – Multiple scans of transaction database – Huge number of candidates – Tedious workload of support counting for candidates Improving Apriori: general ideas – Reduce passes of transaction database scans – Shrink number of candidates – Facilitate support counting of candidates July 2, 201615

16 Partition: Scan Database Only Twice Any itemset that is potentially frequent in DB must be frequent in at least one of the partitions of DB – Scan 1: partition database and find local frequent patterns – Scan 2: consolidate global frequent patterns A. Savasere, E. Omiecinski, and S. Navathe. An efficient algorithm for mining association in large databases. In VLDB’95 July 2, 201616

17 Transaction reduction A transaction that does not contain any frequent k-itemsets cannot contain any frequent (k+1)-itemsets. Therefore, such a transaction can be marked or removed from further consideration because subsequent scans of the database for j-itemsets, where j > k, will not require it. July 2, 201617

18 Sampling for Frequent Patterns Select a sample of original database, mine frequent patterns within sample using Apriori (PL: potentially large itemsets) Find additional candidates using negative border function PL = BD - (PL) U PL (BD - (PL) : minimal set of itemsets that are not PL, but whose subsets are all in PL) Scan database once to verify frequent itemsets found in sample and to check only borders of closure of frequent patterns (first pass) If there are uncovered new items in the result of scan C, iteratively apply BD - (C) to expand border of closure and scan database again to find missed frequent patterns. (second pass – may or may not happen) H. Toivonen. Sampling large databases for association rules. In VLDB’96 July 2, 201618

19 Sampling for Frequent Patterns Example: DB t1: {Bread, Jelly, Peanutbutter} t2: {Bread, Peanutbutter} t3: {Bread, Milk, Peanutbutter} t4: {Beer, Bread} t5: {Beer, Milk} Support threshold 40% July 2, 201619 Sample t1: {Bread, Jelly, Peanutbutter} t2: {Bread, Peanutbutter} Support threshold 10%

20 Sampling for Frequent Patterns July 2, 201620 Sample t1: {Bread, Jelly, Peanutbutter} Support threshold 10% t2: {Bread, Peanutbutter}-> support count 1 PL = { {Bread},{Jelly},{Peanutbutter},{Bread, Jelly}, {Bread, Peanutbutter}, {Jelly, Peanutbutter}, {Bread, Jelly, Peanutbutter}} BD - (PL) = {{Beer}, {Milk}} PL U BD - (PL) = { {Bread},{Jelly},{Peanutbutter},{Bread, Jelly}, {Bread, Peanutbutter}, {Jelly, Peanutbutter}, {Bread, Jelly, Peanutbutter}, {Beer}, {Milk} } C = { {Bread},{Peanutbutter}, {Bread, Peanutbutter}, {Beer}, {Milk} } (on DB with 40% support threshold) Need one more pass

21 Sampling for Frequent Patterns July 2, 201621 C = {{Bread},{Peanutbutter}, {Bread, Peanutbutter}, {Beer}, {Milk} } BD - (C) = { {Bread, Beer}, {Bread, Milk}, {Beer, Milk}, {Beer, Peanutbutter}, {Milk, Peanutbutter}} C = C U BD - (C) BD - (C) = { {Bread, Beer, Milk}, {Bread, Beer, Peanutbutter}, {Beer, Milk, Peanutbutter}, {Bread, Milk, Peanutbutter}} C = C U BD - (C) BD - (C) = { {Bread, Beer, Milk, Peanutbutter} } F = { {Bread},{Peanutbutter}, {Bread, Peanutbutter}, {Beer}, {Milk} } (on DB with 40% support threshold)

22 Mining Frequent Patterns Without Candidate Generation Build a compact representation so that no need scan db several times Find frequent itemsets in a recursive way July 2, 201622

23 Construct FP-tree from a Transaction Database July 2, 201623 {} f:4c:1 b:1 p:1 b:1c:3 a:3 b:1m:2 p:2m:1 Header Table Item frequency head f4 c4 a3 b3 m3 p3 min_support = 3 TIDItems bought (ordered) frequent items 100{f, a, c, d, g, i, m, p}{f, c, a, m, p} 200{a, b, c, f, l, m, o}{f, c, a, b, m} 300 {b, f, h, j, o, w}{f, b} 400 {b, c, k, s, p}{c, b, p} 500 {a, f, c, e, l, p, m, n}{f, c, a, m, p} 1.Scan DB once, find frequent 1-itemset (single item pattern) 2.Sort frequent items in frequency descending order 3.Scan DB again, construct FP-tree

24 Find Patterns Having P From P-conditional Database Starting at the frequent item header table in the FP-tree Traverse the FP-tree by following the link of each frequent item p Accumulate all of transformed prefix paths of item p to form p’s conditional pattern base July 2, 201624 Conditional pattern bases itemcond. pattern base cf:3 afc:3 bfca:1, f:1, c:1 mfca:2, fcab:1 pfcam:2, cb:1 {} f:4c:1 b:1 p:1 b:1c:3 a:3 b:1m:2 p:2m:1 Header Table Item frequency head f4 c4 a3 b3 m3 p3

25 From Conditional Pattern-bases to Conditional FP-trees For each pattern-base – Accumulate the count for each item in the base – Construct the FP-tree for the frequent items of the pattern base July 2, 201625 m-conditional pattern base: fca:2, fcab:1 {} f:3 c:3 a:3 m-conditional FP-tree All frequent patterns relate to m m, fm, cm, am, fcm, fam, cam, fcam   {} f:4c:1 b:1 p:1 b:1c:3 a:3 b:1m:2 p:2m:1 Header Table Item frequency head f4 c4 a3 b3 m3 p3

26 Recursion: Mining Each Conditional FP-tree July 2, 201626 {} f:3 c:3 a:3 m-conditional FP-tree Cond. pattern base of “am”: (fc:3) {} f:3 c:3 am-conditional FP-tree Cond. pattern base of “cm”: (f:3) {} f:3 cm-conditional FP-tree Cond. pattern base of “cam”: (f:3) {} f:3 cam-conditional FP-tree

27 A Special Case: Single Prefix Path in FP-tree Suppose a (conditional) FP-tree T has a shared single prefix-path P Mining can be decomposed into two parts – Reduction of the single prefix path into one node – Concatenation of the mining results of the two parts July 2, 201627  a 2 :n 2 a 3 :n 3 a 1 :n 1 {} b 1 :m 1 C 1 :k 1 C 2 :k 2 C 3 :k 3 b 1 :m 1 C 1 :k 1 C 2 :k 2 C 3 :k 3 r1r1 + a 2 :n 2 a 3 :n 3 a 1 :n 1 {} r1r1 =

28 Mining Frequent Patterns With FP-trees Idea: Frequent pattern growth – Recursively grow frequent patterns by pattern and database partition Method – For each frequent item, construct its conditional pattern- base, and then its conditional FP-tree – Repeat the process on each newly created conditional FP- tree – Until the resulting FP-tree is empty, or it contains only one path—single path will generate all the combinations of its sub-paths, each of which is a frequent pattern July 2, 201628

29 FP-Growth vs. Apriori: Scalability With the Support Threshold July 2, 201629 Data set T25I20D10K

30 Parallel And Distributed Algorithms Data Parallelism: parallelize the data – Reduced communication cost when compared to task parallelism Initial candidates and local counts – Memory at each processor should be large enough to hold all candidates at each scan Task Parallelism: parallelize the candidates – Candidates are partitioned and counted separately, therefore can fit in the memory – Pass not only candidates but also local set of transactions July 2, 201630

31 Parallel And Distributed Algorithms Data Parallelism: Count Distribution Algorithm (CDA) Pass k = 1: 1. Generate local C1 from local data Di 2. Merge local candidate sets to find C1 Pass k > 1: 1. Generate complete Ck from Lk-1 2. Count local data Di to find support for Ck 3. Exchange local counts for Ck to find global Ck 4. Compute Lk from Ck July 2, 201631

32 Parallel And Distributed Algorithms Task Parallelism: Data Distribution Algorithm (DDA) Partition data + candidate sets 1. Generate Ck from Lk-1; retain |Ck| / P locally. 2. Count local Ck using both local and remote data 3. Calculate local Lk using local Ck and synchronize July 2, 201632

33 Sequential Associations Sequence: ordered list of itemsets – I = {i 1, i 2, …, i m } set of items – S =, where s 1  I – e.g: Subsequence of a given sequence is one that can be obtained by removing some items and any resulting empty itemsets from the original sequence. – e.g: is a subsequence of – {A,C},{B} is NOT subsequence of the above sequence July 2, 201633

34 Sequential Associations Support(s): percentage of sessions (customers) whose session-sequence contains the sequence s Confidence(s → t): ratio of the number of sessions that contain both sequences s and t to the number of ones that contain s. Example: s = Support(s) = 1/3 u = Support(u) = 2/3 July 2, 201634 CustomerTimeItemset C110AB C120BC C130D C215ABC C220D C315ACD

35 Sequential Associations Basic Algorithms: AprioriAll – Basic algorithm for finding sequential associations GSP – More efficient than AprioriAll – Includes extensions Window Concept Hierarchy Spade – Improves performance by using vertical table structure Equivalence classes Episode mining – Events are ordered with occurance time – Uses window July 2, 201635

36 Sequential Associations Support for a sequence: Fraction of total customers that support a sequence. Large Sequence: Sequence that meets min_sup. Maximal Sequence: A sequence that is not contained in any other sequence. July 2, 201636

37 Sequential Associations - Example Customer ID Transaction Time Items Bought 1June 25 '9330 1June 30 '9390 2June 10 '9310,20 2June 15 '9330 2June 20 '9340,60,70 3June 25 '9330,50,70 4June 25 '9330 4June 30 '9340,70 4July 25 '9390 5June 12 '9390 Customer IDCustomer Sequence 1{ (30) (90) } 2{ (10 20) (30) (40 60 70) } 3{ (30 50 70) } 4{ (30) (40 70) (90) } 5{ (90) } Maximal sequences with support > 25% { (30) (90) } { (30) (40 70) } Min-sup:25% { (10 20) (30) } Does not satisfy min_sup (Only supported by Cust. 2) { (30) }, { (70) }, { (30) (40) } … are not maximal. July 2, 201637

38 Frequent Sequence Generation Customer Sequences L1 1-SeqSupport 4 2 4 4 4 L2 2-SeqSupport 2 4 3 3 2 2 3 2 2 L3 3-SeqSupport 2 2 3 2 2 L4 4-SeqSupport 2 1 0 Minisup = 25% Maximal:,, July 2, 201638

39 Sequential Associations Candidate Generation -Initially find 1-element frequent sequence -At subsequent passes, each pass starts the frequent sequences found in the previous pass. -Each candidate sequence has one more item than a previously found frequent sequence; so all the candidate sequences in the same pass will have the same number of items -Candidate generation -Join: s 1 joins with s 2 if the subsequence obtained by dropping the first item of s 1 is the same as the subsequence obtained by dropping the last element of s 2. s 1 is extended with the last item in s 2. -Prune: delete candidate if it has an infrequent contiguous subsequence July 2, 201639

40 Sequential Associations Initial candidates:,,,,,,, Scan database once, count support for candidates CandSup 3 5 4 3 3 2 1 1 50 40 30 20 10 SequenceSeq. ID min_sup =2 July 2, 201640

41 Sequential Associations 41 51 length-2 Candidates Without Apriori property, 8*8+8*7/2=92 candidates Apriori prunes 44.57% candidates July 2, 2016

42 Sequential Associations 42 … … … … 1 st scan: 8 cand. 6 length-1 seq. pat. 2 nd scan: 51 cand. 19 length-2 seq. pat. 10 cand. not in DB at all 3 rd scan: 46 cand. 19 length-3 seq. pat. 20 cand. not in DB at all 4 th scan: 8 cand. 6 length-4 seq. pat. 5 th scan: 1 cand. 1 length-5 seq. pat. Cand. cannot pass sup. threshold Cand. not in DB at all 50 40 30 20 10 SequenceSeq. ID min_sup =2 GSP: July 2, 2016

43 Sequential Associations Example: L3 Freq. 3-sequences C4 (candidate 4-seq) After join C4 (candidate 4-seq) After pruning July 2, 201643

44 Interestingness Measure: Correlation (Lift) play basketball  eat cereal [40%, 66.7%] is misleading – The overall % of students eating cereal is 75% > 66.7%. play basketball  not eat cereal [20%, 33.3%] is more accurate, although with lower support and confidence Measure of dependent/correlated events: lift (also called correlation, interest) July 2, 201644 BasketballNot basketballSum (row) Cereal200017503750 Not cereal10002501250 Sum(col.)300020005000

45 Interestingness Measure: Conviction Measures the independence of negation of implication Conviction (A → B) = P(A) P(-B) / P(A, -B) 1: not related∞: always hold Example: Conviction (basketball → cereal) = (3000/5000) * (1250/5000) / (1000/5000) = 0.75 Conviction (basketball → not cereal) = (3000/5000) * (3750/5000) / (2000/5000) = 1.12 Conviction (not basketball → cereal) = (2000/5000) * (1250/5000) / (250/5000) = 2 July 2, 201645 Basket ball Not basketball Sum (row) Cereal200017503750 Not cereal 10002501250 Sum(col.)300020005000

46 Interestingness Measure: Conviction Conviction (A → B) = P(A) P(-B) / P(A, -B) Example: t1: {Bread, Jelly, Peanutbutter} t2: {Bread, Peanutbutter} t3: {Bread, Milk, Peanutbutter} t4: {Beer, Bread} t5: {Beer, Milk} conviction (Peanutbutter → Bread) = (3/5 * 1/5) / 0 = ∞ conviction (Bread → Peanutbutter) = (4/5 * 2/5) / (1/5) = 8/5 > 1 July 2, 201646

47 Interestingness Measure: f-measure f-measure = (( 1+ α 2 ) * support * confidence ) / (α * s ) + c α takes a value in [0,1], when 0 support has min effect Another representation F-metric = ((B 2 +1 ) * confidence * support ) / (( B * confidence) + support)) July 2, 201647

48 Which Measures Should Be Used? all-conf or coherence could be good measures (Omiecinski@TKDE’03) Both all-conf and coherence have the downward closure property July 2, 201648


Download ppt "Association Rule Mining CENG 514 Data Mining July 2, 20161."

Similar presentations


Ads by Google