Download presentation

Presentation is loading. Please wait.

Published byAli Bruington Modified about 1 year ago

1
חוקי Association ד " ר אבי רוזנפלד

2
המוטיבציה מה הם הדברים שהולכים ביחד ? –איזה מוצרים בסופר שווה לשים ביחד –מערכות המלצה – Recommendation Systems שבוע הבא –חיפוש באינטרנט בעוד שבועיים

3
אבל איך מגדירים ביחד ? האם זה Supervised או Unsupervised – Unsupervised מה כמה פעמים דברים חייבים להיות ביחד לפני שאני מחשב אותם ? – Threshold של Confidence ו Support –תכף נגדיר אותם...

4
December 2008©GKGupta4 אבל קודם כל, דוגמא

5
December 2008©GKGupta5 Frequency of Items

6
6 מה אני מחפש... If-then rules about the contents of baskets. {i 1, i 2,…,i k } → j means: “if a basket contains all of i 1,…,i k then it is likely to contain j.” אני רוצה ללמוד אם יש תלות למשהו – j, בהינתן {i 1, i 2,…,i k }

7
7 Support Simplest question: find sets of items that appear “frequently” in the baskets. Support for itemset I = the number of baskets containing all items in I. –בדרך כלל, ההסתברות למצוא את הדברים ביחד Given a support threshold s, sets of items that appear in > s baskets are called frequent itemsets.

8
Confidence כמה פעמים היה גם I וגם j Confidence of this association rule is the probability of j given i 1,…,i k. – I = i 1,…,i k באופן פורמאלי : Confidence = באופן פורמאלי : Lift =

9
December 2008©GKGupta9 Frequent Items Assume 25% support. In 25 transactions, a frequent item must occur in at least 7 transactions (25*1/4=6.25). The frequent 1-itemset or L 1 is now given below.

10
December 2008©GKGupta10 L2L2 The following pairs are frequent.

11
11 Example: Confidence B 1 = {m, c, b}B 2 = {m, p, j} B 3 = {m, b}B 4 = {c, j} B 5 = {m, p, b}B 6 = {m, c, b, j} B 7 = {c, b, j}B 8 = {b, c} An association rule: {m, b} → c. – Confidence = 2/4 = 50%. + _ +

12
December 2008©GKGupta12 Rules The full set of rules are given below. Could some rules be removed? Comment: Study the above rules carefully.

13
13 Main-Memory Bottleneck For many frequent-itemset algorithms, main memory is the critical resource. – As we read baskets, we need to count something, e.g., occurrences of pairs. – The number of different things we can count is limited by main memory. – Swapping counts in/out is a disaster (why?).

14
14 Finding Frequent Pairs The hardest problem often turns out to be finding the frequent pairs. – Why? Often frequent pairs are common, frequent triples are rare. We’ll concentrate on how to do that, then discuss extensions to finding frequent triples, etc.

15
David Corne, and Nick Taylor, Heriot-Watt University - These slides and related resources: hing/dmml.html ID a, b, c, d, e, f, g, h, i E.g. 3-itemset {a,b,h} has support 15% 2-itemset {a, i} has support 0% 4-itemset {b, c, d, h} has support 5% If minimum support is 10%, then {b} is a large itemset, but {b, c, d, h} Is a small itemset!

16
16 Finding Frequent Pairs The hardest problem often turns out to be finding the frequent pairs. – Why? Often frequent pairs are common, frequent triples are rare. We’ll concentrate on how to do that, then discuss extensions to finding frequent triples, etc.

17
17 Naïve Algorithm Read file once, counting in main memory the occurrences of each pair. – From each basket of n items, generate its n (n -1)/2 pairs by two nested loops. Fails if (#items) 2 exceeds main memory. – Remember: #items can be 100K (Wal-Mart) or 10B (Web pages).

18
18 Example: Counting Pairs Suppose 10 5 items. Suppose counts are 4-byte integers. Number of pairs of items: 10 5 ( )/2 = 5*10 9 (approximately). Therefore, 2*10 10 (20 gigabytes) of main memory needed.

19
David Corne, and Nick Taylor, Heriot-Watt University - These slides and related resources: hing/dmml.html The Apriori algorithm for finding large itemsets efficiently in big DBs 1: Find all large 1-itemsets 2: For (k = 2 ; while L k-1 is non-empty; k++) 3{C k = apriori-gen (L k-1 ) 4 For each c in C k, initialise c.count to zero 5 For all records r in the DB 6 {C r = subset (C k, r); For each c in C r, c.count++ } 7 Set L k := all c in C k whose count >= minsup 8 } /* end -- return all of the L k sets.

20
David Corne, and Nick Taylor, Heriot-Watt University - These slides and related resources: hing/dmml.html Explaining the Apriori Algorithm … 1: Find all large 1-itemsets To start off, we simply find all of the large 1- itemsets. This is done by a basic scan of the DB. We take each item in turn, and count the number of times that item appears in a basket. In our running example, suppose minimum support was 60%, then the only large 1-itemsets would be: {a}, {b}, {c}, {d} and {f}. So we get L 1 = { {a}, {b}, {c}, {d}, {f}}

21
David Corne, and Nick Taylor, Heriot-Watt University - These slides and related resources: hing/dmml.html Explaining the Apriori Algorithm … 1 : Find all large 1-itemsets 2: For (k = 2 ; while L k-1 is non-empty; k++) We already have L 1. This next bit just means that the remainder of the algorithm generates L 2, L 3, and so on until we get to an L k that’s empty. How these are generated is like this:

22
David Corne, and Nick Taylor, Heriot-Watt University - These slides and related resources: hing/dmml.html Explaining the Apriori Algorithm … 1 : Find all large 1-itemsets 2: For (k = 2 ; while L k-1 is non-empty; k++) 3 {C k = apriori-gen (L k-1 ) Given the large k-1-itemsets, this step generates some candidate k-itemsets that might be large. Because of how apriori-gen works, the set C k is guaranteed to contain all the large k-itemsets, but also contains some that will turn out not to be `large’.

23
David Corne, and Nick Taylor, Heriot-Watt University - These slides and related resources: hing/dmml.html Explaining the Apriori Algorithm … 1 : Find all large 1-itemsets 2: For (k = 2 ; while L k-1 is non-empty; k++) 3 {C k = apriori-gen (L k-1 ) 4 For each c in C k, initialise c.count to zero We are going to work out the support for each of the candidate k-itemsets in C k, by working out how many times each of these itemsets appears in a record in the DB.– this step starts us off by initialising these counts to zero.

24
David Corne, and Nick Taylor, Heriot-Watt University - These slides and related resources: hing/dmml.html Explaining the Apriori Algorithm … 1 : Find all large 1-itemsets 2: For (k = 2 ; while L k-1 is non-empty; k++) 3 {C k = apriori-gen (L k-1 ) 4 For each c in C k, initialise c.count to zero 5 For all records r in the DB 6 {C r = subset (C k, r); For each c in C r, c.count++ } We now take each record r in the DB and do this: get all the candidate k-itemsets from C k that are contained in r. For each of these, update its count.

25
David Corne, and Nick Taylor, Heriot-Watt University - These slides and related resources: hing/dmml.html Explaining the Apriori Algorithm … 1 : Find all large 1-itemsets 2: For (k = 2 ; while L k-1 is non-empty; k++) 3 {C k = apriori-gen (L k-1 ) 4 For each c in C k, initialise c.count to zero 5 For all records r in the DB 6 {C r = subset (C k, r); For each c in C r, c.count++ } 7 Set L k := all c in C k whose count >= minsup Now we have the count for every candidate. Those whose count is big enough are valid large itemsets of the right size. We therefore now have L k, We now go back into the for loop of line 2 and start working towards finding L k+1

26
David Corne, and Nick Taylor, Heriot-Watt University - These slides and related resources: hing/dmml.html Explaining the Apriori Algorithm … 1 : Find all large 1-itemsets 2: For (k = 2 ; while L k-1 is non-empty; k++) 3 {C k = apriori-gen (L k-1 ) 4 For each c in C k, initialise c.count to zero 5 For all records r in the DB 6 {C r = subset (C k, r); For each c in C r, c.count++ } 7 Set L k := all c in C k whose count >= minsup 8 } /* end -- return all of the L k sets. We finish at the point where we get an empty L k. The algorithm returns all of the (non-empty) L k sets, which gives us an excellent start in finding interesting rules (although the large itemsets themselves will usually be very interesting and useful.

27
פתרון נוסף : FP-Growth בונה עץ לפי התדירות של הביטויים –מהגדול לקטן –יש לי אלגוריתם כזה... ( שיש עליו פטנט - אבל הוא מיועד לחיפוש ולא Association)

Similar presentations

© 2016 SlidePlayer.com Inc.

All rights reserved.

Ads by Google