חוקי Association ד " ר אבי רוזנפלד. המוטיבציה מה הם הדברים שהולכים ביחד ? –איזה מוצרים בסופר שווה לשים ביחד –מערכות המלצה – Recommendation Systems שבוע.

Slides:



Advertisements
Similar presentations
These slides are at: David Corne, room EM G.39, x 3410, / any questions, feel free to.
Advertisements

Huffman Codes and Asssociation Rules (II) Prof. Sin-Min Lee Department of Computer Science.
1 Frequent Itemset Mining: Computation Model uTypically, data is kept in a flat file rather than a database system. wStored on disk. wStored basket-by-basket.
DATA MINING Association Rule Discovery. AR Definition aka Affinity Grouping Common example: Discovery of which items are frequently sold together at a.
Data Mining of Very Large Data
IT 433 Data Warehousing and Data Mining Association Rules Assist.Prof.Songül Albayrak Yıldız Technical University Computer Engineering Department
Association Rule Mining. 2 The Task Two ways of defining the task General –Input: A collection of instances –Output: rules to predict the values of any.
IDS561 Big Data Analytics Week 6.
 Back to finding frequent itemsets  Typically, data is kept in flat files rather than in a database system:  Stored on disk  Stored basket-by-basket.
1 Mining Associations Apriori Algorithm. 2 Computation Model uTypically, data is kept in a flat file rather than a database system. wStored on disk. wStored.
Association rules The goal of mining association rules is to generate all possible rules that exceed some minimum user-specified support and confidence.
1 Association Rules Market Baskets Frequent Itemsets A-Priori Algorithm.
Data Mining Association Analysis: Basic Concepts and Algorithms
Association Analysis. Association Rule Mining: Definition Given a set of records each of which contain some number of items from a given collection; –Produce.
Data Mining Techniques So Far: Cluster analysis K-means Classification Decision Trees J48 (C4.5) Rule-based classification JRIP (RIPPER) Logistic Regression.
Data Mining Association Analysis: Basic Concepts and Algorithms Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
1 Association Rules Apriori Algorithm. 2 Computation Model uTypically, data is kept in a flat file rather than a database system. wStored on disk. wStored.
Association Rule Mining Part 2 (under construction!) Introduction to Data Mining with Case Studies Author: G. K. Gupta Prentice Hall India, 2006.
Data Mining Association Analysis: Basic Concepts and Algorithms
Is Sampling Useful in Data Mining? A Case in the Maintenance of Discovered Association Rules S.D. Lee, David W. Cheung, Ben Kao The University of Hong.
1 Association Rules Market Baskets Frequent Itemsets A-priori Algorithm.
Data Mining Association Analysis: Basic Concepts and Algorithms
6/23/2015CSE591: Data Mining by H. Liu1 Association Rules Transactional data Algorithm Applications.
Asssociation Rules Prof. Sin-Min Lee Department of Computer Science.
© Vipin Kumar CSci 8980 Fall CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance Computing Research Center Department of Computer.
Association Rule Mining Part 1 Introduction to Data Mining with Case Studies Author: G. K. Gupta Prentice Hall India, 2006.
Fast Algorithms for Association Rule Mining
Lecture14: Association Rules
Mining Association Rules
1 Fast Algorithms for Mining Association Rules Rakesh Agrawal Ramakrishnan Srikant Slides from Ofer Pasternak.
1 “Association Rules” Market Baskets Frequent Itemsets A-priori Algorithm.
Association Discovery from Databases Association rules are a simple formalism for expressing positive connections between columns in a 0/1 matrix. A classical.
David Corne, and Nick Taylor, Heriot-Watt University - These slides and related resources:
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining By Tan, Steinbach, Kumar Lecture.
Modul 7: Association Analysis. 2 Association Rule Mining  Given a set of transactions, find rules that will predict the occurrence of an item based on.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Association Rules. CS583, Bing Liu, UIC 2 Association rule mining Proposed by Agrawal et al in Initially used for Market Basket Analysis to find.
Frequent Itemsets and Association Rules 1 Wu-Jun Li Department of Computer Science and Engineering Shanghai Jiao Tong University Lecture 3: Frequent Itemsets.
DATA MINING LECTURE 3 Frequent Itemsets Association Rules.
Note to other teachers and users of these slides: We would be delighted if you found this our material useful in giving your own lectures. Feel free to.
CS 8751 ML & KDDSupport Vector Machines1 Mining Association Rules KDD from a DBMS point of view –The importance of efficiency Market basket analysis Association.
CSE4334/5334 DATA MINING CSE4334/5334 Data Mining, Fall 2014 Department of Computer Science and Engineering, University of Texas at Arlington Chengkai.
Is Sampling Useful in Data Mining? A Case in the Maintenance of Discovered Association Rules S.D. Lee, David W. Cheung, Ben Kao The University of Hong.
Sampling Large Databases for Association Rules Jingting Zeng CIS 664 Presentation March 13, 2007.
1 What is Association Analysis: l Association analysis uses a set of transactions to discover rules that indicate the likely occurrence of an item based.
Frequent-Itemset Mining. Market-Basket Model A large set of items, e.g., things sold in a supermarket. A large set of baskets, each of which is a small.
Association Rule Mining
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Association Analysis This lecture node is modified based on Lecture Notes for.
Charles Tappert Seidenberg School of CSIS, Pace University
Mining Frequent Patterns, Associations, and Correlations Compiled By: Umair Yaqub Lecturer Govt. Murray College Sialkot.
Data Mining (and machine learning) The A Priori Algorithm.
Jeffrey D. Ullman Stanford University.  2% of your grade will be for answering other students’ questions on Piazza.  18% for Gradiance.  Piazza code.
M. Sulaiman Khan Dept. of Computer Science University of Liverpool 2009 COMP527: Data Mining ARM: Improvements March 10, 2009 Slide.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Data Mining Association Rules Mining Frequent Itemset Mining Support and Confidence Apriori Approach.
COMP53311 Association Rule Mining Prepared by Raymond Wong Presented by Raymond Wong
Introduction to Machine Learning Lecture 13 Introduction to Association Rules Albert Orriols i Puig Artificial.
1 Data Mining Lecture 6: Association Analysis. 2 Association Rule Mining l Given a set of transactions, find rules that will predict the occurrence of.
Frequent Pattern Mining
Frequent Itemsets Association Rules
CPS216: Advanced Database Systems Data Mining
Data Mining Association Analysis: Basic Concepts and Algorithms
Market Baskets Frequent Itemsets A-Priori Algorithm
Data Mining Association Analysis: Basic Concepts and Algorithms
Data Mining (and machine learning)
Association Analysis: Basic Concepts
Presentation transcript:

חוקי Association ד " ר אבי רוזנפלד

המוטיבציה מה הם הדברים שהולכים ביחד ? –איזה מוצרים בסופר שווה לשים ביחד –מערכות המלצה – Recommendation Systems שבוע הבא –חיפוש באינטרנט בעוד שבועיים

אבל איך מגדירים ביחד ? האם זה Supervised או Unsupervised – Unsupervised מה כמה פעמים דברים חייבים להיות ביחד לפני שאני מחשב אותם ? – Threshold של Confidence ו Support –תכף נגדיר אותם...

December 2008©GKGupta4 אבל קודם כל, דוגמא

December 2008©GKGupta5 Frequency of Items

6 מה אני מחפש... If-then rules about the contents of baskets. {i 1, i 2,…,i k } → j means: “if a basket contains all of i 1,…,i k then it is likely to contain j.” אני רוצה ללמוד אם יש תלות למשהו – j, בהינתן {i 1, i 2,…,i k }

7 Support Simplest question: find sets of items that appear “frequently” in the baskets. Support for itemset I = the number of baskets containing all items in I. –בדרך כלל, ההסתברות למצוא את הדברים ביחד Given a support threshold s, sets of items that appear in > s baskets are called frequent itemsets.

Confidence כמה פעמים היה גם I וגם j Confidence of this association rule is the probability of j given i 1,…,i k. – I = i 1,…,i k באופן פורמאלי : Confidence = באופן פורמאלי : Lift =

December 2008©GKGupta9 Frequent Items Assume 25% support. In 25 transactions, a frequent item must occur in at least 7 transactions (25*1/4=6.25). The frequent 1-itemset or L 1 is now given below.

December 2008©GKGupta10 L2L2 The following pairs are frequent.

11 Example: Confidence B 1 = {m, c, b}B 2 = {m, p, j} B 3 = {m, b}B 4 = {c, j} B 5 = {m, p, b}B 6 = {m, c, b, j} B 7 = {c, b, j}B 8 = {b, c} An association rule: {m, b} → c. – Confidence = 2/4 = 50%. + _ +

December 2008©GKGupta12 Rules The full set of rules are given below. Could some rules be removed? Comment: Study the above rules carefully.

13 Main-Memory Bottleneck For many frequent-itemset algorithms, main memory is the critical resource. – As we read baskets, we need to count something, e.g., occurrences of pairs. – The number of different things we can count is limited by main memory. – Swapping counts in/out is a disaster (why?).

14 Finding Frequent Pairs The hardest problem often turns out to be finding the frequent pairs. – Why? Often frequent pairs are common, frequent triples are rare. We’ll concentrate on how to do that, then discuss extensions to finding frequent triples, etc.

David Corne, and Nick Taylor, Heriot-Watt University - These slides and related resources: hing/dmml.html ID a, b, c, d, e, f, g, h, i E.g. 3-itemset {a,b,h} has support 15% 2-itemset {a, i} has support 0% 4-itemset {b, c, d, h} has support 5% If minimum support is 10%, then {b} is a large itemset, but {b, c, d, h} Is a small itemset!

16 Finding Frequent Pairs The hardest problem often turns out to be finding the frequent pairs. – Why? Often frequent pairs are common, frequent triples are rare. We’ll concentrate on how to do that, then discuss extensions to finding frequent triples, etc.

17 Naïve Algorithm Read file once, counting in main memory the occurrences of each pair. – From each basket of n items, generate its n (n -1)/2 pairs by two nested loops. Fails if (#items) 2 exceeds main memory. – Remember: #items can be 100K (Wal-Mart) or 10B (Web pages).

18 Example: Counting Pairs Suppose 10 5 items. Suppose counts are 4-byte integers. Number of pairs of items: 10 5 ( )/2 = 5*10 9 (approximately). Therefore, 2*10 10 (20 gigabytes) of main memory needed.

David Corne, and Nick Taylor, Heriot-Watt University - These slides and related resources: hing/dmml.html The Apriori algorithm for finding large itemsets efficiently in big DBs 1: Find all large 1-itemsets 2: For (k = 2 ; while L k-1 is non-empty; k++) 3{C k = apriori-gen (L k-1 ) 4 For each c in C k, initialise c.count to zero 5 For all records r in the DB 6 {C r = subset (C k, r); For each c in C r, c.count++ } 7 Set L k := all c in C k whose count >= minsup 8 } /* end -- return all of the L k sets.

David Corne, and Nick Taylor, Heriot-Watt University - These slides and related resources: hing/dmml.html Explaining the Apriori Algorithm … 1: Find all large 1-itemsets To start off, we simply find all of the large 1- itemsets. This is done by a basic scan of the DB. We take each item in turn, and count the number of times that item appears in a basket. In our running example, suppose minimum support was 60%, then the only large 1-itemsets would be: {a}, {b}, {c}, {d} and {f}. So we get L 1 = { {a}, {b}, {c}, {d}, {f}}

David Corne, and Nick Taylor, Heriot-Watt University - These slides and related resources: hing/dmml.html Explaining the Apriori Algorithm … 1 : Find all large 1-itemsets 2: For (k = 2 ; while L k-1 is non-empty; k++) We already have L 1. This next bit just means that the remainder of the algorithm generates L 2, L 3, and so on until we get to an L k that’s empty. How these are generated is like this:

David Corne, and Nick Taylor, Heriot-Watt University - These slides and related resources: hing/dmml.html Explaining the Apriori Algorithm … 1 : Find all large 1-itemsets 2: For (k = 2 ; while L k-1 is non-empty; k++) 3 {C k = apriori-gen (L k-1 ) Given the large k-1-itemsets, this step generates some candidate k-itemsets that might be large. Because of how apriori-gen works, the set C k is guaranteed to contain all the large k-itemsets, but also contains some that will turn out not to be `large’.

David Corne, and Nick Taylor, Heriot-Watt University - These slides and related resources: hing/dmml.html Explaining the Apriori Algorithm … 1 : Find all large 1-itemsets 2: For (k = 2 ; while L k-1 is non-empty; k++) 3 {C k = apriori-gen (L k-1 ) 4 For each c in C k, initialise c.count to zero We are going to work out the support for each of the candidate k-itemsets in C k, by working out how many times each of these itemsets appears in a record in the DB.– this step starts us off by initialising these counts to zero.

David Corne, and Nick Taylor, Heriot-Watt University - These slides and related resources: hing/dmml.html Explaining the Apriori Algorithm … 1 : Find all large 1-itemsets 2: For (k = 2 ; while L k-1 is non-empty; k++) 3 {C k = apriori-gen (L k-1 ) 4 For each c in C k, initialise c.count to zero 5 For all records r in the DB 6 {C r = subset (C k, r); For each c in C r, c.count++ } We now take each record r in the DB and do this: get all the candidate k-itemsets from C k that are contained in r. For each of these, update its count.

David Corne, and Nick Taylor, Heriot-Watt University - These slides and related resources: hing/dmml.html Explaining the Apriori Algorithm … 1 : Find all large 1-itemsets 2: For (k = 2 ; while L k-1 is non-empty; k++) 3 {C k = apriori-gen (L k-1 ) 4 For each c in C k, initialise c.count to zero 5 For all records r in the DB 6 {C r = subset (C k, r); For each c in C r, c.count++ } 7 Set L k := all c in C k whose count >= minsup Now we have the count for every candidate. Those whose count is big enough are valid large itemsets of the right size. We therefore now have L k, We now go back into the for loop of line 2 and start working towards finding L k+1

David Corne, and Nick Taylor, Heriot-Watt University - These slides and related resources: hing/dmml.html Explaining the Apriori Algorithm … 1 : Find all large 1-itemsets 2: For (k = 2 ; while L k-1 is non-empty; k++) 3 {C k = apriori-gen (L k-1 ) 4 For each c in C k, initialise c.count to zero 5 For all records r in the DB 6 {C r = subset (C k, r); For each c in C r, c.count++ } 7 Set L k := all c in C k whose count >= minsup 8 } /* end -- return all of the L k sets. We finish at the point where we get an empty L k. The algorithm returns all of the (non-empty) L k sets, which gives us an excellent start in finding interesting rules (although the large itemsets themselves will usually be very interesting and useful.

פתרון נוסף : FP-Growth בונה עץ לפי התדירות של הביטויים –מהגדול לקטן –יש לי אלגוריתם כזה... ( שיש עליו פטנט - אבל הוא מיועד לחיפוש ולא Association)