Sampling Large Databases for Association Rules Jingting Zeng CIS 664 Presentation March 13, 2007.

Slides:



Advertisements
Similar presentations
Association Rules Evgueni Smirnov.
Advertisements

Association Rule Mining
Recap: Mining association rules from large datasets
Hash-Based Improvements to A-Priori
1 CPS : Information Management and Mining Association Rules and Frequent Itemsets.
Association Analysis (2). Example TIDList of item ID’s T1I1, I2, I5 T2I2, I4 T3I2, I3 T4I1, I2, I4 T5I1, I3 T6I2, I3 T7I1, I3 T8I1, I2, I3, I5 T9I1, I2,
1 Frequent Itemset Mining: Computation Model uTypically, data is kept in a flat file rather than a database system. wStored on disk. wStored basket-by-basket.
Data Mining Techniques Association Rule
DATA MINING Association Rule Discovery. AR Definition aka Affinity Grouping Common example: Discovery of which items are frequently sold together at a.
Data Mining of Very Large Data
IT 433 Data Warehousing and Data Mining Association Rules Assist.Prof.Songül Albayrak Yıldız Technical University Computer Engineering Department
1 Mining Associations Apriori Algorithm. 2 Computation Model uTypically, data is kept in a flat file rather than a database system. wStored on disk. wStored.
Association rules The goal of mining association rules is to generate all possible rules that exceed some minimum user-specified support and confidence.
Sampling Large Databases for Association Rules ( Toivenon’s Approach, 1996) Farzaneh Mirzazadeh Fall 2007.
1 Association Rules Market Baskets Frequent Itemsets A-Priori Algorithm.
1 of 25 1 of 45 Association Rule Mining CIT366: Data Mining & Data Warehousing Instructor: Bajuna Salehe The Institute of Finance Management: Computing.
Data Mining Techniques So Far: Cluster analysis K-means Classification Decision Trees J48 (C4.5) Rule-based classification JRIP (RIPPER) Logistic Regression.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
1 Association Rules Apriori Algorithm. 2 Computation Model uTypically, data is kept in a flat file rather than a database system. wStored on disk. wStored.
1 Improvements to A-Priori Park-Chen-Yu Algorithm Multistage Algorithm Approximate Algorithms Compacting Results.
Data Mining Association Analysis: Basic Concepts and Algorithms
Data Mining: Associations
1 Association Rules Market Baskets Frequent Itemsets A-priori Algorithm.
Improvements to A-Priori
Association Analysis: Basic Concepts and Algorithms.
Data Mining Association Analysis: Basic Concepts and Algorithms
Improvements to A-Priori
Association Rules Presented by: Anilkumar Panicker Presented by: Anilkumar Panicker.
Asssociation Rules Prof. Sin-Min Lee Department of Computer Science.
1 Improvements to A-Priori Bloom Filters Park-Chen-Yu Algorithm Multistage Algorithm Approximate Algorithms Compacting Results.
Lecture14: Association Rules
Mining Association Rules
Associations and Frequent Item Analysis. 2 Outline  Transactions  Frequent itemsets  Subset Property  Association rules  Applications.
1 Improvements to A-Priori Park-Chen-Yu Algorithm Multistage Algorithm Approximate Algorithms Compacting Results.
Performance and Scalability: Apriori Implementation.
Mining Association Rules in Large Databases. What Is Association Rule Mining?  Association rule mining: Finding frequent patterns, associations, correlations,
1 “Association Rules” Market Baskets Frequent Itemsets A-priori Algorithm.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining By Tan, Steinbach, Kumar Lecture.
Modul 7: Association Analysis. 2 Association Rule Mining  Given a set of transactions, find rules that will predict the occurrence of an item based on.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Association Rules. CS583, Bing Liu, UIC 2 Association rule mining Proposed by Agrawal et al in Initially used for Market Basket Analysis to find.
M. Sulaiman Khan Dept. of Computer Science University of Liverpool 2009 COMP527: Data Mining Association Rule Mining March 5, 2009.
CSE4334/5334 DATA MINING CSE4334/5334 Data Mining, Fall 2014 Department of Computer Science and Engineering, University of Texas at Arlington Chengkai.
Part II - Association Rules © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part II – Association Rules Margaret H. Dunham Department of.
Association Rule Mining
1 Improvements to A-Priori Bloom Filters Park-Chen-Yu Algorithm Multistage Algorithm Approximate Algorithms Compacting Results.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Association Analysis This lecture node is modified based on Lecture Notes for.
1 CPS216: Advanced Database Systems Data Mining Slides created by Jeffrey Ullman, Stanford.
Mining Frequent Patterns, Associations, and Correlations Compiled By: Umair Yaqub Lecturer Govt. Murray College Sialkot.
Association Rules presented by Zbigniew W. Ras *,#) *) University of North Carolina – Charlotte #) ICS, Polish Academy of Sciences.
CURE Clustering Using Representatives Handles outliers well. Hierarchical, partition First a constant number of points c, are chosen from each cluster.
Jeffrey D. Ullman Stanford University.  2% of your grade will be for answering other students’ questions on Piazza.  18% for Gradiance.  Piazza code.
M. Sulaiman Khan Dept. of Computer Science University of Liverpool 2009 COMP527: Data Mining ARM: Improvements March 10, 2009 Slide.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Data Mining Association Rules Mining Frequent Itemset Mining Support and Confidence Apriori Approach.
1 Data Mining Lecture 6: Association Analysis. 2 Association Rule Mining l Given a set of transactions, find rules that will predict the occurrence of.
Reducing Number of Candidates
Data Mining Association Analysis: Basic Concepts and Algorithms
Association Rules Repoussis Panagiotis.
Frequent Pattern Mining
Market Basket Many-to-many relationship between different objects
Dynamic Itemset Counting
Data Mining Association Analysis: Basic Concepts and Algorithms
Data Mining Association Analysis: Basic Concepts and Algorithms
Hash-Based Improvements to A-Priori
Association Rule Mining
Data Mining Association Analysis: Basic Concepts and Algorithms
Farzaneh Mirzazadeh Fall 2007
Association Analysis: Basic Concepts
Presentation transcript:

Sampling Large Databases for Association Rules Jingting Zeng CIS 664 Presentation March 13, 2007

Association Rules Outline uAssociation Rules Problem Overview uAssociation Rules Definitions uPrevious Work on Association Rules uToivonen’s Algorithm uExperiments Result uConclusion

Overview uPurpose If people tend to buy A and B together, then a buyer of A is a good target for an advertisement for B.

The Market-Basket Example uItems frequently purchased together: Bread  PeanutButter uUses: wPlacement wAdvertising wSales wCoupons uObjective: increase sales and reduce costs

Other Example uThe same technology has other uses University course enrollment data has been analyzed to find combinations of courses taken by the same students

Scale of Problem uWalMart sells 100,000 items and can store hundreds of millions of baskets. uThe Web has 100,000,000 words and several billion pages.

Association Rule Definitions uSet of items: I={I 1,I 2,…,I m }  Transactions: D={t 1,t 2, …, t n }, t j  I uSupport of an itemset: Percentage of transactions which contain that itemset. uFrequent itemset: Itemset whose number of occurrences is above a threshold.

Association Rule Definitions  Association Rule (AR): implication X  Y where X,Y  I and X  Y = ;  Support of AR (s) X  Y: Percentage of transactions that contain X  Y  Confidence of AR (  ) X  Y: Ratio of number of transactions that contain X  Y to the number that contain X

Example B1 = {m, c, b}B2 = {m, p, j} B3 = {m, b} B4 = {c, j} B5 = {m, p, b}B6 = {m, c, b, j} B7 = {c, b, j}B8 = {b, c} uAssociation Rule w{m, b}  c wSupport = 2/8 = 25% wConfidence = 2/4 = 50%

Association Rule Problem  Given a set of items I={I 1,I 2,…,I m } and a database of transactions D={t 1,t 2, …, t n } where t i ={I i1,I i2, …, I ik } and I ij  I, the Association Rule Problem is to identify all association rules X  Y with a minimum support and confidence threshold.

Association Rule Techniques uFind all frequent itemsets uGenerate strong association rules from the frequent itemsets

APriori Algorithm uA two-pass approach called a-priori limits the need for main memory. uKey idea: monotonicity : if a set of items appears at least s times, so does every subset. wConverse for pairs: if item i does not appear in s baskets, then no pair including i can appear in s baskets.

APriori Algorithm (contd.) uPass 1: Read baskets and count in main memory the occurrences of each item. wRequires only memory proportional to #items. uPass 2: Read baskets again and count in main memory only those pairs both of which were found in Pass 1 to have occurred at least s times. wRequires memory proportional to square of frequent items only.

Partitioning uDivide database into partitions D 1,D 2,…,D p uApply Apriori to each partition uAny large itemset must be large in at least one partition.

Partitioning Algorithm 1.Divide D into partitions D 1,D 2,…,D p; 2.For I = 1 to p do 3. L i = Apriori(D i ); 4.C = L 1  …  L p ; 5.Count C on D to generate L;

Sampling uLarge databases uSample the database and apply Apriori to the sample. uPotentially Frequent Itemsets (PL): Large itemsets from sample uNegative Border (BD - ): wGeneralization of Apriori-Gen applied to itemsets of varying sizes. wMinimal set of itemsets which are not in PL, but whose subsets are all in PL.

Negative Border Example Let Items = {A,…,F} and there are itemsets: {A}, {B}, {C}, {F}, {A,B}, {A,C}, {A,F}, {C,F}, {A,C,F} The whole negative border is: {{B,C}, {B,F}, {D}, {E}}

Toivonen’s Algorithm uStart as in the simple algorithm, but lower the threshold slightly for the sample. wExample: if the sample is 1% of the baskets, use as the support threshold rather than wGoal is to avoid missing any itemset that is frequent in the full set of baskets.

Toivonen’s Algorithm (contd.) uAdd to the itemsets that are frequent in the sample the negative border of these itemsets. uAn itemset is in the negative border if it is not deemed frequent in the sample, but all its immediate subsets are. wExample: ABCD is in the negative border if and only if it is not frequent, but all of ABC, BCD, ACD, and ABD are.

Toivonen’s Algorithm (contd.) uIn a second pass, count all candidate frequent itemsets from the first pass, and also count the negative border. uIf no itemset from the negative border turns out to be frequent, then the candidates found to be frequent in the whole data are exactly the frequent itemsets.

Toivonen’s Algorithm (contd.) uWhat if we find something in the negative border is actually frequent? uWe must start over again! uBut by choosing the support threshold for the sample wisely, we can make the probability of failure low, while still keeping the number of itemsets checked on the second pass low enough for main-memory.

Experiment Synthetic data set characteristics (T = row size on average, I = size of maximal frequent sets on average)

Experiment (contd.) Lowered frequency thresholds (%) for probability of missing any given frequent set is less than δ = 0.001

Number of trials with misses

Conclusions uAdvantages: Reduced failure probability, while keeping candidate-count low enough for memory uDisadvantages: Potentially large number of candidates in second pass

Thank you!