Charles Tappert Seidenberg School of CSIS, Pace University

Slides:



Advertisements
Similar presentations
Association Rules Evgueni Smirnov.
Advertisements

DATA MINING Association Rule Discovery. AR Definition aka Affinity Grouping Common example: Discovery of which items are frequently sold together at a.
Association Analysis (Data Engineering). Type of attributes in assoc. analysis Association rule mining assumes the input data consists of binary attributes.
IT 433 Data Warehousing and Data Mining Association Rules Assist.Prof.Songül Albayrak Yıldız Technical University Computer Engineering Department
Association Rule Mining. 2 The Task Two ways of defining the task General –Input: A collection of instances –Output: rules to predict the values of any.
MIS2502: Data Analytics Association Rule Mining. Uses What products are bought together? Amazon’s recommendation engine Telephone calling patterns Association.
Data Mining Association Analysis: Basic Concepts and Algorithms
Association Analysis. Association Rule Mining: Definition Given a set of records each of which contain some number of items from a given collection; –Produce.
Data Mining Techniques So Far: Cluster analysis K-means Classification Decision Trees J48 (C4.5) Rule-based classification JRIP (RIPPER) Logistic Regression.
Data Mining Association Analysis: Basic Concepts and Algorithms Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Mining Association Rules. Association rules Association rules… –… can predict any attribute and combinations of attributes … are not intended to be used.
Data Mining Association Analysis: Basic Concepts and Algorithms
Data Mining Association Analysis: Basic Concepts and Algorithms
Association Rules Presented by: Anilkumar Panicker Presented by: Anilkumar Panicker.
Asssociation Rules Prof. Sin-Min Lee Department of Computer Science.
1 ACCTG 6910 Building Enterprise & Business Intelligence Systems (e.bis) Association Rule Mining Olivia R. Liu Sheng, Ph.D. Emma Eccles Jones Presidential.
© Vipin Kumar CSci 8980 Fall CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance Computing Research Center Department of Computer.
Association Rule Mining Part 1 Introduction to Data Mining with Case Studies Author: G. K. Gupta Prentice Hall India, 2006.
Fast Algorithms for Association Rule Mining
Lecture14: Association Rules
Mining Association Rules
Market Basket Analysis and Association Rules Lantz Ch 8
Association Discovery from Databases Association rules are a simple formalism for expressing positive connections between columns in a 0/1 matrix. A classical.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining By Tan, Steinbach, Kumar Lecture.
Modul 7: Association Analysis. 2 Association Rule Mining  Given a set of transactions, find rules that will predict the occurrence of an item based on.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Association Rules. CS583, Bing Liu, UIC 2 Association rule mining Proposed by Agrawal et al in Initially used for Market Basket Analysis to find.
ASSOCIATION RULE DISCOVERY (MARKET BASKET-ANALYSIS) MIS2502 Data Analytics Adapted from Tan, Steinbach, and Kumar (2004). Introduction to Data Mining.
EXAM REVIEW MIS2502 Data Analytics. Exam What Tool to Use? Evaluating Decision Trees Association Rules Clustering.
CS 8751 ML & KDDSupport Vector Machines1 Mining Association Rules KDD from a DBMS point of view –The importance of efficiency Market basket analysis Association.
CSE4334/5334 DATA MINING CSE4334/5334 Data Mining, Fall 2014 Department of Computer Science and Engineering, University of Texas at Arlington Chengkai.
Association Rule Mining Data Mining and Knowledge Discovery Prof. Carolina Ruiz and Weiyang Lin Department of Computer Science Worcester Polytechnic Institute.
Part II - Association Rules © Prentice Hall1 DATA MINING Introductory and Advanced Topics Part II – Association Rules Margaret H. Dunham Department of.
Association Rule.. Association rule mining  It is an important data mining model studied extensively by the database and data mining community.  Assume.
1 What is Association Analysis: l Association analysis uses a set of transactions to discover rules that indicate the likely occurrence of an item based.
Association Rule Mining
ASSOCIATION RULES (MARKET BASKET-ANALYSIS) MIS2502 Data Analytics Adapted from Tan, Steinbach, and Kumar (2004). Introduction to Data Mining.
Data Science and Big Data Analytics Chap 4: Advanced Analytical Theory and Methods: Clustering Charles Tappert Seidenberg School of CSIS, Pace University.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Association Analysis This lecture node is modified based on Lecture Notes for.
Association Rules presented by Zbigniew W. Ras *,#) *) University of North Carolina – Charlotte #) ICS, Polish Academy of Sciences.
CURE Clustering Using Representatives Handles outliers well. Hierarchical, partition First a constant number of points c, are chosen from each cluster.
Overview Definition of Apriori Algorithm
Data Analytics CMIS Short Course part II Day 1 Part 1: Clustering Sam Buttrey December 2015.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Chapter 8 Association Rules. Data Warehouse and Data Mining Chapter 10 2 Content Association rule mining Mining single-dimensional Boolean association.
Association Rules Carissa Wang February 23, 2010.
Chap 6: Association Rules. Rule Rules!  Motivation ~ recent progress in data mining + warehousing have made it possible to collect HUGE amount of data.
Data Mining Association Rules Mining Frequent Itemset Mining Support and Confidence Apriori Approach.
1 Data Mining Lecture 6: Association Analysis. 2 Association Rule Mining l Given a set of transactions, find rules that will predict the occurrence of.
MIS2502: Data Analytics Association Rule Mining David Schuff
Chapter 14 – Association Rules and Collaborative Filtering © Galit Shmueli and Peter Bruce 2016 Data Mining for Business Analytics (3rd ed.) Shmueli, Bruce.
Chapter 13 – Association Rules DM for Business Intelligence.
MIS2502: Data Analytics Association Rule Mining Jeremy Shafer
Association Rules Repoussis Panagiotis.
Market Basket Many-to-many relationship between different objects
Data Mining Association Analysis: Basic Concepts and Algorithms
Association Rule Mining
Targeted Association Mining in Time-Varying Domains
Data Mining Association Rules Assoc.Prof.Songül Varlı Albayrak
Data Mining Association Analysis: Basic Concepts and Algorithms
MIS2502: Data Analytics Association Rule Mining
Market Basket Analysis and Association Rules
MIS2502: Data Analytics Association Rule Mining
Lecture 11 (Market Basket Analysis)
MIS2502: Data Analytics Association Rule Learning
Association Analysis: Basic Concepts
Data Analytics (BE-2015 Pattern) Unit III- Association Rules and Regression By Prof. B.A.Khivsara Note: The material to prepare this presentation has.
Charles Tappert Seidenberg School of CSIS, Pace University
Presentation transcript:

Charles Tappert Seidenberg School of CSIS, Pace University Data Science and Big Data Analytics Chap 5: Adv Analytical Theory and Methods: Association Rules Charles Tappert Seidenberg School of CSIS, Pace University

Chapter Sections 5.1 Overview 5.2 Apriori Algorithm 5.3 Evaluation of Candidate Rules 5.4 Example: Transactions in a Grocery Store 5.5 Validation and Testing 5.6 Diagnostics

5.1 Overview Association rules method Unsupervised learning method Descriptive (not predictive) method Used to find hidden relationships in data The relationships are represented as rules Questions association rules might answer Which products tend to be purchased together What products do similar customers tend to buy

5.1 Overview Example – general logic of association rules

5.1 Overview Rules have the form X -> Y Itemset When X is observed, Y is also observed Itemset Collection of items or entities k-itemset = {item 1, item 2,…,item k} Examples Items purchased in one transaction Set of hyperlinks clicked by a user in one session

5.1 Overview – Apriori Algorithm Apriori is the most fundamental algorithm Given itemset L, support of L is the percent of transactions that contain L Frequent itemset – items appear together “often enough” Minimum support defines “often enough” (% transactions) If an itemset is frequent, then any subset is frequent

5.1 Overview – Apriori Algorithm If {B,C,D} frequent, then all subsets frequent

5.2 Apriori Algorithm Frequent = minimum support Bottom-up iterative algorithm Identify the frequent (min support) 1-itemsets Frequent 1-itemsets are paired into 2-itemsets, and the frequent 2-itemsets are identified, etc. Definitions for next slide D = transaction database d = minimum support threshold N = maximum length of itemset (optional parameter) Ck = set of candidate k-itemsets Lk = set of k-itemsets with minimum support

5.2 Apriori Algorithm

5.3 Evaluation of Candidate Rules Confidence Frequent itemsets can form candidate rules Confidence measures the certainty of a rule Minimum confidence – predefined threshold Problem with confidence Given a rule X->Y, confidence considers only the antecedent (X) and the co-occurrence of X and Y Cannot tell if a rule contains true implication

5.3 Evaluation of Candidate Rules Lift Lift measures how much more often X and Y occur together than expected if statistically independent Lift = 1 if X and Y are statistically independent Lift > 1 indicates the degree of usefulness of the rule Example – in 1000 transactions, If {milk, eggs} appears in 300, {milk} in 500, and {eggs} in 400, then Lift(milk->eggs) = 0.3/(0.5*0.4) = 1.5 If {milk, bread} appears in 400, {milk} in 500, and {bread} in 400, then Lift(milk->bread) = 0.4/(0.5*0.4) = 2.0

5.3 Evaluation of Candidate Rules Leverage Leverage measures the difference in the probability of X and Y appearing together compared to statistical independence Leverage = 0 if X and Y are statistically independent Leverage > 0 indicates degree of usefulness of rule Example – in 1000 transactions, If {milk, eggs} appears in 300, {milk} in 500, and {eggs} in 400, then Leverage(milk->eggs) = 0.3 - 0.5*0.4 = 0.1 If {milk, bread} appears in 400, {milk} in 500, and {bread} in 400, then Leverage (milk->bread) = 0.4 - 0.5*0.4 = 0.2

5.4 Applications of Association Rules The term market basket analysis refers to a specific implementation of association rules For better merchandising – products to include/exclude from inventory each month Placement of products within related products Association rules also used for Recommender systems – Amazon, Netflix Clickstream analysis from web usage log files Website visitors to page X click on links A,B,C more than on links D,E,F

5.5 Example: Grocery Store Transactions 5.5.1 The Groceries Dataset > 5.5 Example: Grocery Store Transactions 5.5.1 The Groceries Dataset Packages -> Install -> arules, arulesViz # don’t enter next line > install.packages(c("arules", "arulesViz")) # appears on console > library('arules') > library('arulesViz') > data(Groceries) > summary(Groceries) # indicates 9835 rows Class of dataset Groceries is transactions, containing 3 slots transactionInfo # data frame with vectors having length of transactions itemInfo # data frame storing item labels data # binary evidence matrix of labels in transactions > Groceries@itemInfo[1:10,] > apply(Groceries@data[,10:20],2,function(r) paste(Groceries@itemInfo[r,"labels"],collapse=", "))

> summary(itemsets) # found 59 itemsets> inspect(head(sort(itemsets,by="support"),10)) # lists top 10 supported items 5.5 Example: Grocery Store Transactions 5.5.2 Frequent Itemset Generation To illustrate the Apriori algorithm, the code below does each iteration separately. Assume minimum support threshold = 0.02 (0.02 * 9853 = 198 items), get 122 itemsets total First, get itemsets of length 1 > itemsets<-apriori(Groceries,parameter=list(minlen=1,maxlen=1,support=0.02,target="frequent itemsets")) > summary(itemsets) # found 59 itemsets > inspect(head(sort(itemsets,by="support"),10)) # lists top 10 Second, get itemsets of length 2 > itemsets<-apriori(Groceries,parameter=list(minlen=2,maxlen=2,support=0.02,target="frequent itemsets")) > summary(itemsets) # found 61 itemsets Third, get itemsets of length 3 > itemsets<-apriori(Groceries,parameter=list(minlen=3,maxlen=3,support=0.02,target="frequent itemsets")) > summary(itemsets) # found 2 itemsets

5. 5 Example: Grocery Store Transactions 5. 5 5.5 Example: Grocery Store Transactions 5.5.3 Rule Generation and Visualization The Apriori algorithm will now generate rules. Set minimum support threshold to 0.001 (allows more rules, presumably for the scatterplot) and minimum confidence threshold to 0.6 to generate 2,918 rules. > rules <- apriori(Groceries,parameter=list(support=0.001,confidence=0.6,target="rules")) > summary(rules) # finds 2918 rules > plot(rules) # displays scatterplot The scatterplot shows that the highest lift occurs at a low support and a low confidence.

> plot(rules) > plot(rules) 5.5 Example: Grocery Store Transactions 5.5.3 Rule Generation and Visualization

5. 5 Example: Grocery Store Transactions 5. 5 5.5 Example: Grocery Store Transactions 5.5.3 Rule Generation and Visualization Get scatterplot matrix to compare the support, confidence, and lift of the 2918 rules > plot(rules@quality) # displays scatterplot matrix Lift is proportional to confidence with several linear groupings. Note that Lift = Confidence/Support(Y), so when support of Y remains the same, lift is proportional to confidence and the slope of the linear trend is the reciprocal of Support(Y).

> plot(rules) > plot(rules) 5.5 Example: Grocery Store Transactions 5.5.3 Rule Generation and Visualization

5. 5 Example: Grocery Store Transactions 5. 5 5.5 Example: Grocery Store Transactions 5.5.3 Rule Generation and Visualization Compute the 1/Support(Y) which is the slope > slope<- sort(round(rules@quality$lift/rules@quality$confidence,2)) Display the number of times each slope appears in dataset > unlist(lapply(split(slope,f=slope),length)) Display the top 10 rules sorted by lift > inspect(head(sort(rules,by="lift"),10)) Rule {Instant food products, soda} -> {hamburger meat} has the highest lift of 19 (page 154)

5. 5 Example: Grocery Store Transactions 5. 5 5.5 Example: Grocery Store Transactions 5.5.3 Rule Generation and Visualization Find the rules with confidence above 0.9 > confidentRules<-rules[quality(rules)$confidence>0.9] > confidentRules # set of 127 rules Plot a matrix-based visualization of the LHS v RHS of rules > plot(confidentRules,method="matrix",measure=c("lift","confid ence"),control=list(reorder=TRUE)) The legend on the right is a color matrix indicating the lift and the confidence to which each square in the main matrix corresponds

> plot(rules) > plot(rules) 5.5 Example: Grocery Store Transactions 5.5.3 Rule Generation and Visualization

5. 5 Example: Grocery Store Transactions 5. 5 5.5 Example: Grocery Store Transactions 5.5.3 Rule Generation and Visualization Visualize the top 5 rules with the highest lift. > highLiftRules<-head(sort(rules,by="lift"),5) > plot(highLiftRules,method="graph",control=list(type="items")) In the graph, the arrow always points from an item on the LHS to an item on the RHS. For example, the arrows that connects ham, processed cheese, and white bread suggest the rule {ham, processed cheese} -> {white bread} Size of circle indicates support and shade represents lift

5. 5 Example: Grocery Store Transactions 5. 5 5.5 Example: Grocery Store Transactions 5.5.3 Rule Generation and Visualization

5.6 Validation and Testing The frequent and high confidence itemsets are found by pre- specified minimum support and minimum confidence levels Measures like lift and/or leverage then ensure that interesting rules are identified rather than coincidental ones However, some of the remaining rules may be considered subjectively uninteresting because they don’t yield unexpected profitable actions E.g., rules like {paper} -> {pencil} are not interesting/meaningful Incorporating subjective knowledge requires domain experts Good rules provide valuable insights for institutions to improve their business operations

5.7 Diagnostics Although minimum support is pre-specified in phases 3&4, this level can be adjusted to target the range of the number of rules – variants/improvements of Apriori are available For large datasets the Apriori algorithm can be computationally expensive – efficiency improvements Partitioning Sampling Transaction reduction Hash-based itemset counting Dynamic itemset counting