Association Rule Mining

Slides:



Advertisements
Similar presentations
Association Rules Evgueni Smirnov.
Advertisements

Recap: Mining association rules from large datasets
Association Analysis (2). Example TIDList of item ID’s T1I1, I2, I5 T2I2, I4 T3I2, I3 T4I1, I2, I4 T5I1, I3 T6I2, I3 T7I1, I3 T8I1, I2, I3, I5 T9I1, I2,
Frequent Itemset Mining Methods. The Apriori algorithm Finding frequent itemsets using candidate generation Seminal algorithm proposed by R. Agrawal and.
CSE 634 Data Mining Techniques
Data Mining Techniques Association Rule
Association rules and frequent itemsets mining
Frequent Closed Pattern Search By Row and Feature Enumeration
LOGO Association Rule Lecturer: Dr. Bo Yuan
Association rules The goal of mining association rules is to generate all possible rules that exceed some minimum user-specified support and confidence.
Sampling Large Databases for Association Rules ( Toivenon’s Approach, 1996) Farzaneh Mirzazadeh Fall 2007.
Data Mining Association Analysis: Basic Concepts and Algorithms
Association Analysis. Association Rule Mining: Definition Given a set of records each of which contain some number of items from a given collection; –Produce.
Data Mining Techniques So Far: Cluster analysis K-means Classification Decision Trees J48 (C4.5) Rule-based classification JRIP (RIPPER) Logistic Regression.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
732A02 Data Mining - Clustering and Association Analysis ………………… Jose M. Peña Association rules Apriori algorithm FP grow algorithm.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Association Rule Mining Part 2 (under construction!) Introduction to Data Mining with Case Studies Author: G. K. Gupta Prentice Hall India, 2006.
Association Analysis: Basic Concepts and Algorithms.
Chapter 4: Mining Frequent Patterns, Associations and Correlations
© Vipin Kumar CSci 8980 Fall CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance Computing Research Center Department of Computer.
Mining Association Rules
Mining Association Rules
Performance and Scalability: Apriori Implementation.
Mining Association Rules in Large Databases. What Is Association Rule Mining?  Association rule mining: Finding frequent patterns, associations, correlations,
Approximate Frequency Counts over Data Streams Gurmeet Singh Manku, Rajeev Motwani Standford University VLDB2002.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining By Tan, Steinbach, Kumar Lecture.
Modul 7: Association Analysis. 2 Association Rule Mining  Given a set of transactions, find rules that will predict the occurrence of an item based on.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Mining Frequent Patterns without Candidate Generation : A Frequent-Pattern Tree Approach 指導教授:廖述賢博士 報 告 人:朱 佩 慧 班 級:管科所博一.
CSE4334/5334 DATA MINING CSE4334/5334 Data Mining, Fall 2014 Department of Computer Science and Engineering, University of Texas at Arlington Chengkai.
Data Mining Find information from data data ? information.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Association Analysis This lecture node is modified based on Lecture Notes for.
Mining Frequent Patterns, Associations, and Correlations Compiled By: Umair Yaqub Lecturer Govt. Murray College Sialkot.
M. Sulaiman Khan Dept. of Computer Science University of Liverpool 2009 COMP527: Data Mining ARM: Improvements March 10, 2009 Slide.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Data Mining Association Rules Mining Frequent Itemset Mining Support and Confidence Apriori Approach.
1 Data Mining Lecture 6: Association Analysis. 2 Association Rule Mining l Given a set of transactions, find rules that will predict the occurrence of.
DATA MINING: ASSOCIATION ANALYSIS (2) Instructor: Dr. Chun Yu School of Statistics Jiangxi University of Finance and Economics Fall 2015.
Data Mining Find information from data data ? information.
TITLE What should be in Objective, Method and Significant
Reducing Number of Candidates
Data Mining Association Analysis: Basic Concepts and Algorithms
Data Mining: Concepts and Techniques
A Research Oriented Study Report By :- Akash Saxena
Association rule mining
Association Rules Repoussis Panagiotis.
Knowledge discovery & data mining Association rules and market basket analysis--introduction UCLA CS240A Course Notes*
Frequent Pattern Mining
Market Basket Many-to-many relationship between different objects
Dynamic Itemset Counting
Data Mining Association Analysis: Basic Concepts and Algorithms
Association Rule Mining
Data Mining Association Analysis: Basic Concepts and Algorithms
Mining Association Rules in Large Databases
DIRECT HASHING AND PRUNING (DHP) ALGORITHM
A Parameterised Algorithm for Mining Association Rules
Data Mining Association Analysis: Basic Concepts and Algorithms
Farzaneh Mirzazadeh Fall 2007
Unit 3 MINING FREQUENT PATTERNS ASSOCIATION AND CORRELATIONS
Association Analysis: Basic Concepts and Algorithms
Mining Frequent Patterns without Candidate Generation
Mining Sequential Patterns
Frequent-Pattern Tree
Market Basket Analysis and Association Rules
©Jiawei Han and Micheline Kamber
FP-Growth Wenlong Zhang.
Association Rule Mining
Mining Association Rules in Large Databases
Association Analysis: Basic Concepts
Presentation transcript:

Association Rule Mining Dr. P. Viswanath, RGMCET November 21, 2018 Data Mining: Concepts and Techniques

Data Mining: Concepts and Techniques Overview Data Mining Association rule mining Apriori method Some other methods Conclusion November 21, 2018 Data Mining: Concepts and Techniques

Data Mining: Concepts and Techniques Data mining: A process of discovering previously unknown and potentially useful relationships among data elements in a large database. Various techniques from Statistics, Pattern Recognition, Machine Intelligence, Databases can be used for this purpose. But scalability to large data sets is the main concern. November 21, 2018 Data Mining: Concepts and Techniques

Data Mining: Some tasks The relationships that can be discovered could be a kind of rule between various elements Quantitative descriptive rule Quantitative discriminant rule Association rule natural groups among data items Data clustering a prediction about future Time series analysis November 21, 2018 Data Mining: Concepts and Techniques

Data Mining: Concepts and Techniques Association Rule An example: There is a super-market, and people are buying items from it. The goods bought by each person are stored in a database. Let the items are {A, B, C, … }. November 21, 2018 Data Mining: Concepts and Techniques

Data Mining: Concepts and Techniques Association Rule A rule like, if a person buys a set of items {A,C,E} then mostly he/she will buy another set of items {D,F}. {A,C,E}  {D,F} is the association rule. Eg: People who buy potato chips are also buying cool-drinks. Potato chips  cool-drinks November 21, 2018 Data Mining: Concepts and Techniques

Data Mining: Concepts and Techniques Association Rule But, how good are these rules? That is, how much we can trust these rules. Are these rules useful? How frequently is this rule applicable. November 21, 2018 Data Mining: Concepts and Techniques

Data Mining: Concepts and Techniques Association Rule {D}  {A} is an association rule. According to the given database, this rule is true. [confidence is high] But, only one person bought both D and A. [support is low] November 21, 2018 Data Mining: Concepts and Techniques

Data Mining: Concepts and Techniques Association Rule {A}  {C} is an association rule. According to the given database, this rule is true only partly. [confidence is not high] But, 2 out of 4 bought both A and C. [support is moderate] November 21, 2018 Data Mining: Concepts and Techniques

Notation and Definitions Let I be the set of all items. X, Y, … be the subsets of I We call X, Y, … as itemsets. If X has k items, then X is called as a k-itemset If I is of size n. That is, in total there are n items. Then, the total number of itemsets is 2n – 1. Association rule is of the form X  Y November 21, 2018 Data Mining: Concepts and Techniques

Notation and Definitions Support for the rule X  Y is the fraction of transactions which contains both X and Y. That is, Support = #transactions containing X and Y / Total # of transactions. Confidence of the rule = #transactions that contains both X and Y / #transactions that contains X. Very often these are given in percentages (not in fractions). November 21, 2018 Data Mining: Concepts and Techniques

Data Mining: Concepts and Techniques The Example For rule A  C : support = 0.5 (or 50%) confidence = 0.666 (or 66.6%) November 21, 2018 Data Mining: Concepts and Techniques

Data Mining: Concepts and Techniques Notation Normally support is defined for an itemset. Support (X) = percent of transactions having X. Confidence is defined for a rule. Confidence (X  Y) = Support (X and Y) / Support (X) November 21, 2018 Data Mining: Concepts and Techniques

Data Mining: Concepts and Techniques An Exercise Problem Transaction Id Items bought 100 A,B,C 101 B,C 102 A,C 103 A,B,D 104 105 A,C,E 106 B,D 107 Find out support and confidence of A  B Find out support and confidence of B  A November 21, 2018 Data Mining: Concepts and Techniques

Data Mining: Concepts and Techniques The Problem Given a transactional database, find out all association rules satisfying a given minimum support and confidence. November 21, 2018 Data Mining: Concepts and Techniques

Data Mining: Concepts and Techniques The Problem This problem boils down into two subproblems Find out all itemsets for which the support is more than the minimum value. This is called frequent itemset mining. Find out the association rules using frequent itemsets. November 21, 2018 Data Mining: Concepts and Techniques

Data Mining: Concepts and Techniques The Problem Frequent itemset mining is the more difficult problem. Find out all itemsets for which the support is more than a given value. How much difficult is this problem? November 21, 2018 Data Mining: Concepts and Techniques

Data Mining: Concepts and Techniques A simple algorithm If the minimum support is s%. If there are m transacations, then if an itemset is present in more than sm/100 transactions, it is frequent. Here sm/100 is the threshold number. A simple naïve algorithm for this is … November 21, 2018 Data Mining: Concepts and Techniques

Data Mining: Concepts and Techniques A Naive Algorithm For each itemset create a counter. Intialize all counters to zero. For each transaction in the database, Find out all subsets of the transaction and increment their respective counters. Select those itemsets for which the counter value is more than the given threshold value. November 21, 2018 Data Mining: Concepts and Techniques

Analysis of the Algorithm If there are n items. Then the total number of counters is 2n – 1 . If n is a small number (perhaps <20) then this is a feasible solution. But when n is large (like 1000) then it is not feasible to create 21000 – 1 counters. As an exercise, try to find out how much big this number is. November 21, 2018 Data Mining: Concepts and Techniques

Data Mining: Concepts and Techniques Analysis Time complexity is O(m) [Good] #database scans = one only. [Good] Space complexity is O(2n) [Very Bad] In data mining #database scans is one important measure of scalability. November 21, 2018 Data Mining: Concepts and Techniques

Data Mining: Concepts and Techniques Other Naïve Method The other way is to use only one counter and find the support for each itemset separately. For this one has to scan the database 2n – 1 times. Space complexity is reduced, but time complexity is increased. November 21, 2018 Data Mining: Concepts and Techniques

Data Mining: Concepts and Techniques Apriori Algorithm One of the initial algorithms to solve this problem in a better way. It uses an important property regarding the itemsets A subset of a frequent itemset must also be a frequent itemset i.e., if {A,B} is a frequent itemset, both {A} and {B} should also be frequent. If either {A} or {B} is not frequent, then {A,B} is also non-frequent. November 21, 2018 Data Mining: Concepts and Techniques

Data Mining: Concepts and Techniques Apriori Algorithm Some of the itemsets, we can discard at early stages. For example, if X is a non-frequent itemset, then there is no need to worry about all supersets of X. But, if X is frequent, then may be a superset of X is also frequent. November 21, 2018 Data Mining: Concepts and Techniques

Data Mining: Concepts and Techniques Apriori Algorithm This is a bottom-up method. First find frequent 1-itemsets, then find frequent 2-itemsets, … If we already found frequent k-itemsets. We call this LK November 21, 2018 Data Mining: Concepts and Techniques

Apriori Algorithm Continued … We generate candidates which can be frequent K+1 itemsets. We call these candidates as CK+1 We find count of these candidates and find LK+1 November 21, 2018 Data Mining: Concepts and Techniques

How candidates are generated If {A,B,C} and {A,B,D} are two itemsets in L3 then a candidate itemset in C4 is {A,B,C,D} provided all its subsets of size 3 are in L3 If, for example, {B,C,D} is not in L3 then {A,B,C,D} can not be frequent and is removed from C4 [This is called the pruning step] November 21, 2018 Data Mining: Concepts and Techniques

Data Mining: Concepts and Techniques The Apriori Algorithm Ck: Candidate itemset of size k Lk : frequent itemset of size k Find L1 ; for (k = 1; Lk !=; k++) do begin Ck+1 = candidates generated from Lk; for each transaction t in database do (i) increment the count of all (ii) candidates in Ck+1 that are contained in t (iii) Lk+1 = candidates in Ck+1 with min_support end return k Lk; November 21, 2018 Data Mining: Concepts and Techniques

The Apriori Algorithm — Example Database D L1 C1 Scan D C2 C2 L2 Scan D C3 L3 Scan D November 21, 2018 Data Mining: Concepts and Techniques

Analysis of Apriori Algorithm If the largest itemset size is k then we need to scan the database atleast k times. The space required depends on the number of candidates generated. But, certainly this is better than the naïve methods. November 21, 2018 Data Mining: Concepts and Techniques

Data Mining: Concepts and Techniques Exercise Problem Transaction Id Items bought 100 A,B,C,D,E 101 A,B,C,D,F 102 B,C,F 103 A,C,F,G Let the minimum support required is 50%, find out all frequent itemsets using the Apriori algorithm. At each stage show the candidates generated and describe how the Apriori property is used to prune the candidates set. November 21, 2018 Data Mining: Concepts and Techniques

Methods to Improve Apriori’s Efficiency Hash-based itemset counting: A k-itemset whose corresponding hashing bucket count is below the threshold cannot be frequent Transaction reduction: A transaction that does not contain any frequent k-itemset is useless in subsequent scans Partitioning: Any itemset that is potentially frequent in DB must be frequent in at least one of the partitions of DB November 21, 2018 Data Mining: Concepts and Techniques

Methods to Improve Apriori’s Efficiency Sampling: mining on a subset of given data, lower support threshold + a method to determine the completeness Dynamic itemset counting: add new candidate itemsets only when all of their subsets are estimated to be frequent November 21, 2018 Data Mining: Concepts and Techniques

Mining Frequent Patterns Without Candidate Generation Compress a large database into a compact, Frequent-Pattern tree (FP-tree) structure highly condensed, but complete for frequent pattern mining avoid costly database scans November 21, 2018 Data Mining: Concepts and Techniques

Data Mining: Concepts and Techniques FP-tree based mining Develops an efficient, FP-tree-based frequent pattern mining method A divide-and-conquer methodology: decompose mining tasks into smaller ones Avoid candidate generation: sub-database test only! November 21, 2018 Data Mining: Concepts and Techniques

Partition based methods Partition the database and then apply divide-and-conquer strategies. November 21, 2018 Data Mining: Concepts and Techniques

Data Mining: Concepts and Techniques Summary Association rule mining probably the most significant contribution from the database community in KDD A large number of papers have been published Many interesting issues have been explored An interesting research direction Association analysis in other types of data: spatial data, multimedia data, time series data, etc. November 21, 2018 Data Mining: Concepts and Techniques

Data Mining: Concepts and Techniques Thank you !!! November 21, 2018 Data Mining: Concepts and Techniques