Association Rule Mining

Association Rule Mining
Dr. P. Viswanath, RGMCET November 21, 2018 Data Mining: Concepts and Techniques

Data Mining: Concepts and Techniques
Overview Data Mining Association rule mining Apriori method Some other methods Conclusion November 21, 2018 Data Mining: Concepts and Techniques

Data mining: A process of discovering previously unknown and potentially useful relationships among data elements in a large database. Various techniques from Statistics, Pattern Recognition, Machine Intelligence, Databases can be used for this purpose. But scalability to large data sets is the main concern. November 21, 2018 Data Mining: Concepts and Techniques

Data Mining: Some tasks
The relationships that can be discovered could be a kind of rule between various elements Quantitative descriptive rule Quantitative discriminant rule Association rule natural groups among data items Data clustering a prediction about future Time series analysis November 21, 2018 Data Mining: Concepts and Techniques

Association Rule An example: There is a super-market, and people are buying items from it. The goods bought by each person are stored in a database. Let the items are {A, B, C, … }. November 21, 2018 Data Mining: Concepts and Techniques

Association Rule A rule like, if a person buys a set of items {A,C,E} then mostly he/she will buy another set of items {D,F}. {A,C,E}  {D,F} is the association rule. Eg: People who buy potato chips are also buying cool-drinks. Potato chips  cool-drinks November 21, 2018 Data Mining: Concepts and Techniques

Association Rule But, how good are these rules? That is, how much we can trust these rules. Are these rules useful? How frequently is this rule applicable. November 21, 2018 Data Mining: Concepts and Techniques

Association Rule {D}  {A} is an association rule. According to the given database, this rule is true. [confidence is high] But, only one person bought both D and A. [support is low] November 21, 2018 Data Mining: Concepts and Techniques

Association Rule {A}  {C} is an association rule. According to the given database, this rule is true only partly. [confidence is not high] But, 2 out of 4 bought both A and C. [support is moderate] November 21, 2018 Data Mining: Concepts and Techniques

Notation and Definitions
Let I be the set of all items. X, Y, … be the subsets of I We call X, Y, … as itemsets. If X has k items, then X is called as a k-itemset If I is of size n. That is, in total there are n items. Then, the total number of itemsets is 2n – 1. Association rule is of the form X  Y November 21, 2018 Data Mining: Concepts and Techniques

Notation and Definitions
Support for the rule X  Y is the fraction of transactions which contains both X and Y. That is, Support = #transactions containing X and Y / Total # of transactions. Confidence of the rule = #transactions that contains both X and Y / #transactions that contains X. Very often these are given in percentages (not in fractions). November 21, 2018 Data Mining: Concepts and Techniques

The Example For rule A  C : support = 0.5 (or 50%) confidence = (or 66.6%) November 21, 2018 Data Mining: Concepts and Techniques

Notation Normally support is defined for an itemset. Support (X) = percent of transactions having X. Confidence is defined for a rule. Confidence (X  Y) = Support (X and Y) / Support (X) November 21, 2018 Data Mining: Concepts and Techniques

An Exercise Problem Transaction Id Items bought 100 A,B,C 101 B,C 102 A,C 103 A,B,D 104 105 A,C,E 106 B,D 107 Find out support and confidence of A  B Find out support and confidence of B  A November 21, 2018 Data Mining: Concepts and Techniques

The Problem Given a transactional database, find out all association rules satisfying a given minimum support and confidence. November 21, 2018 Data Mining: Concepts and Techniques

The Problem This problem boils down into two subproblems Find out all itemsets for which the support is more than the minimum value. This is called frequent itemset mining. Find out the association rules using frequent itemsets. November 21, 2018 Data Mining: Concepts and Techniques

The Problem Frequent itemset mining is the more difficult problem. Find out all itemsets for which the support is more than a given value. How much difficult is this problem? November 21, 2018 Data Mining: Concepts and Techniques

A simple algorithm If the minimum support is s%. If there are m transacations, then if an itemset is present in more than sm/100 transactions, it is frequent. Here sm/100 is the threshold number. A simple naïve algorithm for this is … November 21, 2018 Data Mining: Concepts and Techniques

A Naive Algorithm For each itemset create a counter. Intialize all counters to zero. For each transaction in the database, Find out all subsets of the transaction and increment their respective counters. Select those itemsets for which the counter value is more than the given threshold value. November 21, 2018 Data Mining: Concepts and Techniques

Analysis of the Algorithm
If there are n items. Then the total number of counters is 2n – 1 . If n is a small number (perhaps <20) then this is a feasible solution. But when n is large (like 1000) then it is not feasible to create – 1 counters. As an exercise, try to find out how much big this number is. November 21, 2018 Data Mining: Concepts and Techniques

Analysis Time complexity is O(m) [Good] #database scans = one only. [Good] Space complexity is O(2n) [Very Bad] In data mining #database scans is one important measure of scalability. November 21, 2018 Data Mining: Concepts and Techniques

Other Naïve Method The other way is to use only one counter and find the support for each itemset separately. For this one has to scan the database 2n – 1 times. Space complexity is reduced, but time complexity is increased. November 21, 2018 Data Mining: Concepts and Techniques

Apriori Algorithm One of the initial algorithms to solve this problem in a better way. It uses an important property regarding the itemsets A subset of a frequent itemset must also be a frequent itemset i.e., if {A,B} is a frequent itemset, both {A} and {B} should also be frequent. If either {A} or {B} is not frequent, then {A,B} is also non-frequent. November 21, 2018 Data Mining: Concepts and Techniques

Apriori Algorithm Some of the itemsets, we can discard at early stages. For example, if X is a non-frequent itemset, then there is no need to worry about all supersets of X. But, if X is frequent, then may be a superset of X is also frequent. November 21, 2018 Data Mining: Concepts and Techniques

Apriori Algorithm This is a bottom-up method. First find frequent 1-itemsets, then find frequent 2-itemsets, … If we already found frequent k-itemsets. We call this LK November 21, 2018 Data Mining: Concepts and Techniques

Apriori Algorithm Continued …
We generate candidates which can be frequent K+1 itemsets. We call these candidates as CK+1 We find count of these candidates and find LK+1 November 21, 2018 Data Mining: Concepts and Techniques

How candidates are generated
If {A,B,C} and {A,B,D} are two itemsets in L3 then a candidate itemset in C4 is {A,B,C,D} provided all its subsets of size 3 are in L3 If, for example, {B,C,D} is not in L3 then {A,B,C,D} can not be frequent and is removed from C [This is called the pruning step] November 21, 2018 Data Mining: Concepts and Techniques

The Apriori Algorithm Ck: Candidate itemset of size k Lk : frequent itemset of size k Find L1 ; for (k = 1; Lk !=; k++) do begin Ck+1 = candidates generated from Lk; for each transaction t in database do (i) increment the count of all (ii) candidates in Ck+1 that are contained in t (iii) Lk+1 = candidates in Ck+1 with min_support end return k Lk; November 21, 2018 Data Mining: Concepts and Techniques

The Apriori Algorithm — Example
Database D L1 C1 Scan D C2 C2 L2 Scan D C3 L3 Scan D November 21, 2018 Data Mining: Concepts and Techniques

Analysis of Apriori Algorithm
If the largest itemset size is k then we need to scan the database atleast k times. The space required depends on the number of candidates generated. But, certainly this is better than the naïve methods. November 21, 2018 Data Mining: Concepts and Techniques

Exercise Problem Transaction Id Items bought 100 A,B,C,D,E 101 A,B,C,D,F 102 B,C,F 103 A,C,F,G Let the minimum support required is 50%, find out all frequent itemsets using the Apriori algorithm. At each stage show the candidates generated and describe how the Apriori property is used to prune the candidates set. November 21, 2018 Data Mining: Concepts and Techniques

Methods to Improve Apriori’s Efficiency
Hash-based itemset counting: A k-itemset whose corresponding hashing bucket count is below the threshold cannot be frequent Transaction reduction: A transaction that does not contain any frequent k-itemset is useless in subsequent scans Partitioning: Any itemset that is potentially frequent in DB must be frequent in at least one of the partitions of DB November 21, 2018 Data Mining: Concepts and Techniques

Methods to Improve Apriori’s Efficiency
Sampling: mining on a subset of given data, lower support threshold + a method to determine the completeness Dynamic itemset counting: add new candidate itemsets only when all of their subsets are estimated to be frequent November 21, 2018 Data Mining: Concepts and Techniques

Mining Frequent Patterns Without Candidate Generation
Compress a large database into a compact, Frequent-Pattern tree (FP-tree) structure highly condensed, but complete for frequent pattern mining avoid costly database scans November 21, 2018 Data Mining: Concepts and Techniques

FP-tree based mining Develops an efficient, FP-tree-based frequent pattern mining method A divide-and-conquer methodology: decompose mining tasks into smaller ones Avoid candidate generation: sub-database test only! November 21, 2018 Data Mining: Concepts and Techniques

Partition based methods
Partition the database and then apply divide-and-conquer strategies. November 21, 2018 Data Mining: Concepts and Techniques

Summary Association rule mining probably the most significant contribution from the database community in KDD A large number of papers have been published Many interesting issues have been explored An interesting research direction Association analysis in other types of data: spatial data, multimedia data, time series data, etc. November 21, 2018 Data Mining: Concepts and Techniques

Thank you !!! November 21, 2018 Data Mining: Concepts and Techniques

Association Rule Mining

Similar presentations

Presentation on theme: "Association Rule Mining"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Association Rule Mining

Similar presentations

Presentation on theme: "Association Rule Mining"— Presentation transcript:

Similar presentations

About project

Feedback