Mining Association Rules

Slides:



Advertisements
Similar presentations
Association Rules Evgueni Smirnov.
Advertisements

Mining Association Rules
Association Analysis (2). Example TIDList of item ID’s T1I1, I2, I5 T2I2, I4 T3I2, I3 T4I1, I2, I4 T5I1, I3 T6I2, I3 T7I1, I3 T8I1, I2, I3, I5 T9I1, I2,
Frequent Itemset Mining Methods. The Apriori algorithm Finding frequent itemsets using candidate generation Seminal algorithm proposed by R. Agrawal and.
CSE 634 Data Mining Techniques
Data Mining Techniques Association Rule
1 of 25 1 of 45 Association Rule Mining CIT366: Data Mining & Data Warehousing Instructor: Bajuna Salehe The Institute of Finance Management: Computing.
Association Rules l Mining Association Rules between Sets of Items in Large Databases (R. Agrawal, T. Imielinski & A. Swami) l Fast Algorithms for.
Rakesh Agrawal Ramakrishnan Srikant
Association Analysis. Association Rule Mining: Definition Given a set of records each of which contain some number of items from a given collection; –Produce.
Efficiently Mining Long Patterns from Databases Roberto J. Bayardo Jr. IBM Almaden Research Center.
Chapter 5: Mining Frequent Patterns, Association and Correlations
Data Mining Techniques So Far: Cluster analysis K-means Classification Decision Trees J48 (C4.5) Rule-based classification JRIP (RIPPER) Logistic Regression.
732A02 Data Mining - Clustering and Association Analysis ………………… Jose M. Peña Association rules Apriori algorithm FP grow algorithm.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Association Analysis (2). Example TIDList of item ID’s T1I1, I2, I5 T2I2, I4 T3I2, I3 T4I1, I2, I4 T5I1, I3 T6I2, I3 T7I1, I3 T8I1, I2, I3, I5 T9I1, I2,
Data Mining Association Analysis: Basic Concepts and Algorithms
4/3/01CS632 - Data Mining1 Data Mining Presented By: Kevin Seng.
Association Analysis: Basic Concepts and Algorithms.
Association Rule Mining. Generating assoc. rules from frequent itemsets  Assume that we have discovered the frequent itemsets and their support  How.
1 Association Rule Mining Instructor Qiang Yang Thanks: Jiawei Han and Jian Pei.
Mining Association Rules in Large Databases
Mining Association Rules in Large Databases
Mining Association Rules
1 Fast Algorithms for Mining Association Rules Rakesh Agrawal Ramakrishnan Srikant Slides from Ofer Pasternak.
Mining Association Rules in Large Databases. What Is Association Rule Mining?  Association rule mining: Finding frequent patterns, associations, correlations,
Pattern Recognition Lecture 20: Data Mining 3 Dr. Richard Spillman Pacific Lutheran University.
Association Discovery from Databases Association rules are a simple formalism for expressing positive connections between columns in a 0/1 matrix. A classical.
Mining Association Rules between Sets of Items in Large Databases presented by Zhuang Wang.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining By Tan, Steinbach, Kumar Lecture.
Modul 7: Association Analysis. 2 Association Rule Mining  Given a set of transactions, find rules that will predict the occurrence of an item based on.
Data Warehousing 資料倉儲 Min-Yuh Day 戴敏育 Assistant Professor 專任助理教授 Dept. of Information Management, Tamkang University Dept. of Information ManagementTamkang.
DATA MINING LECTURE 3 Frequent Itemsets Association Rules.
Information Systems Data Analysis – Association Mining Prof. Les Sztandera.
Fast Algorithms For Mining Association Rules By Rakesh Agrawal and R. Srikant Presented By: Chirayu Modi.
CSE4334/5334 DATA MINING CSE4334/5334 Data Mining, Fall 2014 Department of Computer Science and Engineering, University of Texas at Arlington Chengkai.
Association Rule Mining Data Mining and Knowledge Discovery Prof. Carolina Ruiz and Weiyang Lin Department of Computer Science Worcester Polytechnic Institute.
Data Mining Find information from data data ? information.
Association Rule Mining
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Association Analysis This lecture node is modified based on Lecture Notes for.
Mining Frequent Patterns, Associations, and Correlations Compiled By: Umair Yaqub Lecturer Govt. Murray College Sialkot.
Dept. of Information Management, Tamkang University
M. Sulaiman Khan Dept. of Computer Science University of Liverpool 2009 COMP527: Data Mining ARM: Improvements March 10, 2009 Slide.
What is Frequent Pattern Analysis?
Data Mining  Association Rule  Classification  Clustering.
The UNIVERSITY of KENTUCKY Association Rule Mining CS 685: Special Topics in Data Mining Spring 2009.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Chapter 8 Association Rules. Data Warehouse and Data Mining Chapter 10 2 Content Association rule mining Mining single-dimensional Boolean association.
The UNIVERSITY of KENTUCKY Association Rule Mining CS 685: Special Topics in Data Mining.
Data Mining Association Rules Mining Frequent Itemset Mining Support and Confidence Apriori Approach.
COMP53311 Association Rule Mining Prepared by Raymond Wong Presented by Raymond Wong
CS685 : Special Topics in Data Mining, UKY The UNIVERSITY of KENTUCKY Association Rule Mining CS 685: Special Topics in Data Mining Jinze Liu.
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Association Rule Mining COMP Seminar BCB 713 Module Spring 2011.
Introduction to Machine Learning Lecture 13 Introduction to Association Rules Albert Orriols i Puig Artificial.
1 Data Mining Lecture 6: Association Analysis. 2 Association Rule Mining l Given a set of transactions, find rules that will predict the occurrence of.
Data Mining Find information from data data ? information.
Association rule mining
Association Rules Repoussis Panagiotis.
Mining Association Rules
Frequent Pattern Mining
©Jiawei Han and Micheline Kamber
Data Mining Association Analysis: Basic Concepts and Algorithms
Association Rule Mining
Data Mining Association Analysis: Basic Concepts and Algorithms
Mining Association Rules in Large Databases
Association Rule Mining
Data Mining Association Analysis: Basic Concepts and Algorithms
Association Rule Mining
©Jiawei Han and Micheline Kamber
Association Analysis: Basic Concepts
Presentation transcript:

Mining Association Rules

Data Mining Overview Data Mining Data warehouses and OLAP (On Line Analytical Processing.) Association Rules Mining Clustering: Hierarchical and Partitional approaches Classification: Decision Trees and Bayesian classifiers Sequential Patterns Mining Advanced topics: outlier detection, web mining

Association Rules: Background Given: (1) database of transactions, (2) each transaction is a list of items (purchased by a customer in a visit) Find: all association rules that satisfy user-specified minimum support and minimum confidence interval Example: 30% of transactions that contain beer also contain diapers; 5% of transactions contain these items 30%: confidence of the rule 5%: support of the rule We are interested in finding all rules rather than verifying if a rule holds

Rule Measures: Support and Confidence Customer buys both Customer buys diaper Find all the rules X & Y  Z with minimum confidence and support support, s, probability that a transaction contains {X  Y  Z} confidence, c, conditional probability that a transaction having {X  Y} also contains Z Customer buys beer Let minimum support 50%, and minimum confidence 50%, we have A  C (50%, 66.6%) C  A (50%, 100%)

Application Examples Market Basket Analysis *  Maintenance Agreement (What the store should do to boost Maintenance Agreement sales?) Home Electronics  * (What other products should the store stocks up on if the store has a sale on Home Electronics?) Attached mailing in direct marketing Detecting “ping-pong”ing of patients Transaction: patient Item: doctor/clinic visited by patient Support of the rule: number of common patients HIC Australia “success story”

Problem Statement I = {i1, i2, …, im}: a set of literals, called items Transaction T: a set of items s.t. T I Database D: a set of transactions A transaction contains X, a set of items in I, if X T An association rule is an implication of the form X  Y, where X,Y I The rule X  Y holds in the transaction set D with confidence c if c% of transactions in D that contain X also contain Y The rule X  Y has support s in the transaction set D if s% of transactions in D contain X Y Find all rules that have support and confidence greater than user-specified min support and min confidence

Association Rule Mining: A Road Map Boolean vs. quantitative associations (Based on the types of values handled) buys(x, “SQLServer”) ^ buys(x, “DMBook”) ® buys(x, “DBMiner”) [0.2%, 60%] age(x, “30..39”) ^ income(x, “42..48K”) ® buys(x, “PC”) [1%, 75%] Single dimension vs. multiple dimensional associations (see ex. Above) Single level vs. multiple-level analysis What brands of beers are associated with what brands of diapers? Various extensions Correlation, causality analysis Association does not necessarily imply correlation or causality Constraints enforced E.g., small sales (sum < 100) trigger big buys (sum > 1,000)?

Problem Decomposition 1. Find all sets of items that have minimum support (frequent itemsets) 2. Use the frequent itemsets to generate the desired rules

Problem Decomposition – Example For min support = 50% = 2 trans, and min confidence = 50% For the rule Shoes  Jacket Support = Sup({Shoes,Jacket)}=50% Confidence = =66.6% Jacket  Shoes has 50% support and 100% confidence

Discovering Rules Naïve Algorithm for each frequent itemset l do for each subset c of l do if (support(l ) / support(l - c) >= minconf) then output the rule (l – c )  c, with confidence = support(l ) / support (l - c ) and support = support(l )

Discovering Rules (2) Lemma. If consequent c generates a valid rule, so do all subsets of c. (e.g. X  YZ, then XY  Z and XZ  Y) Example: Consider a frequent itemset ABCDE If ACDE  B and ABCE  D are the only one-consequent rules with minimum support confidence, then ACE  BD is the only other rule that needs to be tested

Mining Frequent Itemsets: the Key Step Find the frequent itemsets: the sets of items that have minimum support A subset of a frequent itemset must also be a frequent itemset i.e., if {AB} is a frequent itemset, both {A} and {B} should be a frequent itemset Iteratively find frequent itemsets with cardinality from 1 to k (k-itemset) Use the frequent itemsets to generate association rules.

The Apriori Algorithm Lk: Set of frequent itemsets of size k (those with min support) Ck: Set of candidate itemset of size k (potentially frequent itemsets) L1 = {frequent items}; for (k = 1; Lk !=; k++) do begin Ck+1 = candidates generated from Lk; for each transaction t in database do increment the count of all candidates in Ck+1 that are contained in t Lk+1 = candidates in Ck+1 with min_support end return k Lk;

The Apriori Algorithm — Example Min support =50% = 2 trans Database D L1 C1 Scan D C2 C2 L2 Scan D C3 L3 Scan D

How to Generate Candidates? Suppose the items in Lk-1 are listed in order Step 1: self-joining Lk-1 insert into Ck select p.item1, p.item2, …, p.itemk-1, q.itemk-1 from Lk-1 p, Lk-1 q where p.item1=q.item1, …, p.itemk-2=q.itemk-2, p.itemk-1 < q.itemk-1 Step 2: pruning forall itemsets c in Ck do forall (k-1)-subsets s of c do if (s is not in Lk-1) then delete c from Ck

Example of Generating Candidates L3={abc, abd, acd, ace, bcd} Self-joining: L3*L3 abcd from abc and abd acde from acd and ace Pruning: acde is removed because ade is not in L3 C4={abcd}

How to Count Supports of Candidates? Why counting supports of candidates a problem? The total number of candidates can be very huge One transaction may contain many candidates Method: Candidate itemsets are stored in a hash-tree Leaf node of hash-tree contains a list of itemsets and counts Interior node contains a hash table Subset function: finds all the candidates contained in a transaction

Hash-tree:search Given a transaction T and a set Ck find all members of its members contained in T Assume an ordering on the items Start from the root, use every item in T to go to the next node If you are at an interior node and you just used item i, then use each item that comes after i in T If you are at a leaf node check the itemsets

Methods to Improve Apriori’s Efficiency Transaction reduction: A transaction that does not contain any frequent k-itemset is useless in subsequent scans Partitioning: Any itemset that is potentially frequent in DB must be frequent in at least one of the partitions of DB Sampling: mining on a subset of given data, lower support threshold + a method to determine the completeness Dynamic itemset counting: add new candidate itemsets only when all of their subsets are estimated to be frequent

Is Apriori Fast Enough? — Performance Bottlenecks The core of the Apriori algorithm: Use frequent (k – 1)-itemsets to generate candidate frequent k-itemsets Use database scan and pattern matching to collect counts for the candidate itemsets The bottleneck of Apriori: candidate generation Huge candidate sets: 104 frequent 1-itemset will generate 107 candidate 2-itemsets To discover a frequent pattern of size 100, e.g., {a1, a2, …, a100}, one needs to generate 2100  1030 candidates. Multiple scans of database: Needs (n +1 ) scans, n is the length of the longest pattern

Max-Miner Max-miner finds long patterns efficiently: the maximal frequent patterns Instead of checking all subsets of a long pattern try to detect long patterns early Scales linearly to the size of the patterns

Max-Miner: the idea f Pruning: (1) set infrequency Set enumeration tree of an ordered set f 1 2 3 4 Pruning: (1) set infrequency (2) Superset frequency 1,2 1,3 1,4 2,3 2,4 3,4 1,2,3 1,2,4 1,3,4 2,3,4 Each node is a candidate group g h(g) is the head: the itemset of the node t(g) tail: an ordered set that contains all items that can appear in the subnodes 1,2,3,4 Example: h({1}) = {1} and t({1}) = {2,3,4}

Max-miner pruning When we count the support of a candidate group g, we compute also the support for h(g), h(g) t(g) and h(g) {i} for each i in t(g) If h(g) t(g) is frequent, then stop expanding the node g and report the union as frequent itemset If h(g) {i} is infrequent, then remove I from all subnodes (just remove i from any tail of a group after g) Expand the node g by one and do the same

The algorithm Max-Miner Set candidate groups C {} Set of Itemsets F {Gen-Initial-Groups(T,C)} while C not empty do scan T to count the support of all candidate groups in C for each g in C s.t. h(g) U t(g) is frequent do F  F U {h(g) U t(g)} Set candidate groups Cnew{ } for each g in C such that h(g) U t(g) is infrequent do F F U {Gen-sub-nodes(g, Cnew)} C  remove from F any itemset with a proper superset in F remove from C any group g s.t. h(g) U t(g) has a superset in F return F

The algorithm (2) Gen-Initial-Groups(T, C) scan T to obtain F1, the set of frequent 1-itemsets impose an ordering on items in F1 for each item i in F1 other than the greatest itemset do let g be a new candidate with h(g) = {i} and t(g) = {j | j follows i in the ordering} C C U {g} return the itemset F1 (an the C of course) Gen-sub-nodes(g, C) /* generation of new itemsets at the next level*/ remove any item i from t(g) if h(g) U {i} is infrequent reorder the items in t(g) for each i in t(g) other than the greatest do let g’ be a new candidate with h(g’) = h(g) U {i} and t(g’) = {j | j in t(g) and j is after i in t(g)} C  C U {g’} return h(g) U {m} where m is the greatest item in t(g) or h(g) if t(g) is empty

Item Ordering Re-ordering items we try to increase the effectiveness of frequency-pruning Very frequent items have higher probability to be contained in long patterns Put these item at the end of the ordering, so they appear in many tails