Data Mining: Association Rule Mining CSE880: Database Systems.

Slides:



Advertisements
Similar presentations
Association Rule Mining
Advertisements

Mining Association Rules
Association Analysis (2). Example TIDList of item ID’s T1I1, I2, I5 T2I2, I4 T3I2, I3 T4I1, I2, I4 T5I1, I3 T6I2, I3 T7I1, I3 T8I1, I2, I3, I5 T9I1, I2,
Association Rule Mining. 2 The Task Two ways of defining the task General –Input: A collection of instances –Output: rules to predict the values of any.
732A02 Data Mining - Clustering and Association Analysis ………………… Jose M. Peña FP grow algorithm Correlation analysis.
Data Mining Association Analysis: Basic Concepts and Algorithms
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
CPS : Information Management and Mining
Frequent Item Mining.
Association Analysis. Association Rule Mining: Definition Given a set of records each of which contain some number of items from a given collection; –Produce.
Chapter 5: Mining Frequent Patterns, Association and Correlations
Data Mining Techniques So Far: Cluster analysis K-means Classification Decision Trees J48 (C4.5) Rule-based classification JRIP (RIPPER) Logistic Regression.
Data Mining Association Analysis: Basic Concepts and Algorithms Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
732A02 Data Mining - Clustering and Association Analysis ………………… Jose M. Peña Association rules Apriori algorithm FP grow algorithm.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Data Mining Association Analysis: Basic Concepts and Algorithms
Mining Association Rules in Large Databases
FP-growth. Challenges of Frequent Pattern Mining Improving Apriori Fp-growth Fp-tree Mining frequent patterns with FP-tree Visualization of Association.
Data Mining Association Analysis: Basic Concepts and Algorithms
Data Mining Association Analysis: Basic Concepts and Algorithms
1 Mining Frequent Patterns Without Candidate Generation Apriori-like algorithm suffers from long patterns or quite low minimum support thresholds. Two.
Association Analysis: Basic Concepts and Algorithms.
Association Rule Mining. Generating assoc. rules from frequent itemsets  Assume that we have discovered the frequent itemsets and their support  How.
Data Mining Association Analysis: Basic Concepts and Algorithms
Mining Association Rules in Large Databases
© Vipin Kumar CSci 8980 Fall CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance Computing Research Center Department of Computer.
Frequent-Pattern Tree. 2 Bottleneck of Frequent-pattern Mining  Multiple database scans are costly  Mining long patterns needs many passes of scanning.
Mining Association Rules
© Vipin Kumar CSci 8980 Fall CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance Computing Research Center Department of Computer.
SEG Tutorial 2 – Frequent Pattern Mining.
Eick, Tan, Steinbach, Kumar: Association Analysis Part1 Organization “Association Analysis” 1. What is Association Analysis? 2. Association Rules 3. The.
Mining Frequent Patterns without Candidate Generation Presented by Song Wang. March 18 th, 2009 Data Mining Class Slides Modified From Mohammed and Zhenyu’s.
Jiawei Han, Jian Pei, and Yiwen Yin School of Computing Science Simon Fraser University Mining Frequent Patterns without Candidate Generation SIGMOD 2000.
AR mining Implementation and comparison of three AR mining algorithms Xuehai Wang, Xiaobo Chen, Shen chen CSCI6405 class project.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar.
Data Mining Frequent-Pattern Tree Approach Towards ARM Lecture
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
What Is Association Mining? l Association rule mining: – Finding frequent patterns, associations, correlations, or causal structures among sets of items.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining By Tan, Steinbach, Kumar Lecture.
Modul 7: Association Analysis. 2 Association Rule Mining  Given a set of transactions, find rules that will predict the occurrence of an item based on.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
DATA MINING LECTURE 3 Frequent Itemsets Association Rules.
Data Mining Association Analysis Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/
Data & Text Mining1 Introduction to Association Analysis Zhangxi Lin ISQS 3358 Texas Tech University.
Mining Frequent Patterns without Candidate Generation.
Mining Frequent Patterns without Candidate Generation : A Frequent-Pattern Tree Approach 指導教授:廖述賢博士 報 告 人:朱 佩 慧 班 級:管科所博一.
Parallel Mining Frequent Patterns: A Sampling-based Approach Shengnan Cong.
Frequent Item Mining. What is data mining? =Pattern Mining? What patterns? Why are they useful?
CSE4334/5334 DATA MINING CSE4334/5334 Data Mining, Fall 2014 Department of Computer Science and Engineering, University of Texas at Arlington Chengkai.
Data Mining Find information from data data ? information.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Association Analysis This lecture node is modified based on Lecture Notes for.
What is Frequent Pattern Analysis?
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
1 Data Mining Lecture 6: Association Analysis. 2 Association Rule Mining l Given a set of transactions, find rules that will predict the occurrence of.
Association Rule Mining CENG 514 Data Mining July 2,
Reducing Number of Candidates
Data Mining Association Analysis: Basic Concepts and Algorithms
Frequent Pattern Mining
Data Mining Association Analysis: Basic Concepts and Algorithms
Association Rule Mining
Data Mining Association Analysis: Basic Concepts and Algorithms
Mining Association Rules in Large Databases
Data Mining Association Analysis: Basic Concepts and Algorithms
Association Analysis: Basic Concepts and Algorithms
Frequent-Pattern Tree
FP-Growth Wenlong Zhang.
Mining Association Rules in Large Databases
Association Analysis: Basic Concepts
Presentation transcript:

Data Mining: Association Rule Mining CSE880: Database Systems

Data, data everywhere l Walmart records ~20 million transactions per day l Google indexed ~3 billion Web pages l Yahoo collects ~10 GB Web traffic data per hour –U of M Computer Science department collects ~80MB of Web traffic on April 28, 2003 l NASA Earth Observing Satellites (EOS) produces over 1 PB of Earth Science data per year l NYSE trading volume: 1,443,627,000 (Oct 31, 2003) l Scientific simulations can produce terabytes of data per hour

Data Mining - Motivation l There is often information “hidden” in the data that is not readily evident l Human analysts may take weeks to discover useful information l Much of the data is never analyzed at all The Data Gap Total new disk (TB) since 1995 Number of analysts From: R. Grossman, C. Kamath, V. Kumar, “Data Mining for Scientific and Engineering Applications”

What is Data Mining? l Many Definitions –Non-trivial extraction of implicit, previously unknown and potentially useful information from data –Exploration & analysis, by automatic or semi-automatic means, of large quantities of data in order to discover meaningful patterns Knowledge Discovery Process What is Data Information Mining?

Data Mining Tasks Classification Clustering Association Rule Mining Anomaly Detection Milk Data

Association Rule Mining l Given a set of transactions, find rules that will predict the occurrence of an item based on the occurrences of other items in the transaction Market-Basket transactions Example of Association Rules {Diaper}  {Beer}, {Milk, Bread}  {Eggs,Coke}, {Beer, Bread}  {Milk}, Implication means co-occurrence, not causality!

Terminology: Frequent Itemset l Market Basket: items bought in a single customer transaction l Itemset: X={x1, …, xk} l Support: The support of itemset X is the fraction of transactions in the database that contains all the items in the itemset Let minsup = 50% Itemsets that meet the minsup threshold are called frequent itemsets Itemsets {A,D}, {D} … do not meet the minsup threshold Transaction-idItems bought 1A, B, C 2A, C 3A, D 4B, E, F Frequent ItemsetSupport {A}75% {B}50% {C}50% {A, C}50%

Terminology: Association Rule l Given: a pair of itemsets X, Y l Association rule X  Y: If X is purchased in a transaction, it is very likely that Y will also be purchased in that transaction. –Support of X  Y : the fraction of transactions that contain both X and Y i.e. the support of itemset X  Y –Confidence of X  Y : conditional probability that a transaction having X also contains Y. i.e. Support (X  Y) / support (X) Sup(X) Sup(Y) Customer buys Y Customer buys both Sup( X  Y ) Customer buys X Goal of Association Rule Mining: To discover all rules having 1. support  minsup and 2. confidence  minconf

An Illustrative Example Example:

How to Generate Association Rules l How many possible association rules? l Naïve way: –Enumerate all the possible rules –Compute support and confidence for each rule –Exponentially expensive!! l Need to decouple the minimum support and minimum confidence requirements

How to Generate Association Rules? Example of Rules: {Milk,Diaper}  {Beer} (s=0.4, c=0.67) {Milk,Beer}  {Diaper} (s=0.4, c=1.0) {Diaper,Beer}  {Milk} (s=0.4, c=0.67) {Diaper}  {Milk,Beer} (s=0.4, c=0.5) Observations: All the rules above have identical support Why? Approach: divide the rule generation process into 2 steps Generate the frequent itemsets first Generate association rules from the frequent itemsets

Generating Frequent Itemsets There are 2 d possible itemsets

Generating Frequent Itemsets l Naive approach: –Each itemset in the lattice is a candidate frequent itemset –Count the support of each candidate by scanning the database –Complexity ~ O(NMw) => Expensive since M = 2 d !!!

Approach for Mining Frequent Itemsets l Reduce the number of candidates (M) –Complete search: M=2 d l Reduce the number of transactions (N) –Reduce size of N as the size of itemset increases l Reduce the number of comparisons (NM) –Use efficient data structures to store the candidates and transactions –No need to match every candidate against every transaction

Apriori Principle l Any subset of a frequent itemset must be frequent –If {beer, diaper, nuts} is frequent, so is {beer, diaper} –Every transaction having {beer, diaper, nuts} also contains {beer, diaper} l Support has anti-monotone property –If A is a subset of B, then support(A) ≥ support(B) l Apriori pruning principle: If there is any itemset that is infrequent, its superset need not be generated!

Found to be Infrequent Illustrating Apriori Principle Pruned supersets

Apriori Algorithm l Method: –Let k=1 –Find frequent itemsets of length 1 –Repeat until no new frequent itemsets are identified  Generate length (k+1) candidate itemsets from length k frequent itemsets  Update the support count of candidates by scanning the DB  Eliminate candidates that are infrequent, leaving only those that are frequent

Illustrating Apriori Principle Items (1-itemsets) Pairs (2-itemsets) (No need to generate candidates involving Coke or Eggs) Triplets (3-itemsets) Minimum Support = 3 If every subset is considered, 6 C C C 3 = 41 With support-based pruning, = 13

How to Generate Candidates? l Given set of frequent itemsets of length 3: {A,B,C}, {A,B,D}, {A,C,D}, {B,C,D},{B,C,E},{C,D,E} l Generate candidate itemsets of length 4 by adding a frequent item to each frequent itemset of length 3 –Produces lots of unnecessary candidates  Suppose A,B,C,D,E,F,G,H are frequent items  {A,B,C} + E produces the candidate {A,B,C,E} {A,B,C} + F produces the candidate {A,B,C,F} {A,B,C} + G produces the candidate {A,B,C,G} {A,B,C} + H produces the candidate {A,B,C,H}  These candidates are guaranteed to be infrequent – Why?

How to Generate Candidates? l Given set of frequent itemsets of length 3: {A,B,C}, {A,B,D}, {A,C,D}, {B,C,D},{B,C,E},{C,D,E} l Generate candidate itemsets of length 4 by joining pairs of frequent itemsets of length 3 –Joining {A,B,C} with {A,B,D} will produce the candidate {A,B,C,D} –Problem 1: Duplicate/Redundant candidates  Joining {A,B,C} with {B,C,D} will produce the same candidate {A,B,C,D} –Problem 2: Unnecessary candidates  Join {A,B,C} with {B,C,E} will produce the candidate {A,B,C,E}, which is guaranteed to be infrequent

How to Generate Candidates? l Given set of frequent itemsets of length 3 (Ordered): {A,B,C}, {A,B,D}, {A,C,D}, {B,C,D},{B,C,E},{C,D,E} l Ordered: therefore, there is no (A,C,B) between (ACD) and (BCD), Combing will create infrequent (ACB) l Join a pair of frequent k-itemsets only if its prefix of length (k-1) are identical (guarantees subset is frequent): –Join {A,B,C} and {A,B,D} to produce {A,B,C,D} –Do not have to join {A,B,C} and {B,C,D} since they don’t have the same prefix  This avoids generating duplicate candidates –Do not have to join {A,C,D} and {C,D,E} Subset is not frequent l Pruning –Joining {B,C,D} and {B,C,E} will produce {B,C,D,E} –Prune {B,C,D,E} because one of its subset {B,D,E} is not frequentx

How to Generate Candidates? l Let L k-1 be the set of all frequent itemsets of length k-1 l Assume each itemset in L k-1 are sorted in lexicographic order l Step 1: self-joining L k-1 insert into C k select p.item 1, p.item 2, …, p.item k-1, q.item k-1 from L k-1 p, L k-1 q where p.item 1 =q. item 1, …, p. item k-2 =q. item k-2, p. item k-1 < q. item k-1 l Step 2: pruning forall itemsets c in C k do forall (k-1)-subsets s of c do if (s is not in L k-1 ) then delete c from C k

How to Count Support of Candidates? l Why counting supports of candidates is a problem? –The total number of candidates can be huge – Each transaction may contain many candidates l Method: –Store candidate itemsets in a hash-tree –Leaf node of the hash-tree contains a list of itemsets and their respective support counts –Interior node contains a hash table

Generate Hash Tree ,4,7 2,5,8 3,6,9 Hash function Suppose you have 15 candidate itemsets of length 3: {1 4 5}, {1 2 4}, {4 5 7}, {1 2 5}, {4 5 8}, {1 5 9}, {1 3 6}, {2 3 4}, {5 6 7}, {3 4 5}, {3 5 6}, {3 5 7}, {6 8 9}, {3 6 7}, {3 6 8} You need: Hash function Max leaf size: max number of itemsets stored in a leaf node (if number of itemsets exceed max leaf size, split the node)

Example: Counting Supports of Candidates 1,4,7 2,5,8 3,6,9 Subset function Transaction:

Rule Generation l Given a frequent itemset L, find all non-empty subsets f  L such that f  L – f satisfies the minimum confidence requirement –If {A,B,C,D} is a frequent itemset, candidate rules: ABC  D, ABD  C, ACD  B, BCD  A, A  BCD,B  ACD,C  ABD, D  ABC AB  CD,AC  BD, AD  BC, BC  AD, BD  AC, CD  AB, l If |L| = k, then there are 2 k – 2 candidate association rules (ignoring L   and   L)

Rule Generation l How to efficiently generate rules from frequent itemsets? –In general, confidence does not have an anti- monotone property –But confidence of rules generated from the same itemset has an anti-monotone property –Frequent itemset: f = {A,B,C,D} conf(ABC  D)  conf(AB  CD)  conf(A  BCD)  Confidence is non-increasing as number of items in rule consequent increases

Rule Generation for Apriori Algorithm Lattice of rules

Bottleneck of Apriori Algorithm l The core of the Apriori algorithm: –Use frequent (k – 1)-itemsets to generate candidate frequent k-itemsets –Use database scan and pattern matching to collect counts for the candidate itemsets l Bottleneck of Apriori: candidate generation –Still far too many candidates –For example, if there are frequent items, need to generate (10000  9999)/2 candidates of length 2  Many of them ends up being infrequent

FP-Tree and FP-growth l Compress a large database into a compact, Frequent-Pattern tree (FP-tree) structure –highly condensed, but complete for frequent pattern mining –avoid costly database scans l Develop an efficient, FP-tree-based frequent pattern mining algorithm (called FP-growth) –A divide-and-conquer methodology: decompose mining tasks into smaller ones –Generate “candidates” only if when it is necessary

Construct FP-tree {} f:4c:1 b:1 p:1 b:1c:3 a:3 b:1m:2 p:2m:1 Header Table Item frequency head f4 c4 a3 b3 m3 p3 minsup = 3 TIDItems bought (ordered) frequent items 100{f, a, c, d, g, i, m, p}{f, c, a, m, p} 200{a, b, c, f, l, m, o}{f, c, a, b, m} 300 {b, f, h, j, o, w}{f, b} 400 {b, c, k, s, p}{c, b, p} 500 {a, f, c, e, l, p, m, n}{f, c, a, m, p} 1.Scan DB once, find frequent 1-itemset (single item pattern) 2.Sort frequent items in frequency descending order, f-list 3.Scan DB again, construct FP-tree F-list=f-c-a-b-m-p

Benefits of the FP-tree Structure l Completeness –Preserve information for frequent pattern mining l Compactness –Reduce irrelevant info—infrequent items are gone –Items in frequency descending order: the more frequently occurring, the more likely to be shared –Never larger than the original database (if we exclude node-links and the count field)  For some dense data sets, compression ratio can be more than a factor of 100

Partition Patterns and Databases l Frequent patterns can be partitioned into subsets according to f-list (efficient partitioning because ordered) –F-list=f-c-a-b-m-p –Patterns containing p –Patterns having m but no p –… –Patterns having c but no a nor b, m, p –Pattern f l Completeness and no redundancy

Deriving Frequent Itemsets from FP-tree l Start at the frequent item header table in the FP-tree l Traverse the FP-tree by following the link of each frequent item p l Accumulate all of transformed prefix paths of item p to form p’s conditional pattern base l Cond. Pattern base below is generated recursively, shown next Conditional pattern bases itemcond. pattern base cf:3 afc:3 bfca:1, f:1, c:1 mfca:2, fcab:1 pfcam:2, cb:1 {} f:4c:1 b:1 p:1 b:1c:3 a:3 b:1m:2 p:2m:1 Header Table Item frequency head f4 c4 a3 b3 m3 p3

Deriving Frequent Itemsets from FP-tree l Start with the bottom, p. p is frequent pattern. Pattern base for p:, frequent items:c:3; Cond. FP-tree for p: ; recursive call generates freq. pattern pc l m: m is frequent pattern. Pattern base: fca2, fcab1, frequent items: fca3, Create cond. FP-tree. Mine( |m) involves mining a, c, f in sequence l Create frequent pattern am:3 and call mine( |am) l Second: create frequent pattern cm:3 and call mine( |cam) l mine( |cam) creates the longest pattern fcam:3 l Note: each conditional FP-tree has calls from several headers All frequent

Recursion: Mine Each Conditional FP- tree l For each pattern-base –Accumulate the count for each item in the base –Construct the FP-tree for the frequent items of the pattern base m-conditional pattern base: fca:2, fcab:1 {} f:3 c:3 a:3 m-conditional FP-tree All frequent patterns relate to m m, fm, cm, am, fcm, fam, cam, fcam   {} f:4c:1 b:1 p:1 b:1c:3 a:3 b:1m:2 p:2m:1 Header Table Item frequency head f4 c4 a3 b3 m3 p3

Recursion: Mine Each Conditional FP-tree {} f:3 c:3 a:3 m-conditional FP-tree Cond. pattern base of “am”: (fc:3) {} f:3 c:3 am-conditional FP-tree Cond. pattern base of “cm”: (f:3) {} f:3 cm-conditional FP-tree Cond. pattern base of “cam”: (f:3) {} f:3 cam-conditional FP-tree

Special Case: Single Prefix Path in FP-tree l Suppose a (conditional) FP-tree T has a shared single prefix-path P l Mining can be decomposed into two parts –Reduction of the single prefix path into one node –Concatenation of the mining results of the two parts  a 2 :n 2 a 3 :n 3 a 1 :n 1 {} b 1 :m 1 C 1 :k 1 C 2 :k 2 C 3 :k 3 b 1 :m 1 C 1 :k 1 C 2 :k 2 C 3 :k 3 r1r1 + a 2 :n 2 a 3 :n 3 a 1 :n 1 {} r1r1 =

Mining Frequent Patterns With FP-trees l Idea: Frequent pattern growth –Recursively grow frequent patterns by pattern and database partition l Method –For each frequent item, construct its conditional pattern-base, and then its conditional FP-tree –Repeat the process on each newly created conditional FP-tree –Until the resulting FP-tree is empty, or it contains only one path—single path will generate all the combinations of its sub-paths, each of which is a frequent pattern

Scaling FP-growth by DB Projection l FP-tree cannot fit in memory?—DB projection l First partition a database into a set of projected DBs l Then construct and mine FP-tree for each projected DB

Partition-based Projection l Parallel projection needs a lot of disk space l Partition projection saves it Tran. DB fcamp fcabm fb cbp fcamp p-proj DB fcam cb fcam m-proj DB fcab fca b-proj DB f cb … a-proj DB fc … c-proj DB f … f-proj DB … am-proj DB fc cm-proj DB f …

References l Srikant and Agrawal, “Fast Algorithms for Mining Association Rules” l Han, Pei, Yin, “Mining Frequent Patterns without Candidate Generation”