Association Rule Mining

Slides:



Advertisements
Similar presentations
Mining Association Rules
Advertisements

Association Rules Mining
Association Analysis (2). Example TIDList of item ID’s T1I1, I2, I5 T2I2, I4 T3I2, I3 T4I1, I2, I4 T5I1, I3 T6I2, I3 T7I1, I3 T8I1, I2, I3, I5 T9I1, I2,
Data Mining Techniques Association Rule
Graph Mining Laks V.S. Lakshmanan
Association Rule Mining. Mining Association Rules in Large Databases  Association rule mining  Algorithms Apriori and FP-Growth  Max and closed patterns.
732A02 Data Mining - Clustering and Association Analysis ………………… Jose M. Peña FP grow algorithm Correlation analysis.
1 of 25 1 of 45 Association Rule Mining CIT366: Data Mining & Data Warehousing Instructor: Bajuna Salehe The Institute of Finance Management: Computing.
CPS : Information Management and Mining
Chapter 5: Mining Frequent Patterns, Association and Correlations
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
732A02 Data Mining - Clustering and Association Analysis ………………… Jose M. Peña Association rules Apriori algorithm FP grow algorithm.
Data Mining Association Analysis: Basic Concepts and Algorithms
Association Rules Mining Part III. Multiple-Level Association Rules Items often form hierarchy. Items at the lower level are expected to have lower support.
Data Mining Association Analysis: Basic Concepts and Algorithms
FP-growth. Challenges of Frequent Pattern Mining Improving Apriori Fp-growth Fp-tree Mining frequent patterns with FP-tree Visualization of Association.
1 Mining Frequent Patterns Without Candidate Generation Apriori-like algorithm suffers from long patterns or quite low minimum support thresholds. Two.
Association Analysis: Basic Concepts and Algorithms.
Association Rule Mining. Generating assoc. rules from frequent itemsets  Assume that we have discovered the frequent itemsets and their support  How.
Chapter 4: Mining Frequent Patterns, Associations and Correlations
Mining Association Rules in Large Databases
Association Rule Mining - MaxMiner. Mining Association Rules in Large Databases  Association rule mining  Algorithms Apriori and FP-Growth  Max and.
Frequent-Pattern Tree. 2 Bottleneck of Frequent-pattern Mining  Multiple database scans are costly  Mining long patterns needs many passes of scanning.
Mining Association Rules
Mining Association Rules
Association Rule Mining. Mining Association Rules in Large Databases  Association rule mining  Algorithms Apriori and FP-Growth  Max and closed patterns.
Performance and Scalability: Apriori Implementation.
SEG Tutorial 2 – Frequent Pattern Mining.
Mining Association Rules in Large Databases. What Is Association Rule Mining?  Association rule mining: Finding frequent patterns, associations, correlations,
Chapter 5 Mining Association Rules with FP Tree Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2010.
Ch5 Mining Frequent Patterns, Associations, and Correlations
Data Mining Frequent-Pattern Tree Approach Towards ARM Lecture
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining By Tan, Steinbach, Kumar Lecture.
Modul 7: Association Analysis. 2 Association Rule Mining  Given a set of transactions, find rules that will predict the occurrence of an item based on.
Mining Frequent Patterns without Candidate Generation.
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Association Rule Mining III COMP Seminar GNET 713 BCB Module Spring 2007.
M. Sulaiman Khan Dept. of Computer Science University of Liverpool 2009 COMP527: Data Mining ARM: Improvements March 10, 2009 Slide.
Data Mining  Association Rule  Classification  Clustering.
The UNIVERSITY of KENTUCKY Association Rule Mining CS 685: Special Topics in Data Mining Spring 2009.
The UNIVERSITY of KENTUCKY Association Rule Mining CS 685: Special Topics in Data Mining.
Reducing Number of Candidates Apriori principle: – If an itemset is frequent, then all of its subsets must also be frequent Apriori principle holds due.
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Association Rule Mining COMP Seminar BCB 713 Module Spring 2011.
1 Data Mining Lecture 6: Association Analysis. 2 Association Rule Mining l Given a set of transactions, find rules that will predict the occurrence of.
CS685: Special Topics in Data Mining The UNIVERSITY of KENTUCKY Frequent Itemset Mining II Tree-based Algorithm Max Itemsets Closed Itemsets.
DATA MINING ASSOCIATION RULES.
Reducing Number of Candidates
Data Mining Association Analysis: Basic Concepts and Algorithms
Data Mining: Concepts and Techniques
Information Management course
Association rule mining
Mining Association Rules
Frequent Pattern Mining
Data Mining Association Rule Mining
Chapter 5: Mining Association Rules in Large Databases
©Jiawei Han and Micheline Kamber
Data Mining Association Analysis: Basic Concepts and Algorithms
Market Baskets Frequent Itemsets A-Priori Algorithm
Mining Association Rules in Large Databases
Association Rule Mining
Find Patterns Having P From P-conditional Database
Association Analysis: Basic Concepts and Algorithms
732A02 Data Mining - Clustering and Association Analysis
Mining Frequent Patterns without Candidate Generation
Frequent-Pattern Tree
©Jiawei Han and Micheline Kamber
FP-Growth Wenlong Zhang.
Association Rule Mining
Mining Association Rules in Large Databases
What Is Association Mining?
Presentation transcript:

Association Rule Mining COMP 790-90 Seminar BCB 713 Module Spring 2011

How to Count Supports of Candidates? Why counting supports of candidates a problem? The total number of candidates can be very huge One transaction may contain many candidates Method: Candidate itemsets are stored in a hash-tree Leaf node of hash-tree contains a list of itemsets and counts Interior node contains a hash table Subset function: finds all the candidates contained in a transaction

Challenges of Frequent Pattern Mining Multiple scans of transaction database Huge number of candidates Tedious workload of support counting for candidates Improving Apriori: general ideas Reduce number of transaction database scans Shrink number of candidates Facilitate support counting of candidates

DIC: Reduce Number of Scans ABCD Once both A and D are determined frequent, the counting of AD can begin Once all length-2 subsets of BCD are determined frequent, the counting of BCD can begin ABC ABD ACD BCD AB AC BC AD BD CD Transactions 1-itemsets A B C D 2-itemsets Apriori … {} Itemset lattice 1-itemsets 2-items S. Brin R. Motwani, J. Ullman, and S. Tsur, 1997. DIC 3-items

DHP: Reduce the Number of Candidates A hashing bucket count <min_sup  every candidate in the buck is infrequent Candidates: a, b, c, d, e Hash entries: {ab, ad, ae} {bd, be, de} … Large 1-itemset: a, b, d, e The sum of counts of {ab, ad, ae} < min_sup  ab should not be a candidate 2-itemset J. Park, M. Chen, and P. Yu, 1995

Partition: Scan Database Only Twice Partition the database into n partitions Itemset X is frequent  X is frequent in at least one partition Scan 1: partition database and find local frequent patterns Scan 2: consolidate global frequent patterns A. Savasere, E. Omiecinski, and S. Navathe, 1995

Sampling for Frequent Patterns Select a sample of original database, mine frequent patterns within sample using Apriori Scan database once to verify frequent itemsets found in sample, only borders of closure of frequent patterns are checked Example: check abcd instead of ab, ac, …, etc. Scan database again to find missed frequent patterns H. Toivonen, 1996

Bottleneck of Frequent-pattern Mining Multiple database scans are costly Mining long patterns needs many passes of scanning and generates lots of candidates To find frequent itemset i1i2…i100 # of scans: 100 # of Candidates: Bottleneck: candidate-generation-and-test Can we avoid candidate generation?

Set Enumeration Tree Subsets of I can be enumerated systematically I={a, b, c, d}  a b c d ab ac ad bc bd cd abc abd acd bcd abcd

Borders of Frequent Itemsets Connected X and Y are frequent and X is an ancestor of Y  all patterns between X and Y are frequent  a b c d ab ac ad bc bd cd abc abd acd bcd abcd

Projected Databases To find a child Xy of X, only X-projected database is needed The sub-database of transactions containing X Item y is frequent in X-projected database  a b c d ab ac ad bc bd cd abc abd acd bcd abcd

Tree-Projection Method Find frequent 2-itemsets For each frequent 2-itemset xy, form a projected database The sub-database containing xy Recursive mining If x’y’ is frequent in xy-proj db, then xyx’y’ is a frequent pattern

Compress Database by FP-tree root Header table item f c a b m p f:4 c:1 1st scan: find freq items Only record freq items in FP-tree F-list: f-c-a-b-m-p 2nd scan: construct tree Order freq items in each transaction w.r.t. f-list Explore sharing among transactions c:3 b:1 b:1 a:3 p:1 m:2 b:1 p:2 m:1 TID Items bought (ordered) freq items 100 f, a, c, d, g, I, m, p f, c, a, m, p 200 a, b, c, f, l,m, o f, c, a, b, m 300 b, f, h, j, o f, b 400 b, c, k, s, p c, b, p 500 a, f, c, e, l, p, m, n

Benefits of FP-tree Completeness Compactness Never break a long pattern in any transaction Preserve complete information for freq pattern mining No need to scan database anymore Compactness Reduce irrelevant info — infrequent items are gone Items in frequency descending order (f-list): the more frequently occurring, the more likely to be shared Never be larger than the original database (not counting node-links and the count fields)

Partition Frequent Patterns Frequent patterns can be partitioned into subsets according to f-list: f-c-a-b-m-p Patterns containing p Patterns having m but no p … Patterns having c but no a nor b, m, or p Pattern f The partitioning is complete and without any overlap

Find Patterns Having Item “p” Only transactions containing p are needed Form p-projected database Starting at entry p of header table Follow the side-link of frequent item p Accumulate all transformed prefix paths of p root Header table item f c a b m p p-projected database TDB|p fcam: 2 cb: 1 Local frequent item: c:3 Frequent patterns containing p p: 3, pc: 3 f:4 c:1 c:3 b:1 b:1 a:3 p:1 m:2 b:1 p:2 m:1

Find Patterns Having Item m But No p Form m-projected database TDB|m Item p is excluded Contain fca:2, fcab:1 Local frequent items: f, c, a Build FP-tree for TDB|m Header table item f c a b m p root root Header table item f c a f:4 c:1 f:3 c:3 b:1 b:1 c:3 a:3 p:1 a:3 m:2 b:1 m-projected FP-tree p:2 m:1

Recursive Mining Patterns having m but no p can be mined recursively Optimization: enumerate patterns from single-branch FP-tree Enumerate all combination Support = that of the last item m, fm, cm, am fcm, fam, cam fcam Header table item f c a root f:3 c:3 a:3 m-projected FP-tree

Borders and Max-patterns Max-patterns: borders of frequent patterns A subset of max-pattern is frequent A superset of max-pattern is infrequent  a b c d ab ac ad bc bd cd abc abd acd bcd abcd

MaxMiner: Mining Max-patterns Tid Items 10 A,B,C,D,E 20 B,C,D,E, 30 A,C,D,F 1st scan: find frequent items A, B, C, D, E 2nd scan: find support for AB, AC, AD, AE, ABCDE BC, BD, BE, BCDE CD, CE, CDE, DE, Since BCDE is a max-pattern, no need to check BCD, BDE, CDE in later scan Baya’98 Min_sup=2 Potential max-patterns

Frequent Closed Patterns For frequent itemset X, if there exists no item y s.t. every transaction containing X also contains y, then X is a frequent closed pattern “acdf” is a frequent closed pattern Concise rep. of freq pats Reduce # of patterns and rules N. Pasquier et al. In ICDT’99 Min_sup=2 TID Items 10 a, c, d, e, f 20 a, b, e 30 c, e, f 40 a, c, d, f 50

CLOSET: Mining Frequent Closed Patterns Flist: list of all freq items in support asc. order Flist: d-a-f-e-c Divide search space Patterns having d Patterns having d but no a, etc. Find frequent closed pattern recursively Every transaction having d also has cfa  cfad is a frequent closed pattern PHM’00 Min_sup=2 TID Items 10 a, c, d, e, f 20 a, b, e 30 c, e, f 40 a, c, d, f 50

Closed and Max-patterns Closed pattern mining algorithms can be adapted to mine max-patterns A max-pattern must be closed Depth-first search methods have advantages over breadth-first search ones

Mining Various Kinds of Rules or Regularities Multi-level, quantitative association rules, correlation and causality, ratio rules, sequential patterns, emerging patterns, temporal associations, partial periodicity Classification, clustering, iceberg cubes, etc.

Multiple-level Association Rules Items often form hierarchy Flexible support settings: Items at the lower level are expected to have lower support. Transaction database can be encoded based on dimensions and levels explore shared multi-level mining uniform support Milk [support = 10%] 2% Milk [support = 6%] Skim Milk [support = 4%] Level 1 min_sup = 5% Level 2 min_sup = 3% reduced support

Multi-dimensional Association Rules Single-dimensional rules: buys(X, “milk”)  buys(X, “bread”) MD rules:  2 dimensions or predicates Inter-dimension assoc. rules (no repeated predicates) age(X,”19-25”)  occupation(X,“student”)  buys(X,“coke”) hybrid-dimension assoc. rules (repeated predicates) age(X,”19-25”)  buys(X, “popcorn”)  buys(X, “coke”) Categorical Attributes: finite number of possible values, no order among values Quantitative Attributes: numeric, implicit order

Quantitative/Weighted Association Rules Numeric attributes are dynamically discretized maximize the confidence or compactness of the rules 2-D quantitative association rules: Aquan1  Aquan2  Acat Cluster “adjacent” association rules to form general rules using a 2-D grid. 70-80k 60-70k 50-60k 40-50k 30-40k 20-30k <20k 32 33 34 35 36 37 38 Income age(X,”33-34”)  income(X,”30K - 50K”)  buys(X,”high resolution TV”) Age

Mining Distance-based Association Rules Binning methods do not capture semantics of interval data Distance-based partitioning Density/number of points in an interval “Closeness” of points in an interval Price Equi-width Equi-depth Distance-based 7 [0,10] [7,20] [7,7] 20 [11,20] [22,50] [20,22] 22 [21,30] [51,53] 50 [31,40] 51 [41,50] [50,53] 53 [51,60

Constraint-based Data Mining Find all the patterns in a database autonomously? The patterns could be too many but not focused! Data mining should be interactive User directs what to be mined Constraint-based mining User flexibility: provides constraints on what to be mined System optimization: push constraints for efficient mining

Constraints in Data Mining Knowledge type constraint classification, association, etc. Data constraint — using SQL-like queries find product pairs sold together in stores in New York Dimension/level constraint in relevance to region, price, brand, customer category Rule (or pattern) constraint small sales (price < $10) triggers big sales (sum >$200) Interestingness constraint strong rules: support and confidence