What is Frequent Pattern Analysis?

Slides:



Advertisements
Similar presentations
Association Rule Mining
Advertisements

Association Rule Mining. 2 The Task Two ways of defining the task General –Input: A collection of instances –Output: rules to predict the values of any.
1 of 25 1 of 45 Association Rule Mining CIT366: Data Mining & Data Warehousing Instructor: Bajuna Salehe The Institute of Finance Management: Computing.
Data Mining Association Analysis: Basic Concepts and Algorithms
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Frequent Item Mining.
Association Analysis. Association Rule Mining: Definition Given a set of records each of which contain some number of items from a given collection; –Produce.
Chapter 5: Mining Frequent Patterns, Association and Correlations
Data Mining Techniques So Far: Cluster analysis K-means Classification Decision Trees J48 (C4.5) Rule-based classification JRIP (RIPPER) Logistic Regression.
Data Mining Association Analysis: Basic Concepts and Algorithms Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Organization “Association Analysis”
Data Mining: Concepts and Techniques (2nd ed.) — Chapter 5 —
732A02 Data Mining - Clustering and Association Analysis ………………… Jose M. Peña Association rules Apriori algorithm FP grow algorithm.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Data Mining Association Analysis: Basic Concepts and Algorithms
Mining Association Rules in Large Databases
Data Mining Association Analysis: Basic Concepts and Algorithms
FP-growth. Challenges of Frequent Pattern Mining Improving Apriori Fp-growth Fp-tree Mining frequent patterns with FP-tree Visualization of Association.
Data Mining Association Analysis: Basic Concepts and Algorithms
Data Mining Association Analysis: Basic Concepts and Algorithms
1 Mining Frequent Patterns Without Candidate Generation Apriori-like algorithm suffers from long patterns or quite low minimum support thresholds. Two.
Association Analysis: Basic Concepts and Algorithms.
Association Rule Mining. Generating assoc. rules from frequent itemsets  Assume that we have discovered the frequent itemsets and their support  How.
Data Mining Association Analysis: Basic Concepts and Algorithms
Chapter 4: Mining Frequent Patterns, Associations and Correlations
Mining Association Rules in Large Databases
Mining Association Rules in Large Databases
© Vipin Kumar CSci 8980 Fall CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance Computing Research Center Department of Computer.
Mining Association Rules
Mining Association Rules
SEG Tutorial 2 – Frequent Pattern Mining.
Mining Association Rules in Large Databases. What Is Association Rule Mining?  Association rule mining: Finding frequent patterns, associations, correlations,
Chapter 5 Mining Association Rules with FP Tree Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2010.
Ch5 Mining Frequent Patterns, Associations, and Correlations
Data Mining: Concepts and Techniques (3rd ed.) — Chapter 6 —
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
What Is Association Mining? l Association rule mining: – Finding frequent patterns, associations, correlations, or causal structures among sets of items.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining By Tan, Steinbach, Kumar Lecture.
Modul 7: Association Analysis. 2 Association Rule Mining  Given a set of transactions, find rules that will predict the occurrence of an item based on.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Data Mining Association Analysis Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/
Data & Text Mining1 Introduction to Association Analysis Zhangxi Lin ISQS 3358 Texas Tech University.
Frequent Item Mining. What is data mining? =Pattern Mining? What patterns? Why are they useful?
CSE4334/5334 DATA MINING CSE4334/5334 Data Mining, Fall 2014 Department of Computer Science and Engineering, University of Texas at Arlington Chengkai.
Data Mining Find information from data data ? information.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Association Analysis This lecture node is modified based on Lecture Notes for.
Mining Frequent Patterns, Associations, and Correlations Compiled By: Umair Yaqub Lecturer Govt. Murray College Sialkot.
Mining Frequent Patterns. What Is Frequent Pattern Analysis? Frequent pattern: a pattern (a set of items, subsequences, substructures, etc.) that occurs.
Chapter 6: Mining Frequent Patterns, Association and Correlations
Association Analysis (3)
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Reducing Number of Candidates Apriori principle: – If an itemset is frequent, then all of its subsets must also be frequent Apriori principle holds due.
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Association Rule Mining COMP Seminar BCB 713 Module Spring 2011.
1 Data Mining Lecture 6: Association Analysis. 2 Association Rule Mining l Given a set of transactions, find rules that will predict the occurrence of.
DATA MINING: ASSOCIATION ANALYSIS (2) Instructor: Dr. Chun Yu School of Statistics Jiangxi University of Finance and Economics Fall 2015.
Data Mining Association Analysis: Basic Concepts and Algorithms
Data Mining: Concepts and Techniques
Data Mining: Concepts and Techniques
Information Management course
Association rule mining
Frequent Pattern Mining
Data Mining Association Analysis: Basic Concepts and Algorithms
Data Mining Association Analysis: Basic Concepts and Algorithms
Data Mining Association Analysis: Basic Concepts and Algorithms
Association Analysis: Basic Concepts and Algorithms
Mining Association Rules in Large Databases
Association Analysis: Basic Concepts
What Is Association Mining?
Presentation transcript:

CIS4930 Introduction to Data Mining Frequent Pattern Mining Peixiang Zhao

What is Frequent Pattern Analysis? Frequent pattern: a pattern (a set of items, subsequences, substructures, etc.) that occurs frequently in a data set Motivation: Finding inherent regularities in data What products were often purchased together?— Beer and diapers?! What are the subsequent purchases after buying a PC? What kinds of DNA are sensitive to this new drug? Can we automatically classify web documents? Applications Basket data analysis, cross-marketing, catalog design, sale campaign analysis, Web log (click stream) analysis, and DNA sequence analysis

Why Frequent Patterns? Frequent patterns An intrinsic and important property of datasets Foundation for many essential data mining tasks Association, correlation, and causality analysis Sequential, structural (e.g., sub-graph) patterns Pattern analysis in spatiotemporal, multimedia, time-series, and stream data Classification: discriminative, frequent pattern analysis Cluster analysis: frequent pattern-based clustering Broad applications

Association Rule Mining Given a set of transactions, find rules that will predict the occurrence of an item based on the occurrences of other items in the transaction Examples: {Diaper}  {Beer}, {Milk, Bread}  {Eggs,Coke}, {Beer, Bread}  {Milk}, Implication means co-occurrence, not causality!

Basic Concepts Itemset k-itemset X = {x1, …, xk} A set of one or more items k-itemset X = {x1, …, xk} (absolute) support of X Frequency or occurrence of an itemset X (relative) support of X The fraction of transactions that contains X (i.e., the probability that a transaction contains X) An itemset X is frequent if X’s support is no less than a minsup threshold Customer buys diaper buys both buys beer

Basic Concepts Association rule An implication expression of the form XY, where X and Y are itemsets Find all the association rules X  Y with minimum support and confidence Support, s, fraction of transactions that contain both X and Y (probability that a transaction contains X and Y) Confidence, c, how often items in Y appear in transactions that contain X (conditional probability that a transaction having X also contains Y) Example Let minsup = 50%, minconf = 50% Freq. Pat.: Beer:3, Nuts:3, Diaper:4, Eggs:3, {Beer, Diaper}:3 Association rules: (many more!) Beer  Diaper (60%, 100%) Diaper  Beer (60%, 75%)

Closed Patterns and Max-Patterns A long pattern contains a combinatorial number of sub-patterns, e.g., {a1, …, a100} contains (1001) + (1002) + … + (110000) = 2100 – 1 = 1.27*1030 sub-patterns! Solution: Mine closed patterns and max-patterns instead An itemset X is closed if X is frequent and there exists no super-pattern Y כ X, with the same support as X An itemset X is a maximal pattern if X is frequent and there exists no frequent super-pattern Y כ X Closed pattern is a lossless compression of freq. patterns Reducing the # of patterns and rules

Closed and Max-Patterns: An Example Exercise. DB = {<a1, …, a100>, < a1, …, a50>} Min_sup = 1. What is the set of closed itemset? <a1, …, a100>: 1 < a1, …, a50>: 2 What is the set of maximal pattern? What is the set of all patterns? !!

Computational Complexity How many itemsets are potentially to be generated in the worst case? The number of frequent itemsets is senstive to the minsup threshold When minsup is low, there exist an exponential number of frequent itemsets The worst case: MN where M: # distinct items, and N: max length of transactions The worst case complexity vs. the expected probability Ex. Suppose Walmart has 104 kinds of products The chance to pick up one product 10-4 The chance to pick up a particular set of 10 products: ~10-40 What is the chance this particular set of 10 products to be frequent 103 times in 109 transactions?

Association Rule Mining Given a set of transactions T, the goal of association rule mining is to find all rules having support ≥ minsup threshold confidence ≥ minconf threshold Brute-force approach: List all possible association rules Compute the support and confidence for each rule Prune rules that fail the minsup and minconf thresholds  Computationally prohibitive!

Mining Association Rules Observations All the association rules are binary partitions of the same itemset: {Milk, Diaper, Beer} Rules originating from the same itemset have identical support but can have different confidence Thus, we may decouple the support and confidence requirements Example of Rules: {Milk,Diaper}  {Beer} (s=0.4, c=0.67) {Milk,Beer}  {Diaper} (s=0.4, c=1.0) {Diaper,Beer}  {Milk} (s=0.4, c=0.67) {Beer}  {Milk,Diaper} (s=0.4, c=0.67) {Diaper}  {Milk,Beer} (s=0.4, c=0.5) {Milk}  {Diaper,Beer} (s=0.4, c=0.5)

Mining Association Rules Two-step approach: Frequent Itemset Mining Generate all itemsets whose support  minsup Rule Generation Generate high confidence rules from each frequent itemset, where each rule is a binary partitioning of a frequent itemset Frequent itemset generation is still computationally expensive

Frequent Itemset Mining Given d items, there are 2d possible candidate itemsets

Frequent Itemset Mining Brute-force approach Each itemset in the lattice is a candidate frequent itemset Count the support of each candidate by scanning the database Match each transaction against every candidate Time complexity: O(N*M*w)  Expensive since M = 2d

Computational Complexity of Association Rule Mining Given d unique items: Total number of itemsets = 2d Total number of possible association rules If d=6, R = 602 rules

Frequent Itemset Mining Strategies Reduce the number of candidates (M) Complete search: M=2d Use pruning techniques to reduce M Reduce the number of transactions (N) Reduce size of N as the size of itemset increases Used by vertical-based mining algorithms Reduce the number of comparisons (NM) Use efficient data structures to store the candidates or transactions No need to match every candidate against every transaction

Methods for Frequent Itemset Mining Traversal of Itemset Lattice General-to-specific vs Specific-to-general

Methods for Frequent Itemset Mining Traversal of Itemset Lattice Breadth-first vs Depth-first

Methods for Frequent Itemset Mining Traversal of Itemset Lattice Equivalent Classes

Methods for Frequent Itemset Mining Representation of Database horizontal vs vertical data layout

Frequent Itemset Mining Methods Apriori: A Candidate Generation-and-Test Approach Improving the Efficiency of Apriori FPGrowth: A Frequent Pattern-Growth Approach ECLAT: Frequent Pattern Mining with Vertical Data Format

The Downward Closure Property The downward closure property of frequent patterns Any subset of a frequent itemset must be frequent If {beer, diaper, nuts} is frequent, so is {beer, diaper} i.e., every transaction having {beer, diaper, nuts} also contains {beer, diaper} Apriori principle holds due to the following property of the support measure: Support of an itemset never exceeds the support of its subsets This is known as the anti-monotone property of support

Illustrating Apriori Principle Found to be Infrequent Pruned supersets

Apriori: A Candidate Generation & Test Approach Apriori pruning principle: If there is any itemset which is infrequent, its superset should not be generated/tested! Algorithm Initially, scan DB once to get frequent 1-itemset 2.1 Generate length (k+1) candidate itemsets from length k frequent itemsets 2.2 Test the candidates against DB 3. Terminate when no frequent or candidate set can be generated

The Apriori Algorithm—An Example Supmin = 2 Database TDB L1 C1 1st scan C2 C2 L2 2nd scan C3 L3 3rd scan

The Apriori Algorithm Ck: Candidate itemset of size k Lk : frequent itemset of size k L1 = {frequent items}; for (k = 1; Lk !=; k++) do begin Ck+1 = candidates generated from Lk; for each transaction t in database do increment the count of all candidates in Ck+1 that are contained in t Lk+1 = candidates in Ck+1 with min_support end return k Lk;

Implementation How to generate candidates? Step 1: self-joining Lk Step 2: pruning Example of Candidate-generation L3={abc, abd, acd, ace, bcd} Self-joining: L3*L3 abcd from abc and abd acde from acd and ace Pruning: acde is removed because ade is not in L3 C4 = {abcd}

Candidate Generation: SQL SQL Implementation of candidate generation Suppose the items in Lk-1 are listed in an order Step 1: self-joining Lk-1 insert into Ck select p.item1, p.item2, …, p.itemk-1, q.itemk-1 from Lk-1 p, Lk-1 q where p.item1=q.item1, …, p.itemk-2=q.itemk-2, p.itemk-1 < q.itemk-1 Step 2: pruning forall itemsets c in Ck do forall (k-1)-subsets s of c do if (s is not in Lk-1) then delete c from Ck

Frequency Counting Hash tree Suppose you have 15 candidate itemsets of length 3: {1 4 5}, {1 2 4}, {4 5 7}, {1 2 5}, {4 5 8}, {1 5 9}, {1 3 6}, {2 3 4}, {5 6 7}, {3 4 5}, {3 5 6}, {3 5 7}, {6 8 9}, {3 6 7}, {3 6 8} You need: Hash function Max leaf size: max number of itemsets stored in a leaf node (if number of candidate itemsets exceeds max leaf size, split the node) 2 3 4 5 6 7 1 4 5 1 3 6 1 2 4 4 5 7 1 2 5 4 5 8 1 5 9 3 4 5 3 5 6 3 5 7 6 8 9 3 6 7 3 6 8 1,4,7 2,5,8 3,6,9 Hash function

Subset Operation Given a transaction t, what are the possible subsets of size 3?

Subset Operation Using Hash Tree 1,4,7 2,5,8 3,6,9 Hash Function 1 2 3 5 6 transaction 1 + 2 3 5 6 3 5 6 2 + 3 5 6 1 2 + 5 6 3 + 5 6 1 3 + 2 3 4 6 1 5 + 5 6 7 1 4 5 1 3 6 3 4 5 3 5 6 3 5 7 3 6 7 3 6 8 6 8 9 1 2 4 1 2 5 1 5 9 4 5 7 4 5 8

Factors Affecting Complexity Choice of minimum support threshold lowering support threshold results in more frequent itemsets this may increase number of candidates and max length of frequent itemsets Dimensionality (number of items) of the data set more space is needed to store support count of each item if number of frequent items also increases, both computation and I/O costs may also increase Size of database since Apriori makes multiple passes, run time of algorithm may increase with number of transactions Average transaction width transaction width increases with denser data sets This may increase max length of frequent itemsets, and the number of subsets in a transaction increases with its width

Maximal Frequent Itemset An itemset is maximal frequent if none of its immediate supersets is frequent Maximal Itemsets Border

Closed Itemset An itemset is closed if none of its immediate supersets has the same support as the itemset

Maximal vs. Closed Itemsets Transaction Ids Not supported by any transactions

Maximal vs Closed Frequent Itemsets Closed but not maximal Minimum support = 2 Closed and maximal # Closed = 9 # Maximal = 4

Maximal vs Closed Itemsets

Further Improvement of Apriori Major computational challenges Multiple scans of transaction database Huge number of candidates Tedious workload of support counting for candidates Improving Apriori: general ideas Reduce passes of transaction database scans Shrink number of candidates Facilitate support counting of candidates

Partition: Scan Database Only Twice Any itemset that is potentially frequent in DB must be frequent in at least one of the partitions of DB Scan 1: partition database and find local frequent patterns Scan 2: consolidate global frequent patterns DB1 + DB2 + + DBk = DB sup1(i) < σDB1 sup2(i) < σDB2 supk(i) < σDBk sup(i) < σDB

FP-Growth: A Frequent Pattern-Growth Approach Bottlenecks of the Apriori approach Breadth-first (i.e., level-wise) search Candidate generation and test Often generates a huge number of candidates The FPGrowth Approach Depth-first search Avoid explicit candidate generation Major philosophy: Grow long patterns from short ones using local frequent items only “abc” is a frequent pattern Get all transactions having “abc”, i.e., project DB on abc: DB|abc “d” is a local frequent item in DB|abc  abcd is a frequent pattern

General Idea The FP-growth method indexes the database for fast support computation via an augmented prefix tree: the frequent pattern tree (FP-tree) Each node in the tree is labeled with a single item, with the support information for the itemset comprising the items on the path from the root to that node FP-tree construction Initially the tree contains as root the null item ∅ For each tuple X ∈ D, we insert the X into the FP-tree, incrementing the count of all nodes along the path that represents X If X shares a prefix with some previously inserted transaction, then X will follow the same path until the common prefix. For the remaining items in X, new nodes are created under the common prefix, with counts initialized to 1 The FP-tree is complete when all transactions have been inserted

FP-Tree: Preprocessing Procedure Scan the DB once, find frequent 1-itemset (single item pattern) Sort frequent items in frequency descending order: f-list Scan DB again, construct FP-tree This way, frequent patterns can be partitioned into subsets according to f-list (ex. b-e-a-c-d) Patterns containing d Patterns having d but no c … Patterns having b but no e nor a, c, d Completeness and nonredundency

Frequent Pattern Tree The FP-tree is a prefix compressed representation of D, and the frequent items are sorted in descending order of support

FP-Growth Algorithm Given a FP-tree R, projected FP-trees are built for each frequent item i in R in increasing order of support in a recursive manner To project R on item i, we find all the occurrences of i in the tree, and for each occurrence, we determine the corresponding path from the root to i The count of item i on a given path is recorded in cnt(i) and the path is inserted into the new projected tree While inserting the path, the count of each node along the given path is incremented by the path count cnt(i) The base case for the recursion happens when the input FP-tree R is a single path Basic enumeration

Projected Frequent Pattern Tree for D

Frequent Pattern Tree Projection

FP-Growth Algorithm

Benefits of FP-Tree Completeness Preserve complete information for frequent pattern mining Never break a long pattern of any transaction Compactness Reduce irrelevant info—infrequent items are gone Items in frequency descending order: the more frequently occurring, the more likely to be shared Never be larger than the original database (not count node-links and the count field)

Advantages of the Pattern Growth Approach Divide-and-conquer Decompose both the mining task and DB according to the frequent patterns obtained so far Lead to focused search of smaller databases Other factors No candidate generation, no candidate test Compressed database: FP-tree structure No repeated scan of entire database Basic ops: counting local freq items and building sub FP-tree, no pattern search and matching

ECLAT: Mining by Exploring Vertical Data Format For each item, store a list of transaction ids (tids)

ECLAT Determine the support of any k-itemset by intersecting tid-lists of two of its (k-1) subsets Eclat intersects the tidsets only if the frequent itemsets share a common prefix It traverses the prefix search tree in a DFS-like manner, processing a group of itemsets that have the same prefix, also called a prefix equivalence class  

Eclat Algorithm

ECLAT: Tidlist Intersections