What is Frequent Pattern Analysis?

Name: What is Frequent Pattern Analysis?
Uploaded: 2017-07-12T11:48:08+00:00
Duration: PTM25S31
Channel: Chastity Erika Thornton
Description: What is Frequent Pattern Analysis?

CIS4930 Introduction to Data Mining
Frequent Pattern Mining Peixiang Zhao

What is Frequent Pattern Analysis?
Frequent pattern: a pattern (a set of items, subsequences, substructures, etc.) that occurs frequently in a data set Motivation: Finding inherent regularities in data What products were often purchased together?— Beer and diapers?! What are the subsequent purchases after buying a PC? What kinds of DNA are sensitive to this new drug? Can we automatically classify web documents? Applications Basket data analysis, cross-marketing, catalog design, sale campaign analysis, Web log (click stream) analysis, and DNA sequence analysis

Why Frequent Patterns? Frequent patterns
An intrinsic and important property of datasets Foundation for many essential data mining tasks Association, correlation, and causality analysis Sequential, structural (e.g., sub-graph) patterns Pattern analysis in spatiotemporal, multimedia, time-series, and stream data Classification: discriminative, frequent pattern analysis Cluster analysis: frequent pattern-based clustering Broad applications

Association Rule Mining
Given a set of transactions, find rules that will predict the occurrence of an item based on the occurrences of other items in the transaction Examples: {Diaper}  {Beer}, {Milk, Bread}  {Eggs,Coke}, {Beer, Bread}  {Milk}, Implication means co-occurrence, not causality!

Basic Concepts Itemset k-itemset X = {x1, …, xk}
A set of one or more items k-itemset X = {x1, …, xk} (absolute) support of X Frequency or occurrence of an itemset X (relative) support of X The fraction of transactions that contains X (i.e., the probability that a transaction contains X) An itemset X is frequent if X’s support is no less than a minsup threshold Customer buys diaper buys both buys beer

Basic Concepts Association rule
An implication expression of the form XY, where X and Y are itemsets Find all the association rules X  Y with minimum support and confidence Support, s, fraction of transactions that contain both X and Y (probability that a transaction contains X and Y) Confidence, c, how often items in Y appear in transactions that contain X (conditional probability that a transaction having X also contains Y) Example Let minsup = 50%, minconf = 50% Freq. Pat.: Beer:3, Nuts:3, Diaper:4, Eggs:3, {Beer, Diaper}:3 Association rules: (many more!) Beer  Diaper (60%, 100%) Diaper  Beer (60%, 75%)

Closed Patterns and Max-Patterns
A long pattern contains a combinatorial number of sub-patterns, e.g., {a1, …, a100} contains (1001) + (1002) + … + (110000) = 2100 – 1 = 1.27*1030 sub-patterns! Solution: Mine closed patterns and max-patterns instead An itemset X is closed if X is frequent and there exists no super-pattern Y כ X, with the same support as X An itemset X is a maximal pattern if X is frequent and there exists no frequent super-pattern Y כ X Closed pattern is a lossless compression of freq. patterns Reducing the # of patterns and rules

Closed and Max-Patterns: An Example
Exercise. DB = {<a1, …, a100>, < a1, …, a50>} Min_sup = 1. What is the set of closed itemset? <a1, …, a100>: 1 < a1, …, a50>: 2 What is the set of maximal pattern? What is the set of all patterns? !!

Computational Complexity
How many itemsets are potentially to be generated in the worst case? The number of frequent itemsets is senstive to the minsup threshold When minsup is low, there exist an exponential number of frequent itemsets The worst case: MN where M: # distinct items, and N: max length of transactions The worst case complexity vs. the expected probability Ex. Suppose Walmart has 104 kinds of products The chance to pick up one product 10-4 The chance to pick up a particular set of 10 products: ~10-40 What is the chance this particular set of 10 products to be frequent 103 times in 109 transactions?

Association Rule Mining
Given a set of transactions T, the goal of association rule mining is to find all rules having support ≥ minsup threshold confidence ≥ minconf threshold Brute-force approach: List all possible association rules Compute the support and confidence for each rule Prune rules that fail the minsup and minconf thresholds  Computationally prohibitive!

Mining Association Rules
Observations All the association rules are binary partitions of the same itemset: {Milk, Diaper, Beer} Rules originating from the same itemset have identical support but can have different confidence Thus, we may decouple the support and confidence requirements Example of Rules: {Milk,Diaper}  {Beer} (s=0.4, c=0.67) {Milk,Beer}  {Diaper} (s=0.4, c=1.0) {Diaper,Beer}  {Milk} (s=0.4, c=0.67) {Beer}  {Milk,Diaper} (s=0.4, c=0.67) {Diaper}  {Milk,Beer} (s=0.4, c=0.5) {Milk}  {Diaper,Beer} (s=0.4, c=0.5)

Mining Association Rules
Two-step approach: Frequent Itemset Mining Generate all itemsets whose support  minsup Rule Generation Generate high confidence rules from each frequent itemset, where each rule is a binary partitioning of a frequent itemset Frequent itemset generation is still computationally expensive

Frequent Itemset Mining
Given d items, there are 2d possible candidate itemsets

Frequent Itemset Mining
Brute-force approach Each itemset in the lattice is a candidate frequent itemset Count the support of each candidate by scanning the database Match each transaction against every candidate Time complexity: O(N*M*w)  Expensive since M = 2d

Computational Complexity of Association Rule Mining
Given d unique items: Total number of itemsets = 2d Total number of possible association rules If d=6, R = 602 rules

Frequent Itemset Mining Strategies
Reduce the number of candidates (M) Complete search: M=2d Use pruning techniques to reduce M Reduce the number of transactions (N) Reduce size of N as the size of itemset increases Used by vertical-based mining algorithms Reduce the number of comparisons (NM) Use efficient data structures to store the candidates or transactions No need to match every candidate against every transaction

Methods for Frequent Itemset Mining
Traversal of Itemset Lattice General-to-specific vs Specific-to-general

Traversal of Itemset Lattice Breadth-first vs Depth-first

Traversal of Itemset Lattice Equivalent Classes

Representation of Database horizontal vs vertical data layout

Frequent Itemset Mining Methods
Apriori: A Candidate Generation-and-Test Approach Improving the Efficiency of Apriori FPGrowth: A Frequent Pattern-Growth Approach ECLAT: Frequent Pattern Mining with Vertical Data Format

The Downward Closure Property
The downward closure property of frequent patterns Any subset of a frequent itemset must be frequent If {beer, diaper, nuts} is frequent, so is {beer, diaper} i.e., every transaction having {beer, diaper, nuts} also contains {beer, diaper} Apriori principle holds due to the following property of the support measure: Support of an itemset never exceeds the support of its subsets This is known as the anti-monotone property of support

Illustrating Apriori Principle
Found to be Infrequent Pruned supersets

Apriori: A Candidate Generation & Test Approach
Apriori pruning principle: If there is any itemset which is infrequent, its superset should not be generated/tested! Algorithm Initially, scan DB once to get frequent 1-itemset 2.1 Generate length (k+1) candidate itemsets from length k frequent itemsets 2.2 Test the candidates against DB 3. Terminate when no frequent or candidate set can be generated

The Apriori Algorithm—An Example
Supmin = 2 Database TDB L1 C1 1st scan C2 C2 L2 2nd scan C3 L3 3rd scan

The Apriori Algorithm Ck: Candidate itemset of size k
Lk : frequent itemset of size k L1 = {frequent items}; for (k = 1; Lk !=; k++) do begin Ck+1 = candidates generated from Lk; for each transaction t in database do increment the count of all candidates in Ck+1 that are contained in t Lk+1 = candidates in Ck+1 with min_support end return k Lk;

Implementation How to generate candidates? Step 1: self-joining Lk
Step 2: pruning Example of Candidate-generation L3={abc, abd, acd, ace, bcd} Self-joining: L3*L3 abcd from abc and abd acde from acd and ace Pruning: acde is removed because ade is not in L3 C4 = {abcd}

Candidate Generation: SQL
SQL Implementation of candidate generation Suppose the items in Lk-1 are listed in an order Step 1: self-joining Lk-1 insert into Ck select p.item1, p.item2, …, p.itemk-1, q.itemk-1 from Lk-1 p, Lk-1 q where p.item1=q.item1, …, p.itemk-2=q.itemk-2, p.itemk-1 < q.itemk-1 Step 2: pruning forall itemsets c in Ck do forall (k-1)-subsets s of c do if (s is not in Lk-1) then delete c from Ck

Frequency Counting Hash tree
Suppose you have 15 candidate itemsets of length 3: {1 4 5}, {1 2 4}, {4 5 7}, {1 2 5}, {4 5 8}, {1 5 9}, {1 3 6}, {2 3 4}, {5 6 7}, {3 4 5}, {3 5 6}, {3 5 7}, {6 8 9}, {3 6 7}, {3 6 8} You need: Hash function Max leaf size: max number of itemsets stored in a leaf node (if number of candidate itemsets exceeds max leaf size, split the node) 2 3 4 5 6 7 1 4 5 1 3 6 1 2 4 4 5 7 1 2 5 4 5 8 1 5 9 3 4 5 3 5 6 3 5 7 6 8 9 3 6 7 3 6 8 1,4,7 2,5,8 3,6,9 Hash function

Subset Operation Given a transaction t, what are the possible subsets of size 3?

Subset Operation Using Hash Tree
1,4,7 2,5,8 3,6,9 Hash Function transaction 1 + 3 5 6 2 + 3 5 6 1 2 + 5 6 3 + 5 6 1 3 + 2 3 4 6 1 5 + 5 6 7 1 4 5 1 3 6 3 4 5 3 5 6 3 5 7 3 6 7 3 6 8 6 8 9 1 2 4 1 2 5 1 5 9 4 5 7 4 5 8

Factors Affecting Complexity
Choice of minimum support threshold lowering support threshold results in more frequent itemsets this may increase number of candidates and max length of frequent itemsets Dimensionality (number of items) of the data set more space is needed to store support count of each item if number of frequent items also increases, both computation and I/O costs may also increase Size of database since Apriori makes multiple passes, run time of algorithm may increase with number of transactions Average transaction width transaction width increases with denser data sets This may increase max length of frequent itemsets, and the number of subsets in a transaction increases with its width

Maximal Frequent Itemset
An itemset is maximal frequent if none of its immediate supersets is frequent Maximal Itemsets Border

Closed Itemset An itemset is closed if none of its immediate supersets has the same support as the itemset

Maximal vs. Closed Itemsets
Transaction Ids Not supported by any transactions

Maximal vs Closed Frequent Itemsets
Closed but not maximal Minimum support = 2 Closed and maximal # Closed = 9 # Maximal = 4

Maximal vs Closed Itemsets

Further Improvement of Apriori
Major computational challenges Multiple scans of transaction database Huge number of candidates Tedious workload of support counting for candidates Improving Apriori: general ideas Reduce passes of transaction database scans Shrink number of candidates Facilitate support counting of candidates

Partition: Scan Database Only Twice
Any itemset that is potentially frequent in DB must be frequent in at least one of the partitions of DB Scan 1: partition database and find local frequent patterns Scan 2: consolidate global frequent patterns DB1 + DB2 + + DBk = DB sup1(i) < σDB1 sup2(i) < σDB2 supk(i) < σDBk sup(i) < σDB

FP-Growth: A Frequent Pattern-Growth Approach
Bottlenecks of the Apriori approach Breadth-first (i.e., level-wise) search Candidate generation and test Often generates a huge number of candidates The FPGrowth Approach Depth-first search Avoid explicit candidate generation Major philosophy: Grow long patterns from short ones using local frequent items only “abc” is a frequent pattern Get all transactions having “abc”, i.e., project DB on abc: DB|abc “d” is a local frequent item in DB|abc  abcd is a frequent pattern

General Idea The FP-growth method indexes the database for fast support computation via an augmented prefix tree: the frequent pattern tree (FP-tree) Each node in the tree is labeled with a single item, with the support information for the itemset comprising the items on the path from the root to that node FP-tree construction Initially the tree contains as root the null item ∅ For each tuple X ∈ D, we insert the X into the FP-tree, incrementing the count of all nodes along the path that represents X If X shares a prefix with some previously inserted transaction, then X will follow the same path until the common prefix. For the remaining items in X, new nodes are created under the common prefix, with counts initialized to 1 The FP-tree is complete when all transactions have been inserted

FP-Tree: Preprocessing
Procedure Scan the DB once, find frequent 1-itemset (single item pattern) Sort frequent items in frequency descending order: f-list Scan DB again, construct FP-tree This way, frequent patterns can be partitioned into subsets according to f-list (ex. b-e-a-c-d) Patterns containing d Patterns having d but no c … Patterns having b but no e nor a, c, d Completeness and nonredundency

Frequent Pattern Tree The FP-tree is a prefix compressed representation of D, and the frequent items are sorted in descending order of support

FP-Growth Algorithm Given a FP-tree R, projected FP-trees are built for each frequent item i in R in increasing order of support in a recursive manner To project R on item i, we find all the occurrences of i in the tree, and for each occurrence, we determine the corresponding path from the root to i The count of item i on a given path is recorded in cnt(i) and the path is inserted into the new projected tree While inserting the path, the count of each node along the given path is incremented by the path count cnt(i) The base case for the recursion happens when the input FP-tree R is a single path Basic enumeration

Projected Frequent Pattern Tree for D

Frequent Pattern Tree Projection

FP-Growth Algorithm

Benefits of FP-Tree Completeness
Preserve complete information for frequent pattern mining Never break a long pattern of any transaction Compactness Reduce irrelevant info—infrequent items are gone Items in frequency descending order: the more frequently occurring, the more likely to be shared Never be larger than the original database (not count node-links and the count field)

Advantages of the Pattern Growth Approach
Divide-and-conquer Decompose both the mining task and DB according to the frequent patterns obtained so far Lead to focused search of smaller databases Other factors No candidate generation, no candidate test Compressed database: FP-tree structure No repeated scan of entire database Basic ops: counting local freq items and building sub FP-tree, no pattern search and matching

ECLAT: Mining by Exploring Vertical Data Format
For each item, store a list of transaction ids (tids)

ECLAT Determine the support of any k-itemset by intersecting tid-lists of two of its (k-1) subsets Eclat intersects the tidsets only if the frequent itemsets share a common prefix It traverses the prefix search tree in a DFS-like manner, processing a group of itemsets that have the same prefix, also called a prefix equivalence class  

Eclat Algorithm

ECLAT: Tidlist Intersections

What is Frequent Pattern Analysis?

Similar presentations

Presentation on theme: "What is Frequent Pattern Analysis?"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

What is Frequent Pattern Analysis?

Similar presentations

Presentation on theme: "What is Frequent Pattern Analysis?"— Presentation transcript:

Similar presentations

About project

Feedback