Presentation is loading. Please wait.

Presentation is loading. Please wait.

Frequent Item Mining.

Similar presentations


Presentation on theme: "Frequent Item Mining."— Presentation transcript:

1 Frequent Item Mining

2 What is data mining? =Pattern Mining? What patterns?
Why are they useful?

3 Definition: Frequent Itemset
A collection of one or more items Example: {Milk, Bread, Diaper} k-itemset An itemset that contains k items Support count () Frequency of occurrence of an itemset E.g. ({Milk, Bread,Diaper}) = 2 Support Fraction of transactions that contain an itemset E.g. s({Milk, Bread, Diaper}) = 2/5 Frequent Itemset An itemset whose support is greater than or equal to a minsup threshold

4 Frequent Itemsets Mining
TID Transactions 100 { A, B, E } 200 { B, D } 300 400 { A, C } 500 { B, C } 600 700 { A, B } 800 { A, B, C, E } 900 { A, B, C } 1000 { A, C, E } Minimum support level 50% {A},{B},{C},{A,B}, {A,C}

5 Three Different Views of FIM
Transactional Database How we do store a transactional database? Horizontal, Vertical, Transaction-Item Pair Binary Matrix Bipartite Graph How does the FIM formulated in these different settings? 5

6 Frequent Itemset Generation
Given d items, there are 2d possible candidate itemsets

7 Frequent Itemset Generation
Brute-force approach: Each itemset in the lattice is a candidate frequent itemset Count the support of each candidate by scanning the database Match each transaction against every candidate Complexity ~ O(NMw) => Expensive since M = 2d !!!

8 Reducing Number of Candidates
Apriori principle: If an itemset is frequent, then all of its subsets must also be frequent Apriori principle holds due to the following property of the support measure: Support of an itemset never exceeds the support of its subsets This is known as the anti-monotone property of support

9 Illustrating Apriori Principle
Found to be Infrequent Pruned supersets

10 Illustrating Apriori Principle
Items (1-itemsets) Pairs (2-itemsets) (No need to generate candidates involving Coke or Eggs) Minimum Support = 3 Triplets (3-itemsets) If every subset is considered, 6C1 + 6C2 + 6C3 = 41 With support-based pruning, = 13

11 Apriori R. Agrawal and R. Srikant. Fast algorithms for mining association rules. VLDB, , 1994

12

13 How to Generate Candidates?
Suppose the items in Lk-1 are listed in an order Step 1: self-joining Lk-1 insert into Ck select p.item1, p.item2, …, p.itemk-1, q.itemk-1 from Lk-1 p, Lk-1 q where p.item1=q.item1, …, p.itemk-2=q.itemk-2, p.itemk-1 < q.itemk-1 Step 2: pruning forall itemsets c in Ck do forall (k-1)-subsets s of c do if (s is not in Lk-1) then delete c from Ck

14 Challenges of Frequent Itemset Mining
Multiple scans of transaction database Huge number of candidates Tedious workload of support counting for candidates Improving Apriori: general ideas Reduce passes of transaction database scans Shrink number of candidates Facilitate support counting of candidates

15 Alternative Methods for Frequent Itemset Generation
Representation of Database horizontal vs vertical data layout

16 ECLAT For each item, store a list of transaction ids (tids) TID-list

17 ECLAT Determine support of any k-itemset by intersecting tid-lists of two of its (k-1) subsets. 3 traversal approaches: top-down, bottom-up and hybrid Advantage: very fast support counting Disadvantage: intermediate tid-lists may become too large for memory

18

19

20 FP-growth Algorithm Use a compressed representation of the database using an FP-tree Once an FP-tree has been constructed, it uses a recursive divide-and-conquer approach to mine the frequent itemsets

21 FP-tree construction null After reading TID=1: A:1 B:1

22 FP-Tree Construction null B:3 A:7 B:5 C:3 C:1 D:1 D:1 C:3 E:1 D:1 E:1
Transaction Database null B:3 A:7 B:5 C:3 C:1 D:1 Header table D:1 C:3 E:1 D:1 E:1 D:1 E:1 D:1 Pointers are used to assist frequent itemset generation

23 FP-growth Conditional Pattern base for D: P = {(A:1,B:1,C:1), (A:1,B:1), (A:1,C:1), (A:1), (B:1,C:1)} Recursively apply FP-growth on P Frequent Itemsets found (with sup > 1): AD, BD, CD, ACD, BCD null A:7 B:1 B:5 C:1 C:1 D:1 D:1 C:3 D:1 D:1 D:1

24

25 Compact Representation of Frequent Itemsets
Some itemsets are redundant because they have identical support as their supersets Number of frequent itemsets Need a compact representation

26 Maximal Frequent Itemset
An itemset is maximal frequent if none of its immediate supersets is frequent Maximal Itemsets Border Infrequent Itemsets

27 Closed Itemset An itemset is closed if none of its immediate supersets has the same support as the itemset

28 Maximal vs Closed Itemsets
Transaction Ids Not supported by any transactions

29 Maximal vs Closed Frequent Itemsets
Closed but not maximal Minimum support = 2 Closed and maximal # Closed = 9 # Maximal = 4

30 Maximal vs Closed Itemsets

31 Association Rule Mining and FIM

32 Research Questions How to efficiently enumerate Maximal Frequent Itemsets? How about Closed Frequent Itemsets?

33 Association Rule Mining
Given a set of transactions, find rules that will predict the occurrence of an item based on the occurrences of other items in the transaction Example of Association Rules Market-Basket transactions {Diaper}  {Beer}, {Beer, Bread}  {Milk}, Implication means co-occurrence, not causality!

34 Definition: Association Rule
An implication expression of the form X  Y, where X and Y are itemsets Example: {Milk, Diaper}  {Beer} Rule Evaluation Metrics Support (s) Fraction of transactions that contain both X and Y Confidence (c) Measures how often items in Y appear in transactions that contain X Example:

35 Association Rule Mining Task
Given a set of transactions T, the goal of association rule mining is to find all rules having support ≥ minsup threshold confidence ≥ minconf threshold Brute-force approach: List all possible association rules Compute the support and confidence for each rule Prune rules that fail the minsup and minconf thresholds  Computationally prohibitive!

36 Mining Association Rules
Example of Rules: {Milk,Diaper}  {Beer} (s=0.4, c=0.67) {Milk,Beer}  {Diaper} (s=0.4, c=1.0) {Diaper,Beer}  {Milk} (s=0.4, c=0.67) {Beer}  {Milk,Diaper} (s=0.4, c=0.67) {Diaper}  {Milk,Beer} (s=0.4, c=0.5) {Milk}  {Diaper,Beer} (s=0.4, c=0.5) Observations: All the above rules are binary partitions of the same itemset: {Milk, Diaper, Beer} Rules originating from the same itemset have identical support but can have different confidence Thus, we may decouple the support and confidence requirements

37 Mining Association Rules
Two-step approach: Frequent Itemset Generation Generate all itemsets whose support  minsup Rule Generation Generate high confidence rules from each frequent itemset, where each rule is a binary partitioning of a frequent itemset Frequent itemset generation is still computationally expensive

38 Computational Complexity
Given d unique items: Total number of itemsets = 2d Total number of possible association rules: If d=6, R = 602 rules

39 Rule Generation Given a frequent itemset L, find all non-empty subsets f  L such that f  L – f satisfies the minimum confidence requirement If {A,B,C,D} is a frequent itemset, candidate rules: ABC D, ABD C, ACD B, BCD A, A BCD, B ACD, C ABD, D ABC AB CD, AC  BD, AD  BC, BC AD, BD AC, CD AB, If |L| = k, then there are 2k – 2 candidate association rules (ignoring L   and   L)

40 Rule Generation How to efficiently generate rules from frequent itemsets? In general, confidence does not have an anti-monotone property c(ABC D) can be larger or smaller than c(AB D) But confidence of rules generated from the same itemset has an anti-monotone property e.g., L = {A,B,C,D}: c(ABC  D)  c(AB  CD)  c(A  BCD) Confidence is anti-monotone w.r.t. number of items on the RHS of the rule

41 Rule Generation for Apriori Algorithm
Lattice of rules Pruned Rules Low Confidence Rule

42 Rule Generation for Apriori Algorithm
Candidate rule is generated by merging two rules that share the same prefix in the rule consequent join(CD=>AB,BD=>AC) would produce the candidate rule D => ABC Prune rule D=>ABC if its subset AD=>BC does not have high confidence

43 Beyond Itemsets Sequence Mining Graph Mining Tree Mining
Finding frequent subsequences from a collection of sequences Graph Mining Finding frequent (connected) subgraphs from a collection of graphs Tree Mining Finding frequent (embedded) subtrees from a set of trees/graphs Geometric Structure Mining Finding frequent substructures from 3-D or 2-D geometric graphs Among others…

44 Frequent Pattern Mining
B B A A A B B A B B E F E A A A B C B D D C C F D D F C C C C D D A D F D C A B D C

45 Why Frequent Pattern Mining is So Important?
Application Domains Business, biology, chemistry, WWW, computer/networing security, … Summarizing the underlying datasets, providing key insights Basic tools for other data mining tasks Assocation rule mining Classification Clustering Change Detection etc…

46 Network motifs: recurring patterns that occur significantly more than in randomized nets
Do motifs have specific roles in the network? Many possible distinct subgraphs

47 The 13 three-node connected subgraphs

48 199 4-node directed connected subgraphs
And it grows fast for larger subgraphs : node subgraphs, 1,530,843 6-node…

49 Finding network motifs – an overview
Generation of a suitable random ensemble (reference networks) Network motifs detection process: Count how many times each subgraph appears Compute statistical significance for each subgraph – probability of appearing in random as much as in real network (P-val or Z-score)

50 Ensemble of networks Real = Rand=0.5±0.6 Zscore (#Standard Deviations)=7.5

51 Performance and Scalability: Apriori Implementation

52 Apriori R. Agrawal and R. Srikant. Fast algorithms for mining association rules. VLDB, , 1994

53 Challenges of Frequent Itemset Mining
Multiple scans of transaction database Huge number of candidates Tedious workload of support counting for candidates Improving Apriori: general ideas Reduce passes of transaction database scans Shrink number of candidates Facilitate support counting of candidates 53

54 Reducing Number of Comparisons
Candidate counting: Scan the database of transactions to determine the support of each candidate itemset To reduce the number of comparisons, store the candidates in a hash structure Instead of matching each transaction against every candidate, match it against candidates contained in the hashed buckets

55 Generate Hash Tree Suppose you have 15 candidate itemsets of length 3: {1 4 5}, {1 2 4}, {4 5 7}, {1 2 5}, {4 5 8}, {1 5 9}, {1 3 6}, {2 3 4}, {5 6 7}, {3 4 5}, {3 5 6}, {3 5 7}, {6 8 9}, {3 6 7}, {3 6 8} You need: Hash function Max leaf size: max number of itemsets stored in a leaf node (if number of candidate itemsets exceeds max leaf size, split the node) 2 3 4 5 6 7 1 4 5 1 3 6 1 2 4 4 5 7 1 2 5 4 5 8 1 5 9 3 4 5 3 5 6 3 5 7 6 8 9 3 6 7 3 6 8 1,4,7 2,5,8 3,6,9 Hash function

56 Association Rule Discovery: Hash tree
Hash Function Candidate Hash Tree 1 5 9 1 4 5 1 3 6 3 4 5 3 6 7 3 6 8 3 5 6 3 5 7 6 8 9 2 3 4 5 6 7 1 2 4 4 5 7 1 2 5 4 5 8 1,4,7 3,6,9 2,5,8 Hash on 1, 4 or 7

57 Association Rule Discovery: Hash tree
Hash Function Candidate Hash Tree 1 5 9 1 4 5 1 3 6 3 4 5 3 6 7 3 6 8 3 5 6 3 5 7 6 8 9 2 3 4 5 6 7 1 2 4 4 5 7 1 2 5 4 5 8 1,4,7 3,6,9 2,5,8 Hash on 2, 5 or 8

58 Association Rule Discovery: Hash tree
Hash Function Candidate Hash Tree 1 5 9 1 4 5 1 3 6 3 4 5 3 6 7 3 6 8 3 5 6 3 5 7 6 8 9 2 3 4 5 6 7 1 2 4 4 5 7 1 2 5 4 5 8 1,4,7 3,6,9 2,5,8 Hash on 3, 6 or 9

59 Subset Operation Given a transaction t, what are the possible subsets of size 3?

60 Subset Operation Using Hash Tree
1,4,7 2,5,8 3,6,9 Hash Function transaction 1 + 3 5 6 2 + 1 5 9 1 4 5 1 3 6 3 4 5 3 6 7 3 6 8 3 5 6 3 5 7 6 8 9 2 3 4 5 6 7 1 2 4 4 5 7 1 2 5 4 5 8 5 6 3 +

61 Subset Operation Using Hash Tree
1,4,7 2,5,8 3,6,9 Hash Function transaction 1 + 3 5 6 2 + 3 5 6 1 2 + 5 6 3 + 5 6 1 3 + 2 3 4 6 1 5 + 5 6 7 1 4 5 1 3 6 3 4 5 3 5 6 3 5 7 3 6 7 3 6 8 6 8 9 1 2 4 1 2 5 1 5 9 4 5 7 4 5 8

62 Subset Operation Using Hash Tree
1,4,7 2,5,8 3,6,9 Hash Function transaction 1 + 3 5 6 2 + 3 5 6 1 2 + 5 6 3 + 5 6 1 3 + 2 3 4 6 1 5 + 5 6 7 1 4 5 1 3 6 3 4 5 3 5 6 3 5 7 3 6 7 3 6 8 6 8 9 1 2 4 1 2 5 1 5 9 4 5 7 4 5 8 Match transaction against 11 out of 15 candidates

63 Prefix Tree Representation
Efficient Implementations of Apriori and Eclat Christian Borgelt., FIMI’03

64 Prefix Tree

65 Prefix Tree Structure for Counting

66 Other key optimization
Recording the items Why is this relevant? Transaction Tree Organize transaction into trees Count through two trees

67 Scalability How to handle very large dataset?
The dataset can not be stored in the main memory Performance of out-of-core datasets/Performance of in-core datasets

68 Partition: Scan Database Only Twice
Any itemset that is potentially frequent in DB must be frequent in at least one of the partitions of DB Scan 1: partition database and find local frequent patterns Scan 2: consolidate global frequent patterns A. Savasere, E. Omiecinski, and S. Navathe. An efficient algorithm for mining association in large databases. In VLDB’95

69 DHP: Reduce the Number of Candidates
A k-itemset whose corresponding hashing bucket count is below the threshold cannot be frequent Candidates: a, b, c, d, e Hash entries: {ab, ad, ae} {bd, be, de} … Frequent 1-itemset: a, b, d, e ab is not a candidate 2-itemset if the sum of count of {ab, ad, ae} is below support threshold J. Park, M. Chen, and P. Yu. An effective hash-based algorithm for mining association rules. In SIGMOD’95

70 Sampling for Frequent Patterns
Select a sample of original database, mine frequent patterns within sample using Apriori Scan database once to verify frequent itemsets found in sample, only borders of closure of frequent patterns are checked Example: check abcd instead of ab, ac, …, etc. Scan database again to find missed frequent patterns H. Toivonen. Sampling large databases for association rules. In VLDB’96

71 DIC: Reduce Number of Scans
ABCD Once both A and D are determined frequent, the counting of AD begins Once all length-2 subsets of BCD are determined frequent, the counting of BCD begins ABC ABD ACD BCD AB AC BC AD BD CD Transactions 1-itemsets A B C D 2-itemsets Apriori {} Itemset lattice 1-itemsets S. Brin R. Motwani, J. Ullman, and S. Tsur. Dynamic itemset counting and implication rules for market basket data. In SIGMOD’97 2-items DIC 3-items

72 References R. Agrawal, T. Imielinski, and A. Swami. Mining association rules between sets of items in large databases. SIGMOD, , 1993.  R. Agrawal and R. Srikant. Fast algorithms for mining association rules. VLDB, , 1994. R. J. Bayardo. Efficiently mining long patterns from databases. SIGMOD, 85-93, 1998.

73 References: Christian Borgelt, Efficient Implementations of Apriori and Eclat, FIMI’03 Ferenc Bodon, A fast APRIORI implementation, FIMI’03 Ferenc Bodon, A Survey on Frequent Itemset Mining, Technical Report, Budapest University of Technology and Economic, 2006

74 Important websites: FIMI workshop Christian Borgelt’s website
Not only Apriori and FIM FP-tree, ECLAT, Closed, Maximal Christian Borgelt’s website Ferenc Bodon’s website


Download ppt "Frequent Item Mining."

Similar presentations


Ads by Google