LCM ver.2: Efficient Mining Algorithms for Frequent/Closed/Maximal Itemsets Takeaki Uno Masashi Kiyomi Hiroki Arimura National Institute of Informatics,

Slides:

Advertisements

Similar presentations

Mining Association Rules

Advertisements

Recap: Mining association rules from large datasets

Association Analysis (2). Example TIDList of item ID’s T1I1, I2, I5 T2I2, I4 T3I2, I3 T4I1, I2, I4 T5I1, I3 T6I2, I3 T7I1, I3 T8I1, I2, I3, I5 T9I1, I2,

Frequent Itemset Mining Methods. The Apriori algorithm Finding frequent itemsets using candidate generation Seminal algorithm proposed by R. Agrawal and.

COMP5318 Knowledge Discovery and Data Mining

More Efficient Generation of Plane Triangulations Shin-ichi Nakano Takeaki Uno Gunma University National Institute of JAPAN Informatics, JAPAN 23/Sep/2003.

Zeev Dvir – GenMax From: “ Efficiently Mining Frequent Itemsets ” By : Karam Gouda & Mohammed J. Zaki.

LCM: An Efficient Algorithm for Enumerating Frequent Closed Item Sets L inear time C losed itemset M iner Takeaki Uno Tatsuya Asai Hiroaki Arimura Yuzo.

FP-Growth algorithm Vasiljevic Vladica,

FP (FREQUENT PATTERN)-GROWTH ALGORITHM ERTAN LJAJIĆ, 3392/2013 Elektrotehnički fakultet Univerziteta u Beogradu.

Data Mining Association Analysis: Basic Concepts and Algorithms

Kuo-Yu HuangNCU CSIE DBLab1 The Concept of Maximal Frequent Itemsets NCU CSIE Database Laboratory Kuo-Yu Huang

Verify and mining frequent patterns from large windows over data streams Barzan Mozafari, Hetal Thakkar, Carlo Zaniolo ICDE2008.

Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,

732A02 Data Mining - Clustering and Association Analysis ………………… Jose M. Peña Association rules Apriori algorithm FP grow algorithm.

Data Mining Association Analysis: Basic Concepts and Algorithms

Association Analysis (2). Example TIDList of item ID’s T1I1, I2, I5 T2I2, I4 T3I2, I3 T4I1, I2, I4 T5I1, I3 T6I2, I3 T7I1, I3 T8I1, I2, I3, I5 T9I1, I2,

FP-growth. Challenges of Frequent Pattern Mining Improving Apriori Fp-growth Fp-tree Mining frequent patterns with FP-tree Visualization of Association.

Data Mining Association Analysis: Basic Concepts and Algorithms

1 Mining Frequent Patterns Without Candidate Generation Apriori-like algorithm suffers from long patterns or quite low minimum support thresholds. Two.

Mining Frequent patterns without candidate generation Jiawei Han, Jian Pei and Yiwen Yin.

Association Analysis: Basic Concepts and Algorithms.

Association Rule Mining. Generating assoc. rules from frequent itemsets  Assume that we have discovered the frequent itemsets and their support  How.

Data Mining Association Analysis: Basic Concepts and Algorithms

FPtree/FPGrowth. FP-Tree/FP-Growth Algorithm Use a compressed representation of the database using an FP-tree Then use a recursive divide-and-conquer.

© Vipin Kumar CSci 8980 Fall CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance Computing Research Center Department of Computer.

Frequent-Pattern Tree. 2 Bottleneck of Frequent-pattern Mining  Multiple database scans are costly  Mining long patterns needs many passes of scanning.

Association Analysis (3). FP-Tree/FP-Growth Algorithm Use a compressed representation of the database using an FP-tree Once an FP-tree has been constructed,

Fast Vertical Mining Using Diffsets Mohammed J. Zaki Karam Gouda

© Vipin Kumar CSci 8980 Fall CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance Computing Research Center Department of Computer.

Performance and Scalability: Apriori Implementation.

SEG Tutorial 2 – Frequent Pattern Mining.

Chapter 5 Mining Association Rules with FP Tree Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2010.

Mining Frequent Itemsets with Constraints Takeaki Uno Takeaki Uno National Institute of Informatics, JAPAN Nov/2005 FJWCP.

Ch5 Mining Frequent Patterns, Associations, and Correlations

VLDB 2012 Mining Frequent Itemsets over Uncertain Databases Yongxin Tong 1, Lei Chen 1, Yurong Cheng 2, Philip S. Yu 3 1 The Hong Kong University of Science.

Ambiguous Frequent Itemset Mining and Polynomial Delay Enumeration May/25/2008 PAKDD 2008 Takeaki Uno (1), Hiroki Arimura (2) (1) National Institute of.

AR mining Implementation and comparison of three AR mining algorithms Xuehai Wang, Xiaobo Chen, Shen chen CSCI6405 class project.

1 Verifying and Mining Frequent Patterns from Large Windows ICDE2008 Barzan Mozafari, Hetal Thakkar, Carlo Zaniolo Date: 2008/9/25 Speaker: Li, HueiJyun.

Data Mining Frequent-Pattern Tree Approach Towards ARM Lecture

Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,

EFFICIENT ITEMSET EXTRACTION USING IMINE INDEX By By U.P.Pushpavalli U.P.Pushpavalli II Year ME(CSE) II Year ME(CSE)

Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining By Tan, Steinbach, Kumar Lecture.

Modul 7: Association Analysis. 2 Association Rule Mining  Given a set of transactions, find rules that will predict the occurrence of an item based on.

Takeaki Uno Tatsuya Asai Yuzo Uchida Hiroki Arimura

Mining Frequent Patterns without Candidate Generation.

An Efficient Algorithm for Enumerating Pseudo Cliques Dec/18/2007 ISAAC, Sendai Takeaki Uno National Institute of Informatics & The Graduate University.

Mining Frequent Patterns without Candidate Generation : A Frequent-Pattern Tree Approach 指導教授：廖述賢博士報告人：朱佩慧班級：管科所博一.

An Efficient Polynomial Delay Algorithm for Pseudo Frequent Itemset Mining 2/Oct/2007 Discovery Science 2007 Takeaki Uno (National Institute of Informatics)

Frequent Item Mining. What is data mining? =Pattern Mining? What patterns? Why are they useful?

LCM ver.3: Collaboration of Array, Bitmap and Prefix Tree for Frequent Itemset Mining Takeaki Uno Masashi Kiyomi Hiroki Arimura National Institute of Informatics,

1 Inverted Matrix: Efficient Discovery of Frequent Items in Large Datasets in the Context of Interactive Mining -SIGKDD’03 Mohammad El-Hajj, Osmar R. Zaïane.

CanTree: a tree structure for efficient incremental mining of frequent patterns Carson Kai-Sang Leung, Quamrul I. Khan, Tariqul Hoque ICDM ’ 05 報告者：林靜怡.

New Algorithms for Enumerating All Maximal Cliques

Speeding Up Enumeration Algorithms with Amortized Analysis Takeaki Uno (National Institute of Informatics, JAPAN)

M. Sulaiman Khan Dept. of Computer Science University of Liverpool 2009 COMP527: Data Mining ARM: Improvements March 10, 2009 Slide.

Association Analysis (3)

Detailed Description of an Algorithm for Enumeration of Maximal Frequent Sets with Irredundant Dualization I rredundant B order E numerator Takeaki Uno.

Δ-Tolerance Closed Frequent Itemsets James Cheng,Yiping Ke,and Wilfred Ng ICDM ’ 06 報告者：林靜怡 2007/03/15.

Reducing Number of Candidates Apriori principle: – If an itemset is frequent, then all of its subsets must also be frequent Apriori principle holds due.

1 Top Down FP-Growth for Association Rule Mining By Ke Wang.

Fast Algorithms for BIG DATA (title means “I make slides according to the interests of audience ) 14/Jan/2012 NII Shonan-meeting (open problem seminar)

Approach to Data Mining from Algorithm and Computation Takeaki Uno, ETH Switzerland, NII Japan Hiroki Arimura, Hokkaido University, Japan.

Frequent Pattern Mining

The Concept of Maximal Frequent Itemsets

Vasiljevic Vladica, FP-Growth algorithm Vasiljevic Vladica,

Mining Frequent Itemsets over Uncertain Databases

COMP5331 FP-Tree Prepared by Raymond Wong Presented by Raymond Wong

Frequent-Pattern Tree

Output Sensitive Enumeration

Presentation transcript:

LCM ver.2: Efficient Mining Algorithms for Frequent/Closed/Maximal Itemsets Takeaki Uno Masashi Kiyomi Hiroki Arimura National Institute of Informatics, JAPAN Hokkaido University, JAPAN 1/Nov/2004 Frequent Itemset Mining Implementations ’04

Summary FI miningBacktracking with Hypercube decomposition (few freq. Counting) Back- tracking CI miningBacktracking with PPC-extension (complete enumeration) (small memory) Apriori with pruning MFI miningBacktracking with pruning (small memory) Apriori with pruning freq. countingOccurrence deliver (linear time computation) Down project database maintenance array with Anytime database reduction (simple) (fast initialization) Trie (FP- tree) maximality check More database reductions (small memory) store all itemsets Our approach Typical approach

Frequent Itemset Mining Almost all computation time is spent for frequency counting ⇒ ⇒ How to reduce FI miningBacktracking with Hypercube decomposition (few freq. Counting) Backtracking CI mining Backtracking with PPC-extension (complete enumeration)(small memory) Apriori with pruning MFI mining Backtracking with pruning (small memory) Apriori with pruning freq.counting Occurrence deliver (linear time computation) Down project database maintenance array with Anytime database reduction (simple) (fast initialization) Trie (FP-tree) maximality checkMore database reductions (small memory)store all itemsets #FI to be checked cost of frequency counting

Hypercube Decomposition [form Ver.1] Reduce #FI to be checked Decompose the set of all FI’s into hypercubes, each of which is included in an equivalence class Enumerate maximal and minimal of each hypercube (with frequency counting) Generate other FI’s between maximal and minimal (without frequency counting) Efficient when support is small

Occurrence Deliver [ver1] Compute the denotations of P ∪ {i} for all i’s at once, by transposing the trimmed database Trimmed database is composed of - - items to be added - - transactions including P linear time linear time in the size of trimmed database A B C ABCABC denotation of 1,2,3 denotation of 1,2,4 denotation of 1,2,5 AAAA BBBB C itemset: 1,2 denotation: A,B,C Efficient for sparse datasets Trimmeddatabase 12 database

Loss of Occurrence Deliver [new] Avoiding frequency counting of infrequent itemset P ∪ {e} has been considered to be important However, the computation time for such itemsets is 1/3 of all computation cost on average, in our experiments (if we sort items by their frequency (size of tuple list)) P∪P∪ AD ELM ABCEFGH JKLN ABDEFGI JKLMSTW BEGILT MTW ABCDFGH IKLMNST θ Occurrence deliver has an advantage of its simple structure

Anytime Database Reduction [new] Database reduction: Database reduction: Reduce the database, by [fp-growth, etc] ◆ ◆ Remove item e, if e is included in less than θ transactions or or included in all transactions ◆ ◆ merge identical transactions into one Anytime database reduction: Anytime database reduction: Recursively apply trimming and this reduction, in the recursion   database size becomes small in lower levels of the recursion  In the recursion tree, lower level iterations are exponentially many rather than upper level iterations.  very efficient

Example of Anytime D. R. [new]    trim  anytime database reduction  trim  anytime database reduction…. i j

Array (reduced) vs. Trie (FP-tree) [new] Trie can compress the trimmed database [fp-growth, etc] By experiments for FIMI instances, we compute the average compression ratio by Trie for trimmed database over all iterations  #items(cells) in Tries  1/2 average, 1/6 minimum (dense case) If Trie is constructed by a binary tree, it needs at least 3 pointers for each item.  memory use (computation time)  twice, minimum 2/3 initialization is fast (LCM O(||T||) : Trie O(|T|log|T| + ||T||) )

Results

Closed Itemset Mining avoid (prune) non-closed itemsets? (existing pruning is not complete) quickly operate closure? save memory use? (existing approach uses much memory) FI mining Backtracking with Hypercube decomposition (few freq. Counting) Backtracking CI miningBacktracking with PPC-extension (complete enumeration)(small memory) Apriori with pruning MFI mining Backtracking with pruning (small memory) Apriori with pruning freq.counting Occurrence deliver (linear time computation) Down project database maintenance array with Anytime database reduction (simple) (fast initialization) Trie (FP-tree) Maximality check More database reductions (small memory) store all itemsets How to

Prefix Preserving Closure Extension [ver1] Prefix preserving closure extension Prefix preserving closure extension (PPC-extension) is a variation of closure extension Def. closure tail Def. closure tail of a closed itemset P ⇔＝ ⇔ the minimum j s.t. closure (P ∩ {1,…,j}) ＝ P Def. ＝ Def. H ＝ closure(P ∪ {i}) (closure extension of P) PPC-extension is a PPC-extension of P ⇔＝ ⇔ i > closure tail and H ∩{1,…,i-1} ＝ P ∩{1,…,i-1}   no duplication occurs by depth-first search unique “Any” closed itemset H is generated from another “unique” closed itemset by PPC-extension (i.e., from closure(H ∩{1,…,i-1}) )

Example of ppc-extension [ver1] closure extension ppc extension 1,2,5,6,7,9 2,3,4,5 1,2,7,8,9 1,7,9 2,7,9 2 T ＝ φ {1,7,9} {2,7,9} {1,2,7,9} {7,9} {2,5} {2} {2,3,4,5} {1,2,7,8,9}{1,2,5,6,7,9}  closure extension  acyclic  ppc extension  tree

Results

Maximal Frequent Itemset Mining How to FI mining Backtracking with Hypercube decomposition (few freq. Counting) Backtracking CI mining Backtracking with PPC-extension (complete enumeration)(small memory) Apriori with pruning MFI mining Backtracking with pruning (small memory) Apriori with pruning freq.counting Occurrence deliver (linear time computation) Down project database maintenance array with Anytime database reduction (simple) (fast initialization) Trie (FP-tree) maximality check More database reductions (small memory) store all itemsets avoid (prune) non-maximal imteset? check maximality quickly? save memory? (existing maximality check and pruning use much memory)

Backtracking-based Pruning [new] During backtracking algorithm for FI, ： current itemset ： a MFI including K re-sort items s.t. items of H locate end re-sort 312 We can avoid so many non-MFI’s Then, new MFI NEVER be found in recursive calls w.r.t. items in H  omit such recursive calls rec. callno rec. call

Fast Maximality Check (CI,MFI) [new] To reduce the computation cost for maximality check, closedness check, we use more database reduction At anytime database reduction, we keep ◆ ◆ the intersection of merged transactions, for closure operation ◆ ◆ the sum of merged transactions as a weighted transaction database, for maximality check Closure is the intersection of transactions Frequency of one more larger itemsets are sum of transactions in the trimmed database   By using these reduced databases, computation time becomes short (no more than frequency counting)

Results

Experiments CPU, memory, OS: AMD Athron XP 1600+, 224MB, Linux Compared with: FP-growth, afopt, Mafia, Patriciamine, kDCI (All these marked high scores at competition FIMI03) 13 datasets FIMI repository 13 datasets of FIMI repository Fast at large supports for all instances of FI, CI, MFI Fast for all instances for CI (except for Accidents) Fast for all sparse datasets of FI, CI, MFI Slow only for accidents, T40I10D100K of FI, MFI, and pumsbstar of MFI Result

Summary of Results largesupports FICIMFI sparse(7) LCM middle(5) dense(1) smallsupports FICIMFI sparse(7)LCM middle(5)BothLCMBoth dense(1)OthersLCMOthers

results

Conclusion When equivalence classes are large, PPC-extension and Hypercube decomposition works well Anytime database reduction and Occurrence deliver have advantages on initialization, sparse cases and simplicity compared to Trie and Down project Backtracking-based pruning saves memory usage More database reduction works well as much as memory storage approaches

Future Work LCM is weak at MFI mining and dense datasets LCM is weak at MFI mining and dense datasets More efficient Pruning for MFI Some new data structures for dense cases Fast radix sort for anytime database reduction IO optimization ?????

List of Datasets Real datasets ・・ BMS-WebVeiw-1 ・・ BMS-WebVeiw-2 ・・ BMS-POS ・・ Retail ・・ Kosarak ・・ Accidents Machine learning benchmark ・・ Chess ・・ Mushroom ・・ Pumsb ・・ Pumsb* ・・ Connect Aartificial datasets ・・ T10I4D100K ・・ T40I10D100K