Mining Frequent Itemsets with Constraints Takeaki Uno Takeaki Uno National Institute of Informatics, JAPAN Nov/2005 FJWCP.

Slides:



Advertisements
Similar presentations
Recap: Mining association rules from large datasets
Advertisements

Huffman Codes and Asssociation Rules (II) Prof. Sin-Min Lee Department of Computer Science.
Association Analysis (2). Example TIDList of item ID’s T1I1, I2, I5 T2I2, I4 T3I2, I3 T4I1, I2, I4 T5I1, I3 T6I2, I3 T7I1, I3 T8I1, I2, I3, I5 T9I1, I2,
Frequent Itemset Mining Methods. The Apriori algorithm Finding frequent itemsets using candidate generation Seminal algorithm proposed by R. Agrawal and.
DATA MINING Association Rule Discovery. AR Definition aka Affinity Grouping Common example: Discovery of which items are frequently sold together at a.
Association Analysis (Data Engineering). Type of attributes in assoc. analysis Association rule mining assumes the input data consists of binary attributes.
gSpan: Graph-based substructure pattern mining
Data Mining in Clinical Databases by using Association Rules Department of Computing Charles Lo.
More Efficient Generation of Plane Triangulations Shin-ichi Nakano Takeaki Uno Gunma University National Institute of JAPAN Informatics, JAPAN 23/Sep/2003.
Association Rule Mining. 2 The Task Two ways of defining the task General –Input: A collection of instances –Output: rules to predict the values of any.
LCM: An Efficient Algorithm for Enumerating Frequent Closed Item Sets L inear time C losed itemset M iner Takeaki Uno Tatsuya Asai Hiroaki Arimura Yuzo.
1 of 25 1 of 45 Association Rule Mining CIT366: Data Mining & Data Warehousing Instructor: Bajuna Salehe The Institute of Finance Management: Computing.
Data Mining Association Analysis: Basic Concepts and Algorithms
Association Analysis. Association Rule Mining: Definition Given a set of records each of which contain some number of items from a given collection; –Produce.
Efficiently Mining Long Patterns from Databases Roberto J. Bayardo Jr. IBM Almaden Research Center.
Data Mining Techniques So Far: Cluster analysis K-means Classification Decision Trees J48 (C4.5) Rule-based classification JRIP (RIPPER) Logistic Regression.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Association Analysis (2). Example TIDList of item ID’s T1I1, I2, I5 T2I2, I4 T3I2, I3 T4I1, I2, I4 T5I1, I3 T6I2, I3 T7I1, I3 T8I1, I2, I3, I5 T9I1, I2,
FP-growth. Challenges of Frequent Pattern Mining Improving Apriori Fp-growth Fp-tree Mining frequent patterns with FP-tree Visualization of Association.
Data Mining Association Analysis: Basic Concepts and Algorithms
1 Mining Frequent Patterns Without Candidate Generation Apriori-like algorithm suffers from long patterns or quite low minimum support thresholds. Two.
Association Analysis: Basic Concepts and Algorithms.
Data Mining Association Analysis: Basic Concepts and Algorithms
Asssociation Rules Prof. Sin-Min Lee Department of Computer Science.
1 ACCTG 6910 Building Enterprise & Business Intelligence Systems (e.bis) Association Rule Mining Olivia R. Liu Sheng, Ph.D. Emma Eccles Jones Presidential.
© Vipin Kumar CSci 8980 Fall CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance Computing Research Center Department of Computer.
Fast Algorithms for Association Rule Mining
Mining Association Rules
Mining Association Rules
Performance and Scalability: Apriori Implementation.
1 Efficiently Mining Frequent Trees in a Forest Mohammed J. Zaki.
1 Apriori Algorithm Review for Finals. SE 157B, Spring Semester 2007 Professor Lee By Gaurang Negandhi.
Ambiguous Frequent Itemset Mining and Polynomial Delay Enumeration May/25/2008 PAKDD 2008 Takeaki Uno (1), Hiroki Arimura (2) (1) National Institute of.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining By Tan, Steinbach, Kumar Lecture.
Modul 7: Association Analysis. 2 Association Rule Mining  Given a set of transactions, find rules that will predict the occurrence of an item based on.
ICDE 2012 Discovering Threshold-based Frequent Closed Itemsets over Probabilistic Data Yongxin Tong 1, Lei Chen 1, Bolin Ding 2 1 Department of Computer.
Takeaki Uno Tatsuya Asai Yuzo Uchida Hiroki Arimura
LCM ver.2: Efficient Mining Algorithms for Frequent/Closed/Maximal Itemsets Takeaki Uno Masashi Kiyomi Hiroki Arimura National Institute of Informatics,
An Efficient Algorithm for Enumerating Pseudo Cliques Dec/18/2007 ISAAC, Sendai Takeaki Uno National Institute of Informatics & The Graduate University.
Expert Systems with Applications 34 (2008) 459–468 Multi-level fuzzy mining with multiple minimum supports Yeong-Chyi Lee, Tzung-Pei Hong, Tien-Chin Wang.
An Efficient Polynomial Delay Algorithm for Pseudo Frequent Itemset Mining 2/Oct/2007 Discovery Science 2007 Takeaki Uno (National Institute of Informatics)
A Study of Balanced Search Trees: Brainstorming a New Balanced Search Tree Anthony Kim, 2005 Computer Systems Research.
CSE4334/5334 DATA MINING CSE4334/5334 Data Mining, Fall 2014 Department of Computer Science and Engineering, University of Texas at Arlington Chengkai.
Association Rule Mining Data Mining and Knowledge Discovery Prof. Carolina Ruiz and Weiyang Lin Department of Computer Science Worcester Polytechnic Institute.
LCM ver.3: Collaboration of Array, Bitmap and Prefix Tree for Frequent Itemset Mining Takeaki Uno Masashi Kiyomi Hiroki Arimura National Institute of Informatics,
New Algorithms for Enumerating All Maximal Cliques
Association Rule Mining
CloSpan: Mining Closed Sequential Patterns in Large Datasets Xifeng Yan, Jiawei Han and Ramin Afshar Proceedings of 2003 SIAM International Conference.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Association Analysis This lecture node is modified based on Lecture Notes for.
Speeding Up Enumeration Algorithms with Amortized Analysis Takeaki Uno (National Institute of Informatics, JAPAN)
Association Analysis (3)
Output Sensitive Algorithm for Finding Similar Objects Jul/2/2007 Combinatorial Algorithms Day Takeaki Uno Takeaki Uno National Institute of Informatics,
Detailed Description of an Algorithm for Enumeration of Maximal Frequent Sets with Irredundant Dualization I rredundant B order E numerator Takeaki Uno.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
1 Data Mining Lecture 6: Association Analysis. 2 Association Rule Mining l Given a set of transactions, find rules that will predict the occurrence of.
Fast Algorithms for BIG DATA (title means “I make slides according to the interests of audience ) 14/Jan/2012 NII Shonan-meeting (open problem seminar)
Computational Challenges in BIG DATA 28/Apr/2012 China-Korea-Japan Workshop Takeaki Uno National Institute of Informatics & Graduated School for Advanced.
Approach to Data Mining from Algorithm and Computation Takeaki Uno, ETH Switzerland, NII Japan Hiroki Arimura, Hokkaido University, Japan.
Frequent Pattern Mining
Data Mining Association Analysis: Basic Concepts and Algorithms
Association Rule Mining
Data Mining Association Analysis: Basic Concepts and Algorithms
Data Mining Association Analysis: Basic Concepts and Algorithms
Association Analysis: Basic Concepts and Algorithms
Approximate Frequency Counts over Data Streams
FP-Growth Wenlong Zhang.
Output Sensitive Enumeration
Output Sensitive Enumeration
Association Analysis: Basic Concepts
Presentation transcript:

Mining Frequent Itemsets with Constraints Takeaki Uno Takeaki Uno National Institute of Informatics, JAPAN Nov/2005 FJWCP

Mining Frequent Itemsets with Constraints Takeaki Uno Takeaki Uno National Institute of Informatics, JAPAN joint work with joint work with Hiroki Arimura Hiroki Arimura Hokkaido University, JAPAN Nov/2005 FJWCP

Knowledge Discovering from Database Knowledge Discovering from Database Finding interesting patterns from large scale databases H H C C C C H H H O O N H H H H databases name C C C person agephone name family person name person age C O O H extract patterns Applications in data engineering, bioinformatics, chemistry, management science, linguistics, etc. person family C H H

Frequent Pattern Approach ・ ・ It is difficult to define “what is interesting” in math. terms ・ ・ Popular approach is database ・ ・ Patterns frequently appearing in the database are good candidates for the task - - enumerates candidates of knowledge, - - filter by some constraints to remove unnecessary patterns - - have a “look” at them (evaluate) candidates filtering beer and nappy

Frequent Pattern Approach ・ ・ Patterns with high frequencies are something “obvious”   we have to search into low frequency patterns ・ ・ But, there are huge number of patterns with low frequencies ・ ・ Directly finding patterns satisfying the given constraints is important In this talk, we focus on transactions database, and show algorithms for finding frequent patterns satisfying the given constraints efficiently In this talk, we focus on transactions database, and show algorithms for finding frequent patterns satisfying the given constraints efficiently

Transaction Database ・ Transaction database T : transactionsitemset a database composed of transactions defined on itemset E T T i.e., T, T ∈ T, T ⊆ E - - basket data - - links of web pages - - words in documents ・ pattern itemset ・ A subset of E is called a pattern or itemset 1,2,5,6,7 2,3,4,5 1,2,7,8,9 1,7,9 2,7,9 2 T = T In practice, the size of T can be over million Takes long time to operate

Occurrences of Pattern ・ ・ For a pattern P, occurrence: T occurrence of P : a transaction in T including P denotation: denotation of P : set of occurrences of P frequency: frequency of P : the size of the denotation of P frequent ⇔ P is frequent ⇔ frequency of P is no less than θ, 1,2,5,6,7,9 2,3,4,5 1,2,7,8,9 1,7,9 2,7,9 2 T=T =T=T = denotation of {1,2} = = { {1,2,5,6,7,9}, {1,2,7,8,9} } patterns included in at least 3 transactions at least 3 transactions {1} {2} {7} {9} {1,7} {1,9} {2,7} {2,9} {7,9} {2,7,9} Frequent pattern mining problem: T given θ and T, find all frequent patterns Frequent pattern mining problem: T given θ and T, find all frequent patterns

Backtracking Algorithm ・ ・ “Frequent” is a monotone property, so backtracking algorithm works, starting from the emptyset BackTrack (P) Output P Output P For each item i > max. item of P For each item i > max. item of P If P ∪ {i} is frequent then If P ∪ {i} is frequent then call BackTrack (P ∪ {i}) call BackTrack (P ∪ {i}) φ 1,3 1,2 1,2,31,2,41,3,42,3, ,42,41,42,3 1,2,3,4 ・ ・ In practice, very fast ・ ・ Frequency computing is the most heavy part ・ ・ In practice, very fast ・ ・ Frequency computing is the most heavy part

Evaluate Computation Time ・ ・ Enumeration takes long time if there are many output, so we evaluate its efficiency by “computation time for each output” (throughput) ・ ・ Recent good implementations of frequent pattern mining takes constant time for each, if the number of output is large ・ ・ But, dealing with constraint checking is not so trivial We show some algorithms for - - maximality in equivalence class, - - constraints on items - - constraints on additional items (rules)

itemset lattice Decrease #Solutions: Closed Pattern ・ ・ Closed pattern: maximal one among patterns with the same occurrences φ Enumerate all frequent closed patterns instead of frequent patterns Enumerate all frequent closed patterns instead of frequent patterns Our algorithm: LCM ・< ・ Usually, # closed patterns < # frequent patterns ・ ・ Closed patterns do not lose the information of occurrences

Enumeration by ppc Extension ・ ・ Any closed pattern is generated from the other closed pattern by: adding an item i, and 2. taking maximal (but not uniquely generated) φ i i h ・ ppc extension ・ the generation is ppc extension (prefix preserving closure) if < they have the same prefix (items < i) ・ ・ Any closed pattern is generated ppc extension uniquely by ppc extension of another closed pattern

ExampleExample generation ppc extension 1,2,5,6,7,9 2,3,4,5 1,2,7,8,9 1,7,9 2,7,9 2 T = φ {1,7,9} {2,7,9} {1,2,7,9} {7,9} {2,5} {2} {2,3,4,5} {1,2,7,8,9}{1,2,5,6,7,9} ・  ・ usual generation  acyclic ・  ・ ppc extension  tree

Time Complexity ・ T ・ One ppc extension needs O(|| T ||) time T T ( || T || is sum of sizes of transactions, i.e., ∑|T|, T ∈ T ) ・ ・ There are at most |I| candidates for ppc extensions (I: pattern)  T  Closed patterns can be enumerated in O(|I||| T ||) time for each (without extra memory for previously found patterns) ・ ・ In practice, computation time can be smaller, by - - recursively reduce the database, - - generate candidates at once by sweeping the reduced database   O(1) time for each, if # outputs is large enough for input size

Experimental Result ・ ・ Usually, very fast, rather than other algorithms (except for dense databases) BEST Award: implementation competition FIMI ‘04

Constraints on Weight, Size, etc. ・ ・ It is not difficult to add constraints w.r.t. weights - - lower, and/or upper bounds on - - size, sum, max., min., ave., variance, of - - item or transaction weights itemset lattice φ If the constraints are unti-monotone, still linear time in # solutions Even if it is monotone, usually linear if # solutions is large

Non-monotone Constraints ・ ・ Hardness depends on the properties of the constraints - - Find patterns with constraints, then check frequency&closedness - - Find closed patterns then check the constraints itemset lattice φ FASTFAST SLOW ・ ・ Especially, if constraints are given on the items (ex., include A or B if it includes C, etc) the time for checking is very short, rather than frequency computing   slight increase of computation time - - Logical constraints - - Highly dependent patterns (frequency >> Πfrequency of its items) FAST

Association Rule Mining ・  ・ Association rule is a rule of the form (a,b,c)  d ・ ・ If transactions including d (not including d) are high ratio among transactions including (a,b,c),  rule (a,b,c)  d (¬d) is reliable, and characterize database Finding good rules is important problem ・ ・ (a,b,c) has to be frequent, so that the rule is common in the database ・ ・ However, evaluating the ratio for each pair of closed pattern and item takes so long time, by simple way

Occurrence Deliver ・ ・ Compute the denotations of P ∪ {i} for all i’s at once, 1,2,5,6,7,9 2,3,4,5 1,2,7,8,9 1,7,9 2,7,9 2 T=T =T=T = A B2345 C12789 D179 E279 F2 P = {1,7} AAA CC Check the frequency for all items to be added in linear time of the database size Check the frequency for all items to be added in linear time of the database size frequency of item = reliability of rule Computed in short time frequency of item = reliability of rule Computed in short time A C D

results

ConclusionConclusion ・ ・ We see algorithms for enumerating frequent pattern with constraints ・ ・ Closed patterns: decreasing #solution without loosing information -LCM - algorithm LCM ・ ・ Closed patterns with monotone/unti-monotone/general constraints ・ ・ Rule mining with closed patterns Algorithm sense, we can do. How to implement in simple and easy way? Closed pattern for other kind of patterns