Leiden University Efficient Frequent Query Discovery in F ARMER Siegfried Nijssen and Joost N. Kok ECML/PKDD-2003, Cavtat.

Slides:



Advertisements
Similar presentations
Association Rule Mining
Advertisements

Graph Mining Laks V.S. Lakshmanan
Mining for Tree-Query Associations in a Graph Jan Van den Bussche Hasselt University, Belgium joint work with Bart Goethals (U Antwerp, Belgium) and Eveline.
gSpan: Graph-based substructure pattern mining
Sequential Patterns & Process Mining Current State of Research Edgar de Graaf LIACS.
Frequent Closed Pattern Search By Row and Feature Enumeration
10 -1 Lecture 10 Association Rules Mining Topics –Basics –Mining Frequent Patterns –Mining Frequent Sequential Patterns –Applications.
Mining Graphs.
FP-Growth algorithm Vasiljevic Vladica,
Frequent Subgraph Pattern Mining on Uncertain Graph Data
Data Mining Association Analysis: Basic Concepts and Algorithms
Association Analysis. Association Rule Mining: Definition Given a set of records each of which contain some number of items from a given collection; –Produce.
Data Mining Techniques So Far: Cluster analysis K-means Classification Decision Trees J48 (C4.5) Rule-based classification JRIP (RIPPER) Logistic Regression.
Data Mining Association Analysis: Basic Concepts and Algorithms
Equivalence Checking Using Cuts and Heaps Andreas Kuehlmann Florian Krohm IBM Thomas J. Watson Research Center Presented by: Zhenghua Qi.
Data Mining Association Analysis: Basic Concepts and Algorithms
Mining Tree-Query Associations in a Graph Bart Goethals University of Antwerp, Belgium Eveline Hoekx Jan Van den Bussche Hasselt University, Belgium.
© Vipin Kumar CSci 8980 Fall CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance Computing Research Center Department of Computer.
The community-search problem and how to plan a successful cocktail party Mauro SozioAris Gionis Max Planck Institute, Germany Yahoo! Research, Barcelona.
ACM SIGKDD Aug – Washington, DC  M. El-Hajj and O. R. Zaïane, 2003 Database Lab. University of Alberta Canada Inverted Matrix: Efficient Discovery.
1 Efficiently Mining Frequent Trees in a Forest Mohammed J. Zaki.
Efficient Case Retrieval Sources: –Chapter 7 – –
What Is Sequential Pattern Mining?
Subgraph Containment Search Dayu Yuan The Pennsylvania State University 1© Dayu Yuan9/7/2015.
Graph Indexing: A Frequent Structure­ based Approach Authors:Xifeng Yan†, Philip S‡. Yu, Jiawei Han†
USpan: An Efficient Algorithm for Mining High Utility Sequential Patterns Authors: Junfu Yin, Zhigang Zheng, Longbing Cao In: Proceedings of the 18th ACM.
Vilalta&Eick: Informed Search Informed Search and Exploration Search Strategies Heuristic Functions Local Search Algorithms Vilalta&Eick: Informed Search.
Mining Frequent Itemsets with Constraints Takeaki Uno Takeaki Uno National Institute of Informatics, JAPAN Nov/2005 FJWCP.
Mining Optimal Decision Trees from Itemset Lattices Dr, Siegfried Nijssen Dr. Elisa Fromont KDD 2007.
Cristian Urs and Ben Riveira. Introduction The article we chose focuses on improving the performance of Genetic Algorithms by: Use of predictive models.
Sequential PAttern Mining using A Bitmap Representation
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining By Tan, Steinbach, Kumar Lecture.
BLAST: A Case Study Lecture 25. BLAST: Introduction The Basic Local Alignment Search Tool, BLAST, is a fast approach to finding similar strings of characters.
Takeaki Uno Tatsuya Asai Yuzo Uchida Hiroki Arimura
LCM ver.2: Efficient Mining Algorithms for Frequent/Closed/Maximal Itemsets Takeaki Uno Masashi Kiyomi Hiroki Arimura National Institute of Informatics,
Xiangnan Kong,Philip S. Yu Department of Computer Science University of Illinois at Chicago KDD 2010.
1 Frequent Subgraph Mining Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY June 12, 2010.
Graph Indexing: A Frequent Structure- based Approach Alicia Cosenza November 26 th, 2007.
CSE4334/5334 DATA MINING CSE4334/5334 Data Mining, Fall 2014 Department of Computer Science and Engineering, University of Texas at Arlington Chengkai.
Privacy-preserving rule mining. Outline  A brief introduction to association rule mining  Privacy preserving rule mining Single party  Perturbation.
University at BuffaloThe State University of New York Lei Shi Department of Computer Science and Engineering State University of New York at Buffalo Frequent.
LCM ver.3: Collaboration of Array, Bitmap and Prefix Tree for Frequent Itemset Mining Takeaki Uno Masashi Kiyomi Hiroki Arimura National Institute of Informatics,
Mining Frequent Itemsets from Uncertain Data Presenter : Chun-Kit Chui Chun-Kit Chui [1], Ben Kao [1] and Edward Hung [2] [1] Department of Computer Science.
1 Inverted Matrix: Efficient Discovery of Frequent Items in Large Datasets in the Context of Interactive Mining -SIGKDD’03 Mohammad El-Hajj, Osmar R. Zaïane.
Computer Science and Engineering TreeSpan Efficiently Computing Similarity All-Matching Gaoping Zhu #, Xuemin Lin #, Ke Zhu #, Wenjie Zhang #, Jeffrey.
Temporal Database Paper Reading R 資工碩一 馬智釗 Efficient Mining Strategy for Frequent Serial Episodes in Temporal Database, K Huang, C Chang.
Frequent Structure Mining Robert Howe University of Vermont Spring 2014.
Graph Indexing From managing and mining graph data.
1 Data Mining Lecture 6: Association Analysis. 2 Association Rule Mining l Given a set of transactions, find rules that will predict the occurrence of.
Binary Decision Diagrams Prof. Shobha Vasudevan ECE, UIUC ECE 462.
Ning Jin, Wei Wang ICDE 2011 LTS: Discriminative Subgraph Mining by Learning from Search History.
Gspan: Graph-based Substructure Pattern Mining
An Efficient Algorithm for Incremental Update of Concept space
Mining in Graphs and Complex Structures
Data Mining Association Analysis: Basic Concepts and Algorithms
Frequent Pattern Mining
Advanced Pattern Mining 02
Mining Frequent Subgraphs
Data Mining Association Analysis: Basic Concepts and Algorithms
Data Mining Association Analysis: Basic Concepts and Algorithms
On Efficient Graph Substructure Selection
Proper Refinement of Datalog Clauses using Primary Keys
Data Mining Association Analysis: Basic Concepts and Algorithms
Graph Database Mining and Its Applications
Mining Frequent Subgraphs
COMP5331 FP-Tree Prepared by Raymond Wong Presented by Raymond Wong
Mining Frequent Subgraphs
Jongik Kim1, Dong-Hoon Choi2, and Chen Li3
Association Analysis: Basic Concepts
Presentation transcript:

Leiden University Efficient Frequent Query Discovery in F ARMER Siegfried Nijssen and Joost N. Kok ECML/PKDD-2003, Cavtat

September 25, 2003, CavtatECML/PKDD-2003 Introduction Frequent structure mining: given a set of complex structures (molecules, access logs, graphs, (free) trees,...), find substructures that occur frequently Frequent structure mining approaches: –Specialized: efficient algorithms for sequences, trees (Freqt, uFreqT) and graphs (gSpan, FSG) –General: ILP algorithms (Warmr), biased graph mining algorithms (B-AGM)

September 25, 2003, CavtatECML/PKDD-2003 Introduction [Yan, SIGKDD’2003] Comparison between gSpan and W ARMR on confirmed active Aids molecules: 6400s W ARMR 2sgSpan Our goal: to build an efficient W ARMR - like algorithm

September 25, 2003, CavtatECML/PKDD-2003 Overview Problem description Optimizations: –Use a bias for tight problem specifications –Perform a depth-first search –Use efficient data structures in a new complete enumeration strategy which combines pruning with candidate generation –Speed-up evaluation by storing intermediate evaluation results, construct low-cost queries Experiments & conclusions

September 25, 2003, CavtatECML/PKDD The task of the algorithm is: Given a database of Datalog facts Find a set of queries that occurs frequently Problem description

September 25, 2003, CavtatECML/PKDD-2003 Database of Facts {e(g 1,n 1,n 2,a),e(g 1,n 2,n 1,a),e(g 1,n 2,n 3,a), e(g 1,n 3,n 1,b),e(g 1,n 3,n 4,b),e(g 1,n 3,n 5,c), e(g 2,n 6,n 7,b)} a ab a bc n1n1 n2n2 n3n3 n4n4 n5n5 n6n6 n7n7 b g1g1 g2g2

September 25, 2003, CavtatECML/PKDD-2003 Queries k(G)  e(G,N 1,N 2,a),e(G,N 2,N 3,a), e(G,N 1,N 4,a),e(G,N 4,N 5,b) a a b a N1N1 N2N2 N3N3 N4N4 N5N5

September 25, 2003, CavtatECML/PKDD-2003 Queries - Bias For a fixed set of predicates many kinds of queries possible: –k(G)  e(G,N 1,N 2,a),e(G,N 2,N 3,a), e(G,N 1,N 4,a),e(G,N 4,N 5,b) –k(G)  e(G,N 1,N 2,L),e(G,N 2,N 3,L), e(G,N 1,N 4,L),e(G,N 4,N 5,L) Our algorithm requires the user to specify a mode bias with types, primary keys, atom variable constraints,...

September 25, 2003, CavtatECML/PKDD-2003 Occurrence of Queries Database D: {e(g 1,n 1,n 2,a),e(g 1,n 2,n 1,a),e(g 1,n 2,n 3,a), e(g 1,n 3,n 1,b),e(g 1,n 3,n 4,b),e(g 1,n 3,n 5,a), e(g 1,n 4,n 2,a),e(g 1,n 4,n 5,b),e(g 2,n 6,n 7,b)} Query Q: k(G)  e(G,N 1,N 2,a),e(G,N 2,N 3,a), e(G,N 1,N 4,a),e(G,N 4,N 5,b) (W ARMR )  -subsumption: D Q iff there is a substitution , (Q  )  D  ={G/g 1,N 1 /n 2,N 2 /n 1,N 3 /n 2,N 4 /n 3,N 5 /n 1 }

September 25, 2003, CavtatECML/PKDD-2003 aa b a ba n3n3 n4n4 n5n5 n6n6 n7n7 b g1g1 g2g2 Occurrence of Queries n1n1 n2n2 N4N4 N2N2 b a a a N1N1 N3N3 N5N5 a a N1N1 N2N2 N3N3 N5N5 a b N4N4 a a a b Counterintuitive! b a

September 25, 2003, CavtatECML/PKDD-2003 Occurrence of Queries a a b Counterintuitive! a a a b k(G)  e(G,N 1,N 2,b),e(G,N 2,N 3,a), e(G,N 3,N 2,a),e(G,N 3,N 4,a) k(G)  e(G,N 1,N 2,b),e(G,N 2,N 3,a), e(G,N 3,N 2,a) Equivalent:

September 25, 2003, CavtatECML/PKDD-2003 Occurrence of Queries (F ARMER here) OI-subsumption: D Q iff there is a substitution , (Q  )  D and: –  is injective –  does not map to constants in Q Advantages over OI-subsumption: –in many situations (eg. graphs) more intuitive –if queries are equivalent, they are alphabetic variants; mode refinement is easier (proper) Disadvantages?

September 25, 2003, CavtatECML/PKDD-2003 Frequency e(g 1,n 1,n 2,a)e(g 1,n 2,n 1,a)e(g 1,n 2,n 3,a) e(g 1,n 4,n 2,a)Database D: {e(g 1,n 1,n 2,a),e(g 1,n 2,n 1,a),e(g 1,n 2,n 3,a), e(g 1,n 3,n 1,b),e(g 1,n 3,n 4,b),e(g 1,n 3,n 5,a), e(g 1,n 4,n 2,a),e(g 1,n 4,n 5,b),e(g 2,n 6,n 7,b)} e(G,N 1,N 2,a)Query Q: k(G)  e(G,N 1,N 2,a) Frequency freq(Q): the number of different values for G for which the body is subsumed by the database.

September 25, 2003, CavtatECML/PKDD-2003 Monotonicity Frequently: frequency  minsup, for predefined threshold value minsup Monotonicity: if Q 2 OI-subsumes Q 1, freq(Q 1 )  freq(Q 2 )  if a query is infrequent, it should not be refined  if a query is subsumed by an infrequent query, it should not be considered

September 25, 2003, CavtatECML/PKDD F ARMER F ARMER (Query Q):: determine refinements of Q compute frequency of refinements sort refinements for each frequent refinement Q’ do F ARMER (Q’)

September 25, 2003, CavtatECML/PKDD-2003 Determine Refinements Only one variant of each query should be counted and outputted Main problem: query equivalency under OI has graph isomorphism complexity Our approach: –use ordered tree-based heuristics –use efficient data structures to determine equivalency –perform also other pruning during exponential search

September 25, 2003, CavtatECML/PKDD-2003 Determine Refinements [IJCAI’01] e(G,N 1,N 2,a)e(G,N 1,N 2,b) e(G,N 3,N 4,b) e(G,N 2,N 3,a) e(G,N 1,N 3,a) e(G,N 2,N 3,b) e(G,N 1,N 3,b) e(G,N 3,N 4,a)

September 25, 2003, CavtatECML/PKDD-2003 Determine Refinements e(G,N 1,N 2,a)e(G,N 1,N 2,b) e(G,N 3,N 4,b) e(G,N 2,N 3,a) e(G,N 1,N 3,a) e(G,N 2,N 3,b) e(G,N 1,N 3,b) e(G,N 3,N 4,a) e(G,N 2,N 3,a) e(G,N 1,N 3,a) e(G,N 2,N 3,b) e(G,N 1,N 3,b) e(G,N 3,N 4,b)

September 25, 2003, CavtatECML/PKDD-2003 Determine Refinements (In the paper) we prove that –Refinement with this strategy is complete: of every frequent query defined by the bias, at least one variant is found –The order of siblings does not matter for completeness (but they must have some order)

September 25, 2003, CavtatECML/PKDD-2003 Determine Refinements Incrementally generate variants Search for the variant (under construction) in the existing part of the query tree To optimize this search, siblings are stored in a tree-like hash structure If a query is found that is infrequent  query Q is pruned (monotonicity constraint!)

September 25, 2003, CavtatECML/PKDD-2003 Frequency Computation Main problem: the complexity of finding an OI substitution is the same as subgraph isomorphism, and is therefore NP complete Our approach: try to avoid as much as possible that the same (exponential) computation is performed twice

September 25, 2003, CavtatECML/PKDD-2003 Frequency Computation D = e(G,N 1,N 2,b)Q = k(G)  e(G,N 1,N 2,b) For each value of G for which the database subsumes the query, the `first’ substitution is stored e(g 1,n 3,n 1,b) e(g 2,n 6,n 7,b) {e(g 1,n 1,n 2,a),e(g 1,n 2,n 1,a),e(g 1,n 2,n 3,a), e(g 1,n 3,n 1,b),e(g 1,n 3,n 4,b),e(g 1,n 3,n 5,a), e(g 1,n 4,n 2,a),e(g 1,n 4,n 5,b),e(g 2,n 6,n 7,b)}

September 25, 2003, CavtatECML/PKDD-2003 Frequency Computation Once a query is refined, for each refinement the first subsuming substitution has to be determined This computation is performed in one backtracking procedure for all refinements together (like query packs) This search starts from the subsitution of the original query

September 25, 2003, CavtatECML/PKDD-2003 {e(g 1,n 1,n 2,a),e(g 1,n 2,n 1,a),e(g 1,n 2,n 3,a), e(g 1,n 3,n 1,b),e(g 1,n 3,n 4,b),e(g 1,n 3,n 5,a), e(g 1,n 4,n 2,a),e(g 1,n 4,n 5,b),e(g 2,n 6,n 7,b)} Frequency Computation D = e(G,N 1,N 2,b)Q = k(G)  e(G,N 1,N 2,b) e(G,N 2,N 3,a ) e(G,N 2,N 3,b ) e(G,N 1,N 3,b ) e(G,N 3,N 4,b ) e(g 1,n 3,n 1,b) {e(g 1,n 1,n 2,a),e(g 1,n 2,n 1,a),e(g 1,n 2,n 3,a), e(g 1,n 3,n 1,b),e(g 1,n 3,n 4,b),e(g 1,n 3,n 5,a), e(g 1,n 4,n 2,a),e(g 1,n 4,n 5,b),e(g 2,n 6,n 7,b)} e(g 1,n 3,n 4,b) e(g 1,n 4,n 5,b) {e(g 1,n 1,n 2,a),e(g 1,n 2,n 1,a),e(g 1,n 2,n 3,a), e(g 1,n 3,n 1,b),e(g 1,n 3,n 4,b),e(g 1,n 3,n 5,a), e(g 1,n 4,n 2,a),e(g 1,n 4,n 5,b),e(g 2,n 6,n 7,b)} e(G,N 2,N 3,a ) e(G,N 2,N 3,b ) e(G,N 1,N 3,b ) e(G,N 3,N 4,b ) e(g 1,n 1,n 2,a) e(g 1,n 3,n 1,b) {e(g 1,n 1,n 2,a),e(g 1,n 2,n 1,a),e(g 1,n 2,n 3,a), e(g 1,n 3,n 1,b),e(g 1,n 3,n 4,b),e(g 1,n 3,n 5,a), e(g 1,n 4,n 2,a),e(g 1,n 4,n 5,b),e(g 2,n 6,n 7,b)} e(G,N 2,N 3,a ) e(G,N 2,N 3,b ) e(G,N 1,N 3,b ) e(G,N 3,N 4,b )

September 25, 2003, CavtatECML/PKDD-2003 D = Q1 = Q2 = k(G)  e(G,N 1,N 2,b),e(G,N 2,N 3,a),e(G,N 2,N 3,b) Sorting Order {e(g 1,n 1,n 2,a),e(g 1,n 2,n 1,a),e(g 1,n 2,n 3,a), e(g 1,n 3,n 1,b),e(g 1,n 3,n 4,b),e(g 1,n 3,n 5,a), e(g 1,n 4,n 2,a),e(g 1,n 4,n 5,b),e(g 2,n 6,n 7,b)} e(g 1,n 3,n 4,b) e(g 1,n 4,n 2,a)e(g 1,n 4,n 5,b) {e(g 1,n 1,n 2,a),e(g 1,n 2,n 1,a),e(g 1,n 2,n 3,a), e(g 1,n 3,n 1,b),e(g 1,n 3,n 4,b),e(g 1,n 3,n 5,a), e(g 1,n 4,n 2,a),e(g 1,n 4,n 5,b),e(g 2,n 6,n 7,b)} e(G,N 1,N 2,b)e(G,N 2,N 3,a)e(G,N 2,N 3,b) k(G)  e(G,N 1,N 2,b),e(G,N 2,N 3,a),e(G,N 2,N 3,b) e(g 1,n 1,n 2,a) e(g 1,n 3,n 1,b) {e(g 1,n 1,n 2,a),e(g 1,n 2,n 1,a),e(g 1,n 2,n 3,a), e(g 1,n 3,n 1,b),e(g 1,n 3,n 4,b),e(g 1,n 3,n 5,a), e(g 1,n 4,n 2,a),e(g 1,n 4,n 5,b),e(g 2,n 6,n 7,b)} e(G,N 1,N 2,b)e(G,N 2,N 3,a) k(G)  e(G,N 1,N 2,b),e(G,N 2,N 3,a),e(G,N 2,N 3,b) k(G)  e(G,N 1,N 2,b),e(G,N 2,N 3,b),e(G,N 2,N 3,a) e(g 1,n 3,n 4,b) e(g 1,n 4,n 5,b) {e(g 1,n 1,n 2,a),e(g 1,n 2,n 1,a),e(g 1,n 2,n 3,a), e(g 1,n 3,n 1,b),e(g 1,n 3,n 4,b),e(g 1,n 3,n 5,a), e(g 1,n 4,n 2,a),e(g 1,n 4,n 5,b),e(g 2,n 6,n 7,b)} e(G,N 1,N 2,b)e(G,N 2,N 3,b) k(G)  e(G,N 1,N 2,b),e(G,N 2,N 3,b),e(G,N 2,N 3,a) e(g 1,n 3,n 4,b) e(g 1,n 4,n 2,a)e(g 1,n 4,n 5,b) {e(g 1,n 1,n 2,a),e(g 1,n 2,n 1,a),e(g 1,n 2,n 3,a), e(g 1,n 3,n 1,b),e(g 1,n 3,n 4,b),e(g 1,n 3,n 5,a), e(g 1,n 4,n 2,a),e(g 1,n 4,n 5,b),e(g 2,n 6,n 7,b)} e(G,N 1,N 2,b)e(G,N 2,N 3,b)e(G,N 2,N 3,a) k(G)  e(G,N 1,N 2,b),e(G,N 2,N 3,b),e(G,N 2,N 3,a)

September 25, 2003, CavtatECML/PKDD-2003 Experimental Results Bongard dataset Warmr emulates OI 392 examples minsup=5% 1s 192MB 350Mhz

September 25, 2003, CavtatECML/PKDD-2003 Experimental Results Predictive Toxicology dataset MachineAlgorithm6%7% Pentium III 500Mhz 448MBgSpan5s Dual Athlon MP GBFSG IP11s7s Athlon XP MBFarmer72s48s Pentium II 350Mhz 192MBFarmer224s148s Pentium III 500Mhz 448MBFSG248s Dual Athlon MP GBFSG II675s23s Pentium III 350Mhz 192MBWarmr>1h>1h

September 25, 2003, CavtatECML/PKDD-2003 Conclusions We decreased the performance gap between specialized algorithms and ILP algorithms significantly We did so by: –using (weak) object identity –using a new complete enumeration strategy –choosing query evaluation strategies with low costs (much memory however required!) Future: provide better comparisons