gSpan: Graph-based substructure pattern mining

Slides:



Advertisements
Similar presentations
Recap: Mining association rules from large datasets
Advertisements

DS.GR.14 Graph Matching Input: 2 digraphs G1 = (V1,E1), G2 = (V2,E2) Questions to ask: 1.Are G1 and G2 isomorphic? 2.Is G1 isomorphic to a subgraph of.
Frequent Itemset Mining Methods. The Apriori algorithm Finding frequent itemsets using candidate generation Seminal algorithm proposed by R. Agrawal and.
Graph Mining Laks V.S. Lakshmanan
 Data mining has emerged as a critical tool for knowledge discovery in large data sets. It has been extensively used to analyze business, financial,
Mining Frequent Patterns II: Mining Sequential & Navigational Patterns Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.
www.brainybetty.com1 MAVisto A tool for the exploration of network motifs By Guo Chuan & Shi Jiayi.
Correlation Search in Graph Databases Yiping Ke James Cheng Wilfred Ng Presented By Phani Yarlagadda.
Depth-First Search1 Part-H2 Depth-First Search DB A C E.
22C:19 Discrete Structures Trees Spring 2014 Sukumar Ghosh.
IKI 10100: Data Structures & Algorithms Ruli Manurung (acknowledgments to Denny & Ade Azurat) 1 Fasilkom UI Ruli Manurung (Fasilkom UI)IKI10100: Lecture10.
Introduction to Graph Mining
5/12/2015PhD seminar CS BGU Counting subgraphs Support measures for graphs Natalia Vanetik.
Frequent Structure Mining Prajwal Shrestha Department of Computer Science The University of Vermont Spring 2015.
Mining Graphs.
1 IncSpan :Incremental Mining of Sequential Patterns in Large Database Hong Cheng, Xifeng Yan, Jiawei Han Proc Int. Conf. on Knowledge Discovery.
Association Analysis (7) (Mining Graphs)
Chen Chen 1, Cindy X. Lin 1, Matt Fredrikson 2, Mihai Christodorescu 3, Xifeng Yan 4, Jiawei Han 1 1 University of Illinois at Urbana-Champaign 2 University.
3 -1 Chapter 3 The Greedy Method 3 -2 The greedy method Suppose that a problem can be solved by a sequence of decisions. The greedy method has that each.
1 Mining Frequent Patterns Without Candidate Generation Apriori-like algorithm suffers from long patterns or quite low minimum support thresholds. Two.
Mining Frequent patterns without candidate generation Jiawei Han, Jian Pei and Yiwen Yin.
Mining Tree-Query Associations in a Graph Bart Goethals University of Antwerp, Belgium Eveline Hoekx Jan Van den Bussche Hasselt University, Belgium.
SubSea: An Efficient Heuristic Algorithm for Subgraph Isomorphism Vladimir Lipets Ben-Gurion University of the Negev Joint work with Prof. Ehud Gudes.
33 rd International Conference on Very Large Data Bases, Sep. 2007, Vienna Towards Graph Containment Search and Indexing Chen Chen 1, Xifeng Yan 2, Philip.
1 Efficiently Mining Frequent Trees in a Forest Mohammed J. Zaki.
FAST FREQUENT FREE TREE MINING IN GRAPH DATABASES Marko Lazić 3335/2011 Department of Computer Engineering and Computer Science,
Mining Association Rules of Simple Conjunctive Queries Bart Goethals Wim Le Page Heikki Mannila SIAM /8/261.
Subgraph Containment Search Dayu Yuan The Pennsylvania State University 1© Dayu Yuan9/7/2015.
Graph Indexing: A Frequent Structure­ based Approach Authors:Xifeng Yan†, Philip S‡. Yu, Jiawei Han†
USpan: An Efficient Algorithm for Mining High Utility Sequential Patterns Authors: Junfu Yin, Zhigang Zheng, Longbing Cao In: Proceedings of the 18th ACM.
© The McGraw-Hill Companies, Inc., Chapter 3 The Greedy Method.
Mining Frequent Itemsets with Constraints Takeaki Uno Takeaki Uno National Institute of Informatics, JAPAN Nov/2005 FJWCP.
Mining Optimal Decision Trees from Itemset Lattices Dr, Siegfried Nijssen Dr. Elisa Fromont KDD 2007.
Sequential PAttern Mining using A Bitmap Representation
Approximate Frequency Counts over Data Streams Gurmeet Singh Manku, Rajeev Motwani Standford University VLDB2002.
AR mining Implementation and comparison of three AR mining algorithms Xuehai Wang, Xiaobo Chen, Shen chen CSCI6405 class project.
Efficient Data Mining for Calling Path Patterns in GSM Networks Information Systems, accepted 5 December 2002 SPEAKER: YAO-TE WANG ( 王耀德 )
Frequent Structure Mining Presented By: Ahmed R. Nabhan Computer Science Department University of Vermont Fall 2011.
1 Frequent Subgraph Mining Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY June 12, 2010.
Graph Indexing: A Frequent Structure- based Approach Alicia Cosenza November 26 th, 2007.
Expert Systems with Applications 34 (2008) 459–468 Multi-level fuzzy mining with multiple minimum supports Yeong-Chyi Lee, Tzung-Pei Hong, Tien-Chin Wang.
Frequent Subgraph Discovery Michihiro Kuramochi and George Karypis ICDM 2001.
The Colorful Traveling Salesman Problem Yupei Xiong, Goldman, Sachs & Co. Bruce Golden, University of Maryland Edward Wasil, American University Presented.
CSE4334/5334 DATA MINING CSE4334/5334 Data Mining, Fall 2014 Department of Computer Science and Engineering, University of Texas at Arlington Chengkai.
Data Mining Association Rules: Advanced Concepts and Algorithms Lecture Notes Introduction to Data Mining by Tan, Steinbach, Kumar.
University at BuffaloThe State University of New York Lei Shi Department of Computer Science and Engineering State University of New York at Buffalo Frequent.
CanTree: a tree structure for efficient incremental mining of frequent patterns Carson Kai-Sang Leung, Quamrul I. Khan, Tariqul Hoque ICDM ’ 05 報告者:林靜怡.
CloSpan: Mining Closed Sequential Patterns in Large Datasets Xifeng Yan, Jiawei Han and Ramin Afshar Proceedings of 2003 SIAM International Conference.
Mining Top-K Large Structural Patterns in a Massive Network Feida Zhu 1, Qiang Qu 2, David Lo 1, Xifeng Yan 3, Jiawei Han 4, and Philip S. Yu 5 1 Singapore.
1 Knowledge Discovery from Transportation Network Data Paper Review Jiang, W., Vaidya, J., Balaporia, Z., Clifton, C., and Banich, B. Knowledge Discovery.
Mining Graph Patterns Efficiently via Randomized Summaries Chen Chen, Cindy X. Lin, Matt Fredrikson, Mihai Christodorescu, Xifeng Yan, Jiawei Han VLDB’09.
Discovering functional interaction patterns in Protein-Protein Interactions Networks   Authors: Mehmet E Turnalp Tolga Can Presented By: Sandeep Kumar.
Frequent Structure Mining Robert Howe University of Vermont Spring 2014.
Graph Indexing From managing and mining graph data.
COMP53311 Association Rule Mining Prepared by Raymond Wong Presented by Raymond Wong
Indexing and Mining Free Trees Yun Chi, Yirong Yang, Richard R. Muntz Department of Computer Science University of California, Los Angeles, CA {
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Mining Complex Data COMP Seminar Spring 2011.
Chapter 3 Data Mining: Classification & Association Chapter 4 in the text box Section: 4.3 (4.3.1),
CSC317 1 At the same time: Breadth-first search tree: If node v is discovered after u then edge uv is added to the tree. We say that u is a predecessor.
Ning Jin, Wei Wang ICDE 2011 LTS: Discriminative Subgraph Mining by Learning from Search History.
Gspan: Graph-based Substructure Pattern Mining
Mining in Graphs and Complex Structures
Mining Frequent Subgraphs
Jiawei Han Department of Computer Science
Mining Complex Data COMP Seminar Spring 2011.
Graph Database Mining and Its Applications
Mining Frequent Subgraphs
FP-Growth Wenlong Zhang.
Mining Frequent Subgraphs
Finding Frequent Itemsets by Transaction Mapping
Presentation transcript:

gSpan: Graph-based substructure pattern mining Authors: Xifeng Yan and Jiawei Han Presented by: Colin Luther

Copyright note: This presentation was originally provided by Prof. Xifeng Yan upon request from a student. Citation: Xifeng Yan and Jiawei Han. gSpan: graph- based substructure pattern mining. In IEEE International Conference on Data Mining (ICDM), 2002

Background Problem Definition Authors Contribution Outlines Background Problem Definition Authors Contribution Concepts behind gSpan Experimental Result Conclusion

Background Frequent Subgraph Mining is an extension to existing frequent pattern mining algorithms A major challenge is to count how many instances of a pattern are in the dataset Counting instances might be easy for sets, but subtle for graphs Recall the graph isomorphism problem

Two Isomorphic graph (a) and (b) with their mapping function (c) Background G1=(V1,E1,L1) G2=(V2,E2,L2) f(V1.1) = V2.2 f(V1.2) = V2.5 f(V1.3) = V2.3 f(V1.4) = V2.4 f(V1.5) = V2.1 Y 5 V 3 X 3 1 W 1 V 4 W U 2 Y 2 U 5 4 X (a) (b) (c) Two Isomorphic graph (a) and (b) with their mapping function (c) Two graphs are isomorphic if one can find a mapping of nodes of the first graph to the second graph such that labels on nodes and edges are preserved.

Problem: Finding Frequent Subgraphs Problem setting: similar to finding frequent itemsets for association rule discovery Input: Database of graph transactions Undirected simple graph (no multiples edges) Each graph transaction has labeled edges/vertices. Transactions may not be connected Minimum support thresholds Output: Frequent subgraphs that satisfy the support threshold, where each frequent subgraph is connected.

Finding Frequent Subgraphs Xifeng Yan

Authors Contribution Representing graphs as strings (like TreeMiner) No candidate generation! “It combines the growing and checking of frequent subgraphs into one procedure, thus accelerates the mining process.” Really fast, still a standard baseline system that most rivals compare their systems to.

Concepts behind gSpan The idea is to produces a Depth-First Search (DFS) codes for each edge in graphs Edges are sorted according to lexicographic order of codes Yan and Han proved that graph isomororphism can be tested for two graphs annotated with DFS codes Starting with small graph patterns containing 1-edge, patterns are expanded systemically by the DFS search Employ anti-monotonic property of graph frequency

Anti-Monotonicity of graph frequency The frequency of a super-pattern is less than or equal to the frequency of a sub-pattern. Copyright SIGMOD’08

Lexicographic Ordering in Graph It can tell us the order of two graphs. The design can help us build a similar hierarchy. The design should guarantee easy-growing from one level to the lower level and easy-rolling-up from low level to higher level. It may be difficult to have such design that no two nodes in this tree are same for graph case. It can tell us whether the graph has been discovered. And more, the most important, if a graph has been discovered, all its children nodes in the hierarchy must have been discovered.

Lexicographic Ordering in Graph 1-edge ... 2-edge ... ... ... ... 3-edge ... ... ...

DFS code and Minimum DFS code Depth First Tree and Forward/Backward Edge Set

DFS code and Minimum DFS code We use a 5-tuple (vi, vj, l(vi), l(vj), l(vi,vj)) to represent an edge. (it may be redudant, but much easier to understand.) Turn a graph into a sequence whose basic element is 5-tuple. Form the sequence in such an order: to extend one new node, add the forward edge that connect one node in the old graph with this new node. Add all backward edge that connect this new node to other nodes in the old graph repeat this procedure.

DFS code Y a e0: (0,1,x,y,a) X X a a a e2: (2,0,x,x,a) X b e1: (1,2,y,x,b) Y d b Z d e5: (1,4,x,z,d) b e4: (3,1,x,y,b) Z X b c Z c e3: (2,3,x,z,c) Z v0 v3 v1 v4 v2

Minimum DFS code Each Graph may have lots of DFS code (why?): one smallest lexicographic one is its Minimum DFS Code

Graph Parent and its Children Given a DFS code c0=(e0,e1,…,en) if c1=(e0,e1,…,en,ex) if c0<c1, then c0 is c1’s parent, c1 is c0’s child. ? X Y Z a b c

DFS Code Tree 1-edge ... 2-edge ... ... ... ... 3-edge ... ... ...

Theorem 1. Given two graph G0 and G1, G0 is isomorphic to G1 iff min_dfs_code(G0)=min_dfs_code(G1). 2. DFS Code Tree covers all graphs although some tree nodes may represent the same graph 3. Given a node in DFS Code Tree, if its DFS code is not its minimum DFS code, prune this node and its all descendants won’t change. “Covering”.

Algorithm

Algorithm

Experimental Result

Experimental Result

Conclusion No Candidate Generation and False Test Space Saving from Depth First Search Good Performance: using “memory Pool” and one major counting improvement, it seems the performance will be improved 5 times more. (but need more testing).

Exam Questions Q1) What two major costs from Apriori-like, frequent substructure mining algorithms did gSpan aim to reduce/avoid? Answer: 1) The creation of size k+1 candidate subgraphs from size k frequent subgraphs is more complicated and costly the standard Apriori large itemset generation. 2) Pruning false positives is an expensive process. Subgraph isomorphism problem is NP-Complete.

Exam Questions (cont.) Q2) Which DFS tree does the DFS code below belong to? Answer: tree (c)

Exam Questions Q3) What does gSpan compare when testing for isomorphism between two graphs, and why? Answer: gSpan compares the minimum DFS codes of the two graphs. Given two graphs G and G’, G is isomorphic to G’ if min(G)=min(G’). This theorem allows for a simple string comparison of more complicated graphs. If two nodes contain the same graph but different minimum DFS codes, we can prune the sub-branch of the rightmost of the two nodes. This greatly decreases the problem size.

Questions?