Graph Substructure Search Xuemin Lin School of Computer Science and Engineering University of New South Wales Sydney, Australia.

Slides:

Advertisements

Similar presentations

Evaluating “find a path” reachability queries P. Bouros 1, T. Dalamagas 2, S.Skiadopoulos 3, T. Sellis 1,2 1 National Technical University of Athens 2.

Advertisements

gSpan: Graph-based substructure pattern mining

Correlation Search in Graph Databases Yiping Ke James Cheng Wilfred Ng Presented By Phani Yarlagadda.

University of Illinois at Urbana-Champaign Graph Indexing: Tree + Δ ≥ Graph Peixiang Zhao Jeffrey Xu Yu Philip S. Yu Peixiang Zhao Jeffrey Xu Yu Philip.

Approximation Algorithms Chapter 5: k-center. Overview n Main issue: Parametric pruning –Technique for approximation algorithms n 2-approx. algorithm.

Connected Substructure Similarity Search Haichuan Shang The University of New South Wales & NICTA, Australia Joint Work: Xuemin Lin (The University of.

SPARK: Top-k Keyword Query in Relational Databases Yi Luo, Xuemin Lin, Wei Wang, Xiaofang Zhou Univ. of New South Wales, Univ. of Queensland SIGMOD 2007.

1 Efficient Subgraph Search over Large Uncertain Graphs Ye Yuan 1, Guoren Wang 1, Haixun Wang 2, Lei Chen 3 1. Northeastern University, China 2. Microsoft.

Leiden University Efficient Frequent Query Discovery in F ARMER Siegfried Nijssen and Joost N. Kok ECML/PKDD-2003, Cavtat.

IGraph: A Framework for Comparisons of Disk-Based Graph Indexing Techniques Jeffrey Xu Yu et. al. VLDB ‘10 Presented by Tao Yu.

Association Analysis (7) (Mining Graphs)

Stabbing the Sky: Efficient Skyline Computation over Sliding Windows COMP9314 Lecture Notes.

Quantile-Based KNN over Multi- Valued Objects Wenjie Zhang Xuemin Lin, Muhammad Aamir Cheema, Ying Zhang, Wei Wang The University of New South Wales, Australia.

Sorting, Searching, and Simulation in the MapReduce Framework Michael T. Goodrich Dept. of Computer Science.

SubSea: An Efficient Heuristic Algorithm for Subgraph Isomorphism Vladimir Lipets Ben-Gurion University of the Negev Joint work with Prof. Ehud Gudes.

Alyce Brady CS 510: Computer Algorithms Depth-First Graph Traversal Algorithm.

The community-search problem and how to plan a successful cocktail party Mauro SozioAris Gionis Max Planck Institute, Germany Yahoo! Research, Barcelona.

33 rd International Conference on Very Large Data Bases, Sep. 2007, Vienna Towards Graph Containment Search and Indexing Chen Chen 1, Xifeng Yan 2, Philip.

1 Efficiently Mining Frequent Trees in a Forest Mohammed J. Zaki.

FAST FREQUENT FREE TREE MINING IN GRAPH DATABASES Marko Lazić 3335/2011 Department of Computer Engineering and Computer Science,

Slides are modified from Jiawei Han & Micheline Kamber

Graph Indexing Techniques Seoul National University IDB Lab. Kisung Kim

Subgraph Containment Search Dayu Yuan The Pennsylvania State University 1© Dayu Yuan9/7/2015.

Graph Indexing: A Frequent Structure based Approach Authors:Xifeng Yan†, Philip S‡. Yu, Jiawei Han†

Similarity Methods C371 Fall 2004.

Mehdi Kargar Aijun An York University, Toronto, Canada Discovering Top-k Teams of Experts with/without a Leader in Social Networks.

Mehdi Kargar Aijun An York University, Toronto, Canada Keyword Search in Graphs: Finding r-cliques.

Click to edit Present’s Name Xiaoyang Zhang 1, Jianbin Qin 1, Wei Wang 1, Yifang Sun 1, Jiaheng Lu 2 HmSearch: An Efficient Hamming Distance Query Processing.

Takeaki Uno Tatsuya Asai Yuzo Uchida Hiroki Arimura

Diversified Top-k Graph Pattern Matching 1 Yinghui Wu UC Santa Barbara Wenfei Fan University of Edinburgh Southwest Jiaotong University Xin Wang.

On Graph Query Optimization in Large Networks Alice Leung ICS 624 4/14/2011.

SPIN: Mining Maximal Frequent Subgraphs from Graph Databases Jun Huan, Wei Wang, Jan Prins, Jiong Yang KDD 2004.

Xiangnan Kong,Philip S. Yu Department of Computer Science University of Illinois at Chicago KDD 2010.

1 Frequent Subgraph Mining Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY June 12, 2010.

Graph Indexing: A Frequent Structure- based Approach Alicia Cosenza November 26 th, 2007.

An Efficient Algorithm for Enumerating Pseudo Cliques Dec/18/2007 ISAAC, Sendai Takeaki Uno National Institute of Informatics & The Graduate University.

Jianmin Wang 1, Shaoxu Song 1, Xiaochen Zhu 1, Xuemin Lin 2 1 Tsinghua University, China 2 University of New South Wales, Australia 1/23 VLDB 2013.

Computer Science and Engineering Efficiently Monitoring Top-k Pairs over Sliding Windows Presented By: Zhitao Shen 1 Joint work with Muhammad Aamir Cheema.

Mehdi Kargar Aijun An York University, Toronto, Canada Keyword Search in Graphs: Finding r-cliques.

Xiangnan Kong,Philip S. Yu Multi-Label Feature Selection for Graph Classification Department of Computer Science University of Illinois at Chicago.

Graph Query Reformulation with Diversity – Davide Mottin, Francesco Bonchi, Francesco Gullo 1 Graph Query Reformulation with Diversity Davide Mottin, University.

Computer Science and Engineering TreeSpan Efficiently Computing Similarity All-Matching Gaoping Zhu #, Xuemin Lin #, Ke Zhu #, Wenjie Zhang #, Jeffrey.

Information Technology Selecting Representative Objects Considering Coverage and Diversity Shenlu Wang 1, Muhammad Aamir Cheema 2, Ying Zhang 3, Xuemin.

Mining Graph Patterns Efficiently via Randomized Summaries Chen Chen, Cindy X. Lin, Matt Fredrikson, Mihai Christodorescu, Xifeng Yan, Jiawei Han VLDB’09.

Graph Data Management Lab, School of Computer Science Branch Code: A Labeling Scheme for Efficient Query Answering on Tree

Graph Indexing: A Frequent Structure-based Approach 指導老師：曾新穆教授組員：李彥寬、洪世敏、丁鏘巽、黃冠霖、詹博丞日期： 2013/11/ /11/141.

Graph Indexing From managing and mining graph data.

Indexing and Mining Free Trees Yun Chi, Yirong Yang, Richard R. Muntz Department of Computer Science University of California, Los Angeles, CA {

A Binary Linear Programming Formulation of the Graph Edit Distance Presented by Shihao Ji Duke University Machine Learning Group July 17, 2006 Authors:

Presented by: Siddhant Kulkarni Spring Authors: Publication:  ICDE 2015 Type:  Research Paper 2.

Approach to Data Mining from Algorithm and Computation Takeaki Uno, ETH Switzerland, NII Japan Hiroki Arimura, Hokkaido University, Japan.

Subgraph Search Over Uncertain Graphs Erşan Demircioğlu.

1 Substructure Similarity Search in Graph Databases R 陳芃安.

Ning Jin, Wei Wang ICDE 2011 LTS: Discriminative Subgraph Mining by Learning from Search History.

Xifeng Yan Philip S. Yu Jiawei Han SIGMOD 2005 Substructure Similarity Search in Graph Databases.

Gspan: Graph-based Substructure Pattern Mining

Computer Science and Engineering Jianye Yang 1, Ying Zhang 2, Wenjie Zhang 1, Xuemin Lin 1 Influence based Cost Optimization on User Preference 1 The University.

Cohesive Subgraph Computation over Large Graphs

Outline Introduction State-of-the-art solutions

RE-Tree: An Efficient Index Structure for Regular Expressions

Probabilistic Data Management

Mining Frequent Subgraphs

Graph Search with Indexing

TT-Join: Efficient Set Containment Join

On Efficient Graph Substructure Selection

Efficient Subgraph Similarity All-Matching

Jongik Kim1, Dong-Hoon Choi2, and Chen Li3

Efficient Processing of Top-k Spatial Preference Queries

Approximate Graph Mining with Label Costs

Presentation transcript:

Graph Substructure Search Xuemin Lin School of Computer Science and Engineering University of New South Wales Sydney, Australia

Applications of Graphs Chem-informatics  Chemical Compounds (Small Size) Bio-informatics  Protein Interaction Networks (Medium Size) Other Applications  Social Networks (Large Size)  … …

Fundamental Problems in Graph Database Given a graph database D = {g 1,..., g n } of n data graphs and a query graph q,  Substructure Search Retrieve all data graphs which contain q.  Application : Chemical Compounds’ substructure Identification et al.  Supstructure Search Retrieve all data graphs which are contained by q.  Application : Molecule Function Prediction et al.

Substructure Search

Similarity Search? Input Mistake Exploration Queries......

Outline Substructure Search (VLDB08) Substructure Similarity Search (SIGMOD10) Superstructure Search (SSDBM10) Superstructure Similarity Search (ICDE2010) Conclusions and Remarks

Substructure Search: gIndex (SIGMOD04)  Index a set F of features from D. , sup(f): set of graph ids in D contain f  Filtering:  Verification: verify each data graph in C q.

gIndex (SIGMOD04) q g1g1 g2g2 g3g3 ID-List: {g 1,g 2 } Feature A: PrunedPass (Feature A) False PositiveAnswer Filtering: Verification:

Tree+Delta (VLDB07) Objective: reduce the costs for  building index Observation:  lower costs for generating tree-based features.  most (95%) frequent subgraphs (in sparse graphs) are trees. Tree+Delta: select frequent tree features, and then add a small number of effective subgraphs.

FG Index (SIGMOD07) Objective: Index only query processing.  q is a frequent subgraph: if q is indexed, then return sup(q). if a supergraph q’ of q is indexed, no verification for sup (q’).  q is not a frequent subgraph: # of verifications is bounded by

QuickSI (VLDB08): our work Objective: develop efficient verification algorithm  speed up both verification and filtering. QuickSI: An efficient verification algorithm. ―Encode query graphs: terminate earlier. ―Enforce connectivity ―Three novel pruning techniques Up to orders of magnitude speed up.

QuickSI q gDepth-First Traversal Forwarding Backtracking

QuickSI q g Depth-First Traversal Forwarding Backtracking

QuickSI q g Depth-First Traversal Forwarding

QuickSI q g Access infrequent labels as early as possible Depth-First Traversal

QuickSI q g Access infrequent labels as early as possible Depth-First Traversal

Synchronized Depth- First Traversal QuickSI q g Sparse Graph! 2x5=10 possible matching pairs Access infrequent labels as early as possible Retain connectivity

QuickSI q g Sparse Graph! ONLY 2 possible matching pairs Access infrequent labels as early as possible Retain connectivity Depth-First Traversal

QuickSI q g Access infrequent labels as early as possible Effectively use degree information Deg=3 Deg=2 Stop here 2 2 Retain connectivity Depth-First Traversal

QuickSI q g Access infrequent labels as early as possible Retain connectivity Effectively use degree information Deg=3 Deg=2 Stop here 21 Deg=3 Continue Determine the access order for q. Depth First Traversal

Experimental Results Settings NotationsFilteringVerification GSIgIndex (SIGMOD ’04) QuickSI SSISwift Index (This Paper) QuickSI FGFG Index (SIGMOD ’07) AIDS Antiviral dataset, a popular benchmark, 43k chemical bonds, “C” “N” “O” are the most frequent labels. The data sets and query sets are same as in gIndex and FG Index

Experiments – Response Time Constructi on Time # of Features Index Size FG M GSI M SSI Real dataset Large real dataset Constructi on Time # of Features Index Size FG M GSI M SSI

Substructure Similarity Search – Grafil (SIGMOD07) Maximum Common Subgraph MCS: Given g 1 and g 2, the common graph of g 1 and g 2 with the maximal number of edges, mcs(g 1, g 2 ). Grafil: find all g in D s.t. (|q| - |mcs(q, g)|) ≤ σ (|q| - |mcs(q, g)|): number of missing edges Some variants...

Substructure Similarity Search Subgraph Similarity ?

Connected Substructure Similarity Search: our work (SIGMOD10) Maximum Connected Common Subgraph MCCS: Given g 1 and g 2, the connected common graph of g 1 and g 2 with the maximal number of edges, mccs(g 1, g 2 ). dis (q, g) := |q| - |mccs(q, g)| Goal: find all g in D s.t. dis(q,g) ≤σ NP-Complete.

Filtering: triangular inequality? dis(Q,D)+dis(D,F) ≥ dis(Q,F)  dis(Q, D) ≥ dis(Q,F) – dis(D,F) dis(Q,D) dis(Q,F) dis(D,.F) Query (Q) Feature(F) Data (D)

dis(Q,D)+dist(D,F) ≥ dist(Q,F) ? 1 Filtering: triangular inequality? Query (Q) Data (D) 1 dis(Q,D)

Feature(F) Data (D) 1 2 dist(Q,D)+dist(D,F) ≥ dist(Q,F) ? Similarity Search (triangular inequality) 2 dist(F,D)

Filtering: triangular inequality? dis(Q,D)+dis(D,F) ≥ dis(Q,F) ---- HOLD! 2 dis(Q,F) Query (Q) Feature(F) 1 2 2

0 1 3 Query(Q) Feature(F) Data (D) dist(Q,D) dis(Q,D)+dis(D,F) ≥ dis(Q,F) X Triangular inequality: not always hold dist(D,F)dist(Q,F)

Connectivity Dominance Connectivity Dominance: The connectivity of mccs(g 1, g 2 ) dominates the connectivity of g 2 if there is a subgraph isomorphic mapping F from mccs(g 1, g 2 ) to g 2 such that if removing a set S of edges in mccs(g 1, g 2 ) causes mccs(g 1, g 2 ) disconnected, then removing F(S) always causes g 2 disconnected. Theorem. Given three graphs g 1, g 2, and g 3, if the connectivity of mccs(g 1, g 2 ) dominates g 2 or the connectivity of mccs(g 2, g 3 ) dominates g 2, then dist(g 1, g 3 ) ≤ dist(g 1, g 2 ) + dist(g 2, g 3 ). Remark: Linear Algorithm given embeeding.

dist(Q,F)+dist(F,D) ≥ dist(Q,D) Validation Rule 1 – index only dist(Q,F)+dist(F,D) ≤ => dist(Q,D) ≤ (if mccs(Q, F) dominates F or mccs(F, D) dominates F) dist(Q,D)+dist(D,F) ≥ dist(Q,F) Pruning Rule 1: dist(Q,F)-dist(D,F)> => dist(Q,D)> (if mccs(D, F) dominates D) dist(F,Q)+dist(Q,D) ≥ dist(F,D) Pruning Rule 2: dist(F, D)-dist(F, Q)> => dist(Q,D)> (if mccs(F, Q) dominates Q)

Basic idea: 1. enumerate sub-spanning trees of query graph such that the # of missing edges ≤ ; try to terminate the algorithm as early as possible. 2. sharing the enumeration costs by two ways: a. not enumerate every thing from scratch. b. once enumerated, keep enumerated spanning trees. – organized in a binary tree to reduce storage space. 3. extend QucikSI [VLDB08]. Verification Algorithm

Experiments

cIndex [VLDB’07]: super structure search Filtering-Verification Framework  Filter false results by a feature-based index: exclusion based.  Verify each candidate against the query graph. DatabaseIndexQuery Filtering (gc)(gc) (gb)(gb) (ga)(ga) fafa fbfb fcfc q Filtered! Filtered! Candidate! Verification Answer!

GPTree [EDBT’09] Enhanced Filtering-Verification Framework  Share test cost in filtering and verification, respectively. ba caa ba cab fafa (ga)(ga) (gb)(gb) bc cac (gc)(gc) a fbfb c b a a aab (ga)(ga)(gb)(gb) c b c a c a (gc)(gc) sharing across groups? b c ac (fa)(fa)(fb)(fb) Sharing between two phases? sharing across suffixes?

PrefIndex [SSDBM10] Sharing-based Filtering-Verification Framework  Share test cost in filtering and verification, respectively. ba caa ba cab fafa (ga)(ga) (gb)(gb) bc cac (gc)(gc) a fbfb c b a a aab (ga)(ga)(gb)(gb) c b c a c a (gc)(gc) sharing across groups b c ac (fa)(fa)(fb)(fb) sharing between two phases No sharing across suffixes

Computation Sharing Cost Model Cost Gain (Computation Sharing Benefits)  Given k master groups of data graphs, assume that 1) all data graphs in each group G i contain a master feature f i ( 1 ≤ i ≤ k ); 2) the subgraph isomorphism test from f i to a query graph q is cost fi ;  The total cost gain (computation sharing benefits) from each master group G i can be represented as follows: Maximized Gain  Cluster the database into a disjoint set of master groups such that the total gain is maximized (NP-hard).

Efficiency Test Database and Query Sets  Database: AIDS10K;  Query Sets: Q20, Q40, Q60 Q80, Q80+;

Superstructure Similarity Search: our work (ICDE10) Given a q and a g, dis(q, g) = |g| − |mccs(q, g)|. Superstructure Similarity Search: find all g from D such that dis(q, g) ≤ σ. Note: dis(q, g) = |q| − |mccs(q, g)| in substructure similarity search. Observations:  filtering framework in SIGMOD10 is immediately applicable.  techniques in SIGMOD10 may not be effective for a nearly “super- containment” relationship.  Sharing is possible just like PrefIndex.

SG-Enum Index (ICDE2010) Key Ideas: 1.For a g, enumerate all subgraphs with at most σ edges removed, σ-missing subgraphs. 2.dis(q, g) ≤σ if and only if q contains a σ-missing subgraph. Key issues:  Automorphic subgraphs?  Prefix-sharing?  in one g  among different data graphs  Query processing?

SG-Enum Index (ICDE2010) Top-down Construction: 1.Enumerate all σ-missing subgraphs. 2.Iteratively, choose an edge as follows: a)Always select an edge contained by most σ-missing subgraphs. b)Split the group into 2: one contain the edge and another does not contain the edge. Bottom-up Construction: 1.Generate a sequence for each σ-missing subgraph. 2.Merge the prefixes by chance. Bottom-up among data graphs. Query algorithm: extends QuickSI.

Experiments – Query Response Time

Conclusion and Remarks  Substructure search and its similarity search  Superstructure search and its similarity search. (VLDB08, ICDE10, SIGMOD10, SSDBM10) Issues:  Similarity measures?  Large data graphs?