Diversified Top-k Subgraph Querying in a Large Graph

Diversified Top-k Subgraph Querying in a Large Graph
Zhengwei Yang Ada Wai-Chee Fu Ruifeng Liu The Chinese University of Hong Kong SIGMOD, 2016

Motivation The problem of top-k subgraph querying asks for a set of up to k subgraphs isomorphic to a given query graph These k subgraphs are often highly overlapping and not very representative. In this work, they aim to provide the top-k subgraphs with less overlap possible Top-k diversified results are useful since the number of matching subgraphs can be very large

Problem Definition Given an undirected graph G = (V, E, Σ, L), where Σ is a set of vertex labels; L(v) ∈ Σ is the label of v And a Query Graph: Q = (VQ, EQ, ΣQ, LQ), and an integer k The Diversified subgraphs querying problem (DSQ) returns k subgraphs that are isomorphic to the querying subgraph Q that cover the large number of vertices possible in the graph G

An Example of A Collaboration Network
(a) is the query graph, k=2 There are many matching subgraphs in G’ Possible result ({v3, v8, v7, v12}, {(v3, v8, v9, v12}) Possible result({v1, v5, v4, v10}, {v2, v6, v7, v12}) Which one is better? In top-k diversified subgraph they aim to reduce the overlapping information among the matchings Diversity is measured by the number of vertices covered by all the subgraphs in the result

Naïve Solution and Discussion
The naïve way to solve DSQ problem: Find all subgraphs that are matching to Q store them in S (subgraph querying(SQ problem)) Find maximum coverage subsets of S, of size k (the maximum k-coverage problem) Both are NP-hard problems The proposed solution is based on adapting some proposed algorithms to solve the two above problems Ullmann framework (for solving SQ problem) Diversified top-k maximal clique using swapping (for finding diversified sets) They show that greedy algorithms for finding maximum coverage subsets may not be feasible due to the very large number of isomorphic subgraphs, also generate and store all subgraphs may be prohibitive. They also discuss streaming algorithms. They can provide a feasible solution since they keep only a collection of k candidates at any time. Matching subgraphs are scanned once, each newly scanned subgraph may be swapped with a matched subgraph with a current location

Solution Overview They propose two phases algorithm DSQL to solve (DSQ problem) Since generating and storing all matching is not feasible, DSQL will selectively generate embeddings (isomeric subgraphs of Q). This is called non-swapping phase Start by, level 0, collect maximal set of disjoint embeddings and go to level 1 At level i, we allow only i-th vertices to be shared with the i-1-th step, so that |Q|-i new vertices is added to the cover at each stage. When k embeddings are collected(in T) at any level, we terminate Second phase, swapping phase, counties the work of phase 1 by multi-scan swapping in order to refine solution to achieve good approximation bound They proposed SWAP∝ Continue at the level at which phase 1 ends. Each generated embeddings may be swap with an embedding in T.

Phase 1 – DSQL-P1 Consider (a) is the query Q, (b) data graph G
First, we generate a candidate set for each vertex in Q,canS(u) is set of candidates vertices from G with same label as u, based on degree and neighborhood signature filtering. Rank the query nodes and store them in qList (c) based on |candS(u)/degree(u)|. Next, DSQL-P1 calls subroutine Q1Search to get disjoint embeddings T0, if k embeddings are found, we terminate. Otherwise, DSQL-P1 continues to search embeddings with overlap from size 1 to |Q|-1, in level- wise manner. When overlap size is i(level i), each i-subset of VQ is possible set of overlap nodes, Qovp. Each Qovp is sorted based on the ranking in qList and is kept in QoverlapList. For each Qovp finds matching data vertices for the overlap nodes in Qovp. For every query node we derive candidate set TcandS(u), TcandS(u)= candS(u)∩ V(T). For every Qovp, we a match in TcandS(u) resulting ovpEmb, in which only the overlap nodes are matched. ovpEmb is passed to Q1Search one by one to form any complete embedding.

Phase 1 – DSQL-P1 (b) data graph G, (a) query graph Q, k=6

Phase 1 – DSQL-P1 (b) data graph G, (a) query graph Q, k=6
qList=(u1, u2, u3). Search for candidates of u1, candS(u1)= {v1,v7,v14,v16}.

qList=(u1,u2,u3). Search for candidates of u1, candS(u1)= {v1,v7,v14,v16}. When i=0, level 0, we search for embeddings with 0 (NO) overlap. T={(v1,v2,v3), (v7,v8,v9)}.|T|=2<k, we continue

qList=(u1,u2,u3). Search for candidates of u1, candS(u1)= {v1,v7,v14,v16}. When i=0, level 0, we search for embeddings with 0 overlap. T={(v1,v2,v3),(v7,v8,v9)}.|T|=2<k, we continue to level 1 Level 1, the overlap size is set to 1, QoverlapList=({u1},{u2},{u3}), TcandS=(u1:{v1,v7}, u2:{v2,v8}, u3:{v3,v9}).

qList=(u1,u2,u3). Search for candidates of u1, candS(u1)= {v1,v7,v14,v16}. When i=0, level 0, we search for embeddings with 0 overlap. T={(v1,v2,v3), (v7,v8,v9)}.|T|=2<k, we continue Level 1 (d), the overlap size is set to 1, QoverlapList=({u1},{u2},{u3}), TcandS=(u1:{v1,v7}, u2:{v2,v8}, u3:{v3,v9}). We generate partial embeddings ovpEmb one by one. When Qovp={u1}, we get ovpEmbs (v1,-1,-1), (v7,-1,-1). For (v1,-1,-1) we get (v1,v5,v6), we do the same process for all QoverlapList. After level 1 is done T={(v1,v2,v5), (v7,v8,v9),(v1,v5,v6),(v14,v2,v15),(v16,v17,v3)} |T|<k, we continue to level 2 to search overlap of size 2.

qList=(u1,u2,u3). Search for candidates of u1, candS(u1)= {v1,v7,v14,v16}. When i=0, level 0, we search for embeddings with 0 overlap. T={(v1,v2,v3),(v7,v8,v9)}.|T|=2<k, we continue Level 1 (d), the overlap size is set to 1, QoverlapList=({u1},{u2},{u3}), TcandS=(u1:{v1,v7}, u2:{v2,v8}, u3:{v3,v9}). For each Qovp in QoverlapList, generate partial embeddings ovpEmb one by one and pass to Q1Search. When Qovp={u1}, we get ovpEmbs (v1,-1,-1), (v7,-1,-1). For (v1,-1,-1) we get (v1,v5,v6), we do the same process for all QoverlapList. After level 1 is done T={(v1,v2,v5), (v7,v8,v9),(v1,v5,v6),(v14,v2,v15),(v16,v17,v3)} |T|<k, we continue to level 2 to search overlap of size 2. At level 2 we will add (v1,v8,v13), |T|=k, we return T and i

Optimizing DSQL-P1-Localized Subgraph Search
When we are given partial embeddings, i.e.,(v1,-1,-1), some query nodes are already matched. We may greatly improve the performance by limiting the search scope to the neighborhood of the matched vertices Suppose that currently u1 is matched with v1, thus, u5, u4, u2, u3 will be searched after u1 The candidates for u5, u4, u2, u3 will be limited to the neighbors of v1, which are {v5}, {v4}, {v2, v12}, {v3, v15}, respectively.

Optimizing DSQL-P1- Skipping Data Vertices in Backtracking
Consider the case when the partial embedding = {(u1, v1), (u2, v4)} and Next we try to match u3 candidate with v8. Since there is no matching for u4, the algorithm marks v8 as "bad“, Similarly, v9, ..., v1006 are marked as "bad" vertices Backtrack and choose v5 to u2, we will directly skip "bad" vertices, i.e. v8, ..., v1006 Finally, when v3 is mapped to u1, v4, v5, and v6 are all marked as "bad" vertices, the algorithm directly checks v7 and finally finds one matching (v3, v7, v1007, v2007, v2008)

Phase 2- DSQL-P2 The goal of DSQL-P2 is to enhancing P1 results and providing a better worst case approximation guarantee which is 0.5 SWAP∝ is proposed for Phase 2. It consists of multiple passes, and in each pass, all embeddings are scanned once. When the swapping criterion is satisfied, the swap of an old embedding with a new scanned one takes place. The swapping aims to increase coverage. Swapping Criterion: we swap next candidate ℎ with any candidate 𝑓 in T if for a certain parameter ∝ >=0, 𝐵(ℎ,𝑇)>= (1+ ∝ ) 𝐿(𝑓,𝑇) Termination condition Whether we rich |Q|-1 levels or 1- V(T1)⊆V(T) , 2- For each f ∈T, L(f,T)>= (q-i)/(1+ ∝), q is query size

Phase 2- DSQL-P2 The input is T and i (level number)
If (i >0) and (|T|=k) and (approximation ratio < 0.5) we trigger phase 2, otherwise we return T either as an optimal solution (when i=0, or i=|Q|-1 and |T|<k) or a good solution with approximation ratio>=0.5 The main idea of P2 is to use a copy of T as T1 and generate embeddings as in P1 except that we always use T1 instead of T in generation of TcandS. When an embedding ℎ is found, we check if the swapping criterion is satisfied for any embedding 𝑓 in T, if it does then we swap 𝑓 with ℎ, and check if we can terminate.

Results In their experimental results on real datasets, DSQL covers more vertices on a data-graph comparing with existing SQ method (COM) Here are tables that show running time and coverage related k of DSQL, COM

Diversified Top-k Subgraph Querying in a Large Graph

Similar presentations

Presentation on theme: "Diversified Top-k Subgraph Querying in a Large Graph"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Diversified Top-k Subgraph Querying in a Large Graph

Similar presentations

Presentation on theme: "Diversified Top-k Subgraph Querying in a Large Graph"— Presentation transcript:

Similar presentations

About project

Feedback