Presentation is loading. Please wait.

Presentation is loading. Please wait.

Computer Science and Engineering TreeSpan Efficiently Computing Similarity All-Matching Gaoping Zhu #, Xuemin Lin #, Ke Zhu #, Wenjie Zhang #, Jeffrey.

Similar presentations


Presentation on theme: "Computer Science and Engineering TreeSpan Efficiently Computing Similarity All-Matching Gaoping Zhu #, Xuemin Lin #, Ke Zhu #, Wenjie Zhang #, Jeffrey."— Presentation transcript:

1 Computer Science and Engineering TreeSpan Efficiently Computing Similarity All-Matching Gaoping Zhu #, Xuemin Lin #, Ke Zhu #, Wenjie Zhang #, Jeffrey Xu Yu † # The University of New South Wales † The Chinese University of Hong Kong

2 Outline Introduction State-of-the-Art Our Approach Experiments Conclusions 1

3 Introduction — Graph Data  Chem-informatics  Chemical Compounds (small size)  Bio-informatics  PPI Networks (medium size)  Internet  World Wide Web (large size) 2

4 Introduction — Exact All-Matching (I)  Exact All-Matching  Enumerate all exact (i.e. isomorphic) matches of a query graph q in a data graph G.  Applications  Query biological patterns in PPI networks.  Detect suspicious bugs in software programs. C AB D q C AB D G C A C AB D B DC A exact matches 3

5 Introduction — Exact All-Matching (II)  Dilemma of Exact All-Matching  If q is issued by user for exploratory purpose …  If G is noisy due to imprecise data collection …  Potential Solutions  Modify q/G and run exact all-matching again and again.  Ask system to return approximate results (i.e., similarity all-matching) No exact matches can be found! C AB D G C A C AB D q' 4

6 SAPPER [VLDB’10 Zhang et al] (I)  Similarity All-Matching  Given a query graph q, a data graph G and a similarity threshold θ, enumerate all similarity matches of q in G (i.e., all connected subgraphs of G missing at most θ edges in q).  Framework  Enumerate a set of seeds Q SAPPER (i.e., all connected subgraphs q’ of q missing θ edges in q).  Exact all-matching on each seed q’ to obtain exact matches.  Induce similarity matches based on exact matches of seeds. 5

7 SAPPER [VLDB’10 Zhang et al] (II)  Cost Model   |Q SAPPER | = # of exact all-matching tests 6 C AA B G D C AA B C AA B q (θ = 1) C AA BC AA BC AA BC AA BC AA B F 1 = {u 1 →v 1, u 2 →v 2, u 3 → v 3, u 4 →v 4 } u1u1 u4u4 u2u2 u3u3 v1v1 v2v2 v5v5 v4v4 v3v3 F 2 = {u 1 →v 2, u 2 →v 1, u 3 → v 3, u 4 →v 4 } C AA BC AA BC AA BC AA B q' 1 q' 2 q' 3 q' 4

8 Our Approach — Overview (I)  Tree-based Spanning Search Paradigm — TSpan  Enumerate a set of seeds Q T (i.e., spanning trees of q cover all connected subgraph q’ of q missing θ edges in q).  Primary Contribution  Reduce # of exact all-matching tests (i.e., # of seeds).  Reduce the complexity of exact all-matching test from graph to graph to tree to graph. C AB D q (θ = 2) C AB DC AB DC AB D 7 more SAPPER seeds 3 all-matching tests on connected subgraphs of q 1 all-matching tests on a spanning tree of q

9 Our Approach — Overview (II)  Generating Similarity Maximal Matches  Generating similarity maximal matches only can reduce # of exact all-matching tests. 8 C AA B G D C AA B C AA B q (θ = 1) C AA BC AA BC AA BC AA BC AA B F 1 = {u 1 →v 1, u 2 →v 2, u 3 → v 3, u 4 →v 4 } u1u1 u4u4 u2u2 u3u3 v1v1 v2v2 v5v5 v4v4 v3v3 F 2 = {u 1 →v 2, u 2 →v 1, u 3 → v 3, u 4 →v 4 } similarity maximal matches

10 Our Approach — Problem Statement  Similarity Maximal All-Matching  Given a query graph q, a data graph G and a similarity threshold θ, enumerate all distinct similarity maximal matches of q in G conforming θ. 9

11 Our Approach — Seeding (I)  PRIM Order on Spanning Trees  Similar to the basic idea of minimum spanning tree.  Given a total order on E(q), a spanning tree T = {T[0], T[1], …, T[|V(q)|- 1]} of q conforms PRIM order (T[0] is head vertex), if and only if each spanning edge T[i] has the smallest order in E(q) – {T[1],..., T[i − 1]} and connects {T[0], T[1],..., T[i − 1]}. C AB D e1e1 e2e2 e3e3 e4e4 e5e5 e6e6 q C AB D e1e1 e2e2 e3e3 T 10

12 Our Approach — Seeding (II)  Avoid Duplicate Results  Two spanning trees of q may induce duplicate similarity maximal matches.  Associate an edge exclusion set T.R to each T in Q T.  T.R is a set of edges in E(q) – E(T) enforced to be mismatched in the similarity maximal matches induced by T. C AB D q (θ = 2) E AA C G B D C AB D T1T1 C AB D T2T2 T 2.R = { (A,D) } T 1.R = ∅ 11

13 Our Approach — Seeding (III) C AB D e1e1 e2e2 e3e3 e4e4 e5e5 e6e6 e1e2e3e1e2e3 e1e2e4e1e2e4 e1e2e5e1e2e5 e1e4e3e1e4e3 e1e4e6e1e4e6 e1e5e3e1e5e3 e4e3e2e4e3e2 e4e3e5e4e3e5 e4e5e2e4e5e2 e6e2e3e6e2e3 X T 1 [1] e 1 XT 1 [2] e 2 XT1[3]e3XT1[3]e3 X T 2 [3] e 4 X T 4 [3] e 3 X T 4 [2] e 4 XT 7 [3] e 2 XT 7 [2] e 3 XT 7 [1] e 4 TT.RT 1.e 1 e 2 e 3 { } 2.e 1 e 2 e 4 {e 3 } 3.e 1 e 2 e 5 {e 3, e 4 } 4.e 1 e 4 e 3 {e 2 } 5.e 1 e 4 e 6 {e 2, e 3 } 6.e 1 e 5 e 3 {e 2, e 4 } 7.e 4 e 3 e 2 {e 1 } 8.e 4 e 3 e 5 {e 1, e 2 } 9.e 4 e 5 e 2 {e 1, e 3 } 10.e 6 e 2 e 3 {e 1, e 4 } q (θ =2)  Q T Enumeration Algorithm go down alternate-reorder 12

14 Our Approach — Seeding (IV)  Q T Enumeration Algorithm  Correctness : Using Q T to inducing similarity maximal matches neither generates duplicate results nor misses valid results.  Minimality of Q T : Missing any spanning tree in Q T does not guarantee the completeness of results based on edge exclusion semantics.  When |E(q)| = m, |V(q)| = n,  (1)|Q SAPPER | ≥ |Q T |;  (2) |Q T | = |Q SAPPER | only when θ = 0 or m − n + 1. 13

15 Our Approach — Searching (I)  Effectively Storing Q T  Use DFS Traversal Tree to share computation cost. e1e2e3e1e2e3 e1e2e4e1e2e4 e1e2e5e1e2e5 e1e4e3e1e4e3 e1e4e6e1e4e6 e1e5e3e1e5e3 e4e3e2e4e3e2 e4e3e5e4e3e5 e4e5e2e4e5e2 e6e2e3e6e2e3 R e1e1 e4e4 e2e2 e3e3 e4e4 e5e5 e4e4 e3e3 e6e6 e3e3 e5e5 e3e3 e2e2 e5e5 e2e2 e5e5 e6e6 e3e3 e2e2 14

16 Our Approach — Searching (II)  Similarity Maximal All-Matching Algorithm Sketch  Traverse the DFS Traversal Tree in a depth-first backtrack search fashion.  go-down : Beginning from the initial spanning tree, recursively drill down to extend the current partial match to the next spanning edge T[i] in the current spanning tree T.  alternate : If T[i] can not be extended based on the current partial match and we can still afford to mismatch T[i] by conforming θ, alternate the algorithm from T to the alternative spanning tree T’ enumerated by replacing T[i] with T’[i]. 15

17 Our Approach — Optimizations  Optimizations (I) EnumrateOnDemand Strategy  Motivation : further reduce the number of seeds.  Enumerate an alternative tree T’ based on the current tree T only when it is feasible to extend the current partial similarity maximal match conforming θ (1) on the next spanning edge T[i] or (2) on the next spanning edge T[i]’.  Optimizations (II) Effective Search Order  Motivation : terminate all-matching test as early as possible.  Decide the search order of spanning edges in T based on the post-filtering candidate sets of each vertex in q. 16

18 Our Approach — Filtering & Ordering (I)  Neighborhood Aggregate N(v, g)  Given a set of labels Σ V = {L 1,..., L m }, N(v, g) = (x 1,..., x m ) where x i is the number of neighbors of v in g with label L i ∈ Σ V.  Neighborhood-based Filtering  Compute the candidate set C(u) for each u in q. A B D AA D u ∈ q A B C BA C v ∈ G N(u, q) = {2, 1, 0, 2} N(v, G) = {1, 2, 2, 0} 17

19 Our Approach — Filtering & Ordering (II)  QI Search Ordering [VLDB’08 Shang et al.]  Pick Head Vertex : The vertex u in q with minimum φ(u) (i.e., the occurrence of vertices in G with l(u)).  Pick Next Spanning Edge : The edge (u 1, u 2 ) with minimum φ(u 1, u 2 ) (i.e., the occurrences of edges in G with (l(u 1 ), l(u 2 ))) where u 1 is a vertex incident on previous picked spanning edges.  Filtering-based Search Ordering  Pick Head Vertex : The vertex u in q with minimum number of candidates (i.e., |C(u)|).  Pick Next Spanning Edge : The edge (u 1, u 2 ) minimizing |C(u 2 )|×φ (u 1, u 2 )/φ(u 2 ) where u 1 is vertex incident on previous picked spanning edges. 18

20 Experiments — Experimental Settings  Data Graphs  G H : HPRD network (|V(G H )| = 9,460, |E(G H )| = 37,081).  G S : default synthetic data graph.  Other synthetic data graphs generated by varying data graph settings.  Query Graphs  Random selected subgraphs of the corresponding data graphs.  Parameter Settings (default settings in bold) |V(G)|5k, 10k, 20k, 40k, 80k avg. deg(G)4, 8, 12, 16, 20 |ΣV ||ΣV |20, 50, 100, 200 |V(q)|20, 40, 60, 80, 100 avg. deg(q)3, 4, 5, 6 θ1, 2, 3, 4 19

21  |Q SAPPER | : # of exact all-matching tests by SAPPER [VLDB’10].  |Q T | : # of exact all-matching tests by EnumerateAll paradigm.  TSpan : # of exact all-matching tests by EnumerateOnDemand paradigm. Experiments — # of exact all-matching tests 20

22  Similarity All-Matching  SAPPER : Generate all similarity matches.  TSpan+ : Run TSpan first and then generate all similarity matches based on similarity maximal matches.  Similarity Maximal All-Matching  NaïveTSpan : Similarity maximal all-matching with no computation sharing.  TSpan : Similarity maximal all-matching with computation sharing. Experiments — Total Processing Time 21

23  Enumeration Paradigms  PrecTSpan : Similarity maximal all- matching by EnumerateAll.  TSpan : Similarity maximal all-matching by EnumerateOnDemand.  Filtering & Ordering  TSpanQI : TSpan algorithm with QI searching ordering.  TSpanNF : TSpan algorithm with no filtering technique. Experiments — Total Processing Time 22

24  TSpan on Large-scale Datasets Experiments — Large-scale Data Graphs 23

25 Conclusions  Tree-based Spanning Search Paradigm  EnumerateOnDemand Strategy  Filtering-based Search Ordering SAPPERTSpan # of all-matching testssignificantly less each all-matching testgraph to graphtree to graph computation-sharingnoyes similarity resultsnon-maximalmaximal 24

26 Thank You! Any Questions?


Download ppt "Computer Science and Engineering TreeSpan Efficiently Computing Similarity All-Matching Gaoping Zhu #, Xuemin Lin #, Ke Zhu #, Wenjie Zhang #, Jeffrey."

Similar presentations


Ads by Google