Presentation is loading. Please wait.

Presentation is loading. Please wait.

The IEEE International Conference on Big Data 2013 Arash Fard M. Usman Nisar Lakshmish Ramaswamy John A. Miller Matthew Saltz Computer Science Department.

Similar presentations


Presentation on theme: "The IEEE International Conference on Big Data 2013 Arash Fard M. Usman Nisar Lakshmish Ramaswamy John A. Miller Matthew Saltz Computer Science Department."— Presentation transcript:

1 The IEEE International Conference on Big Data 2013 Arash Fard M. Usman Nisar Lakshmish Ramaswamy John A. Miller Matthew Saltz Computer Science Department University of Georgia, Athens, GA October 7, 2013 A Distributed Vertex-Centric Approach for Pattern Matching in Massive Graphs

2 Outline 2  Introduction  Subgraph Pattern Matching  Types of Subgraph Pattern Matching  Strict Simulation  Distributed Algorithms  Experiments

3 Introduction 3  Labeled Directed Graph G(V, E, l)  Versatile and expressive  Social networks, web search engines, genome sequencing, etc.  Data sizes are growing rapidly  Facebook: 1 billion+ users, average degree of 140  Twitter: 200 million+ users, 400 million tweets each day  Bigger datasets mean bigger graphs  Pattern (Query Graph): the interesting small graph  Data Graph: the massive graph contains the information  Pattern Matching: finding the instances of the pattern in the data graph

4 Subgraph Pattern Matching 4  There are different paradigms for pattern matching:  Exact Topological Matching Sub-graph Isomorphism  Simulation Matching (meaning of the relations) Graph Simulation Dual Simulation Strong Simulation Strict Simulation SD: Software Developer Bio: Biologist PM: Product Manager John Ann Sara Mary Bob Alice Pattern GraphData Graph Endorsed by

5 Example: Graph Simulation 5 PM SA DB SA DB PM SD AI DB AI DBAI SA Data Graph DB AI PM SA PMSA 1 2 3 4 5 6 7 8 9 10 11 12 13 14 1516 17 181920 21 22 2324 Pattern Graph PM SA DB AI PM: Product Manager SA: Software Analyst DB: Database Designer AI: AI specialist SD: Software Developer  Label and Children conditions  Quadratic time complexity

6 Example: Dual Simulation 6 PM SA DB SA DB PM SD AI DB AI DBAI SA Pattern Graph Data Graph PM SA DB AI DB AI PM SA PMSA 1 2 3 4 5 6 7 8 9 10 11 12 13 14 1516 17 181920 21 22 2324 PM: Product Manager SA: Software Analyst DB: Database Designer AI: AI specialist SD: Software Developer  Adds Parents condition  Cubic time complexity

7 Strong Simulation 7  Extends dual simulation with a locality condition  A ball G b [ v, r ] is a subgraph of a graph G containing:  all vertices V b within an undirected distance r of the vertex v  all the edges between the vertices in V b S. Ma, Y. Cao, W. Fan, J. Huai, and T. Wo, “Capturing topology in graph pattern matching,” Proc. VLDB Endow., vol. 5, no. 4, pp. 310–321, Dec. 2011.

8 Example: Strong Simulation 8 PM SA DB SA DB PM SD AI DB AI DBAI SA Data Graph DB AI PM SA PMSA 1 2 3 4 5 6 7 8 9 10 11 12 13 14 1516 17 181920 21 22 2324 Pattern Graph PM SA DB AI PM: Product Manager SA: Software Analyst DB: Database Designer AI: AI specialist SD: Software Developer  Adds Locality condition  Cubic time complexity

9 Example: Subgraph Isomorphism 9 PM SA DB SA DB PM SD AI DB AI DBAI SA Data Graph DB AI PM SA PMSA 1 2 3 4 5 6 7 8 9 10 11 12 13 14 1516 17 181920 21 22 2324 2 Subgraph results: {4,6,7,8} {5,6,7,8} Pattern Graph PM SA DB AI PM: Product Manager SA: Software Analyst DB: Database Designer AI: AI specialist SD: Software Developer  Exact Topological Match  NP-complete in general case

10 Strict Simulation 10  Strict Simulation is a smart modification of Strong Simulation  Smaller balls  Faster computation  The same quality of results or better  Cubic time complexity

11 Strict vs Strong Simulation 11

12 Example: Strict vs Strong Simulation 12  d q = 2  Strong Simulation: G b [2,2] contains all vertices  Strict Simulation: G b [2,2] contains {2,1,10,3,4}  Blue and white vertics in the result of Strong Simulation  White vertices in the result of Strict Simulation

13 Vertex Centric BSP 13  Bulk Synchronous Parallel  Vertex Centric  Each vertex of the graph is a computing unit  Each vertex initially knows only its id, label, and out going edges  Algorithm terminates when every vertex votes to halt.  GPS, a free implementation of Pregel, developed in infoLab, Stanford University.

14 Distributed Algorithm for Graph Simulation 14 Initialization: Pattern is received from the Master Set the match flag to true  match flag: indicating if the vertex is a candidate member of the result match set  matchSet: set of vertices in Q that are candidate match Vertexmatch flag matchSet G1 TrueEmpty G2 TrueEmpty G3 TrueEmpty G4 TrueEmpty G5 TrueEmpty G6 TrueEmpty G7 TrueEmpty G8 TrueEmpty Pattern Graph a bc b c c b e a f d G1 G2 G3 G4 G5 G6 G7 G8 Q3Q2 Q1

15 Distributed Algorithm for Graph Simulation 15 Vertexmatch flag matchSet G1 FalseEmpty G2 TrueQ2 G3 TrueQ1 G4 TrueQ3 G5 TrueQ2 G6 TrueQ3 G7 FalseEmpty G8 FalseEmpty Pattern Graph a bc b c c b e a f d G1 G2 G3 G4 G5 G6 G7 G8 Data Graph 1 st Superstep Q3Q2 Q1

16 Distributed Algorithm for Graph Simulation 16 Vertexmatch flag matchSet G1 FalseEmpty G2 TrueQ2 G3 TrueQ1 G4 TrueQ3 G5 TrueQ2 G6 TrueQ3 G7 FalseEmpty G8 FalseEmpty Pattern Graph a bc b c c b e a f d G1 G2 G3 G4 G5 G6 G7 G8 Data Graph 2 nd Superstep Q3Q2 Q1

17 Distributed Algorithm for Graph Simulation 17 Vertexmatch flag matchSet G1 FalseEmpty G2 TrueQ2 G3 TrueQ1 G4 TrueQ3 G5 FalseEmpty G6 TrueQ3 G7 FalseEmpty G8 FalseEmpty Pattern Graph a bc b c c b e a f d G1 G2 G3 G4 G5 G6 G7 G8 Data Graph 3 rd and 4 th Supersteps Q3Q2 Q1

18 Other Distributed Graph Algorithms 18  Dual simulation  In addition to storing the match sets of its children, a vertex also stores the match sets of its parents  During match set evaluation, a vertex takes both of these into account  Strong simulation  Run dual simulation to obtain match relation R d  For each matching vertex v in R d : Create a ball centered at v with radius d Q containing any vertex in G Perform dual simulation on the ball  Strict simulation  Run dual simulation to obtain match relation R d  For each matching vertex v in R d : Create a ball centered at v with radius d Q containing only vertices in R d Perform dual simulation on the ball

19 Experiments 19  System  Cluster of 12 machines  128 GB RAM  Two 2GHz Intel Xeon E-2620  GPS (developed in infoLab, Stanford University)  Datasets  Synthesized graphs: | V |= 100e6, |E| =| V | α, α = 1.2  Real world graphs uk-2002: | V |= 18e6, |E| =298e6 ljournal: | V |= 5e6, |E| =79e6 Amazon-2008: | V |= 7e5, |E| =5e6 http://law.di.unimi.it/datasets.php

20 Experiment 1: Comparison of Strong and Strict Simulation 20 Synthesized ( |V |= 1e6, α = 1.2) Amazon-2008 ( |V |= 7e5, |V| = 5e6) Amazon-2008 ( |V |= 7e5, |V q | = 20)

21 Experiment 2: Running time 21 uk-2002-hc ( |V |= 18e6, |E|=298e6) ( l = 200 & |Vq| = 20) Synthesized ( |V |= 1e8, α = 1.2, |E| =| V | α ) Quality of result: Graph Simulation < Dual Simulation < Strict Simulation

22 Experiment 3: Impact of dataset 22 Synthesized ( |V |= 100e6) ( l = 200 & |Vq| = 20) Synthesized ( α = 1.2, |E| =| V | α )

23 Experiment 4: Impact of pattern 23 uk-2002-hc ( |V |= 18e6, |E|=298e6) Synthesized ( |V |= 100e6, α = 1.2, |E| =| V | α )

24 Related work 24  Pattern matching models  P-homomorphism  Graph Simulation, Dual and Strong  Bounded simulation  Distributed algorithms for pattern matching  S. Ma, Y. Cao, J. Huai, and T. Wo, “Distributed graph pattern matching,” WWW 2012.  Z. Sun, H. Wang, H. Wang, B. Shao, and J. Li, “Efficient subgraph matching on billion node graphs,” VLDB 2012.

25 25  Introducing a new Pattern Matching model  Design, implementation, and test of 4 distributed algorithms for Pattern Matching  Graph Simulation  Dual Simulation  Strong Simulation  Strict Simulation Conclusion

26 Questions? 26 Thanks


Download ppt "The IEEE International Conference on Big Data 2013 Arash Fard M. Usman Nisar Lakshmish Ramaswamy John A. Miller Matthew Saltz Computer Science Department."

Similar presentations


Ads by Google