The IEEE International Conference on Big Data 2013 Arash Fard M. Usman Nisar Lakshmish Ramaswamy John A. Miller Matthew Saltz Computer Science Department.

Slides:



Advertisements
Similar presentations
1 gStore: Answering SPARQL Queries Via Subgraph Matching Presented by Guan Wang Kent State University October 24, 2011.
Advertisements

1 TDD: Topics in Distributed Databases Distributed Query Processing MapReduce Vertex-centric models for querying graphs Distributed query evaluation by.
Yinghui Wu, LFCS DB talk Database Group Meeting Talk Yinghui Wu 10/11/ Simulation Revised for Graph Pattern Matching.
Efficient Information Retrieval for Ranked Queries in Cost-Effective Cloud Environments Presenter: Qin Liu a,b Joint work with Chiu C. Tan b, Jie Wu b,
Shuai Ma, Yang Cao, Wenfei Fan, Jinpeng Huai, Tianyu Wo Capturing Topology in Graph Pattern Matching University of Edinburgh.
New Models for Graph Pattern Matching Shuai Ma ( 马 帅 )
1 Advancing Supercomputer Performance Through Interconnection Topology Synthesis Yi Zhu, Michael Taylor, Scott B. Baden and Chung-Kuan Cheng Department.
STUN: SPATIO-TEMPORAL UNCERTAIN (SOCIAL) NETWORKS Chanhyun Kang Computer Science Dept. University of Maryland, USA Andrea Pugliese.
Towards Efficient Query Processing on Massive Evolving Graphs (C-Big2012) Arash Fard, Amir Abdolrashidi, Lakshmish Ramaswamy and John A. Miller UGA Presentation.
APACHE GIRAPH ON YARN Chuan Lei and Mohammad Islam.
Mining Top-K Large Structural Patterns in a Massive Network Feida Zhu 1, Qiang Qu 2, David Lo 1, Xifeng Yan 3, Jiawei Han 4, and Philip S. Yu 5 1 Singapore.
1 Efficient Subgraph Search over Large Uncertain Graphs Ye Yuan 1, Guoren Wang 1, Haixun Wang 2, Lei Chen 3 1. Northeastern University, China 2. Microsoft.
Clustering Geometric Data Streams Jiří Skála Ivana Kolingerová ZČU/FAV/KIV2007.
MSSG: A Framework for Massive-Scale Semantic Graphs Timothy D. R. Hartley, Umit Catalyurek, Fusun Ozguner, Andy Yoo, Scott Kohn, Keith Henderson Dept.
The community-search problem and how to plan a successful cocktail party Mauro SozioAris Gionis Max Planck Institute, Germany Yahoo! Research, Barcelona.
1 QSX: Querying Social Graphs Graph Pattern Matching Graph pattern matching via subgraph isomorphism Graph pattern matching via graph simulation Revisions.
Memory-Efficient Regular Expression Search Using State Merging Department of Computer Science and Information Engineering National Cheng Kung University,
Making Pattern Queries Bounded in Big Graphs 11 Yang Cao 1,2 Wenfei Fan 1,2 Jinpeng Huai 2 Ruizhe Huang 1 1 University of Edinburgh 2 Beihang University.
Querying Big Graphs within Bounded Resources 1 Yinghui Wu UC Santa Barbara Wenfei Fan University of Edinburgh Southwest Jiaotong University Xin Wang.
Pregel: A System for Large-Scale Graph Processing
Performance Guarantees for Distributed Reachability Queries Wenfei Fan 1,2 Xin Wang 1 Yinghui Wu 1,3 1 University of Edinburgh 2 Harbin Institute of Technology.
Tennessee Technological University1 The Scientific Importance of Big Data Xia Li Tennessee Technological University.
1 On Querying Historical Evolving Graph Sequences Chenghui Ren $, Eric Lo *, Ben Kao $, Xinjie Zhu $, Reynold Cheng $ $ The University of Hong Kong $ {chren,
Research Directions for Big Data Graph Analytics John A. Miller, Lakshmish Ramaswamy, Krys J. Kochut and Arash Fard Department of Computer Science University.
Advanced Software Engineering PROJECT. 1. MapReduce Join (2 students)  Focused on performance analysis on different implementation of join processors.
Pregel: A System for Large-Scale Graph Processing Presented by Dylan Davis Authors: Grzegorz Malewicz, Matthew H. Austern, Aart J.C. Bik, James C. Dehnert,
Genetic Approximate Matching of Attributed Relational Graphs Thomas Bärecke¹, Marcin Detyniecki¹, Stefano Berretti² and Alberto Del Bimbo² ¹ Université.
Diversified Top-k Graph Pattern Matching 1 Yinghui Wu UC Santa Barbara Wenfei Fan University of Edinburgh Southwest Jiaotong University Xin Wang.
A Clustering Algorithm based on Graph Connectivity Balakrishna Thiagarajan Computer Science and Engineering State University of New York at Buffalo.
On Graph Query Optimization in Large Networks Alice Leung ICS 624 4/14/2011.
Pregel: A System for Large-Scale Graph Processing Grzegorz Malewicz, Matthew H. Austern, Aart J. C. Bik, James C. Dehnert, Ilan Horn, Naty Leiser, and.
System Support for Managing Graphs in the Cloud Sameh Elnikety & Yuxiong He Microsoft Research.
Graph pattern matching Graph GGraph Qisomorphism f(a) = 1 f(b) = 6 f(c) = 8 f(d) = 3 f(g) = 5 f(h) = 2 f(i) = 4 f(j) = 7.
1/52 Overlapping Community Search Graph Data Management Lab, School of Computer Science
PREDIcT: Towards Predicting the Runtime of Iterative Analytics Adrian Popescu 1, Andrey Balmin 2, Vuk Ercegovac 3, Anastasia Ailamaki
Yinghui Wu, ICDE Adding Regular Expressions to Graph Reachability and Pattern Queries Wenfei Fan Shuai Ma Nan Tang Yinghui Wu University of Edinburgh.
Andreas Papadopoulos - [DEXA 2015] Clustering Attributed Multi-graphs with Information Ranking 26th International.
University at BuffaloThe State University of New York Lei Shi Department of Computer Science and Engineering State University of New York at Buffalo Frequent.
Answering pattern queries using views Yinghui Wu UC Santa Barbara Wenfei Fan University of EdinburghSouthwest Jiaotong University Xin Wang.
Distributed Graph Simulation: Impossibility and Possibility 1 Yinghui Wu Washington State University Wenfei Fan University of Edinburgh Southwest Jiaotong.
Pregel Algorithms for Graph Connectivity Problems with Performance Guarantees Da Yan (CUHK), James Cheng (CUHK), Kai Xing (HKUST), Yi Lu (CUHK), Wilfred.
Computer Science and Engineering TreeSpan Efficiently Computing Similarity All-Matching Gaoping Zhu #, Xuemin Lin #, Ke Zhu #, Wenjie Zhang #, Jeffrey.
Mining Top-K Large Structural Patterns in a Massive Network Feida Zhu 1, Qiang Qu 2, David Lo 1, Xifeng Yan 3, Jiawei Han 4, and Philip S. Yu 5 1 Singapore.
Data Structures and Algorithms in Parallel Computing Lecture 4.
RDFPath: Path Query Processing on Large RDF Graph with MapReduce Martin Przyjaciel-Zablocki et al. University of Freiburg ESWC May 2013 SNU IDB.
Stefanos Antaris A Socio-Aware Decentralized Topology Construction Protocol Stefanos Antaris *, Despina Stasi *, Mikael Högqvist † George Pallis *, Marios.
Minas Gjoka, Emily Smith, Carter T. Butts
Research Directions for Big Data Graph Analytics John A. Miller, Lakshmish Ramaswamy, Krys J. Kochut and Arash Fard.
Data Structures and Algorithms in Parallel Computing
Discovering functional interaction patterns in Protein-Protein Interactions Networks   Authors: Mehmet E Turnalp Tolga Can Presented By: Sandeep Kumar.
Outline  Introduction  Subgraph Pattern Matching  Types of Subgraph Pattern Matching  Models of Computation  Distributed Algorithms  Performance.
Indexing and Mining Free Trees Yun Chi, Yirong Yang, Richard R. Muntz Department of Computer Science University of California, Los Angeles, CA {
Yinghui Wu, SIGMOD Incremental Graph Pattern Matching Wenfei Fan Xin Wang Yinghui Wu University of Edinburgh Jianzhong Li Jizhou Luo Harbin Institute.
Department of Computer Science, Johns Hopkins University Pregel: BSP and Message Passing for Graph Computations EN Randal Burns 14 November 2013.
Subgraph Search Over Uncertain Graphs Erşan Demircioğlu.
EpiC: an Extensible and Scalable System for Processing Big Data Dawei Jiang, Gang Chen, Beng Chin Ooi, Kian Lee Tan, Sai Wu School of Computing, National.
Cohesive Subgraph Computation over Large Graphs
Answering pattern queries using views
Pagerank and Betweenness centrality on Big Taxi Trajectory Graph
Efficient Image Classification on Vertically Decomposed Data
CPT-S 415 Big Data Yinghui Wu EME B45 1.
June 2017 High Density Clusters.
Data Structures and Algorithms in Parallel Computing
Degree-aware Hybrid Graph Traversal on FPGA-HMC Platform
Simulation based approach Shang Zechao
Pregelix: Think Like a Vertex, Scale Like Spandex
Department of Computer Science University of York
Efficient Subgraph Similarity All-Matching
Tahsin Reza Matei Ripeanu Nicolas Tripoul
Approximate Graph Mining with Label Costs
Presentation transcript:

The IEEE International Conference on Big Data 2013 Arash Fard M. Usman Nisar Lakshmish Ramaswamy John A. Miller Matthew Saltz Computer Science Department University of Georgia, Athens, GA October 7, 2013 A Distributed Vertex-Centric Approach for Pattern Matching in Massive Graphs

Outline 2  Introduction  Subgraph Pattern Matching  Types of Subgraph Pattern Matching  Strict Simulation  Distributed Algorithms  Experiments

Introduction 3  Labeled Directed Graph G(V, E, l)  Versatile and expressive  Social networks, web search engines, genome sequencing, etc.  Data sizes are growing rapidly  Facebook: 1 billion+ users, average degree of 140  Twitter: 200 million+ users, 400 million tweets each day  Bigger datasets mean bigger graphs  Pattern (Query Graph): the interesting small graph  Data Graph: the massive graph contains the information  Pattern Matching: finding the instances of the pattern in the data graph

Subgraph Pattern Matching 4  There are different paradigms for pattern matching:  Exact Topological Matching Sub-graph Isomorphism  Simulation Matching (meaning of the relations) Graph Simulation Dual Simulation Strong Simulation Strict Simulation SD: Software Developer Bio: Biologist PM: Product Manager John Ann Sara Mary Bob Alice Pattern GraphData Graph Endorsed by

Example: Graph Simulation 5 PM SA DB SA DB PM SD AI DB AI DBAI SA Data Graph DB AI PM SA PMSA Pattern Graph PM SA DB AI PM: Product Manager SA: Software Analyst DB: Database Designer AI: AI specialist SD: Software Developer  Label and Children conditions  Quadratic time complexity

Example: Dual Simulation 6 PM SA DB SA DB PM SD AI DB AI DBAI SA Pattern Graph Data Graph PM SA DB AI DB AI PM SA PMSA PM: Product Manager SA: Software Analyst DB: Database Designer AI: AI specialist SD: Software Developer  Adds Parents condition  Cubic time complexity

Strong Simulation 7  Extends dual simulation with a locality condition  A ball G b [ v, r ] is a subgraph of a graph G containing:  all vertices V b within an undirected distance r of the vertex v  all the edges between the vertices in V b S. Ma, Y. Cao, W. Fan, J. Huai, and T. Wo, “Capturing topology in graph pattern matching,” Proc. VLDB Endow., vol. 5, no. 4, pp. 310–321, Dec

Example: Strong Simulation 8 PM SA DB SA DB PM SD AI DB AI DBAI SA Data Graph DB AI PM SA PMSA Pattern Graph PM SA DB AI PM: Product Manager SA: Software Analyst DB: Database Designer AI: AI specialist SD: Software Developer  Adds Locality condition  Cubic time complexity

Example: Subgraph Isomorphism 9 PM SA DB SA DB PM SD AI DB AI DBAI SA Data Graph DB AI PM SA PMSA Subgraph results: {4,6,7,8} {5,6,7,8} Pattern Graph PM SA DB AI PM: Product Manager SA: Software Analyst DB: Database Designer AI: AI specialist SD: Software Developer  Exact Topological Match  NP-complete in general case

Strict Simulation 10  Strict Simulation is a smart modification of Strong Simulation  Smaller balls  Faster computation  The same quality of results or better  Cubic time complexity

Strict vs Strong Simulation 11

Example: Strict vs Strong Simulation 12  d q = 2  Strong Simulation: G b [2,2] contains all vertices  Strict Simulation: G b [2,2] contains {2,1,10,3,4}  Blue and white vertics in the result of Strong Simulation  White vertices in the result of Strict Simulation

Vertex Centric BSP 13  Bulk Synchronous Parallel  Vertex Centric  Each vertex of the graph is a computing unit  Each vertex initially knows only its id, label, and out going edges  Algorithm terminates when every vertex votes to halt.  GPS, a free implementation of Pregel, developed in infoLab, Stanford University.

Distributed Algorithm for Graph Simulation 14 Initialization: Pattern is received from the Master Set the match flag to true  match flag: indicating if the vertex is a candidate member of the result match set  matchSet: set of vertices in Q that are candidate match Vertexmatch flag matchSet G1 TrueEmpty G2 TrueEmpty G3 TrueEmpty G4 TrueEmpty G5 TrueEmpty G6 TrueEmpty G7 TrueEmpty G8 TrueEmpty Pattern Graph a bc b c c b e a f d G1 G2 G3 G4 G5 G6 G7 G8 Q3Q2 Q1

Distributed Algorithm for Graph Simulation 15 Vertexmatch flag matchSet G1 FalseEmpty G2 TrueQ2 G3 TrueQ1 G4 TrueQ3 G5 TrueQ2 G6 TrueQ3 G7 FalseEmpty G8 FalseEmpty Pattern Graph a bc b c c b e a f d G1 G2 G3 G4 G5 G6 G7 G8 Data Graph 1 st Superstep Q3Q2 Q1

Distributed Algorithm for Graph Simulation 16 Vertexmatch flag matchSet G1 FalseEmpty G2 TrueQ2 G3 TrueQ1 G4 TrueQ3 G5 TrueQ2 G6 TrueQ3 G7 FalseEmpty G8 FalseEmpty Pattern Graph a bc b c c b e a f d G1 G2 G3 G4 G5 G6 G7 G8 Data Graph 2 nd Superstep Q3Q2 Q1

Distributed Algorithm for Graph Simulation 17 Vertexmatch flag matchSet G1 FalseEmpty G2 TrueQ2 G3 TrueQ1 G4 TrueQ3 G5 FalseEmpty G6 TrueQ3 G7 FalseEmpty G8 FalseEmpty Pattern Graph a bc b c c b e a f d G1 G2 G3 G4 G5 G6 G7 G8 Data Graph 3 rd and 4 th Supersteps Q3Q2 Q1

Other Distributed Graph Algorithms 18  Dual simulation  In addition to storing the match sets of its children, a vertex also stores the match sets of its parents  During match set evaluation, a vertex takes both of these into account  Strong simulation  Run dual simulation to obtain match relation R d  For each matching vertex v in R d : Create a ball centered at v with radius d Q containing any vertex in G Perform dual simulation on the ball  Strict simulation  Run dual simulation to obtain match relation R d  For each matching vertex v in R d : Create a ball centered at v with radius d Q containing only vertices in R d Perform dual simulation on the ball

Experiments 19  System  Cluster of 12 machines  128 GB RAM  Two 2GHz Intel Xeon E-2620  GPS (developed in infoLab, Stanford University)  Datasets  Synthesized graphs: | V |= 100e6, |E| =| V | α, α = 1.2  Real world graphs uk-2002: | V |= 18e6, |E| =298e6 ljournal: | V |= 5e6, |E| =79e6 Amazon-2008: | V |= 7e5, |E| =5e6

Experiment 1: Comparison of Strong and Strict Simulation 20 Synthesized ( |V |= 1e6, α = 1.2) Amazon-2008 ( |V |= 7e5, |V| = 5e6) Amazon-2008 ( |V |= 7e5, |V q | = 20)

Experiment 2: Running time 21 uk-2002-hc ( |V |= 18e6, |E|=298e6) ( l = 200 & |Vq| = 20) Synthesized ( |V |= 1e8, α = 1.2, |E| =| V | α ) Quality of result: Graph Simulation < Dual Simulation < Strict Simulation

Experiment 3: Impact of dataset 22 Synthesized ( |V |= 100e6) ( l = 200 & |Vq| = 20) Synthesized ( α = 1.2, |E| =| V | α )

Experiment 4: Impact of pattern 23 uk-2002-hc ( |V |= 18e6, |E|=298e6) Synthesized ( |V |= 100e6, α = 1.2, |E| =| V | α )

Related work 24  Pattern matching models  P-homomorphism  Graph Simulation, Dual and Strong  Bounded simulation  Distributed algorithms for pattern matching  S. Ma, Y. Cao, J. Huai, and T. Wo, “Distributed graph pattern matching,” WWW  Z. Sun, H. Wang, H. Wang, B. Shao, and J. Li, “Efficient subgraph matching on billion node graphs,” VLDB 2012.

25  Introducing a new Pattern Matching model  Design, implementation, and test of 4 distributed algorithms for Pattern Matching  Graph Simulation  Dual Simulation  Strong Simulation  Strict Simulation Conclusion

Questions? 26 Thanks