CSE 486/586 CSE 486/586 Distributed Systems Graph Processing Steve Ko Computer Sciences and Engineering University at Buffalo.

Slides:



Advertisements
Similar presentations
Pregel: A System for Large-Scale Graph Processing
Advertisements

Overview of this week Debugging tips for ML algorithms
Piccolo: Building fast distributed programs with partitioned tables Russell Power Jinyang Li New York University.
epiC: an Extensible and Scalable System for Processing Big Data
MapReduce Online Created by: Rajesh Gadipuuri Modified by: Ying Lu.
Distributed Graph Analytics Imranul Hoque CS525 Spring 2013.
DISTRIBUTED COMPUTING & MAP REDUCE CS16: Introduction to Data Structures & Algorithms Thursday, April 17,
CSE 486/586, Spring 2012 CSE 486/586 Distributed Systems Byzantine Fault Tolerance Steve Ko Computer Sciences and Engineering University at Buffalo.
Distributed Graph Processing Abhishek Verma CS425.
CSE 486/586, Spring 2012 CSE 486/586 Distributed Systems Google Chubby Lock Service Steve Ko Computer Sciences and Engineering University at Buffalo.
APACHE GIRAPH ON YARN Chuan Lei and Mohammad Islam.
LFGRAPH: SIMPLE AND FAST DISTRIBUTED GRAPH ANALYTICS Hoque, Imranul, Vmware Inc. and Gupta, Indranil, University of Illinois at Urbana-Champaign – TRIOS.
Distributed Computations
Design Patterns for Efficient Graph Algorithms in MapReduce Jimmy Lin and Michael Schatz University of Maryland Tuesday, June 29, 2010 This work is licensed.
Paper by: Grzegorz Malewicz, Matthew Austern, Aart Bik, James Dehnert, Ilan Horn, Naty Leiser, Grzegorz Czajkowski (Google, Inc.) Pregel: A System for.
Distributed Computations MapReduce
CS 425 / ECE 428 Distributed Systems Fall 2014 Indranil Gupta (Indy) Lecture 22: Stream Processing, Graph Processing All slides © IG.
Lecture 2 – MapReduce CPE 458 – Parallel Programming, Spring 2009 Except as otherwise noted, the content of this presentation is licensed under the Creative.
Design Patterns for Efficient Graph Algorithms in MapReduce Jimmy Lin and Michael Schatz University of Maryland MLG, January, 2014 Jaehwan Lee.
Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.
SIDDHARTH MEHTA PURSUING MASTERS IN COMPUTER SCIENCE (FALL 2008) INTERESTS: SYSTEMS, WEB.
Pregel: A System for Large-Scale Graph Processing
MapReduce.
By: Jeffrey Dean & Sanjay Ghemawat Presented by: Warunika Ranaweera Supervised by: Dr. Nalin Ranasinghe.
MapReduce. Web data sets can be very large – Tens to hundreds of terabytes Cannot mine on a single server Standard architecture emerging: – Cluster of.
Google MapReduce Simplified Data Processing on Large Clusters Jeff Dean, Sanjay Ghemawat Google, Inc. Presented by Conroy Whitney 4 th year CS – Web Development.
Parallel Programming Models Basic question: what is the “right” way to write parallel programs –And deal with the complexity of finding parallelism, coarsening.
MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat.
1 The Map-Reduce Framework Compiled by Mark Silberstein, using slides from Dan Weld’s class at U. Washington, Yaniv Carmeli and some other.
Pregel: A System for Large-Scale Graph Processing Presented by Dylan Davis Authors: Grzegorz Malewicz, Matthew H. Austern, Aart J.C. Bik, James C. Dehnert,
GRAPH PROCESSING Hi, I am Mayank and the second presenter for today is Shadi. We will be talking about Graph Processing.
MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.
CSE 486/586 CSE 486/586 Distributed Systems Data Analytics Steve Ko Computer Sciences and Engineering University at Buffalo.
Pregel: A System for Large-Scale Graph Processing Grzegorz Malewicz, Matthew H. Austern, Aart J. C. Bik, James C. Dehnert, Ilan Horn, Naty Leiser, and.
MapReduce M/R slides adapted from those of Jeff Dean’s.
MapReduce Kristof Bamps Wouter Deroey. Outline Problem overview MapReduce o overview o implementation o refinements o conclusion.
CSE 486/586, Spring 2013 CSE 486/586 Distributed Systems Global States Steve Ko Computer Sciences and Engineering University at Buffalo.
CSE 486/586, Spring 2014 CSE 486/586 Distributed Systems Google Chubby Lock Service Steve Ko Computer Sciences and Engineering University at Buffalo.
Chapter 5 Ranking with Indexes 1. 2 More Indexing Techniques n Indexing techniques:  Inverted files - best choice for most applications  Suffix trees.
CSE 486/586 CSE 486/586 Distributed Systems Byzantine Fault Tolerance Steve Ko Computer Sciences and Engineering University at Buffalo.
Data Structures and Algorithms in Parallel Computing Lecture 4.
Pregel: A System for Large-Scale Graph Processing Nov 25 th 2013 Database Lab. Wonseok Choi.
MapReduce: Simplified Data Processing on Large Clusters By Dinesh Dharme.
CSE 486/586 CSE 486/586 Distributed Systems Global States Steve Ko Computer Sciences and Engineering University at Buffalo.
CSE 486/586 CSE 486/586 Distributed Systems Leader Election Steve Ko Computer Sciences and Engineering University at Buffalo.
1 Tree and Graph Processing On Hadoop Ted Malaska.
Supporting On-Demand Elasticity in Distributed Graph Processing Mayank Pundir*, Manoj Kumar, Luke M. Leslie, Indranil Gupta, Roy H. Campbell University.
Department of Computer Science, Johns Hopkins University Pregel: BSP and Message Passing for Graph Computations EN Randal Burns 14 November 2013.
COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn University
CSE 486/586 Distributed Systems Byzantine Fault Tolerance
Lecture 23: Structure of Networks
Lecture 22: Stream Processing, Graph Processing
PREGEL Data Management in the Cloud
CSE 486/586 Distributed Systems Logical Time
Data Structures and Algorithms in Parallel Computing
Lecture 23: Structure of Networks
湖南大学-信息科学与工程学院-计算机与科学系
CSE 486/586 Distributed Systems Byzantine Fault Tolerance
Cse 344 May 4th – Map/Reduce.
Apache Spark Lecture by: Faria Kalim (lead TA) CS425, UIUC
Pregelix: Think Like a Vertex, Scale Like Spandex
Lecture 22: Stream Processing, Graph Processing
Lecture 23: Structure of Networks
Apache Spark Lecture by: Faria Kalim (lead TA) CS425 Fall 2018 UIUC
Introduction to MapReduce
5/7/2019 Map Reduce Map reduce.
Review of Bulk-Synchronous Communication Costs Problem of Semijoin
MapReduce: Simplified Data Processing on Large Clusters
CSE 486/586 Distributed Systems Logical Time
CSE 486/586 Distributed Systems Byzantine Fault Tolerance
Presentation transcript:

CSE 486/586 CSE 486/586 Distributed Systems Graph Processing Steve Ko Computer Sciences and Engineering University at Buffalo

CSE 486/586 Recap Byzantine generals problem –They must decide on a common plan of action. –But, some of the generals can be traitors. Requirements –All loyal generals decide upon the same plan of action (e.g., attack or retreat). –A small number of traitors cannot cause the loyal generals to adopt a bad plan. Impossibility result –In general, with less than 3f + 1 nodes, we cannot tolerate f faulty nodes. 2

CSE 486/586 Today Distributed Graph Processing Google’s Pregel system –Inspiration for many newer graph processing systems: Piccolo, Giraph, GraphLab, PowerGraph, LFGraph, X- Stream, etc. 3

CSE 486/586 What Graphs? Large graphs are all around us Internet Graph: vertices are routers/switches and edges are links World Wide Web: vertices are webpages, and edges are URL links on a webpage pointing to another webpage –Called “Directed” graph as edges are uni-directional Social graphs: Facebook, Twitter, LinkedIn Biological graphs: DNA interaction graphs, ecosystem graphs, etc. 4

CSE 486/586 What Graph Analysis? Need to derive properties from these graphs Need to summarize these graphs into statistics E.g., find shortest paths between pairs of vertices –Internet (for routing) –LinkedIn (degrees of separation) E.g., do matching –Dating graphs in match.com (for better dates) PageRank –Web Graphs –Google search, Bing search, Yahoo search: all rely on this And many (many) other examples! 5

CSE 486/586 What Are the Difficulties? Large! –Human social network has 100s Millions of vertices and Billions of edges –WWW has Millions of vertices and edges Hard to store the entire graph on one server and process it –Slow on one server (even if beefy!) Need a distributed solution 6

CSE 486/586 Example Are C and D connected? Can we get all connected pairs? 7 A B C D E

CSE 486/586 Typical Graph Processing Works in iterations Each vertex assigned a value In each iteration, each vertex: –Gathers values from its immediate neighbors (vertices who join it directly with an edge). B  A, C  A, D  A,… –Does some computation using its own value and its neighbors values. –Updates its new value and sends it out to its neighboring vertices. E.g., A  B, C, D, E Graph processing terminates after: i) fixed iterations, or ii) vertices stop changing values 8 A B C D E

CSE 486/586 Example Are C and D connected? 9 A B C D E

CSE 486/586 One Possible Way Each vertex has two Boolean variables. –Cflag: true/false –Dflag: true/false Goal –Cflag: Am I connected to C? –Dflag: Am I connected to D? –If both are true at any of the vertices, C and D are connected. –Initially all false, except at C and D. Every iteration: –Propagates its values to neighboring vertices. –Collects the values from other vertices. –Updates the variables with OR, i.e., if any one of the values is true, it’s true. Iterate N times, where N is the length of the longest path in the entire graph. 10

CSE 486/586 Recall MapReduce? 11 Map phase Shuffle phase Reduce phase

CSE 486/586 Can We Use MapReduce? Run MapReduce N times, where N is the length of the longest path. Goal for each iteration: –Propagate each vertex’s Cflag & Dflag values to their neighboring vertices. –Collect Cflag & Dflag values from other vertices. –Update each of the Cflag & Dflag variables with OR, i.e., if any one of the values is true, it’s true. Map –Input (key, value)? Computation? Output list of (intermediate key, intermediate value)? Shuffle: Grouping based on intermediate keys. –Each intermediate key will get a list of intermediate values. Reduce –Input (intermediate key, list of intermediate values)? Computation? Output (final key, final value)? 12

CSE 486/586 MapReduce Run MapReduce N times Map –Input (key, value): (vertex id, (current Cflag & Dflag values)) –Computation: nothing. –Output list of (intermediate key, intermediate value): list of (neighboring vertex id, current Cflag & Dflag values) »For each neighboring vertex, propagate Cflag & Dflag values Shuffle –Grouping based on vertex ids. Reduce –Input (intermediate key, intermediate value): (vertex id, list of neighbors’ current Cflag & Dflag values) –Computation: OR –Output (key, value): (vertex id, updated Cflag & Dflag values) 13

CSE 486/586 Pros and Cons Pros –Well-known –The system is there. –It works. Cons –Not quite a good fit –Need to re-think in terms of keys, values, maps, and reduces. Question –Can we provide a system that is a better fit for graph processing? 14

CSE 486/586 Bulk Synchronous Parallel Model Originally by Valiant (1990) 15

CSE 486/586 Better Programming Model Vertex-centric programming In each iteration, each vertex performs Gather- Apply-Scatter for all its assigned vertices –Gather: get all neighboring vertices’ values –Apply: compute own new value from own old value and gathered neighbors’ values –Scatter: send own new value to neighboring vertices 16

CSE 486/586 Google Pregel Gives simple APIs for easier graph processing Vertex-centric programming Developer’s code subclasses Vertex class Implements Compute() method. Vertex class allows a developer to: –Get/set vertex values –Get/set outgoing edge values (e.g., for weighted graphs) –Send/receive messages to any vertex Gather-apply-scatter –Gather is done automatically (with scatter from the previous iteration) –Apply is done by Compute() –Scatter is done by explicit message passing 17

CSE 486/586 Page Rank Example 18

CSE 486/586 Shortest Path Example 19

CSE 486/586 Google Pregel Pregel uses the master/worker model –Master (one server) »Maintains list of worker servers »Monitors workers; restarts them on failure »Provides Web-UI monitoring tool of job progress –Worker (rest of the servers) »Processes its vertices »Communicates with the other workers Naturally captures a graph structure (vertex per worker) Persistent data is stored as files on a distributed storage system (such as GFS or BigTable) Temporary data is stored on local disk 20

CSE 486/586 Pregel Execution 1.Many copies of the program begin executing on a cluster 2.The master assigns a partition of input (vertices) to each worker 1.Each worker loads the vertices and marks them as active 3.The master instructs each worker to perform a iteration 1.Each worker loops through its active vertices & computes for each vertex 2.Messages can be sent whenever, but need to be delivered before the end of the iteration (i.e., the barrier) 3.When all workers reach iteration barrier, master starts next iteration 4.Computation halts when, in some iteration: no vertices are active and when no messages are in transit 5.Master instructs each worker to save its portion of the graph 21

CSE 486/586 Summary Lots of (large) graphs around us Need to process these MapReduce not a good match Distributed Graph Processing systems: Pregel by Google 22

CSE 486/ Acknowledgements These slides contain material developed and copyrighted by Indranil Gupta (UIUC).