Replication-based Fault-tolerance for Large-scale Graph Processing

Slides:



Advertisements
Similar presentations
Piccolo: Building fast distributed programs with partitioned tables Russell Power Jinyang Li New York University.
Advertisements

Distributed Graph Analytics Imranul Hoque CS525 Spring 2013.
Distributed Graph Processing Abhishek Verma CS425.
Spark: Cluster Computing with Working Sets
Discretized Streams Fault-Tolerant Streaming Computation at Scale Matei Zaharia, Tathagata Das (TD), Haoyuan (HY) Li, Timothy Hunter, Scott Shenker, Ion.
APACHE GIRAPH ON YARN Chuan Lei and Mohammad Islam.
1 Cheriton School of Computer Science 2 Department of Computer Science RemusDB: Transparent High Availability for Database Systems Umar Farooq Minhas 1,
Yucheng Low Aapo Kyrola Danny Bickson A Framework for Machine Learning and Data Mining in the Cloud Joseph Gonzalez Carlos Guestrin Joe Hellerstein.
LFGRAPH: SIMPLE AND FAST DISTRIBUTED GRAPH ANALYTICS Hoque, Imranul, Vmware Inc. and Gupta, Indranil, University of Illinois at Urbana-Champaign – TRIOS.
Piccolo – Paper Discussion Big Data Reading Group 9/20/2010.
IMapReduce: A Distributed Computing Framework for Iterative Computation Yanfeng Zhang, Northeastern University, China Qixin Gao, Northeastern University,
Distributed Computations
Graph Processing Recap: data-intensive cloud computing – Just database management on the cloud – But scaling it to thousands of nodes – Handling partial.
1 of 14 1 Fault-Tolerant Embedded Systems: Scheduling and Optimization Viacheslav Izosimov, Petru Eles, Zebo Peng Embedded Systems Lab (ESLAB) Linköping.
1 of 14 1 Scheduling and Optimization of Fault- Tolerant Embedded Systems Viacheslav Izosimov Embedded Systems Lab (ESLAB) Linköping University, Sweden.
Paper by: Grzegorz Malewicz, Matthew Austern, Aart Bik, James Dehnert, Ilan Horn, Naty Leiser, Grzegorz Czajkowski (Google, Inc.) Pregel: A System for.
CS 425 / ECE 428 Distributed Systems Fall 2014 Indranil Gupta (Indy) Lecture 22: Stream Processing, Graph Processing All slides © IG.
BiGraph BiGraph: Bipartite-oriented Distributed Graph Partitioning for Big Learning Jiaxin Shi Rong Chen, Jiaxin Shi, Binyu Zang, Haibing Guan Institute.
Pregel: A System for Large-Scale Graph Processing
Presented By HaeJoon Lee Yanyan Shen, Beng Chin Ooi, Bogdan Marius Tudor National University of Singapore Wei Lu Renmin University Cang Chen Zhejiang University.
1 Fast Failure Recovery in Distributed Graph Processing Systems Yanyan Shen, Gang Chen, H.V. Jagadish, Wei Lu, Beng Chin Ooi, Bogdan Marius Tudor.
Pregel: A System for Large-Scale Graph Processing Presented by Dylan Davis Authors: Grzegorz Malewicz, Matthew H. Austern, Aart J.C. Bik, James C. Dehnert,
X-Stream: Edge-Centric Graph Processing using Streaming Partitions
GRAPH PROCESSING Hi, I am Mayank and the second presenter for today is Shadi. We will be talking about Graph Processing.
Efficient Graph Processing with Distributed Immutable View Rong Chen Rong Chen +, Xin Ding +, Peng Wang +, Haibo Chen +, Binyu Zang + and Haibing Guan.
Distributed shared memory. What we’ve learnt so far  MapReduce/Dryad as a distributed programming model  Data-flow (computation as vertex, data flow.
1 Moshe Shadmon ScaleDB Scaling MySQL in the Cloud.
CSE 486/586 CSE 486/586 Distributed Systems Graph Processing Steve Ko Computer Sciences and Engineering University at Buffalo.
Pregel: A System for Large-Scale Graph Processing Grzegorz Malewicz, Matthew H. Austern, Aart J. C. Bik, James C. Dehnert, Ilan Horn, Naty Leiser, and.
Fast Crash Recovery in RAMCloud. Motivation The role of DRAM has been increasing – Facebook used 150TB of DRAM For 200TB of disk storage However, there.
Resilient Distributed Datasets: A Fault- Tolerant Abstraction for In-Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave,
Data Structures and Algorithms in Parallel Computing Lecture 4.
Data Structures and Algorithms in Parallel Computing
Pregel: A System for Large-Scale Graph Processing Nov 25 th 2013 Database Lab. Wonseok Choi.
Spark System Background Matei Zaharia  [June HotCloud ]  Spark: Cluster Computing with Working Sets  [April NSDI.
Department of Computer Science, Johns Hopkins University Pregel: BSP and Message Passing for Graph Computations EN Randal Burns 14 November 2013.
Computer Science and Engineering Parallelizing Feature Mining Using FREERIDE Leonid Glimcher P. 1 ipdps’04 Scaling and Parallelizing a Scientific Feature.
Resilient Distributed Datasets A Fault-Tolerant Abstraction for In-Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave,
Chenning Xie+, Rong Chen+, Haibing Guan*, Binyu Zang+ and Haibo Chen+
TensorFlow– A system for large-scale machine learning
Lecture 22: Stream Processing, Graph Processing
Topo Sort on Spark GraphX Lecturer: 苟毓川
Introduction to Distributed Platforms
Optimizing Parallel Algorithms for All Pairs Similarity Search
Majd F. Sakr, Suhail Rehman and
Curator: Self-Managing Storage for Enterprise Clusters
Pagerank and Betweenness centrality on Big Taxi Trajectory Graph
CSCI5570 Large Scale Data Processing Systems
Scaling Apache Flink® to very large State
Parallel Programming By J. H. Wang May 2, 2017.
PREGEL Data Management in the Cloud
Distributed Graph-Parallel Computation on Natural Graphs
Sub-millisecond Stateful Stream Querying over
Apache Spark Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing Aditya Waghaye October 3, 2016 CS848 – University.
MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner
Introduction to Spark.
Mingxing Zhang, Youwei Zhuo (equal contribution),
Mayank Bhatt, Jayasi Mehar
Distributed Systems CS
Student: Fang Hui Supervisor: Teo Yong Meng
Apache Spark Lecture by: Faria Kalim (lead TA) CS425, UIUC
Parallel Applications And Tools For Cloud Computing Environments
Discretized Streams: A Fault-Tolerant Model for Scalable Stream Processing Zaharia, et al (2012)
Peng Jiang, Linchuan Chen, and Gagan Agrawal
Apache Spark Lecture by: Faria Kalim (lead TA) CS425 Fall 2018 UIUC
Introduction to MapReduce
Gengbin Zheng, Esteban Meneses, Abhinav Bhatele and Laxmikant V. Kale
Fast, Interactive, Language-Integrated Cluster Computing
Phoenix: A Substrate for Resilient Distributed Graph Analytics
Gurbinder Gill Roshan Dathathri Loc Hoang Keshav Pingali
Presentation transcript:

Replication-based Fault-tolerance for Large-scale Graph Processing Peng Wang, Kaiyuan Zhang, Rong Chen, Haibo Chen, Haibing Guan Shanghai Jiao Tong University

Graph Useful information in graph Many applications SSSP Community Detection ……

Graph computing Graphs are large Fault tolerance is important Require a lot of machines Fault tolerance is important

How graph computing works LoadGraph LoadGraph PageRank(i) // compute its own rank total = 0 foreach ( j in in_neighbors(i)) : total = total + R[j] * Wji R[i] = 0.15 + total // trigger neighbors to run again if R[i] not converged then activate all neighbors Compute Compute SendMsg SendMsg EnterBarrier Commit Commit LeaveBarrier 2 1 Master 2 1 1 1 1 Replica 3 3 W1 W2

Related work about fault tolerance Simple re-execution (MapReduce) Complex data dependency Coarse-grained FT (Spark) Fine-grained update on each vertex State-of-the-art fault tolerance for graph computing Checkpoint Trinity, PowerGraph, Pregel, etc.

How checkpoint works global barrier checkpoint recovery Loading Graph Iter X Iter X+1 Recovery W1 1 3 5 7 2 6 1 5 1 5 W2 2 4 6 1 3 4 6 4 6 Crash DFS Partition && Topology Iter X

Problems of checkpoint Large execution overhead Large amount of states to write Synchronization overhead PageRank on LiveJournal Execution time (sec) Checkpoint period

Problems of checkpoint Large overhead Slow recovery A lot of I/O operations Require standby node Second w/o CKPT Checkpoint Period

Observation and motivation Reuse existing replicas to provide fault tolerance Reuse existing replicas  small overhead Replicas distributed in different machines  fast recovery Vertices without replica Almost all the vertices have replicas 0.84% 0.96% 0.26% 0.13%

Contribution Imitator: a replication based fault tolerance system for graph processing Small overhead Less than 5% for all cases Fast recovery Up to 17.5 times faster than the checkpoint one

Outline Execution flow Replication management Recovery Evaluation Conclusion

Normal execution flow Adding FT support LoadGraph LoadGraph Adding FT support Extending normal synchronization message Compute Compute SendMsg SendMsg EnterBarrier Commit Commit LeaveBarrier

Failure before barrier LoadGraph LoadGraph Compute Compute Compute Crash Newbie joins SendMsg SendMsg SendMsg EnterBarrier enterBarrier EnterBarrier Rollback && Recovery Recovery Commit Commit Commit LeaveBarrier LeaveBarrier

Failure during barrier LoadGraph LoadGraph Compute Compute Compute Newbie boot SendMsg SendMsg SendMsg EnterBarrier leaveBarrier EnterBarrier Crash Recovery Commit Commit Commit LeaveBarrier LeaveBarrier Recovery

Management of replication Fault tolerance replicas every vertex has at least f replicas to tolerate f failures Full state replica (mirror) Existing replica lacks meta information Such as replication location Master Node1 Replica 5 1 7 2 4 6 Node2 4 2 1 7 5 3 Node3 Vertex: 5 Master: n2 Replicas: n1 | n3 1 3 5 4 6 2

Optimization: selfish vertices States of selfish vertices have no consumer Their states may only decided by their neighbors Opt: get their states by re-computation

How to recover Challenges Parallel recovery Consistent state after recovery

Problems of recovery Which vertices have crashed? Master 1 5 1 7 4 2 1 1 3 5 Mirror 1 2 4 6 7 5 3 4 6 2 Replica 1 Node1 Node2 Node3 Which vertices have crashed? How to recover without a central coordinator? Rules: 1. Master recovers replicas 2. If master crashed, mirror recovers master and replicas Replication Location Vertex 3: Master: n3 Mirror: n2

Rebirth Crash Master 1 5 1 7 4 2 1 1 3 5 Mirror 1 2 4 6 7 5 3 4 6 2 Replica 1 Node1 Node2 Node3 5 1 1 7 4 2 2 1 2 4 4 6 6 7 5 5 3 3 Node1 Newbie3 Node2 Rule: 1. Master recovers replicas 2. If master crashed, mirror recovers master and replicas

Problems of Rebirth Standby machine A single newbie machine Migrate tasks to surviving machines

Migration Crash Master 1 5 1 7 4 2 1 1 3 5 Mirror 1 2 4 6 7 5 3 4 6 2 Replica 1 Node1 Node2 Node3 5 1 7 4 2 1 2 4 6 7 5 3 6 Node2 Node1 Procedure: 1. Mirrors upgrade to masters and broadcast 2. Reload missing graph structure and reconstruct

Inconsistency after recovery Replica 2 on W1 Rank Activated false Master 2 on W2 Rank 0.1 Activated Compute Compute 0.1 0.2 SendMsg SendMsg false true EnterBarrier 2 2 Commit Commit 1 1 LeaveBarrier 3 3 W1 W2

Replay Activation 2 2 1 1 3 3 0.2 0.1 false true 0.4 0.2 false true Master 2 on W2 Rank 0.1 Activated Replica 2 on W1 Rank Activated false 0.2 0.1 false true 2 2 1 1 Master 1 on W1 Rank 0.2 Activated false ActNgbs true Replica 1 on W2 Rank Activated false ActNgbs 3 3 0.4 0.2 W1 W2 false true

Evaluation 50 VMs (10G memory 4cores) HDFS (3 Replications) Applications Application Graph Vertices Edge PageRank GWeb 0.87M 5.11M LJournal 4.85M 70.0M Wiki 5.72M 130.1M ALS SYN-GL 0.11M 2.7M CD DBLP 0.32M 1.05M SSSP RoadCA 1.97M 5.53M

Speedup over Hama Imitator is based on Hama, a open source clone of Pregel Replication for dynamic computing [Distributed Graphlab, VLDB’12] Evaluated systems Baseline: Imitator without fault tolerance REP: Baseline + Replication based FT CKPT: Baseline + Checkpoint based FT

Normal execution overhead Replication has negligible execution overhead

Communication Overhead

Performance of recovery Execution Time (Second) 41 56

Recovery Scalability The more machines, the faster the recovery Recovery Time (Second) The more machines, the faster the recovery

Simultaneous failure Execution Time (Second) Add more replicas to tolerate more than 1 machines simultaneous failure

Case study Finished iterations Execution time (Second) 2.6 Replay 8.8 45 Detect Failure Execution time (Second) Application: PageRank on the dataset of LiveJournal A checkpoint for every 4 iterations A Failure is injected between the 6th iteration and the 7th iteration

Conclusion Imitator: a graph engine which supports fault tolerance Imitator’s execution overhead is negligible because it leverages existing replicas Imitator’s recovery is fast because of its parallel recovery approach You are welcome to raise any question!

Backup

Memory Consumption

Partition Impact