Chenning Xie+, Rong Chen+, Haibing Guan*, Binyu Zang+ and Haibo Chen+

Slides:



Advertisements
Similar presentations
Piccolo: Building fast distributed programs with partitioned tables Russell Power Jinyang Li New York University.
Advertisements

Differentiated Graph Computation and Partitioning on Skewed Graphs
SLA-Oriented Resource Provisioning for Cloud Computing
Esma Yildirim Department of Computer Engineering Fatih University Istanbul, Turkey DATACLOUD 2013.
Distributed Graph Analytics Imranul Hoque CS525 Spring 2013.
Spark: Cluster Computing with Working Sets
GraphChi: Big Data – small machine
APACHE GIRAPH ON YARN Chuan Lei and Mohammad Islam.
LFGRAPH: SIMPLE AND FAST DISTRIBUTED GRAPH ANALYTICS Hoque, Imranul, Vmware Inc. and Gupta, Indranil, University of Illinois at Urbana-Champaign – TRIOS.
IMapReduce: A Distributed Computing Framework for Iterative Computation Yanfeng Zhang, Northeastern University, China Qixin Gao, Northeastern University,
Scaling Distributed Machine Learning with the BASED ON THE PAPER AND PRESENTATION: SCALING DISTRIBUTED MACHINE LEARNING WITH THE PARAMETER SERVER – GOOGLE,
Design Patterns for Efficient Graph Algorithms in MapReduce Jimmy Lin and Michael Schatz University of Maryland Tuesday, June 29, 2010 This work is licensed.
A Lightweight Infrastructure for Graph Analytics Donald Nguyen Andrew Lenharth and Keshav Pingali The University of Texas at Austin.
BiGraph BiGraph: Bipartite-oriented Distributed Graph Partitioning for Big Learning Jiaxin Shi Rong Chen, Jiaxin Shi, Binyu Zang, Haibing Guan Institute.
GraphLab A New Framework for Parallel Machine Learning
Network Support for Cloud Services Lixin Gao, UMass Amherst.
Pregel: A System for Large-Scale Graph Processing
Venkatram Ramanathan 1. Motivation Evolution of Multi-Core Machines and the challenges Background: MapReduce and FREERIDE Co-clustering on FREERIDE Experimental.
Distributed Asynchronous Bellman-Ford Algorithm
11 If you were plowing a field, which would you rather use? Two oxen, or 1024 chickens? (Attributed to S. Cray) Abdullah Gharaibeh, Lauro Costa, Elizeu.
Carnegie Mellon University GraphLab Tutorial Yucheng Low.
Introduction and Overview Questions answered in this lecture: What is an operating system? How have operating systems evolved? Why study operating systems?
Ex-MATE: Data-Intensive Computing with Large Reduction Objects and Its Application to Graph Mining Wei Jiang and Gagan Agrawal.
X-Stream: Edge-Centric Graph Processing using Streaming Partitions
GRAPH PROCESSING Hi, I am Mayank and the second presenter for today is Shadi. We will be talking about Graph Processing.
Efficient Graph Processing with Distributed Immutable View Rong Chen Rong Chen +, Xin Ding +, Peng Wang +, Haibo Chen +, Binyu Zang + and Haibing Guan.
An Autonomic Framework in Cloud Environment Jiedan Zhu Advisor: Prof. Gagan Agrawal.
Architectural Support for Fine-Grained Parallelism on Multi-core Architectures Sanjeev Kumar, Corporate Technology Group, Intel Corporation Christopher.
Hotspot Detection in a Service Oriented Architecture Pranay Anchuri,
PREDIcT: Towards Predicting the Runtime of Iterative Analytics Adrian Popescu 1, Andrey Balmin 2, Vuk Ercegovac 3, Anastasia Ailamaki
Computer Science and Engineering Parallelizing Defect Detection and Categorization Using FREERIDE Leonid Glimcher P. 1 ipdps’05 Scaling and Parallelizing.
Zibin Zheng DR 2 : Dynamic Request Routing for Tolerating Latency Variability in Cloud Applications CLOUD 2013 Jieming Zhu, Zibin.
Nanco: a large HPC cluster for RBNI (Russell Berrie Nanotechnology Institute) Anne Weill – Zrahia Technion,Computer Center October 2008.
Is Your Graph Algorithm Eligible for Nondeterministic Execution? Zhiyuan Shao, Lin Hou, Yan Ai, Yu Zhang and Hai Jin Services Computing Technology and.
DynamicMR: A Dynamic Slot Allocation Optimization Framework for MapReduce Clusters Nanyang Technological University Shanjiang Tang, Bu-Sung Lee, Bingsheng.
Massive Semantic Web data compression with MapReduce Jacopo Urbani, Jason Maassen, Henri Bal Vrije Universiteit, Amsterdam HPDC ( High Performance Distributed.
Scalable and Coordinated Scheduling for Cloud-Scale computing
Data Structures and Algorithms in Parallel Computing
A N I N - MEMORY F RAMEWORK FOR E XTENDED M AP R EDUCE 2011 Third IEEE International Conference on Coud Computing Technology and Science.
Application-Aware Traffic Scheduling for Workload Offloading in Mobile Clouds Liang Tong, Wei Gao University of Tennessee – Knoxville IEEE INFOCOM
Department of Computer Science, Johns Hopkins University Pregel: BSP and Message Passing for Graph Computations EN Randal Burns 14 November 2013.
Spark on Entropy : A Reliable & Efficient Scheduler for Low-latency Parallel Jobs in Heterogeneous Cloud Huankai Chen PhD Student at University of Kent.
EpiC: an Extensible and Scalable System for Processing Big Data Dawei Jiang, Gang Chen, Beng Chin Ooi, Kian Lee Tan, Sai Wu School of Computing, National.
Dynamic Mobile Cloud Computing: Ad Hoc and Opportunistic Job Sharing.
Optimizing Distributed Actor Systems for Dynamic Interactive Services
TensorFlow– A system for large-scale machine learning
A Dynamic Scheduling Framework for Emerging Heterogeneous Systems
Hydra: Leveraging Functional Slicing for Efficient Distributed SDN Controllers Yiyang Chang, Ashkan Rezaei, Balajee Vamanan, Jahangir Hasan, Sanjay Rao.
Pagerank and Betweenness centrality on Big Taxi Trajectory Graph
Parallel Programming By J. H. Wang May 2, 2017.
Parallel Density-based Hybrid Clustering
Sub-millisecond Stateful Stream Querying over
Yu Su, Yi Wang, Gagan Agrawal The Ohio State University
Mingxing Zhang, Youwei Zhuo (equal contribution),
Department of Computer Science University of California, Santa Barbara
Mayank Bhatt, Jayasi Mehar
Distributed Systems CS
CS110: Discussion about Spark
Replication-based Fault-tolerance for Large-scale Graph Processing
Discretized Streams: A Fault-Tolerant Model for Scalable Stream Processing Zaharia, et al (2012)
Peng Jiang, Linchuan Chen, and Gagan Agrawal
Declarative Transfer Learning from Deep CNNs at Scale
Da Yan, James Cheng, Yi Lu, Wilfred Ng Presented By: Nafisa Anzum
Asymmetric Transitivity Preserving Graph Embedding
Gengbin Zheng, Esteban Meneses, Abhinav Bhatele and Laxmikant V. Kale
TensorFlow: A System for Large-Scale Machine Learning
Department of Computer Science University of California, Santa Barbara
Assoc. Prof. Marc FRÎNCU, PhD. Habil.
Supporting Online Analytics with User-Defined Estimation and Early Termination in a MapReduce-Like Framework Yi Wang, Linchuan Chen, Gagan Agrawal The.
Gurbinder Gill Roshan Dathathri Loc Hoang Keshav Pingali
Presentation transcript:

SYNC or ASYNC ? Time to Fuse for Distributed Graph-Parallel Computation Chenning Xie+, Rong Chen+, Haibing Guan*, Binyu Zang+ and Haibo Chen+ Institute of Parallel and Distributed Systems + Department of Computer Science * Shanghai Jiao Tong University

Graph-structured computation has adopted in a wide range of areas Big Data  Graph Computation 100 Hrs of Video every minute 1.11 Billion Users 6 Billion Photos 400 Million Tweets/day Graph-structured computation has adopted in a wide range of areas NLP

Graph-parallel Computation Ri = Wi,j Rj ∑ “Think as Vertex”, e.g. PageRank : 4 5 1 2 3 Characteristics Linked set  data dependence Rank of who links it  predictable accesses Convergence  iterative computation

Distributed Graph Computation Larger Graph Complicated Computation Storage support

Distributed Graph Computation User define the logic (e.g. Pagerank) : Input: Rj (Data of neighbor j ) Compute(): Vertex_Data = WjRj ∑ Framework: Load & partition over cluster Schedule compute() Repeatedly until get convergence 4 5 3 1 2 Machine A Machine B 4 5 3 1

Existing Scheduling Modes - Synchronous ( Sync Mode ) Scheduling: Iteration Machine A while (iteration ≤ max) do if Va== ∅ then break V’a← ∅ foreach v ∈ Vado A ← compute(v) V’a ← V’a∪ A barrier to update Va← V’a iteration ++ Pseudocode 3 4 1 5 2 1 3 4 2 5 Internal state (e.g. Machine A): Active Vertices Memory 1 3 4 1 3 4 2 5 Flip Global Barrier Previous State 1 3 4 1 3 4 2 5

Existing Scheduling Modes - Asynchronous ( Async Mode ) Scheduling: while (Va != ∅) do v = dequeue( Va ) A ← compute(v) V’a ← V’a∪ A signal across machines Pseudocode 5 Machine B 4 4 3 3 1 1 2 2 Machine A Internal State (e.g. Machine A): Active Queue 3 4 Propagate ASAP, to converge faster Machine B Piped Proceeding Queue 1

Existing Scheduling Modes Synchronous Iteration Machine A 3 4 1 5 2 Which could get a better performance? 5 Machine B 4 3 1 2 Machine A Asynchronous

Algorithms: Sync vs. Async Same Configuration + Different Algorithms ? Different algorithms prefer different modes Large active vertex set with collecting all data from neighbors Belief Propagation Algorithm Require fast broadcast of shortest path value

Configuration: Sync vs. Async Same Configuration + Different Algorithms: Uncertain Different Configuration + Same Algorithms (LBP) ? Scale on #Machines Different. Partitions Better choice changes with configuration Partition methods affect load balance & communication Sync mode batches heavy load , Async mode scales better.

No one stays ahead for all the execution Stages: Sync vs. Async Same Configuration + Different Algorithms: Uncertain Different Configuration + Same Algorithms: Uncertain Same Configuration + Same Algorithm ? Execution Time (s) SSSP Graph Coloring Execution Time (s) No one stays ahead for all the execution Async mode starts faster, Sync mode grows with a peak. Sync faster but not converge, Async slower but converged.

Summery: Sync vs. Async Properties SYNC vs. ASYNC Favorites SYNC ASYNC Communication Convergence SYNC Regular Slow vs. ASYNC Irregular Fast Better choice is Unintuitive Single mode alone may be still Suboptimal Favorites Algorithm Workload Scalability SYNC I/O Bound Heavyweight | Graph | ASYNC CPU bound Lightweight | Machines | vs.

Contributions PowerSwitch – adaptive, fast & seamless switches First comprehensive study on Sync & Async modes PowerSwitch – adaptive, fast & seamless switches Hybrid Execution Mode ( Hsync Mode ): Dynamically and transparently support the correct mode switches Switch Timing Model: Determine the more efficient mode combined with online sampling, offline profiling and heuristics

Agenda How to Switch - the Hsync mode Internal state conversion Consistency & correctness When to Switch – the timing model Performance Metrics Current mode prediction The other mode estimation Implementation Evaluation

Challenges of switches Convert state at Consistent switch points Sync mode Vertex update: unordered Flip in global barrier Async mode Priority/FIFO queue Dequeue and enqueue Internal state of one machine Active Vertices 1 3 4 Global Barrier Memory Previous State Flip Active Queue Piped Proceeding Queue Machine B

Challenges of switches Hsync mode Consistent switch points : Sync -> Async: global barrier Async -> Sync: suspend & wait State transfer: active vertex set Internal state of one machine Active Vertices 1 3 4 Global Barrier Memory Previous State Flip Active Queue Piped Proceeding Queue Machine B Switch point Switch point 1

Agenda How to Switch - the Hsync mode Internal state conversion Consistency & correctness When to Switch – the timing model Performance Metrics Current mode prediction The other mode estimation Implementation Evaluation

Switch timing - affected by lots of factors Challenges: How to quantify the real-time performance? How to obtain the metrics? Performance Metrics Throughput = -------------------- |Vcompute| Tinterval * µ |NumTaskasync| |NumTasksync| Convergence ratio µ = by sampling specific input pattern, e.g. power-law, large diameter, high density…

Predict Throughput for Current mode Sync Iteration as interval Async Constant interval Throughput Calculate the next interval based on: Current + History accumulation

Predict for Other offline mode No more execution information Predict Async when in sync mode: Solution Online sampling : on subset of input in Async before start Offline profiling : build Neural Network model, refer to paper

Predict for Other offline mode Predict Sync when in async mode: Hard to predict exactly Heuristic: Sync makes high utilization of resource. ThroSync > ThroAsync , if workload is enough Condition: Number of active vertices increases Workload : --------------- > ThroAsync |Vnew| T Async -> Sync

Prediction Accuracy PageRank: Predicted throughput vs. Real sampled Sync mode Async mode

Prediction Accuracy PageRank: Predicted switch timing vs. Optimal Predict 15.2s Optimal 14.5s

Implementation PowerSwitch: Based on latest GraphLab (PowerGraph) v2.2 with both Sync & Async modes. Provide the same graph abstraction transparent & compatible to all apps of GraphLab Open Source http://ipads.se.sjtu.edu.cn/projects/powerswitch.html

Implementation - Architecture Sampler Predictor New Mode switcher Sampler Predictor Extension Fault tolerance

Evaluation Baseline: original SYNC & ASYNC mode Configuration 48-node EC2-like cluster (VM based). Each node has 4 AMD Opteron cores, 12GB of RAM, connected with 1 GigE network. Algorithms and Data Set Algorithm Graph |V| |E| PageRank LJournal 5.4M 79M Wiki 5.7M 130M Twitter 42M 1.47B LBP SYN-ImageData 1-12M 2-24M SSSP RoadCA 1.9M 5.5M Coloring

Performance Overview PageRank SSSP LBP Coloring Outperform the baseline with best mode from 9% to 73% for all algorithms and dataset

Switch Overhead Sync->Async : 0.1s Async->Sync : 0.6s Overhead grows slightly with active vertex number increasing.

Case: Single Source Shortest Path (SSSP) Speedup Switch Point: Sync to Async Switch Point: Async to Sync Execution Time (s) Execution Mode: Async -> Sync-> Async

Conclusion PowerSwitch A comprehensive analysis to the performance of Sync and Async modes for different algorithms, configuration and stages A Hsync mode that dynamically switch modes between Sync & Async to pursue optimal performance An effective switch timing model to predict suitable mode with sampling & profiling Outperforms GraphLab with best mode from 9% to 73% for various algorithms and dataset

Thanks Questions http://ipads.se.sjtu.edu.cn/projects/powerswitch.html Institute of Parallel and Distributed Systems