Presented By HaeJoon Lee Yanyan Shen, Beng Chin Ooi, Bogdan Marius Tudor National University of Singapore Wei Lu Renmin University Cang Chen Zhejiang University.

Slides:



Advertisements
Similar presentations
Pregel: A System for Large-Scale Graph Processing
Advertisements

Lecture 12: MapReduce: Simplified Data Processing on Large Clusters Xiaowei Yang (Duke University)
Piccolo: Building fast distributed programs with partitioned tables Russell Power Jinyang Li New York University.
LIBRA: Lightweight Data Skew Mitigation in MapReduce
MapReduce Online Created by: Rajesh Gadipuuri Modified by: Ying Lu.
Distributed Graph Processing Abhishek Verma CS425.
Spark: Cluster Computing with Working Sets
APACHE GIRAPH ON YARN Chuan Lei and Mohammad Islam.
Piccolo – Paper Discussion Big Data Reading Group 9/20/2010.
IMapReduce: A Distributed Computing Framework for Iterative Computation Yanfeng Zhang, Northeastern University, China Qixin Gao, Northeastern University,
Scaling Distributed Machine Learning with the BASED ON THE PAPER AND PRESENTATION: SCALING DISTRIBUTED MACHINE LEARNING WITH THE PARAMETER SERVER – GOOGLE,
Distributed Computations
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
MapReduce Simplified Data Processing On large Clusters Jeffery Dean and Sanjay Ghemawat.
Next Generation of Apache Hadoop MapReduce Arun C. Murthy - Hortonworks Founder and Architect Formerly Architect, MapReduce.
Google Distributed System and Hadoop Lakshmi Thyagarajan.
Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc
Introduction of Apache Hama Edward J. Yoon, October 11, 2011.
Design Patterns for Efficient Graph Algorithms in MapReduce Jimmy Lin and Michael Schatz University of Maryland MLG, January, 2014 Jaehwan Lee.
IPDPS, Supporting Fault Tolerance in a Data-Intensive Computing Middleware Tekin Bicer, Wei Jiang and Gagan Agrawal Department of Computer Science.
Pregel: A System for Large-Scale Graph Processing
MapReduce. Web data sets can be very large – Tens to hundreds of terabytes Cannot mine on a single server Standard architecture emerging: – Cluster of.
Google MapReduce Simplified Data Processing on Large Clusters Jeff Dean, Sanjay Ghemawat Google, Inc. Presented by Conroy Whitney 4 th year CS – Web Development.
Yongzhi Wang, Jinpeng Wei VIAF: Verification-based Integrity Assurance Framework for MapReduce.
Ex-MATE: Data-Intensive Computing with Large Reduction Objects and Its Application to Graph Mining Wei Jiang and Gagan Agrawal.
1 Fast Failure Recovery in Distributed Graph Processing Systems Yanyan Shen, Gang Chen, H.V. Jagadish, Wei Lu, Beng Chin Ooi, Bogdan Marius Tudor.
Pregel: A System for Large-Scale Graph Processing Presented by Dylan Davis Authors: Grzegorz Malewicz, Matthew H. Austern, Aart J.C. Bik, James C. Dehnert,
VLDB2012 Hoang Tam Vo #1, Sheng Wang #2, Divyakant Agrawal †3, Gang Chen §4, Beng Chin Ooi #5 #National University of Singapore, †University of California,
Chapter 3 Parallel Algorithm Design. Outline Task/channel model Task/channel model Algorithm design methodology Algorithm design methodology Case studies.
CloudClustering Ankur Dave*, Wei Lu†, Jared Jackson†, Roger Barga† *UC Berkeley †Microsoft Research Toward an Iterative Data Processing Pattern on the.
A Framework for Elastic Execution of Existing MPI Programs Aarthi Raveendran Tekin Bicer Gagan Agrawal 1.
A Framework for Elastic Execution of Existing MPI Programs Aarthi Raveendran Graduate Student Department Of CSE 1.
CSE 486/586 CSE 486/586 Distributed Systems Graph Processing Steve Ko Computer Sciences and Engineering University at Buffalo.
Pregel: A System for Large-Scale Graph Processing Grzegorz Malewicz, Matthew H. Austern, Aart J. C. Bik, James C. Dehnert, Ilan Horn, Naty Leiser, and.
MapReduce How to painlessly process terabytes of data.
MapReduce and GFS. Introduction r To understand Google’s file system let us look at the sort of processing that needs to be done r We will look at MapReduce.
CARDIO: Cost-Aware Replication for Data-Intensive workflOws Presented by Chen He.
PREDIcT: Towards Predicting the Runtime of Iterative Analytics Adrian Popescu 1, Andrey Balmin 2, Vuk Ercegovac 3, Anastasia Ailamaki
Lecture 4 TTH 03:30AM-04:45PM Dr. Jianjun Hu CSCE569 Parallel Computing University of South Carolina Department of.
MROrder: Flexible Job Ordering Optimization for Online MapReduce Workloads School of Computer Engineering Nanyang Technological University 30 th Aug 2013.
By Jeff Dean & Sanjay Ghemawat Google Inc. OSDI 2004 Presented by : Mohit Deopujari.
Dynamic Slot Allocation Technique for MapReduce Clusters School of Computer Engineering Nanyang Technological University 25 th Sept 2013 Shanjiang Tang,
Data Structures and Algorithms in Parallel Computing Lecture 4.
DynamicMR: A Dynamic Slot Allocation Optimization Framework for MapReduce Clusters Nanyang Technological University Shanjiang Tang, Bu-Sung Lee, Bingsheng.
Massive Semantic Web data compression with MapReduce Jacopo Urbani, Jason Maassen, Henri Bal Vrije Universiteit, Amsterdam HPDC ( High Performance Distributed.
Data Structures and Algorithms in Parallel Computing
A N I N - MEMORY F RAMEWORK FOR E XTENDED M AP R EDUCE 2011 Third IEEE International Conference on Coud Computing Technology and Science.
Pregel: A System for Large-Scale Graph Processing Nov 25 th 2013 Database Lab. Wonseok Choi.
DECOR: A Distributed Coordinated Resource Monitoring System Shan-Hsiang Shen Aditya Akella.
Outline  Introduction  Subgraph Pattern Matching  Types of Subgraph Pattern Matching  Models of Computation  Distributed Algorithms  Performance.
Load Rebalancing for Distributed File Systems in Clouds.
REX: RECURSIVE, DELTA-BASED DATA-CENTRIC COMPUTATION Yavuz MESTER Svilen R. Mihaylov, Zachary G. Ives, Sudipto Guha University of Pennsylvania.
FTC-Charm++: An In-Memory Checkpoint-Based Fault Tolerant Runtime for Charm++ and MPI Gengbin Zheng Lixia Shi Laxmikant V. Kale Parallel Programming Lab.
Next Generation of Apache Hadoop MapReduce Owen
INTRODUCTION TO HADOOP. OUTLINE  What is Hadoop  The core of Hadoop  Structure of Hadoop Distributed File System  Structure of MapReduce Framework.
Department of Computer Science, Johns Hopkins University Pregel: BSP and Message Passing for Graph Computations EN Randal Burns 14 November 2013.
EpiC: an Extensible and Scalable System for Processing Big Data Dawei Jiang, Gang Chen, Beng Chin Ooi, Kian Lee Tan, Sai Wu School of Computing, National.
Resilient Distributed Datasets A Fault-Tolerant Abstraction for In-Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave,
TensorFlow– A system for large-scale machine learning
CSS534: Parallel Programming in Grid and Cloud
PREGEL Data Management in the Cloud
PA an Coordinated Memory Caching for Parallel Jobs
Data Structures and Algorithms in Parallel Computing
MapReduce Simplied Data Processing on Large Clusters
湖南大学-信息科学与工程学院-计算机与科学系
Cse 344 May 4th – Map/Reduce.
Replication-based Fault-tolerance for Large-scale Graph Processing
Parallel Programming in C with MPI and OpenMP
Phoenix: A Substrate for Resilient Distributed Graph Analytics
MapReduce: Simplified Data Processing on Large Clusters
Presentation transcript:

Presented By HaeJoon Lee Yanyan Shen, Beng Chin Ooi, Bogdan Marius Tudor National University of Singapore Wei Lu Renmin University Cang Chen Zhejiang University Big Data Final Seminar VLDB 2014 H.V. Jagadish University of Michigan

Outline Presented by Haejoon Lee Partition Based Recovery Implementation Evaluation Conclusion Motivation Background

1 2 / 20Presented by Haejoon Lee Distributed Graph Processing System - The set of vertices and edges is divided partitions. - The partitions are distributed among compute nodes. Bulk Synchronous Parallel Computation model in DGPS. Each worker executes input phase. Then they are iteratively processing by Global Barrier.

1 Background 3 / 20Presented by Haejoon Lee Scaling the # of nodes causes two effects - It increase the # of failed nodes during job-execution. - System progress stops during recovery, so a # of nodes could become idle. For these reasons, we need efficient failure recovery system Why do you think?

2 Motivation 4 / 20Presented by Haejoon Lee Checkpoint Based Recovery Flow - Requires nodes to write the status to storage as checkpoint. - Uses healthy nodes to load the status from the last check-point. - Re-executes all the missing workloads. However, CBR causes high recovery latency. - Re-executes the missing workloads over the whole graph in failed and even healthy nodes.

2 Motivation 5 / 20Presented by Haejoon Lee Problem in Cascading Failure - Def. failure occurs during normal execution at any time. - Frequent check-pointing will incur long execution time. Proposes Fast Failure Recovery (Partition Based Recovery)

Outline 6 / 20Presented by Haejoon Lee Partition Based Recovery Implementation Evaluation Conclusion Motivation Background

3 Partition Based Recovery 7 / 20Presented by Haejoon Lee Execution Flow - Restricts recovery of subgraph in only failed nodes using log msg. - Divides the subgraphs in only failed nodes into partitions. - Distribute these partitions among computer nodes. - Reload these partitions from the last checkpoint and rebalance it What is locally log message in PBR? - PBR require every node to log its outgoing msg at the end of super step. - Every healthy node forwards the log msg to vertices in failed partitions.

3 CBR vs PBR 8 / 20Presented by Haejoon Lee CD AB E F Checkpoint Based Recovery N1 N2 C ’ D ‘ A ‘B ‘ E ‘F ‘ Each node storage has Checkpoint CBR incurs HIGH computation cost and communication cost

3 CBR vs PBR 9 / 20Presented by Haejoon Lee Partition Based Recovery AB CD EF N1 N2

3 Details of PBR 10 / 20Presented by Haejoon Lee AB CD E F N1 N2 1.Reassignment Partition -Random assigning partitions -In each iteration calculate the above one for Cost -Check the minimal cost -Find the optimal partition based minimal cost Optimal Partition after checking generated partition

3 Details of PBR 11 / 20Presented by Haejoon Lee A B C D EF N1 N2 2. Recomputation Missing Workload -Failed partitions (A,B),(C,D) load checkpoint in step11 - Healthy Partition D forwards locally log msg to vertices in failed partitions A B C D EF A B C D EF Superstep 11 Superstep 12 Assuming: the latest checkpoint is in super step 11 Locally log message Compute vertices from checkpoint FailedHealthy

3 Details of PBR 12 / 20Presented by Haejoon Lee 3. Re-balance configuration if each node’s one is different How to handle Cascading Failure? - Unlike the CBR’s handling, PBR treats cascading failure as normal failure by executing these 3 steps - In practice, the occurrence of failure is not very frequent.

4 PBR Architecture on Giraph 13 / 20Presented by Haejoon Lee Master - ‘Assign Partitions’ as recovery plan and save it to Zookeeper Zookeeper - a centralized service for maintaining configuration information, naming, providing distributed synchronization Slaves - fetch the partitions from Zookeeper If ( slaves are in checkpoint step ) they do checkpoint, and perform computation Else if ( slaves are failed as restart ) they load partitions and perform computation

Outline Presented by Haejoon Lee Partition Based Recovery Implementation Evaluation Conclusion Motivation Background

5 Experimental Setup CBR vs PBR Benchmark- *K-means, Semi-clustering, and *PageRank - Runs all the tasks for 20 super steps. - Performs a checkpoint at the beginning of step 11. Cluster - 72 Compute Nodes- Intel X GHZ, 8GB memory, 2 * 500GB HDD - Giraph with PBR runs as MapReduce job on Hadoop Dataset 14 / 20Presented by Haejoon Lee

5 Evaluation- K-means CBR vs PBR 15 / 20Presented by Haejoon Lee PBR outperforms 12.4 to 25.7 than CBR. The recovery time of two function increase linearly. PBR takes almost the same time as CBR. - No outgoing msg among differ vertices in K-means - The time of checkpoint is negligible compared to computing the new belonging clusters

5 Evaluation- K-means CBR vs PBR 16 / 20Presented by Haejoon Lee These experiments verify the effectiveness of PBR, which parallelizes computation and eliminates unnecessary recovery cost. PBR outperforms 6.8 to 23.9 than CBR - In CBR, no matter how many nodes fail because they have to reload all computation PBR can reduce recovery time by 23.8 to 26.8 than CBR.

5 Evaluation- PageRank CBR vs PBR 17 / 20Presented by Haejoon Lee PBR takes slightly more time than CBR. - Friendester’s property is Power-law links. - Each super step involve a # of forwarding logged msg via Disk I/O. Check Pointing

5 Evaluation- PageRank CBR vs PBR 18 / 20Presented by Haejoon Lee These experiments verify the effectiveness of PBR, which parallelizes computation and eliminates unnecessary recovery cost.

6 Conclusion 19 / 20Presented by Haejoon Lee Partition based recovery is proposed as novel recovery system which parallelize failure recovery processing. This system distributes the recovery task to multiple compute nodes such that the recovery processing can be executed concurrently It is implemented on the widely used Girpah system and observe outperforms existing checkpoint-based recovery stem by up to 30 times

Thank s

6 Backup: Semi-Clustering Master Seminar PresentationPresented by Haejoon Lee

6 PBR Architecture on Giraph 13 / 20Presented by Haejoon Lee Master - ‘Assign Partitions’ as recovery plan and save it to Zookeeper Slaves fetch the partitions from Zookeeper - If they are in checkpoint step, they do and perform computation - If they are in fail as restart, they load partitions and perform it

6 Backup: Communication Cost of PR Master Seminar PresentationPresented by Haejoon Lee