Fault tolerance, malleability and migration for divide-and-conquer applications on the Grid Gosia Wrzesińska, Rob V. van Nieuwpoort, Jason Maassen, Henri.

Slides:



Advertisements
Similar presentations
European Research Network on Foundations, Software Infrastructures and Applications for large scale distributed, GRID and Peer-to-Peer Technologies Experiences.
Advertisements

Opening Workshop DAS-2 (Distributed ASCI Supercomputer 2) Project vrije Universiteit.
7 april SP3.1: High-Performance Distributed Computing The KOALA grid scheduler and the Ibis Java-centric grid middleware Dick Epema Catalin Dumitrescu,
The Ibis Project: Simplifying Grid Programming & Deployment Henri Bal Vrije Universiteit Amsterdam.
MINJAE HWANG THAWAN KOOBURAT CS758 CLASS PROJECT FALL 2009 Extending Task-based Programming Model beyond Shared-memory Systems.
Cilk NOW Based on a paper by Robert D. Blumofe & Philip A. Lisiecki.
The Distributed ASCI Supercomputer (DAS) project Henri Bal Vrije Universiteit Amsterdam Faculty of Sciences.
Enterprise Job Scheduling for Clustered Environments Stratos Paulakis, Vassileios Tsetsos, and Stathes Hadjiefthymiades P ervasive C omputing R esearch.
Parallel Programming on Computational Grids. Outline Grids Application-level tools for grids Parallel programming on grids Case study: Ibis.
Spark: Cluster Computing with Working Sets
The road to reliable, autonomous distributed systems
Fast Paths in Concurrent Programs Wen Xu, Princeton University Sanjeev Kumar, Intel Labs. Kai Li, Princeton University.
More information Examination (date: 19 december) Written exam based on: - Reader (without paper 12 by Marsland and Campbell) - Awari paper (available from.
Distributed supercomputing on DAS, GridLab, and Grid’5000 Henri Bal Vrije Universiteit Amsterdam Faculty of Sciences.
The Distributed ASCI Supercomputer (DAS) project Henri Bal Vrije Universiteit Amsterdam Faculty of Sciences.
Piccolo – Paper Discussion Big Data Reading Group 9/20/2010.
The Ibis Project: Simplifying Grid Programming & Deployment Henri Bal, Jason Maassen, Rob van Nieuwpoort, Thilo Kielmann, Niels Drost, Ceriel Jacobs, Frank.
Grid Adventures on DAS, GridLab and Grid'5000 Henri Bal Vrije Universiteit Amsterdam Faculty of Sciences.
Ibis: a Java-centric Programming Environment for Computational Grids Henri Bal Vrije Universiteit Amsterdam vrije Universiteit.
The Ibis Project: Simplifying Grid Programming & Deployment Henri Bal Vrije Universiteit Amsterdam.
Parallel Programming on Computational Grids. Outline Grids Application-level tools for grids Parallel programming on grids Case study: Ibis.
1 stdchk : A Checkpoint Storage System for Desktop Grid Computing Matei Ripeanu – UBC Sudharshan S. Vazhkudai – ORNL Abdullah Gharaibeh – UBC The University.
16: Distributed Systems1 DISTRIBUTED SYSTEM STRUCTURES NETWORK OPERATING SYSTEMS The users are aware of the physical structure of the network. Each site.
1 The Google File System Reporter: You-Wei Zhang.
This work was carried out in the context of the Virtual Laboratory for e-Science project. This project is supported by a BSIK grant from the Dutch Ministry.
Panel Abstractions for Large-Scale Distributed Systems Henri Bal Vrije Universiteit Amsterdam.
The Ibis Project: Simplifying Grid Programming & Deployment Henri Bal Vrije Universiteit Amsterdam.
SUMA: A Scientific Metacomputer Cardinale, Yudith Figueira, Carlos Hernández, Emilio Baquero, Eduardo Berbín, Luis Bouza, Roberto Gamess, Eric García,
Cracow Grid Workshop, October 27 – 29, 2003 Institute of Computer Science AGH Design of Distributed Grid Workflow Composition System Marian Bubak, Tomasz.
A High Performance Middleware in Java with a Real Application Fabrice Huet*, Denis Caromel*, Henri Bal + * Inria-I3S-CNRS, Sophia-Antipolis, France + Vrije.
More on Adaptivity in Grids Sathish S. Vadhiyar Source/Credits: Figures from the referenced papers.
Fault-tolerant Scheduling of Fine- grained Tasks in Grid Environments.
S-Paxos: Eliminating the Leader Bottleneck
Gluepy: A Framework for Flexible Programming in Complex Grid Environments Ken Hironaka Hideo Saito Kei Takahashi Kenjiro Taura (University of Tokyo) {kenny,
1. Outline  Introduction  Different Mechanisms Broadcasting Multicasting Forward Pointers Home-based approach Distributed Hash Tables Hierarchical approaches.
Grid Computing Framework A Java framework for managed modular distributed parallel computing.
Design Issues of Prefetching Strategies for Heterogeneous Software DSM Author :Ssu-Hsuan Lu, Chien-Lung Chou, Kuang-Jui Wang, Hsiao-Hsi Wang, and Kuan-Ching.
Massive Semantic Web data compression with MapReduce Jacopo Urbani, Jason Maassen, Henri Bal Vrije Universiteit, Amsterdam HPDC ( High Performance Distributed.
Wide-Area Parallel Computing in Java Henri Bal Vrije Universiteit Amsterdam Faculty of Sciences vrije Universiteit.
Fault Tolerance in Charm++ Gengbin Zheng 10/11/2005 Parallel Programming Lab University of Illinois at Urbana- Champaign.
Fault Tolerance and Checkpointing - Sathish Vadhiyar.
Cluster Computers. Introduction Cluster computing –Standard PCs or workstations connected by a fast network –Good price/performance ratio –Exploit existing.
Parallel Computing on Wide-Area Clusters: the Albatross Project Aske Plaat Thilo Kielmann Jason Maassen Rob van Nieuwpoort Ronald Veldema Vrije Universiteit.
1 Hierarchical Parallelization of an H.264/AVC Video Encoder A. Rodriguez, A. Gonzalez, and M.P. Malumbres IEEE PARELEC 2006.
FTC-Charm++: An In-Memory Checkpoint-Based Fault Tolerant Runtime for Charm++ and MPI Gengbin Zheng Lixia Shi Laxmikant V. Kale Parallel Programming Lab.
Lecture 5: RPC (exercises/questions). 26-Jun-16COMP28112 Lecture 52 First Six Steps of RPC TvS: Figure 4-7.
High level programming for the Grid Gosia Wrzesinska Dept. of Computer Science Vrije Universiteit Amsterdam vrije Universiteit.
Distributed and Parallel Processing George Wells.
Apache Ignite Compute Grid Research Corey Pentasuglia.
OpenMosix, Open SSI, and LinuxPMI
Operating System.
Advanced Computer Networks
DISTRIBUTED COMPUTING
Outline Midterm results summary Distributed file systems – continued
Replication-based Fault-tolerance for Large-scale Graph Processing
The Arabica Project A distributed scientific computing project based on a cluster computer and Java technologies. Daniel D. Warner Dept. of Mathematical.
Architectures of distributed systems Fundamental Models
Active replication for fault tolerance
CS 213 Lecture 11: Multiprocessor 3: Directory Organization
Architectures of distributed systems Fundamental Models
Hybrid Programming with OpenMP and MPI
Lecture 6: RPC (exercises/questions)
Atlas: An Infrastructure for Global Computing
Introduction To Distributed Systems
Architectures of distributed systems Fundamental Models
Database System Architectures
Parallel programming in Java
Lecture 6: RPC (exercises/questions)
Lecture 7: RPC (exercises/questions)
Cluster Computers.
Presentation transcript:

Fault tolerance, malleability and migration for divide-and-conquer applications on the Grid Gosia Wrzesińska, Rob V. van Nieuwpoort, Jason Maassen, Henri E. Bal vrije Universiteit Ibis

Distributed supercomputing Parallel processing on geographically distributed computing systems (grids) Needed: –Fault-tolerance: survive node crashes –Malleability: add or remove machines at runtime –Migration: move a running application to another set of machines We focus on divide-and-conquer applications Leiden Delft Brno Internet Berlin

Outline The Ibis grid programming environment Satin: a divide-and-conquer framework Fault-tolerance, malleability and migration in Satin Performance evaluation

The Ibis system Java-centric => portability –„write once, run anywhere” Efficient communication –Efficient pure Java implementation –Optimized solutions for special cases High level programming models: –Divide & Conquer (Satin) –Remote Method Invocation (RMI) –Replicated Method Invocation (RepMI) –Group Method Invocation (GMI)

Satin: divide-and-conquer on the Grid Performs excellent on the Grid –Hierarchical: fits hierarchical platforms –Java-based: can run on heterogeneous resources –Grid-friendly load balancing: Cluster-aware Random Stealing [van Nieuwpoort et al., PPoPP 2001] Missing support for –Fault tolerance –Malleability –Migration

Example application: Fibonacci Also: Barnes-Hut, Raytracer, SAT solver, Tsp, Knapsack... Fib(3)Fib(2)Fib(4) Fib(3) Fib(2) Fib(1) Fib(0) Fib(1)Fib(0) Fib(1) Fib(2)Fib(1) Fib(0) Fib(5) processor 1 (master) processor 2 processor

Fault-tolerance, malleability, migration Can be implemented by handling processors joining or leaving the ongoing computation Processors may leave either unexpectedly (crash) or gracefully Handling joining processors is trivial: –Let them start stealing jobs Handling leaving processors is harder: –Recompute missing jobs –Problems: orphan jobs, partial results from gracefully leaving processors

8 Crashing processors processor 1 processor 2 processor 3 8

Crashing processors processor 1 processor

Crashing processors processor 1 processor

Crashing processors processor 1 processor 3 ? 8 Problem: orphan jobs – jobs stolen from crashed processors

Crashing processors processor 1 processor 3 2 ? Problem: orphan jobs – jobs stolen from crashed processors

Handling orphan jobs For each finished orphan, broadcast (jobID,processorID) tuple; abort the rest All processors store tuples in orphan tables Processors perform lookups in orphan tables for each recomputed job If successful: send a result request to the owner (async), put the job on a stolen jobs list processor 3 broadcast (9,cpu3)(15,cpu3)

Handling orphan jobs - example processor 1 processor 2 processor

Handling orphan jobs - example processor 1 processor

Handling orphan jobs - example processor 1 processor

Handling orphan jobs - example processor 1 processor 3 (9, cpu3) (15,cpu3) 9cpu3 15 cpu

Handling orphan jobs - example processor 1 processor 3 9cpu3 15 cpu3 1213

Handling orphan jobs - example processor 1 processor 3 9cpu3 15 cpu3 1213

Processors leaving gracefully processor 1 processor 2 processor

Processors leaving gracefully processor 1 processor 2 processor 3 8 Send results to another processor; treat those results as orphans

Processors leaving gracefully processor 1 processor

Processors leaving gracefully processor 1 processor 3 (11,cpu3)(9,cpu3)(15,cpu3) 11 cpu3 9 cpu3 15 cpu

Processors leaving gracefully processor 1 processor cpu3 9 cpu3 15 cpu3 1213

Processors leaving gracefully processor 1 processor cpu3 9 cpu3 15 cpu3 1213

Some remarks about scalability Little data is broadcast (< 1% jobs) We broadcast pointers Message combining Lightweight broadcast: no need for reliability, synchronization, etc.

Performance evaluation Leiden, Delft (DAS-2) + Berlin, Brno (GridLab) Bandwidth: 62 – 654 Mbit/s Latency: 2 – 21 ms

Impact of saving partial results 16 cpus Leiden 16 cpus Delft 8 cpus Leiden, 8 cpus Delft 4 cpus Berlin, 4 cpus Brno

Migration overhead 8 cpus Leiden 4 cpus Berlin 4 cpus Brno (Leiden cpus replaced by Delft)

Crash-free execution overhead Used: 32 cpus in Delft

Summary Satin implements fault-tolerance, malleability and migration for divide-and-conquer applications Save partial results by repairing the execution tree Applications can adapt to changing numbers of cpus and migrate without loss of work (overhead < 10%) Outperform traditional approach by 25% No overhead during crash-free execution

Further information Publications and a software distribution available at:

Additional slides

Ibis design

Partial results on leaving cpus If processors leave gracefully: Send all finished jobs to another processor Treat those jobs as orphans = broadcast (jobID, processorID) tuples Execute the normal crash recovery procedure

A crash of the master Master: the processor that started the computation by spawning the root job Remaining processors elect a new master At the end of the crash recovery procedure the new master restarts the application

Job identifiers rootId = 1 childId = parentId * branching_factor + child_no Problem: need to know maximal branching factor of the tree Solution: strings of bytes, one byte per tree level

Distributed ASCI Supercomputer (DAS) – 2 Node configuration Dual 1 GHz Pentium-III >= 1 GB memory 100 Mbit Ethernet + (Myrinet) Linux VU (72 nodes) UvA (32) Leiden (32) Delft (32) GigaPort (1-10 Gb) Utrecht (32)

Compiling/optimizing programs Java compiler bytecode rewriter JVM source bytecode Optimizations are done by bytecode rewriting –E.g. compiler-generated serialization (as in Manta)

GridLab testbed interface FibInter extends ibis.satin.Spawnable { public int fib(long n); } class Fib extends ibis.satin.SatinObject implements FibInter { public int fib (int n) { if (n < 2) return n; int x = fib (n - 1); int y = fib (n - 2); sync(); return x + y; } Java + divide&conquer Example

Grid results ProgramsitesCPUsEfficiency Raytracer54081 % SAT-solver52888 % Compression32267 % Efficiency based on normalization to single CPU type (1GHz P3)