Presentation is loading. Please wait.

Presentation is loading. Please wait.

Berlin, March 11th, 20041 GGF10 - GridCPR-WG PARIS project-team Activities in Checkpoint Recovery Christine Morin PARIS INRIA.

Similar presentations

Presentation on theme: "Berlin, March 11th, 20041 GGF10 - GridCPR-WG PARIS project-team Activities in Checkpoint Recovery Christine Morin PARIS INRIA."— Presentation transcript:

1 Berlin, March 11th, 20041 GGF10 - GridCPR-WG PARIS project-team Activities in Checkpoint Recovery Christine Morin PARIS INRIA project-team IRISA – Rennes (France)

2 Berlin, March 11th, 2004 2 Cluster Federations A particular case of grid Interconnection of several clusters of moderate size Homogeneity and heterogeneity More and more homogeneous platforms: PC, Linux Heterogeneous networks (SAN, LAN, WAN) Clusters with different amount and kinds of resources Considered applications Scientific applications (numerical simulation) sequential and parallel applications based either on the shared memory or the message-passing communication paradigm Code coupling applications Applications requiring a huge amount of resources (memory, computing power) Dynamicity A cluster may join or leave the federation at any time Individual nodes may fail in a cluster SAN LAN WAN

3 Berlin, March 11th, 2004 3 Grid-aware OS for Cluster Federations A single system image OS on each cluster A cluster appears as a single machine which offers a kind of standard interface Mosix, Amoeba, Kerrighed A cluster federation is seen as a set of pairs Structured peer to peer (P2P) network (instead of a hierarchy) Fully decentralized control Native support for dynamicity Designed for scalability Size of the routing tables bounded by log(N) Probabilistic log(N) bounds on the number of routing hops Standardization of the APIs (IRIS project) Promising work to take into account the network's topology and security issues (Pastry) Structured P2P systems usually provide distributed hash tables (DHT) Building block for higher level services

4 Berlin, March 11th, 2004 4 Current Work on Checkpoint Recovery Cluster Federation Execution of multithreaded applications in cluster federations A coherence protocol for cached copies of volatile objects in peer-to- peer systems (multiple failures tolerated) Hierarchical checkpointing protocol for code coupling applications Cluster SSI image operating system: Kerrighed Full Posix thread interface Global process and memory management Configurable global scheduler High availability Dynamic resource management for tolerating cluster reconfigurations (node addition, eviction or failure) Checkpoint recovery mechanisms

5 Berlin, March 11th, 2004 5 Goals for Checkpoint Recovery in Kerrighed Experimental platform for checkpointing strategies for parallel applications Basic mechanisms common to different checkpointing protocols in MP and SM systems Being able to checkpoint any kind of parallel application Transparent checkpointing Implementation in a single system of various checkpointing strategies To allow the programmer to choose a suitable strategy for a particular application To be able to compare several strategies with realistic (industrial) applications Avoid code duplication in the system Robustness Fair comparison Common framework Checkpoint and rollback servers Checkpoint numbering Dependency management Unified model for message-passing and shared memory models Direct Dependency Vector (DDV) management Message logging Incremental checkpointing Checkpointing in background Communication system Atomic multicast Stable storage Different implementations Disk Memory

6 Berlin, March 11th, 2004 6 Checkpoint Recovery in Kerrighed: Current Status and Work Directions Current Status Linux-based Kerrighed prototype (2.4) Small kernel patch and a set of modules Transparent checkpoint recovery for (computing) individual processes Virtualization of a process in the cluster Unique ghost mechanism for process migration, checkpointing and restoration Easy specialization of the stable storage implementation Ghost can be sent to or retrieved from network, memory or disk Work Directions Complete the debugging of coordinated checkpointing (and recovery) for multithreaded and message-passing based applications Checkpointable locks and barriers in a cluster Disk I/O management Posix extension for a proper integration of transparent checkpointing/recovery in the operating system Ghost process MemoryDiskNetwork DuplicationMigrationCheckpoint/restart

7 Berlin, March 11th, 2004 7 Hierarchical Checkpoint Recovery for Cluster Federations Relaxed inter-cluster synchronism to reflect the architecture Coordinated checkpointing in a cluster Communication-induced checkpointing between clusters Independent checkpoints in each cluster Forced checkpoints when a communication generates a new dependency Force a checkpoint only if the sender has saved a checkpoint since its last send Several cluster checkpoints are kept Management of Direct Dependency Vectors (DDV) to detect dependencies DDV included in inter-cluster messages DDV associated with cluster checkpoints Garbage collection of useless cluster checkpoints Evaluation by discrete-event simulation Works well if Few inter-cluster communications Inter-cluster communications « quasi- unidirectional » SimulationProcessingDisplay Simulation

8 Berlin, March 11th, 2004 8 Future Work Checkpoint recovery in the large (we plan to hire a PhD student) Dealing with applications with huge data sets executed in cluster federations Follow-up of our preliminary work on a hierarchical checkpointing protocol for code coupling applications in cluster federations Based on Kerrighed experimental platform Not only basic coordinated checkpointing but also various variants of independent and communication-induced strategies Standard interface and basic building blocks Implementation in Kerrighed of ideas studied in previous projects ICARE fault tolerant software DSM Combining replication inherent to the DSM with the replication needed for ensuring recovery data stability Extension of the coherence protocol to manage recovery data in memory HA-PSLS Integration of a DSM and a parallel file system Up-grading ICARE Cohabitation of persistent and memory checkpoints Swap management (to avoid memory size limitation and to evict recovery data from memory) Mapped file management (in-place checkpoints)

9 Berlin, March 11th, 2004 9 Kerrighed is registered as a community trademark.

10 Berlin, March 11th, 2004 10 Software Distribution Kerrighed web site (open since mid-November 2002) Open source under GPL licence Current version: Kerrighed V0.81 based on Linux 2.4.24 Kerrighed users mailing-list (created in April 2003) Kerrighed forum (created February 2004) Notes Kerrighed is a registered trademark Kerrighed deposit at APP for each public release Kerrighed tutorial (in conjunction with ICS04, Saint-Malo (France), June 27th, 2004)

11 Berlin, March 11th, 2004 11 RoadMap for Kerrighed Prototype March 2004 MPI (with migration) April 2004 Kerrighed V1.00 (SSI-OSCAR) SGFD January 2005 Kerrighed V1.10 64 bits (opteron) Checkpointing for parallel applications July 2005 Kerrighed V2.0 High availability

12 Berlin, March 11th, 2004 12 Current Support: EDF Kerrighed research prototype (2000-2003) CRECO EDF/INRIA CIFRE Ph.D. grant (Geoffroy Vallée) Industrial Post-Doc (Renaud Lottiaux) Experimentations with first industrial applications provided by EDF HRM1D, CATHARE, Cyrano 3, Aster Kerrighed integration in OSCAR (2004-2005) INRIA Industrial Post-Doc (G. Vallée) with EDF & ORNL SSI-OSCAR

13 Berlin, March 11th, 2004 13 Current Support: DGA Kerrighed robustness and full set of functionalities (2003-2005) COCA PEA funded by DGA Partnership with CGEY and ONERA-CERT 2 full time engineers (Renaud Lottiaux, David Margery) Experimentations with industrial applications Ligase, Gorf3D, Mixsar, RTI HLA

14 Berlin, March 11th, 2004 14 Current Kerrighed Team (being part of the PARIS project-team) Faculty Christine Morin (DR, INRIA) PhD students Pascal Gallard (INRIA) Gaël Utard (INRIA) Louis Rilling (ENS-Cachan) Post-doc Geoffroy Vallée (PDI-EDF) Engineers Renaud Lottiaux (INRIA) David Margery (INRIA) Invited researcher Isaac Scherson (UCI) Master students Jamal Ghaffour Etienne Rivière Former members Ramamurthy Badrinath (assistant professor, IIT Kharagpur, India) May 2002 – April 2003 Viet Hoa Dinh (engineer) September 2001- September 2002 Jean-Yves Burlett (Master student, univ. Rennes 1) February-June 2001 Sébastien Monnet (Master student, univ. Rennes 1) February-June 2003 H. Maka (Bachelor student, IIT Kharagpur) May-July 2003

15 Berlin, March 11th, 2004 15 Academic Collaborations University of Ulm, Germany Checkpointing for shared memory parallel applications Rutgers University, USA Myrinet, Infiniband Self healing clusters ORNL SSI-OSCAR University of California, Irvine, USA Global scheduling Deakin University, Australia SSI (informal contacts)

Download ppt "Berlin, March 11th, 20041 GGF10 - GridCPR-WG PARIS project-team Activities in Checkpoint Recovery Christine Morin PARIS INRIA."

Similar presentations

Ads by Google