Computer and Computational Sciences Division Los Alamos National Laboratory On the Feasibility of Incremental Checkpointing for Scientific Computing Jose.

Computer and Computational Sciences Division Los Alamos National Laboratory On the Feasibility of Incremental Checkpointing for Scientific Computing Jose Carlos Sancho jcsancho@lanl.gov with Fabrizio Petrini, Greg Johnson, Juan Fernandez and Eitan Frachtenberg Performance and Architectures Lab (PAL)

Jose C. Sancho jcsancho@lanl.gov Inter. Parallel & Distributed Processing Symposium CCS-3 P AL Santa Fe, NM Talk Overview n Goal n Fault-tolerance for Scientific Computing n Methodology n Characterization of Scientific Applications n Performance Evaluation of Incremental Checkpointing n Concluding Remarks

Jose C. Sancho jcsancho@lanl.gov Inter. Parallel & Distributed Processing Symposium CCS-3 P AL Santa Fe, NM Goal Prove the Feasibility of Incremental Checkpointing  Frequent  Automatic  User-transparent  No changes to application  No special hardware support

Jose C. Sancho jcsancho@lanl.gov Inter. Parallel & Distributed Processing Symposium CCS-3 P AL Santa Fe, NM Large Scale Computers n Large component count n Strongly coupled hardware 133,120 processors 608,256 DRAM Failure Rate

Jose C. Sancho jcsancho@lanl.gov Inter. Parallel & Distributed Processing Symposium CCS-3 P AL Santa Fe, NM Failures expected during the application’s execution n Running for months n Demands high capability Scientific Computing

Jose C. Sancho jcsancho@lanl.gov Inter. Parallel & Distributed Processing Symposium CCS-3 P AL Santa Fe, NM Providing Fault-tolerance n Hardware replication + n Checkpointing and rollback recovery High cost solution! Spare node Checkpointing Recovery

Jose C. Sancho jcsancho@lanl.gov Inter. Parallel & Distributed Processing Symposium CCS-3 P AL Santa Fe, NM Checkpointing and Recovery n Simplicity F Easy implementation n Cost-effective F No additional hardware support Critical aspect: Bandwidth requirements Saving process state

Jose C. Sancho jcsancho@lanl.gov Inter. Parallel & Distributed Processing Symposium CCS-3 P AL Santa Fe, NM Reducing Bandwidth n Incremental checkpointing F Only the memory modified from the previous checkpoint is saved to stable storage Full Process state Incremental

Jose C. Sancho jcsancho@lanl.gov Inter. Parallel & Distributed Processing Symposium CCS-3 P AL Santa Fe, NM New Challenges n Frequent checkpoints: F Minimizing rollback interval to increase system availability n Automatic and user-transparent F Autonomic computing F New vision of to manage the high complexity of large systems F Self-healing and self-repairing More bandwidth pressure

Jose C. Sancho jcsancho@lanl.gov Inter. Parallel & Distributed Processing Symposium CCS-3 P AL Santa Fe, NM Survey of Implementation Levels CLIP, Dome, CCITF Ickp, CoCheck, Diskless Revive, Safetynet Just a few !! Hardware Operating system Run-time library Application

Jose C. Sancho jcsancho@lanl.gov Inter. Parallel & Distributed Processing Symposium CCS-3 P AL Santa Fe, NM Enabling Automatic Checkpointing Low User intervention Checkpoint data Low Hardware Operating system Run-time library Application High automatic

Jose C. Sancho jcsancho@lanl.gov Inter. Parallel & Distributed Processing Symposium CCS-3 P AL Santa Fe, NM The Bandwidth Challenge Does the current technology provide enough bandwidth? Frequent Automatic

Jose C. Sancho jcsancho@lanl.gov Inter. Parallel & Distributed Processing Symposium CCS-3 P AL Santa Fe, NM Methodology n Analyzing the Memory Footprint of Scientific Codes F Run-time library stack heap static data text mmap mprotec() Application’s Memory Footprint

Jose C. Sancho jcsancho@lanl.gov Inter. Parallel & Distributed Processing Symposium CCS-3 P AL Santa Fe, NM Methodology n Quantifying the Bandwidth Requirements F Checkpoint intervals: 1s to 20s F Comparing with the current bandwidth available 900 MB/s 75 MB/s Sustained network bandwidth Quadrics QsNet II Single sustained disk bandwidth Ultra SCSI controller

Jose C. Sancho jcsancho@lanl.gov Inter. Parallel & Distributed Processing Symposium CCS-3 P AL Santa Fe, NM Experimental Environment n 32-node Linux Cluster u 64 Itanium II processors u PCI-X I/O bus u Quadrics QsNet interconnection network n Parallel Scientific Codes u Sage u Sweep3D u NAS parallel benchmarks: SP, LU, BT and FT Representative of the ASCI production codes at LANL

Jose C. Sancho jcsancho@lanl.gov Inter. Parallel & Distributed Processing Symposium CCS-3 P AL Santa Fe, NM Memory Footprint Sage-1000MB954.6MB Sage-500MB497.3MB Sage-100MB103.7MB Sage-50MB55MB Sweep3D105.5MB SP Class C40.1MB LU Class C16.6MB BT Class C76.5MB FT Class C118MB Increasing memory footprint

Jose C. Sancho jcsancho@lanl.gov Inter. Parallel & Distributed Processing Symposium CCS-3 P AL Santa Fe, NM Talk overview n Goal n Fault-tolerance for scientific computing n Methodology n Characterization of scientific applications n Performance evaluation of Incremental Checkpointing u Bandwidth u Scalability

Jose C. Sancho jcsancho@lanl.gov Inter. Parallel & Distributed Processing Symposium CCS-3 P AL Santa Fe, NM Characterization Data initialization Regular processing bursts Sage-1000MB

Jose C. Sancho jcsancho@lanl.gov Inter. Parallel & Distributed Processing Symposium CCS-3 P AL Santa Fe, NM Communication Interleaved Sage-1000MB Regular communication bursts

Jose C. Sancho jcsancho@lanl.gov Inter. Parallel & Distributed Processing Symposium CCS-3 P AL Santa Fe, NM Fraction of the Memory Footprint Overwritten during the Main Iteration Full memory footprint Below the full memory footprint

Jose C. Sancho jcsancho@lanl.gov Inter. Parallel & Distributed Processing Symposium CCS-3 P AL Santa Fe, NM Bandwidth Requirements Bandwidth (MB/s) Timeslices (s) 78.8MB/ s 12.1MB/ s Decreases with the timeslices Sage-1000MB

Jose C. Sancho jcsancho@lanl.gov Inter. Parallel & Distributed Processing Symposium CCS-3 P AL Santa Fe, NM Bandwidth Requirements for 1 second Increases with memory footprint Single SCSI disk performance Most demanding

Jose C. Sancho jcsancho@lanl.gov Inter. Parallel & Distributed Processing Symposium CCS-3 P AL Santa Fe, NM Increasing Memory Footprint Size Average Bandwidth (MB/s) Timeslices (s) Increases sublinearly

Jose C. Sancho jcsancho@lanl.gov Inter. Parallel & Distributed Processing Symposium CCS-3 P AL Santa Fe, NM Increasing Processor Count Average Bandwidth (MB/s) Timeslices (s) Decreases slightly with processor count Weak-scaling

Jose C. Sancho jcsancho@lanl.gov Inter. Parallel & Distributed Processing Symposium CCS-3 P AL Santa Fe, NM Technological Trends Performance of applications bounded by memory improvements Increases at a faster pace Performance Improvement per year

Jose C. Sancho jcsancho@lanl.gov Inter. Parallel & Distributed Processing Symposium CCS-3 P AL Santa Fe, NM Conclusions n No technological limitations of commodity components for clusters to implement automatic, frequent, and user-transparent incremental checkpointing n Current hardware technology can sustain the bandwidth requirements n These results can be generalized to future large scale computers

Jose C. Sancho jcsancho@lanl.gov Inter. Parallel & Distributed Processing Symposium CCS-3 P AL Santa Fe, NM Conclusions n The process bandwidth decreases slightly with processor count n Increases sublinearly with the memory footprint size n Improvements in networking and storage will make incremental checkpointing even more effective in the future

Computer and Computational Sciences Division Los Alamos National Laboratory On the Feasibility of Incremental Checkpointing for Scientific Computing Jose.

Similar presentations

Presentation on theme: "Computer and Computational Sciences Division Los Alamos National Laboratory On the Feasibility of Incremental Checkpointing for Scientific Computing Jose."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Computer and Computational Sciences Division Los Alamos National Laboratory On the Feasibility of Incremental Checkpointing for Scientific Computing Jose.

Similar presentations

Presentation on theme: "Computer and Computational Sciences Division Los Alamos National Laboratory On the Feasibility of Incremental Checkpointing for Scientific Computing Jose."— Presentation transcript:

Similar presentations

About project

Feedback