Presentation is loading. Please wait.

Presentation is loading. Please wait.

Computer and Computational Sciences Division Los Alamos National Laboratory On the Feasibility of Incremental Checkpointing for Scientific Computing Jose.

Similar presentations


Presentation on theme: "Computer and Computational Sciences Division Los Alamos National Laboratory On the Feasibility of Incremental Checkpointing for Scientific Computing Jose."— Presentation transcript:

1 Computer and Computational Sciences Division Los Alamos National Laboratory On the Feasibility of Incremental Checkpointing for Scientific Computing Jose Carlos Sancho jcsancho@lanl.gov with Fabrizio Petrini, Greg Johnson, Juan Fernandez and Eitan Frachtenberg Performance and Architectures Lab (PAL)

2 Jose C. Sancho jcsancho@lanl.gov Inter. Parallel & Distributed Processing Symposium CCS-3 P AL Santa Fe, NM Talk Overview n Goal n Fault-tolerance for Scientific Computing n Methodology n Characterization of Scientific Applications n Performance Evaluation of Incremental Checkpointing n Concluding Remarks

3 Jose C. Sancho jcsancho@lanl.gov Inter. Parallel & Distributed Processing Symposium CCS-3 P AL Santa Fe, NM Goal Prove the Feasibility of Incremental Checkpointing  Frequent  Automatic  User-transparent  No changes to application  No special hardware support

4 Jose C. Sancho jcsancho@lanl.gov Inter. Parallel & Distributed Processing Symposium CCS-3 P AL Santa Fe, NM Large Scale Computers n Large component count n Strongly coupled hardware 133,120 processors 608,256 DRAM Failure Rate

5 Jose C. Sancho jcsancho@lanl.gov Inter. Parallel & Distributed Processing Symposium CCS-3 P AL Santa Fe, NM Failures expected during the application’s execution n Running for months n Demands high capability Scientific Computing

6 Jose C. Sancho jcsancho@lanl.gov Inter. Parallel & Distributed Processing Symposium CCS-3 P AL Santa Fe, NM Providing Fault-tolerance n Hardware replication + n Checkpointing and rollback recovery High cost solution! Spare node Checkpointing Recovery

7 Jose C. Sancho jcsancho@lanl.gov Inter. Parallel & Distributed Processing Symposium CCS-3 P AL Santa Fe, NM Checkpointing and Recovery n Simplicity F Easy implementation n Cost-effective F No additional hardware support Critical aspect: Bandwidth requirements Saving process state

8 Jose C. Sancho jcsancho@lanl.gov Inter. Parallel & Distributed Processing Symposium CCS-3 P AL Santa Fe, NM Reducing Bandwidth n Incremental checkpointing F Only the memory modified from the previous checkpoint is saved to stable storage Full Process state Incremental

9 Jose C. Sancho jcsancho@lanl.gov Inter. Parallel & Distributed Processing Symposium CCS-3 P AL Santa Fe, NM New Challenges n Frequent checkpoints: F Minimizing rollback interval to increase system availability n Automatic and user-transparent F Autonomic computing F New vision of to manage the high complexity of large systems F Self-healing and self-repairing More bandwidth pressure

10 Jose C. Sancho jcsancho@lanl.gov Inter. Parallel & Distributed Processing Symposium CCS-3 P AL Santa Fe, NM Survey of Implementation Levels CLIP, Dome, CCITF Ickp, CoCheck, Diskless Revive, Safetynet Just a few !! Hardware Operating system Run-time library Application

11 Jose C. Sancho jcsancho@lanl.gov Inter. Parallel & Distributed Processing Symposium CCS-3 P AL Santa Fe, NM Enabling Automatic Checkpointing Low User intervention Checkpoint data Low Hardware Operating system Run-time library Application High automatic

12 Jose C. Sancho jcsancho@lanl.gov Inter. Parallel & Distributed Processing Symposium CCS-3 P AL Santa Fe, NM The Bandwidth Challenge Does the current technology provide enough bandwidth? Frequent Automatic

13 Jose C. Sancho jcsancho@lanl.gov Inter. Parallel & Distributed Processing Symposium CCS-3 P AL Santa Fe, NM Methodology n Analyzing the Memory Footprint of Scientific Codes F Run-time library stack heap static data text mmap mprotec() Application’s Memory Footprint

14 Jose C. Sancho jcsancho@lanl.gov Inter. Parallel & Distributed Processing Symposium CCS-3 P AL Santa Fe, NM Methodology n Quantifying the Bandwidth Requirements F Checkpoint intervals: 1s to 20s F Comparing with the current bandwidth available 900 MB/s 75 MB/s Sustained network bandwidth Quadrics QsNet II Single sustained disk bandwidth Ultra SCSI controller

15 Jose C. Sancho jcsancho@lanl.gov Inter. Parallel & Distributed Processing Symposium CCS-3 P AL Santa Fe, NM Experimental Environment n 32-node Linux Cluster u 64 Itanium II processors u PCI-X I/O bus u Quadrics QsNet interconnection network n Parallel Scientific Codes u Sage u Sweep3D u NAS parallel benchmarks: SP, LU, BT and FT Representative of the ASCI production codes at LANL

16 Jose C. Sancho jcsancho@lanl.gov Inter. Parallel & Distributed Processing Symposium CCS-3 P AL Santa Fe, NM Memory Footprint Sage-1000MB954.6MB Sage-500MB497.3MB Sage-100MB103.7MB Sage-50MB55MB Sweep3D105.5MB SP Class C40.1MB LU Class C16.6MB BT Class C76.5MB FT Class C118MB Increasing memory footprint

17 Jose C. Sancho jcsancho@lanl.gov Inter. Parallel & Distributed Processing Symposium CCS-3 P AL Santa Fe, NM Talk overview n Goal n Fault-tolerance for scientific computing n Methodology n Characterization of scientific applications n Performance evaluation of Incremental Checkpointing u Bandwidth u Scalability

18 Jose C. Sancho jcsancho@lanl.gov Inter. Parallel & Distributed Processing Symposium CCS-3 P AL Santa Fe, NM Characterization Data initialization Regular processing bursts Sage-1000MB

19 Jose C. Sancho jcsancho@lanl.gov Inter. Parallel & Distributed Processing Symposium CCS-3 P AL Santa Fe, NM Communication Interleaved Sage-1000MB Regular communication bursts

20 Jose C. Sancho jcsancho@lanl.gov Inter. Parallel & Distributed Processing Symposium CCS-3 P AL Santa Fe, NM Fraction of the Memory Footprint Overwritten during the Main Iteration Full memory footprint Below the full memory footprint

21 Jose C. Sancho jcsancho@lanl.gov Inter. Parallel & Distributed Processing Symposium CCS-3 P AL Santa Fe, NM Bandwidth Requirements Bandwidth (MB/s) Timeslices (s) 78.8MB/ s 12.1MB/ s Decreases with the timeslices Sage-1000MB

22 Jose C. Sancho jcsancho@lanl.gov Inter. Parallel & Distributed Processing Symposium CCS-3 P AL Santa Fe, NM Bandwidth Requirements for 1 second Increases with memory footprint Single SCSI disk performance Most demanding

23 Jose C. Sancho jcsancho@lanl.gov Inter. Parallel & Distributed Processing Symposium CCS-3 P AL Santa Fe, NM Increasing Memory Footprint Size Average Bandwidth (MB/s) Timeslices (s) Increases sublinearly

24 Jose C. Sancho jcsancho@lanl.gov Inter. Parallel & Distributed Processing Symposium CCS-3 P AL Santa Fe, NM Increasing Processor Count Average Bandwidth (MB/s) Timeslices (s) Decreases slightly with processor count Weak-scaling

25 Jose C. Sancho jcsancho@lanl.gov Inter. Parallel & Distributed Processing Symposium CCS-3 P AL Santa Fe, NM Technological Trends Performance of applications bounded by memory improvements Increases at a faster pace Performance Improvement per year

26 Jose C. Sancho jcsancho@lanl.gov Inter. Parallel & Distributed Processing Symposium CCS-3 P AL Santa Fe, NM Conclusions n No technological limitations of commodity components for clusters to implement automatic, frequent, and user-transparent incremental checkpointing n Current hardware technology can sustain the bandwidth requirements n These results can be generalized to future large scale computers

27 Jose C. Sancho jcsancho@lanl.gov Inter. Parallel & Distributed Processing Symposium CCS-3 P AL Santa Fe, NM Conclusions n The process bandwidth decreases slightly with processor count n Increases sublinearly with the memory footprint size n Improvements in networking and storage will make incremental checkpointing even more effective in the future


Download ppt "Computer and Computational Sciences Division Los Alamos National Laboratory On the Feasibility of Incremental Checkpointing for Scientific Computing Jose."

Similar presentations


Ads by Google