Download presentation
Presentation is loading. Please wait.
Published byShon Jenkins Modified over 9 years ago
1
Computer and Computational Sciences Division Los Alamos National Laboratory On the Feasibility of Incremental Checkpointing for Scientific Computing Jose Carlos Sancho jcsancho@lanl.gov with Fabrizio Petrini, Greg Johnson, Juan Fernandez and Eitan Frachtenberg Performance and Architectures Lab (PAL)
2
Jose C. Sancho jcsancho@lanl.gov Inter. Parallel & Distributed Processing Symposium CCS-3 P AL Santa Fe, NM Talk Overview n Goal n Fault-tolerance for Scientific Computing n Methodology n Characterization of Scientific Applications n Performance Evaluation of Incremental Checkpointing n Concluding Remarks
3
Jose C. Sancho jcsancho@lanl.gov Inter. Parallel & Distributed Processing Symposium CCS-3 P AL Santa Fe, NM Goal Prove the Feasibility of Incremental Checkpointing Frequent Automatic User-transparent No changes to application No special hardware support
4
Jose C. Sancho jcsancho@lanl.gov Inter. Parallel & Distributed Processing Symposium CCS-3 P AL Santa Fe, NM Large Scale Computers n Large component count n Strongly coupled hardware 133,120 processors 608,256 DRAM Failure Rate
5
Jose C. Sancho jcsancho@lanl.gov Inter. Parallel & Distributed Processing Symposium CCS-3 P AL Santa Fe, NM Failures expected during the application’s execution n Running for months n Demands high capability Scientific Computing
6
Jose C. Sancho jcsancho@lanl.gov Inter. Parallel & Distributed Processing Symposium CCS-3 P AL Santa Fe, NM Providing Fault-tolerance n Hardware replication + n Checkpointing and rollback recovery High cost solution! Spare node Checkpointing Recovery
7
Jose C. Sancho jcsancho@lanl.gov Inter. Parallel & Distributed Processing Symposium CCS-3 P AL Santa Fe, NM Checkpointing and Recovery n Simplicity F Easy implementation n Cost-effective F No additional hardware support Critical aspect: Bandwidth requirements Saving process state
8
Jose C. Sancho jcsancho@lanl.gov Inter. Parallel & Distributed Processing Symposium CCS-3 P AL Santa Fe, NM Reducing Bandwidth n Incremental checkpointing F Only the memory modified from the previous checkpoint is saved to stable storage Full Process state Incremental
9
Jose C. Sancho jcsancho@lanl.gov Inter. Parallel & Distributed Processing Symposium CCS-3 P AL Santa Fe, NM New Challenges n Frequent checkpoints: F Minimizing rollback interval to increase system availability n Automatic and user-transparent F Autonomic computing F New vision of to manage the high complexity of large systems F Self-healing and self-repairing More bandwidth pressure
10
Jose C. Sancho jcsancho@lanl.gov Inter. Parallel & Distributed Processing Symposium CCS-3 P AL Santa Fe, NM Survey of Implementation Levels CLIP, Dome, CCITF Ickp, CoCheck, Diskless Revive, Safetynet Just a few !! Hardware Operating system Run-time library Application
11
Jose C. Sancho jcsancho@lanl.gov Inter. Parallel & Distributed Processing Symposium CCS-3 P AL Santa Fe, NM Enabling Automatic Checkpointing Low User intervention Checkpoint data Low Hardware Operating system Run-time library Application High automatic
12
Jose C. Sancho jcsancho@lanl.gov Inter. Parallel & Distributed Processing Symposium CCS-3 P AL Santa Fe, NM The Bandwidth Challenge Does the current technology provide enough bandwidth? Frequent Automatic
13
Jose C. Sancho jcsancho@lanl.gov Inter. Parallel & Distributed Processing Symposium CCS-3 P AL Santa Fe, NM Methodology n Analyzing the Memory Footprint of Scientific Codes F Run-time library stack heap static data text mmap mprotec() Application’s Memory Footprint
14
Jose C. Sancho jcsancho@lanl.gov Inter. Parallel & Distributed Processing Symposium CCS-3 P AL Santa Fe, NM Methodology n Quantifying the Bandwidth Requirements F Checkpoint intervals: 1s to 20s F Comparing with the current bandwidth available 900 MB/s 75 MB/s Sustained network bandwidth Quadrics QsNet II Single sustained disk bandwidth Ultra SCSI controller
15
Jose C. Sancho jcsancho@lanl.gov Inter. Parallel & Distributed Processing Symposium CCS-3 P AL Santa Fe, NM Experimental Environment n 32-node Linux Cluster u 64 Itanium II processors u PCI-X I/O bus u Quadrics QsNet interconnection network n Parallel Scientific Codes u Sage u Sweep3D u NAS parallel benchmarks: SP, LU, BT and FT Representative of the ASCI production codes at LANL
16
Jose C. Sancho jcsancho@lanl.gov Inter. Parallel & Distributed Processing Symposium CCS-3 P AL Santa Fe, NM Memory Footprint Sage-1000MB954.6MB Sage-500MB497.3MB Sage-100MB103.7MB Sage-50MB55MB Sweep3D105.5MB SP Class C40.1MB LU Class C16.6MB BT Class C76.5MB FT Class C118MB Increasing memory footprint
17
Jose C. Sancho jcsancho@lanl.gov Inter. Parallel & Distributed Processing Symposium CCS-3 P AL Santa Fe, NM Talk overview n Goal n Fault-tolerance for scientific computing n Methodology n Characterization of scientific applications n Performance evaluation of Incremental Checkpointing u Bandwidth u Scalability
18
Jose C. Sancho jcsancho@lanl.gov Inter. Parallel & Distributed Processing Symposium CCS-3 P AL Santa Fe, NM Characterization Data initialization Regular processing bursts Sage-1000MB
19
Jose C. Sancho jcsancho@lanl.gov Inter. Parallel & Distributed Processing Symposium CCS-3 P AL Santa Fe, NM Communication Interleaved Sage-1000MB Regular communication bursts
20
Jose C. Sancho jcsancho@lanl.gov Inter. Parallel & Distributed Processing Symposium CCS-3 P AL Santa Fe, NM Fraction of the Memory Footprint Overwritten during the Main Iteration Full memory footprint Below the full memory footprint
21
Jose C. Sancho jcsancho@lanl.gov Inter. Parallel & Distributed Processing Symposium CCS-3 P AL Santa Fe, NM Bandwidth Requirements Bandwidth (MB/s) Timeslices (s) 78.8MB/ s 12.1MB/ s Decreases with the timeslices Sage-1000MB
22
Jose C. Sancho jcsancho@lanl.gov Inter. Parallel & Distributed Processing Symposium CCS-3 P AL Santa Fe, NM Bandwidth Requirements for 1 second Increases with memory footprint Single SCSI disk performance Most demanding
23
Jose C. Sancho jcsancho@lanl.gov Inter. Parallel & Distributed Processing Symposium CCS-3 P AL Santa Fe, NM Increasing Memory Footprint Size Average Bandwidth (MB/s) Timeslices (s) Increases sublinearly
24
Jose C. Sancho jcsancho@lanl.gov Inter. Parallel & Distributed Processing Symposium CCS-3 P AL Santa Fe, NM Increasing Processor Count Average Bandwidth (MB/s) Timeslices (s) Decreases slightly with processor count Weak-scaling
25
Jose C. Sancho jcsancho@lanl.gov Inter. Parallel & Distributed Processing Symposium CCS-3 P AL Santa Fe, NM Technological Trends Performance of applications bounded by memory improvements Increases at a faster pace Performance Improvement per year
26
Jose C. Sancho jcsancho@lanl.gov Inter. Parallel & Distributed Processing Symposium CCS-3 P AL Santa Fe, NM Conclusions n No technological limitations of commodity components for clusters to implement automatic, frequent, and user-transparent incremental checkpointing n Current hardware technology can sustain the bandwidth requirements n These results can be generalized to future large scale computers
27
Jose C. Sancho jcsancho@lanl.gov Inter. Parallel & Distributed Processing Symposium CCS-3 P AL Santa Fe, NM Conclusions n The process bandwidth decreases slightly with processor count n Increases sublinearly with the memory footprint size n Improvements in networking and storage will make incremental checkpointing even more effective in the future
Similar presentations
© 2025 SlidePlayer.com Inc.
All rights reserved.