Presentation is loading. Please wait.

Presentation is loading. Please wait.

Presented by Reliability, Availability, and Serviceability (RAS) for High-Performance Computing Stephen L. Scott and Christian Engelmann Computer Science.

Similar presentations


Presentation on theme: "Presented by Reliability, Availability, and Serviceability (RAS) for High-Performance Computing Stephen L. Scott and Christian Engelmann Computer Science."— Presentation transcript:

1 Presented by Reliability, Availability, and Serviceability (RAS) for High-Performance Computing Stephen L. Scott and Christian Engelmann Computer Science and Mathematics Division Oak Ridge National Laboratory, Oak Ridge, TN, USA

2 2 Scott_RAS_0614 Research and development goals  Develop techniques to enable HPC systems to run computational jobs 24x7  Develop proof-of-concept prototypes and production-type RAS solutions  Provide high-level RAS capabilities for current terascale and next-generation petascale high-performance computing (HPC) systems  Eliminate many of the numerous single points of failure and control in today’s HPC systems

3 3 Scott_RAS_0614 MOLAR: Adaptive runtime support for high-end computing operating and runtime systems  Addresses the challenges for operating and runtime systems to run large applications efficiently on future ultra-scale high-end computers  Part of the Forum to Address Scalable Technology for Runtime and Operating Systems (FAST-OS)  MOLAR is a collaborative research effort (www.fastos.org/molar):

4 4 Scott_RAS_0614 Active/standby with shared storage  Single active head node  Backup to shared storage  Simple checkpoint/restart  Fail-over to standby node Possible corruption of backup state when failing during backup Introduction of a new single point of failure No guarantee of correctness and availability Simple Linux Utility for Resource Management, metadata servers of Parallel Virtual File System and Lustre Active/Standby Head Nodes with Shared Storage

5 5 Scott_RAS_0614 Active/standby redundancy  Single active head node  Backup to standby node  Simple checkpoint/restart  Fail-over to standby node  Idle standby head node  Rollback to backup  Service interruption for fail-over and restore-over HA-OSCAR, Torque on Cray XT Active/Standby Head Nodes

6 6 Scott_RAS_0614 Asymmetric active/active redundancy  Many active head nodes  Work load distribution  Optional fail-over to standby head node(s) (n+1 or n+m)  No coordination between active head nodes  Service interruption for fail-over and restore-over  Loss of state without standby  Limited use cases, such as high-throughput computing Prototype based on HA-OSCAR Asymmetric Active/Active Head Nodes

7 7 Scott_RAS_0614 Symmetric active/active redundancy  Many active head nodes  Work load distribution  Symmetric replication between head nodes  Continuous service  Always up to date  No fail-over necessary  No restore-over necessary  Virtual synchrony model  Complex algorithms  JOSHUA prototype for Torque Active/Active Head Nodes

8 8 Scott_RAS_0614 Input Replication Virtually Synchronous Processing Output Unification Symmetric active/active replication

9 9 Scott_RAS_0614 Symmetric active/active high availability for head and service nodes  A component = MTTF / (MTTF + MTTR)  A system = 1 - (1 - A component ) n  T down = 8760 hours * (1 – A)  Single node MTTF: 5000 hours  Single node MTTR: 72 hours NodesAvailabilityEst. annual downtime 198.58%5d4h21m 299.97%1h45m 399.9997% 1m30s 499.999995% 1s Single-site redundancy for 7 nines does not mask catastrophic events NodesAvailabilityEst. annual downtime 198.58%5d4h21m 299.97%1h45m 399.9997% 1m30s NodesAvailabilityEst. annual downtime 198.58%5d4h21m 299.97%1h45m NodesAvailabilityEst. annual downtime 198.58%5d4h21m

10 10 Scott_RAS_0614 High-availability framework for HPC  Pluggable component framework  Communication drivers  Group communication  Virtual synchrony  Applications  Interchangeable components  Adaptation to application needs, such as level of consistency  Adaptation to system properties, such as network and system scale Applications Scheduler MPI Runtime File System SSI Virtual Synchrony Replicated Memory Replicated File Replicated State-Machine Replicated Database Replicated RPC/RMI Distributed Control Group Communication Membership Management Failure Detection Reliable Multicast Atomic Multicast Communication Driver Singlecast Failure Detection Multicast Network (Ethernet, Myrinet, Elan+, Infiniband,…)

11 11 Scott_RAS_0614 Scalable, fault-tolerant membership for MPI tasks on HPC systems  Scalable approach to reconfiguring communication infrastructure  Decentralized (peer-to-peer) protocol that maintains consistent view of active nodes in the presence of faults  Resilience against multiple node failures, even during reconfiguration  Response time:  Hundreds of microseconds over MPI on 1024-node Blue Gene/L  Single-digit milliseconds over TCP on 64-node Gigabit Ethernet Linux cluster (XTORC)  Integration with Berkeley Laboratory checkpoint/restart (BLCR) mechanism to handle node failures without restarting an entire MPI job

12 12 Scott_RAS_0614 Stabilization time over MPI on BG/L Time for Stabilization [microsecs] 0 50 100 150 200 250 300 350 48163264 128256 5121024 Number of Nodes (Log Scale) Experimental results Distance model Base model

13 13 Scott_RAS_0614 Stabilization time over TCP on XTORC 0 3711151923273135394347 Number of nodes Time for Stabilization [microsecs] 500 1000 1500 2000 Experimental results Distance Model Base Model

14 14 Scott_RAS_0614 ORNL contacts Stephen L. Scott Network and Cluster Computing Computer Science and Mathematics (865) 574-3144 Scottsl@ornl.gov Christian Engelmann Network and Cluster Computing Computer Science and Mathematics (865) 574-3132 Engelmannc@ornl.gov 14 Scott_RAS_0614


Download ppt "Presented by Reliability, Availability, and Serviceability (RAS) for High-Performance Computing Stephen L. Scott and Christian Engelmann Computer Science."

Similar presentations


Ads by Google