Presentation is loading. Please wait.

Presentation is loading. Please wait.

Characterizing Failure Data in High-Performance-Computing Systems Bianca Schroeder With Garth Gibson and Gary Grider Department of Computer Science Carnegie.

Similar presentations


Presentation on theme: "Characterizing Failure Data in High-Performance-Computing Systems Bianca Schroeder With Garth Gibson and Gary Grider Department of Computer Science Carnegie."— Presentation transcript:

1 Characterizing Failure Data in High-Performance-Computing Systems Bianca Schroeder With Garth Gibson and Gary Grider Department of Computer Science Carnegie Mellon University

2 June 16Bianca Schroeder2 What do failures look like? Realistic benchmarking requires understanding of real failure workloads, e.g: Cause of failures? Failure rate and its distribution? Mean time to repair and its distribution? What parameters affect the above? Preparing release of large set of failure data from HPC systems at LANL.

3 June 16Bianca Schroeder3 The data Contains record for each failure event requiring intervention of system administrator: [ StartTime, | EndTime, | System | Node | Root cause]

4 June 16Bianca Schroeder4 The data Cluster system with 544 procs. Contains record for each failure event requiring intervention of system administrator: [ StartTime, | EndTime, | System | Node | Root cause]

5 June 16Bianca Schroeder5 The data Cluster Type A Cluster Type C Contains record for each failure event requiring intervention of system administrator: Cluster Type B [ StartTime, | EndTime, | System | Node | Root cause]

6 June 16Bianca Schroeder6 The data SMPs NUMA Cluster Type B Cluster Type A Cluster Type C Contains record for each failure event requiring intervention of system administrator: [ StartTime, | EndTime, | System | Node | Root cause]  Total of 23.000 records for 20 systems over 9 years!  Will be made publicly available.

7 June 16Bianca Schroeder7 Cause of failures Percentage (%)

8 June 16Bianca Schroeder8 Cause of failures Hardware and software dominate. Breakdown is quite consistent across systems and over time. CPU Memory Other OS Other Parallel FS Percentage (%)

9 June 16Bianca Schroeder9 The failure rate versus system age The “bathtub” Time in use Infant mortalityWear-out “Reverse bathtub” --- why? Failure rate

10 June 16Bianca Schroeder10 The failure rate versus system age Failures by time of day “Reverse bathtub” --- why?

11 June 16Bianca Schroeder11 Time to Repair TTR (hours) 90% repaired in < 5h. 50% repaired in < 50 min. 10% repaired in < 10 min. Average TTR ~ 6h TTR is similar across systems and over time. Somewhat dependent on cause of failure. Very variable distribution Pr (TTR<X)

12 June 16Bianca Schroeder12 Time to Repair TTR (hours) 90% repaired in < 5h. 50% repaired in < 50 min. 10% repaired in < 10 min. Average TTR ~ 6h TTR is similar across systems and over time. Somewhat dependent on cause of failure. Very variable distribution => far from exponential. Same is true for MTBF. Pr (TTR<X)

13 June 16Bianca Schroeder13 Future work Create public data repository. Other sites. Error logs. Utilization and workload data. What are the most important statistics? What other data should be included? Come and talk to us! Or e-mail: bianca@cs.cmu.edu


Download ppt "Characterizing Failure Data in High-Performance-Computing Systems Bianca Schroeder With Garth Gibson and Gary Grider Department of Computer Science Carnegie."

Similar presentations


Ads by Google