Download presentation
Presentation is loading. Please wait.
Published byMillicent Wilkerson Modified over 9 years ago
1
Making HPC System Acquisition Decisions Is an HPC Application Larry P. Davis and Cray J. Henry, Department of Defense High Performance Computing Modernization Program Office Roy L. Campbell, Jr., U.S. Army Research Laboratory William Ward, U.S. Army Engineer Research and Development Center Allan Snavely and Laura Carrington, University of California at San Diego November 2004 Larry P. Davis and Cray J. Henry, Department of Defense High Performance Computing Modernization Program Office Roy L. Campbell, Jr., U.S. Army Research Laboratory William Ward, U.S. Army Engineer Research and Development Center Allan Snavely and Laura Carrington, University of California at San Diego November 2004 Department of Defense High Performance Computing Modernization Program
2
OverviewOverview Program Background Acquisition Methodology Process Benchmarks Performance and price/performance scoring System selection and optimization of program workload Uncertainty analysis Performance Prediction and Analysis Program Background Acquisition Methodology Process Benchmarks Performance and price/performance scoring System selection and optimization of program workload Uncertainty analysis Performance Prediction and Analysis
3
An Exciting Time for Benchmarking and Performance Modeling! DOE PERC program DoD benchmarking and performance modeling activities DARPA HPCS Productivity Team benchmarking activities HEC Revitalization Task Force Report Joint Federal Agency benchmarking and acquisitions Federal large benchmarking study Federal Agencies, HPC User Forum, IDC DOE PERC program DoD benchmarking and performance modeling activities DARPA HPCS Productivity Team benchmarking activities HEC Revitalization Task Force Report Joint Federal Agency benchmarking and acquisitions Federal large benchmarking study Federal Agencies, HPC User Forum, IDC
5
Current User Base and Requirements 561 projects and 4,572 users at approximately 179 sites Requirements categorized in 10 Computational Technology Areas (CTA) FY 2005 non-real-time requirements of 260 teraFLOPS-years 561 projects and 4,572 users at approximately 179 sites Requirements categorized in 10 Computational Technology Areas (CTA) FY 2005 non-real-time requirements of 260 teraFLOPS-years CSM – 507 Users CFD – 1,135 Users CCM – 235 Users CEA – 304 Users CWO – 231 Users SIP – 435 Users FMS – 889 Users EQM – 170 Users IMT – 568 Users 60 users are self characterized as “other” CEN – 38 Users As of November 2004
7
Technology Insertion (TI) HPC System Acquisition Process Annual process to purchase high performance computing capability for major shared resource centers (MSRCs) and allocated distributed centers Total funding of $35M–$60M (~$50M in FY 2005) Two of the four major shared resource centers provisioned each year on a rotating basis TI-04 process upgraded HPC capabilities at the Army Research Laboratory and the Naval Oceanographic Office MSRCs TI-05 process will upgrade HPC capabilities at Aeronautical Systems Center and Engineer Research and Development Center MSRCs Annual process to purchase high performance computing capability for major shared resource centers (MSRCs) and allocated distributed centers Total funding of $35M–$60M (~$50M in FY 2005) Two of the four major shared resource centers provisioned each year on a rotating basis TI-04 process upgraded HPC capabilities at the Army Research Laboratory and the Naval Oceanographic Office MSRCs TI-05 process will upgrade HPC capabilities at Aeronautical Systems Center and Engineer Research and Development Center MSRCs
8
Technology Insertion 2005 (TI-05) Acquisition Process Assess computational requirements Determine application benchmarks and their weights Develop acquisition process and evaluation criteria using GSA as acquisition agent Execute Phase I RFQ and evaluation Identification of promising HPC systems Execute Phase II RFQ and evaluation Construct best solution sets of systems Purchase best overall solution set through GSA Assess computational requirements Determine application benchmarks and their weights Develop acquisition process and evaluation criteria using GSA as acquisition agent Execute Phase I RFQ and evaluation Identification of promising HPC systems Execute Phase II RFQ and evaluation Construct best solution sets of systems Purchase best overall solution set through GSA
9
TI-05 Evaluation Criteria Performance Price/Performance Raw Performance Usability User Criteria Center Criteria Confidence/Past Performance Benchmarks (Subset) Performance Price/Performance Raw Performance Usability User Criteria Center Criteria Confidence/Past Performance Benchmarks (Subset) QuantitativeQuantitative QualitativeQualitative
10
Types of Benchmark Codes Synthetic codes Basic hardware and system performance tests Meant to determine expected future performance and serve as surrogate for workload not represented by application codes Scalable, quantitative synthetic tests are used for evaluation by the Performance Team, and others are used as system performance checks and qualitative evaluation by Usability Team A subset of synthetic tests needed for performance modeling is required Application codes Actual application codes as determined by requirements and usage Meant to indicate current performance Each application code (except two) has two test cases: standard and large Synthetic codes Basic hardware and system performance tests Meant to determine expected future performance and serve as surrogate for workload not represented by application codes Scalable, quantitative synthetic tests are used for evaluation by the Performance Team, and others are used as system performance checks and qualitative evaluation by Usability Team A subset of synthetic tests needed for performance modeling is required Application codes Actual application codes as determined by requirements and usage Meant to indicate current performance Each application code (except two) has two test cases: standard and large
11
TI-05 Synthetic Benchmark Codes I/O Tests Include a simplified streaming test Include a scalable I/O test Operating System Tests Measure the performance of system calls, interprocessor communication, and TCP scalability (now includes IPv4 and IPv6) Memory Tests Measure memory hierarchy performance, such as memory bandwidth (now includes multiple memory performance curves based on fraction of random strides in memory access) Network Tests Are a set of five MPI tests (point-to-point, broadcast, allreduce) CPU Tests Exercise multiple fundamental computation kernels, BLAS routines, and ScaLapack routines PMaC Machine Probes Exercise basic system functions to use in performance predictions (included in memory tests, network tests, and streaming I/O test) I/O Tests Include a simplified streaming test Include a scalable I/O test Operating System Tests Measure the performance of system calls, interprocessor communication, and TCP scalability (now includes IPv4 and IPv6) Memory Tests Measure memory hierarchy performance, such as memory bandwidth (now includes multiple memory performance curves based on fraction of random strides in memory access) Network Tests Are a set of five MPI tests (point-to-point, broadcast, allreduce) CPU Tests Exercise multiple fundamental computation kernels, BLAS routines, and ScaLapack routines PMaC Machine Probes Exercise basic system functions to use in performance predictions (included in memory tests, network tests, and streaming I/O test)
12
TI-05 Application Benchmark Codes Aero – Aeroelasticity CFD code (Fortran, serial vector, 15,000 lines of code) AVUS (Cobalt-60) – Turbulent flow CFD code (Fortran, MPI, 19,000 lines of code) GAMESS – Quantum chemistry code (Fortran, MPI, 330,000 lines of code) HYCOM – Ocean circulation modeling code (Fortran, MPI, 31,000 lines of code) OOCore – Out-of-core solver (Fortran, MPI, 39,000 lines of code) RFCTH2 – Shock physics code (~43% Fortran/~57% C, MPI, 436,000 lines of code) WRF – Multi-Agency mesoscale atmospheric modeling code (Fortran and C, MPI, 100,000 lines of code) Overflow-2 – CFD code originally developed by NASA (Fortran 90, MPI, 83,000 lines of code) Aero – Aeroelasticity CFD code (Fortran, serial vector, 15,000 lines of code) AVUS (Cobalt-60) – Turbulent flow CFD code (Fortran, MPI, 19,000 lines of code) GAMESS – Quantum chemistry code (Fortran, MPI, 330,000 lines of code) HYCOM – Ocean circulation modeling code (Fortran, MPI, 31,000 lines of code) OOCore – Out-of-core solver (Fortran, MPI, 39,000 lines of code) RFCTH2 – Shock physics code (~43% Fortran/~57% C, MPI, 436,000 lines of code) WRF – Multi-Agency mesoscale atmospheric modeling code (Fortran and C, MPI, 100,000 lines of code) Overflow-2 – CFD code originally developed by NASA (Fortran 90, MPI, 83,000 lines of code)
13
Basic Rules for Application Benchmarks: Emphasis on Performance Establish a DoD standard benchmark time for each application benchmark case NAVO IBM Regatta P4 chosen as standard DoD system Benchmark timings (at least three on each test case) are requested for systems that meet or beat the DoD standard benchmark times by at least a factor of two (preferably four) Benchmark timings may be extrapolated provided they are guaranteed, but at least one actual timing on the offered or closely related system must be provided Establish a DoD standard benchmark time for each application benchmark case NAVO IBM Regatta P4 chosen as standard DoD system Benchmark timings (at least three on each test case) are requested for systems that meet or beat the DoD standard benchmark times by at least a factor of two (preferably four) Benchmark timings may be extrapolated provided they are guaranteed, but at least one actual timing on the offered or closely related system must be provided
14
Benchmark Scoring Two major components of benchmark scoring: application codes and synthetic codes All application codes need not be run, but the more that are run by any vendor increases the opportunity to be part of the final mix Evaluation of synthetic tests that are quantitatively scored are done in a consistent fashion with the application tests Vendors are required to run a load mix test in response to the Phase II RFQ Weight for application code scores is greater than weight for synthetic codes scores in determining price/performance score It is essential that results be provided on all required synthetic tests and very important on other tests Two major components of benchmark scoring: application codes and synthetic codes All application codes need not be run, but the more that are run by any vendor increases the opportunity to be part of the final mix Evaluation of synthetic tests that are quantitatively scored are done in a consistent fashion with the application tests Vendors are required to run a load mix test in response to the Phase II RFQ Weight for application code scores is greater than weight for synthetic codes scores in determining price/performance score It is essential that results be provided on all required synthetic tests and very important on other tests
15
Determine workload percentages by CTA Partition CTA percentages among benchmark test cases Consider all alternatives that meet total cost constraints Using benchmark scores, maximize workload for each alternative, subject to the constraint of matching required CTA percentages Determine price/performance score for each alternative and rank order Use of Benchmark Data to Score Systems and Construct Alternatives
16
HPCMP System Performance (Unclassified) 5 9 1 4 3 1 = number of application test cases not included (out of 13 total) n ~40 2
17
How the Optimizer Works: Problem Description KNOWNKNOWNUNKNOWNUNKNOWN Workload Distribution Matrix Application Test Cases Optimal Quantity Set Machines p. N N... N N p p.. p. p p.. p. p p.. p. p p.. p. p p....................... Application Score Matrix Application Test Cases Prices $$ Budget Limits $$ Overall Desired Workload Distribution Application Test Cases f. Allowed Distribution Deviation Machines $ $... $ $ sssss sssss. sssss sssss................................... ffff.. HI LO
18
Problem Description Offered Systems Quantity is variable Workload allocation is variable Existing systems Quantity is fixed Workload allocation is variable Offered Systems Quantity is variable Workload allocation is variable Existing systems Quantity is fixed Workload allocation is variable
19
MotivationMotivation Primary Goal: Find solution set with optimal (minimum) price/performance (and solution sets w/ price/performance within X% of optimal price/performance) Secondary Goal: Determine optimal allocation of work for each application test case per machine Primary Goal: Find solution set with optimal (minimum) price/performance (and solution sets w/ price/performance within X% of optimal price/performance) Secondary Goal: Determine optimal allocation of work for each application test case per machine
20
Optimization Scheme Fix quantity of each machine Mark quantity combinations that fall within acquisition price range (viable options) Score each viable option (via SIMPLEX optimization kernel) Divide life-cycle cost (acquisition price, maintenance, power, any costs over and above normal operations) by total performance Rank results in ascending order Fix quantity of each machine Mark quantity combinations that fall within acquisition price range (viable options) Score each viable option (via SIMPLEX optimization kernel) Divide life-cycle cost (acquisition price, maintenance, power, any costs over and above normal operations) by total performance Rank results in ascending order
21
Architecture % Selection by Processor Quantity (Example)
22
Should We Do Uncertainty Analysis?
23
Performance Modeling Uncertainty Analysis Assumption: Uncertainties in measured performance values can be treated as uncertainties in measurements of physical quantities For small, random uncertainties in measured values x, y, z, …, the uncertainty in a calculated function q (x, y, z …) can be expressed as: Systematic errors need careful consideration since they cannot be calculated analytically Assumption: Uncertainties in measured performance values can be treated as uncertainties in measurements of physical quantities For small, random uncertainties in measured values x, y, z, …, the uncertainty in a calculated function q (x, y, z …) can be expressed as: Systematic errors need careful consideration since they cannot be calculated analytically
24
Benchmarking and Performance Prediction Uncertainty Analysis Overall goal: Understand and accurately estimate uncertainties in benchmarking and performance prediction calculations Develop uncertainty equations from analytical expressions used to calculate performance and price/performance Estimate uncertainties in quantities used for these calculations Eventual goal: propagate uncertainties in performance predictions and benchmarking results to determine uncertainties in acquisition scoring Overall goal: Understand and accurately estimate uncertainties in benchmarking and performance prediction calculations Develop uncertainty equations from analytical expressions used to calculate performance and price/performance Estimate uncertainties in quantities used for these calculations Eventual goal: propagate uncertainties in performance predictions and benchmarking results to determine uncertainties in acquisition scoring
25
Power Law Propagation of Uncertainties in Benchmarking and Performance Modeling Benchmark Times Least Squares Fit Total Performance for Solution Set Price/Performance for Solution Set Rank Ordering and Histograms of Solution SetsPrice/PerformancePrice/Performance AveragingAveraging over spans of Solution Sets OptimizerOptimizer Average Performance for Each System Benchmark Performance Benchmark Scores Benchmark Weights
26
Uncertainties in Benchmark Times and Performance Benchmark Times Benchmark Performance From replicated measurements or Analytical performance prediction equation From replicated measurements or Analytical performance prediction equation
27
Uncertainties in Performance via Power Law Fit
28
ln [number of processors (n)] StandardperformanceStandardperformance Data points Power Law Fit Power Number of processors required to reach standard performance (n STD ) Number of processors required to reach standard performance (n STD ) ln performance
29
Power Law Propagation of Uncertainties in Benchmarking and Performance Modeling Benchmark Times Least Squares Fit Total Performance for Solution Set Price/Performance for Solution Set Rank Ordering and Histograms of Solution SetsPrice/PerformancePrice/Performance AveragingAveraging over spans of Solution Sets OptimizerOptimizer Average Performance for Each System Benchmark Performance Benchmark Scores Benchmark Weights 4–5% 2–5% ~4% ~3%
30
Architecture % Selection by Processor Quantity (Example)
31
Architecture % Selection by Processor Quantity for Varying Percentages Off the Best Price/Performance (Example)
32
Uncertainties in Performance Scores for Various Uncertainties in Benchmark Times (Example) Quantity Inherent Performance Uncertainties 10% Performance Uncertainties 20% Performance Uncertainties Benchmark Time or Performance 5%10%20% Score of Individual System on Individual Benchmark 4%11%21% Average Performance of an Individual System 4%6%10% Total Score of All Selected Systems in a Solution Set 1%3%7% Price/Performance Score of All Selected Systems in a Solution Set 1 8%9%11% 1 Assigns an 8% uncertainty in life-cycle cost
33
Performance Modeling and Prediction Goals Enable informed purchasing decisions in support of TI-XX Develop an understanding of our key application codes for the purpose of guiding code developers and users toward more efficient applications (Where are the code/system bottlenecks?) Replace the current application benchmark suite with a judicious choice of synthetic benchmarks that could be used to predict performance of any HPC architecture on the program’s key applications Enable informed purchasing decisions in support of TI-XX Develop an understanding of our key application codes for the purpose of guiding code developers and users toward more efficient applications (Where are the code/system bottlenecks?) Replace the current application benchmark suite with a judicious choice of synthetic benchmarks that could be used to predict performance of any HPC architecture on the program’s key applications
34
BenchmarksBenchmarks Today Dedicated Applications Larger weight Real codes Representative data sets Synthetic Benchmarks Smaller weight Future look Focus on key machine features Today Dedicated Applications Larger weight Real codes Representative data sets Synthetic Benchmarks Smaller weight Future look Focus on key machine features Tomorrow Synthetic Benchmarks 100% weight Coordinated to application “signature” Performance on real codes accurately predicted from synthetic benchmark results Supported by genuine “signature” databases Next 1–2 years key — must prove that synthetics benchmarks and application “signatures” can be coordinated
35
Potential Future Impact of Performance Modeling and Prediction Benchmarking Has Real Impact Over $160M in decisions over last 4 years $100s of millions in decisions over the next decade Benchmarking Has Real Impact Over $160M in decisions over last 4 years $100s of millions in decisions over the next decade Synthetics performance coordinated to application signatures is the next huge step. Make it Happen! Synthetics performance coordinated to application signatures is the next huge step. Make it Happen!
36
The Performance Prediction Framework Parallel performance - two major factors: Single processor performance Interprocessor communications performance Two major components of the framework: Single processor model Model of application’s performance between communication events (floating point performance and memory access) Communications model (Network simulator) Model of application’s communication events (Measures full MPI latency and bandwidth) Parallel performance - two major factors: Single processor performance Interprocessor communications performance Two major components of the framework: Single processor model Model of application’s performance between communication events (floating point performance and memory access) Communications model (Network simulator) Model of application’s communication events (Measures full MPI latency and bandwidth)
37
The Performance Prediction Framework Both models based on simplicity and isolation: Simplicity: start simple and only add complexity when needed to explain behavior Isolation: Collect each piece of the performance framework in isolation, then combine pieces for performance prediction Both models based on simplicity and isolation: Simplicity: start simple and only add complexity when needed to explain behavior Isolation: Collect each piece of the performance framework in isolation, then combine pieces for performance prediction
38
Components of Performance Prediction Framework Machine Profile - characterizations of the rates at which a machine can (or is projected to) carry out fundamental operations abstract from the particular application Application Signature - detailed summaries of the fundamental operations to be carried out by the application independent of any particular machine Combine Machine Profile and Application Signature using: Convolution Method - algebraic mapping of the application signature onto the machine profile to calculate a performance prediction Machine Profile - characterizations of the rates at which a machine can (or is projected to) carry out fundamental operations abstract from the particular application Application Signature - detailed summaries of the fundamental operations to be carried out by the application independent of any particular machine Combine Machine Profile and Application Signature using: Convolution Method - algebraic mapping of the application signature onto the machine profile to calculate a performance prediction
39
Components of Performance Prediction Framework Performance prediction of Application B on Machine A Performance prediction of Application B on Machine A Parallel Processor Prediction Machine Profile (Machine A) Characterization of memory performance capabilities of Machine A Machine Profile (Machine A) Characterization of memory performance capabilities of Machine A Application Signature (Application B) Characterization of memory operations needed to be performed by Application B Application Signature (Application B) Characterization of memory operations needed to be performed by Application B Convolution Method Mapping memory usage needs of Application B to the capabilities of Machine A Application B Machine A Convolution Method Mapping memory usage needs of Application B to the capabilities of Machine A Application B Machine A Machine Profile (Machine A) Characterization of network performance capabilities of Machine A Machine Profile (Machine A) Characterization of network performance capabilities of Machine A Application Signature (Application B) Characterization of network operations needed to be performed by Application B Application Signature (Application B) Characterization of network operations needed to be performed by Application B Convolution Method Mapping network usage needs of Application B to the capabilities of Machine A Application B Machine A Convolution Method Mapping network usage needs of Application B to the capabilities of Machine A Application B Machine A Single-Processor Model Communication Model
40
MAPS Data MAPS – Memory bandwidth benchmark measures memory rates (MB/s) for different levels of cache (L1, L2, L3, Main Memory) and different access patterns (stride-one and random) MAPS – Memory bandwidth benchmark measures memory rates (MB/s) for different levels of cache (L1, L2, L3, Main Memory) and different access patterns (stride-one and random) Stride-one access L1 cache Stride-one access L1/L2 cache Random access L1/L2 cache Stride-one access L2 cache/Main Memory
41
Application Signature Trace of operations on the processor performed by an application (memory and FP ops on processor) Sample: Sample: Where the format is as follows: Basic-block #: # memory ref., type, hit rates, access stride Where the format is as follows: Basic-block #: # memory ref., type, hit rates, access stride BB#202: 2.0E9, load, 99%, 100%, stride-one BB#202: 1.9E3, FP BB#303: 2.2E10, load, 52%, 63%, random BB#303: 1.1E2, FP BB#202: 2.0E9, load, 99%, 100%, stride-one BB#202: 1.9E3, FP BB#303: 2.2E10, load, 52%, 63%, random BB#303: 1.1E2, FP Trace of application is collected and processed by the MetaSim Tracer Cache hit rates for the PREDICTED MACHINE for each basic-block of the application. This additional information requires “processing” by the MetaSim tracer not just straight memory tracing, hence the combination of the application signature and convolution components Cache hit rates for the PREDICTED MACHINE for each basic-block of the application. This additional information requires “processing” by the MetaSim tracer not just straight memory tracing, hence the combination of the application signature and convolution components
42
ConvolutionsConvolutions Single-processor or per-processor performance: Machine profile for processor (Machine A) Application Signature for application (App. #1) The relative “per-processor” performance of App. #1 on Machine A is represented as the Single-processor or per-processor performance: Machine profile for processor (Machine A) Application Signature for application (App. #1) The relative “per-processor” performance of App. #1 on Machine A is represented as the MetaSim trace collected on Cobalt 60 simulating SC45 memory structure MetaSim trace collected on Cobalt 60 simulating SC45 memory structure Basic- Block Number Procedure Name # Memory References L1 hit rate L2 hit rate Random ratio Memory Bandwidth 5247Walldst2.22E1197.2899.990.008851 10729Poorgrd4.90E0888.9792.290.201327 8649Ucm61.81E1092.0197.070.23572
43
Results-Predictions for AVUS (Cobalt60) System Actual time (s) Predicted time (s) % Error NAVO IBM PWR3 (Habu) NAVO IBM PWR3 (Habu)8,60111,180+30% ARL IBM PWR3 (Brainerd) 10,67510,385-3% MHPCC IBM PWR3 (Tempest) 8,3549,488+14% MHPCC IBM PWR4 (Hurricane) 4,9324,258-14% NAVO IBM PWR4 (Marcellus) 4,3754,445+2% ARL IBM PWR4 (Shelton) 4,456 NAVO IBM PWR4+ (Romulus) 3,2723,239-1% ASC HP SC45 3,3342,688-19% ARL Linux Networx Xeon Cluster 3,459 AVUS TI-05 standard data set on 64 CPUs
44
Results-Predictions for AVUS (Cobalt60) System Actual time (s) Predicted time (s) % Error NAVO IBM PWR3 (Habu) NAVO IBM PWR3 (Habu)18,19523,513+29% ARL IBM PWR3 (Brainerd) 21,819 MHPCC IBM PWR3 (Tempest) 15,05119,907+32% MHPCC IBM PWR4 (Hurricane) 10,2868,837-14% NAVO IBM PWR4 (Marcellus) 9,2999,358+1% ARL IBM PWR4 (Shelton) 8,625 NAVO IBM PWR4+ (Romulus) 7,0606,552-7% ASC HP SC45 6,9935,907-16% ARL Linux Networx Xeon Cluster 7,063 AVUS TI-05 standard data set on 32 CPUs
45
Results-Predictions for AVUS (Cobalt60) System Actual time (s) Predicted time (s) % Error NAVO IBM PWR3 (Habu) NAVO IBM PWR3 (Habu)3,8705,357+38% ARL IBM PWR3 (Brainerd) 4,970 MHPCC IBM PWR3 (Tempest) 3,7794,548+20% MHPCC IBM PWR4 (Hurricane) 2,3682,075-12% NAVO IBM PWR4 (Marcellus) 2,0382,155+6% ARL IBM PWR4 (Shelton) 1,935 NAVO IBM PWR4+ (Romulus) 1,5181,590-5% ASC HP SC45 1,6171,302-19% ARL Linux Networx Xeon Cluster 1,728 AVUS TI-05 standard data set on 128 CPUs
46
Results-Predictions for HYCOM System Actual time (s) Predicted time (s) % Error NAVO IBM PWR3 (Habu) NAVO IBM PWR3 (Habu)7,1297,319+3% ARL IBM PWR3 (Brainerd) 6,976 MHPCC IBM PWR3 (Tempest) 10,4536,512-38% MHPCC IBM PWR4 (Hurricane) 3,5322,804-21% NAVO IBM PWR4 (Marcellus) 3,3643,404+1% ARL IBM PWR4 (Shelton) 2,585 NAVO IBM PWR4+ (Romulus) 2,2312,061-8% ASC HP SC45 3,5942,358-34% ARL Linux Networx Xeon Cluster 3,305 HYCOM TI-05 standard data set on 59 CPUs
47
Results-Predictions for HYCOM System Actual time (s) Predicted time (s) % Error NAVO IBM PWR3 (Habu) NAVO IBM PWR3 (Habu)4,4204,977+13% ARL IBM PWR3 (Brainerd) 4,425 MHPCC IBM PWR3 (Tempest) 3,9124,081+4% MHPCC IBM PWR4 (Hurricane) 2,9392,249-23% NAVO IBM PWR4 (Marcellus) 2,4722,663+8% ARL IBM PWR4 (Shelton) 1,675 NAVO IBM PWR4+ (Romulus) 1,4091,457+3% ASC HP SC45 2,4691,558-37% ARL Linux Networx Xeon Cluster 2,133 HYCOM TI-05 standard data set on 96 CPUs
48
Results-Predictions for HYCOM System Actual time (s) Predicted time (s) % Error NAVO IBM PWR3 (Habu) NAVO IBM PWR3 (Habu)3,3484,059+21% ARL IBM PWR3 (Brainerd) 3,691 MHPCC IBM PWR3 (Tempest) 2,9923,360+12% MHPCC IBM PWR4 (Hurricane) 2,6612,118-20% NAVO IBM PWR4 (Marcellus) 2,0312,360+16% ARL IBM PWR4 (Shelton) NAVO IBM PWR4+ (Romulus) 1,1031,185+7% ASC HP SC45 1,9491,558-20% ARL Linux Networx Xeon Cluster 1,746 HYCOM TI-05 standard data set on 124 CPUs
49
Uncertainties in Performance Scores for Various Uncertainties in Benchmark Times (Example) Quantity Inherent Performance Uncertainties 10% Performance Uncertainties 20% Performance Uncertainties Benchmark Time or Performance 5%10%20% Score of Individual System on Individual Benchmark 4%11%21% Average Performance of an Individual System 4%6%10% Total Score of All Selected Systems in a Solution Set 1%3%7% Price/Performance Score of All Selected Systems in a Solution Set 1 8%9%11% 1 Assigns an 8% uncertainty in life-cycle cost
50
Results — Sensitivity Study of HYCOM Investigation of “Processor” Performance Effects Base case is performance of HABU ( IBM PWR3) Four-fold improvements in floating-point performance (no impact on run-time!) Two-fold improvements in memory bandwidth/latency (increase in main memory performance drives improved performance!) Base case is performance of HABU ( IBM PWR3) Four-fold improvements in floating-point performance (no impact on run-time!) Two-fold improvements in memory bandwidth/latency (increase in main memory performance drives improved performance!) HYCOM run on 59 CPUs TI-04 Standard data set HYCOM run on 59 CPUs TI-04 Standard data set
51
Conclusions: The Takeaway Message Careful, systematic evaluation of performance on real application benchmarks is a major factor in system acquisition decisions for DoD Evaluation of overall price/performance of complete set of systems (new systems plus existing set) is necessary to optimize workload performance Consideration of uncertainties in performance scores is important to construct reliable information for acquisition decisions Performance prediction methodology shows significant promise in simplifying the applications benchmarking process for HPC system vendors in the near future Careful, systematic evaluation of performance on real application benchmarks is a major factor in system acquisition decisions for DoD Evaluation of overall price/performance of complete set of systems (new systems plus existing set) is necessary to optimize workload performance Consideration of uncertainties in performance scores is important to construct reliable information for acquisition decisions Performance prediction methodology shows significant promise in simplifying the applications benchmarking process for HPC system vendors in the near future
Similar presentations
© 2025 SlidePlayer.com Inc.
All rights reserved.