Presentation is loading. Please wait.

Presentation is loading. Please wait.

Making HPC System Acquisition Decisions Is an HPC Application Larry P. Davis and Cray J. Henry, Department of Defense High Performance Computing Modernization.

Similar presentations


Presentation on theme: "Making HPC System Acquisition Decisions Is an HPC Application Larry P. Davis and Cray J. Henry, Department of Defense High Performance Computing Modernization."— Presentation transcript:

1 Making HPC System Acquisition Decisions Is an HPC Application Larry P. Davis and Cray J. Henry, Department of Defense High Performance Computing Modernization Program Office Roy L. Campbell, Jr., U.S. Army Research Laboratory William Ward, U.S. Army Engineer Research and Development Center Allan Snavely and Laura Carrington, University of California at San Diego November 2004 Larry P. Davis and Cray J. Henry, Department of Defense High Performance Computing Modernization Program Office Roy L. Campbell, Jr., U.S. Army Research Laboratory William Ward, U.S. Army Engineer Research and Development Center Allan Snavely and Laura Carrington, University of California at San Diego November 2004 Department of Defense High Performance Computing Modernization Program

2 OverviewOverview  Program Background  Acquisition Methodology  Process  Benchmarks  Performance and price/performance scoring  System selection and optimization of program workload  Uncertainty analysis  Performance Prediction and Analysis  Program Background  Acquisition Methodology  Process  Benchmarks  Performance and price/performance scoring  System selection and optimization of program workload  Uncertainty analysis  Performance Prediction and Analysis

3 An Exciting Time for Benchmarking and Performance Modeling!  DOE PERC program  DoD benchmarking and performance modeling activities  DARPA HPCS Productivity Team benchmarking activities  HEC Revitalization Task Force Report  Joint Federal Agency benchmarking and acquisitions  Federal large benchmarking study  Federal Agencies, HPC User Forum, IDC  DOE PERC program  DoD benchmarking and performance modeling activities  DARPA HPCS Productivity Team benchmarking activities  HEC Revitalization Task Force Report  Joint Federal Agency benchmarking and acquisitions  Federal large benchmarking study  Federal Agencies, HPC User Forum, IDC

4

5 Current User Base and Requirements  561 projects and 4,572 users at approximately 179 sites  Requirements categorized in 10 Computational Technology Areas (CTA)  FY 2005 non-real-time requirements of 260 teraFLOPS-years  561 projects and 4,572 users at approximately 179 sites  Requirements categorized in 10 Computational Technology Areas (CTA)  FY 2005 non-real-time requirements of 260 teraFLOPS-years CSM – 507 Users CFD – 1,135 Users CCM – 235 Users CEA – 304 Users CWO – 231 Users SIP – 435 Users FMS – 889 Users EQM – 170 Users IMT – 568 Users 60 users are self characterized as “other” CEN – 38 Users As of November 2004

6

7 Technology Insertion (TI) HPC System Acquisition Process  Annual process to purchase high performance computing capability for major shared resource centers (MSRCs) and allocated distributed centers  Total funding of $35M–$60M (~$50M in FY 2005)  Two of the four major shared resource centers provisioned each year on a rotating basis  TI-04 process upgraded HPC capabilities at the Army Research Laboratory and the Naval Oceanographic Office MSRCs  TI-05 process will upgrade HPC capabilities at Aeronautical Systems Center and Engineer Research and Development Center MSRCs  Annual process to purchase high performance computing capability for major shared resource centers (MSRCs) and allocated distributed centers  Total funding of $35M–$60M (~$50M in FY 2005)  Two of the four major shared resource centers provisioned each year on a rotating basis  TI-04 process upgraded HPC capabilities at the Army Research Laboratory and the Naval Oceanographic Office MSRCs  TI-05 process will upgrade HPC capabilities at Aeronautical Systems Center and Engineer Research and Development Center MSRCs

8 Technology Insertion 2005 (TI-05) Acquisition Process  Assess computational requirements  Determine application benchmarks and their weights  Develop acquisition process and evaluation criteria using GSA as acquisition agent  Execute Phase I RFQ and evaluation  Identification of promising HPC systems  Execute Phase II RFQ and evaluation  Construct best solution sets of systems  Purchase best overall solution set through GSA  Assess computational requirements  Determine application benchmarks and their weights  Develop acquisition process and evaluation criteria using GSA as acquisition agent  Execute Phase I RFQ and evaluation  Identification of promising HPC systems  Execute Phase II RFQ and evaluation  Construct best solution sets of systems  Purchase best overall solution set through GSA

9 TI-05 Evaluation Criteria  Performance  Price/Performance  Raw Performance  Usability  User Criteria  Center Criteria  Confidence/Past Performance  Benchmarks (Subset)  Performance  Price/Performance  Raw Performance  Usability  User Criteria  Center Criteria  Confidence/Past Performance  Benchmarks (Subset) QuantitativeQuantitative QualitativeQualitative

10 Types of Benchmark Codes  Synthetic codes  Basic hardware and system performance tests  Meant to determine expected future performance and serve as surrogate for workload not represented by application codes  Scalable, quantitative synthetic tests are used for evaluation by the Performance Team, and others are used as system performance checks and qualitative evaluation by Usability Team  A subset of synthetic tests needed for performance modeling is required  Application codes  Actual application codes as determined by requirements and usage  Meant to indicate current performance  Each application code (except two) has two test cases: standard and large  Synthetic codes  Basic hardware and system performance tests  Meant to determine expected future performance and serve as surrogate for workload not represented by application codes  Scalable, quantitative synthetic tests are used for evaluation by the Performance Team, and others are used as system performance checks and qualitative evaluation by Usability Team  A subset of synthetic tests needed for performance modeling is required  Application codes  Actual application codes as determined by requirements and usage  Meant to indicate current performance  Each application code (except two) has two test cases: standard and large

11 TI-05 Synthetic Benchmark Codes  I/O Tests  Include a simplified streaming test  Include a scalable I/O test  Operating System Tests  Measure the performance of system calls, interprocessor communication, and TCP scalability (now includes IPv4 and IPv6)  Memory Tests  Measure memory hierarchy performance, such as memory bandwidth (now includes multiple memory performance curves based on fraction of random strides in memory access)  Network Tests  Are a set of five MPI tests (point-to-point, broadcast, allreduce)  CPU Tests  Exercise multiple fundamental computation kernels, BLAS routines, and ScaLapack routines  PMaC Machine Probes  Exercise basic system functions to use in performance predictions (included in memory tests, network tests, and streaming I/O test)  I/O Tests  Include a simplified streaming test  Include a scalable I/O test  Operating System Tests  Measure the performance of system calls, interprocessor communication, and TCP scalability (now includes IPv4 and IPv6)  Memory Tests  Measure memory hierarchy performance, such as memory bandwidth (now includes multiple memory performance curves based on fraction of random strides in memory access)  Network Tests  Are a set of five MPI tests (point-to-point, broadcast, allreduce)  CPU Tests  Exercise multiple fundamental computation kernels, BLAS routines, and ScaLapack routines  PMaC Machine Probes  Exercise basic system functions to use in performance predictions (included in memory tests, network tests, and streaming I/O test)

12 TI-05 Application Benchmark Codes  Aero – Aeroelasticity CFD code (Fortran, serial vector, 15,000 lines of code)  AVUS (Cobalt-60) – Turbulent flow CFD code (Fortran, MPI, 19,000 lines of code)  GAMESS – Quantum chemistry code (Fortran, MPI, 330,000 lines of code)  HYCOM – Ocean circulation modeling code (Fortran, MPI, 31,000 lines of code)  OOCore – Out-of-core solver (Fortran, MPI, 39,000 lines of code)  RFCTH2 – Shock physics code (~43% Fortran/~57% C, MPI, 436,000 lines of code)  WRF – Multi-Agency mesoscale atmospheric modeling code (Fortran and C, MPI, 100,000 lines of code)  Overflow-2 – CFD code originally developed by NASA (Fortran 90, MPI, 83,000 lines of code)  Aero – Aeroelasticity CFD code (Fortran, serial vector, 15,000 lines of code)  AVUS (Cobalt-60) – Turbulent flow CFD code (Fortran, MPI, 19,000 lines of code)  GAMESS – Quantum chemistry code (Fortran, MPI, 330,000 lines of code)  HYCOM – Ocean circulation modeling code (Fortran, MPI, 31,000 lines of code)  OOCore – Out-of-core solver (Fortran, MPI, 39,000 lines of code)  RFCTH2 – Shock physics code (~43% Fortran/~57% C, MPI, 436,000 lines of code)  WRF – Multi-Agency mesoscale atmospheric modeling code (Fortran and C, MPI, 100,000 lines of code)  Overflow-2 – CFD code originally developed by NASA (Fortran 90, MPI, 83,000 lines of code)

13 Basic Rules for Application Benchmarks: Emphasis on Performance  Establish a DoD standard benchmark time for each application benchmark case  NAVO IBM Regatta P4 chosen as standard DoD system  Benchmark timings (at least three on each test case) are requested for systems that meet or beat the DoD standard benchmark times by at least a factor of two (preferably four)  Benchmark timings may be extrapolated provided they are guaranteed, but at least one actual timing on the offered or closely related system must be provided  Establish a DoD standard benchmark time for each application benchmark case  NAVO IBM Regatta P4 chosen as standard DoD system  Benchmark timings (at least three on each test case) are requested for systems that meet or beat the DoD standard benchmark times by at least a factor of two (preferably four)  Benchmark timings may be extrapolated provided they are guaranteed, but at least one actual timing on the offered or closely related system must be provided

14 Benchmark Scoring  Two major components of benchmark scoring: application codes and synthetic codes  All application codes need not be run, but the more that are run by any vendor increases the opportunity to be part of the final mix  Evaluation of synthetic tests that are quantitatively scored are done in a consistent fashion with the application tests  Vendors are required to run a load mix test in response to the Phase II RFQ  Weight for application code scores is greater than weight for synthetic codes scores in determining price/performance score  It is essential that results be provided on all required synthetic tests and very important on other tests  Two major components of benchmark scoring: application codes and synthetic codes  All application codes need not be run, but the more that are run by any vendor increases the opportunity to be part of the final mix  Evaluation of synthetic tests that are quantitatively scored are done in a consistent fashion with the application tests  Vendors are required to run a load mix test in response to the Phase II RFQ  Weight for application code scores is greater than weight for synthetic codes scores in determining price/performance score  It is essential that results be provided on all required synthetic tests and very important on other tests

15 Determine workload percentages by CTA Partition CTA percentages among benchmark test cases Consider all alternatives that meet total cost constraints Using benchmark scores, maximize workload for each alternative, subject to the constraint of matching required CTA percentages Determine price/performance score for each alternative and rank order Use of Benchmark Data to Score Systems and Construct Alternatives

16 HPCMP System Performance (Unclassified) 5 9 1 4 3 1 = number of application test cases not included (out of 13 total) n ~40 2

17 How the Optimizer Works: Problem Description KNOWNKNOWNUNKNOWNUNKNOWN Workload Distribution Matrix Application Test Cases Optimal Quantity Set Machines p. N N... N N p p.. p. p p.. p. p p.. p. p p.. p. p p....................... Application Score Matrix Application Test Cases Prices $$ Budget Limits $$ Overall Desired Workload Distribution Application Test Cases f.  Allowed Distribution Deviation Machines $ $... $ $ sssss sssss. sssss sssss................................... ffff.. HI LO

18 Problem Description  Offered Systems  Quantity is variable  Workload allocation is variable  Existing systems  Quantity is fixed  Workload allocation is variable  Offered Systems  Quantity is variable  Workload allocation is variable  Existing systems  Quantity is fixed  Workload allocation is variable

19 MotivationMotivation  Primary Goal: Find solution set with optimal (minimum) price/performance (and solution sets w/ price/performance within X% of optimal price/performance)  Secondary Goal: Determine optimal allocation of work for each application test case per machine  Primary Goal: Find solution set with optimal (minimum) price/performance (and solution sets w/ price/performance within X% of optimal price/performance)  Secondary Goal: Determine optimal allocation of work for each application test case per machine

20 Optimization Scheme  Fix quantity of each machine  Mark quantity combinations that fall within acquisition price range (viable options)  Score each viable option (via SIMPLEX optimization kernel)  Divide life-cycle cost (acquisition price, maintenance, power, any costs over and above normal operations) by total performance  Rank results in ascending order  Fix quantity of each machine  Mark quantity combinations that fall within acquisition price range (viable options)  Score each viable option (via SIMPLEX optimization kernel)  Divide life-cycle cost (acquisition price, maintenance, power, any costs over and above normal operations) by total performance  Rank results in ascending order

21 Architecture % Selection by Processor Quantity (Example)

22 Should We Do Uncertainty Analysis?

23 Performance Modeling Uncertainty Analysis  Assumption: Uncertainties in measured performance values can be treated as uncertainties in measurements of physical quantities  For small, random uncertainties in measured values x, y, z, …, the uncertainty in a calculated function q (x, y, z …) can be expressed as:  Systematic errors need careful consideration since they cannot be calculated analytically  Assumption: Uncertainties in measured performance values can be treated as uncertainties in measurements of physical quantities  For small, random uncertainties in measured values x, y, z, …, the uncertainty in a calculated function q (x, y, z …) can be expressed as:  Systematic errors need careful consideration since they cannot be calculated analytically

24 Benchmarking and Performance Prediction Uncertainty Analysis  Overall goal: Understand and accurately estimate uncertainties in benchmarking and performance prediction calculations  Develop uncertainty equations from analytical expressions used to calculate performance and price/performance  Estimate uncertainties in quantities used for these calculations  Eventual goal: propagate uncertainties in performance predictions and benchmarking results to determine uncertainties in acquisition scoring  Overall goal: Understand and accurately estimate uncertainties in benchmarking and performance prediction calculations  Develop uncertainty equations from analytical expressions used to calculate performance and price/performance  Estimate uncertainties in quantities used for these calculations  Eventual goal: propagate uncertainties in performance predictions and benchmarking results to determine uncertainties in acquisition scoring

25 Power Law Propagation of Uncertainties in Benchmarking and Performance Modeling Benchmark Times Least Squares Fit Total Performance for Solution Set Price/Performance for Solution Set Rank Ordering and Histograms of Solution SetsPrice/PerformancePrice/Performance AveragingAveraging over spans of Solution Sets OptimizerOptimizer Average Performance for Each System Benchmark Performance Benchmark Scores Benchmark Weights

26 Uncertainties in Benchmark Times and Performance Benchmark Times Benchmark Performance From replicated measurements or Analytical performance prediction equation From replicated measurements or Analytical performance prediction equation

27 Uncertainties in Performance via Power Law Fit

28      ln [number of processors (n)] StandardperformanceStandardperformance Data points Power Law Fit Power Number of processors required to reach standard performance (n STD ) Number of processors required to reach standard performance (n STD ) ln performance

29 Power Law Propagation of Uncertainties in Benchmarking and Performance Modeling Benchmark Times Least Squares Fit Total Performance for Solution Set Price/Performance for Solution Set Rank Ordering and Histograms of Solution SetsPrice/PerformancePrice/Performance AveragingAveraging over spans of Solution Sets OptimizerOptimizer Average Performance for Each System Benchmark Performance Benchmark Scores Benchmark Weights 4–5% 2–5% ~4% ~3%

30 Architecture % Selection by Processor Quantity (Example)

31 Architecture % Selection by Processor Quantity for Varying Percentages Off the Best Price/Performance (Example)

32 Uncertainties in Performance Scores for Various Uncertainties in Benchmark Times (Example) Quantity Inherent Performance Uncertainties 10% Performance Uncertainties 20% Performance Uncertainties Benchmark Time or Performance 5%10%20% Score of Individual System on Individual Benchmark 4%11%21% Average Performance of an Individual System 4%6%10% Total Score of All Selected Systems in a Solution Set 1%3%7% Price/Performance Score of All Selected Systems in a Solution Set 1 8%9%11% 1 Assigns an 8% uncertainty in life-cycle cost

33 Performance Modeling and Prediction Goals  Enable informed purchasing decisions in support of TI-XX  Develop an understanding of our key application codes for the purpose of guiding code developers and users toward more efficient applications (Where are the code/system bottlenecks?)  Replace the current application benchmark suite with a judicious choice of synthetic benchmarks that could be used to predict performance of any HPC architecture on the program’s key applications  Enable informed purchasing decisions in support of TI-XX  Develop an understanding of our key application codes for the purpose of guiding code developers and users toward more efficient applications (Where are the code/system bottlenecks?)  Replace the current application benchmark suite with a judicious choice of synthetic benchmarks that could be used to predict performance of any HPC architecture on the program’s key applications

34 BenchmarksBenchmarks Today Dedicated Applications   Larger weight   Real codes   Representative data sets Synthetic Benchmarks  Smaller weight  Future look  Focus on key machine features Today Dedicated Applications   Larger weight   Real codes   Representative data sets Synthetic Benchmarks  Smaller weight  Future look  Focus on key machine features Tomorrow Synthetic Benchmarks   100% weight   Coordinated to application “signature”   Performance on real codes accurately predicted from synthetic benchmark results   Supported by genuine “signature” databases Next 1–2 years key — must prove that synthetics benchmarks and application “signatures” can be coordinated

35 Potential Future Impact of Performance Modeling and Prediction Benchmarking Has Real Impact  Over $160M in decisions over last 4 years  $100s of millions in decisions over the next decade Benchmarking Has Real Impact  Over $160M in decisions over last 4 years  $100s of millions in decisions over the next decade Synthetics performance coordinated to application signatures is the next huge step. Make it Happen! Synthetics performance coordinated to application signatures is the next huge step. Make it Happen!

36 The Performance Prediction Framework  Parallel performance - two major factors:  Single processor performance  Interprocessor communications performance  Two major components of the framework:  Single processor model Model of application’s performance between communication events (floating point performance and memory access)  Communications model (Network simulator) Model of application’s communication events (Measures full MPI latency and bandwidth)  Parallel performance - two major factors:  Single processor performance  Interprocessor communications performance  Two major components of the framework:  Single processor model Model of application’s performance between communication events (floating point performance and memory access)  Communications model (Network simulator) Model of application’s communication events (Measures full MPI latency and bandwidth)

37 The Performance Prediction Framework  Both models based on simplicity and isolation:  Simplicity: start simple and only add complexity when needed to explain behavior  Isolation: Collect each piece of the performance framework in isolation, then combine pieces for performance prediction  Both models based on simplicity and isolation:  Simplicity: start simple and only add complexity when needed to explain behavior  Isolation: Collect each piece of the performance framework in isolation, then combine pieces for performance prediction

38 Components of Performance Prediction Framework  Machine Profile - characterizations of the rates at which a machine can (or is projected to) carry out fundamental operations abstract from the particular application  Application Signature - detailed summaries of the fundamental operations to be carried out by the application independent of any particular machine Combine Machine Profile and Application Signature using:  Convolution Method - algebraic mapping of the application signature onto the machine profile to calculate a performance prediction  Machine Profile - characterizations of the rates at which a machine can (or is projected to) carry out fundamental operations abstract from the particular application  Application Signature - detailed summaries of the fundamental operations to be carried out by the application independent of any particular machine Combine Machine Profile and Application Signature using:  Convolution Method - algebraic mapping of the application signature onto the machine profile to calculate a performance prediction

39 Components of Performance Prediction Framework Performance prediction of Application B on Machine A Performance prediction of Application B on Machine A Parallel Processor Prediction Machine Profile (Machine A) Characterization of memory performance capabilities of Machine A Machine Profile (Machine A) Characterization of memory performance capabilities of Machine A Application Signature (Application B) Characterization of memory operations needed to be performed by Application B Application Signature (Application B) Characterization of memory operations needed to be performed by Application B Convolution Method Mapping memory usage needs of Application B to the capabilities of Machine A Application B  Machine A Convolution Method Mapping memory usage needs of Application B to the capabilities of Machine A Application B  Machine A Machine Profile (Machine A) Characterization of network performance capabilities of Machine A Machine Profile (Machine A) Characterization of network performance capabilities of Machine A Application Signature (Application B) Characterization of network operations needed to be performed by Application B Application Signature (Application B) Characterization of network operations needed to be performed by Application B Convolution Method Mapping network usage needs of Application B to the capabilities of Machine A Application B  Machine A Convolution Method Mapping network usage needs of Application B to the capabilities of Machine A Application B  Machine A Single-Processor Model Communication Model

40 MAPS Data MAPS – Memory bandwidth benchmark measures memory rates (MB/s) for different levels of cache (L1, L2, L3, Main Memory) and different access patterns (stride-one and random) MAPS – Memory bandwidth benchmark measures memory rates (MB/s) for different levels of cache (L1, L2, L3, Main Memory) and different access patterns (stride-one and random) Stride-one access L1 cache Stride-one access L1/L2 cache Random access L1/L2 cache Stride-one access L2 cache/Main Memory

41 Application Signature  Trace of operations on the processor performed by an application (memory and FP ops on processor) Sample: Sample: Where the format is as follows: Basic-block #: # memory ref., type, hit rates, access stride Where the format is as follows: Basic-block #: # memory ref., type, hit rates, access stride BB#202: 2.0E9, load, 99%, 100%, stride-one BB#202: 1.9E3, FP BB#303: 2.2E10, load, 52%, 63%, random BB#303: 1.1E2, FP BB#202: 2.0E9, load, 99%, 100%, stride-one BB#202: 1.9E3, FP BB#303: 2.2E10, load, 52%, 63%, random BB#303: 1.1E2, FP  Trace of application is collected and processed by the MetaSim Tracer Cache hit rates for the PREDICTED MACHINE for each basic-block of the application. This additional information requires “processing” by the MetaSim tracer not just straight memory tracing, hence the combination of the application signature and convolution components Cache hit rates for the PREDICTED MACHINE for each basic-block of the application. This additional information requires “processing” by the MetaSim tracer not just straight memory tracing, hence the combination of the application signature and convolution components

42 ConvolutionsConvolutions Single-processor or per-processor performance:  Machine profile for processor (Machine A)  Application Signature for application (App. #1)  The relative “per-processor” performance of App. #1 on Machine A is represented as the Single-processor or per-processor performance:  Machine profile for processor (Machine A)  Application Signature for application (App. #1)  The relative “per-processor” performance of App. #1 on Machine A is represented as the MetaSim trace collected on Cobalt 60 simulating SC45 memory structure MetaSim trace collected on Cobalt 60 simulating SC45 memory structure Basic- Block Number Procedure Name # Memory References L1 hit rate L2 hit rate Random ratio Memory Bandwidth 5247Walldst2.22E1197.2899.990.008851 10729Poorgrd4.90E0888.9792.290.201327 8649Ucm61.81E1092.0197.070.23572

43 Results-Predictions for AVUS (Cobalt60) System Actual time (s) Predicted time (s) % Error NAVO IBM PWR3 (Habu) NAVO IBM PWR3 (Habu)8,60111,180+30% ARL IBM PWR3 (Brainerd) 10,67510,385-3% MHPCC IBM PWR3 (Tempest) 8,3549,488+14% MHPCC IBM PWR4 (Hurricane) 4,9324,258-14% NAVO IBM PWR4 (Marcellus) 4,3754,445+2% ARL IBM PWR4 (Shelton) 4,456 NAVO IBM PWR4+ (Romulus) 3,2723,239-1% ASC HP SC45 3,3342,688-19% ARL Linux Networx Xeon Cluster 3,459 AVUS TI-05 standard data set on 64 CPUs

44 Results-Predictions for AVUS (Cobalt60) System Actual time (s) Predicted time (s) % Error NAVO IBM PWR3 (Habu) NAVO IBM PWR3 (Habu)18,19523,513+29% ARL IBM PWR3 (Brainerd) 21,819 MHPCC IBM PWR3 (Tempest) 15,05119,907+32% MHPCC IBM PWR4 (Hurricane) 10,2868,837-14% NAVO IBM PWR4 (Marcellus) 9,2999,358+1% ARL IBM PWR4 (Shelton) 8,625 NAVO IBM PWR4+ (Romulus) 7,0606,552-7% ASC HP SC45 6,9935,907-16% ARL Linux Networx Xeon Cluster 7,063 AVUS TI-05 standard data set on 32 CPUs

45 Results-Predictions for AVUS (Cobalt60) System Actual time (s) Predicted time (s) % Error NAVO IBM PWR3 (Habu) NAVO IBM PWR3 (Habu)3,8705,357+38% ARL IBM PWR3 (Brainerd) 4,970 MHPCC IBM PWR3 (Tempest) 3,7794,548+20% MHPCC IBM PWR4 (Hurricane) 2,3682,075-12% NAVO IBM PWR4 (Marcellus) 2,0382,155+6% ARL IBM PWR4 (Shelton) 1,935 NAVO IBM PWR4+ (Romulus) 1,5181,590-5% ASC HP SC45 1,6171,302-19% ARL Linux Networx Xeon Cluster 1,728 AVUS TI-05 standard data set on 128 CPUs

46 Results-Predictions for HYCOM System Actual time (s) Predicted time (s) % Error NAVO IBM PWR3 (Habu) NAVO IBM PWR3 (Habu)7,1297,319+3% ARL IBM PWR3 (Brainerd) 6,976 MHPCC IBM PWR3 (Tempest) 10,4536,512-38% MHPCC IBM PWR4 (Hurricane) 3,5322,804-21% NAVO IBM PWR4 (Marcellus) 3,3643,404+1% ARL IBM PWR4 (Shelton) 2,585 NAVO IBM PWR4+ (Romulus) 2,2312,061-8% ASC HP SC45 3,5942,358-34% ARL Linux Networx Xeon Cluster 3,305 HYCOM TI-05 standard data set on 59 CPUs

47 Results-Predictions for HYCOM System Actual time (s) Predicted time (s) % Error NAVO IBM PWR3 (Habu) NAVO IBM PWR3 (Habu)4,4204,977+13% ARL IBM PWR3 (Brainerd) 4,425 MHPCC IBM PWR3 (Tempest) 3,9124,081+4% MHPCC IBM PWR4 (Hurricane) 2,9392,249-23% NAVO IBM PWR4 (Marcellus) 2,4722,663+8% ARL IBM PWR4 (Shelton) 1,675 NAVO IBM PWR4+ (Romulus) 1,4091,457+3% ASC HP SC45 2,4691,558-37% ARL Linux Networx Xeon Cluster 2,133 HYCOM TI-05 standard data set on 96 CPUs

48 Results-Predictions for HYCOM System Actual time (s) Predicted time (s) % Error NAVO IBM PWR3 (Habu) NAVO IBM PWR3 (Habu)3,3484,059+21% ARL IBM PWR3 (Brainerd) 3,691 MHPCC IBM PWR3 (Tempest) 2,9923,360+12% MHPCC IBM PWR4 (Hurricane) 2,6612,118-20% NAVO IBM PWR4 (Marcellus) 2,0312,360+16% ARL IBM PWR4 (Shelton) NAVO IBM PWR4+ (Romulus) 1,1031,185+7% ASC HP SC45 1,9491,558-20% ARL Linux Networx Xeon Cluster 1,746 HYCOM TI-05 standard data set on 124 CPUs

49 Uncertainties in Performance Scores for Various Uncertainties in Benchmark Times (Example) Quantity Inherent Performance Uncertainties 10% Performance Uncertainties 20% Performance Uncertainties Benchmark Time or Performance 5%10%20% Score of Individual System on Individual Benchmark 4%11%21% Average Performance of an Individual System 4%6%10% Total Score of All Selected Systems in a Solution Set 1%3%7% Price/Performance Score of All Selected Systems in a Solution Set 1 8%9%11% 1 Assigns an 8% uncertainty in life-cycle cost

50 Results — Sensitivity Study of HYCOM Investigation of “Processor” Performance Effects  Base case is performance of HABU ( IBM PWR3)  Four-fold improvements in floating-point performance (no impact on run-time!)  Two-fold improvements in memory bandwidth/latency (increase in main memory performance drives improved performance!)  Base case is performance of HABU ( IBM PWR3)  Four-fold improvements in floating-point performance (no impact on run-time!)  Two-fold improvements in memory bandwidth/latency (increase in main memory performance drives improved performance!)  HYCOM run on 59 CPUs  TI-04 Standard data set  HYCOM run on 59 CPUs  TI-04 Standard data set

51 Conclusions: The Takeaway Message  Careful, systematic evaluation of performance on real application benchmarks is a major factor in system acquisition decisions for DoD  Evaluation of overall price/performance of complete set of systems (new systems plus existing set) is necessary to optimize workload performance  Consideration of uncertainties in performance scores is important to construct reliable information for acquisition decisions  Performance prediction methodology shows significant promise in simplifying the applications benchmarking process for HPC system vendors in the near future  Careful, systematic evaluation of performance on real application benchmarks is a major factor in system acquisition decisions for DoD  Evaluation of overall price/performance of complete set of systems (new systems plus existing set) is necessary to optimize workload performance  Consideration of uncertainties in performance scores is important to construct reliable information for acquisition decisions  Performance prediction methodology shows significant promise in simplifying the applications benchmarking process for HPC system vendors in the near future


Download ppt "Making HPC System Acquisition Decisions Is an HPC Application Larry P. Davis and Cray J. Henry, Department of Defense High Performance Computing Modernization."

Similar presentations


Ads by Google