Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Sandia National Laboratories is a multi-program laboratory managed and operated by Sandia Corporation, a wholly owned subsidiary of Lockheed Martin Corporation,

Similar presentations

Presentation on theme: "1 Sandia National Laboratories is a multi-program laboratory managed and operated by Sandia Corporation, a wholly owned subsidiary of Lockheed Martin Corporation,"— Presentation transcript:

1 1 Sandia National Laboratories is a multi-program laboratory managed and operated by Sandia Corporation, a wholly owned subsidiary of Lockheed Martin Corporation, for the U.S. Department of Energy’s National Nuclear Security Administration under contract DE-AC04-94AL85000. SAND NO. 2011-XXXXP The Role of Advanced Technology Systems in the ASC Platform Strategy Douglas Doerfler Distinguished Member of Technical Staff Sandia National Laboratories Scalable Computer Architectures Department SAND 2014-3174C Unlimited Release

2 2 Topics  ASC ATS Computing Strategy  Partnerships & the Joint Procurement Process  Trinity Project Status  Advanced Architecture Test Bed Project

3 3 ASC computing strategy  Approach: Two classes of systems  Advanced Technology: First of a kind systems that identify and foster technical capabilities and features that are beneficial to ASC applications  Commodity Technology: Robust, cost-effective systems to meet the day-to-day simulation workload needs of the program  Investment Principles  Maintain continuity of production  Ensure that the needs of the current and future stockpile are met  Balance investments in system cost-performance types with computational requirements  Partner with industry to introduce new high-end technology constrained by life-cycle costs  Acquire right-sized platforms to meet the mission needs

4 4 Advanced Technology Systems  Leadership-class platforms  Pursue promising new technology paths with industry partners  These systems are to meet unique mission needs and to help prepare the program for future system designs  Includes Non-Recurring Engineering (NRE) funding to enable delivery of leading-edge platforms  Trinity (ATS-1) will be deployed by ACES (New Mexico Alliance for Computing at Extreme Scale, i.e. Los Alamos & Sandia)  ATS-2 will be deployed by LLNL

5 Advanced Technology Systems (ATS) Fiscal Year ‘12 ‘13 ‘14‘15‘16 ‘17 Use Retire ‘19 ‘18 ‘20 Commodity Technology Systems (CTS) Dev. & Deploy Cielo (LANL/SNL) Sequoia (LLNL) ATS 1 – Trinity (LANL/SNL) ATS 2 – (LLNL) ATS 3 – (LANL/SNL) Tri-lab Linux Capacity Cluster II (TLCC II) CTS 1 CTS 2 ‘21 System Delivery ASC Platform Timeline

6 6 The ACES partnership since 2008  SNL/LANL MOU signed March 2008 to integrate and leverage capabilities  Commitment to the shared development and use of HPC to meet NW mission needs  Major efforts are executed by project teams chartered by and accountable to the ACES co-directors  Cielo delivered ca. 2011  Trinity delivery in 2015 is now our dominant focus  Both Laboratories are fully committed to delivering a successful platform as it’s essential to the Laboratories

7 7 NNSA/ASC and SC/ASCR are partnering on RFPs  Trinity/NERSC-8: ACES & LBL  CORAL: LLNL, ORNL and ANL  Strengthen the alliance between NNSA and SC on road to exascale  Show vendors a more united path on road to exascale  Shared technical expertise between labs  Should gain cost benefit  Saves vendors money/time responding to a single RFP, single set of technical requirements  Outside perspective reduces risk -- avoids tunnel vision by one lab  More leverage with vendors by sharing information between labs  Benefits in production, shared bug reports, quarterly meetings  Less likely to be a one-off system with multiple sites participating

8 8 Why is NNSA/ASC and SC/ASCR collaborating?  The April 2011 MOU between SC & NNSA for coordinating Exascale activities was the impetus for ASC and ASCR to work together on the proposed Exascale Computing Initiative (ECI).  While ECI is yet to be realized, ASC & ASCR program directors made strategic decisions to co-fund and collaborate on:  Technology R&D Investments: FastForward and DesignForward  System Acquisitions: Trinity/NERSC-8 and CORAL  Great leveraging opportunities to share precious resources (budget & technical expertise) to achieve each program’s mission goals, while working out some cultural/bureaucratic differences.  The Trinity/NERSC-8 collaboration will proceed with joint RFP and selection, separate system awards, attendance at other system’s project reviews and collective problem solving.

9 9 Trinity & NERSC-8 are two separate projects resulting in two distinct contracts and systems Joint Vendor Selection Trinity System NERSC-8 System Joint Quarterly Reviews, bug reports, collaboration on application transition, advanced options Trinity Contract NERSC-8 Contract Release RFP Creating requirements Market surveys NERSC-8 Mission Drivers Market surveys Trinity Mission Drivers Negotiations

10 10 TrinityNERSC-8 Memory Capacity2 PB to 4 PB1 PB to 2 PB Capability Improvement8 to 10x over Cielo 8 to 10x over Hopper Sustained System Performance (SSP) 20 to 60x over Hopper 10x to 30x over Hopper JMTTI> 24 hours> 35 hours File System BW metric: time to dump 80% RAM 20 mins30 mins File System disk capacity> 30x main memory > 20x main memory Power< 12+3 MW< 6 MW Off-platform I/O> 140 GB/s> 180 GB/s Target System Configurations

11 11 ATS Components of Trinity  Application transition  MPI + (threads, vectors, data locality & allocation)  Center of Excellence key technology providers  Tightly integrating non-volatile memory, i.e. Burst Buffer  Checkpoint/restart  In-situ data analysis  Advanced power management  Better understand …  And then control …  the power usage characteristics at the platform and application level

12 12 Focus Area: Active Power Management  Need to understand power at the platform & application level  Policy driven  Weighted combination of performance & energy  Energy caps based on time of day, physical capacity, etc.  Need to understand & control power  Cabinet & component level I & V measurements  Scalable collection infrastructure  Tunable collection fidelity: cabinets to components  Administrative & user accessible interface for feedback and tuning P-states (Frequency/Voltage States) – P1: 2.1 GHz, 1.25V – P2: 1.7 GHz, 1.1625V – P3: 1.4 GHz, 1.125V – P4: 1.1 GHz, 1.1V AMG demonstration on 6,144 nodes of ORNL’s Jaguar shows that managing P-States allows for a 32% decrease in energy used while only increasing time to solution by 7.5%

13 13 The Need for a Hybrid Storage Model: Grider, et al

14 14 Checkpoint/Restart and Application Efficiency Nomenclature Tau = time for work before checkpoints Delta = checkpoint time R = restart time Ts = total compute time to solution Tw = wallclock time to solution Assumptions: N is large, Ts >> Delta John T. Daly, “A higher order estimate of the optimum checkpoint interval for restart dumps,” Future Generation Computer Systems 22 (2006), pp. 303-312

15 15 Daly Model Example  As JMTTI decreases  The optimal checkpoint interval decreases -> efficiency decreases  Choosing an optimal checkpoint becomes more important  I.e. the need for resilience techniques to improve efficiency is crucial to ensure an efficient use of a job’s time allocation JMTTI JMTTI (hours) Optimal Tau (minutes) Tw (hours) Efficiency 2423060183% 1216065776% 611075266% 37093254%

16 16 Optimal efficiency graph (1/200, 90%) GoodnessBadness Efficiency is a function of Delta/JMTTI (for derivation see Josip Loncaric @ LANL)

17 17 Trinity Status  Formal Design Review completed April 2013  Independent Project Review (Lehman) completed May 2013  Trinity/NERSC8 RFP released August 2013  Technical Evaluation of the proposals completed September 2013  Initial negotiations for both systems completed November 2013  NNSA Independent Cost Review completed Jan 2014  NERSC8 is in the final contract approval stages  Should be announced “soon”  Trinity went back to the proposing vendors for a Best and Final Offer (BAFO)  Target delivery date of September 2015 is unchanged

18 18 Test Bed Strategy complements ATS  Motivation  Significant PRODUCTION code rewrite/modification will be required for future platforms  Ensure that when codes make “the change” it is the right move for longevity, porting efforts, performance etc.  Eliminate or reduce missteps  Philosophy  Have an APPLICATION focus, reduce the impact on codes in a rapidly changing technology environment  Both hardware and software is intended (and has proven) to be highly dynamic  INTENTIONALLY closer to prototypes than production  It is more important to explore a diverse set of architectural alternatives @ rack scale, than push large scale  Primary target is ASC researchers, in addition to ASCR co-design centers

19 Westmere + Knights Ferry SandyBridge Knights Corner (B) SandyBridge Knights Corner (C) Llano+ PowerInsight Trinity + PowerInsight Interlagos + Fermi 2090xInterlagos + Kepler K20X SandyBridge + Kepler K20 SandyBridge + Kepler K20 + K40 IvyBridge Sept. 2011 Power 7 + FPGA Power 8 Kaveri+ PowerInsight V2 SandyBridge Kepler K20 + K40 + TBD Power 8 64bit Haswell+Powerinsight V2 Sept. 2012 Sept. 2013 Sept. 2014 HMC PLANNED Apr. 2014

20 20 MPIIntrins.CUDAOpenAccOpenCL Kokkos Array Cilk+ OpenMP TBB ArBB/CE AN qthreadspthreads MKL/Ma th Lib. Adv. Lang. DSLs LT NVIDIA (GPU) ? AMD (CPU) ? AMD (APU) ?? Intel (CPU) ? Intel (MIC) ? IBM ?? ARM ? miniFE miniMD miniGhost miniAMR miniSMAC LULESH SNAP CoMD S3D/SMC NEK5000 PathFinder Exploration Matrix Vectorization Threading Node Level Future Languages

21 21 Example Analysis using Test Beds PowerInsight

22 22 Thank You Questions? Thank you to the following for content: Trinity & NERSC-8 Project Teams Thuc Hoang – ASC HQ Manuel Vigil – LANL Jim Laros – SNL Josip Loncaric – LANL Simon Hammond - SNL and a cast of many others …

23 23 Trinity is sized to support 2 to 4 jobs of this class -> 2 to 4 PBytes -> ¼ to ½ of the system Trinity is sized to support 2 to 4 jobs of this class -> 2 to 4 PBytes -> ¼ to ½ of the system Point Design: A current LANL 3D problem runs on ¼ of Cielo today using about 80 TBytes Point Design: A current LANL 3D problem runs on ¼ of Cielo today using about 80 TBytes Projected capacity for a higher fidelity problem is about 750 TBytes in the Trinity timeframe Trinity is Sized for High Fidelity Workloads

24 24 Trinity Facility, Power & Cooling  Trinity will be located in the Nicholas C. Metropolis center (SCC) at Los Alamos National Lab  Facility power is one of the primary constraints in the design of Trinity  12 MW water cooling + 2-3 MW (maybe 4 MW) air cooling available  Inclusive of storage and any other externally attached equipment  300 lbs per square foot floor loading  10,000 to 12,000 square feet of floor space  At least 80% of the platform will be water cooled  Direct (direct to chip or cold plate) is preferred  Indirect (e.g. radiator) method is acceptable  Tower water (directly from cooling tower) at up to 32 o C is preferred  Chilled water at 8.5 o C is available but less desirable due to additional $  Under floor air at 12.5 o C is available to supplement the water cooling method  Concerns  Idle power efficiency  Rapid ramp up / ramp down load on power grid over 2 MW

Download ppt "1 Sandia National Laboratories is a multi-program laboratory managed and operated by Sandia Corporation, a wholly owned subsidiary of Lockheed Martin Corporation,"

Similar presentations

Ads by Google