Presentation is loading. Please wait.

Presentation is loading. Please wait.

Operated by Los Alamos National Security, LLC for the U.S. Department of Energy's NNSA UNCLASSIFIED UNCLASSIFIED | 1 The Simulation-Code-Hardware Feedback.

Similar presentations


Presentation on theme: "Operated by Los Alamos National Security, LLC for the U.S. Department of Energy's NNSA UNCLASSIFIED UNCLASSIFIED | 1 The Simulation-Code-Hardware Feedback."— Presentation transcript:

1 Operated by Los Alamos National Security, LLC for the U.S. Department of Energy's NNSA UNCLASSIFIED UNCLASSIFIED | 1 The Simulation-Code-Hardware Feedback Loop in Practice Bill Archer LANL Advanced Simulation and Computing Program Director Salishan Conference on High-Speed Computing April 21, 2014 LA-UR-14-22615

2 Operated by Los Alamos National Security, LLC for the U.S. Department of Energy's NNSA UNCLASSIFIED UNCLASSIFIED | 2 Outline  An A-Typical DSW Driver  Trinity Mission Need  Code Adaptation

3 Operated by Los Alamos National Security, LLC for the U.S. Department of Energy's NNSA UNCLASSIFIED UNCLASSIFIED | 3 An A-Typical DSW Driver

4 Operated by Los Alamos National Security, LLC for the U.S. Department of Energy's NNSA UNCLASSIFIED UNCLASSIFIED | 4 Nuclear performance assessment requires multi-physics codes Thermonuclear Burn Fission Radiation (Photons) Radiation (Photons) Hydrodynamics NNSA procures systems to solve national security problems  In 2011 Directed Stockpile Work (DSW) had a particular classified multi-physics 3D problem they wanted to simulate  We’ll call this “The Problem”  It was simulated with a classified multi-physics integrated design code (IDC)  We’ll call this “The Code”  Run on Cielo, a Cray XE6 at LANL with 0.28 PiB of memory and 136K cores Mission driven problem

5 Operated by Los Alamos National Security, LLC for the U.S. Department of Energy's NNSA UNCLASSIFIED UNCLASSIFIED | 5 The Simulation ran into many problems.  Ran on half the machine; it took 3 to 4 days to get an 8 hour allocation  Dealt with resiliency by allocating extra nodes and restarting within Moab allocation  Suffered memory exhaustion when using ½ of Cielo  25 TiB dump file caused archiving and data movement problems Hero class problem

6 Operated by Los Alamos National Security, LLC for the U.S. Department of Energy's NNSA UNCLASSIFIED UNCLASSIFIED | 6 The Simulation proved to be too big for Cielo  Code team developed tailored physics to improve accuracy while reducing memory usage by 40% – Reduced the dump file to 9 TB  Still had memory exhaustion DSW deferred the problem after a year of trying

7 Operated by Los Alamos National Security, LLC for the U.S. Department of Energy's NNSA UNCLASSIFIED UNCLASSIFIED | 7 In 2012 the code team then tried to get The Code to run The Simulation.  Found an I/O gather that caused memory exhaustion – Very painful to debug at 65,000 cores  Further runs at 78,000 cores ran into I/O hangs – Went away with an operation system update  Took 80% of Cielo to run The Problem – 97,000 cores and 0.21 PiB memory  Throttled I/O sends and was able to run at 60% Still not practical to run, need a bigger machine

8 Operated by Los Alamos National Security, LLC for the U.S. Department of Energy's NNSA UNCLASSIFIED UNCLASSIFIED | 8 Trinity Mission Need

9 Operated by Los Alamos National Security, LLC for the U.S. Department of Energy's NNSA UNCLASSIFIED UNCLASSIFIED | 9 The Trinity Mission Need was driven by the need to solve this type of problem.  The desire was to increase the resolution by 2X – Allows increased geometric and physics fidelity  Adaptive mesh refinement (AMR) allows us to limit the memory increase to 3X, about 0.75 PiB  Be able to run 2 to 4 of these problems at once.  Became basis of critical decision documents and Request For Proposals 2 to 4 PBytes memory desired; No FLOPS requirement

10 Operated by Los Alamos National Security, LLC for the U.S. Department of Energy's NNSA UNCLASSIFIED UNCLASSIFIED | 10 Trinity will meet the memory requirement for our simulations. Trinity negotiations are underway – An announcement is expected in late May – Delivery expected in Q4 FY15 Trinity will be deployed by Los Alamos and Sandia (ACES) –Sited at Los Alamos, used by Los Alamos, Sandia, and Livermore Partnering on the procurement with LBNL for acquisition of NERSC8 Description of ScopeThreshold KPPs in RFP Capability Performance Improvement over Cielo baseline 8x Aggregate Memory2 PB Aggregate File System Capacity30x Main memory Trinity Power requirement< 15 MW Trinity must demonstrate a significant capability improvement over current platforms (>> Cielo) Advanced Computing at Extreme Scale (ACES)

11 Operated by Los Alamos National Security, LLC for the U.S. Department of Energy's NNSA UNCLASSIFIED UNCLASSIFIED | 11 Code Adaptation

12 Operated by Los Alamos National Security, LLC for the U.S. Department of Energy's NNSA UNCLASSIFIED UNCLASSIFIED | 12 How do codes keep up with rapid changes in hardware?  ASC has multiple Codes that total several million lines of code and represent several Billion dollars of taxpayer investment  ASC is bringing in major systems every 2.5 years What do we do … today? Peak Effective Light cores Heavy cores Plots curtsey of LLNL

13 Operated by Los Alamos National Security, LLC for the U.S. Department of Energy's NNSA UNCLASSIFIED UNCLASSIFIED | 13  Assumes we can’t afford to rewrite codes for every system  Moving to new machine should only impact the hardware aware infrastructure  Need an “abstraction layer” to isolate the physics modules from the hardware aware infrastructure One answer is to isolate the physics from the hardware Despite years of research the community has failed to deliver a production usable abstraction layer Code 1Code 2Code 3 Hydro 1 Interface Hydro 2 Interface EOS Interface Explosives Interface Physics Packages Abstraction Layer(s) Hardware Aware Infrastructure MPIThreadsI/OViz

14 Operated by Los Alamos National Security, LLC for the U.S. Department of Energy's NNSA UNCLASSIFIED UNCLASSIFIED | 14 Other extreme, become agile and rewrite the codes for each machine  Successfully programmed Cray vector machines with loop level pragmas  Successfully programmed parallel clusters with low level MPI calls  Success occurred during decades of hardware stability! Cray 1 ASCI Blue Mountain Any agile examples with large code bases?

15 Operated by Los Alamos National Security, LLC for the U.S. Department of Energy's NNSA UNCLASSIFIED UNCLASSIFIED | 15 Ability to affordable and quickly adapt the codes to new hardware is THE problem.  ASC systems are bought to solve mission problems  If our codes can’t use the systems, there isn’t any reason to buy them. The community needs to come together on a solution for graceful code migration

16 Operated by Los Alamos National Security, LLC for the U.S. Department of Energy's NNSA UNCLASSIFIED UNCLASSIFIED | 16 Questions? Cielo, Cray XE6, 1.4 PF/s, 2011 IBM Punch Card Accounting Machines, 20 Op/s, 1944

17 Operated by Los Alamos National Security, LLC for the U.S. Department of Energy's NNSA UNCLASSIFIED UNCLASSIFIED | 17 Backups

18 Operated by Los Alamos National Security, LLC for the U.S. Department of Energy's NNSA UNCLASSIFIED UNCLASSIFIED | 18 Abstract There is a tendency to view simulations, codes, and hardware independently and in isolation from each other. I will discuss a mission driven simulation that pushed the limits of a code and the Cielo hardware and in the end was just too large to run. This in turn was a major driver of the hardware requirements for Trinity. The selected Trinity hardware is now driving changes to the code, with the intent that this will allow us to run the original simulation, and to start preparing the codes for the next generation hardware. This feedback loop is typical of how Los Alamos leverages hardware and codes to increase the simulation space for problems of mission interest.

19 Operated by Los Alamos National Security, LLC for the U.S. Department of Energy's NNSA UNCLASSIFIED UNCLASSIFIED | 19 Biography Dr. Bill Archer carried out his doctorate research on quantum chemistry at Los Alamos. Before returning to LANL in 1999 he was a post-doc at Rice University, a main-ring superconducting magnet designer at the Superconducting Super Collider, and an operations research analyst at the Center for Naval Analyses. While at CNA he spent seven years embedded with the Fleet. Upon returning to LANL he modified an atomic physics code for parallel processing on Blue Mountain. He then moved to the ASCI Crestone Project, where he was project leader and team leader for several years as the codes were brought into general production on the Q and White machines. He was one of the first members of the Thermonuclear Burn Initiative, where he started studying burn and the history of the Weapons Program. Since 2008 he has held a variety of Los Alamos management positions: Advanced Simulation and Computing (ASC) Integrated Codes Program Manager, line manager of the Simulation Analysis and Code Development Group, line manager of the Material and Physical Data Group, ASC Deputy Program Director (acting), and currently ASC Program Director.


Download ppt "Operated by Los Alamos National Security, LLC for the U.S. Department of Energy's NNSA UNCLASSIFIED UNCLASSIFIED | 1 The Simulation-Code-Hardware Feedback."

Similar presentations


Ads by Google