Operated by Los Alamos National Security, LLC for the U.S. Department of Energy's NNSA UNCLASSIFIED UNCLASSIFIED | 1 The Simulation-Code-Hardware Feedback.

Slides:



Advertisements
Similar presentations
1 NNSAs ICF Strategy Presented to Fusion Power Associates 34th Annual Meeting and Symposium Washington, DC December 11, 2013 Kirk Levedahl NNSA NATIONAL.
Advertisements

Rate Plan. Value Story When asked, our customers identify the following topics as ways MidAmerican Energy provides value – Emergency Response.
IBM aims to reclaim supercomputer title By Jatin Chopra.
Operated by Los Alamos National Security, LLC for NNSA Are We Exploring the Right Enabling Technologies to Support End Applications in the Push to Exascale?
FUTURE TECHNOLOGIES Lecture 13.  In this lecture we will discuss some of the important technologies of the future  Autonomic Computing  Cloud Computing.
Advances in Modeling and Simulation for US Nuclear Stockpile Stewardship February 2, 2009 James S. Peery Director Computers, Computation, Informatics and.
U.S. Department of Energy Office of Science Advanced Scientific Computing Research Program NERSC Users Group Meeting Department of Energy Update June 12,
Current Progress on the CCA Groundwater Modeling Framework Bruce Palmer, Yilin Fang, Vidhya Gurumoorthi, Computational Sciences and Mathematics Division.
Acquiring Information Systems and Applications
MotoHawk Training Model-Based Design of Embedded Systems.
Planned Machines: ASCI Purple, ALC and M&IC MCR Presented to SOS7 Mark Seager ICCD ADH for Advanced Technology Lawrence Livermore.
Presented by Scalable Systems Software Project Al Geist Computer Science Research Group Computer Science and Mathematics Division Research supported by.
UNCLASSIFIED: LA-UR Data Infrastructure for Massive Scientific Visualization and Analysis James Ahrens & Christopher Mitchell Los Alamos National.
Operated by Los Alamos National Security, LLC for the U.S. Department of Energy’s NNSA U N C L A S S I F I E D Slide 1 Exascale? No problem! Paul Henning.
Business Intelligence Dr. Mahdi Esmaeili 1. Technical Infrastructure Evaluation Hardware Network Middleware Database Management Systems Tools and Standards.
Acquiring Information Systems and Applications
How WebMD Maintains Operational Flexibility with NoSQL Rajeev Borborah, Sr. Director, Engineering Matt Wilson – Director, Production Engineering – Consumer.
DrupalCon Munich Aug 2012 Catharina HochEntice,
Design Completion A Major Milestone System is Presented to Users and Management for Approval.
Virtual Machine Hosting for Networked Clusters: Building the Foundations for “Autonomic” Orchestration Based on paper by Laura Grit, David Irwin, Aydan.
LLNL-PRES-XXXXXX This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344.
Wrangling Customer Usage Data with Hadoop Clearwire – Thursday, June 27 th Carmen Hall – IT Director Mathew Johnson – Sr. IT Manager.
U.S. Department of Energy Office of Science Advanced Scientific Computing Research Program CASC, May 3, ADVANCED SCIENTIFIC COMPUTING RESEARCH An.
Operated by Los Alamos National Security, LLC for the U.S. Department of Energy’s NNSA U N C L A S S I F I E D The Case for Monitoring and Testing David.
A Design of Trusted Operating System Based on Linux BY LI HONGJUAN, LANYUQING The presenter Rusul J. ALSaedi Spring 2015 CS Dr. Rothstein.
DOE BER Climate Modeling PI Meeting, Potomac, Maryland, May 12-14, 2014 Funding for this study was provided by the US Department of Energy, BER Program.
System Management for Virtualization and Automation in a Dynamic Data Center SVM’08 Munich Karsten Beins, Sen. Director Infrastructure Technology.
McGraw-Hill/Irwin © The McGraw-Hill Companies, All Rights Reserved BUSINESS PLUG-IN B17 Organizational Architecture Trends.
Welcome ! Richard Culatta Deputy Director Office of Educational
Operated by Los Alamos National Security, LLC for the U.S. Department of Energy’s NNSA U N C L A S S I F I E D Slide 1 Nick Salazar Operations Support.
Operated by Los Alamos National Security, LLC for the U.S. Department of Energy’s NNSA U N C L A S S I F I E D Lessons Learned: Certification and Accreditation.
Our Process We Threw Out Preconceptions and Left No Stone Unturned We looked at white papers, articles, Gartner and Forrester reports, and marketing collateral.
DOE/NNSA Kansas City Plant (KCP) Program Manager, Strategic Innovation
Introduction to Systems Analysis and Design
CHAPTER 13 Acquiring Information Systems and Applications.
Operated by Los Alamos National Security, LLC for NNSA U N C L A S S I F I E D Slide 1 LANL-stor and the Challenges of Evolutionary Development Managing.
Operated by Los Alamos National Security, LLC for DOE/NNSA U N C L A S S I F I E D Slide 1 Los Alamos National Laboratory Yvonne Gonzales, Small Business.
Server Virtualization
Issues Autonomic operation (fault tolerance) Minimize interference to applications Hardware support for new operating systems Resource management (global.
Advanced Simulation and Computing (ASC) Academic Strategic Alliances Program (ASAP) Center at The University of Chicago The Center for Astrophysical Thermonuclear.
Office of Science U.S. Department of Energy Raymond L. Orbach Director Office of Science U.S. Department of Energy Presentation to BESAC December 6, 2004.
1 IBM TIVOLI Business Continuance Seminar Training Document.
HPCMP Benchmarking Update Cray Henry April 2008 Department of Defense High Performance Computing Modernization Program.
02/09/2010 Industrial Project Course (234313) Virtualization-aware database engine Final Presentation Industrial Project Course (234313) Virtualization-aware.
Toward a Multi-State Consensus on Rural Intersection Decision Support: Objectives u Gain understanding of issues involved with national rural intersection.
Martin Schulz Center for Applied Scientific Computing Lawrence Livermore National Laboratory ASC STAT Team: Greg Lee, Dong Ahn (LLNL), Dane Gardner (LANL)
Mercury Program Margin Management Tool (MMT) 10-January-2014.
Dan Luttrell, Northrop Grumman USC Agile Experiences Workshop March 17-19, 2004 Agile Process in a DOD Environment - One Project’s.
Site Report DOECGF April 26, 2011 W. Alan Scott Sandia National Laboratories Sandia National Laboratories is a multi-program laboratory managed and operated.
A (Very Basic) Introduction to the Kansas City Plant Jay Coghlan, Executive Director, Nuclear Watch New Mexico October 2007.
Comprehensive Scientific Support Of Large Scale Parallel Computation David Skinner, NERSC.
HPC HPC-5 Systems Integration High Performance Computing 1 Application Resilience: Making Progress in Spite of Failure Nathan A. DeBardeleben and John.
Chapter 10 Information Systems Development. Learning Objectives Upon successful completion of this chapter, you will be able to: Explain the overall process.
Agenda: Overview of Agile testing Difference between Agile and traditional Methodology Agile Development Methodologies Extreme Programming Test Driven.
Experiments in Utility Computing: Hadoop and Condor Sameer Paranjpye Y! Web Search.
Other Tools HPC Code Development Tools July 29, 2010 Sue Kelly Sandia is a multiprogram laboratory operated by Sandia Corporation, a.
Tackling I/O Issues 1 David Race 16 March 2010.
ECG Simulation NCRR Overview Technology for the ECG Simulation project CardioWave BioPSE project background Tools developed to date Tools for the next.
Petascale Computing Resource Allocations PRAC – NSF Ed Walker, NSF CISE/ACI March 3,
On the Path to Trinity - Experiences Bringing Codes to the Next Generation ASC Platform Courtenay T. Vaughan and Simon D. Hammond Sandia National Laboratories.
U N C L A S S I F I E D LA-UR Leveraging VMware to implement Disaster Recovery at LANL Anil Karmel Technical Staff Member
Introduction to Data Analysis with R on HPC Texas Advanced Computing Center Feb
Automated File Server Disk Quota Management May 13 th, 2008 Bill Claycomb Computer Systems Analyst Infrastructure Computing Systems Department Sandia is.
Towards a High Performance Extensible Grid Architecture Klaus Krauter Muthucumaru Maheswaran {krauter,
Extreme Scale Infrastructure
Design Completion A Major Milestone
Organizations Are Embracing New Opportunities
OmegaPS Users’ Group Meeting OUGM19
What is a System? A system is a collection of interrelated components that work together to perform a specific task.
Presentation transcript:

Operated by Los Alamos National Security, LLC for the U.S. Department of Energy's NNSA UNCLASSIFIED UNCLASSIFIED | 1 The Simulation-Code-Hardware Feedback Loop in Practice Bill Archer LANL Advanced Simulation and Computing Program Director Salishan Conference on High-Speed Computing April 21, 2014 LA-UR

Operated by Los Alamos National Security, LLC for the U.S. Department of Energy's NNSA UNCLASSIFIED UNCLASSIFIED | 2 Outline  An A-Typical DSW Driver  Trinity Mission Need  Code Adaptation

Operated by Los Alamos National Security, LLC for the U.S. Department of Energy's NNSA UNCLASSIFIED UNCLASSIFIED | 3 An A-Typical DSW Driver

Operated by Los Alamos National Security, LLC for the U.S. Department of Energy's NNSA UNCLASSIFIED UNCLASSIFIED | 4 Nuclear performance assessment requires multi-physics codes Thermonuclear Burn Fission Radiation (Photons) Radiation (Photons) Hydrodynamics NNSA procures systems to solve national security problems  In 2011 Directed Stockpile Work (DSW) had a particular classified multi-physics 3D problem they wanted to simulate  We’ll call this “The Problem”  It was simulated with a classified multi-physics integrated design code (IDC)  We’ll call this “The Code”  Run on Cielo, a Cray XE6 at LANL with 0.28 PiB of memory and 136K cores Mission driven problem

Operated by Los Alamos National Security, LLC for the U.S. Department of Energy's NNSA UNCLASSIFIED UNCLASSIFIED | 5 The Simulation ran into many problems.  Ran on half the machine; it took 3 to 4 days to get an 8 hour allocation  Dealt with resiliency by allocating extra nodes and restarting within Moab allocation  Suffered memory exhaustion when using ½ of Cielo  25 TiB dump file caused archiving and data movement problems Hero class problem

Operated by Los Alamos National Security, LLC for the U.S. Department of Energy's NNSA UNCLASSIFIED UNCLASSIFIED | 6 The Simulation proved to be too big for Cielo  Code team developed tailored physics to improve accuracy while reducing memory usage by 40% – Reduced the dump file to 9 TB  Still had memory exhaustion DSW deferred the problem after a year of trying

Operated by Los Alamos National Security, LLC for the U.S. Department of Energy's NNSA UNCLASSIFIED UNCLASSIFIED | 7 In 2012 the code team then tried to get The Code to run The Simulation.  Found an I/O gather that caused memory exhaustion – Very painful to debug at 65,000 cores  Further runs at 78,000 cores ran into I/O hangs – Went away with an operation system update  Took 80% of Cielo to run The Problem – 97,000 cores and 0.21 PiB memory  Throttled I/O sends and was able to run at 60% Still not practical to run, need a bigger machine

Operated by Los Alamos National Security, LLC for the U.S. Department of Energy's NNSA UNCLASSIFIED UNCLASSIFIED | 8 Trinity Mission Need

Operated by Los Alamos National Security, LLC for the U.S. Department of Energy's NNSA UNCLASSIFIED UNCLASSIFIED | 9 The Trinity Mission Need was driven by the need to solve this type of problem.  The desire was to increase the resolution by 2X – Allows increased geometric and physics fidelity  Adaptive mesh refinement (AMR) allows us to limit the memory increase to 3X, about 0.75 PiB  Be able to run 2 to 4 of these problems at once.  Became basis of critical decision documents and Request For Proposals 2 to 4 PBytes memory desired; No FLOPS requirement

Operated by Los Alamos National Security, LLC for the U.S. Department of Energy's NNSA UNCLASSIFIED UNCLASSIFIED | 10 Trinity will meet the memory requirement for our simulations. Trinity negotiations are underway – An announcement is expected in late May – Delivery expected in Q4 FY15 Trinity will be deployed by Los Alamos and Sandia (ACES) –Sited at Los Alamos, used by Los Alamos, Sandia, and Livermore Partnering on the procurement with LBNL for acquisition of NERSC8 Description of ScopeThreshold KPPs in RFP Capability Performance Improvement over Cielo baseline 8x Aggregate Memory2 PB Aggregate File System Capacity30x Main memory Trinity Power requirement< 15 MW Trinity must demonstrate a significant capability improvement over current platforms (>> Cielo) Advanced Computing at Extreme Scale (ACES)

Operated by Los Alamos National Security, LLC for the U.S. Department of Energy's NNSA UNCLASSIFIED UNCLASSIFIED | 11 Code Adaptation

Operated by Los Alamos National Security, LLC for the U.S. Department of Energy's NNSA UNCLASSIFIED UNCLASSIFIED | 12 How do codes keep up with rapid changes in hardware?  ASC has multiple Codes that total several million lines of code and represent several Billion dollars of taxpayer investment  ASC is bringing in major systems every 2.5 years What do we do … today? Peak Effective Light cores Heavy cores Plots curtsey of LLNL

Operated by Los Alamos National Security, LLC for the U.S. Department of Energy's NNSA UNCLASSIFIED UNCLASSIFIED | 13  Assumes we can’t afford to rewrite codes for every system  Moving to new machine should only impact the hardware aware infrastructure  Need an “abstraction layer” to isolate the physics modules from the hardware aware infrastructure One answer is to isolate the physics from the hardware Despite years of research the community has failed to deliver a production usable abstraction layer Code 1Code 2Code 3 Hydro 1 Interface Hydro 2 Interface EOS Interface Explosives Interface Physics Packages Abstraction Layer(s) Hardware Aware Infrastructure MPIThreadsI/OViz

Operated by Los Alamos National Security, LLC for the U.S. Department of Energy's NNSA UNCLASSIFIED UNCLASSIFIED | 14 Other extreme, become agile and rewrite the codes for each machine  Successfully programmed Cray vector machines with loop level pragmas  Successfully programmed parallel clusters with low level MPI calls  Success occurred during decades of hardware stability! Cray 1 ASCI Blue Mountain Any agile examples with large code bases?

Operated by Los Alamos National Security, LLC for the U.S. Department of Energy's NNSA UNCLASSIFIED UNCLASSIFIED | 15 Ability to affordable and quickly adapt the codes to new hardware is THE problem.  ASC systems are bought to solve mission problems  If our codes can’t use the systems, there isn’t any reason to buy them. The community needs to come together on a solution for graceful code migration

Operated by Los Alamos National Security, LLC for the U.S. Department of Energy's NNSA UNCLASSIFIED UNCLASSIFIED | 16 Questions? Cielo, Cray XE6, 1.4 PF/s, 2011 IBM Punch Card Accounting Machines, 20 Op/s, 1944

Operated by Los Alamos National Security, LLC for the U.S. Department of Energy's NNSA UNCLASSIFIED UNCLASSIFIED | 17 Backups

Operated by Los Alamos National Security, LLC for the U.S. Department of Energy's NNSA UNCLASSIFIED UNCLASSIFIED | 18 Abstract There is a tendency to view simulations, codes, and hardware independently and in isolation from each other. I will discuss a mission driven simulation that pushed the limits of a code and the Cielo hardware and in the end was just too large to run. This in turn was a major driver of the hardware requirements for Trinity. The selected Trinity hardware is now driving changes to the code, with the intent that this will allow us to run the original simulation, and to start preparing the codes for the next generation hardware. This feedback loop is typical of how Los Alamos leverages hardware and codes to increase the simulation space for problems of mission interest.

Operated by Los Alamos National Security, LLC for the U.S. Department of Energy's NNSA UNCLASSIFIED UNCLASSIFIED | 19 Biography Dr. Bill Archer carried out his doctorate research on quantum chemistry at Los Alamos. Before returning to LANL in 1999 he was a post-doc at Rice University, a main-ring superconducting magnet designer at the Superconducting Super Collider, and an operations research analyst at the Center for Naval Analyses. While at CNA he spent seven years embedded with the Fleet. Upon returning to LANL he modified an atomic physics code for parallel processing on Blue Mountain. He then moved to the ASCI Crestone Project, where he was project leader and team leader for several years as the codes were brought into general production on the Q and White machines. He was one of the first members of the Thermonuclear Burn Initiative, where he started studying burn and the history of the Weapons Program. Since 2008 he has held a variety of Los Alamos management positions: Advanced Simulation and Computing (ASC) Integrated Codes Program Manager, line manager of the Simulation Analysis and Code Development Group, line manager of the Material and Physical Data Group, ASC Deputy Program Director (acting), and currently ASC Program Director.