Presentation on theme: "From Athena to Minerva: A Brief Overview Ben Cash Minerva Project Team, Minerva Workshop, GMU/COLA, September 16, 2013."— Presentation transcript:
From Athena to Minerva: A Brief Overview Ben Cash Minerva Project Team, Minerva Workshop, GMU/COLA, September 16, 2013
Athena Background World Modeling Summit (WMS; May 2008) Summit calls for revolution in climate modeling to more rapidly advance improvement in climate model resolution, accuracy and reliability Recommends petascale supercomputers dedicated to climate modeling Athena supercomputer The U.S. National Science Foundation responds, offering to dedicate the retiring Athena supercomputer over a six-month period in 2009-2010 An international collaboration was formed among groups in the U.S., Japan and the U.K. to use Athena to take up the challenge
Project Athena Dedicated supercomputer Athena was a Cray XT-4 with 18,048 computational cores Replaced by new Cray XT-5, Kraken, with 99,072 cores (since increased) # 21 on June 2009 Top 500 list 6 months, 24/7, 99.3% utilization Over 1 PB data generated Large international collaboration Over 30 people 6 groups 3 continents State-of-the-art global AGCMs NICAM (JAMSTEC/ U. Tokyo): Nonhydrostatic Icosahedral Atmospheric Model IFS (ECMWF): Integrated Forecast System Highest possible spatial resolution
Athena Science Goals Hypothesis: Increasing climate model resolution to accurately resolve mesoscale phenomena in the atmosphere (and ocean and land surface) can dramatically improve the fidelity of the models in simulating climate – mean, variances, covariances, and extreme events. Hypothesis: Simulating the effect of increasing greenhouse gases on regional aspects of climate, especially extremes, may, for some regions, depend critically on the spatial resolution of the climate model. Hypothesis: Explicitly resolving important processes, such as clouds in the atmosphere (and eddies in the ocean and landscape features on the continental surface), without parameterization, can improve the fidelity of the models, especially in describing the regional structure of weather and climate.
Qualitative Analysis: 2009 NICAM Precipitation and Cloudiness May 21-August 31
Athena Lessons Learned Dedicated usage of a relatively big supercomputer greatly enhances productivity Dealing with only a few users and their requirements allows for more efficient utilization of resources Challenge: Dedicated simulation projects like Project Athena can generate enormous amounts of data to be archived, analyzed and managed. NICS (and TeraGrid) do not currently have enough storage capacity. Data management is a big challenge. Preparation time: 2 to 3 weeks at least were needed before the beginning of dedicated runs to test and optimize the codes and to plan strategies for optimal use of the system. Communication throughout the project was essential: (weekly telecons, email lists, personal calls, …)
Athena Limitations Athena was a tremendous success, generating tremendous amount of data and large number of papers for a six month project. BUT… Limited number of realizations Athena runs generally consisted of a single realization No way to assess robustness of results Uncoupled models Multiple, dissimilar models Resources were split between IFS and NICAM Differences in performance meant very different experiments performed – difficult to directly compare results Storage limitations and post-processing demands limited what could be saved for each model
Minerva Background NCAR Yellowstone In 2012, NCAR-Wyoming Supercomputing Center (NWSC) debuted Yellowstone, the successor to Bluefire, their previous production platform IBM iDataplex, 72,280 cores, 1.5 petaflops peak performance #17 on June 2013 Top 500 list 10.7 PB disk capability – vast increase over capacity available during Athena High capacity HPSS data archive Dedicated high memory analysis clusters (Geyser and Caldera) Accelerated Scientific Discovery (ASD) program Recognizing that many groups will not be ready to take advantage of new architecture, NCAR accepted a small number proposals for early access to Yellowstone 3 months of near-dedicated access before being opened to general user community Opportunity to continue successful Athena collaboration between COLA and ECMWF, and to address limitations in the Athena experiments
Minerva Timeline March 2012 – Proposal finalized and submitted 31 million core hours requested April 2012 – Proposal accepted 21 million core hours approved Anticipated date of production start: July 21 Code testing and benchmarking on Janus begins October 5, 2012 First login to Yellowstone – bcash reportedly user 1 October – November 23, 2012 Jobs are plagued by massive system instabilities, conflict between code and Intel compiler
Minerva Timeline continued November 24 – Dec 1, 2012 Code conflict resolved, low core count jobs avoid worst of system instability Minerva jobs occupy 61000 cores (!) Peter Towers estimates Minerva easily sets record for “Most IFS FLOPs in a 24 hour period” Jobs rapidly overrun initial 250 TB disk allocation, triggering request for additional resources This becomes a Minerva project theme Due to system instability, user accounts are not charged for jobs at this time Roughly 7 million free core hours as a result: 28 million total 800+ TB generated
Minerva Catalog: Base Experiments ResolutionStart DatesEnsemblesLengthPeriod of Integration T319May 11524 months (total)1980-2011 ** T639May 11524 months (total)1980-2011 T639May 1, Nov 151 (total)5 and 4 months, respectively 2000-2011 Minerva Catalog: Extended Experiments ResolutionStart DatesEnsemblesLengthPeriod of Integration T319May 1, Nov 1517 months1980-2011 T639May 1, Nov 1157 months1980-2011 T1279May 1157 months2000-2011 ** to be completed
Qualitative Analysis: 2010 T1279 Precipitation May – November
Minerva Lessons Learned Dedicated usage of a relatively big supercomputer greatly enhances productivity Experience with early usage period demonstrates tremendous progress can be made with dedicated access Dealing with only a few users allows for more efficient utilization Noticeable decrease in efficiency once scheduling multiple jobs of multiple sizes was turned over to a scheduler NCAR resources initially overwhelmed by challenges of new machine and individual problems that arose. Focus on a single model allows for in-depth exploration Data saved at much higher frequency Multiple ensemble members, increased vertical levels, etc.
Dedicated simulation projects like Athena and Minerva generate enormous amounts of data to be archived, analyzed and managed. Data management is a big challenge. Other than machine instability, data management and post-processing were solely responsible for halts in production. Even on a system designed with lessons from Athena in mind, production capabilities overwhelm storage and processing Post-processing and storage must be incorporated into production stream ‘Rapid burn’ projects such as Athena and Minerva are particularly prone to overwhelming storage resources
Despite advances beyond Athena, more work to be done Focus of Tuesday discussion Fill in matrix of experiments Further increases in ocean, at mospheric resolution Sensitivity tests (aerosols, greenhouse gases) ?? Beyond Minerva: A New Pantheon