Progress Toward Accelerating CAM-SE. Jeff Larkin Along with: Rick Archibald, Ilene Carpenter, Kate Evans, Paulius Micikevicius, Jim Rosinski, Jim Schwarzmeier,

Slides:



Advertisements
Similar presentations
Prasanna Pandit R. Govindarajan
Advertisements

Instructor Notes This lecture describes the different ways to work with multiple devices in OpenCL (i.e., within a single context and using multiple contexts),
Memory.
Autonomic Systems Justin Moles, Winter 2006 Enabling autonomic behavior in systems software with hot swapping Paper by: J. Appavoo, et al. Presentation.
A Dynamic World, what can Grids do for Multi-Core computing? Daniel Goodman, Anne Trefethen and Douglas Creager
Thoughts on Shared Caches Jeff Odom University of Maryland.
Computing with Accelerators: Overview ITS Research Computing Mark Reed.
GPU System Architecture Alan Gray EPCC The University of Edinburgh.
BWUPEP2011, UIUC, May 29 - June Blue Waters Undergraduate Petascale Education Program May 29 – June Hybrid MPI/CUDA Scaling accelerator.
GPGPU Introduction Alan Gray EPCC The University of Edinburgh.
HPCC Mid-Morning Break High Performance Computing on a GPU cluster Dirk Colbry, Ph.D. Research Specialist Institute for Cyber Enabled Discovery.
GPUs on Clouds Andrew J. Younge Indiana University (USC / Information Sciences Institute) UNCLASSIFIED: 08/03/2012.
Internet Networking Spring 2006 Tutorial 12 Web Caching Protocols ICP, CARP.
NPACI: National Partnership for Advanced Computational Infrastructure Supercomputing ‘98 Mannheim CRAY T90 vs. Tera MTA: The Old Champ Faces a New Challenger.
1 Spring Semester 2007, Dept. of Computer Science, Technion Internet Networking recitation #13 Web Caching Protocols ICP, CARP.
1/28/2004CSCI 315 Operating Systems Design1 Operating System Structures & Processes Notice: The slides for this lecture have been largely based on those.
Main Memory. Background Program must be brought (from disk) into memory and placed within a process for it to be run Main memory and registers are only.
Contemporary Languages in Parallel Computing Raymond Hummel.
A Source-to-Source OpenACC compiler for CUDA Akihiro Tabuchi †1 Masahiro Nakao †2 Mitsuhisa Sato †1 †1. Graduate School of Systems and Information Engineering,
The hybird approach to programming clusters of multi-core architetures.
HPCC Mid-Morning Break Dirk Colbry, Ph.D. Research Specialist Institute for Cyber Enabled Discovery Introduction to the new GPU (GFX) cluster.
Chapter 3 Memory Management: Virtual Memory
Programming GPUs using Directives Alan Gray EPCC The University of Edinburgh.
OpenMP in a Heterogeneous World Ayodunni Aribuki Advisor: Dr. Barbara Chapman HPCTools Group University of Houston.
ICOM 5995: Performance Instrumentation and Visualization for High Performance Computer Systems Lecture 7 October 16, 2002 Nayda G. Santiago.
GPUs and Accelerators Jonathan Coens Lawrence Tan Yanlin Li.
GPU Programming with CUDA – Optimisation Mike Griffiths
Porting the physical parametrizations on GPUs using directives X. Lapillonne, O. Fuhrer, Cristiano Padrin, Piero Lanucara, Alessandro Cheloni Eidgenössisches.
Debugging and Profiling GMAO Models with Allinea’s DDT/MAP Georgios Britzolakis April 30, 2015.
Experience with COSMO MPI/OpenMP hybrid parallelization Matthew Cordery, William Sawyer Swiss National Supercomputing Centre Ulrich Schättler Deutscher.
The Finite-Volume Dynamical Core on GPUs within GEOS-5 William Putman Global Modeling and Assimilation Office NASA GSFC 9/8/11 Programming weather, climate,
GPU Architecture and Programming
The Alternative Larry Moore. 5 Nodes and Variant Input File Sizes Hadoop Alternative.
Experiences Accelerating MATLAB Systems Biology Applications Heart Wall Tracking Lukasz Szafaryn, Kevin Skadron University of Virginia.
Multi-core Acceleration of NWP John Michalakes, NCAR John Linford, Virginia Tech Manish Vachharajani, University of Colorado Adrian Sandu, Virginia Tech.
1 06/09/2011, COSMO GM Xavier Lapillonne Porting the physical parametrizations on GPU using directives X. Lapillonne, O. Fuhrer Eidgenössisches Departement.
JPEG-GPU: A GPGPU IMPLEMENTATION OF JPEG CORE CODING SYSTEMS Ang Li University of Wisconsin-Madison.
CE Operating Systems Lecture 13 Linux/Unix interprocess communication.
GPUs: Overview of Architecture and Programming Options Lee Barford firstname dot lastname at gmail dot com.
Parallelization of likelihood functions for data analysis Alfio Lazzaro CERN openlab Forum on Concurrent Programming Models and Frameworks.
Manno, , © by Supercomputing Systems 1 1 COSMO - Dynamical Core Rewrite Approach, Rewrite and Status Tobias Gysi POMPA Workshop, Manno,
OpenCL Programming James Perry EPCC The University of Edinburgh.
Compiler and Runtime Support for Enabling Generalized Reduction Computations on Heterogeneous Parallel Configurations Vignesh Ravi, Wenjing Ma, David Chiu.
MESQUITE: Mesh Optimization Toolkit Brian Miller, LLNL
Slide 1 Using OpenACC in IFS Physics’ Cloud Scheme (CLOUDSC) Sami Saarinen ECMWF Basic GPU Training Sept 16-17, 2015.
Status of Dynamical Core C++ Rewrite Oliver Fuhrer (MeteoSwiss), Tobias Gysi (SCS), Men Muhheim (SCS), Katharina Riedinger (SCS), David Müller (SCS), Thomas.
Innovation for Our Energy Future Opportunities for WRF Model Acceleration John Michalakes Computational Sciences Center NREL Andrew Porter Computational.
A. Frank - P. Weisberg Operating Systems Structure of Operating Systems.
Euro-Par, 2006 ICS 2009 A Translation System for Enabling Data Mining Applications on GPUs Wenjing Ma Gagan Agrawal The Ohio State University ICS 2009.
Shangkar Mayanglambam, Allen D. Malony, Matthew J. Sottile Computer and Information Science Department Performance.
1 Rocket Science using Charm++ at CSAR Orion Sky Lawlor 2003/10/21.
Slide 1 NEMOVAR-LEFE Workshop 22/ Slide 1 Current status of NEMOVAR Kristian Mogensen.
Benchmarking and Applications. Purpose of Our Benchmarking Effort Reveal compiler (and run-time systems) weak points and lack of adequate automatic optimizations.
GPU Performance Optimisation Alan Gray EPCC The University of Edinburgh.
SixTrack for GPU R. De Maria. SixTrack Status SixTrack: Single Particle Tracking Code [cern.ch/sixtrack]. 70K lines written in Fortran 77/90 (with few.
11 Brian Van Straalen Portable Performance Discussion August 7, FASTMath SciDAC Institute.
LLNL-PRES This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344.
Martin Kruliš by Martin Kruliš (v1.1)1.
GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU Programs Mai Zheng, Vignesh T. Ravi, Wenjing Ma, Feng Qin, and Gagan Agrawal Dept. of Computer.
Application of Emerging Computational Architectures (GPU, MIC) to Atmospheric Modeling Tom Henderson NOAA Global Systems Division
An Update on Accelerating CICE with OpenACC
Generalized and Hybrid Fast-ICA Implementation using GPU
GPU Computing Jan Just Keijser Nikhef Jamboree, Utrecht
Enabling machine learning in embedded systems
Experience with Maintaining the GPU Enabled Version of COSMO
J-Zephyr Sebastian D. Eastham
Simulation at NASA for the Space Radiation Effort
Lecture 2- Query Processing (continued)
Cristiano Padrin (CASPUR)
Using OpenMP offloading in Charm++
Presentation transcript:

Progress Toward Accelerating CAM-SE. Jeff Larkin Along with: Rick Archibald, Ilene Carpenter, Kate Evans, Paulius Micikevicius, Jim Rosinski, Jim Schwarzmeier, Mark Taylor

Background In 2009 ORNL asked many of their top users: What sort of science would you do on a 20 Petaflops machine in 2012? – Answer to come on next slide Center for Accelerated Application Research (CAAR) established to determine: – Can a set of codes from various disciplines be made to effectively use GPU accelerators with the combined efforts of domain scientists and vendors – Each team has a science lead, code lead, members from ORNL, Cray, Nvidia, and elsewhere

CAM-SE Target Problem 1/8 degree CAM, using CAM-SE dynamical core and Mozart tropospheric chemistry. Why is acceleration needed to “do” the problem? – When including all the tracers associated with Mozart atmospheric chemistry, the simulation is too expensive to run at high resolution on today’s systems. What unrealized parallelism needs to be exposed? – In many parts of the dynamics, parallelism needs to include levels (k) and chemical constituents (q).

Profile of Runtime % of Runtime

Next Steps Once the dominant routines were identified, standalone kernels were created for each. Early efforts tested PGI & HMPP directive, plus CUDA C, CUDA Fortran, and OpenCL Directives-based compiler were too immature at the time – Poor support for Fortran modules and derived types – Did not allow implementation at a high enough level CUDA Fortran provided good performance while allowing us to remain in Fortran

Identifying Parallelism HOMME parallelizes both MPI and OpenMP over elements Most of the tracer advection can also parallelize over tracers (q) and levels (k) – Vertical remap is the exception, due to vertical dependence in levels. Parallelizing over tracers and sometimes levels while threading over quadrature points (nv) provides ample parallelism within each element to utilize GPU effectively.

Status Euler_step & laplace_sphere_wk were straightforward to rewrite in CUDA Fortran Vertical Remap was rewritten to be more amenable to GPU (made it vectorize) – Resulting code is 2X faster on CPU than original code and has been given back to the community Edge Packing/Unpacking for boundary exchange needs to be rewritten (Ilene talked about this already) – Designed for 1 element per MPI rank, but we plan to run with more – Once this is node-aware, it can also be device-aware and greatly reduce PCIe transfers Someone said yesterday: “As with many kernels, the ratio of FLOPS per by transfer determines successful acceleration.”

Status (cont.) Kernels were put back into HOMME and validation tests were run and passed – This version did nothing to reduce data movement, only tested kernel accuracy – In process of porting forward to current trunk and do more intelligent data movement Currently reevaluating directives now that compilers have matured – Directives-based vertical remap now slightly outperforms hand-tuned CUDA – Still working around derived_type issues

Challenges Data Structures (Object-Oriented Fortran) – Every node has an array of element derived types, which contains more arrays – We only care about some of these arrays, so data movement isn’t very natural – We must essentially change many non-contiguous CPU arrays into a contiguous GPU array Parallelism occurs at various levels of the calltree, not just leaf routines, so compiler must be able to inline leaves in order to use directives – Cray compiler handles this via whole program analysis, PGI compiler may support this via inline library

Challenges (cont.) CUDA Fortran requires everything live in the same module – Must duplicate some routines and data structures from several module in our “cuda_mod” – Insert ifdefs that hijack CPU routine calls and forward the request to matching cuda_mod routines – Simple for user, but developer must maintain duplicate routines – Hey Dave, when will this get changed? ;)

Until the Boundary Exchange is rewritten, euler_step performance is hampered by data movement. Streaming over elements helps, but may not be realistic for the full code.

With data transfer, laplace_sphere_wk is a wash, but since all necessary data is already resident from euler_step, kernel only time is realistic.

Vertical remap rewrite is 2X faster on the CPU and still faster on GPU. All data already resident on device from euler_step, so kernel-only time is realistic.

Future Work Use CUDA 4.0 dynamic pinning of memory to allow overlapping & better PCIe performance Move forward to CAM5/CESM1 – No chance of our work being used otherwise Some additional, small kernels are needed to allow data to remain resident – Cheaper to run these on the GPU than to copy the data Reprofile with accelerated application to identify next most important routines – Chemisty implicit solver is expected to be next – Physics is expected to require mature, directives-based compiler – Rinse, repeat

Conclusions Much has been done, much remains For a fairly new, cleanly written code, CUDA Fortran was tractable. – HOMME has very similar loop nests throughout, that was key to making this possible – Still results in multiple code paths to maintain, so we’d prefer to move to directives for the long-run We believe GPU accelerators will be beneficial for the selected problem – We hope that it will also benefit a wider audience (CAM5 should help this)