Multi-core Acceleration of NWP John Michalakes, NCAR John Linford, Virginia Tech Manish Vachharajani, University of Colorado Adrian Sandu, Virginia Tech.

Slides:



Advertisements
Similar presentations
Vectors, SIMD Extensions and GPUs COMP 4611 Tutorial 11 Nov. 26,
Advertisements

Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.
Lecture 6: Multicore Systems
Instructor Notes We describe motivation for talking about underlying device architecture because device architecture is often avoided in conventional.
Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.
Computing with Accelerators: Overview ITS Research Computing Mark Reed.
GPGPU Introduction Alan Gray EPCC The University of Edinburgh.
HPCC Mid-Morning Break High Performance Computing on a GPU cluster Dirk Colbry, Ph.D. Research Specialist Institute for Cyber Enabled Discovery.
Development of a track trigger based on parallel architectures Felice Pantaleo PH-CMG-CO (University of Hamburg) Felice Pantaleo PH-CMG-CO (University.
GPUs on Clouds Andrew J. Younge Indiana University (USC / Information Sciences Institute) UNCLASSIFIED: 08/03/2012.
Presented by Performance and Productivity of Emerging Architectures Jeremy Meredith Sadaf Alam Jeffrey Vetter Future Technologies.
Acceleration of the Smith– Waterman algorithm using single and multiple graphics processors Author : Ali Khajeh-Saeed, Stephen Poole, J. Blair Perot. Publisher:
CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.
The PTX GPU Assembly Simulator and Interpreter N.M. Stiffler Zheming Jin Ibrahim Savran.
CS 732: Advance Machine Learning Usman Roshan Department of Computer Science NJIT.
A Performance and Energy Comparison of FPGAs, GPUs, and Multicores for Sliding-Window Applications From J. Fowers, G. Brown, P. Cooke, and G. Stitt, University.
Accelerating Machine Learning Applications on Graphics Processors Narayanan Sundaram and Bryan Catanzaro Presented by Narayanan Sundaram.
University of Michigan Electrical Engineering and Computer Science Amir Hormati, Mehrzad Samadi, Mark Woh, Trevor Mudge, and Scott Mahlke Sponge: Portable.
Heterogeneous Computing Dr. Jason D. Bakos. Heterogeneous Computing 2 “Traditional” Parallel/Multi-Processing Large-scale parallel platforms: –Individual.
Panda: MapReduce Framework on GPU’s and CPU’s
GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.
To GPU Synchronize or Not GPU Synchronize? Wu-chun Feng and Shucai Xiao Department of Computer Science, Department of Electrical and Computer Engineering,
ORIGINAL AUTHOR JAMES REINDERS, INTEL PRESENTED BY ADITYA AMBARDEKAR Overview for Intel Xeon Processors and Intel Xeon Phi coprocessors.
Accelerating SQL Database Operations on a GPU with CUDA Peter Bakkum & Kevin Skadron The University of Virginia GPGPU-3 Presentation March 14, 2010.
Evaluation of Multi-core Architectures for Image Processing Algorithms Masters Thesis Presentation by Trupti Patil July 22, 2009.
Samuel Williams, John Shalf, Leonid Oliker, Shoaib Kamil, Parry Husbands, Katherine Yelick Lawrence Berkeley National Laboratory ACM International Conference.
CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA
GPU Performance Prediction GreenLight Education & Outreach Summer Workshop UCSD. La Jolla, California. July 1 – 2, Javier Delgado Gabriel Gazolla.
BY: ALI AJORIAN ISFAHAN UNIVERSITY OF TECHNOLOGY 2012 GPU Architecture 1.
1 Chapter 04 Authors: John Hennessy & David Patterson.
Implementation of Parallel Processing Techniques on Graphical Processing Units Brad Baker, Wayne Haney, Dr. Charles Choi.
Extracted directly from:
By Arun Bhandari Course: HPC Date: 01/28/12. GPU (Graphics Processing Unit) High performance many core processors Only used to accelerate certain parts.
Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.
General Purpose Computing on Graphics Processing Units: Optimization Strategy Henry Au Space and Naval Warfare Center Pacific 09/12/12.
Porting the physical parametrizations on GPUs using directives X. Lapillonne, O. Fuhrer, Cristiano Padrin, Piero Lanucara, Alessandro Cheloni Eidgenössisches.
GPU in HPC Scott A. Friedman ATS Research Computing Technologies.
The Finite-Volume Dynamical Core on GPUs within GEOS-5 William Putman Global Modeling and Assimilation Office NASA GSFC 9/8/11 Programming weather, climate,
1 © 2012 The MathWorks, Inc. Parallel computing with MATLAB.
Programming Concepts in GPU Computing Dušan Gajić, University of Niš Programming Concepts in GPU Computing Dušan B. Gajić CIITLab, Dept. of Computer Science.
Use of GPUs in ALICE (and elsewhere) Thorsten Kollegger TDOC-PG | CERN |
Fast Support Vector Machine Training and Classification on Graphics Processors Bryan Catanzaro Narayanan Sundaram Kurt Keutzer Parallel Computing Laboratory,
1 06/09/2011, COSMO GM Xavier Lapillonne Porting the physical parametrizations on GPU using directives X. Lapillonne, O. Fuhrer Eidgenössisches Departement.
Jie Chen. 30 Multi-Processors each contains 8 cores at 1.4 GHz 4GB GDDR3 memory offers ~100GB/s memory bandwidth.
Some key aspects of NVIDIA GPUs and CUDA. Silicon Usage.
GPUs: Overview of Architecture and Programming Options Lee Barford firstname dot lastname at gmail dot com.
Compiler and Runtime Support for Enabling Generalized Reduction Computations on Heterogeneous Parallel Configurations Vignesh Ravi, Wenjing Ma, David Chiu.
1)Leverage raw computational power of GPU  Magnitude performance gains possible.
Progress Toward Accelerating CAM-SE. Jeff Larkin Along with: Rick Archibald, Ilene Carpenter, Kate Evans, Paulius Micikevicius, Jim Rosinski, Jim Schwarzmeier,
Slide 1 Using OpenACC in IFS Physics’ Cloud Scheme (CLOUDSC) Sami Saarinen ECMWF Basic GPU Training Sept 16-17, 2015.
Yang Yu, Tianyang Lei, Haibo Chen, Binyu Zang Fudan University, China Shanghai Jiao Tong University, China Institute of Parallel and Distributed Systems.
Innovation for Our Energy Future Opportunities for WRF Model Acceleration John Michalakes Computational Sciences Center NREL Andrew Porter Computational.
CS 732: Advance Machine Learning
Sunpyo Hong, Hyesoon Kim
Computer Architecture Lecture 24 Parallel Processing Ralph Grishman November 2015 NYU.
GPGPU introduction. Why is GPU in the picture Seeking exa-scale computing platform Minimize power per operation. – Power is directly correlated to the.
GPU Performance Optimisation Alan Gray EPCC The University of Edinburgh.
3/12/2013Computer Engg, IIT(BHU)1 CUDA-3. GPGPU ● General Purpose computation using GPU in applications other than 3D graphics – GPU accelerates critical.
My Coordinates Office EM G.27 contact time:
Fast and parallel implementation of Image Processing Algorithm using CUDA Technology On GPU Hardware Neha Patil Badrinath Roysam Department of Electrical.
Processor Level Parallelism 2. How We Got Here Developments in PC CPUs.
Heterogeneous Processing KYLE ADAMSKI. Overview What is heterogeneous processing? Why it is necessary Issues with heterogeneity CPU’s vs. GPU’s Heterogeneous.
Application of Emerging Computational Architectures (GPU, MIC) to Atmospheric Modeling Tom Henderson NOAA Global Systems Division
A Tool for Chemical Kinetics Simulation on Accelerated Architectures
Linchuan Chen, Xin Huo and Gagan Agrawal
Ray Tracing on Programmable Graphics Hardware
Multicore and GPU Programming
6- General Purpose GPU Programming
Multicore and GPU Programming
Presentation transcript:

Multi-core Acceleration of NWP John Michalakes, NCAR John Linford, Virginia Tech Manish Vachharajani, University of Colorado Adrian Sandu, Virginia Tech HPC Users Forum, September 10, 2009

Outline WRF and multi-core overview Cost breakdown and kernel repository Two cases Path forward

WRF Overview Large collaborative effort to develop community weather model – registered users –Applications Numerical Weather Prediction High resolution climate Air quality research/prediction Wildfire Atmospheric Research Software designed for HPC –Ported to and in use on virtually all types of system in the Top500 –2007 Gordon Bell finalist Why acceleration? –Exploit fine-grained parallelism –Cost performance ($ and e - ) –Need for strong scaling 5 day global WRF forecast at 20km horizontal resolution

WRF Overview Large collaborative effort to develop community weather model – registered users –Applications Numerical Weather Prediction High resolution climate Air quality research/prediction Wildfire Atmospheric Research Software designed for HPC –Ported to and in use on virtually all types of system in the Top500 –2007 Gordon Bell finalist Why acceleration? –Exploit fine-grained parallelism –Cost performance ($ and e - ) –Need for strong scaling courtesy Peter Johnsen, Cray

Multi-/Many-core Graphics Processing Units –NVIDIA GTX280, AMD –High-end versions of commodity graphics cards –O(100) physical SIMD cores supporting O(1000) way concurrent threads –Separate co-processor to host CPU, PCIe connection –Large register files, fast (but very small) shared memory –Programmed using special purpose threading languages: CUDA, OpenCL –Higher-level language support in development (e.g. PGI 9) “Traditional” multi-core –Xeon 5500, Opteron Istanbul, Power 6/7 –Much improved memory b/w (5x on Stream * ) –Hyperthreading/SMT –Includes heterogeneity in the form of SIMD units & instructions –x86 instruction set; Native C, Fortran, OpenMP,... Cell Broadband Engine –PowerXCell 8i –PowerPC with 8 co-processors on a chip –No shared memory but relatively large local stores per core –Cores separately programmed; all computation and data movement programmer controlled *

Percentages of total run time (single processor profile) WRF Cost Breakdown and Kernel Repository

WSM5 Microphysics WRF Single Moment 5-Tracer (WSM5) * scheme Represents condensation, precipitation, and thermodynamic effects of latent heat release Operates independently up each column of 3D WRF domain Expensive and relatively computationally intense (~2 ops per word) * Hong, S., J. Dudhia, and S. Chen (2004). Monthly Weather Review, 132(1):

WSM5 Microphysics contributed by Roman Dubtsov, Intel

WSM5 Microphysics CUDA version distributed with WRFV3 Users have seen x improvement –Case/system dependent –Makes other parts of code run faster (!) PGI has implemented with 9.0 acceleration directives and seen comparable speedups and overheads from transfer cost WRF CONUS 12km benchmark Courtesy Brent Leback and Craig Toepfer, PGI total seconds microphysics

Kernel: WRF-Chem WRF model coupled to atmospheric chemistry for air quality research and air pollution forecasting –Time evolution and advection of tens to hundreds of chemical species being produced and consumed at varying rates in networks of reactions –Many times cost of core meteorology; seemingly ideal for acceleration * Grell et al., WRF Chem Version 3.0 User’s Guide, ** Hairer E. and G. Wanner. Solving ODEs II: Stiff and Differential-Algebraic Problems, Springer *** Damian, et al. (2002). Computers & Chemical Engineering 26,

Kernel: WRF-Chem Rosenbrock solver for stiff system of ODEs at each cell –Rosenbrock ** solver for stiff system of ODEs at each cell –Computation at each cell independent – perfectly parallel –Solver itself is not parallelizable –600K fp ops per cell (compare to 5K ops/cell for meteorology) –1-million load-stores per cell –Very large footprint: 15KB state at each cell

! OMP PARALLEL DO J = 1,... DO K = 1,... DO I = 1,... Kernel: WRF-Chem KPP generates Fortran solver called at each grid-cell in 3D domain Multi-core implementation –Insert OpenMP directives and multithread loop over grid cells

Kernel: WRF-Chem Cell BE implementation –PPU acts as master, invoking SPEs cell-by-cell –SPUs round robin through domain, triple buffering –Enhancement: SPUs process cells in blocks of four Cell-index innermost in blocks, utilize SIMD

Kernel: WRF-Chem GPU implementation –Host CPU controls outer loop over steps in Rosenbrock algorithm –Each step implemented as kernel over all cells in domain –Thread-shared memory utilized where possible (but not much) –Cells are masked out of the computation as they reach convergence –Cell index (thread index) innermost for coalesced access to device memory kernel invocation GPU CPU kernel invocation

Chemistry Performance on GPU

Some preliminary conclusions Chemistry kinetics –Very expensive but not computationally intense, so data movement costs are high and memory footprint is very large (15K bytes per cell) –Each Cell BE core has local store large enough to allow effective overlap of computation and communication –Today’s GPUs have ample device memory bandwidth, but there is not enough fast local memory for working set –Xeon 5500, Power6 have sufficient cache, concurrency, and bandwidth WRF Microphysics –More computationally intense so GPU has an edge, but Xeon is closing –PCIe transfer costs tip balance to the Xeon but this can be addressed –Haven’t tried on Cell In all cases, conventional multi-core CPU easier to program, debug, and optimize Garcia, J. R. Kelly, and T. Voran. Computing Spectropolarimetric Signals on Accelerator Hardware Comparing the Cell BE and NVIDIA GPUs. Proceedings of 2009 LCI Conference March Boulder CO.

Accelerators for Weather and Climate? Considerable conversion effort and maintenance issues, esp. for large legacy codes. What speedup justifies? What limits speedups? –Fast, close, and large enough memory resources –Distance from host processor –Baseline moving: CPUs getting faster Can newer generations of accelerators address? –Technically: probable –Business case: ???