Radiation Modeling Using the Uintah Heterogeneous CPU/GPU Runtime System Alan Humphrey, Qingyu Meng, Martin Berzins, Todd Harman Scientific Computing and.

Slides:



Advertisements
Similar presentations
Shredder GPU-Accelerated Incremental Storage and Computation
Advertisements

Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.
Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.
Optimization on Kepler Zehuan Wang
1 A GPU Accelerated Storage System NetSysLab The University of British Columbia Abdullah Gharaibeh with: Samer Al-Kiswany Sathish Gopalakrishnan Matei.
IIAA GPMAD A beam dynamics code using Graphics Processing Units GPMAD (GPU Processed Methodical Accelerator Design) utilises Graphics Processing Units.
Early Linpack Performance Benchmarking on IPE Mole-8.5 Fermi GPU Cluster Xianyi Zhang 1),2) and Yunquan Zhang 1),3) 1) Laboratory of Parallel Software.
 155 South 1452 East Room 380  Salt Lake City, Utah  This research was sponsored by the National Nuclear Security Administration.
1 Lawrence Livermore National Laboratory By Chunhua (Leo) Liao, Stephen Guzik, Dan Quinlan A node-level programming model framework for exascale computing*
University of Michigan Electrical Engineering and Computer Science Transparent CPU-GPU Collaboration for Data-Parallel Kernels on Heterogeneous Systems.
Eigenvalue and eigenvectors  A x = λ x  Quantum mechanics (Schrödinger equation)  Quantum chemistry  Principal component analysis (in data mining)
A Scalable Heterogeneous Parallelization Framework for Iterative Local Searches Martin Burtscher 1 and Hassan Rabeti 2 1 Department of Computer Science,
PARALLEL PROCESSING COMPARATIVE STUDY 1. CONTEXT How to finish a work in short time???? Solution To use quicker worker. Inconvenient: The speed of worker.
OpenFOAM on a GPU-based Heterogeneous Cluster
1 Dr. Frederica Darema Senior Science and Technology Advisor NSF Future Parallel Computing Systems – what to remember from the past RAMP Workshop FCRC.
CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.
Parallelization and CUDA libraries Lei Zhou, Yafeng Yin, Hong Man.
University of Michigan Electrical Engineering and Computer Science Amir Hormati, Mehrzad Samadi, Mark Woh, Trevor Mudge, and Scott Mahlke Sponge: Portable.
Heterogeneous Computing Dr. Jason D. Bakos. Heterogeneous Computing 2 “Traditional” Parallel/Multi-Processing Large-scale parallel platforms: –Individual.
To GPU Synchronize or Not GPU Synchronize? Wu-chun Feng and Shucai Xiao Department of Computer Science, Department of Electrical and Computer Engineering,
ORIGINAL AUTHOR JAMES REINDERS, INTEL PRESENTED BY ADITYA AMBARDEKAR Overview for Intel Xeon Processors and Intel Xeon Phi coprocessors.
Accelerating SQL Database Operations on a GPU with CUDA Peter Bakkum & Kevin Skadron The University of Virginia GPGPU-3 Presentation March 14, 2010.
Motivation “Every three minutes a woman is diagnosed with Breast cancer” (American Cancer Society, “Detailed Guide: Breast Cancer,” 2006) Explore the use.
GPU Programming with CUDA – Accelerated Architectures Mike Griffiths
An approach for solving the Helmholtz Equation on heterogeneous platforms An approach for solving the Helmholtz Equation on heterogeneous platforms G.
GPU-accelerated Evaluation Platform for High Fidelity Networking Modeling 11 December 2007 Alex Donkers Joost Schutte.
CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA
GPU Programming David Monismith Based on notes taken from the Udacity Parallel Programming Course.
Martin Berzins (Steve Parker) What are the hard apps problems? How do the solutions get shared? What non-apps work is needed? Thanks to DOE for funding.
1 Hardware Support for Collective Memory Transfers in Stencil Computations George Michelogiannakis, John Shalf Computer Architecture Laboratory Lawrence.
Implementation of Parallel Processing Techniques on Graphical Processing Units Brad Baker, Wayne Haney, Dr. Charles Choi.
Extracted directly from:
CCA Common Component Architecture Manoj Krishnan Pacific Northwest National Laboratory MCMD Programming and Implementation Issues.
Advisor: Dr. Aamir Shafi Co-Advisor: Mr. Ali Sajjad Member: Dr. Hafiz Farooq Member: Mr. Tahir Azim Optimizing N-body Simulations for Multi-core Compute.
General Purpose Computing on Graphics Processing Units: Optimization Strategy Henry Au Space and Naval Warfare Center Pacific 09/12/12.
Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations Xin Huo, Vignesh T. Ravi, Gagan Agrawal Department of Computer Science and Engineering.
Applying GPU and POSIX Thread Technologies in Massive Remote Sensing Image Data Processing By: Group 17 King Mongkut's Institute of Technology Ladkrabang.
LATA: A Latency and Throughput- Aware Packet Processing System Author: Jilong Kuang and Laxmi Bhuyan Publisher: DAC 2010 Presenter: Chun-Sheng Hsueh Date:
GPU Architecture and Programming
Introducing collaboration members – Korea University (KU) ALICE TPC online tracking algorithm on a GPU Computing Platforms – GPU Computing Platforms Joohyung.
NIH Resource for Biomolecular Modeling and Bioinformatics Beckman Institute, UIUC NAMD Development Goals L.V. (Sanjay) Kale Professor.
Experiences Accelerating MATLAB Systems Biology Applications Heart Wall Tracking Lukasz Szafaryn, Kevin Skadron University of Virginia.
1 1 What does Performance Across the Software Stack mean?  High level view: Providing performance for physics simulations meaningful to applications 
Introduction What is GPU? It is a processor optimized for 2D/3D graphics, video, visual computing, and display. It is highly parallel, highly multithreaded.
Some key aspects of NVIDIA GPUs and CUDA. Silicon Usage.
Introduction to Research 2011 Introduction to Research 2011 Ashok Srinivasan Florida State University Images from ORNL, IBM, NVIDIA.
ARCHES: GPU Ray Tracing I.Motivation – Emergence of Heterogeneous Systems II.Overview and Approach III.Uintah Hybrid CPU/GPU Scheduler IV.Current Uintah.
Compiler and Runtime Support for Enabling Generalized Reduction Computations on Heterogeneous Parallel Configurations Vignesh Ravi, Wenjing Ma, David Chiu.
1)Leverage raw computational power of GPU  Magnitude performance gains possible.
Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA
Program Optimizations and Recent Trends in Heterogeneous Parallel Computing Dušan Gajić, University of Niš Program Optimizations and Recent Trends in Heterogeneous.
Parallelization Strategies Laxmikant Kale. Overview OpenMP Strategies Need for adaptive strategies –Object migration based dynamic load balancing –Minimal.
Shangkar Mayanglambam, Allen D. Malony, Matthew J. Sottile Computer and Information Science Department Performance.
Parallel Performance Measurement of Heterogeneous Parallel Systems with GPUs Allen D. Malony, Scott Biersdorff, Sameer Shende, Heike Jagode†, Stanimire.
Martin Kruliš by Martin Kruliš (v1.0)1.
 155 South 1452 East Room 380  Salt Lake City, Utah  This research was sponsored by the National Nuclear Security Administration.
Building PetaScale Applications and Tools on the TeraGrid Workshop December 11-12, 2007 Scott Lathrop and Sergiu Sanielevici.
Large-scale geophysical electromagnetic imaging and modeling on graphical processing units Michael Commer (LBNL) Filipe R. N. C. Maia (LBNL-NERSC) Gregory.
LLNL-PRES This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344.
 155 South 1452 East Room 380  Salt Lake City, Utah  This research was sponsored by the National Nuclear Security Administration.
NFV Compute Acceleration APIs and Evaluation
Productive Performance Tools for Heterogeneous Parallel Computing
Amit Amritkar & Danesh Tafti Eric de Sturler & Kasia Swirydowicz
I. E. Venetis1, N. Nikoloutsakos1, E. Gallopoulos1, John Ekaterinaris2
ACCELERATING SPARSE CHOLESKY FACTORIZATION ON GPUs
Accelerating MapReduce on a Coupled CPU-GPU Architecture
A Pattern Specification and Optimizations Framework for Accelerating Scientific Computations on Heterogeneous Clusters Linchuan Chen Xin Huo and Gagan.
Linchuan Chen, Xin Huo and Gagan Agrawal
Multicore and GPU Programming
Multicore and GPU Programming
Presentation transcript:

Radiation Modeling Using the Uintah Heterogeneous CPU/GPU Runtime System Alan Humphrey, Qingyu Meng, Martin Berzins, Todd Harman Scientific Computing and Imaging Institute & University of Utah I.Uintah Overview II.Emergence of Heterogeneous Systems III.Modifying Uintah Runtime: CPU-GPU Scheduler IV.ARCHES Combustion Simulation Component V.Developing a Radiation Transport Model VI.Results, Future Work and Conclusion Thanks to: John Schmidt, Jeremy Thornock, Isaac Hunsaker, J. Davison de St. Germain Justin Luitjens and Steve Parker, NVIDIA DoE for funding the CSAFE project from , DOE NETL, DOE NNSA, INCITE NSF for funding via SDCI and PetaApps Keeneland Computing Facility, supported by NSF under Contract OCI Oak Ridge Leadership Computing Facility for access to TitanDev

Uintah Overview Virtual Soldier Shaped Charges Industrial Flares Plume Fires Explosions Parallel, adaptive multi-physics framework Fluid-structure interaction problems Patch-based AMR using: particles and mesh-based fluid-solve Foam Compaction Angiogenesis Sandstone Compaction

Uintah - Scalability 256K cores – Jaguar XK6 95% weak scaling efficiency & 60% strong scaling efficiency Multi-threaded MPI – shared memory model on-node 1 Scalable, efficient, lock-free data structures 2 Cores Patch-based domain decomposition Asynchronous task-based paradigm 1.Q. Meng, M. Berzins, and J. Schmidt. ”Using Hybrid Parallelism to Improve Memory Use in the Uintah Framework”. In Proc. of the 2011 TeraGrid Conference (TG11), Salt Lake City, Utah, Q. Meng and M. Berzins. Scalable Large-scale Fluid-structure Interaction Solvers in the Uintah Framework via Hybrid Task- based Parallelism Algorithms. Concurrency and Computation: Practice and Experience 2012, Submitted

Emergence of Heterogeneous Systems Motivation - Accelerate Uintah Components Utilize all on-node computational resources Uintah’s asynchronous task-based approach well suited to take advantage of GPUs Natural progression – GPU Tasks Keeneland Initial Delivery System 360 GPUs DoE Titan 1000s of GPUs Nvidia M2070/90 Tesla GPU Multi-core CPU +

When extending a general computational framework to GPUs, with over 700K lines of code …. where to start? …. Uintah’s asynchronous task-based approach makes this surprisingly manageable

Other Graph Based Applications 1: 1 1: 2 1: 3 1: 4 2: 2 2: 3 2: 4 2: 2 2: 3 2: 4 3: 3 3: 4 3: 3 Charm++: Object-based Virtualization Intel CnC: Language for graph based parallelism Plasma & Magma (Dongarra): DAG based Parallel linear algebra software

Uintah Task-Based Approach Task Graph: Directed Acyclic Graph Asynchronous, out of order execution of tasks - key idea Task – basic unit of work C++ method with computation Allows Uintah to be generalized to support accelerators Overlap communication/computation GPU extension is realized without massive, sweeping code changes Extend Task class (CPU & GPU call backs) Design GPU task queue data structures Infrastructure handles device API details Mechanism to detect completion of async ops Write GPU kernels for appropriate CPU tasks 4 patch single level ICE task graph

NVIDIA Fermi Overview Host memory to Device memory is max 8GB/sec Device memory to cores is 144GB/sec Memory bound applications must hide PCIe latency 8GB/sec 144GB/sec

Generated by Google profiling tool, visualized by Kcachegrind FirstOrderAdvector Operators & Significant portion of runtime (~ 20% ) Highly structured calculations Stencil operations and other SIMD constructs Map well onto GPU High FLOPs:Byte ratio Fluid Solver Code (ICE)

Results – Without Optimizations GPU performance for stencil-based operations ~2x over multi-core CPU equivalent for realistic patch sizes Worth pursuing, but need optimizations Hide PCIe latency with asynchronous memory copies Significant speedups for large patch sizes only

Hiding PCIe Latency Nvidia CUDA Asynchronous API Asynchronous functions provide: Memcopies asynchronous with CPU Concurrently execute a kernel and memcopy Stream - sequence of operations that execute in order on GPU Operations from different streams can be interleaved Data TransferKernel Execution Data Transfer Normal Page-locked Memory

Multi-Threaded CPU Scheduler

Multi-Threaded CPU-GPU Scheduler

Multistage Task Queue Architecture GPU Task Queues Overlap computation with PCIe transfers and MPI communication Automatically handles device memory ops and stream management Enables Uintah to “pre-fetch” GPU data Queries task-graph for task’s data requirements

Uintah CPU-GPU Scheduler Abilities Now able to run capability jobs on: Keeneland Initial Delivery System (NICS) 1440 CPU cores & 360 GPUs simultaneously 3 GPUs per node TitanDev - Jaguar XK6 GPU partition (OLCF) CPU cores & 960 GPUs simultaneously 1 GPU per node Shown speedups on fluid-solver code High degree of node-level parallelism

ARCHES Combustion Component Designed for simulating turbulent reacting flows with participating media radiation Heat, mass, and momentum transport 3D Large Eddy Simulation (LES) code Evaluate large clean coal boilers that alleviate CO 2 concerns ARCHES is massively parallel & highly scalable through its integration with Uintah

Exascale Problem Design of Alstom Clean Coal Boilers LES resolution needed for 350MW boiler problem 1mm per side for each computational volume = 9 x cells Based on initial runs - to simulate problem in 48 hours of wall clock time requires M fast cores. Professor Phil Smith ICSE, Utah O 2 concentrations in a clean coal boiler

Developing a Uintah Radiation Model ARCHES Combustion Component Need to approximate the radiation transfer equation Methods considered: Discrete Ordinates Method (DOM) Reverse Monte Carlo Ray Tracing (RMCRT) Both solve the same equation: DOM: slow and expensive (solving linear systems) and is difficult to add more complex radiation physics (specifically scattering) RMCRT: faster due to ray decomposition and naturally incorporates physics (such as scattering) with ease. No linear solve.

Reverse Monte Carlo Ray Tracing (RMCRT) RMCRT Lends itself to scalable parallelism Intensities of each ray are mutually exclusive Multiple rays can be traced simultaneously at any given cell and time step Rays traced backwards from computational cell, eliminating the need to track ray bundles that never reach that cell Figure shows the back path of a ray from S to the emitter E, on a nine cell structured mesh patch

ARCHES GPU-Based RMCRT RayTrace task comutationally intensive Ideal for SIMD parallelization Rays mutually exclusive Can be traced simultaneously Offload Ray Tracing and RNG to GPU(s) Available CPU cores can perform other computation RNG states on device, 1 per thread

Random Number Generation Using NVIDIA cuRAND Library High performance GPU-accelerated random number generation (RNG)

RMCRT Kernel - General Approach Tile a 2D slice with 2D threadblocks Slice in two fastest dimensions: x and y Thread iterates along the slowest dimension Each thread is responsible for one set of rays per cell in every Z-slice (the slowest dimension) Single kernel launch (minimize overhead) Good data reuse GPU-side RNG (cuRAND library) Uintah Patch

GPU RMCRT Speedup Results (Single Node) Single CPU Core vs Single GPU MachineRaysCPU (sec)GPU (sec)Speedup (x) Keeneland 1-core Intel TitanDev 1-core AMD GPU – Nvidia M2090 Keeneland CPU Core – Intel Xeon X5660 TitanDev CPU Core – AMD Opteron 6200

GPU RMCRT Speedup Results (Single Node) All CPU Cores vs Single GPU MachineRaysCPU (sec)GPU (sec)Speedup (x) Keeneland 12-cores Intel TitanDev 16-cores AMD GPU – Nvidia M2090 Keeneland CPU Cores – Intel Xeon X5660 TitanDev CPU Cores – AMD Opteron 6200

GPU-Based RMCRT Scalability Mean time per timestep for GPU lower than CPU (up to 64 GPUs) GPU implementation quickly runs out of work All-to-all nature of problem limits size that can be computed due to memory constraints with large, highly resolved physical domains Strong scaling results for both CPU and GPU implementations Of RMCRT on TitanDev

Addressing RMCRT Scalability Use coarser representation of computational domain with multiple levels Define Region of Interest (ROI) Surround ROI with successively coarser grid As rays travel away from ROI, the stride taken between cells becomes larger This reduces computational cost and memory usage. Multi-level Scheme

Future Work Scheduler – Infrastructure Decentralized task scheduler GPU affinity for multi socket/GPU nodes Support for Intel MIC (Xeon Phi) Radiation Transport Model (RMCRT) Scalability Continue multi-level, data onion approach

Conclusion Task-based paradigm makes extending Uintah to heterogeneous systems manageable Radiation modelling is a difficult problem, with three fundamental scalability barriers: Extremely expensive computation All-to-all communication pattern Exorbitant memory footprint Offload computationally intensive RMCRT to the GPU. Multi-level, Data Onion approach begins to address communication issues. Uintah’s multi-threaded runtime system directly addresses reduction in memory footprint.

Questions? Software Download