Presented by Rengan Xu LCPC 2014 09/16/2014

Slides:



Advertisements
Similar presentations
Compilation and Parallelization Techniques with Tool Support to Realize Sequence Alignment Algorithm on FPGA and Multicore Sunita Chandrasekaran1 Oscar.
Advertisements

GPU Computing with OpenACC Directives. subroutine saxpy(n, a, x, y) real :: x(:), y(:), a integer :: n, i $!acc kernels do i=1,n y(i) = a*x(i)+y(i) enddo.
OpenMP Optimization National Supercomputing Service Swiss National Supercomputing Center.
Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.
The OpenUH Compiler: A Community Resource Barbara Chapman University of Houston March, 2007 High Performance Computing and Tools Group
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY Sponsors: National Science Foundation, LogicBlox Inc., and NVIDIA Kernel.
The Jacquard Programming Environment Mike Stewart NUG User Training, 10/3/05.
1 Lawrence Livermore National Laboratory By Chunhua (Leo) Liao, Stephen Guzik, Dan Quinlan A node-level programming model framework for exascale computing*
Sparse LU Factorization for Parallel Circuit Simulation on GPU Ling Ren, Xiaoming Chen, Yu Wang, Chenxi Zhang, Huazhong Yang Department of Electronic Engineering,
University of Houston Extending Global Optimizations in the OpenUH Compiler for OpenMP Open64 Workshop, CGO ‘08.
DISTRIBUTED AND HIGH-PERFORMANCE COMPUTING CHAPTER 7: SHARED MEMORY PARALLEL PROGRAMMING.
Tile Reduction: the first step towards tile aware parallelization in OpenMP Ge Gan Department of Electrical and Computer Engineering Univ. of Delaware.
A Source-to-Source OpenACC compiler for CUDA Akihiro Tabuchi †1 Masahiro Nakao †2 Mitsuhisa Sato †1 †1. Graduate School of Systems and Information Engineering,
Semi-Automatic Composition of Data Layout Transformations for Loop Vectorization Shixiong Xu, David Gregg University of Dublin, Trinity College
Accelerating SQL Database Operations on a GPU with CUDA Peter Bakkum & Kevin Skadron The University of Virginia GPGPU-3 Presentation March 14, 2010.
Performance Analysis and Optimization Tool Andres S. CHARIF-RUBIAL Emmanuel OSERET Performance Analysis Team, University.
SAGE: Self-Tuning Approximation for Graphics Engines
Motivation “Every three minutes a woman is diagnosed with Breast cancer” (American Cancer Society, “Detailed Guide: Breast Cancer,” 2006) Explore the use.
Programming GPUs using Directives Alan Gray EPCC The University of Edinburgh.
Prospector : A Toolchain To Help Parallel Programming Minjang Kim, Hyesoon Kim, HPArch Lab, and Chi-Keung Luk Intel This work will be also supported by.
An approach for solving the Helmholtz Equation on heterogeneous platforms An approach for solving the Helmholtz Equation on heterogeneous platforms G.
CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA
Lecture 8. Profiling - for Performance Analysis - Prof. Taeweon Suh Computer Science Education Korea University COM503 Parallel Computer Architecture &
Parallel Applications Parallel Hardware Parallel Software IT industry (Silicon Valley) Users Efficient Parallel CKY Parsing on GPUs Youngmin Yi (University.
Compiler BE Panel IDC HPC User Forum April 2009 Don Kretsch Director, Sun Developer Tools Sun Microsystems.
A Compiler-Based Tool for Array Analysis in HPC Applications Presenter: Ahmad Qawasmeh Advisor: Dr. Barbara Chapman 2013 PhD Showcase Event.
CUDA All material not from online sources/textbook copyright © Travis Desell, 2012.
Automatic Performance Tuning of SpMV on GPGPU Xianyi Zhang Lab of Parallel Computing Institute of Software Chinese Academy of Sciences
Profiling and Tuning OpenACC Code. Profiling Tools (PGI) Use time option to learn where time is being spent -ta=nvidia,time NVIDIA Visual Profiler 3 rd.
 Copyright, HiPERiSM Consulting, LLC, George Delic, Ph.D. HiPERiSM Consulting, LLC (919) P.O. Box 569, Chapel Hill, NC.
GPU Architecture and Programming
JPEG-GPU: A GPGPU IMPLEMENTATION OF JPEG CORE CODING SYSTEMS Ang Li University of Wisconsin-Madison.
Automating and Optimizing Data Transfers for Many-core Coprocessors Student: Bin Ren, Advisor: Gagan Agrawal, NEC Intern Mentor: Nishkam Ravi, Yi Yang.
Parallelization of likelihood functions for data analysis Alfio Lazzaro CERN openlab Forum on Concurrent Programming Models and Frameworks.
Compiler and Runtime Support for Enabling Generalized Reduction Computations on Heterogeneous Parallel Configurations Vignesh Ravi, Wenjing Ma, David Chiu.
OpenACC for Fortran PGI Compilers for Heterogeneous Supercomputing.
Efficient Parallel CKY Parsing on GPUs Youngmin Yi (University of Seoul) Chao-Yue Lai (UC Berkeley) Slav Petrov (Google Research) Kurt Keutzer (UC Berkeley)
Experiences with Achieving Portability across Heterogeneous Architectures Lukasz G. Szafaryn +, Todd Gamblin ++, Bronis R. de Supinski ++ and Kevin Skadron.
Euro-Par, 2006 ICS 2009 A Translation System for Enabling Data Mining Applications on GPUs Wenjing Ma Gagan Agrawal The Ohio State University ICS 2009.
Dynamic Scheduling Monte-Carlo Framework for Multi-Accelerator Heterogeneous Clusters Authors: Anson H.T. Tse, David B. Thomas, K.H. Tsoi, Wayne Luk Source:
CS/EE 217 GPU Architecture and Parallel Programming Lecture 23: Introduction to OpenACC.
Sunpyo Hong, Hyesoon Kim
Heterogeneous Computing With GPGPUs Matthew Piehl Overview Introduction to CUDA Project Overview Issues faced nvcc Implementation Performance Metrics Conclusions.
Heterogeneous Computing using openCL lecture 4 F21DP Distributed and Parallel Technology Sven-Bodo Scholz.
My Coordinates Office EM G.27 contact time:
Lecture 14 Introduction to OpenACC Kyu Ho Park May 12, 2016 Ref: 1.David Kirk and Wen-mei Hwu, Programming Massively Parallel Processors, MK and NVIDIA.
SixTrack for GPU R. De Maria. SixTrack Status SixTrack: Single Particle Tracking Code [cern.ch/sixtrack]. 70K lines written in Fortran 77/90 (with few.
Martin Kruliš by Martin Kruliš (v1.1)1.
Parallel OpenFOAM CFD Performance Studies Student: Adi Farshteindiker Advisors: Dr. Guy Tel-Zur,Prof. Shlomi Dolev The Department of Computer Science Faculty.
Comparison of Threading Programming Models
Employing compression solutions under openacc
Hee-Seok Kim, Izzat El Hajj, John Stratton,
CS427 Multicore Architecture and Parallel Computing
Exploiting NVIDIA GPUs with OpenMP
Accelerating MapReduce on a Coupled CPU-GPU Architecture
Linchuan Chen, Xin Huo and Gagan Agrawal
A Cloud System for Machine Learning Exploiting a Parallel Array DBMS
HPC User Forum 2012 Panel on Potential Disruptive Technologies Emerging Parallel Programming Approaches Guang R. Gao Founder ET International.
Konstantis Daloukas Nikolaos Bellas Christos D. Antonopoulos
Memory Opportunity in Multicore Era
Compiler Back End Panel
General Programming on Graphical Processing Units
General Programming on Graphical Processing Units
Compiler Back End Panel
Kremlin: Rethinking and Rebooting gprof for the Multicore Era
Cristiano Padrin (CASPUR)
Using OpenMP offloading in Charm++
HPC User Forum: Back-End Compiler Technology Panel
6- General Purpose GPU Programming
Peter Oostema & Rajnish Aggarwal 6th March, 2019
Presentation transcript:

NAS Parallel Benchmarks on GPGPUs using a Directive-based Programming Model Presented by Rengan Xu rxu6@uh.edu LCPC 2014 09/16/2014 Rengan Xu, Xiaonan Tian, Sunita Chandrasekaran, Yonghong Yan, Barbara Chapman HPC Tools group (http://web.cs.uh.edu/~hpctools/) Department of Computer Science University of Houston Give your email id or mine in the first and last slides. Rengan Xu LCPC 2014

Outline Motivation Overview of OpenACC and NPB benchmarks Parallelization and Optimization Techniques Performance Evaluation Conclusion and Future Work Rengan Xu LCPC 2014

Motivation Provide an open source OpenACC compiler Evaluation of our open source OpenACC compiler with some real application NPB is the benchmark suite close to real applications Identifying parallelization techniques to improving performance without losing portability Rengan Xu LCPC 2014

Overview of OpenACC Standard, a high-level directive-based programming model for accelerators OpenACC 2.0 released late 2013 Data Directive: copy/copyin/copyout/…… Data Synchronization directive update Compute Directive Parallel: more control to the user Kernels: more control to the compiler Three levels of parallelism Gang Worker Vector Rengan Xu LCPC 2014

OpenUH: An Open Source OpenACC Compiler Link: http://web.cs.uh.edu/~openuh/ Source Code with OpenACC Directives FRONTENDS (C, OpenACC) IPA(Inter Procedural Analyzer) PRELOWER (Preprocess OpenACC) GPU Code NVCC Compiler WOPT (Global Scalar Optimizer) PTX Assembler Loaded Dynamically CPU Binary LOWER (Transformation of OpenACC) Pre-lower verify the region correctness Lower phase transform the OpenACC directives into IR LNO (Loop Nest Optimizer) Runtime Library WHIRL2CUDA Linker CG(Code for IA-32,IA-64,X86_64) Executable Rengan Xu LCPC 2014

NAS Parallel Benchmarks (NPB) Well recognized for evaluating current and emerging multi- core/many-core hardware architectures 5 parallel kernels IS, EP, CG, MG and FT 3 simulated computational fluid dynamics (CFD) applications LU, SP and BT Different problem sizes Class S: small for quick test purpose Class W: workstation size Class A: standard test problem Class E: largest test problem ~16x larger than the previous problem size Rengan Xu LCPC 2014

Steps to parallelize an application Profile to find the hotspot Analyze compute intensive loops to make it parallelizable Add compute directives to these loops Add data directive to manage data motion and synchronization Optimize data structure and array access pattern Apply Loop scheduling tuning Apply other optimizations, e.g. async and cache Rengan Xu LCPC 2014

Parallelization and Optimization Techniques Array privatization Loop scheduling tuning Memory coalescing optimization Data motion optimization Cache optimization Array reduction optimization Scan operation optimization Rengan Xu LCPC 2014

Array Privatization Before array privatization (has data race) After array privatization (no data race, increased memory) #pragma acc kernels for(k=0; k<=grid_points[2]-1; k++){ for(j=0; j<grid_points[1]-1; j++){ for(i=0; i<grid_points[0]-1; i++){ for(m=0; m<5; m++){ rhs[j][i][m] = forcing[k][j][i][m]; } #pragma acc kernels for(k=0; k<=grid_points[2]-1; k++){ for(j=0; j<grid_points[1]-1; j++){ for(i=0; i<grid_points[0]-1; i++){ for(m=0; m<5; m++){ } rhs[j][i][m] = forcing[k][j][i][m]; rhs[k][j][i][m] = forcing[k][j][i][m]; It is a technique of taking some data that is common or shared among parallel tasks and duplicating it so that different parallel tasks can have a private copy to operate The compiler implementation is not mature yet to guarantee the result correctness Thousands of threads will easily cause memory overflow It limits us to apply optimization only to this kernel Rengan Xu LCPC 2014

Loop Scheduling Tuning Before tuning After tuning #pragma acc kernels for(k=0; k<=grid_points[2]-1; k++){ for(j=0; j<grid_points[1]-1; j++){ for(i=0; i<grid_points[0]-1; i++){ for(m=0; m<5; m++){ rhs[k][j][i][m] = forcing[k][j][i][m]; } #pragma acc kernels loop gang for(k=0; k<=grid_points[2]-1; k++){ #pragma acc loop worker for(j=0; j<grid_points[1]-1; j++){ #pragma acc loop vector for(i=0; i<grid_points[0]-1; i++){ for(m=0; m<5; m++){ rhs[k][j][i][m] = forcing[k][j][i][m]; } Rengan Xu LCPC 2014

Memory Coalescing Optimization Non-coalesced memory access Coalesced memory access (loop interchange) #pragma acc kernels loop gang for(j=1; j <= gp12; j++){ #pragma acc loop worker for(i=1; I <= gp02; i++){ #pragma acc loop vector for(k=0; k <= ksize; k++){ fjacZ[0][0][k][i][j] = 0.0; } #pragma acc kernels loop gang for(k=0; k <= ksize; k++){ #pragma acc loop worker for(i=1; i<= gp02; i++){ #pragma acc loop vector for(j=1; j <= gp12; j++){ fjacZ[0][0][k][i][j] = 0.0; } Rengan Xu LCPC 2014

Memory Coalescing Optimization Non-coalescing memory access Coalesced memory access (change data layout) #pragma acc kernels loop gang for(k=0; k<=grid_points[2]-1; k++){ #pragma acc loop worker for(j=0; j<grid_points[1]-1; j++){ #pragma acc loop vector for(i=0; i<grid_points[0]-1; i++){ for(m=0; m<5; m++){ rhs[k][j][i][m] = forcing[k][j][i][m]; } #pragma acc kernels loop gang for(k=0; k<=grid_points[2]-1; k++){ #pragma acc loop worker for(j=0; j<grid_points[1]-1; j++){ #pragma acc loop vector for(i=0; i<grid_points[0]-1; i++){ for(m=0; m<5; m++){ } rhs[m][k][j][i] = forcing[m][k][j][i]; Rengan Xu LCPC 2014

Data Movement Optimization In NPB, most of the benchmarks contain many global arrays live throughout the entire program Allocate memory at the beginning Update directive to synchronize data between host and device Rengan Xu LCPC 2014

Cache Optimization Utilize the Read-Only Data Cache in Kepler GPU High bandwidth and low latency Full speed unaligned memory access Compiler annotates read only data automatically Compiler scan the offload computation region and extract read-only data list Alias issue: users need give more information to compiler. Kernels region: “independent” clause in loop directive Parallel region: users take full responsibility to control the transformation. Kepler what? 20 ? Rengan Xu LCPC 2014

Array Reduction Optimization Array reduction issue – every element of an array needs reduction (c) OpenACC solution 2 (a) OpenMP solution (b) OpenACC solution 1 Rengan Xu LCPC 2014

Scan Operation Optimization Input: Inclusive scan output: Exclusive scan output: In-place scan: the input array and output array are the same Proposed scan clause extension: #pragma acc loop scan(operator:in-var,out-var,identity-var,count-var) By default the scan is not in-place, which means the input array and output array are different For inclusive scan, the identity value is ignored For exclusive scan, the user has to specify the identity value For in-place inclusive scan, the user must pass IN_PLACE in in-var For in-place exclusive scan, the user must pass IN_PLACE in in-var and specify the identity value. The identity value must be the same as the first value of the provided array Rengan Xu LCPC 2014

Performance Evaluation 16 cores Intel Xeon E5-2640 x86_64 CPU with 32 GB memory Kepler 20 GPU with 5GB memory NPB 3.3 C version1 GCC4.4.7, OpenUH Compare to serial, OpenCL and CUDA version Rengan Xu LCPC 2014 1. http://aces.snu.ac.kr/Center_for_Manycore_Programming/SNU_NPB_Suite.html

Performance Evaluation of OpenUH OpenACC NPB – compared to serial FT C is not executed since memory is limited IS C speedup is slower because of contention to buckets Rengan Xu LCPC 2014

Performance Evaluation of OpenUH OpenACC NPB – effectiveness of optimization CG benefit from the cache optimization FT: AoS to SoA transformation to enable memory coalescing LU and BT improved 50% and 13% from cache optimization because they reuse the read-only data. LU, BT and SP benefit from the coalesced memroy access, because the data layout in the CPU code is not coalesced for GPU Loop scheduling tuning is important for MG, BT and SP Rengan Xu LCPC 2014

Performance Evaluation OpenUH OpenACC NPB – OpenACC vs CUDA1 72%-87%, 86-96%, 72-75%, performance gap is small The gap is because each thread needs a small array. OpenACC uses array privatization thus uses global memory. CUDA these arrays are allocated in registers or spilled in L1 cache. Rengan Xu LCPC 2014 1. http://www:tu-chemnitz:de/informatik/PI/forschung/download/npb-gpu

Performance Evaluation OpenUH OpenACC NPB – OpenACC vs OpenCL1 For EP, OpenACC is 50% slower than OpenCL. This is because of the array privatization which increased the requirement of global memory that exceed the limit of Kepler. So we have to use blocking algorithm to divide the data into chunks and process each chunk one by one. This needs to launch the kernel many times. OpenCL uses shared memory and therefore no privatization and just need to launch kernel once. OpenCL has faster memory access and less kernel launch overhead. CG is because of cache optimization. FT used AoS to SoA but OpenCL not. MG, many routines have temporary data. OpenACC allocate and free the memory dynamically in each routine. BT C has no result because the OpenCL code changed the code significantly. It allocates all the data at the beginning and free them at the end But in OpenACC code, the lifetime of some data is within the subroutine, so those data is only active in the routine and therefore requires less global memory BT and SP OpenCL data layout not changed. didn’t use coalesced memory access optimization. And no loop fission. Large kernel, the loops are executed sequentially Rengan Xu LCPC 2014 1. http://aces.snu.ac.kr/SNU_NPB_Suite.html

Conclusion and Future Work Discussed different parallelization techniques for OpenACC Demonstrated speedup of OpenUH OpenACC over serial code Compared the performance between OpenUH OpenACC and CUDA and OpenCL Contributed 4 NPB benchmarks to SPEC ACCEL V1.0 (released on March 18, 2014) http://www.hpcwire.com/off-the-wire/spechpg-releases-new-hpc-benchmark-suite/ Looking forward to making more contributions to future SPEC ACCEL suite Future Work Explore other optimizations Automate some of the optimizations in OpenUH compiler Support Intel Xeon Phi, AMD GPU APU Rengan Xu LCPC 2014