Employing compression solutions under openacc

Slides:

Advertisements

Similar presentations

A Complete GPU Compute Architecture by NVIDIA Tamal Saha, Abhishek Rawat, Minh Le {ts4rq, ar8eb,

Advertisements

Optimization on Kepler Zehuan Wang

A Case Against Small Data Types in GPGPUs Ahmad Lashgar and Amirali Baniasadi ECE Department University of Victoria.

GPU System Architecture Alan Gray EPCC The University of Edinburgh.

An Effective GPU Implementation of Breadth-First Search Lijuan Luo, Martin Wong and Wen-mei Hwu Department of Electrical and Computer Engineering, UIUC.

Presented by Rengan Xu LCPC /16/2014

Optimizing and Auto-Tuning Belief Propagation on the GPU Scott Grauer-Gray and Dr. John Cavazos Computer and Information Sciences, University of Delaware.

L13: Review for Midterm. Administrative Project proposals due Friday at 5PM (hard deadline) No makeup class Friday! March 23, Guest Lecture Austin Robison,

A Performance and Energy Comparison of FPGAs, GPUs, and Multicores for Sliding-Window Applications From J. Fowers, G. Brown, P. Cooke, and G. Stitt, University.

Accelerating Machine Learning Applications on Graphics Processors Narayanan Sundaram and Bryan Catanzaro Presented by Narayanan Sundaram.

University of Michigan Electrical Engineering and Computer Science Amir Hormati, Mehrzad Samadi, Mark Woh, Trevor Mudge, and Scott Mahlke Sponge: Portable.

Adnan Ozsoy & Martin Swany DAMSL - Distributed and MetaSystems Lab Department of Computer Information and Science University of Delaware September 2011.

Kathy Grimes. Signals Electrical Mechanical Acoustic Most real-world signals are Analog – they vary continuously over time Many Limitations with Analog.

Debunking the 100X GPU vs. CPU Myth: An Evaluation of Throughput Computing on CPU and GPU Presented by: Ahmad Lashgar ECE Department, University of Tehran.

GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.

To GPU Synchronize or Not GPU Synchronize? Wu-chun Feng and Shucai Xiao Department of Computer Science, Department of Electrical and Computer Engineering,

GPGPU platforms GP - General Purpose computation using GPU

Accelerating SQL Database Operations on a GPU with CUDA Peter Bakkum & Kevin Skadron The University of Virginia GPGPU-3 Presentation March 14, 2010.

An approach for solving the Helmholtz Equation on heterogeneous platforms An approach for solving the Helmholtz Equation on heterogeneous platforms G.

CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA

Shared memory systems. What is a shared memory system Single memory space accessible to the programmer Processor communicate through the network to the.

Microprocessor-based systems Curse 7 Memory hierarchies.

MS Thesis Defense “IMPROVING GPU PERFORMANCE BY REGROUPING CPU-MEMORY DATA” by Deepthi Gummadi CoE EECS Department April 21, 2014.

CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.

Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.

CUDA Optimizations Sathish Vadhiyar Parallel Programming.

Fast Support Vector Machine Training and Classification on Graphics Processors Bryan Catanzaro Narayanan Sundaram Kurt Keutzer Parallel Computing Laboratory,

GPU Architecture and Programming

Introducing collaboration members – Korea University (KU) ALICE TPC online tracking algorithm on a GPU Computing Platforms – GPU Computing Platforms Joohyung.

Autonomic scheduling of tasks from data parallel patterns to CPU/GPU core mixes Published in: High Performance Computing and Simulation (HPCS), 2013 International.

Understanding Outstanding Memory Request Handling Resources in GPGPUs

CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2011.

Introduction What is GPU? It is a processor optimized for 2D/3D graphics, video, visual computing, and display. It is highly parallel, highly multithreaded.

Some key aspects of NVIDIA GPUs and CUDA. Silicon Usage.

Weekly Report- Reduction Ph.D. Student: Leo Lee date: Oct. 30, 2009.

University of Michigan Electrical Engineering and Computer Science Adaptive Input-aware Compilation for Graphics Engines Mehrzad Samadi 1, Amir Hormati.

An Integrated GPU Power and Performance Model (ISCA’10, June 19–23, 2010, Saint-Malo, France. International Symposium on Computer Architecture)

CS/EE 217 GPU Architecture and Parallel Programming Lecture 23: Introduction to OpenACC.

Sunpyo Hong, Hyesoon Kim

GPGPU introduction. Why is GPU in the picture Seeking exa-scale computing platform Minimize power per operation. – Power is directly correlated to the.

PuReMD Design Initialization – neighbor-list, bond-list, hydrogenbond-list and Coefficients of QEq matrix Bonded interactions – Bond-order, bond-energy,

Fast and parallel implementation of Image Processing Algorithm using CUDA Technology On GPU Hardware Neha Patil Badrinath Roysam Department of Electrical.

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS 1.

CS 179: GPU Computing LECTURE 2: MORE BASICS. Recap Can use GPU to solve highly parallelizable problems Straightforward extension to C++ ◦Separate CUDA.

Jun Doi IBM Research – Tokyo Early Performance Evaluation of Lattice QCD on POWER+GPU Cluster 17 July 2015.

1 ”MCUDA: An efficient implementation of CUDA kernels for multi-core CPUs” John A. Stratton, Sam S. Stone and Wen-mei W. Hwu Presentation for class TDT24,

Appendix C Graphics and Computing GPUs

Hang Zhang1, Xuhao Chen1, Nong Xiao1,2, Fang Liu1

Gwangsun Kim, Jiyun Jeong, John Kim

Sathish Vadhiyar Parallel Programming

CS427 Multicore Architecture and Parallel Computing

EECE571R -- Harnessing Massively Parallel Processors ece

GPU Memory Details Martin Kruliš by Martin Kruliš (v1.1)

Improving Memory Access 1/3 The Cache and Virtual Memory

Addressing Software-Managed Cache Development Effort in GPGPUs

Lecture 2: Intro to the simd lifestyle and GPU internals

Accelerating MapReduce on a Coupled CPU-GPU Architecture

Lecture 5: GPU Compute Architecture

Linchuan Chen, Xin Huo and Gagan Agrawal

Gwangsun Kim Niladrish Chatterjee Arm, Inc. NVIDIA Mike O’Connor

Spare Register Aware Prefetching for Graph Algorithms on GPUs

Presented by: Isaac Martin

Lecture 5: GPU Compute Architecture for the last time

NVIDIA Fermi Architecture

STUDY AND IMPLEMENTATION

Optimizing MapReduce for GPUs with Effective Shared Memory Usage

Graphics Processing Unit

6- General Purpose GPU Programming

Multicore and GPU Programming

Presentation transcript:

Employing compression solutions under openacc Ebad Salehi, Ahmad Lashgar and Amirali Baniasadi Electrical and Computer Engineering Department University of Victoria

This Work Motivation: Effective memory bandwidth usage is needed to achieve high performance in GPUs. Our solution: Compress floating point numbers before transferring How: We propose a compression method and set of OpenACC clauses We achieve up to 1.36X speedup

Outline Introduction & Background Clauses and the Compression Method GPU Architecture OpenACC Clauses and the Compression Method Experimental Results Conclusion

Introduction GPU has promising peak performance but hard to achieve in practice. The main problem is memory bandwidth Thousands of threads compete for data Memory optimization calls for skills and experience. Good GPU programs need extensive development effort. Frameworks such OpenACC makes it easier.

… GPU Architecture SM Streaming Multiprocessors (SM) Warps Pool Streaming Multiprocessors (SM) Consists of Streaming Processors (SP) 32 thread group is called Warp Threads in a warp execute instructions at once Therefore they send memory request at the same time SM Warp 1 | T1 T2 T3 T4 | Warp 2 | T5 T6 T7 T8 | … Load/Store Unit Int/Float Unit Special Fcn Unit L1 Cache Data

… GPU Memory Subsystem GPU DRAM (Global Memory) Accessed through a 2-level cache hierarchy Memory bandwidth is a bottleneck for many applications. Bandwidth utilization depends on the access pattern. The smallest memory transaction size is 32 bytes. GPU SM1 SMN L1$ L1$ Interconnect … L2$ L2$ MC1 MCN Tell about memory blocks here. The goal is reducing memory transactions. DRAM

OpenACC Directive-based extension to Fortran, C, C++ Maps compute-intensive loops to accelerators Manages data movement between the CPU(Host) and the accelerator Lowers programming effort Coarse grained control Directive based extenstion, Code still compiles for host-only, Each iteration to a thread 3. 4. It’s host directed. Host calls the kernels and data transfer functions. 5. The fact that it is directive-based and programmer doesn’t need to change the source code, decreases the programming effort/time. 6. But it comes at a price. It has less flexiblity than CUDA/OpenCL. It’s hard or even not possible to implemented complicated algorithms. It’s similar to OpenMP except Openacc model is more suitable for accelerators which are capable of running a great number of lightweight threads.

OpenACC Matrix-matrix multiplication example: Transparent pragma Directives Clauses

Proposed Clauses Our goal is proposing a method to alleviate the memory contention problem Ease the work of programmers. We extend the OpenACC model by new clauses. By analysis

Proposed Clauses Data Transfer Clauses Kernels Compression Clause Instruct the compiler to apply the compression technique to data sets at transfer time. Kernels Compression Clause Instructs the compiler to treat the marked data sets as compressed at kernel generation time. The reason why we separated these two.

Proposed Clauses Example: Regular Data Compressed Data copyin compression_copyin(ccopyin) copy compression_copy(ccopy) copyout compression_copyout(ccopyout) Min and max are optional. The reason for max and min discussed later. Example: #pragma acc data compression_copy(data[0:size:min:max])

Proposed Clauses Compression clause on kernels directive Example: #pragma acc kernels compression(d1, d2, d3) { … //kernels region }

Proposed Clauses Matrix-matrix multiplication example:

Compression Method Applicable on floating point data. 32 or 64 bit Flexible compression ratio Low overhead Lossy

Compression Floating point:

Compression Steps This process is performed on the data set Division by twice the maximum absolute value of the data set to map to (-0.5,+0.5). Adding to 1.5 to map to (1,2). Omitting the exponent. Omitting the least significant bits. *In Which ones we lose accuracy?

Compression Steps Dividing by 4 Pi most accurate representation in single precision ≈ 3.1415405

Decompression Steps Pi most accurate representation in single precision ≈ 3.1415405 Pi after decompression ≈ 3.1415927

Decompression Steps Reformulate to use fused multiply-add instruction. The exact reverse operation is not efficient. We reformulate to use fused multiply-add unit.

Methodology and Results Benchmarks: Matrix Multiplication, HotSpot, Nearest Neighbors and Dyadic Convolution Compilers: IPMACC, NVCC, GNU GCC 4.4.7 Platform: CUDA6.0 on NVIDIA Tesla K20c Intel Xeon CPU E5-2620, 16 GB RAM Scientific Linux 6.5 (Carbon) x86 64 Tell about the benchmarks. Kernel time

Matrix Multiplication

HotSpot

Nearest Neighbors Finds k-nearest neighbor points in a two dimensional space. Each point is a C struct with two attributes; latitude, longitude. We transform array of structs to struct of arrays to enhance spatial locality.

Nearest Neighbors Transforming array of structs to struct of arrays at compression stage. lat0 lng0 Lat1 lng1 lat2 lng2

Nearest Neighbors

Dyadic Convolution

Conclusion Easy to use OpenACC clauses proposed Transparent compression scheme Data compression Reduces the bandwidth usage Increase the cache hit rate Good solution for data sizes larger than L2 cache Up to 2X in synthetic benchmarks Up to 1.36X in real world benchmarks

Related Works Samadi et al. developed a transparent approximation system. Accepts CUDA code and generate multiple kernels with different accuracy level. Sathish et al. show the efficiency of hardware-based lossless and lossy compression dram bandwidth. Their proposed compression hardware can improves the performance memory-bound workloads by 26% for GPUs with L1 caches but without L2 caches.

Future Work Adaptive compression Compression methods on different data types. Using hardware compression solutions

Thank you Questions?

HotSpot Application

Matrix Multiplication Matrix-matrix multiplication