Modeling GPU non-Coalesced Memory Access Michael Fruchtman.

Slides:

Advertisements

Similar presentations

Intermediate GPGPU Programming in CUDA

Advertisements

Lecture 6: Multicore Systems

ECE 598HK Computational Thinking for Many-core Computing Lecture 2: Many-core GPU Performance Considerations © Wen-mei W. Hwu and David Kirk/NVIDIA,

An Effective GPU Implementation of Breadth-First Search Lijuan Luo, Martin Wong and Wen-mei Hwu Department of Electrical and Computer Engineering, UIUC.

Performance and Power Analysis on ATI GPU: A Statistical Approach Ying Zhang, Yue Hu, Bin Li, and Lu Peng Department of Electrical and Computer Engineering.

©Wen-mei W. Hwu and David Kirk/NVIDIA, SSL(2014), ECE408/CS483/ECE498AL, University of Illinois, ECE408/CS483 Applied Parallel Programming Lecture.

1 100M CUDA GPUs Oil & GasFinanceMedicalBiophysicsNumericsAudioVideoImaging Heterogeneous Computing CPUCPU GPUGPU Joy Lee Senior SW Engineer, Development.

Acceleration of the Smith– Waterman algorithm using single and multiple graphics processors Author : Ali Khajeh-Saeed, Stephen Poole, J. Blair Perot. Publisher:

GPU PROGRAMMING David Gilbert California State University, Los Angeles.

Name: Kaiyong Zhao Supervisor: Dr. X. -W Chu. Background & Related Work Multiple-Precision Integer GPU Computing & CUDA Multiple-Precision Arithmetic.

L8: Memory Hierarchy Optimization, Bandwidth CS6963.

1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, April 12, 2012 Timing.ppt Measuring Performance These notes will introduce: Timing Program.

Accelerating Machine Learning Applications on Graphics Processors Narayanan Sundaram and Bryan Catanzaro Presented by Narayanan Sundaram.

GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.

Accelerating SQL Database Operations on a GPU with CUDA Peter Bakkum & Kevin Skadron The University of Virginia GPGPU-3 Presentation March 14, 2010.

Sunpyo Hong, Hyesoon Kim

SAGE: Self-Tuning Approximation for Graphics Engines

CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA

Shared memory systems. What is a shared memory system Single memory space accessible to the programmer Processor communicate through the network to the.

Fast Thermal Analysis on GPU for 3D-ICs with Integrated Microchannel Cooling Zhuo Fen and Peng Li Department of Electrical and Computer Engineering, {Michigan.

GPU Programming with CUDA – Optimisation Mike Griffiths

High Performance Computing with GPUs: An Introduction Krešimir Ćosić, Thursday, August 12th, LSST All Hands Meeting 2010, Tucson, AZ GPU Tutorial:

Sparse Matrix-Vector Multiplication on Throughput-Oriented Processors

CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.

© David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, ECE408 Applied Parallel Programming Lecture 12 Parallel.

Large-scale Deep Unsupervised Learning using Graphics Processors

1 ECE 8823A GPU Architectures Module 5: Execution and Resources - I.

GPU Architecture and Programming

CUDA Performance Patrick Cozzi University of Pennsylvania CIS Fall

GPU Programming and CUDA Sathish Vadhiyar Parallel Programming.

Heterogeneous CPU/GPU co- processor clusters Michael Fruchtman.

CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2011.

Some key aspects of NVIDIA GPUs and CUDA. Silicon Usage.

Optimizing Parallel Reduction in CUDA Mark Harris NVIDIA Developer Technology.

GPU-based Computing. Tesla C870 GPU 8 KB / multiprocessor 1.5 GB per GPU 16 KB up to 768 threads () up to 768 threads ( 21 bytes of shared memory and.

1)Leverage raw computational power of GPU  Magnitude performance gains possible.

Efficient Parallel CKY Parsing on GPUs Youngmin Yi (University of Seoul) Chao-Yue Lai (UC Berkeley) Slav Petrov (Google Research) Kurt Keutzer (UC Berkeley)

© David Kirk/NVIDIA and Wen-mei W. Hwu University of Illinois, CS/EE 217 GPU Architecture and Parallel Programming Lecture 10 Reduction Trees.

1 ITCS 4/5010 GPU Programming, B. Wilkinson, Jan 21, CUDATiming.ppt Measuring Performance These notes introduce: Timing Program Execution How to.

Weekly Report- Reduction Ph.D. Student: Leo Lee date: Oct. 30, 2009.

Introduction to CUDA CAP 4730 Spring 2012 Tushar Athawale.

University of Michigan Electrical Engineering and Computer Science Adaptive Input-aware Compilation for Graphics Engines Mehrzad Samadi 1, Amir Hormati.

An Integrated GPU Power and Performance Model (ISCA’10, June 19–23, 2010, Saint-Malo, France. International Symposium on Computer Architecture)

1 GPU programming Dr. Bernhard Kainz. 2 Dr Bernhard Kainz Overview About myself Motivation GPU hardware and system architecture GPU programming languages.

Lecture 8 : Manycore GPU Programming with CUDA Courtesy : SUNY-Stony Brook Prof. Chowdhury’s course note slides are used in this lecture note.

©Wen-mei W. Hwu and David Kirk/NVIDIA, University of Illinois, CS/EE 217 GPU Architecture and Parallel Programming Lecture 6: DRAM Bandwidth.

Sunpyo Hong, Hyesoon Kim

GPGPU introduction. Why is GPU in the picture Seeking exa-scale computing platform Minimize power per operation. – Power is directly correlated to the.

GPU Performance Optimisation Alan Gray EPCC The University of Edinburgh.

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS 1.

CUDA programming Performance considerations (CUDA best practices)

Analyzing Memory Access Intensity in Parallel Programs on Multicore Lixia Liu, Zhiyuan Li, Ahmed Sameh Department of Computer Science, Purdue University,

S. Pardi Frascati, 2012 March GPGPU Evaluation – First experiences in Napoli Silvio Pardi.

Lecture 6 CUDA Global Memory Kyu Ho Park Mar. 29, 2016.

Jun Doi IBM Research – Tokyo Early Performance Evaluation of Lattice QCD on POWER+GPU Cluster 17 July 2015.

Computer Engg, IIT(BHU)

Computer Graphics Graphics Hardware

Basic CUDA Programming

ECE408/CS483 Applied Parallel Programming Lecture 7: DRAM Bandwidth

Presented by: Isaac Martin

DRAM Bandwidth Slide credit: Slides adapted from

L4: Memory Hierarchy Optimization II, Locality and Data Placement

Antonio R. Miele Marco D. Santambrogio Politecnico di Milano

Computer Graphics Graphics Hardware

Antonio R. Miele Marco D. Santambrogio Politecnico di Milano

General Purpose Graphics Processing Units (GPGPUs)

Measuring Performance

Mattan Erez The University of Texas at Austin

6- General Purpose GPU Programming

Multicore and GPU Programming

Presentation transcript:

Modeling GPU non-Coalesced Memory Access Michael Fruchtman

Importance GPU Energy Efficiency – Dependent on performance Complex Memory Model – Coalesced memory – Warps of 16 threads Applications – Memory bound applications – Predict the performance

Goals Profile the effect of non-coalesced memory access on memory bound GPU applications. Find a model that matches the delay in performance. Extend the model to calculate the extra cost in power.

Coalesced Access Source: Cuda Programming Guide 3.0

Coalesced Access Source: CUDA Programming Guide

Method and Procedure Find a memory bound problem – Matrix/Vector Addition 8000x8000 – Perform a solution for each level of coalescence 16 levels of coalescence Separate threads from each other Increasing number of memory accesses – Same number of instructions – Increasing memory access time

Perfect Coalescence Block Striding

Example Code __global__ void matrixAdd(int * A, int * B, int * C, int matrixSize) { int startingaddress = blockDim.x * blockIdx.x + threadIdx.x; int stride = blockDim.x; for(int currentaddress=startingaddress; currentaddress < matrixSize; currentaddress+=stride) { C[currentaddress]=A[currentaddress]+B[currentaddress]; }

Perfect Non-Coalescence Stream Splitting

Example Code __global__ void matrixAdd(int * A, int * B, int * C, int matrixSize) { int countperthread = matrixSize/blockDim.x; int startingaddress=((float)threadIdx.x/blockDim.x)*matrixSize; int endingaddress = startingaddress+countperthread; for(int currentaddress=startingaddress; currentaddress<endingaddress; currentaddress++) { C[currentaddress]=A[currentaddress]+B[currentaddress]; }

Non-Coalesced Level Modify Perfect Coalescence Code – Read the stride from the matrix – Insert 0s at the right places to stop threads – Instruction Number Slight Increase Memory access becomes increasingly non-coalesced Doesn’t perform perfect matrix addition

Experimental Setup Nehalem Processor – Core i GHz – Performance metric included memory transfer – QPI improves memory transfer performance compared to previous architecture such as Core 2 Duo

Experimental Setup NVIDIA CUDA GPU – EVGA GTX 260 Core MB GT200, CUDA Version 1.3 supports partial coalescence Stock speed 576MHz Maximum Memory Bandwidth 111.9GB/s 216 cores in 27 multiprocessors

Performance

Memory Requested (bytes)

Instructions Executed

Performance Mystery Why is perfect non-coalescence so much slower than 1/16 coalescence? NVIDIA GTX

Non-Coalescence Model Performance is near perfectly linear – R 2 = D(d) =d * M a – d: number of non-coalesced memory accesses – M a : Memory access time Dependent on memory architecture GT200 M a = 2.43 microseconds measured 1400 clock cycles

Model of Extra Power Cost Power consumption is in a range Dependent on GPU – See An Integrated GPU power and performance model P(d) = D(d) * P(d) D(d) is delay due to non-coalesced access P(d) is the average power consumed by GPU while active

Conclusion Performance Degrades Linearly with non- coalesced access – Energy efficiency will also degrade linearly – Memory-bound applications GPU Memory Contention – Switching time between chip significant Tools to reduce non-coalescence – CUDA-Lite finds and fixes some non-coalesence

References and Related Work NVIDIA. NVIDIA CUDA Programming Guide 3.0. February 20, S. Baghsorkhi, M. Delahaye, S. Patel, W. Gropp, W. Hwu. An adaptive performance modeling tool for GPU Architectures. Proceedings of the 15 th ACM SIGPLAN symposium on Principles and practice of parallel programming. Volume 45, Issue 5, May S. Hong and H. Kim. An integrated GPU power and performance model. Proceedings of the 37 th annual international symposium on computer architecture. Volume 38, Issue 3, June S. Lee, S. Min, R. Eigenmann. OpenMP to GPGPU: a compiler framework for automatic translation and optimization. Proceedings of the 14 th ACM SIGPLAN symposium on Principles and Practice of parallel programming. Volume 44, Issue 4, April S. Ueng, M. Lathara, S. Baghsorkhi, W. Hwu. CUDA-Lite: Reducing GPU Programming Complexity. Languages and Compilers for Parallel Computing. Volume 5335, pp