Analyzing CUDA Workloads Using a Detailed GPU Simulator

Slides:

Advertisements

Similar presentations

Intermediate GPGPU Programming in CUDA

Advertisements

Complete Unified Device Architecture A Highly Scalable Parallel Programming Framework Submitted in partial fulfillment of the requirements for the Maryland.

Orchestrated Scheduling and Prefetching for GPGPUs Adwait Jog, Onur Kayiran, Asit Mishra, Mahmut Kandemir, Onur Mutlu, Ravi Iyer, Chita Das.

Lecture 6: Multicore Systems

Instructor Notes We describe motivation for talking about underlying device architecture because device architecture is often avoided in conventional.

Optimization on Kepler Zehuan Wang

GPU System Architecture Alan Gray EPCC The University of Edinburgh.

GPU Programming and CUDA Sathish Vadhiyar Parallel Programming.

GPU Programming and CUDA Sathish Vadhiyar Parallel Programming.

OWL: Cooperative Thread Array (CTA) Aware Scheduling Techniques for Improving GPGPU Performance Adwait Jog, Onur Kayiran, Nachiappan CN, Asit Mishra, Mahmut.

Neither More Nor Less: Optimizing Thread-level Parallelism for GPGPUs Onur Kayıran, Adwait Jog, Mahmut Kandemir, Chita R. Das.

1 Wilson W. L. Fung Tor M. Aamodt University of British Columbia HPCA-17 Feb 14, 2011.

Acceleration of the Smith– Waterman algorithm using single and multiple graphics processors Author : Ali Khajeh-Saeed, Stephen Poole, J. Blair Perot. Publisher:

1 Threading Hardware in G80. 2 Sources Slides by ECE 498 AL : Programming Massively Parallel Processors : Wen-Mei Hwu John Nickolls, NVIDIA.

Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow Wilson W. L. Fung Ivan Sham George Yuan Tor M. Aamodt Electrical and Computer Engineering.

L15: Review for Midterm. Administrative Project proposals due today at 5PM (hard deadline) – handin cs6963 prop March 31, MIDTERM in class L15: Review.

Programming with CUDA, WS09 Waqar Saleem, Jens Müller Programming with CUDA and Parallel Algorithms Waqar Saleem Jens Müller.

CUDA (Compute Unified Device Architecture) Supercomputing for the Masses by Peter Zalutski.

The PTX GPU Assembly Simulator and Interpreter N.M. Stiffler Zheming Jin Ibrahim Savran.

GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.

Veynu Narasiman The University of Texas at Austin Michael Shebanow

GPU Programming and CUDA Sathish Vadhiyar High Performance Computing.

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 7: Threading Hardware in G80.

Ahmad Lashgar Amirali Baniasadi Ahmad Khonsari ECE, University of Tehran,ECE, University of Victoria.

CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA

Shared memory systems. What is a shared memory system Single memory space accessible to the programmer Processor communicate through the network to the.

1 The Performance Potential for Single Application Heterogeneous Systems Henry Wong* and Tor M. Aamodt § *University of Toronto § University of British.

Unifying Primary Cache, Scratch, and Register File Memories in a Throughput Processor Mark Gebhart 1,2 Stephen W. Keckler 1,2 Brucek Khailany 2 Ronny Krashinsky.

High Performance Computing with GPUs: An Introduction Krešimir Ćosić, Thursday, August 12th, LSST All Hands Meeting 2010, Tucson, AZ GPU Tutorial:

Stefan PopovHigh Performance GPU Ray Tracing Real-time Ray Tracing on GPU with BVH-based Packet Traversal Stefan Popov, Johannes Günther, Hans- Peter Seidel,

CUDA All material not from online sources/textbook copyright © Travis Desell, 2012.

CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 CS 395 Winter 2014 Lecture 17 Introduction to Accelerator.

GPU Architecture and Programming

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Control Flow/ Thread Execution.

CUDA Performance Patrick Cozzi University of Pennsylvania CIS Fall

GPU Programming and CUDA Sathish Vadhiyar Parallel Programming.

Divergence-Aware Warp Scheduling

CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2011.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 8: Threading Hardware in G80.

Some key aspects of NVIDIA GPUs and CUDA. Silicon Usage.

GPUs: Overview of Architecture and Programming Options Lee Barford firstname dot lastname at gmail dot com.

Optimizing Parallel Reduction in CUDA Mark Harris NVIDIA Developer Technology.

1)Leverage raw computational power of GPU  Magnitude performance gains possible.

 GPU Power Model Nandhini Sudarsanan Nathan Vanderby Neeraj Mishra Usha Vinodh

Introduction to CUDA (1 of n*) Patrick Cozzi University of Pennsylvania CIS Spring 2011 * Where n is 2 or 3.

 GPU Power Model Nandhini Sudarsanan Nathan Vanderby Neeraj Mishra Usha Vinodh

Introduction to CUDA Programming Implementing Reductions Andreas Moshovos Winter 2009 Based on slides from: Mark Harris NVIDIA Optimizations studied on.

Timothy G. Rogers Daniel R. Johnson Mike O’Connor Stephen W. Keckler A Variable Warp-Size Architecture.

Sunpyo Hong, Hyesoon Kim

GPGPU introduction. Why is GPU in the picture Seeking exa-scale computing platform Minimize power per operation. – Power is directly correlated to the.

GPU Programming and CUDA Sathish Vadhiyar High Performance Computing.

My Coordinates Office EM G.27 contact time:

Fast and parallel implementation of Image Processing Algorithm using CUDA Technology On GPU Hardware Neha Patil Badrinath Roysam Department of Electrical.

Effect of Instruction Fetch and Memory Scheduling on GPU Performance Nagesh B Lakshminarayana, Hyesoon Kim.

CS 179: GPU Computing LECTURE 2: MORE BASICS. Recap Can use GPU to solve highly parallelizable problems Straightforward extension to C++ ◦Separate CUDA.

Henry Wong, Misel-Myrto Papadopoulou, Maryam Sadooghi-Alvandi, Andreas Moshovos University of Toronto Demystifying GPU Microarchitecture through Microbenchmarking.

The Present and Future of Parallelism on GPUs

Introduction to CUDA Li Sung-Chi Taiwan Evolutionary Intelligence Laboratory 2016/12/14 Group Meeting Presentation.

Analyzing CUDA Workloads Using a Detailed GPU Simulator

Programming Massively Parallel Graphics Processors

Lecture 26: Multiprocessors

Programming Massively Parallel Graphics Processors

Lecture 27: Multiprocessors

Operation of the Basic SM Pipeline

General Purpose Graphics Processing Units (GPGPUs)

6- General Purpose GPU Programming

CIS 6930: Chip Multiprocessor: Parallel Architecture and Programming

CIS 6930: Chip Multiprocessor: Parallel Architecture and Programming

Presentation transcript:

Analyzing CUDA Workloads Using a Detailed GPU Simulator Ali Bakhoda, George L. Yuan, Wilson W. L. Fung, Henry Wong and Tor M. Aamodt University of British Columbia

GPUs and CPUs on a collision course 1st GPUs with programmable shaders in 2001 Today: TeraFlop on a single card. Turing complete. Highly accessible: senior undergrad students can learn to program CUDA in a few weeks (not good perf. code) Rapidly growing set of CUDA applications (209 listed on NVIDIA’s CUDA website in February). With OpenCL safely expect number of non-graphics applications written for GPUs to explode. GPUs are massively parallel systems: Multicore + SIMT + fine grain multithreaded

No academic detailed simulator for studying this?!?

GPGPU-Sim An academic detailed (“cycle-level”) timing simulator developed from the ground up at the University of British Columbia (UBC) for modeling a modern GPU running non-graphics workloads. Relatively accurate (no effort expended trying to make it more accurate relative to real hardware)

GPGPU-Sim Currently supports CUDA version 1.1 applications “out of the box”. Microarchitecture model Based on notion of “shader cores” which approximate NVIDIA GeForce 8 series and above notion of “Streaming Multiprocessor”. Connect to memory controllers using a detailed network-on-chip simulator (Dally & Towles’ booksim) Detailed DRAM timing model (everything except refresh) GPGPU-Sim v2.0b available: www.gpgpu-sim.org

Rest of this talk Obligatory brief introduction to CUDA GPGPU-Sim internals (100,000’ view) Simulator software overview Modeled Microarchitecture Some results from the paper

CUDA Example Runs on CPU nthreads x nblocks copies run in main() { … cudaMalloc((void**) &d_idata, bytes); cudaMalloc((void**) &d_odata, maxNumBlocks*sizeof(int)); cudaMemcpy(d_idata, h_idata, bytesin, cudaMemcpyHostToDevice); reduce<<< nthreads, nblocks, smemSize >>>(d_idata, d_odata); cudaThreadSynchronize(); cudaMemcpy(d_odata, h_odata, bytesout, cudaMemcpyDeviceToHost); } __global__ void reduce(int *g_idata, int *g_odata) extern __shared__ int sdata[]; unsigned int tid = threadIdx.x; unsigned int i = blockIdx.x*blockDim.x + threadIdx.x; sdata[tid] = g_idata[i]; __syncthreads(); for(unsigned int s=1; s < blockDim.x; s *= 2) { if ((tid % (2*s)) == 0) sdata[tid] += sdata[tid + s]; if (tid == 0) g_odata[blockIdx.x] = sdata[0]; Runs on CPU nthreads x nblocks copies run in Parallel on GPU

Normal CUDA Flow Applications written in a mixture of C/C++ and CUDA. “nvcc” takes CUDA (.cu) files and generates host C code and “Parallel Thread eXecution” assembly language (PTX). PTX is passed to assembler / optimizer “ptxas” to generate machine code that is packed into a C array (not human readable). Combine whole thing and link to CUDA runtime API using regular C/C++ compiler linker. Run your app on the GPU.

GPGPU-Sim Flow Uses CUDA nvcc to generate CPU C code and PTX. flex/bison parser reads in PTX. Link together host (CPU) code and simulator into one binary. Intercept CUDA API calls using custom libcuda that implements functions declared in header files that come with CUDA.

GPGPU-Sim Microarchitecture Set of “shader cores” connected to set of memory controllers via a detailed interconnection network model (booksim). Memory controllers reorder requests to reduce activate /precharge overheads. Vary topology / bandwidth of interconnect Cache for global memory operations.

Shader Core Details Shader core roughly like a “Streaming Multiprocessor” in NVIDIA terminology. Set of scalar threads grouped together into an SIMD unit called a “warp” (NVIDIA uses 32 on current hardware). Warps grouped into CTAs. CTAs grouped into “grids”. Set of warps on a core are fine grain interleaved on pipeline to hide off-chip memory access latency. Threads in one CTA can communicate via an on chip 16KB “shared memory”.

Interconnection Network Baseline: Mesh Variations: Crossbar, Ring, Torus Baseline mesh memory controller placement:

Are more threads better? More CTAs on a core Helps hide the latency when some wait for barriers Can increase memory latency tolerance Needs more resources Less CTAs on a core Less contention in interconnection and memory system

Memory Access Coalescing Grouping accesses from multiple, concurrently issued, scalar threads into a single access to a contiguous memory region Is always done for a single warp Coalescing among multiple warps We explore its performance benefits Is more expensive to implement

Simulation setup Number of shader cores 28 Warp Size 32 SIMD pipeline width 8 # of Threads/CTAs/Registers per Core 1024 / 8 /16384 Shared Memory / Core 16KB (16 banks) Constant Cache / Core 8KB (2-way set assoc. 64B lines LRU) Texture Cache / Core 64KB (2-way set assoc. 64B lines LRU) Memory Channels BW / Memory Module 8 Byte/Cycle DRAM request queue size Memory Controller Out of order (FR-FCFS) Branch Divergence handling method Immediate Post Dominator Warp Scheduling Policy Round Robin among ready Warps

Benchmark Selection Applications developed by 3rd party researchers Less than 50x reported speedups + some applications from CUDA SDK

Benchmarks (more info in paper) Abbr. Claimed Speedup AES Cryptography AES 12x Breadth First Search BFS 2x-3x Coulombic Potential CP 647x gpuDG DG 50x 3D Laplace Solver LPS LIBOR Monte Carlo LIB MUMmerGPU MUM 3.5x-10x Neural Network NN 10x N-Queens Solver NQU 2.5x Ray Tracing RAY 16x StoreGPU STO 9x Weather Prediction WP 20x

Interconnection Network Latency Sensitivity Slight increase in interconnection latency has no severe effect of overall performance No need to overdesign interconnection to decrease latency

Interconnection Network Bandwidth Sensitivity Low Bandwidth decreases performance a lot (8B) Very high bandwidth moves the bottleneck

Effects of varying number of CTAs Most benchmarks do not benefit substantially Some benchmarks even perform better with fewer concurrent threads (e.g. AES) Less contention in DRAM

More insights and data in the paper…

Summary GPGPU-Sim: a novel GPU simulator Capable of simulating CUDA applications www.gpgpu-sim.org Performance of simulated applications More sensitive to bisection BW Less sensitive to (zero load) Latency Sometimes running fewer CTAs can improve performance (less DRAM contention)

Interconnect Topology (Fig 9)

ICNT Latency and BW sensitivity (Fig 10-11)

Mem Controller Optimization Effects (Fig 12)

DRAM Utilization and Efficiency (Fig 13 -14)

L1 / L2 Cache (Fig 15)

Varying CTAs (Fig 16)

Inter-Warp Coalescing (Fig 17)