Presented by: Isaac Martin

Slides:



Advertisements
Similar presentations
Lecture 6: Multicore Systems
Advertisements

Instructor Notes We describe motivation for talking about underlying device architecture because device architecture is often avoided in conventional.
A Complete GPU Compute Architecture by NVIDIA Tamal Saha, Abhishek Rawat, Minh Le {ts4rq, ar8eb,
Optimization on Kepler Zehuan Wang
GPGPU Introduction Alan Gray EPCC The University of Edinburgh.
Instructor Notes This lecture deals with how work groups are scheduled for execution on the compute units of devices Also explain the effects of divergence.
BWUPEP2011, UIUC, May 29 - June Taking CUDA to Ludicrous Speed BWUPEP2011, UIUC, May 29 - June Blue Waters Undergraduate Petascale Education.
GPUs. An enlarging peak performance advantage: –Calculation: 1 TFLOPS vs. 100 GFLOPS –Memory Bandwidth: GB/s vs GB/s –GPU in every PC and.
1 Threading Hardware in G80. 2 Sources Slides by ECE 498 AL : Programming Massively Parallel Processors : Wen-Mei Hwu John Nickolls, NVIDIA.
L15: Review for Midterm. Administrative Project proposals due today at 5PM (hard deadline) – handin cs6963 prop March 31, MIDTERM in class L15: Review.
L13: Review for Midterm. Administrative Project proposals due Friday at 5PM (hard deadline) No makeup class Friday! March 23, Guest Lecture Austin Robison,
Programming with CUDA WS 08/09 Lecture 9 Thu, 20 Nov, 2008.
GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.
Shekoofeh Azizi Spring  CUDA is a parallel computing platform and programming model invented by NVIDIA  With CUDA, you can send C, C++ and Fortran.
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 7: Threading Hardware in G80.
CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA
BY: ALI AJORIAN ISFAHAN UNIVERSITY OF TECHNOLOGY 2012 GPU Architecture 1.
Unifying Primary Cache, Scratch, and Register File Memories in a Throughput Processor Mark Gebhart 1,2 Stephen W. Keckler 1,2 Brucek Khailany 2 Ronny Krashinsky.
Extracted directly from:
© David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30-July 2, Taiwan 2008 CUDA Course Programming Massively Parallel Processors: the CUDA experience.
CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.
Multiprocessing. Going Multi-core Helps Energy Efficiency William Holt, HOT Chips 2005 Adapted from UC Berkeley "The Beauty and Joy of Computing"
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 CS 395 Winter 2014 Lecture 17 Introduction to Accelerator.
CUDA Optimizations Sathish Vadhiyar Parallel Programming.
GPU Architecture and Programming
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Control Flow/ Thread Execution.
CUDA Performance Patrick Cozzi University of Pennsylvania CIS Fall
(1) Kernel Execution ©Sudhakar Yalamanchili and Jin Wang unless otherwise noted.
CUDA - 2.
CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2011.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 8: Threading Hardware in G80.
Some key aspects of NVIDIA GPUs and CUDA. Silicon Usage.
Optimizing Parallel Reduction in CUDA Mark Harris NVIDIA Developer Technology.
CUDA Performance Considerations (2 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2011.
1)Leverage raw computational power of GPU  Magnitude performance gains possible.
CUDA. Assignment  Subject: DES using CUDA  Deliverables: des.c, des.cu, report  Due: 12/14,
CUDA All material not from online sources/textbook copyright © Travis Desell, 2012.
CS/EE 217 GPU Architecture and Parallel Programming Midterm Review
Timothy G. Rogers Daniel R. Johnson Mike O’Connor Stephen W. Keckler A Variable Warp-Size Architecture.
Computer Architecture Lecture 24 Parallel Processing Ralph Grishman November 2015 NYU.
Jason Jong Kyu Park1, Yongjun Park2, and Scott Mahlke1
GPGPU introduction. Why is GPU in the picture Seeking exa-scale computing platform Minimize power per operation. – Power is directly correlated to the.
GPU Performance Optimisation Alan Gray EPCC The University of Edinburgh.
My Coordinates Office EM G.27 contact time:
1 ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 28, 2013 Branching.ppt Control Flow These notes will introduce scheduling control-flow.
Processor Level Parallelism 2. How We Got Here Developments in PC CPUs.
CUDA programming Performance considerations (CUDA best practices)
Henry Wong, Misel-Myrto Papadopoulou, Maryam Sadooghi-Alvandi, Andreas Moshovos University of Toronto Demystifying GPU Microarchitecture through Microbenchmarking.
Single Instruction Multiple Threads
Gwangsun Kim, Jiyun Jeong, John Kim
Sathish Vadhiyar Parallel Programming
CS427 Multicore Architecture and Parallel Computing
EECE571R -- Harnessing Massively Parallel Processors ece
Exploiting Inter-Warp Heterogeneity to Improve GPGPU Performance
GPU Memory Details Martin Kruliš by Martin Kruliš (v1.1)
Parallel Computing Lecture
Lecture 2: Intro to the simd lifestyle and GPU internals
ECE 498AL Spring 2010 Lectures 8: Threading & Memory Hardware in G80
Lecture 5: GPU Compute Architecture
Improving cache performance of MPEG video codec
Bojian Zheng CSCD70 Spring 2018
Exploiting Inter-Warp Heterogeneity to Improve GPGPU Performance
Lecture 5: GPU Compute Architecture for the last time
CS/EE 217 – GPU Architecture and Parallel Programming
Mattan Erez The University of Texas at Austin
General Purpose Graphics Processing Units (GPGPUs)
CS179: GPU PROGRAMMING Recitation 2 GPU Memory Synchronization
6- General Purpose GPU Programming
Address-Stride Assisted Approximate Load Value Prediction in GPUs
CIS 6930: Chip Multiprocessor: Parallel Architecture and Programming
Presentation transcript:

Presented by: Isaac Martin APRES: Improving Cache Efficiency by Exploiting Load Characteristics on GPUs Presented by: Isaac Martin

GPU Overview Streaming Multiprocessors (SM) Dozens of cores each (128*) GPU has multiple SMs Single Instruction Multiple Thread (SIMT) Many threads run on same code (1024-2048 per SM*), kernels Threads grouped into warps Limited cache space per SM (16-48KB*) Results in lots of cache misses, memory latency in GPU Device Memory How to improve?

Two Common Types of Loads Small Memory Range Strong locality, same or very close address Ex: Single variable shared across all warps Large Memory Range w/ Striding Address only accessed once, evenly spaced Common for image processing, using thread index to access data Ex: Reading pixel values from an image in parallel SIMT Design In good SIMT code, all threads in warp execute same instruction (performance suffers if they diverge) All threads in warp should have same PC for kernel

Cache Misses Cold Misses Cache block is empty, unavoidable Conflict Misses Associativity scheme Cache slot already occupied by other data Capacity Misses Out of space How to avoid dumping important data? Compute vs. Memory Intensive Compute mostly cold misses Memory intensive has lots of capacity & conflict misses

Adaptive PREfetching and Scheduling (APRES) Architecture solution to improve hit rate, try and reduce latency caused by the two common load types Group sets of warps based on load type Short memory range If warps load same address at same PC, and data is in cache, no memory latency expected Prioritize these warps, they will complete sooner Long memory range w/ striding Loads for this data usually miss first time If PC the same, can guess address next warp will use w/ striding Compare, calculate predicted address of warps at PC Prefetch address into cache

Hardware Solution - LAWS & SAP

Locality Aware Warp Scheduler (LAWS)

Scheduling Aware Prefetching (SAP)

APRES Impact on Baseline GPU Performance 31.7% improvement over baseline GPU 7.2% improvement over state-of-the-art predicting & scheduling tools Hardware Overhead Additional hardware only 2.06% of standard L1 cache Additional functional units (4 int adders, 1 int multiply, 1 int divider) negligible compared to Fused Multiply- Add (FMA) functional units in CUDA cores (NVIDIA GPUs). Questions?