Presented by: Isaac Martin

Presented by: Isaac Martin
APRES: Improving Cache Efficiency by Exploiting Load Characteristics on GPUs Presented by: Isaac Martin

GPU Overview Streaming Multiprocessors (SM)
Dozens of cores each (128*) GPU has multiple SMs Single Instruction Multiple Thread (SIMT) Many threads run on same code ( per SM*), kernels Threads grouped into warps Limited cache space per SM (16-48KB*) Results in lots of cache misses, memory latency in GPU Device Memory How to improve?

Two Common Types of Loads
Small Memory Range Strong locality, same or very close address Ex: Single variable shared across all warps Large Memory Range w/ Striding Address only accessed once, evenly spaced Common for image processing, using thread index to access data Ex: Reading pixel values from an image in parallel SIMT Design In good SIMT code, all threads in warp execute same instruction (performance suffers if they diverge) All threads in warp should have same PC for kernel

Cache Misses Cold Misses Cache block is empty, unavoidable
Conflict Misses Associativity scheme Cache slot already occupied by other data Capacity Misses Out of space How to avoid dumping important data? Compute vs. Memory Intensive Compute mostly cold misses Memory intensive has lots of capacity & conflict misses

Adaptive PREfetching and Scheduling (APRES)
Architecture solution to improve hit rate, try and reduce latency caused by the two common load types Group sets of warps based on load type Short memory range If warps load same address at same PC, and data is in cache, no memory latency expected Prioritize these warps, they will complete sooner Long memory range w/ striding Loads for this data usually miss first time If PC the same, can guess address next warp will use w/ striding Compare, calculate predicted address of warps at PC Prefetch address into cache

Hardware Solution - LAWS & SAP

Locality Aware Warp Scheduler (LAWS)

Scheduling Aware Prefetching (SAP)

APRES Impact on Baseline GPU
Performance 31.7% improvement over baseline GPU 7.2% improvement over state-of-the-art predicting & scheduling tools Hardware Overhead Additional hardware only 2.06% of standard L1 cache Additional functional units (4 int adders, 1 int multiply, 1 int divider) negligible compared to Fused Multiply- Add (FMA) functional units in CUDA cores (NVIDIA GPUs). Questions?

Presented by: Isaac Martin

Similar presentations

Presentation on theme: "Presented by: Isaac Martin"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Presented by: Isaac Martin

Similar presentations

Presentation on theme: "Presented by: Isaac Martin"— Presentation transcript:

Similar presentations

About project

Feedback