Presentation is loading. Please wait.

Presentation is loading. Please wait.

SAGE: Self-Tuning Approximation for Graphics Engines

Similar presentations


Presentation on theme: "SAGE: Self-Tuning Approximation for Graphics Engines"— Presentation transcript:

1 SAGE: Self-Tuning Approximation for Graphics Engines
Mehrzad Samadi1, Janghaeng Lee1, D. Anoushe Jamshidi1, Amir Hormati2, and Scott Mahlke1 University of Michigan1, Google Inc.2 December 2013 University of Michigan Electrical Engineering and Computer Science Compilers creating custom processors

2 Approximate Computing
Different domains: Machine Learning Image Processing Video Processing Physical Simulation Quality: 100% 95% 90% 85% Higher performance Lower power consumption Less work

3 Ubiquitous Graphics Processing Units
Wide range of devices Mostly regular applications Works on large data sets Super Computers Servers Desktops Cell Phones Good opportunity for automatic approximation

4 SAGE Framework Simplify or skip processing
Speedup 2~3x Simplify or skip processing Computationally expensive Lowest impact on the output quality Self-Tuning Approximation on Graphics Engines Write the program once Automatic approximation Self-tuning dynamically 10% 10% Output Error

5 Overview

6 SAGE Framework Input Program Static Compiler Approximate Approximation
Methods Approximate Kernels Runtime system Tuning Parameters

7 Static Compilation Evaluation Metric Atomic Operation Approximate
Kernels Target Output Quality (TOQ) Data Packing Thread Fusion Tuning Parameters CUDA Code

8 Runtime System CPU Preprocessing Preprocessing GPU Tuning Tuning
Execution Calibration Calibration Tuning T T + C T + 2C Quality TOQ Speedup Time

9 Runtime System CPU Preprocessing GPU Tuning Execution Calibration Time
T + C T Quality TOQ Speedup Time

10 Approximation Methods

11 Approximation Methods
Evaluation Metric Atomic Operation Approximate Kernels Target Output Quality (TOQ) Data Packing Thread Fusion Tuning Parameters CUDA Code

12 Atomic Operations Atomic operations update a memory location such that the update appears to happen atomically // Compute histogram of colors in an image __global__ void histogram(int n, int* color, int* bucket) int tid = threadIdx.x + blockDim.x * blockIdx.x; int nThreads = gridDim.x * blockDim.x; for ( int i = tid ; tid < n; tid += nThreads) int c = colors[i]; atomicAdd(&bucket[c], 1);

13 Atomic Operations Threads

14 Atomic Operations Threads

15 Atomic Operations Threads

16 Atomic Operations Threads

17 Atomic Add Slowdown

18 Atomic Operation Tuning
Iterations T0 T32 T64 T96 T128 T160 // Compute histogram of colors in an image __global__ void histogram(int n, int* color, int* bucket) int tid = threadIdx.x + blockDim.x * blockIdx.x; int nThreads = gridDim.x * blockDim.x; for ( int i = tid ; tid < n; tid += nThreads) int c = colors[i]; atomicAdd(&bucket[c], 1);

19 Atomic Operation Tuning
SAGE skips one iteration per thread To improve the performance, it drops the iteration with the maximum number of conflicts Iterations T0 T32 T64 T96 T128 T160

20 Atomic Operation Tuning
SAGE skips one iteration per thread To improve the performance, it drops the iteration with the maximum number of conflicts It drops 50% of iterations T0 T32 T64 T96 T128 T160

21 Atomic Operation Tuning
Drop rate goes down to 25% T0 T32 T64

22 Dropping One Iteration
Iteration No. Conflicts After optimization Conflict Detection CD 0 2 CD 1 Max conflict 1 8 Iteration 2 Iteration 0 Iteration 1 CD 2 1 2 17 CD 3 3 3 12

23 Approximation Methods
Evaluation Metric Atomic Operation Approximate Kernels Target Output Quality (TOQ) Data Packing Thread Fusion Tuning Parameters CUDA Code

24 Data Packing Threads Memory

25 Data Packing Threads Memory

26 Data Packing Threads Memory

27 Data Packing Threads Memory

28 Quantization Preprocessing finds min and max of the input sets and packs the data During execution, each thread unpacks the data and transforms the quantization level to data by applying a linear transformation Quantization Levels 1 2 3 4 5 6 7 Min Max 101

29 Quantization Preprocessing finds max and min of the input sets and packs the data During execution, each thread unpacks the data and transforms the quantization level to data by applying a linear transformation 101 1 2 3 4 5 6 7 Min Max

30 Approximation Methods
Evaluation Metric Atomic Operation Approximate Kernels Target Output Quality (TOQ) Data Packing Thread Fusion Tuning Parameters CUDA Code

31 Thread Fusion Memory Threads Memory

32 Thread Fusion Memory Threads Memory

33 Thread Fusion Computation Output writing T0 T1 Computation

34 Thread Fusion T1 T254 T255 T0 T0 Block 1 Block 0 Computation
Output writing T1 T254 T255 T0 Block 0 Block 1 T0

35 Thread Fusion Reducing the number of threads per block results in poor resource utilization T0 T127 T0 T127 Block 0 Block 1

36 Block Fusion T0 T1 T254 T255 Block 0 & 1 fused

37 Runtime

38 How to Compute Output Quality?
Approximate Version Accurate Version Evaluation Metric High overhead Tuning should find a good enough approximate version as fast as possible Greedy algorithm

39 Tuning parameter of the
TOQ = 90% Quality = 100% Speedup = 1x K(0,0) Tuning parameter of the First optimization Tuning parameter of the Second optimization K(x,y)

40 Tuning TOQ = 90% Quality = 100% Speedup = 1x 94% 96% 1.15X 1.5X K(0,0)

41 Tuning TOQ = 90% Quality = 100% Speedup = 1x 94% 96% 1.15X 1.5X 94%
K(0,0) 94% 1.15X 96% 1.5X K(1,0) K(0,1) 94% 2.5X 95% 1.6X K(1,1) K(0,2)

42 Tuning TOQ = 90% Quality = 100% Speedup = 1x 94% 96% 1.15X 1.5X 94%
K(0,0) 94% 1.15X 96% 1.5X K(1,0) Tuning Path K(0,1) Final Choice 94% 2.5X 95% 1.6X K(1,1) K(0,2) 88% 3X 87% 2.8X K(2,1) K(1,2)

43 Evaluation

44 Experimental Setup Backend of Cetus compiler GPU CPU Benchmarks
NVIDIA GTX 560 2GB GDDR 5 CPU Intel Core i7 Benchmarks Image processing Machine Learning

45 K-Means TOQ Output Quality Accumulative Speedup

46 satisfy the TOQ threshold
Confidence 𝑝𝑜𝑠𝑡𝑒𝑟𝑖𝑜𝑟∝𝑝𝑟𝑖𝑜𝑟 ×𝑙𝑖𝑘𝑒𝑙𝑖ℎ𝑜𝑜𝑑 Beta Uniform Binomial After checking 50 samples, we will be 93% confident that 95% of the outputs satisfy the TOQ threshold

47 Calibration Overhead Calibration Overhead(%)

48 Performance TOQ = 95% TOQ = 90% Loop Perforation Loop Perforation SAGE
6.4 Geometric Mean

49 Conclusion Automatic approximation is possible
SAGE automatically generates approximate kernels with different parameters Runtime system uses tuning parameters to control the output quality during execution 2.5x speedup with less than 10% quality loss compared to the accurate execution

50 SAGE: Self-Tuning Approximation for Graphics Engines
Mehrzad Samadi1, Janghaeng Lee1, D. Anoushe Jamshidi1, Amir Hormati2, and Scott Mahlke1 University of Michigan1, Google Inc.2 December 2013 University of Michigan Electrical Engineering and Computer Science Compilers creating custom processors

51 Backup Slides

52 What Does Calibration Miss?

53 SAGE Can Control The Output Quality

54 Distribution of Errors


Download ppt "SAGE: Self-Tuning Approximation for Graphics Engines"

Similar presentations


Ads by Google