SAGE: Self-Tuning Approximation for Graphics Engines

Slides:

Advertisements

Similar presentations

1 Qilin: Exploiting Parallelism on Heterogeneous Multiprocessors with Adaptive Mapping Chi-Keung (CK) Luk Technology Pathfinding and Innovation Software.

Advertisements

International Symposium on Low Power Electronics and Design Qing Xie, Mohammad Javad Dousti, and Massoud Pedram University of Southern California ISLPED.

Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY Sponsors: National Science Foundation, LogicBlox Inc., and NVIDIA Kernel.

Chimera: Collaborative Preemption for Multitasking on a Shared GPU

University of Michigan Electrical Engineering and Computer Science Transparent CPU-GPU Collaboration for Data-Parallel Kernels on Heterogeneous Systems.

University of Michigan Electrical Engineering and Computer Science Dynamic Parallelization of JavaScript Applications Using an Ultra-lightweight Speculation.

University of Michigan Electrical Engineering and Computer Science MacroSS: Macro-SIMDization of Streaming Applications Amir Hormati*, Yoonseo Choi ‡,

CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.

CS 732: Advance Machine Learning Usman Roshan Department of Computer Science NJIT.

Accelerating Machine Learning Applications on Graphics Processors Narayanan Sundaram and Bryan Catanzaro Presented by Narayanan Sundaram.

University of Michigan Electrical Engineering and Computer Science Amir Hormati, Mehrzad Samadi, Mark Woh, Trevor Mudge, and Scott Mahlke Sponge: Portable.

A Source-to-Source OpenACC compiler for CUDA Akihiro Tabuchi †1 Masahiro Nakao †2 Mitsuhisa Sato †1 †1. Graduate School of Systems and Information Engineering,

To GPU Synchronize or Not GPU Synchronize? Wu-chun Feng and Shucai Xiao Department of Computer Science, Department of Electrical and Computer Engineering,

1 1 © 2011 The MathWorks, Inc. Accelerating Bit Error Rate Simulation in MATLAB using Graphics Processors James Lebak Brian Fanous Nick Moore High-Performance.

Synergy.cs.vt.edu Power and Performance Characterization of Computational Kernels on the GPU Yang Jiao, Heshan Lin, Pavan Balaji (ANL), Wu-chun Feng.

Accelerating SQL Database Operations on a GPU with CUDA Peter Bakkum & Kevin Skadron The University of Virginia GPGPU-3 Presentation March 14, 2010.

TASK ADAPTATION IN REAL-TIME & EMBEDDED SYSTEMS FOR ENERGY & RELIABILITY TRADEOFFS Sathish Gopalakrishnan Department of Electrical & Computer Engineering.

Supporting GPU Sharing in Cloud Environments with a Transparent

2012/06/22 Contents  GPU (Graphic Processing Unit)  CUDA Programming  Target: Clustering with Kmeans  How to use.

Predictive Runtime Code Scheduling for Heterogeneous Architectures 1.

University of Michigan Electrical Engineering and Computer Science 1 Dynamic Acceleration of Multithreaded Program Critical Paths in Near-Threshold Systems.

Modeling GPU non-Coalesced Memory Access Michael Fruchtman.

By Arun Bhandari Course: HPC Date: 01/28/12. GPU (Graphics Processing Unit) High performance many core processors Only used to accelerate certain parts.

Parallel Applications Parallel Hardware Parallel Software IT industry (Silicon Valley) Users Efficient Parallel CKY Parsing on GPUs Youngmin Yi (University.

© David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, ECE408 Applied Parallel Programming Lecture 12 Parallel.

Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.

Performance Prediction for Random Write Reductions: A Case Study in Modelling Shared Memory Programs Ruoming Jin Gagan Agrawal Department of Computer and.

Automatic translation from CUDA to C++ Luca Atzori, Vincenzo Innocente, Felice Pantaleo, Danilo Piparo 31 August, 2015.

GPU Architecture and Programming

CS 193G Lecture 7: Parallel Patterns II. Overview Segmented Scan Sort Mapreduce Kernel Fusion.

JPEG-GPU: A GPGPU IMPLEMENTATION OF JPEG CORE CODING SYSTEMS Ang Li University of Wisconsin-Madison.

Investigating Adaptive Compilation using the MIPSpro Compiler Keith D. Cooper Todd Waterman Department of Computer Science Rice University Houston, TX.

Training Program on GPU Programming with CUDA 31 st July, 7 th Aug, 14 th Aug 2011 CUDA Teaching UoM.

Adam Wagner Kevin Forbes. Motivation  Take advantage of GPU architecture for highly parallel data-intensive application  Enhance image segmentation.

QCAdesigner – CUDA HPPS project

Compiler and Runtime Support for Enabling Generalized Reduction Computations on Heterogeneous Parallel Configurations Vignesh Ravi, Wenjing Ma, David Chiu.

Efficient Parallel CKY Parsing on GPUs Youngmin Yi (University of Seoul) Chao-Yue Lai (UC Berkeley) Slav Petrov (Google Research) Kurt Keutzer (UC Berkeley)

Euro-Par, 2006 ICS 2009 A Translation System for Enabling Data Mining Applications on GPUs Wenjing Ma Gagan Agrawal The Ohio State University ICS 2009.

1 Adaptive Parallelism for Web Search Myeongjae Jeon Rice University In collaboration with Yuxiong He (MSR), Sameh Elnikety (MSR), Alan L. Cox (Rice),

ApproxHadoop Bringing Approximations to MapReduce Frameworks

© David Kirk/NVIDIA and Wen-mei W. Hwu University of Illinois, CS/EE 217 GPU Architecture and Parallel Programming Lecture 10 Reduction Trees.

Jason Jong Kyu Park, Yongjun Park, and Scott Mahlke

University of Michigan Electrical Engineering and Computer Science Adaptive Input-aware Compilation for Graphics Engines Mehrzad Samadi 1, Amir Hormati.

Orchestrating Multiple Data-Parallel Kernels on Multiple Devices Janghaeng Lee, Mehrzad Samadi, and Scott Mahlke October, 2015 University of Michigan -

An Integrated GPU Power and Performance Model (ISCA’10, June 19–23, 2010, Saint-Malo, France. International Symposium on Computer Architecture)

CS 732: Advance Machine Learning

Farzad Khorasani, Rajiv Gupta, Laxmi N. Bhuyan UC Riverside

AUTO-GC: Automatic Translation of Data Mining Applications to GPU Clusters Wenjing Ma Gagan Agrawal The Ohio State University.

University of Michigan Electrical Engineering and Computer Science Paragon: Collaborative Speculative Loop Execution on GPU and CPU Mehrzad Samadi 1 Amir.

Sunpyo Hong, Hyesoon Kim

Heterogeneous Computing With GPGPUs Matthew Piehl Overview Introduction to CUDA Project Overview Issues faced nvcc Implementation Performance Metrics Conclusions.

Canny Edge Detection Using an NVIDIA GPU and CUDA Alex Wade CAP6938 Final Project.

CPU-GPU Collaboration for Output Quality Monitoring Mehrzad Samadi and Scott Mahlke University of Michigan March 2014 Compilers creating custom processors.

Equalizer: Dynamically Tuning GPU Resources for Efficient Execution Ankit Sethia* Scott Mahlke University of Michigan.

Accelerating K-Means Clustering with Parallel Implementations and GPU Computing Janki Bhimani Miriam Leeser Ningfang Mi

1 ”MCUDA: An efficient implementation of CUDA kernels for multi-core CPUs” John A. Stratton, Sam S. Stone and Wen-mei W. Hwu Presentation for class TDT24,

Employing compression solutions under openacc

ECE408 Fall 2015 Applied Parallel Programming Lecture 21: Application Case Study – Molecular Dynamics.

Accelerating MapReduce on a Coupled CPU-GPU Architecture

CS 179: GPU Programming Lecture 7.

ECE408 / CS483 Applied Parallel Programming Lecture 23: Application Case Study – Electrostatic Potential Calculation.

Parallel Computation Patterns (Reduction)

CS179: GPU PROGRAMMING Recitation 2 GPU Memory Synchronization

Lecture 3: CUDA Threads & Atomics

Chapter 4:Parallel Programming in CUDA C

GPU Scheduling on the NVIDIA TX2:

Parallel Computation Patterns (Histogram)

6- General Purpose GPU Programming

Sculptor: Flexible Approximation with

Presentation transcript:

SAGE: Self-Tuning Approximation for Graphics Engines Mehrzad Samadi1, Janghaeng Lee1, D. Anoushe Jamshidi1, Amir Hormati2, and Scott Mahlke1 University of Michigan1, Google Inc.2 December 2013 University of Michigan Electrical Engineering and Computer Science Compilers creating custom processors

Approximate Computing Different domains: Machine Learning Image Processing Video Processing Physical Simulation … Quality: 100% 95% 90% 85% Higher performance Lower power consumption Less work

Ubiquitous Graphics Processing Units Wide range of devices Mostly regular applications Works on large data sets Super Computers Servers Desktops Cell Phones Good opportunity for automatic approximation

SAGE Framework Simplify or skip processing Speedup 2~3x Simplify or skip processing Computationally expensive Lowest impact on the output quality Self-Tuning Approximation on Graphics Engines Write the program once Automatic approximation Self-tuning dynamically 10% 10% Output Error

Overview

SAGE Framework Input Program Static Compiler Approximate Approximation Methods Approximate Kernels Runtime system Tuning Parameters

Static Compilation Evaluation Metric Atomic Operation Approximate Kernels Target Output Quality (TOQ) Data Packing Thread Fusion Tuning Parameters CUDA Code

Runtime System CPU Preprocessing Preprocessing GPU Tuning Tuning Execution Calibration Calibration Tuning T T + C T + 2C Quality TOQ Speedup Time

Runtime System CPU Preprocessing GPU Tuning Execution Calibration Time T + C T Quality TOQ Speedup Time

Approximation Methods

Approximation Methods Evaluation Metric Atomic Operation Approximate Kernels Target Output Quality (TOQ) Data Packing Thread Fusion Tuning Parameters CUDA Code

Atomic Operations Atomic operations update a memory location such that the update appears to happen atomically // Compute histogram of colors in an image __global__ void histogram(int n, int* color, int* bucket) int tid = threadIdx.x + blockDim.x * blockIdx.x; int nThreads = gridDim.x * blockDim.x; for ( int i = tid ; tid < n; tid += nThreads) int c = colors[i]; atomicAdd(&bucket[c], 1);

Atomic Operations Threads

Atomic Operations Threads

Atomic Operations Threads

Atomic Operations Threads

Atomic Add Slowdown

Atomic Operation Tuning Iterations T0 T32 T64 T96 T128 T160 // Compute histogram of colors in an image __global__ void histogram(int n, int* color, int* bucket) int tid = threadIdx.x + blockDim.x * blockIdx.x; int nThreads = gridDim.x * blockDim.x; for ( int i = tid ; tid < n; tid += nThreads) int c = colors[i]; atomicAdd(&bucket[c], 1);

Atomic Operation Tuning SAGE skips one iteration per thread To improve the performance, it drops the iteration with the maximum number of conflicts Iterations T0 T32 T64 T96 T128 T160

Atomic Operation Tuning SAGE skips one iteration per thread To improve the performance, it drops the iteration with the maximum number of conflicts It drops 50% of iterations T0 T32 T64 T96 T128 T160

Atomic Operation Tuning Drop rate goes down to 25% T0 T32 T64

Dropping One Iteration Iteration No. Conflicts After optimization Conflict Detection CD 0 2 CD 1 Max conflict 1 8 Iteration 2 Iteration 0 Iteration 1 CD 2 1 2 17 CD 3 3 3 12

Approximation Methods Evaluation Metric Atomic Operation Approximate Kernels Target Output Quality (TOQ) Data Packing Thread Fusion Tuning Parameters CUDA Code

Data Packing Threads Memory

Data Packing Threads Memory

Data Packing Threads Memory

Data Packing Threads Memory

Quantization Preprocessing finds min and max of the input sets and packs the data During execution, each thread unpacks the data and transforms the quantization level to data by applying a linear transformation Quantization Levels 1 2 3 4 5 6 7 Min Max 101

Quantization Preprocessing finds max and min of the input sets and packs the data During execution, each thread unpacks the data and transforms the quantization level to data by applying a linear transformation 101 1 2 3 4 5 6 7 Min Max

Approximation Methods Evaluation Metric Atomic Operation Approximate Kernels Target Output Quality (TOQ) Data Packing Thread Fusion Tuning Parameters CUDA Code

Thread Fusion ≈ ≈ ≈ ≈ Memory Threads Memory

Thread Fusion Memory Threads Memory

Thread Fusion Computation Output writing T0 T1 Computation

Thread Fusion T1 T254 T255 T0 T0 Block 1 Block 0 Computation Output writing T1 T254 T255 T0 Block 0 Block 1 T0

Thread Fusion Reducing the number of threads per block results in poor resource utilization T0 T127 T0 T127 Block 0 Block 1

Block Fusion T0 T1 T254 T255 Block 0 & 1 fused

Runtime

How to Compute Output Quality? Approximate Version Accurate Version Evaluation Metric High overhead Tuning should find a good enough approximate version as fast as possible Greedy algorithm

Tuning parameter of the TOQ = 90% Quality = 100% Speedup = 1x K(0,0) Tuning parameter of the First optimization Tuning parameter of the Second optimization K(x,y)

Tuning TOQ = 90% Quality = 100% Speedup = 1x 94% 96% 1.15X 1.5X K(0,0)

Tuning TOQ = 90% Quality = 100% Speedup = 1x 94% 96% 1.15X 1.5X 94% K(0,0) 94% 1.15X 96% 1.5X K(1,0) K(0,1) 94% 2.5X 95% 1.6X K(1,1) K(0,2)

Tuning TOQ = 90% Quality = 100% Speedup = 1x 94% 96% 1.15X 1.5X 94% K(0,0) 94% 1.15X 96% 1.5X K(1,0) Tuning Path K(0,1) Final Choice 94% 2.5X 95% 1.6X K(1,1) K(0,2) 88% 3X 87% 2.8X K(2,1) K(1,2)

Evaluation

Experimental Setup Backend of Cetus compiler GPU CPU Benchmarks NVIDIA GTX 560 2GB GDDR 5 CPU Intel Core i7 Benchmarks Image processing Machine Learning

K-Means TOQ Output Quality Accumulative Speedup

satisfy the TOQ threshold Confidence 𝑝𝑜𝑠𝑡𝑒𝑟𝑖𝑜𝑟∝𝑝𝑟𝑖𝑜𝑟 ×𝑙𝑖𝑘𝑒𝑙𝑖ℎ𝑜𝑜𝑑 Beta Uniform Binomial After checking 50 samples, we will be 93% confident that 95% of the outputs satisfy the TOQ threshold

Calibration Overhead Calibration Overhead(%)

Performance TOQ = 95% TOQ = 90% Loop Perforation Loop Perforation SAGE 6.4 Geometric Mean

Conclusion Automatic approximation is possible SAGE automatically generates approximate kernels with different parameters Runtime system uses tuning parameters to control the output quality during execution 2.5x speedup with less than 10% quality loss compared to the accurate execution

SAGE: Self-Tuning Approximation for Graphics Engines Mehrzad Samadi1, Janghaeng Lee1, D. Anoushe Jamshidi1, Amir Hormati2, and Scott Mahlke1 University of Michigan1, Google Inc.2 December 2013 University of Michigan Electrical Engineering and Computer Science Compilers creating custom processors

Backup Slides

What Does Calibration Miss?

SAGE Can Control The Output Quality

Distribution of Errors