 GPU Power Model Nandhini Sudarsanan Nathan Vanderby Neeraj Mishra Usha Vinodh

Slides:



Advertisements
Similar presentations
Instructor Notes We describe motivation for talking about underlying device architecture because device architecture is often avoided in conventional.
Advertisements

Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.
A Complete GPU Compute Architecture by NVIDIA Tamal Saha, Abhishek Rawat, Minh Le {ts4rq, ar8eb,
Optimization on Kepler Zehuan Wang
Department of Computer Science iGPU: Exception Support and Speculative Execution on GPUs Jaikrishnan Menon, Marc de Kruijf Karthikeyan Sankaralingam Vertical.
University of Michigan Electrical Engineering and Computer Science Transparent CPU-GPU Collaboration for Data-Parallel Kernels on Heterogeneous Systems.
Performance and Power Analysis on ATI GPU: A Statistical Approach Ying Zhang, Yue Hu, Bin Li, and Lu Peng Department of Electrical and Computer Engineering.
Acceleration of the Smith– Waterman algorithm using single and multiple graphics processors Author : Ali Khajeh-Saeed, Stephen Poole, J. Blair Perot. Publisher:
1 Threading Hardware in G80. 2 Sources Slides by ECE 498 AL : Programming Massively Parallel Processors : Wen-Mei Hwu John Nickolls, NVIDIA.
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, March 22, 2011 Branching.ppt Control Flow These notes will introduce scheduling control-flow.
CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.
Optimization Of Power Consumption For An ARM7- BASED Multimedia Handheld Device Hoseok Chang; Wonchul Lee; Wonyong Sung Circuits and Systems, ISCAS.
Author: D. Brooks, V.Tiwari and M. Martonosi Reviewer: Junxia Ma
University of Michigan Electrical Engineering and Computer Science Amir Hormati, Mehrzad Samadi, Mark Woh, Trevor Mudge, and Scott Mahlke Sponge: Portable.
Architectural and Compiler Techniques for Energy Reduction in High-Performance Microprocessors Nikolaos Bellas, Ibrahim N. Hajj, Fellow, IEEE, Constantine.
Introduction What is GPU? It is a processor optimized for 2D/3D graphics, video, visual computing, and display. It is highly parallel, highly multithreaded.
GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.
To GPU Synchronize or Not GPU Synchronize? Wu-chun Feng and Shucai Xiao Department of Computer Science, Department of Electrical and Computer Engineering,
Synergy.cs.vt.edu Power and Performance Characterization of Computational Kernels on the GPU Yang Jiao, Heshan Lin, Pavan Balaji (ANL), Wu-chun Feng.
8/16/2015\course\cpeg323-08F\Topics1b.ppt1 A Review of Processor Design Flow.
Sunpyo Hong, Hyesoon Kim
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 7: Threading Hardware in G80.
CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA
1 The Performance Potential for Single Application Heterogeneous Systems Henry Wong* and Tor M. Aamodt § *University of Toronto § University of British.
BY: ALI AJORIAN ISFAHAN UNIVERSITY OF TECHNOLOGY 2012 GPU Architecture 1.
Unifying Primary Cache, Scratch, and Register File Memories in a Throughput Processor Mark Gebhart 1,2 Stephen W. Keckler 1,2 Brucek Khailany 2 Ronny Krashinsky.
Parallel Applications Parallel Hardware Parallel Software IT industry (Silicon Valley) Users Efficient Parallel CKY Parsing on GPUs Youngmin Yi (University.
Uncovering the Multicore Processor Bottlenecks Server Design Summit Shay Gal-On Director of Technology, EEMBC.
Warped Gates: Gating Aware Scheduling and Power Gating for GPGPUs
Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.
NVIDIA Fermi Architecture Patrick Cozzi University of Pennsylvania CIS Spring 2011.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 CS 395 Winter 2014 Lecture 17 Introduction to Accelerator.
ACMSE’04, ALDepartment of Electrical and Computer Engineering - UAH Execution Characteristics of SPEC CPU2000 Benchmarks: Intel C++ vs. Microsoft VC++
GPU Architecture and Programming
(1) Kernel Execution ©Sudhakar Yalamanchili and Jin Wang unless otherwise noted.
Autonomic scheduling of tasks from data parallel patterns to CPU/GPU core mixes Published in: High Performance Computing and Simulation (HPCS), 2013 International.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 8: Threading Hardware in G80.
Some key aspects of NVIDIA GPUs and CUDA. Silicon Usage.
Compiler and Runtime Support for Enabling Generalized Reduction Computations on Heterogeneous Parallel Configurations Vignesh Ravi, Wenjing Ma, David Chiu.
GPU Power Model Nandhini Sudarsanan Nathan Vanderby Neeraj Mishra Usha Vinodh
1)Leverage raw computational power of GPU  Magnitude performance gains possible.
Processor Architecture
Efficient Parallel CKY Parsing on GPUs Youngmin Yi (University of Seoul) Chao-Yue Lai (UC Berkeley) Slav Petrov (Google Research) Kurt Keutzer (UC Berkeley)
 GPU Power Model Nandhini Sudarsanan Nathan Vanderby Neeraj Mishra Usha Vinodh
Auto-tuning Dense Matrix Multiplication for GPGPU with Cache
1 November 11, 2015 A Massively Parallel, Hybrid Dataflow/von Neumann Architecture Yoav Etsion November 11, 2015.
An Integrated GPU Power and Performance Model (ISCA’10, June 19–23, 2010, Saint-Malo, France. International Symposium on Computer Architecture)
Sunpyo Hong, Hyesoon Kim
Martin Kruliš by Martin Kruliš (v1.0)1.
GPGPU introduction. Why is GPU in the picture Seeking exa-scale computing platform Minimize power per operation. – Power is directly correlated to the.
Equalizer: Dynamically Tuning GPU Resources for Efficient Execution Ankit Sethia* Scott Mahlke University of Michigan.
My Coordinates Office EM G.27 contact time:
1 ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 28, 2013 Branching.ppt Control Flow These notes will introduce scheduling control-flow.
Synergy.cs.vt.edu Online Performance Projection for Clusters with Heterogeneous GPUs Lokendra S. Panwar, Ashwin M. Aji, Wu-chun Feng (Virginia Tech, USA)
Fast and parallel implementation of Image Processing Algorithm using CUDA Technology On GPU Hardware Neha Patil Badrinath Roysam Department of Electrical.
Henry Wong, Misel-Myrto Papadopoulou, Maryam Sadooghi-Alvandi, Andreas Moshovos University of Toronto Demystifying GPU Microarchitecture through Microbenchmarking.
Hang Zhang1, Xuhao Chen1, Nong Xiao1,2, Fang Liu1
Gwangsun Kim, Jiyun Jeong, John Kim
GPU Memory Details Martin Kruliš by Martin Kruliš (v1.1)
Department of Electrical & Computer Engineering
ECE 498AL Spring 2010 Lectures 8: Threading & Memory Hardware in G80
Accelerating MapReduce on a Coupled CPU-GPU Architecture
Linchuan Chen, Xin Huo and Gagan Agrawal
Mattan Erez The University of Texas at Austin
NVIDIA Fermi Architecture
Mattan Erez The University of Texas at Austin
©Sudhakar Yalamanchili and Jin Wang unless otherwise noted
General Purpose Graphics Processing Units (GPGPUs)
6- General Purpose GPU Programming
CIS 6930: Chip Multiprocessor: Parallel Architecture and Programming
Presentation transcript:

 GPU Power Model Nandhini Sudarsanan Nathan Vanderby Neeraj Mishra Usha Vinodh Chi Xu

Outline  Introduction and Motivation  Analytical Model Description  Experiment Setup  Results  Conclusion and Further Work CSCI 8205: GPU Power Model 2 5/4/11

Introduction  Develop a methodology for building an accurate power model for a GPU.  Validate with a NVIDA’s GTX 480 GPU.  Measure power efficiency of various NVIDIA SDK benchmarks.  Accurate power model can help  Explore various architectural and algorithmic trade offs.  Figure out balance of workload between GPU and CPU. CSCI 8205: GPU Power Model 3 5/4/11

Motivation  Power Consumption: Key criterion for future Hardware Devices and Embedded Software.  Effect of increased power density has been not been felt till now  Supply voltage was scaled back too.  Current and Power density remained constant.  Further reduction in supply voltage difficult in future  Supply voltage approaching close to threshold voltage.  Gate oxide thickness almost equal to 1nm. CSCI 8205: GPU Power Model 4 5/4/11

Motivation CSCI 8205: GPU Power Model 5 5/4/11

GPU Processing Power CSCI 8205: GPU Power Model 6 5/4/11

Price of Power  Maximum Load = Lot of Power  Nvidia 8800 GTX: 137W  Intel Xeon LS5400: 50W CSCI 8205: GPU Power Model 7 5/4/11

Power Wall  Power Density in GPUs larger that even high end CPUs  Power gating, Clock gating have been successfully employed in CPUs [Brooks, Hpca 2001]  Power gating, Clock gating and other H/W based schemes are not used in most GPUs [Kim Isca 2010]  Accurate power model can help  Explore various architectural and algorithmic trade offs.  Figure out balance of workload between GPU and CPU. CSCI 8205: GPU Power Model 8 5/4/11

Background  Power consumption can be divided into: Power = Dynamic_power + Static_power + Short_Ckt_Power  Dynamic power is determined by run-time events  Fixed-function units: texture filtering and rasterization  Programmable units: memory and floating point  Static power determined by  circuit technology  chip layout  operating temperature. P = V CC * N* K design * I leak CSCI 8205: GPU Power Model 9 5/4/11

Previous Power Models  Statistical power modeling approach for GPU [Matsuoka 2010]  Uses 13 CUDA Performance counters (ld,st,branch,tlb miss) to obtain profile  Finds correlation b/w profiles and power by statistical model learning.  Lot of information not captured by counters lost  Cycle-level simulations based Power Model,[Skadron HWWS'04]  Assume hypothetical architecture to explore new GPU microarchitectures and model power and leakage properties  Cycle-level processor simulations are time consuming [Martonosi&Isci 2003]  Do not allow a complete view of operating system effects, I/O [Isci 2003] CSCI 8205: GPU Power Model 10 5/4/11

Outline  Introduction and Motivation  Analytical Model Description  Parser  Power Model  Experiment Setup  Results  Conclusion and Further Work CSCI 8205: GPU Power Model 11 5/4/11

Need for a Parser  GPGPUsim is time consuming  GPGPUsim output is not tailored to our needs  Parser is very fast  GPGPUsim works only with CUDA 2.3 or prior CSCI 8205: GPU Power Model 12 5/4/11

Limitations of the Parser  Dynamic loops are not automatically determined.  Branch prediction is assumed to be taken  Highly tailored to our specific needs.  A change in the PTX layout might require change to parser. CSCI 8205: GPU Power Model 13 5/4/11

Outline  Introduction and Motivation  Analytical Model Description  Parser  Power Model  Experiment Setup  Results  Conclusion and Further Work CSCI 8205: GPU Power Model 14 5/4/11

Fermi Architecture:sm_20 5/4/11CSCI 8205: GPU Power Model 15  Memory Hierarchy  PCIE & RAM  L2 Cache  L1 Cache  Shared Memory  Registers  Streaming Processor  32 ALU, 32FPU, 4SFU  2 Pipelines, stages  2 Warp Scheduler, 2 Inst /Cycle

Fermi Architecture:sm_20 5/4/11CSCI 8205: GPU Power Model 16  Memory Hierarchy  PCIE & RAM  L2 Cache  L1 Cache  Shared Memory  Registers  Streaming Processor  32 ALU, 32FPU, 4SFU  2 Pipelines, stages  2 Warp Scheduler, 2 Inst /Cycle

Factors in the Power Model  Temperature  # of SMs  CSCI 8205: GPU Power Model 17 5/4/11

Power Model  Assembly Level CSCI 8205: GPU Power Model 18 5/4/11

Outline  Introduction and Motivation  Analytical Model Description  Parser  Power Model  Experiment Setup  Results  Conclusion and Further Work CSCI 8205: GPU Power Model 19 5/4/11

Experiment Setup - Hardware  Measure Power Consumption and Temperature  Sample 10Hz, GPU sensor  Current Clamp for PCIE & GPU Power Cable  Data Acquisition 100Hz  GPU Performance Counter  Profile 57 Counters per Kernel  9 Executions CSCI 8205: GPU Power Model 20 5/4/11

Experiment Setup - Software  Driver API  PTX level Micro-benchmark  Minimize control loops  Stress one type of PTX instruction per kernel, over 95%  76 kernels  Wisely choose block and grid size and  CUDA 4.0  Built in Binary -> Assembly Converter (cuobjdump)  Timer interrupt to collect Temperature  Remote login CSCI 8205: GPU Power Model 21 5/4/11

Limitations of PTX  Higher level than assembly  30 out of 76 PTX take multiple assembly  Divide, Sqrt, etc.: 1 PTX line, library in assembly  Compiler optimizations from PTX -> assembly  Doesn’t reflect RAW dependencies  Performance counters results based on assembly CSCI 8205: GPU Power Model 22 5/4/11

CUDA – Fermi Architecture  Third Generation Streaming Multiprocessor(SM)  32 CUDA cores per SM, 4x over GT200  1024 thread block size, 2x over GT200  Unified address space enables full C++ support  Improved Memory Subsystem 5/4/11CSCI 8205: GPU Power Model 23

CUDA – Fermi Architecture 5/4/11CSCI 8205: GPU Power Model 24 Fermi Memory Hierarchy Registers SM - 0 L1 Cache Shared Mem. Registers SM - N L1 CacheShared Mem. L2 Cache Global Memory

Validation Benchmarks  Small number of overhead operations (loop counters, initialization, etc.).  Computational intensive work to allow for an experiment of significant length for accurate current measurement.  Exhibit high utilization of the CUDA cores, few data hazards as possible.  Grid and block sizes appropriately so that all SM are used, since idle SM leak.  Accordingly 7 benchmarks were selected from CUDA SDK. 5/4/11CSCI 8205: GPU Power Model 25

Validation Benchmarks  Our benchmarks  2D convolution  Matrix Multiplication  Vector Addition  Vector Reduction  Scalar Product  DCT 8x8  3DFD 5/4/11CSCI 8205: GPU Power Model 26

Outline  Introduction and Motivation  Analytical Model Description  Parser  Power Model  Experiment Setup  Results  Conclusion and Further Work CSCI 8205: GPU Power Model 27 5/4/11

Results CSCI 8205: GPU Power Model 28 5/4/11

Outline  Introduction and Motivation  Analytical Model Description  Parser  Power Model  Experiment Setup  Results  Conclusion and Further Work CSCI 8205: GPU Power Model 29 5/4/11

Conclusion and Further Work  Conclusion  Further Work  Take into account context switches  Consider Multiple kernels running simultaneously CSCI 8205: GPU Power Model 30 5/4/11

The End Thanks Q&A CSCI 8205: GPU Power Model 31 5/4/11