 GPU Power Model Nandhini Sudarsanan Nathan Vanderby Neeraj Mishra Usha Vinodh

Slides:



Advertisements
Similar presentations
Intermediate GPGPU Programming in CUDA
Advertisements

Instructor Notes We describe motivation for talking about underlying device architecture because device architecture is often avoided in conventional.
Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.
A Complete GPU Compute Architecture by NVIDIA Tamal Saha, Abhishek Rawat, Minh Le {ts4rq, ar8eb,
Optimization on Kepler Zehuan Wang
Computer Abstractions and Technology
Zhiguo Ge, Weng-Fai Wong, and Hock-Beng Lim Proceedings of the Design, Automation, and Test in Europe Conference, 2007 (DATE’07) April /4/17.
Department of Computer Science iGPU: Exception Support and Speculative Execution on GPUs Jaikrishnan Menon, Marc de Kruijf Karthikeyan Sankaralingam Vertical.
University of Michigan Electrical Engineering and Computer Science Transparent CPU-GPU Collaboration for Data-Parallel Kernels on Heterogeneous Systems.
Performance and Power Analysis on ATI GPU: A Statistical Approach Ying Zhang, Yue Hu, Bin Li, and Lu Peng Department of Electrical and Computer Engineering.
Acceleration of the Smith– Waterman algorithm using single and multiple graphics processors Author : Ali Khajeh-Saeed, Stephen Poole, J. Blair Perot. Publisher:
1 Threading Hardware in G80. 2 Sources Slides by ECE 498 AL : Programming Massively Parallel Processors : Wen-Mei Hwu John Nickolls, NVIDIA.
Optimizing and Auto-Tuning Belief Propagation on the GPU Scott Grauer-Gray and Dr. John Cavazos Computer and Information Sciences, University of Delaware.
DCABES 2009 China University Of Geosciences 1 The Parallel Models of Coronal Polarization Brightness Calculation Jiang Wenqian.
CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.
Optimization Of Power Consumption For An ARM7- BASED Multimedia Handheld Device Hoseok Chang; Wonchul Lee; Wonyong Sung Circuits and Systems, ISCAS.
University of Michigan Electrical Engineering and Computer Science Amir Hormati, Mehrzad Samadi, Mark Woh, Trevor Mudge, and Scott Mahlke Sponge: Portable.
Architectural and Compiler Techniques for Energy Reduction in High-Performance Microprocessors Nikolaos Bellas, Ibrahim N. Hajj, Fellow, IEEE, Constantine.
GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.
To GPU Synchronize or Not GPU Synchronize? Wu-chun Feng and Shucai Xiao Department of Computer Science, Department of Electrical and Computer Engineering,
GPGPU platforms GP - General Purpose computation using GPU
Synergy.cs.vt.edu Power and Performance Characterization of Computational Kernels on the GPU Yang Jiao, Heshan Lin, Pavan Balaji (ANL), Wu-chun Feng.
Shekoofeh Azizi Spring  CUDA is a parallel computing platform and programming model invented by NVIDIA  With CUDA, you can send C, C++ and Fortran.
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 7: Threading Hardware in G80.
CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA
GPU Performance Prediction GreenLight Education & Outreach Summer Workshop UCSD. La Jolla, California. July 1 – 2, Javier Delgado Gabriel Gazolla.
BY: ALI AJORIAN ISFAHAN UNIVERSITY OF TECHNOLOGY 2012 GPU Architecture 1.
Low-Power Wireless Sensor Networks
Unifying Primary Cache, Scratch, and Register File Memories in a Throughput Processor Mark Gebhart 1,2 Stephen W. Keckler 1,2 Brucek Khailany 2 Ronny Krashinsky.
GPUs and Accelerators Jonathan Coens Lawrence Tan Yanlin Li.
Parallel Applications Parallel Hardware Parallel Software IT industry (Silicon Valley) Users Efficient Parallel CKY Parsing on GPUs Youngmin Yi (University.
Uncovering the Multicore Processor Bottlenecks Server Design Summit Shay Gal-On Director of Technology, EEMBC.
Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.
NVIDIA Fermi Architecture Patrick Cozzi University of Pennsylvania CIS Spring 2011.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 CS 395 Winter 2014 Lecture 17 Introduction to Accelerator.
GPU Architecture and Programming
Introducing collaboration members – Korea University (KU) ALICE TPC online tracking algorithm on a GPU Computing Platforms – GPU Computing Platforms Joohyung.
Autonomic scheduling of tasks from data parallel patterns to CPU/GPU core mixes Published in: High Performance Computing and Simulation (HPCS), 2013 International.
"Distributed Computing and Grid-technologies in Science and Education " PROSPECTS OF USING GPU IN DESKTOP-GRID SYSTEMS Klimov Georgy Dubna, 2012.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 8: Threading Hardware in G80.
GPUs: Overview of Architecture and Programming Options Lee Barford firstname dot lastname at gmail dot com.
QCAdesigner – CUDA HPPS project
Compiler and Runtime Support for Enabling Generalized Reduction Computations on Heterogeneous Parallel Configurations Vignesh Ravi, Wenjing Ma, David Chiu.
EECS 583 – Class 21 Research Topic 3: Compilation for GPUs University of Michigan December 12, 2011 – Last Class!!
GPU Power Model Nandhini Sudarsanan Nathan Vanderby Neeraj Mishra Usha Vinodh
1)Leverage raw computational power of GPU  Magnitude performance gains possible.
 GPU Power Model Nandhini Sudarsanan Nathan Vanderby Neeraj Mishra Usha Vinodh
Adaptive GPU Cache Bypassing Yingying Tian *, Sooraj Puthoor†, Joseph L. Greathouse†, Bradford M. Beckmann†, Daniel A. Jiménez * Texas A&M University *,
Efficient Parallel CKY Parsing on GPUs Youngmin Yi (University of Seoul) Chao-Yue Lai (UC Berkeley) Slav Petrov (Google Research) Kurt Keutzer (UC Berkeley)
Auto-tuning Dense Matrix Multiplication for GPGPU with Cache
1 November 11, 2015 A Massively Parallel, Hybrid Dataflow/von Neumann Architecture Yoav Etsion November 11, 2015.
University of Michigan Electrical Engineering and Computer Science Adaptive Input-aware Compilation for Graphics Engines Mehrzad Samadi 1, Amir Hormati.
An Integrated GPU Power and Performance Model (ISCA’10, June 19–23, 2010, Saint-Malo, France. International Symposium on Computer Architecture)
Sunpyo Hong, Hyesoon Kim
Martin Kruliš by Martin Kruliš (v1.0)1.
GPGPU introduction. Why is GPU in the picture Seeking exa-scale computing platform Minimize power per operation. – Power is directly correlated to the.
My Coordinates Office EM G.27 contact time:
Synergy.cs.vt.edu Online Performance Projection for Clusters with Heterogeneous GPUs Lokendra S. Panwar, Ashwin M. Aji, Wu-chun Feng (Virginia Tech, USA)
Fast and parallel implementation of Image Processing Algorithm using CUDA Technology On GPU Hardware Neha Patil Badrinath Roysam Department of Electrical.
Processor Level Parallelism 2. How We Got Here Developments in PC CPUs.
Computer Engg, IIT(BHU)
Hang Zhang1, Xuhao Chen1, Nong Xiao1,2, Fang Liu1
Gwangsun Kim, Jiyun Jeong, John Kim
GPU Memory Details Martin Kruliš by Martin Kruliš (v1.1)
Department of Electrical & Computer Engineering
Accelerating MapReduce on a Coupled CPU-GPU Architecture
Linchuan Chen, Xin Huo and Gagan Agrawal
NVIDIA Fermi Architecture
Mattan Erez The University of Texas at Austin
6- General Purpose GPU Programming
Presentation transcript:

 GPU Power Model Nandhini Sudarsanan Nathan Vanderby Neeraj Mishra Usha Vinodh Chi Xu

Outline  Introduction and Motivation  Analytical Model Description  Experiment Setup  Results  Conclusion and Further Work CSCI 8205: GPU Power Model 2 5/4/11

Introduction  Develop a methodology for building an accurate power model for a GPU.  Validate with a NVIDA’s GTX 480 GPU.  Measure power efficiency of various NVIDIA SDK benchmarks.  Accurate power model can help  Explore various architectural and algorithmic trade offs.  Figure out balance of workload between GPU and CPU. CSCI 8205: GPU Power Model 3 5/4/11

Motivation  Power Consumption: Key criterion for future Hardware Devices and Embedded Software.  Effect of increased power density has been not been felt till now  Supply voltage was scaled back too.  Current and Power density remained constant.  Further reduction in supply voltage difficult in future  Supply voltage approaching close to threshold voltage.  Gate oxide thickness almost equal to 1nm. CSCI 8205: GPU Power Model 4 5/4/11

Motivation CSCI 8205: GPU Power Model 5 5/4/11

GPU Processing Power CSCI 8205: GPU Power Model 6 5/4/11

Price of Power  Maximum Load = Lot of Power  Nvidia 8800 GTX: 137W  Intel Xeon LS5400: 50W CSCI 8205: GPU Power Model 7 5/4/11

Power Wall  Power Density in GPUs larger that even high end CPUs  Power gating, Clock gating have been successfully employed in CPUs [Brooks, Hpca 2001]  Power gating, Clock gating and other H/W based schemes are not used in most GPUs [Kim Isca 2010]  Accurate power model can help  Explore various architectural and algorithmic trade offs.  Figure out balance of workload between GPU and CPU. CSCI 8205: GPU Power Model 8 5/4/11

Background  Power consumption can be divided into: Power = Dynamic_power + Static_power + Short_Ckt_Power  Dynamic power is determined by run-time events  Fixed-function units: texture filtering and rasterization  Programmable units: memory and floating point  Static power determined by  circuit technology  chip layout  operating temperature. P = V CC * N* K design * I leak CSCI 8205: GPU Power Model 9 5/4/11

Previous Power Models  Statistical power modeling approach for GPU [Matsuoka 2010]  Uses 13 CUDA Performance counters (ld,st,branch,tlb miss) to obtain profile  Finds correlation b/w profiles and power by statistical model learning.  Lot of information not captured by counters lost  Cycle-level simulations based Power Model,[Skadron HWWS'04]  Assume hypothetical architecture to explore new GPU microarchitectures and model power and leakage properties  Cycle-level processor simulations are time consuming [Martonosi&Isci 2003]  Do not allow a complete view of operating system effects, I/O [Isci 2003] CSCI 8205: GPU Power Model 10 5/4/11

Outline  Introduction and Motivation  Analytical Model Description  Parser  Power Model  Experiment Setup  Results  Conclusion and Further Work CSCI 8205: GPU Power Model 11 5/4/11

Need for a Parser  GPGPUsim is time consuming  GPGPUsim output is not tailored to our needs  Parser is very fast  GPGPUsim works only with CUDA 2.3 or prior CSCI 8205: GPU Power Model 12 5/4/11

Limitations of the Parser  Dynamic loops are not automatically determined.  Branch prediction is assumed to be taken  Highly tailored to our specific needs.  A change in the PTX layout might require change to parser. CSCI 8205: GPU Power Model 13 5/4/11

Outline  Introduction and Motivation  Analytical Model Description  Parser  Power Model  Experiment Setup  Results  Conclusion and Further Work CSCI 8205: GPU Power Model 14 5/4/11

Power Model  PTX Level CSCI 8205: GPU Power Model 15 5/4/11

Power Model  Assembly Level CSCI 8205: GPU Power Model 16 5/4/11

Outline  Introduction and Motivation  Analytical Model Description  Parser  Power Model  Experiment Setup  Results  Conclusion and Further Work CSCI 8205: GPU Power Model 17 5/4/11

Experiment Setup - Hardware  Measure Power Consumption and Temperature  Sample 10Hz, GPU sensor  Current Clamp for PCIE & GPU Power Cable  Data Acquisition 100Hz  GPU Performance Counter  Profile 57 Counters per Kernel  9 Executions CSCI 8205: GPU Power Model 18 5/4/11

Experiment Setup - Software  Driver API  Generate and Modify PTX code  Minimize control loops  CUDA 4.0  Built in Binary -> Assembly Converter (cuobjdump)  MATLAB to build model  Remote login CSCI 8205: GPU Power Model 19 5/4/11

CUDA – Fermi Architecture  Third Generation Streaming Multiprocessor(SM)  32 CUDA cores per SM, 4x over GT200  1024 thread block size, 2x over GT200  Unified address space enables full C++ support  Improved Memory Subsystem 5/4/11CSCI 8205: GPU Power Model 20

CUDA – Fermi Architecture 5/4/11CSCI 8205: GPU Power Model 21 Fermi Memory Hierarchy Registers SM - 0 L1 Cache Shared Mem. Registers SM - N L1 CacheShared Mem. L2 Cache Global Memory

Benchmarks  Small number of overhead operations (loop counters, initialization, etc.).  Computational intensive work to allow for an experiment of significant length for accurate current measurement.  Exhibit high utilization of the CUDA cores, few data hazards as possible.  Grid and block sizes appropriately so that all SM are used, since idle SM leak.  Accordingly 7 benchmarks were selected from CUDA SDK. 5/4/11CSCI 8205: GPU Power Model 22

Benchmarks  Our benchmarks  2D convolution  Matrix Multiplication  Vector Addition  Vector Reduction  Scalar Product  DCT 8x8  3DFD 5/4/11CSCI 8205: GPU Power Model 23

Limitations of PTX  Higher level than assembly  Divide & Sqrt: 1 PTX line, library in assembly  Compiler optimizations from PTX -> assembly  Doesn’t reflect RAW dependencies  Performance counters use assembly CSCI 8205: GPU Power Model 24 5/4/11

Outline  Introduction and Motivation  Analytical Model Description  Parser  Power Model  Experiment Setup  Results  Conclusion and Further Work CSCI 8205: GPU Power Model 25 5/4/11

Results CSCI 8205: GPU Power Model 26 5/4/11

Outline  Introduction and Motivation  Analytical Model Description  Parser  Power Model  Experiment Setup  Results  Conclusion and Further Work CSCI 8205: GPU Power Model 27 5/4/11

Conclusion and Further Work  Conclusion  Further Work  Take into account context switches  Consider Multiple kernels running simultaneously CSCI 8205: GPU Power Model 28 5/4/11

The End Thanks Q&A CSCI 8205: GPU Power Model 29 5/4/11