Accurate Power and Energy Measurement on Kepler-based Tesla GPUs Martin Burtscher Department of Computer Science.

Slides:



Advertisements
Similar presentations
Tuning of Loop Cache Architectures to Programs in Embedded System Design Susan Cotterell and Frank Vahid Department of Computer Science and Engineering.
Advertisements

Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.
ENERGY AND POWER CHARACTERIZATION OF PARALLEL PROGRAMS RUNNING ON THE INTEL XEON PHI JOAL WOOD, ZILIANG ZONG, QIJUN GU, RONG GE {JW1772, ZILIANG,
GPU System Architecture Alan Gray EPCC The University of Edinburgh.
Early Linpack Performance Benchmarking on IPE Mole-8.5 Fermi GPU Cluster Xianyi Zhang 1),2) and Yunquan Zhang 1),3) 1) Laboratory of Parallel Software.
A Parallel GPU Version of the Traveling Salesman Problem Molly A. O’Neil, Dan Tamir, and Martin Burtscher* Department of Computer Science.
Design and Computer Modeling of Ultracapacitor Regenerative Braking System Adam Klefstad, Dr. Kim Pierson Department of Physics & Astronomy UW-Eau Claire.
G. Alonso, D. Kossmann Systems Group
Why GPU Computing. GPU CPU Add GPUs: Accelerate Science Applications © NVIDIA 2013.
Announcements Assignment 8 posted –Due Friday Dec 2 nd. A bit longer than others. Project progress? Dates –Thursday 12/1 review lecture –Tuesday 12/6 project.
University of Michigan Electrical Engineering and Computer Science Transparent CPU-GPU Collaboration for Data-Parallel Kernels on Heterogeneous Systems.
A Scalable Heterogeneous Parallelization Framework for Iterative Local Searches Martin Burtscher 1 and Hassan Rabeti 2 1 Department of Computer Science,
PARALLEL PROCESSING COMPARATIVE STUDY 1. CONTEXT How to finish a work in short time???? Solution To use quicker worker. Inconvenient: The speed of worker.
OpenFOAM on a GPU-based Heterogeneous Cluster
Lecture 1: History of Operating System
CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.
A Performance and Energy Comparison of FPGAs, GPUs, and Multicores for Sliding-Window Applications From J. Fowers, G. Brown, P. Cooke, and G. Stitt, University.
To GPU Synchronize or Not GPU Synchronize? Wu-chun Feng and Shucai Xiao Department of Computer Science, Department of Electrical and Computer Engineering,
Synergy.cs.vt.edu Power and Performance Characterization of Computational Kernels on the GPU Yang Jiao, Heshan Lin, Pavan Balaji (ANL), Wu-chun Feng.
Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of Energy’s National Nuclear.
Motivation “Every three minutes a woman is diagnosed with Breast cancer” (American Cancer Society, “Detailed Guide: Breast Cancer,” 2006) Explore the use.
Quick Changeovers & SMED
An approach for solving the Helmholtz Equation on heterogeneous platforms An approach for solving the Helmholtz Equation on heterogeneous platforms G.
Folklore Confirmed: Compiling for Speed = Compiling for Energy Tomofumi Yuki INRIA, Rennes Sanjay Rajopadhye Colorado State University 1.
N-Tier Client/Server Architectures Chapter 4 Server - RAID Copyright 2002, Dr. Ken Hoganson All rights reserved. OS Kernel Concept RAID – Redundant Array.
CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA
Codeplay CEO © Copyright 2012 Codeplay Software Ltd 45 York Place Edinburgh EH1 3HP United Kingdom Visit us at The unique challenges of.
Energy Profiling And Analysis Of The HPC Challenge Benchmarks Scalable Performance Laboratory Department of Computer Science Virginia Tech Shuaiwen Song,
Power Characteristics of Irregular GPGPU Programs Jared Coplin and Martin Burtscher Department of Computer Science 1.
Lecture 03: Fundamentals of Computer Design - Trends and Performance Kai Bu
Revisiting Kirchhoff Migration on GPUs Rice Oil & Gas HPC Workshop
By Arun Bhandari Course: HPC Date: 01/28/12. GPU (Graphics Processing Unit) High performance many core processors Only used to accelerate certain parts.
Chapter 1 What is Programming? Lecture Slides to Accompany An Introduction to Computer Science Using Java (2nd Edition) by S.N. Kamin, D. Mickunas, E.
Synthesizing Effective Data Compression Algorithms for GPUs Annie Yang and Martin Burtscher* Department of Computer Science.
1 Single-ISA Heterogeneous Multi-Core Architectures: The Potential for Processor Power Reduction Rakesh Kumar, Keith I. Farkas, Norman P. Jouppi, Parthasarathy.
Brief Review of Control Theory
Budget-based Control for Interactive Services with Partial Execution 1 Yuxiong He, Zihao Ye, Qiang Fu, Sameh Elnikety Microsoft Research.
NVIDIA Tesla GPU Zhuting Xue EE126. GPU Graphics Processing Unit The "brain" of graphics, which determines the quality of performance of the graphics.
Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.
Major objective of this course is: Design and analysis of modern algorithms Different variants Accuracy Efficiency Comparing efficiencies Motivation thinking.
Introducing collaboration members – Korea University (KU) ALICE TPC online tracking algorithm on a GPU Computing Platforms – GPU Computing Platforms Joohyung.
GPUs: Overview of Architecture and Programming Options Lee Barford firstname dot lastname at gmail dot com.
QCAdesigner – CUDA HPPS project
GPU Power Model Nandhini Sudarsanan Nathan Vanderby Neeraj Mishra Usha Vinodh
1)Leverage raw computational power of GPU  Magnitude performance gains possible.
An Efficient CUDA Implementation of the Tree-Based Barnes Hut n-body Algorithm By Martin Burtscher and Keshav Pingali Jason Wengert.
 GPU Power Model Nandhini Sudarsanan Nathan Vanderby Neeraj Mishra Usha Vinodh
Fast Query-Optimized Kernel Machine Classification Via Incremental Approximate Nearest Support Vectors by Dennis DeCoste and Dominic Mazzoni International.
Digital Control CSE 421.
1 November 11, 2015 A Massively Parallel, Hybrid Dataflow/von Neumann Architecture Yoav Etsion November 11, 2015.
University of Michigan Electrical Engineering and Computer Science Adaptive Input-aware Compilation for Graphics Engines Mehrzad Samadi 1, Amir Hormati.
Introduction to Performance Tuning Chia-heng Tu PAS Lab Summer Workshop 2009 June 30,
A function generator is usually a piece of electronic test equipment or software used to generate different types of electrical waveforms over a wide.
Photos placed in horizontal position with even amount of white space between photos and header Sandia National Laboratories is a multi-program laboratory.
Heterogeneous Processing KYLE ADAMSKI. Overview What is heterogeneous processing? Why it is necessary Issues with heterogeneity CPU’s vs. GPU’s Heterogeneous.
FUNCTION GENERATOR.
COSC 1306 COMPUTER SCIENCE AND PROGRAMMING Jehan-François Pâris
GPU Acceleration of Particle-In-Cell Methods B. M. Cowan, J. R. Cary, S. W. Sides Tech-X Corporation.
Input and Output Optimization in Linux for Appropriate Resource Allocation and Management James Avery King.
Overview Motivation (Kevin) Thermal issues (Kevin)
Digital Control CSE 421.
Green cloud computing 2 Cs 595 Lecture 15.
NVIDIA Jetson Platform Characterization
Synchronization trade-offs in GPU implementations of Graph Algorithms
Optimization of PHEV/EV Battery Charging
CSCI1600: Embedded and Real Time Software
Objective of This Course
The University of Adelaide, School of Computer Science
CSCI1600: Embedded and Real Time Software
Rohan Yadav and Charles Yuan (rohany) (chenhuiy)
Presentation transcript:

Accurate Power and Energy Measurement on Kepler-based Tesla GPUs Martin Burtscher Department of Computer Science

Introduction  GPU-based accelerators  Quickly spreading in PCs and even handheld devices  Widely used in high-performance computing  Power and energy efficiency  Heat dissipation is a problem  Electric bill and battery life are of growing concern  Exascale requires 50x boost in performance per watt  Important research area  Need to develop techniques to reduce power and energy  Have to be able to measure power/energy of programs Accurate Power and Energy Measurement on Kepler-based Tesla GPUs 2

GPU Power Sensors  Hardware  High-end compute GPUs include power sensors  For example, K20/K40 Tesla cards have built-in sensor  These cards are the target of this talk  Software  Can query sensor with NVIDIA Management Library  Accurate Power and Energy Measurement on Kepler-based Tesla GPUs 3

Problems  Power sensor data behaves strangely  Running the same kernel twice yields different energy  First launch: 114 J, second launch: 147 J (29% more energy)  Running a kernel 2x as long more than doubles energy  1x input: 732 J, 2x input: 1579 J (8% above doubling)  Power sensor sampling rate varies greatly  Ranges from ms to 130 ms (7.7 Hz to 3760 Hz) Accurate Power and Energy Measurement on Kepler-based Tesla GPUs 4

Methodology  Hardware  Two K20c, two K20m, two K20X, and two K40m GPUs  Measurement  Query power and time in loop on “idle” CPU core  Test code  Compute-intensive regular n-body kernel  Constant computation rate of over 2 TFlops on a K20c  No data dependences; vary n to adjust kernel runtime Accurate Power and Energy Measurement on Kepler-based Tesla GPUs 5

Expected Power Profile Accurate Power and Energy Measurement on Kepler-based Tesla GPUs 6 Kernel starts executing Kernel stops executing GPU idle power Measurement loop runtime

Measured Power Profile Accurate Power and Energy Measurement on Kepler-based Tesla GPUs 7 Power ramps up slowly Power ramps down slowly Switch to step shape Idle power reached Macroscopic phenomena 5s 3s 4s

Energy = Area Under Power Curve Accurate Power and Energy Measurement on Kepler-based Tesla GPUs 8 Integrate to where? Unclear how big energy is Missing energy? Delayed energy?

Ramp-up Behavior of 2 Short Runs Accurate Power and Energy Measurement on Kepler-based Tesla GPUs 9 Short run same as longer run 2 nd run starts higher but also follows curve Ramp down doesn’t follow

Ramp-down Behavior of Several Runs Accurate Power and Energy Measurement on Kepler-based Tesla GPUs 10 Shape depends on power at t 2 Power increases after kernel done Shape always the same Steps down every second Driver lowers power level

Sampling Interval Lengths Accurate Power and Energy Measurement on Kepler-based Tesla GPUs 11 Short intervals Wide range of intervals Very long interval Driver activity can prevent sampling

Sampling Interval Lengths (zoomed-in) Accurate Power and Energy Measurement on Kepler-based Tesla GPUs 12 Identical values Many short intervals Very long interval Sampled power only ever changes after long interval

Correcting the Measurements Accurate Power and Energy Measurement on Kepler-based Tesla GPUs 13

Sampling Frequency  Eliminate redundant samples  Only sample once every 15 ms (66.7 Hz)  Cannot accurately measure kernels under ~150 ms  Account for the variation in interval length  Use high-resolution time stamps  Example: energy from t 1 to t 4  Dotted (fixed intervals): 1205 J  Solid (variable intervals): 1066 J  13% discrepancy Accurate Power and Energy Measurement on Kepler-based Tesla GPUs 14

True Power  Sensor hardware  Seems to asymptotically approach true power  Reminiscent of capacitor charging  True instant power  P true is a function of the slope of the power profile dP/dt and the power measured by the sensor P sensor P true = P sensor + C × dP sensor /dt  “Capacitance” of sensor  C ≈ 0.84 s on all tested K20 GPUs Accurate Power and Energy Measurement on Kepler-based Tesla GPUs 15

Back-calculated from Expected Profile Accurate Power and Energy Measurement on Kepler-based Tesla GPUs 16 ‘Capacitor’ function matches measured values perfectly Minimized absolute errors to determine C

Corrected Power Profile Accurate Power and Energy Measurement on Kepler-based Tesla GPUs 17 Wobbles due to sampling errors Corrected profile matches expected rectangular profile ‘Active idle’ power level

Correction of 2 Short Runs Accurate Power and Energy Measurement on Kepler-based Tesla GPUs 18 Corrected power profile matches expected profile

Second K20c GPU Accurate Power and Energy Measurement on Kepler-based Tesla GPUs 19 Identical to original K20c

K20m GPU Accurate Power and Energy Measurement on Kepler-based Tesla GPUs 20 Similar profile but higher power level

K20X GPU Accurate Power and Energy Measurement on Kepler-based Tesla GPUs 21 Profile is good, no correction needed! Huge 600 ms gap

K40m GPU Accurate Power and Energy Measurement on Kepler-based Tesla GPUs 22 K40m again requires correction

Application to Full CUDA Program  Implementation of Barnes Hut n-body algorithm  Taken from LonestarGPU benchmark suite  Contains multiple regular and irregular kernels  Highly optimized, but still suffers from load imbalance, divergence, and uncoalesced accesses  Main kernel is ‘regularized’ (warp-based) Accurate Power and Energy Measurement on Kepler-based Tesla GPUs 23 NASA/JPL-Caltech/SSC

Barnes Hut Power Profile (1 Step) Accurate Power and Energy Measurement on Kepler-based Tesla GPUs 24 Slow then fast drop-off “Wave” in profile Original profile is hard to interpret

Barnes Hut Power Profile (Kernels) Accurate Power and Energy Measurement on Kepler-based Tesla GPUs 25 Slow then fast drop-off “Wave” in profile Original profile is hard to interpret

Corrected Barnes Hut Power Profile Accurate Power and Energy Measurement on Kepler-based Tesla GPUs 26 Decrease due to load imbal. Two similar irreg. kernels One more irreg. kernel Very short regular kernel Corrected profile reveals important info Regularized main kernel

K20Power Tool  Output  Corrected profile and corresponding ‘active’ energy  Features  Computes instant power using ‘capacitor’ formula  Employs high-resolution time steps  Samples at true frequency of 66.7 Hz  Dissemination  Open source, research license  Accurate Power and Energy Measurement on Kepler-based Tesla GPUs 27

Marcher System  Tool will be part of Marcher system at Texas State  NSF-funded green computing infrastructure  Marcher is a power-measurable cluster system  832 general-purpose cores  12,000 GPU and MIC cores  1.2 TB of DDR3 with power throttling and scaling  50 TB of hybrid storage with hard drives and SSDs  Component-level power measurement tools (e.g., CPU, DRAM, Disk, GPU, Xeon Phi) Accurate Power and Energy Measurement on Kepler-based Tesla GPUs 28

Summary  Correctly measuring K20/K40 power and energy  Sample at 66.7 Hz and include time stamps  Compute true power with presented formula  Use neighboring power samples to approximate slope  Compute true energy by integrating true power  Over intervals where power is above ‘active idle’  K20Power tool  Software tool that implements this methodology  Paper at Accurate Power and Energy Measurement on Kepler-based Tesla GPUs 29

Acknowledgments  Collaborators  Ivan Zecena and Ziliang Zong  U.S. National Science Foundation  DUE , CNS , and CNS  NVIDIA Corporation  Grants and equipment donations  Texas State University  Research Enhancement Program Accurate Power and Energy Measurement on Kepler-based Tesla GPUs 30 Nvidia