An Integrated GPU Power and Performance Model (ISCA’10, June 19–23, 2010, Saint-Malo, France. International Symposium on Computer Architecture) 2012.09.14.

Slides:



Advertisements
Similar presentations
1 Qilin: Exploiting Parallelism on Heterogeneous Multiprocessors with Adaptive Mapping Chi-Keung (CK) Luk Technology Pathfinding and Innovation Software.
Advertisements

Lecture 6: Multicore Systems
Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.
Optimization on Kepler Zehuan Wang
1 A Self-Tuning Cache Architecture for Embedded Systems Chuanjun Zhang*, Frank Vahid**, and Roman Lysecky *Dept. of Electrical Engineering Dept. of Computer.
Zhiguo Ge, Weng-Fai Wong, and Hock-Beng Lim Proceedings of the Design, Automation, and Test in Europe Conference, 2007 (DATE’07) April /4/17.
Chimera: Collaborative Preemption for Multitasking on a Shared GPU
GPU System Architecture Alan Gray EPCC The University of Edinburgh.
Exploiting Graphics Processors for High- performance IP Lookup in Software Routers Author: Jin Zhao, Xinya Zhang, Xin Wang, Yangdong Deng, Xiaoming Fu.
GPGPU Introduction Alan Gray EPCC The University of Edinburgh.
Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.
An Analytical Model for a GPU. Overview SVM Kernel Behavior: Need for other metrics.
Accelerating Machine Learning Applications on Graphics Processors Narayanan Sundaram and Bryan Catanzaro Presented by Narayanan Sundaram.
To GPU Synchronize or Not GPU Synchronize? Wu-chun Feng and Shucai Xiao Department of Computer Science, Department of Electrical and Computer Engineering,
GPGPU platforms GP - General Purpose computation using GPU
Synergy.cs.vt.edu Power and Performance Characterization of Computational Kernels on the GPU Yang Jiao, Heshan Lin, Pavan Balaji (ANL), Wu-chun Feng.
Accelerating SQL Database Operations on a GPU with CUDA Peter Bakkum & Kevin Skadron The University of Virginia GPGPU-3 Presentation March 14, 2010.
Sunpyo Hong, Hyesoon Kim
SAGE: Self-Tuning Approximation for Graphics Engines
Authors: Tong Li, Dan Baumberger, David A. Koufaty, and Scott Hahn [Systems Technology Lab, Intel Corporation] Source: 2007 ACM/IEEE conference on Supercomputing.
Folklore Confirmed: Compiling for Speed = Compiling for Energy Tomofumi Yuki INRIA, Rennes Sanjay Rajopadhye Colorado State University 1.
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 7: Threading Hardware in G80.
CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA
1 The Performance Potential for Single Application Heterogeneous Systems Henry Wong* and Tor M. Aamodt § *University of Toronto § University of British.
Computer Performance Computer Engineering Department.
Unifying Primary Cache, Scratch, and Register File Memories in a Throughput Processor Mark Gebhart 1,2 Stephen W. Keckler 1,2 Brucek Khailany 2 Ronny Krashinsky.
1 Chapter 04 Authors: John Hennessy & David Patterson.
Extracted directly from:
Modeling GPU non-Coalesced Memory Access Michael Fruchtman.
Warped Gates: Gating Aware Scheduling and Power Gating for GPGPUs
1 Computer Architecture Research Overview Rajeev Balasubramonian School of Computing, University of Utah
Thread Criticality Predictors for Dynamic Performance, Power, and Resource Management in Chip Multiprocessors Abhishek Bhattacharjee and Margaret Martonosi.
Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 CS 395 Winter 2014 Lecture 17 Introduction to Accelerator.
Advanced Computer Architecture Fundamental of Computer Design Instruction Set Principles and Examples Pipelining:Basic and Intermediate Concepts Memory.
Introducing collaboration members – Korea University (KU) ALICE TPC online tracking algorithm on a GPU Computing Platforms – GPU Computing Platforms Joohyung.
Towards Dynamic Green-Sizing for Database Servers Mustafa Korkmaz, Alexey Karyakin, Martin Karsten, Kenneth Salem University of Waterloo.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 8: Threading Hardware in G80.
Understanding Performance, Power and Energy Behavior in Asymmetric Processors Nagesh B Lakshminarayana Hyesoon Kim School of Computer Science Georgia Institute.
Compiler and Runtime Support for Enabling Generalized Reduction Computations on Heterogeneous Parallel Configurations Vignesh Ravi, Wenjing Ma, David Chiu.
GPU Power Model Nandhini Sudarsanan Nathan Vanderby Neeraj Mishra Usha Vinodh
 GPU Power Model Nandhini Sudarsanan Nathan Vanderby Neeraj Mishra Usha Vinodh
Processor Architecture
 GPU Power Model Nandhini Sudarsanan Nathan Vanderby Neeraj Mishra Usha Vinodh
Dynamic Scheduling Monte-Carlo Framework for Multi-Accelerator Heterogeneous Clusters Authors: Anson H.T. Tse, David B. Thomas, K.H. Tsoi, Wayne Luk Source:
Computer Science and Engineering Power-Performance Considerations of Parallel Computing on Chip Multiprocessors Jian Li and Jose F. Martinez ACM Transactions.
Power Analysis of Embedded Software : A Fast Step Towards Software Power Minimization 指導教授 : 陳少傑 教授 組員 : R 張馨怡 R 林秀萍.
1 November 11, 2015 A Massively Parallel, Hybrid Dataflow/von Neumann Architecture Yoav Etsion November 11, 2015.
University of Michigan Electrical Engineering and Computer Science Adaptive Input-aware Compilation for Graphics Engines Mehrzad Samadi 1, Amir Hormati.
Sunpyo Hong, Hyesoon Kim
GPU Performance Optimisation Alan Gray EPCC The University of Edinburgh.
Equalizer: Dynamically Tuning GPU Resources for Efficient Execution Ankit Sethia* Scott Mahlke University of Michigan.
The CRISP Performance Model for Dynamic Voltage and Frequency Scaling in a GPGPU Rajib Nath, Dean Tullsen 1 Micro 2015.
Effect of Instruction Fetch and Memory Scheduling on GPU Performance Nagesh B Lakshminarayana, Hyesoon Kim.
CS 179: GPU Computing LECTURE 2: MORE BASICS. Recap Can use GPU to solve highly parallelizable problems Straightforward extension to C++ ◦Separate CUDA.
Henry Wong, Misel-Myrto Papadopoulou, Maryam Sadooghi-Alvandi, Andreas Moshovos University of Toronto Demystifying GPU Microarchitecture through Microbenchmarking.
Performance modeling in GPGPU computing Wenjing xu Professor: Dr.Box.
1 ”MCUDA: An efficient implementation of CUDA kernels for multi-core CPUs” John A. Stratton, Sam S. Stone and Wen-mei W. Hwu Presentation for class TDT24,
Hang Zhang1, Xuhao Chen1, Nong Xiao1,2, Fang Liu1
Gwangsun Kim, Jiyun Jeong, John Kim
GPU Memory Details Martin Kruliš by Martin Kruliš (v1.1)
Warped Gates: Gating Aware Scheduling and Power Gating for GPGPUs
Ching-Chi Lin Institute of Information Science, Academia Sinica
RFVP: Rollback-Free Value Prediction with Safe to Approximate Loads
Xia Zhao*, Zhiying Wang+, Lieven Eeckhout*
Presented by: Isaac Martin
Chapter 5: CPU Scheduling
PERFORMANCE MEASURES. COMPUTATIONAL MODELS Equal Duration Model:  It is assumed that a given task can be divided into n equal subtasks, each of which.
6- General Purpose GPU Programming
CIS 6930: Chip Multiprocessor: Parallel Architecture and Programming
Presentation transcript:

An Integrated GPU Power and Performance Model (ISCA’10, June 19–23, 2010, Saint-Malo, France. International Symposium on Computer Architecture) Shih-Meng Teng Author : Sunpyo Hong, Electrical and Computer Engineering Georgia Institute of Technology Hyesoon Kim, School of Computer Science Georgia Institute of Technology

Outline  Introduction  Background on power  Power and temperature model  IPP : Integrated power and performance model  Experiment and result  Conclusion

Introduction  The increasing computing power of GPUs gives them a considerably higher throughput than that of CPUs.  Optimizing GPU kernels to achieve high performance is still a challenge.  Optimizing an application to achieve a better power efficiency is even more difficult.

Introduction (cont.)  The number of cores inside a chip, especially in GPUs, is increasing dramatically.  For example, NVIDIA’s GTX280 has 30 streaming multiprocessors (SMs) with 240 CUDA cores, and the next generation GPU will have 512 CUDA cores.  GPU applications are highly throughput-oriented, not all applications require all available cores to achieve the best performance.  Do we need all cores to achieve the highest performance?  Can we save power and energy by using fewer cores?

Introduction (cont.)  In Type 1, the performance increases linearly, because applications can utilize the computing powers in all the cores.  In Type 2, the performance is saturated after a certain number of cores due to bandwidth limitations.

Introduction (cont.)  In this paper, the number of cores that shows the highest performance per watt is called the optimal number of cores.  In Type 2, since the performance does not increase linearly, using all the cores consumes more energy than using the optimal number of cores.  Hence, if we can predict the optimal number of cores at static time, either the compiler (or the programmer) can configure the number of threads/blocks to utilize fewer cores, or hardware or a dynamic thread manager can use fewer cores.

Introduction (cont.)  we propose an integrated power and performance prediction system, which we call IPP.  IPP predicts the optimal number of cores that results in the highest performance per watt.

Introduction (cont.)  The results show that by using fewer cores based on the IPP prediction, we can save up to 22.09% and on average 10.99% of runtime energy consumption for the five memory bandwidth-limited benchmarks.  We also estimate the amount of energy savings for GPUs that employ a power gating mechanism. Our evaluation shows that with power gating, the IPP can save on average 25.85% of the total GPU power for the five bandwidth limited benchmarks.

Background on power  Power consumption can be divided into two parts: dynamic power and static power.  Dynamic power :  Switching overhead in transistors.  Determined by runtime.  Static power :  Determined by circuit technology, chip layout and operating temperature.

Background on power (cont.)  Building a Power Model Using an Empirical Method  Isci and Martonosi proposed an empirical method to building a power model. They measured and modeled the Intel Pentium 4 processor.  Basic power model :  They indicate how often an architectural unit is accessed per unit of time, where one is the maximum value.

Background on power (cont.)  Static model  To understand static power consumption and temperature effects, we briefly describe static power models.  However, in a normal operating temperature range, the leakage power can be simplified as a linear model.

Power and performance model  GPU power consumption (GPU_power) is modeled similar to Equation (2).  Overall model

Power and performance model (cont.)  Modeling power consumption from streaming processors  Table 1 summarizes the modeled architectural components used by each instruction type and the corresponding variable names in Equation (7).

Power and performance model (cont.)

 Access Rate:  As Equation 2 shows, dynamic power consumption is dependent on access rate of each hardware component. See Table II.

Power and performance model (cont.)

 Access Rate:  As Equation 2 shows, dynamic power consumption is dependent on access rate of each hardware component. We divide execution cycles by four because one instruction is fetched, scheduled, and executed every four cycles.

Power and performance model (cont.)  Modeling memory power

Power and performance model (cont.)  Power model parameters  To obtain the power model parameters, we design a set of synthetic microbenchmarks that stress different architectural components in the GPU.

Power and performance model (cont.)  Power model parameters  Figure 3 shows how the overall power is distributed among the individual architectural components for all the evaluated benchmarks. More than 60%.

Power and performance model (cont.)  Power model parameters  Figure 3 shows how the overall power is distributed among the individual architectural components for all the evaluated benchmarks. Relatively higher

Power and performance model (cont.)  Active SMs vs. Power Consumption  To measure the power consumption of each SM, we design another set of microbenchmarks to control the number of active SMs.  Even though the evaluated GPU does not employ power gating, idle SMs do not consume as much power as active SMs do because of low-activity factors.  There are still significant differences in the total power consumption depending on the number of active SMs in a GPU.  (See Figure 4)

Power and performance model (cont.)  Active SMs vs. Power Consumption

Power and performance model (cont.)  Active SMs vs. Power Consumption  The maximum power delta between using only one SM versus all SMs is 37W.  Since there is no power gating, the power consumption does not increase linearly as we increase the number of SMs.  We use a log-based model instead of a linear curve, as shown in Equation (14).

Power and performance model (cont.)  Active SMs vs. Power Consumption  Finally, runtime power can be modeled by taking the number of active SMs as shown in Equation (16)

Power and performance model (cont.)  Temperature model  CPU Temperature models are typically represented by an RC model. We determine the model parameters empirically by using a step function experiment.  Figure 5 shows estimated and measured temperature variations.  γ and δ determined by characteristic of GPU. (Table 3)

Power and performance model (cont.)  Temperature model

Power and performance model (cont.)  Temperature model  Max_temp is a function of runtime power, which depends on application characteristics.  We discovered that the chip temperature is strongly affected by the rate of GDDR memory accesses, not only runtime power consumption.

Power and performance model (cont.)  Modeling increases in power consumption   Section 2.2 discussed the impact of temperature on static power consumption.  Since we cannot control the operating voltage of the evaluated GPUs at runtime, we only consider operating temperature effects.

Power and performance model (cont.)  Modeling increases in power consumption  Δ power of GPU is 14 watt and power consumption of fan is 4 watt, so actually Δ power of GPU is 10 watt which called static power consumption. Initial power consumption by equation (16)

Power and performance model (cont.)  Modeling increases in power consumption  Δ power of GPU is 14 watt and power consumption of fan is 4 watt, so actually Δ power of GPU is 10 watt which called static power consumption. Determined by Δ power and Δ temperature σ = Δ power / Δ temperature = 10 / 22 = …

Power and performance model (cont.)  Modeling increases in power consumption  Δ power of GPU is 14 watt and power consumption of fan is 4 watt, so actually Δ power of GPU is 10 watt which called static power consumption. Determined by equation 17 and 18.

Power and performance model (cont.)  Modeling increases in power consumption  Δ power of GPU is 14 watt and power consumption of fan is 4 watt, so actually Δ power of GPU is 10 watt which called static power consumption.

IPP : Integrated power and performance model  Integrated power and performance model to predict performance per watt.  Getting the optimal number of active cores.  We used a recently developed GPU analytical timing model to predict the execution time. MWPMemory Warp ParallelismMWP represents the number of memory requests that can be serviced concurrently. CWPComputation Warp ParallelismCWP represents the number of warps that can finish one computational period during one memory access period. NNumber of warps Mem_LMemory latencyAverage memory latency Mem_cyclesThe processor waiting cycles for memory operations. Comp_cyclesThe execution time of all instructions. RepwThe number of times that each SM needs to repeat the same set of computation.

IPP : Integrated power and performance model (cont.) With memory bandwidth limitation Performance modeling :

IPP : Integrated power and performance model (cont.) Without memory bandwidth limitation Performance modeling :

IPP : Integrated power and performance model (cont.) There are not enough number of running threads in the system.

IPP : Integrated power and performance model (cont.) An application is computationally intensive. Performance modeling :

IPP : Integrated power and performance model (cont.)  Optimal number of cores for highest performance per watt  Calculate MWP

IPP : Integrated power and performance model (cont.)  Final IPP model  Limitation of IPP :  Control flow intensive applications  Asymmetric applications  Texture cache intensive applications.

IPP : Integrated power and performance model (cont.)  Using Results of IPP  We constrain the number of active cores based on an output of IPP by only limiting the number of blocks inside an application, since we cannot change the hardware or the thread scheduler.  IPP only passes the information of the number of optimal cores to the runtime system, and either the hardware or runtime thread scheduler enables only the required number of cores to save energy.

Experiment and result  Environment  NVIDIA GTX280  30 SMs, 65 nm technology  We use the Extech AC/DC Power Analyzer to measure the overall system power consumption.  The raw power data is sent to a data-log machine every 0.5 second.  Each micro-benchmark executes for an average of 10 seconds.  Idlepower_system = 159 w  Idle_Power of GPU = 83 w  Benchmark  Merge benchmark, five additional memory benchmark and one computation intensive benchmark.

Experiment and result  Environment - Benchmark

Experiment and result (cont.)  Evaluation of Runtime Power Model (Micro-benchmark)  Figure 7 compares the predicted power consumption with the measured power value for the micro-benchmarks.  The geometric mean of the error in the power prediction for microbenchmark is 2.5%.

Experiment and result (cont.)  Evaluation of Runtime Power Model (GPGPU Kernels)  Figure 9 compares the predicted power and the measured power consumptions for the evaluated GPGPU kernels.  The geometric mean of the power prediction error is 9.18% for the GPGPU kernels.

Experiment and result (cont.)  Temperature model  The initial temperature is 57 ℃.  The temperature is saturated after around 600 secs.  The peak temperature depends on the peak run-time power consumption.  Based on Equation(22), we can predict that the runtime power. Example: We can predict that the runtime power of SVM would increase by 10W after 600 seconds.

Experiment and result (cont.)  Power Prediction Using IPP  Using predicted times could have increased the error in prediction of power values, but since the error of timing model is not high, the overall error of the IPP system is not significantly increased.  The geometric mean of the power prediction of IPP is 8.94% for the GPGPU kernels and 2.7% for the micro-benchmarks, which are similar to using real execution time measurements.

Experiment and result (cont.)  Performance and Power Efficiency Prediction Using IPP  In this experiment, we vary the number of active cores by varying the number of blocks in the CUDA applications.  Cmem is a non-bandwidth limited benchmark, so it has a linear performance improvement in both the measured data and the predicted values.

Experiment and result (cont.)  Performance and Power Efficiency Prediction Using IPP  Cmem shows a linear correlation between its bandwidth consumption and the number of active cores, but it still cannot reach the peak memory bandwidth.  The memory bandwidths of the remaining benchmarks are saturated when the number of active cores is around 19.

Experiment and result (cont.)  Performance and Power Efficiency Prediction Using IPP

Experiment and result (cont.)  Performance and Power Efficiency Prediction Using IPP  Figure 16 shows GIPS/W for all the GPGPU kernels running on 30 active cores.  The GIPS/W values of the non-bandwidth limited benchmarks are much higher than those of the bandwidth limited benchmarks.  Except for Bino and Bs, IPP predicts GIPS/W values fairly accurately.  The errors in the predicted GIPS/W values of Bino and Bs are attributed to the differences between their predicted and measured runtime performance.

Experiment and result (cont.)  Performance and Power Efficiency Prediction Using IPP

Experiment and result (cont.)  Energy Savings by Using the Optimal Number of Cores Based on IPP A. Runtime+Idle shows the energy savings when the total GPU power is used in the calculation. B. Runtime shows the energy savings when only the runtime power from the equation (5) is used. C. Powergating is the predicted energy savings if power gating is applied.  The average energy savings for Runtime cases is 10.99%.  When power gating is applied, the average energy savings is 25.85%.

Experiment and result (cont.)  Energy Savings by Using the Optimal Number of Cores Based on IPP

Conclusion  We proposed an integrated power and performance modeling system (IPP) for the GPU architecture and the GPGPU kernels.  Using the proposed power model and the newly- developed timing model, IPP predicts performance per watt and also the optimal number of cores to achieve energy savings.  The power model using IPP predicts the power consumption and the execution time with an average of 8.94% error for the evaluated GPGPU kernels.  Based on IPP, the system can save on average 10.99% of runtime energy consumption for the bandwidth limited applications by using fewer cores.

Q & A

MWP and CWP  MWP  The maximum number of warps per SM that can access the memory simultaneously during the time period from right after the SM executes a memory instruction from one warp until all the memory requests from the same warp are serviced  CWP  The number of warps that the SM can execute during one memory warp waiting period plus one

1. CWP > MWP N: Number of active running warps per SM

2. CWP < MWP

3. Not Enough Warps Running