Synergy.cs.vt.edu Power and Performance Characterization of Computational Kernels on the GPU Yang Jiao, Heshan Lin, Pavan Balaji (ANL), Wu-chun Feng.

Slides:

Advertisements

Similar presentations

1 Qilin: Exploiting Parallelism on Heterogeneous Multiprocessors with Adaptive Mapping Chi-Keung (CK) Luk Technology Pathfinding and Innovation Software.

Advertisements

Revisiting Co-Processing for Hash Joins on the Coupled CPU- GPU Architecture School of Computer Engineering Nanyang Technological University 27 th Aug.

Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.

Optimization on Kepler Zehuan Wang

Chimera: Collaborative Preemption for Multitasking on a Shared GPU

GPU Virtualization Support in Cloud System Ching-Chi Lin Institute of Information Science, Academia Sinica Department of Computer Science and Information.

Multi-GPU System Design with Memory Networks

1 MemScale: Active Low-Power Modes for Main Memory Qingyuan Deng, David Meisner*, Luiz Ramos, Thomas F. Wenisch*, and Ricardo Bianchini Rutgers University.

GPU System Architecture Alan Gray EPCC The University of Edinburgh.

Utilization of GPU’s for General Computing Presenter: Charlene DiMeglio Paper: Aspects of GPU for General Purpose High Performance Computing Suda, Reiji,

Early Linpack Performance Benchmarking on IPE Mole-8.5 Fermi GPU Cluster Xianyi Zhang 1),2) and Yunquan Zhang 1),3) 1) Laboratory of Parallel Software.

GPGPU Introduction Alan Gray EPCC The University of Edinburgh.

PVOCL: Power-Aware Dynamic Placement and Migration in Virtualized GPU Environments Palden Lama, Xiaobo Zhou, University of Colorado at Colorado Springs.

Sparse LU Factorization for Parallel Circuit Simulation on GPU Ling Ren, Xiaoming Chen, Yu Wang, Chenxi Zhang, Huazhong Yang Department of Electronic Engineering,

University of Michigan Electrical Engineering and Computer Science Transparent CPU-GPU Collaboration for Data-Parallel Kernels on Heterogeneous Systems.

Development of a track trigger based on parallel architectures Felice Pantaleo PH-CMG-CO (University of Hamburg) Felice Pantaleo PH-CMG-CO (University.

PARALLEL PROCESSING COMPARATIVE STUDY 1. CONTEXT How to finish a work in short time???? Solution To use quicker worker. Inconvenient: The speed of worker.

st International Conference on Parallel Processing (ICPP)

GPUs. An enlarging peak performance advantage: –Calculation: 1 TFLOPS vs. 100 GFLOPS –Memory Bandwidth: GB/s vs GB/s –GPU in every PC and.

CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.

University of Michigan Electrical Engineering and Computer Science Low-Power Scientific Computing Ganesh Dasika, Ankit Sethia, Trevor Mudge, Scott Mahlke.

Energy-Efficient Query Processing on Embedded CPU-GPU Architectures Xuntao Cheng, Bingsheng He, Chiew Tong Lau Nanyang Technological University, Singapore.

Debunking the 100X GPU vs. CPU Myth: An Evaluation of Throughput Computing on CPU and GPU Presented by: Ahmad Lashgar ECE Department, University of Tehran.

Hystor : Making the Best Use of Solid State Drivers in High Performance Storage Systems Presenter : Dong Chang.

GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.

To GPU Synchronize or Not GPU Synchronize? Wu-chun Feng and Shucai Xiao Department of Computer Science, Department of Electrical and Computer Engineering,

Accelerating SQL Database Operations on a GPU with CUDA Peter Bakkum & Kevin Skadron The University of Virginia GPGPU-3 Presentation March 14, 2010.

Motivation “Every three minutes a woman is diagnosed with Breast cancer” (American Cancer Society, “Detailed Guide: Breast Cancer,” 2006) Explore the use.

Training Program on GPU Programming with CUDA 31 st July, 7 th Aug, 14 th Aug 2011 CUDA Teaching UoM.

COLLABORATIVE EXECUTION ENVIRONMENT FOR HETEROGENEOUS PARALLEL SYSTEMS Aleksandar Ili´c, Leonel Sousa 2010 IEEE International Symposium on Parallel & Distributed.

CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA

Shared memory systems. What is a shared memory system Single memory space accessible to the programmer Processor communicate through the network to the.

Codeplay CEO © Copyright 2012 Codeplay Software Ltd 45 York Place Edinburgh EH1 3HP United Kingdom Visit us at The unique challenges of.

Energy Profiling And Analysis Of The HPC Challenge Benchmarks Scalable Performance Laboratory Department of Computer Science Virginia Tech Shuaiwen Song,

BY: ALI AJORIAN ISFAHAN UNIVERSITY OF TECHNOLOGY 2012 GPU Architecture 1.

By Arun Bhandari Course: HPC Date: 01/28/12. GPU (Graphics Processing Unit) High performance many core processors Only used to accelerate certain parts.

Y. Kotani · F. Ino · K. Hagihara Springer Science + Business Media B.V Reporter: 李長霖.

General Purpose Computing on Graphics Processing Units: Optimization Strategy Henry Au Space and Naval Warfare Center Pacific 09/12/12.

Programming Concepts in GPU Computing Dušan Gajić, University of Niš Programming Concepts in GPU Computing Dušan B. Gajić CIITLab, Dept. of Computer Science.

[Tim Shattuck, 2006][1] Performance / Watt: The New Server Focus Improving Performance / Watt For Modern Processors Tim Shattuck April 19, 2006 From the.

Energy Savings with DVFS Reduction in CPU power Extra system power.

GPU Architecture and Programming

Towards Dynamic Green-Sizing for Database Servers Mustafa Korkmaz, Alexey Karyakin, Martin Karsten, Kenneth Salem University of Waterloo.

VGreen: A System for Energy Efficient Manager in Virtualized Environments G. Dhiman, G Marchetti, T Rosing ISLPED 2009.

GPUs: Overview of Architecture and Programming Options Lee Barford firstname dot lastname at gmail dot com.

Compiler and Runtime Support for Enabling Generalized Reduction Computations on Heterogeneous Parallel Configurations Vignesh Ravi, Wenjing Ma, David Chiu.

1)Leverage raw computational power of GPU  Magnitude performance gains possible.

by Arjun Radhakrishnan supervised by Prof. Michael Inggs

 GPU Power Model Nandhini Sudarsanan Nathan Vanderby Neeraj Mishra Usha Vinodh

MemcachedGPU Scaling-up Scale-out Key-value Stores Tayler Hetherington – The University of British Columbia Mike O’Connor – NVIDIA / UT Austin Tor M. Aamodt.

Debunking the 100X GPU vs. CPU Myth An Evaluation of Throughput Computing on CPU and GPU Present by Chunyi Victor W Lee, Changkyu Kim, Jatin Chhugani,

University of Michigan Electrical Engineering and Computer Science Adaptive Input-aware Compilation for Graphics Engines Mehrzad Samadi 1, Amir Hormati.

Orchestrating Multiple Data-Parallel Kernels on Multiple Devices Janghaeng Lee, Mehrzad Samadi, and Scott Mahlke October, 2015 University of Michigan -

An Integrated GPU Power and Performance Model (ISCA’10, June 19–23, 2010, Saint-Malo, France. International Symposium on Computer Architecture)

Sunpyo Hong, Hyesoon Kim

GPGPU introduction. Why is GPU in the picture Seeking exa-scale computing platform Minimize power per operation. – Power is directly correlated to the.

GPU Performance Optimisation Alan Gray EPCC The University of Edinburgh.

My Coordinates Office EM G.27 contact time:

Parallel Computers Today Oak Ridge / Cray Jaguar > 1.75 PFLOPS Two Nvidia 8800 GPUs > 1 TFLOPS Intel 80- core chip > 1 TFLOPS  TFLOPS = floating.

Synergy.cs.vt.edu Online Performance Projection for Clusters with Heterogeneous GPUs Lokendra S. Panwar, Ashwin M. Aji, Wu-chun Feng (Virginia Tech, USA)

GPGPU Performance and Power Estimation Using Machine Learning Gene Wu – UT Austin Joseph Greathouse – AMD Research Alexander Lyashevsky – AMD Research.

Institute of Software,Chinese Academy of Sciences An Insightful and Quantitative Performance Optimization Chain for GPUs Jia Haipeng.

Matthew Royle Supervisor: Prof Shaun Bangay.  How do we implement OpenCL for CPUs  Differences in parallel architectures  Is our CPU implementation.

Synergy.cs.vt.edu VOCL: An Optimized Environment for Transparent Virtualization of Graphics Processing Units Shucai Xiao 1, Pavan Balaji 2, Qian Zhu 3,

Hang Zhang1, Xuhao Chen1, Nong Xiao1,2, Fang Liu1

Green cloud computing 2 Cs 595 Lecture 15.

Leiming Yu, Fanny Nina-Paravecino, David Kaeli, Qianqian Fang

Linchuan Chen, Xin Huo and Gagan Agrawal

The Yin and Yang of Processing Data Warehousing Queries on GPUs

6- General Purpose GPU Programming

Presentation transcript:

synergy.cs.vt.edu Power and Performance Characterization of Computational Kernels on the GPU Yang Jiao, Heshan Lin, Pavan Balaji (ANL), Wu-chun Feng

synergy.cs.vt.edu Graphic Processing Units (GPU) are Powerful * Data and image source,

synergy.cs.vt.edu GPU is Increasingly Popular in HPC  Three out of top five supercomputers are GPU- based

synergy.cs.vt.edu GPUs are Power Hungry It is imperative to investigate Green GPU computing

synergy.cs.vt.edu Green Computing with DVFS on CPUs  Mechanism  Minimizing performance impact  Lower voltage and frequency when CPU not in critical path  What about GPUs? Power ∝ Voltage 2 × Frequency

synergy.cs.vt.edu What is this Paper about?  Characterize performance and power for various kernels on GPUs  Kernels with different compute and memory intensiveness  Various core and memory frequencies  Contributions  Reveal unique frequency scaling behaviors on GPUs  Provide useful hints for green GPU computing

synergy.cs.vt.edu Outline  Introduction  GPU Overview  Characterization Methodology  Experimental Results  Conclusion & Future Work

synergy.cs.vt.edu NVIDIA GTX280 Architecture 8 On-chip memory Small sizes Fast access Off-chip memory Large size High access latency Device (Global) Memory

synergy.cs.vt.edu OpenCL  Write once, run on any GPUs  Allow programmer to fully exploit power of GPUs  Compute kernel: function executed on a GPU OpenCL Device Abstraction

synergy.cs.vt.edu GPU Frequency Scaling  Two dimensional  Compute core frequency and memory frequency  Semi-automatic  Dynamic configuration not supported  User can only control peak frequencies  Automatically switch to idle mode when no computation Details not available to public

synergy.cs.vt.edu Outline  Introduction  GPU Overview  Characterization Methodology  Experimental Results  Conclusion & Future Work

synergy.cs.vt.edu Kernel Selection  High performance of GPUs  Massive parallelism (e.g., 240 cores)  High memory bandwidth (e.g., 140GB/s)  Three kernels of computational diversity Compute Intensive Memory Intensive Matrix Multiplication Matrix Transpose Fast Fourier Transform (FFT)

synergy.cs.vt.edu Kernel Characteristics  Memory to compute ratio  Instruction throughput

synergy.cs.vt.edu Kernel Profile Matrix Multiplication Matrix Transpose FFT R mem 5.6%53.7%8.3% R ins

synergy.cs.vt.edu Measurement  Performance  Matrix multiplication, FFT: GFLOPS  Matrix transpose: MB/s  Energy  Whole system when executing the kernel on the GPU  Power  Reported using the average power  Energy Efficiency  Performance / power

synergy.cs.vt.edu Outline  Introduction  GPU Overview  Characterization Methodology  Experimental Results  Conclusion & Future Work

synergy.cs.vt.edu Experimental Setup  System  Intel Core 2 Quad Q6600  NVIDIA GTX280  1GB memory  Power Meter  Watts Up? Pro ES

synergy.cs.vt.edu Matrix Multiplication - Performance  Mostly affected by core frequency, almost not affected by memory frequency

synergy.cs.vt.edu Matrix Multiplication - Power  Mostly affected by core frequency, slightly affected by memory frequency

synergy.cs.vt.edu Matrix Multiplication - Efficiency  Best efficiency achieved at highest core frequency and relatively high memory frequency

synergy.cs.vt.edu Matrix Transpose - Performance  Performance dominated by memory frequency

synergy.cs.vt.edu Matrix Transpose - Power  Higher core frequency increase power consumption (not performance)

synergy.cs.vt.edu Matrix Transpose - Efficiency  Best efficiency achieved at highest memory frequency and lowest core frequency

synergy.cs.vt.edu FFT - Performance  Affected by both core and memory frequencies

synergy.cs.vt.edu FFT - Power  Affected by both core and memory frequencies

synergy.cs.vt.edu FFT - Efficiency  Best efficiency at highest core and memory frequencies

synergy.cs.vt.edu FFT – Two Dimensional Effect 7%

synergy.cs.vt.edu Power and Efficiency Range

synergy.cs.vt.edu Conclusion & Future Work  To take away  Green computing on GPUs are important  GPU frequency scaling considerably different than CPUs  Next  Finer-grained level of characterization (e.g., different types of operations)  Experiments on Fermi and AMD GPUs

synergy.cs.vt.edu Acknowledgment  NSF Center for High Performance Reconﬁgurable Computing (CHREC) for their support through NSF I/UCRC Grant IIP ;  National Science Foundation for their support partialy through CNS and CNS

synergy.cs.vt.edu Questions?