Performance and Energy Efficiency of GPUs and FPGAs

Slides:



Advertisements
Similar presentations
Vectors, SIMD Extensions and GPUs COMP 4611 Tutorial 11 Nov. 26,
Advertisements

Lecture 38: Chapter 7: Multiprocessors Today’s topic –Vector processors –GPUs –An example 1.
Lecture 6: Multicore Systems
Yaron Doweck Yael Einziger Supervisor: Mike Sumszyk Spring 2011 Semester Project.
Instructor Notes We describe motivation for talking about underlying device architecture because device architecture is often avoided in conventional.
The University of Adelaide, School of Computer Science
Appendix A. Appendix A — 2 FIGURE A.2.1 Historical PC. VGA controller drives graphics display from framebuffer memory. Copyright © 2009 Elsevier, Inc.
Appendix A — 1 FIGURE A.2.2 Contemporary PCs with Intel and AMD CPUs. See Chapter 6 for an explanation of the components and interconnects in this figure.
HPCC Mid-Morning Break High Performance Computing on a GPU cluster Dirk Colbry, Ph.D. Research Specialist Institute for Cyber Enabled Discovery.
GRAPHICS AND COMPUTING GPUS Jehan-François Pâris
Multithreaded FPGA Acceleration of DNA Sequence Mapping Edward Fernandez, Walid Najjar, Stefano Lonardi, Jason Villarreal UC Riverside, Department of Computer.
OpenFOAM on a GPU-based Heterogeneous Cluster
Zheming CSCE715.  A wireless sensor network (WSN) ◦ Spatially distributed sensors to monitor physical or environmental conditions, and to cooperatively.
Accelerating a random forest classifier: multi-core, GP-GPU, or FPGA?
Computational Astrophysics: Methodology 1.Identify astrophysical problem 2.Write down corresponding equations 3.Identify numerical algorithm 4.Find a computer.
CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.
ATI GPUs and Graphics APIs Mark Segal. ATI Hardware X1K series 8 SIMD vertex engines, 16 SIMD fragment (pixel) engines 3-component vector + scalar ALUs.
A Performance and Energy Comparison of FPGAs, GPUs, and Multicores for Sliding-Window Applications From J. Fowers, G. Brown, P. Cooke, and G. Stitt, University.
Technion – Israel Institute of Technology Department of Electrical Engineering High Speed Digital Systems Lab Written by: Haim Natan Benny Pano Supervisor:
Accelerating Machine Learning Applications on Graphics Processors Narayanan Sundaram and Bryan Catanzaro Presented by Narayanan Sundaram.
Heterogeneous Computing Dr. Jason D. Bakos. Heterogeneous Computing 2 “Traditional” Parallel/Multi-Processing Large-scale parallel platforms: –Individual.
Field Programmable Gate Array (FPGA) Layout An FPGA consists of a large array of Configurable Logic Blocks (CLBs) - typically 1,000 to 8,000 CLBs per chip.
Debunking the 100X GPU vs. CPU Myth: An Evaluation of Throughput Computing on CPU and GPU Presented by: Ahmad Lashgar ECE Department, University of Tehran.
GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.
To GPU Synchronize or Not GPU Synchronize? Wu-chun Feng and Shucai Xiao Department of Computer Science, Department of Electrical and Computer Engineering,
GPGPU platforms GP - General Purpose computation using GPU
Synergy.cs.vt.edu Power and Performance Characterization of Computational Kernels on the GPU Yang Jiao, Heshan Lin, Pavan Balaji (ANL), Wu-chun Feng.
Presenter MaxAcademy Lecture Series – V1.0, September 2011 Introduction and Motivation.
HPCC Mid-Morning Break Dirk Colbry, Ph.D. Research Specialist Institute for Cyber Enabled Discovery Introduction to the new GPU (GFX) cluster.
Accelerating SQL Database Operations on a GPU with CUDA Peter Bakkum & Kevin Skadron The University of Virginia GPGPU-3 Presentation March 14, 2010.
An approach for solving the Helmholtz Equation on heterogeneous platforms An approach for solving the Helmholtz Equation on heterogeneous platforms G.
COLLABORATIVE EXECUTION ENVIRONMENT FOR HETEROGENEOUS PARALLEL SYSTEMS Aleksandar Ili´c, Leonel Sousa 2010 IEEE International Symposium on Parallel & Distributed.
CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA
Shared memory systems. What is a shared memory system Single memory space accessible to the programmer Processor communicate through the network to the.
Computer Graphics Graphics Hardware
Accelerating Statistical Static Timing Analysis Using Graphics Processing Units Kanupriya Gulati & Sunil P. Khatri Department of ECE Texas A&M University,
1 Chapter 04 Authors: John Hennessy & David Patterson.
Implementation of Parallel Processing Techniques on Graphical Processing Units Brad Baker, Wayne Haney, Dr. Charles Choi.
MS Thesis Defense “IMPROVING GPU PERFORMANCE BY REGROUPING CPU-MEMORY DATA” by Deepthi Gummadi CoE EECS Department April 21, 2014.
Advanced Computer Architecture, CSE 520 Generating FPGA-Accelerated DFT Libraries Chi-Li Yu Nov. 13, 2007.
Taking the Complexity out of Cluster Computing Vendor Update HPC User Forum Arend Dittmer Director Product Management HPC April,
Programming Concepts in GPU Computing Dušan Gajić, University of Niš Programming Concepts in GPU Computing Dušan B. Gajić CIITLab, Dept. of Computer Science.
Diane Marinkas CDA 6938 April 30, Outline Motivation Algorithm CPU Implementation GPU Implementation Performance Lessons Learned Future Work.
Accelerating a Software Radio Astronomy Correlator By Andrew Woods Supervisor: Prof. Inggs & Dr Langman.
© 2007 SET Associates Corporation SAR Processing Performance on Cell Processor and Xeon Mark Backues, SET Corporation Uttam Majumder, AFRL/RYAS.
Accelerating Statistical Static Timing Analysis Using Graphics Processing Units Kanupriya Gulati and Sunil P. Khatri Department of ECE, Texas A&M University,
“Politehnica” University of Timisoara Course No. 2: Static and Dynamic Configurable Systems (paper by Sanchez, Sipper, Haenni, Beuchat, Stauffer, Uribe)
Accelerating the Singular Value Decomposition of Rectangular Matrices with the CSX600 and the Integrable SVD September 7, 2007 PaCT-2007, Pereslavl-Zalessky.
Introduction What is GPU? It is a processor optimized for 2D/3D graphics, video, visual computing, and display. It is highly parallel, highly multithreaded.
Jie Chen. 30 Multi-Processors each contains 8 cores at 1.4 GHz 4GB GDDR3 memory offers ~100GB/s memory bandwidth.
1)Leverage raw computational power of GPU  Magnitude performance gains possible.
Single-Chip Heterogeneous Computing: Does the Future Include Custom Logic, FPGAs, and GPGPUs? Wasim Shaikh Date: 10/29/2015.
Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA
Program Optimizations and Recent Trends in Heterogeneous Parallel Computing Dušan Gajić, University of Niš Program Optimizations and Recent Trends in Heterogeneous.
Dynamic Scheduling Monte-Carlo Framework for Multi-Accelerator Heterogeneous Clusters Authors: Anson H.T. Tse, David B. Thomas, K.H. Tsoi, Wayne Luk Source:
Debunking the 100X GPU vs. CPU Myth An Evaluation of Throughput Computing on CPU and GPU Present by Chunyi Victor W Lee, Changkyu Kim, Jatin Chhugani,
Sunpyo Hong, Hyesoon Kim
Computer Architecture Lecture 24 Parallel Processing Ralph Grishman November 2015 NYU.
GPGPU introduction. Why is GPU in the picture Seeking exa-scale computing platform Minimize power per operation. – Power is directly correlated to the.
Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA Shirley Moore CPS5401 Fall 2013 svmoore.pbworks.com November 12, 2012.
Fast and parallel implementation of Image Processing Algorithm using CUDA Technology On GPU Hardware Neha Patil Badrinath Roysam Department of Electrical.
Philipp Gysel ECE Department University of California, Davis
S. Pardi Frascati, 2012 March GPGPU Evaluation – First experiences in Napoli Silvio Pardi.
Martin Kruliš by Martin Kruliš (v1.0)1.
Heterogeneous Processing KYLE ADAMSKI. Overview What is heterogeneous processing? Why it is necessary Issues with heterogeneity CPU’s vs. GPU’s Heterogeneous.
P&H Ap. A GPUs for Graphics and Computing. Appendix A — 2 FIGURE A.2.1 Historical PC. VGA controller drives graphics display from framebuffer memory.
Appendix C Graphics and Computing GPUs
NFV Compute Acceleration APIs and Evaluation
CSE 502: Computer Architecture
Presentation transcript:

Performance and Energy Efficiency of GPUs and FPGAs Betkaoui, B.; Thomas, D.B.; Luk, W., "Comparing performance and energy efficiency of FPGAs and GPUs for high productivity computing," Field-Programmable Technology (FPT), 2010 International Conference on , vol., no., pp.94,101, 8-10 Dec. 2010 Jones, D.H.; Powell, A.; Bouganis, C.; Cheung, P.Y.K., "GPU Versus FPGA for High Productivity Computing," Field Programmable Logic and Applications (FPL), 2010 International Conference on , vol., no., pp.119,124, Aug. 31 2010-Sept. 2 2010​ Presented by Aishwarya Dhandapani Taru Doodi

CPU vs Accelerators CPUs use task parallelism Multiple tasks map to multiple threads Tasks run different instructions 10s of relatively heavyweight threads run on 10s of cores Each thread managed and scheduled explicitly Each thread has to be individually programmed Focus on improving latency Accelerators use data parallelism SIMD model (Single Instruction Multiple Data) Same instruction on different data 10,000s of lightweight threads on 100s of cores Threads are managed and scheduled by hardware Programming done for batches of threads (e.g. one pixel shader per group of pixels, or draw call) Focus on improving throughput

NVIDIA GTX 285 Device Overview Stream Processors: 240 Core Clock: 1400 MHz Process Technology: 55nm TDP: ~200W Memory Controller: DDR3

NVIDIA Tesla C1060 Device Overview Stream Processors: 240 Core Clock: 1330 MHz Process Technology: 55nm TDP: ~160W Memory Controller: GDDR3

Convey HC-1 Device Overview 5 Virtex-5 LX330 FPGAs FPGA clock: 300 MHz Memory Controller: DDR2 Host: Intel Xeon 5183 clocked at 2.13 GHz

Kernel Optimizations (1/2) Convey HC-1 Convey personalities: A group of a set of instructions that are designed for an application or class of applications Personalities are stored as pre-compiled FPGA bit files. Personalities used: single precision vector personality, double-precision vector personality and financial analytics personality In addition to the personalities, Convey Math Library, Basic Linear Algebra Subroutines (BLAS) were used

Kernel Optimizations (2/2) NVIDIA GPUs CUDA development model was used to benchmark the GPU CuBLAS which is a ported version of BLAS for GPU implementation was used for optimized implementation

Why do we need optimizations? The architectures under comparison are diverse in nature To analyze the efficiency of an application on the architecture, the application has to be programmed to take advantage of the architecture’s strengths It would be a mammoth task to write a benchmark that is optimized for each architecture in a short period of time Therefore it is essential to use libraries that are optimized for a particular device/architecture

Memory Controllers Memory controllers are digital circuits that manage the data flow to and from the compute units of a processor They contain the logic required to read from and write to the DRAM They also refresh the DRAM periodically, without which the DRAM will lose the data written to it Double data rate(DDR) memory controllers are used to drive DDR SDRAM, where data is transferred on both rising and falling edges of the system's memory clock. DDR memory controllers allow for twice the data to be transferred without increasing the memory cell's clock rate or bus width. GDDR is a memory controller designed for use on graphics processors and is different from DDR.

Experimental Setup for Paper 1 The HC-1 is a 2U server card that uses four Virtex-5’s as application engines (AE) to execute the distributed processes. The HC-1 also uses another Virtex-5 for process management and eight Stratix-II’s for memory interfaces. The resulting system has 128GB of DDR2 RAM with a maximum bandwidth of 80GB/sec. The memory address space is virtually shared with the host processor, making memory allocation and management simple. The GTX285 has 240 core processors running at 1.4GHz and supports up to 4GB of external DDR3 RAM (we use a 1GB version) with a maximum bandwidth of 159GB/sec. A single core of an Intel 2GHz Quad (Core2) Xeon with 4GB DDR3 RAM

Experimental Setup for Paper 2 The Convey HC-1 used in this work has a single multicore Intel Xeon 5138 processor running at 2.13GHz with 8GB of RAM. The HC-1 Coprocessor is configured with 16GB of accelerator-local memory. AEs consist of four Xilinx V5LX330 FPGAs running at 300MHz. The memory controllers are implemented on eight Xilinx V5LX155 FPGAs, while the AEH is implemented on two Xilinx V5LX155 FPGAs. NVidia’s Tesla C1060 GPU has 240 streaming processors running at 1.3GHz. 4GB of GDDR3 memory at 800MHz, offering up to 102GB/sec of memory bandwidth. CPU: Intel Xeon E5420 Quad-Core CPU with multi-threaded applications.

Kernels Scalar Sum of a Vector N-Body Simulation Dense Matrix Multiplication Pseudo Random Number Generator Monte-Carlo Methods for Asian options STREAM Fast Fourier Transform

N Body Simulation 2 Dimensional , O(N2) complexity Calculate force between two bodies Sum up all the forces Calculate new velocities of each body Calculate new position of each body

Pseudo Random Number Generator Mersenne Twister algorithm 32 bit random numbers Nvidia PRNG is implemented as custom software on a fixed architecture. Convey PRNG uses a pipeline shift-register architecture in custom firmware as part of their financial applications personality.

TRIAD: a←b+αc, Where a, b, c ∈ Rm ; α∈R STREAM Synthetic memory bandwidth limited benchmark No data reuse possible Array sizes are defined so that each array is at least 4 times larger than the cache of the device Each vector kernel was timed separately Memory bandwidth = Total number of bytes read and written Time takem to complete the corresponding operation COPY: c←a SCALE: b←αc ADD: c←a+b TRIAD: a←b+αc, Where a, b, c ∈ Rm ; α∈R

Monte Carlo Methods for Asian Options Monte Carlo methods are a class of algorithms that use psuedo-random numbers to perform simulations, allowing the approximate solution of problems which have no tractable closed-form solution. Asian options are a form of derivative which provides a payoff related to the arithmetic average of the price of an underlying asset during the option life- time: Where, Pcall is the payoff of the Asian call option, S(ti) is the asset price at time ti, and K is the strike price. Highly parallel execution Low memory bandwidth requirements

Dense Matrix Multiplication Vital kernel in many scientific applications. One of the most important kernels for LINPACK The HPC vendors provide Optimized hardware Optimized software libraries The SGEMM routine in the BLAS library performs single precision matrix-matrix multiplication, defined as follows: c←βC+αAB whereA, B, C∈Rn×n ; α, β∈Rn

Fast Fourier Transform Fast Fourier Transform is efficient way of calculating the DFT and its inverse. FFT requires both high computation throughput and high memory bandwidth. FFT requires non-unit stride memory access, and hence exhibits low spatial locality. FFTW is more efficient than the Intel MKL implementation. It requires O(N) memory accesses. It requires O(NlogN) floating-point operations. CUFFT is used for the GPU.

Scalar Sum of a Vector Combination of reduce operations and synchronizations Partially synchronous tree architecture process Uses BLAS library routines in implementations 32 and 64 bit vector implementations

Results: N Body Simulation Sample size 4800-9600 GPU performed 43.2 times the CPU HC-1 performed 1.9 times the CPU Tsoi and Luk* implemented customized hardware and firmware & concluded that an FPGA based N- body simulation can run ∼ 2×faster than a GPU. Improved GPU performance slightly (7.8s versus 9.2s) Much slower performance on FPGA (37.9s versus 5.62s) * K. Tsoi and W. Luk, “Axel: A heterogeneous cluster with FPGAs and GPUs,” Proceedings of the 18th annual ACM/SIGDA international symposium on Field programmable gate arrays, pp. 115–124, 2010

Results: Pseudo Random Number Generator GPU does batch processing and is sensitive to the size of the batch HC-1 has bandwidth 128 times greater than on the GTX285 hence larger batches can be generated. HC-1 performs 88.9 times better than CPU GPU performs 89.3 times better than CPU

Results: STREAM Arrays of 32 Million floating-point elements (4 bytes for each element) Requires over 300MB of memory GPU sustains a bandwidth that is almost twice that of the HC-1

Results: Monte Carlo Methods for Asian Options One million simulations over a time period of 356 steps HC-1 performs 18 times better than the multi-threaded CPU implementation Vectorization of FOR loop results in major speed up It is comparable to Single precision GPU performance Convey finance analytics personality doesn’t support single precision flops Random number generator is implemented as a custom hardware library in HC-1 GPU implementation is instruction based The GPU and the HC-1 coprocessor are only about 2 to 4 times more energy efficient than the CPU Near full utilization of devices and hence higher power than the other kernelsone million simulations over a time period of 356 steps

Results: Monte Carlo Methods for Asian Options Performance Energy Efficiency

Results: Dense Matrix Multiplication(1) 32 bit square matrices GPU performs 109.4 better than CPU HC1 performs 48.8 better than CPU 64 bit square matrices GPU performs 98.0 better than CPU HC1 performs 52.5 better than CPU GPU performance peaks occur when the width of the matrix is a multiple of the size of the available shared memory (16kb for every group of eight cores)

Results: Dense Matrix Multiplication(2) Performance Energy Efficiency GPU performs better in terms of both performance (up to 370 GFLOPS) and power efficiency (over 5GFLOPS/Watt).

Results: Dense Matrix Multiplication(2) The GPU is about 5 times faster than both the CPU and the Convey Coprocessor. This speed-up decreases to about 2.5 to 4.2 times if we include data transfer from the main memory to the GPU memory. HC-1 coprocessor can be slower than the CPU when data transfers from the host processor memory to the coprocessor memory are taken into account.

Results: Fast Fourier Transform Performance Energy Efficiency

Results: Fast Fourier Transform Performance of a one-dimensional in-place single-precision complex-to-complex FFT. FFT on HC1 is 16 times faster than a single threaded FFTW It is 4 times faster than multi-threaded implementation. The Tesla C1060 uses GDDR memories which are optimized for sequential memory access operations and stream programming for graphics applications. BLAS routine blas:sscopy is available for each platform. This routine copies a real vector into another real vector. The increment between two consecutive elements in each vector can be specified, i.e. the stride parameter.

Results: Scalar Sum of a Vector 32 bit vector HC1 is 125 times faster than CPU GPU is 306 times faster than CPU 64 bit vector HC1 is 81 times faster than CPU GPU is 109 times faster than CPU

Conclusions Paper 1 Paper 2 Convey HC-1 and GTX 285 performance compared to CPU performance Both devices outperformed the CPU implementation of all benchmarks For most benchmarks GPU outperformed the CPU by more than the FPGA outperformed the CPU Paper 2 GPUs often outperform FPGAs for streaming applications. The performance of the HC-1 was limited by its floating point performance HC-1 has better non-sequential memory accesses which makes it outperform the GPU for applications such as FFT HC-1 demonstrates superior performance and energy efficiency for applications that require low memory bandwidths like Monte Carlo benchmark

Pros and Cons Paper 1 Paper 2 Comparison of FPGA and GPU performance with single core CPU implementation is not fair comparison Tradeoffs in using GPUs and FPGAs not discussed Power consumptions not considered Could have presented a better analysis of the devices considered Paper 2 Detailed analysis of collected data Tradeoffs of both architectures discussed in depth

Questions?