Presentation is loading. Please wait.

Presentation is loading. Please wait.

Debunking the 100X GPU vs. CPU Myth

Similar presentations

Presentation on theme: "Debunking the 100X GPU vs. CPU Myth"— Presentation transcript:

1 Debunking the 100X GPU vs. CPU Myth
An Evaluation of Throughput Computing on CPU and GPU Authors: Victor W Lee, Changkyu Kim, Jatin Chhugani, Michael Deisher, Daehyun Kim, Anthony D. Nguyen, Nadathur Satish, Mikhail Smelyanskiy, Srinivas Chennupaty, Per Hammarlund, Ronak Singhal and Pradeep Dubey Presentation: Sam Delaney

2 Breakdown The problem! CPU and GPU structure! Throughput!
The hardware! The results! The explanation! Wrap up!

3 What are we talking about?
The Problem!

4 Some history The past years have seen great increase in the use and demand of digital storage Demand has not subsided Lots of this data can be classified as having high data level parallelism

5 The problem as it stand today
How can we process all this data and do it fast? With Throughput Computing Applications! – Applications which provide users with the correct content in a reasonable amount of time Problem Solved!

6 OK, not quite Our solution as it turns out can run on two platforms
CPU GPU So now the question becomes: “Which is better?”

7 Let’s look at the fundamental differences
CPU GPU Designed to do a lot of things Provide fast response times for single tasks Awesome features: Branch prediction, out of order execution, etc. But… high power consumption limits the number of cores per die Designed for rendering Great for processes with ample data parallelism (see rendering) Can trade single thread performance for increased parallel processing

8 How do they handle parallelism?
CPU GPU Would provide best single thread performance But lack of cores means less work can be done at once  Many Cores!  But the graphics pipeline lacks certain processing capabilities for general workloads resulting in poorer throughput performance

9 So which is better? If we look at the research, many data-level parallel applications see 10x+ speed up when using GPU’s STOP DRINKING THE KOOL AID Let’s reevaluate these claims and see why some kernels work better for the GPU architecture and while others for the CPU

10 How are we going to test throughput?
The Kernels

11 In order to determine a victor we need a contest of might
14 kernels were chosen which represent the benchmark of parallel performance They all contain large amounts of data level parallelism BEHOLD

12 Kernels explained SGEMM MC Single Precision General Matrix Multiply
Maps to SIMD architecture in a straight forward manner Simple Threading MC Monte Carlo method, using random numbers on complex functions to predict x (insert something into x) Maps well to SIMD architecture

13 Kernels explained Conv FFT
Convolution – Common image filtering operation used for blur, sharpen, etc Each pixel calculated independently, for maximum parallelism Can cause memory alignment issues with SIMD computations FFT Fast Fourier Transform Converts signals from frequency domain to time domain or vice versa Arithmetic is simple Data access patterns are not trivial

14 Kernels explained SAXPY LBM Scalar Algebra X Plus Y
Combination scalar multiplication plus vector addition Maps well to SIMD LBM Lattice Boltzmann Method Used in fluid dynamics Uses the discrete Boltzmann method to simulate flow of Newtonian fluid Suitable for both task and data level parallelism

15 Kernels explained Solv SpMV Constraint Solver
Used in game collision simulations Computes separating forces to keep objects from penetrating each other SpMV Sparse Matrix Vector Multiplication Used in the heart of many iterative solvers

16 Kernels explained GJK Sort
Algorithm for collision detection and resolution of convex objects in physically based simulations in virtual environments Sort Radix Sort Run with SIMD on GPU Better performance with no SIMD on CPU

17 Kernels explained RC Search Ray Casting Used to visualize 3D datasets
Gives memory access run for its money Search In-memory tree structured index search I think we’ve all done this at least once

18 Kernels explained Hist Bilat Histogram computation
Image processing algorithm Bins pixels from pipeline Bilat Bilateral filter Non-linear filter used in image processing for edge preserving smoothing operations

19 That was painful, at least we’re done. Now onto:
The Hardware

20 The Contenders Intel Core i7 CPU Nvidia GTX280 GPU Four Cores
3.2 GHz clock speed Out of order super scalar architecture 2 way hyper threading 32KB L1 and 256KB L2 cache per core 8MB L3 data cache shared 240 CUDA cores 1.3GHz clock speed Hardware multithreading support Various on chip memories Special function units (math units, texture units)

21 That was really technical and hurt my brain, let’s break it down
Processing Element Difference CPU cores are complex to allow wide us for different applications, therefore limiting number of cores GPU SM’s are simple with high throughput in mind, the simplicity means many cores can be placed on a card Cache Size/Multi-threading CPU’s have been increasing cache size to boost performance and hardware prefetchers to reduce memory latency GPU’s use light weight threads to hide memory latency, much smaller cache than CPU Bandwidth Difference CPU – 32GB/s GPU – 141GB/s Other CPU - Not so good at Gather Scatter, really good at fast synchronous operations GPU - Really good at Gather Scatter, not so good at fast synchronous operations

22 But what does it all mean?
The Results

23 How are we measuring? Data transfer time not included
Why does this sound familiar? Kernels optimized for each platform Using the best methods currently available CPU is using 8 threads on 4 cores GPU is using 4 to 8 warps per core depending on kernel with a max of 32 SM’s on the card Disclaimer – all times are on par, or more often better than published times

24 THROW DOWN!!!!!!!

25 So now that we know this class is a lie and we should all burn our Graphics cards in protest, let’s see why we got these results. The Explanation

26 Possible reasons for 10X+ speed up
GPU vs CPU devices – Comparing top end video cards to mobile CPU’s isn’t cool. Optimization – Running GPU optimized code against unoptimized CPU code produces large speed up numbers, but when the CPU code is optimized speed up time drops drastically.

27 Analysis of results Bandwidth Compute Flops Cache
Kernels that are bandwidth bound will have speed up limited to the ratio of the bandwidth, which happens to be ~5X Compute Flops Not all kernels can utilize all available flops Those kernels compute bound are limited to the Peak FLOP ratio Cache This is where the CPU outperforms the GPU Where kernels require lots of cache to perform well (ie sort) we see that the GPU’s limited memory becomes a major hindrance compared the CPU’s whose cache is still increasing.

28 Analysis of results Gather/Scatter Reduction and Synchronization
Kernels like GJK and RC which thrive on Gather/Scatter operations make the best use of the GPU which unlike the CPU has hardware support for these operations. These kernels provided the best speed up for the GPU Reduction and Synchronization Kernels that need synchronization such as Hist provide bottlenecks which the GPU does not handle well Fixed Function Transcendental functions on the GPU, which not as accurate as the CPU provide, a substantial speed up Due to the presence of fast transcendental hardware on GPU’s

29 Optimization For to long programmers have relied on additional cores and increased clock speeds to seek improved performance Code optimization, with multi-threading, SIMD optimizations can provide significant increase in performance Hardware optimizations also play a key part in performance High compute flops and memory bandwidth Large Cache Gather/Scatter Efficient Synchronization Fixed Function Units

30 Wrap Up Why parallel is important Two platforms for parallel work
The architecture of CPU’s and GPU’s The kernels used for comparison The more down to earth results of GPU speed up Why some other people got GPU speed ups of 10X to 100X

31 Questions?

Download ppt "Debunking the 100X GPU vs. CPU Myth"

Similar presentations

Ads by Google