Presentation is loading. Please wait.

Presentation is loading. Please wait.

Debunking the 100X GPU vs. CPU Myth An Evaluation of Throughput Computing on CPU and GPU Authors: Victor W Lee, Changkyu Kim, Jatin Chhugani, Michael Deisher,

Similar presentations

Presentation on theme: "Debunking the 100X GPU vs. CPU Myth An Evaluation of Throughput Computing on CPU and GPU Authors: Victor W Lee, Changkyu Kim, Jatin Chhugani, Michael Deisher,"— Presentation transcript:

1 Debunking the 100X GPU vs. CPU Myth An Evaluation of Throughput Computing on CPU and GPU Authors: Victor W Lee, Changkyu Kim, Jatin Chhugani, Michael Deisher, Daehyun Kim, Anthony D. Nguyen, Nadathur Satish, Mikhail Smelyanskiy, Srinivas Chennupaty, Per Hammarlund, Ronak Singhal and Pradeep Dubey Presentation: Sam Delaney

2 Breakdown  The problem!  CPU and GPU structure!  Throughput!  The hardware!  The results!  The explanation!  Wrap up!

3 The Problem! What are we talking about?

4 Some history  The past years have seen great increase in the use and demand of digital storage  Demand has not subsided  Lots of this data can be classified as having high data level parallelism

5 The problem as it stand today  How can we process all this data and do it fast?  With Throughput Computing Applications! – Applications which provide users with the correct content in a reasonable amount of time  Problem Solved!

6 OK, not quite  Our solution as it turns out can run on two platforms  CPU  GPU  So now the question becomes:  “Which is better?”

7 CPUGPU Let’s look at the fundamental differences  Designed to do a lot of things  Provide fast response times for single tasks  Awesome features: Branch prediction, out of order execution, etc.  But… high power consumption limits the number of cores per die  Designed for rendering  Great for processes with ample data parallelism (see rendering)  Can trade single thread performance for increased parallel processing

8 CPUGPU How do they handle parallelism?  Would provide best single thread performance  But lack of cores means less work can be done at once   Many Cores!  But the graphics pipeline lacks certain processing capabilities for general workloads resulting in poorer throughput performance

9 So which is better?  If we look at the research, many data-level parallel applications see 10x+ speed up when using GPU’s  STOP DRINKING THE KOOL AID  Let’s reevaluate these claims and see why some kernels work better for the GPU architecture and while others for the CPU

10 The Kernels How are we going to test throughput?

11 In order to determine a victor we need a contest of might  14 kernels were chosen which represent the benchmark of parallel performance  They all contain large amounts of data level parallelism  BEHOLD

12 Kernels explained  SGEMM  Single Precision General Matrix Multiply  Maps to SIMD architecture in a straight forward manner  Simple Threading  MC  Monte Carlo method, using random numbers on complex functions to predict x (insert something into x)  Maps well to SIMD architecture

13 Kernels explained  Conv  Convolution – Common image filtering operation used for blur, sharpen, etc  Each pixel calculated independently, for maximum parallelism  Can cause memory alignment issues with SIMD computations  FFT  Fast Fourier Transform  Converts signals from frequency domain to time domain or vice versa  Arithmetic is simple  Data access patterns are not trivial

14 Kernels explained  SAXPY  Scalar Algebra X Plus Y  Combination scalar multiplication plus vector addition  Maps well to SIMD  LBM  Lattice Boltzmann Method  Used in fluid dynamics  Uses the discrete Boltzmann method to simulate flow of Newtonian fluid  Suitable for both task and data level parallelism

15 Kernels explained  Solv  Constraint Solver  Used in game collision simulations  Computes separating forces to keep objects from penetrating each other  SpMV  Sparse Matrix Vector Multiplication  Used in the heart of many iterative solvers

16 Kernels explained  GJK  Algorithm for collision detection and resolution of convex objects in physically based simulations in virtual environments  Sort  Radix Sort  Run with SIMD on GPU  Better performance with no SIMD on CPU

17 Kernels explained  RC  Ray Casting  Used to visualize 3D datasets  Gives memory access run for its money  Search  In-memory tree structured index search  I think we’ve all done this at least once

18 Kernels explained  Hist  Histogram computation  Image processing algorithm  Bins pixels from pipeline  Bilat  Bilateral filter  Non-linear filter used in image processing for edge preserving smoothing operations

19 The Hardware That was painful, at least we’re done. Now onto:

20 Intel Core i7 CPUNvidia GTX280 GPU The Contenders  Four Cores  3.2 GHz clock speed  Out of order super scalar architecture  2 way hyper threading  32KB L1 and 256KB L2 cache per core  8MB L3 data cache shared  240 CUDA cores  1.3GHz clock speed  Hardware multithreading support  Various on chip memories  Special function units (math units, texture units)

21 That was really technical and hurt my brain, let’s break it down  Processing Element Difference  CPU cores are complex to allow wide us for different applications, therefore limiting number of cores  GPU SM’s are simple with high throughput in mind, the simplicity means many cores can be placed on a card  Cache Size/Multi-threading  CPU’s have been increasing cache size to boost performance and hardware prefetchers to reduce memory latency  GPU’s use light weight threads to hide memory latency, much smaller cache than CPU  Bandwidth Difference  CPU – 32GB/s GPU – 141GB/s  Other  CPU - Not so good at Gather Scatter, really good at fast synchronous operations  GPU - Really good at Gather Scatter, not so good at fast synchronous operations

22 The Results But what does it all mean?

23 How are we measuring?  Data transfer time not included  Why does this sound familiar?  Kernels optimized for each platform  Using the best methods currently available  CPU is using 8 threads on 4 cores  GPU is using 4 to 8 warps per core depending on kernel with a max of 32 SM’s on the card  Disclaimer – all times are on par, or more often better than published times

24 THROW DOWN!!!!!!!

25 The Explanation So now that we know this class is a lie and we should all burn our Graphics cards in protest, let’s see why we got these results.

26 Possible reasons for 10X+ speed up  GPU vs CPU devices – Comparing top end video cards to mobile CPU’s isn’t cool.  Optimization – Running GPU optimized code against unoptimized CPU code produces large speed up numbers, but when the CPU code is optimized speed up time drops drastically.

27 Analysis of results  Bandwidth  Kernels that are bandwidth bound will have speed up limited to the ratio of the bandwidth, which happens to be ~5X  Compute Flops  Not all kernels can utilize all available flops  Those kernels compute bound are limited to the Peak FLOP ratio  Cache  This is where the CPU outperforms the GPU  Where kernels require lots of cache to perform well (ie sort) we see that the GPU’s limited memory becomes a major hindrance compared the CPU’s whose cache is still increasing.

28 Analysis of results  Gather/Scatter  Kernels like GJK and RC which thrive on Gather/Scatter operations make the best use of the GPU which unlike the CPU has hardware support for these operations.  These kernels provided the best speed up for the GPU  Reduction and Synchronization  Kernels that need synchronization such as Hist provide bottlenecks which the GPU does not handle well  Fixed Function  Transcendental functions on the GPU, which not as accurate as the CPU provide, a substantial speed up  Due to the presence of fast transcendental hardware on GPU’s

29 Optimization  For to long programmers have relied on additional cores and increased clock speeds to seek improved performance  Code optimization, with multi-threading, SIMD optimizations can provide significant increase in performance  Hardware optimizations also play a key part in performance  High compute flops and memory bandwidth  Large Cache  Gather/Scatter  Efficient Synchronization  Fixed Function Units

30 Wrap Up  Why parallel is important  Two platforms for parallel work  The architecture of CPU’s and GPU’s  The kernels used for comparison  The more down to earth results of GPU speed up  Why some other people got GPU speed ups of 10X to 100X

31 Questions?

Download ppt "Debunking the 100X GPU vs. CPU Myth An Evaluation of Throughput Computing on CPU and GPU Authors: Victor W Lee, Changkyu Kim, Jatin Chhugani, Michael Deisher,"

Similar presentations

Ads by Google