Presentation on theme: "Debunking the 100X GPU vs. CPU Myth An Evaluation of Throughput Computing on CPU and GPU Authors: Victor W Lee, Changkyu Kim, Jatin Chhugani, Michael Deisher,"— Presentation transcript:
Debunking the 100X GPU vs. CPU Myth An Evaluation of Throughput Computing on CPU and GPU Authors: Victor W Lee, Changkyu Kim, Jatin Chhugani, Michael Deisher, Daehyun Kim, Anthony D. Nguyen, Nadathur Satish, Mikhail Smelyanskiy, Srinivas Chennupaty, Per Hammarlund, Ronak Singhal and Pradeep Dubey Presentation: Sam Delaney
Breakdown The problem! CPU and GPU structure! Throughput! The hardware! The results! The explanation! Wrap up!
The Problem! What are we talking about?
Some history The past years have seen great increase in the use and demand of digital storage Demand has not subsided Lots of this data can be classified as having high data level parallelism
The problem as it stand today How can we process all this data and do it fast? With Throughput Computing Applications! – Applications which provide users with the correct content in a reasonable amount of time Problem Solved!
OK, not quite Our solution as it turns out can run on two platforms CPU GPU So now the question becomes: “Which is better?”
CPUGPU Let’s look at the fundamental differences Designed to do a lot of things Provide fast response times for single tasks Awesome features: Branch prediction, out of order execution, etc. But… high power consumption limits the number of cores per die Designed for rendering Great for processes with ample data parallelism (see rendering) Can trade single thread performance for increased parallel processing
CPUGPU How do they handle parallelism? Would provide best single thread performance But lack of cores means less work can be done at once Many Cores! But the graphics pipeline lacks certain processing capabilities for general workloads resulting in poorer throughput performance
So which is better? If we look at the research, many data-level parallel applications see 10x+ speed up when using GPU’s STOP DRINKING THE KOOL AID Let’s reevaluate these claims and see why some kernels work better for the GPU architecture and while others for the CPU
The Kernels How are we going to test throughput?
In order to determine a victor we need a contest of might 14 kernels were chosen which represent the benchmark of parallel performance They all contain large amounts of data level parallelism BEHOLD
Kernels explained SGEMM Single Precision General Matrix Multiply Maps to SIMD architecture in a straight forward manner Simple Threading MC Monte Carlo method, using random numbers on complex functions to predict x (insert something into x) Maps well to SIMD architecture
Kernels explained Conv Convolution – Common image filtering operation used for blur, sharpen, etc Each pixel calculated independently, for maximum parallelism Can cause memory alignment issues with SIMD computations FFT Fast Fourier Transform Converts signals from frequency domain to time domain or vice versa Arithmetic is simple Data access patterns are not trivial
Kernels explained SAXPY Scalar Algebra X Plus Y Combination scalar multiplication plus vector addition Maps well to SIMD LBM Lattice Boltzmann Method Used in fluid dynamics Uses the discrete Boltzmann method to simulate flow of Newtonian fluid Suitable for both task and data level parallelism
Kernels explained Solv Constraint Solver Used in game collision simulations Computes separating forces to keep objects from penetrating each other SpMV Sparse Matrix Vector Multiplication Used in the heart of many iterative solvers
Kernels explained GJK Algorithm for collision detection and resolution of convex objects in physically based simulations in virtual environments Sort Radix Sort Run with SIMD on GPU Better performance with no SIMD on CPU
Kernels explained RC Ray Casting Used to visualize 3D datasets Gives memory access run for its money Search In-memory tree structured index search I think we’ve all done this at least once
Kernels explained Hist Histogram computation Image processing algorithm Bins pixels from pipeline Bilat Bilateral filter Non-linear filter used in image processing for edge preserving smoothing operations
The Hardware That was painful, at least we’re done. Now onto:
Intel Core i7 CPUNvidia GTX280 GPU The Contenders Four Cores 3.2 GHz clock speed Out of order super scalar architecture 2 way hyper threading 32KB L1 and 256KB L2 cache per core 8MB L3 data cache shared 240 CUDA cores 1.3GHz clock speed Hardware multithreading support Various on chip memories Special function units (math units, texture units)
That was really technical and hurt my brain, let’s break it down Processing Element Difference CPU cores are complex to allow wide us for different applications, therefore limiting number of cores GPU SM’s are simple with high throughput in mind, the simplicity means many cores can be placed on a card Cache Size/Multi-threading CPU’s have been increasing cache size to boost performance and hardware prefetchers to reduce memory latency GPU’s use light weight threads to hide memory latency, much smaller cache than CPU Bandwidth Difference CPU – 32GB/s GPU – 141GB/s Other CPU - Not so good at Gather Scatter, really good at fast synchronous operations GPU - Really good at Gather Scatter, not so good at fast synchronous operations
The Results But what does it all mean?
How are we measuring? Data transfer time not included Why does this sound familiar? Kernels optimized for each platform Using the best methods currently available CPU is using 8 threads on 4 cores GPU is using 4 to 8 warps per core depending on kernel with a max of 32 SM’s on the card Disclaimer – all times are on par, or more often better than published times
The Explanation So now that we know this class is a lie and we should all burn our Graphics cards in protest, let’s see why we got these results.
Possible reasons for 10X+ speed up GPU vs CPU devices – Comparing top end video cards to mobile CPU’s isn’t cool. Optimization – Running GPU optimized code against unoptimized CPU code produces large speed up numbers, but when the CPU code is optimized speed up time drops drastically.
Analysis of results Bandwidth Kernels that are bandwidth bound will have speed up limited to the ratio of the bandwidth, which happens to be ~5X Compute Flops Not all kernels can utilize all available flops Those kernels compute bound are limited to the Peak FLOP ratio Cache This is where the CPU outperforms the GPU Where kernels require lots of cache to perform well (ie sort) we see that the GPU’s limited memory becomes a major hindrance compared the CPU’s whose cache is still increasing.
Analysis of results Gather/Scatter Kernels like GJK and RC which thrive on Gather/Scatter operations make the best use of the GPU which unlike the CPU has hardware support for these operations. These kernels provided the best speed up for the GPU Reduction and Synchronization Kernels that need synchronization such as Hist provide bottlenecks which the GPU does not handle well Fixed Function Transcendental functions on the GPU, which not as accurate as the CPU provide, a substantial speed up Due to the presence of fast transcendental hardware on GPU’s
Optimization For to long programmers have relied on additional cores and increased clock speeds to seek improved performance Code optimization, with multi-threading, SIMD optimizations can provide significant increase in performance Hardware optimizations also play a key part in performance High compute flops and memory bandwidth Large Cache Gather/Scatter Efficient Synchronization Fixed Function Units
Wrap Up Why parallel is important Two platforms for parallel work The architecture of CPU’s and GPU’s The kernels used for comparison The more down to earth results of GPU speed up Why some other people got GPU speed ups of 10X to 100X