Presentation on theme: "Debunking the 100X GPU vs. CPU Myth"— Presentation transcript:
1Debunking the 100X GPU vs. CPU Myth An Evaluation of Throughput Computing on CPU and GPUAuthors: Victor W Lee, Changkyu Kim, Jatin Chhugani, Michael Deisher, Daehyun Kim, Anthony D. Nguyen, Nadathur Satish, Mikhail Smelyanskiy, Srinivas Chennupaty, Per Hammarlund, Ronak Singhal and Pradeep DubeyPresentation: Sam Delaney
2Breakdown The problem! CPU and GPU structure! Throughput! The hardware!The results!The explanation!Wrap up!
4Some historyThe past years have seen great increase in the use and demand of digital storageDemand has not subsidedLots of this data can be classified as having high data level parallelism
5The problem as it stand today How can we process all this data and do it fast?With Throughput Computing Applications! – Applications which provide users with the correct content in a reasonable amount of timeProblem Solved!
6OK, not quite Our solution as it turns out can run on two platforms CPUGPUSo now the question becomes:“Which is better?”
7Let’s look at the fundamental differences CPUGPUDesigned to do a lot of thingsProvide fast response times for single tasksAwesome features: Branch prediction, out of order execution, etc.But… high power consumption limits the number of cores per dieDesigned for renderingGreat for processes with ample data parallelism (see rendering)Can trade single thread performance for increased parallel processing
8How do they handle parallelism? CPUGPUWould provide best single thread performanceBut lack of cores means less work can be done at once Many Cores! But the graphics pipeline lacks certain processing capabilities for general workloads resulting in poorer throughput performance
9So which is better?If we look at the research, many data-level parallel applications see 10x+ speed up when using GPU’sSTOP DRINKING THE KOOL AIDLet’s reevaluate these claims and see why some kernels work better for the GPU architecture and while others for the CPU
10How are we going to test throughput? The Kernels
11In order to determine a victor we need a contest of might 14 kernels were chosen which represent the benchmark of parallel performanceThey all contain large amounts of data level parallelismBEHOLD
12Kernels explained SGEMM MC Single Precision General Matrix Multiply Maps to SIMD architecture in a straight forward mannerSimple ThreadingMCMonte Carlo method, using random numbers on complex functions to predict x (insert something into x)Maps well to SIMD architecture
13Kernels explained Conv FFT Convolution – Common image filtering operation used for blur, sharpen, etcEach pixel calculated independently, for maximum parallelismCan cause memory alignment issues with SIMD computationsFFTFast Fourier TransformConverts signals from frequency domain to time domain or vice versaArithmetic is simpleData access patterns are not trivial
14Kernels explained SAXPY LBM Scalar Algebra X Plus Y Combination scalar multiplication plus vector additionMaps well to SIMDLBMLattice Boltzmann MethodUsed in fluid dynamicsUses the discrete Boltzmann method to simulate flow of Newtonian fluidSuitable for both task and data level parallelism
15Kernels explained Solv SpMV Constraint Solver Used in game collision simulationsComputes separating forces to keep objects from penetrating each otherSpMVSparse Matrix Vector MultiplicationUsed in the heart of many iterative solvers
16Kernels explained GJK Sort Algorithm for collision detection and resolution of convex objects in physically based simulations in virtual environmentsSortRadix SortRun with SIMD on GPUBetter performance with no SIMD on CPU
17Kernels explained RC Search Ray Casting Used to visualize 3D datasets Gives memory access run for its moneySearchIn-memory tree structured index searchI think we’ve all done this at least once
18Kernels explained Hist Bilat Histogram computation Image processing algorithmBins pixels from pipelineBilatBilateral filterNon-linear filter used in image processing for edge preserving smoothing operations
19That was painful, at least we’re done. Now onto: The Hardware
20The Contenders Intel Core i7 CPU Nvidia GTX280 GPU Four Cores 3.2 GHz clock speedOut of order super scalar architecture2 way hyper threading32KB L1 and 256KB L2 cache per core8MB L3 data cache shared240 CUDA cores1.3GHz clock speedHardware multithreading supportVarious on chip memoriesSpecial function units (math units, texture units)
21That was really technical and hurt my brain, let’s break it down Processing Element DifferenceCPU cores are complex to allow wide us for different applications, therefore limiting number of coresGPU SM’s are simple with high throughput in mind, the simplicity means many cores can be placed on a cardCache Size/Multi-threadingCPU’s have been increasing cache size to boost performance and hardware prefetchers to reduce memory latencyGPU’s use light weight threads to hide memory latency, much smaller cache than CPUBandwidth DifferenceCPU – 32GB/s GPU – 141GB/sOtherCPU - Not so good at Gather Scatter, really good at fast synchronous operationsGPU - Really good at Gather Scatter, not so good at fast synchronous operations
23How are we measuring? Data transfer time not included Why does this sound familiar?Kernels optimized for each platformUsing the best methods currently availableCPU is using 8 threads on 4 coresGPU is using 4 to 8 warps per core depending on kernel with a max of 32 SM’s on the cardDisclaimer – all times are on par, or more often better than published times
25So now that we know this class is a lie and we should all burn our Graphics cards in protest, let’s see why we got these results.The Explanation
26Possible reasons for 10X+ speed up GPU vs CPU devices – Comparing top end video cards to mobile CPU’s isn’t cool.Optimization – Running GPU optimized code against unoptimized CPU code produces large speed up numbers, but when the CPU code is optimized speed up time drops drastically.
27Analysis of results Bandwidth Compute Flops Cache Kernels that are bandwidth bound will have speed up limited to the ratio of the bandwidth, which happens to be ~5XCompute FlopsNot all kernels can utilize all available flopsThose kernels compute bound are limited to the Peak FLOP ratioCacheThis is where the CPU outperforms the GPUWhere kernels require lots of cache to perform well (ie sort) we see that the GPU’s limited memory becomes a major hindrance compared the CPU’s whose cache is still increasing.
28Analysis of results Gather/Scatter Reduction and Synchronization Kernels like GJK and RC which thrive on Gather/Scatter operations make the best use of the GPU which unlike the CPU has hardware support for these operations.These kernels provided the best speed up for the GPUReduction and SynchronizationKernels that need synchronization such as Hist provide bottlenecks which the GPU does not handle wellFixed FunctionTranscendental functions on the GPU, which not as accurate as the CPU provide, a substantial speed upDue to the presence of fast transcendental hardware on GPU’s
29OptimizationFor to long programmers have relied on additional cores and increased clock speeds to seek improved performanceCode optimization, with multi-threading, SIMD optimizations can provide significant increase in performanceHardware optimizations also play a key part in performanceHigh compute flops and memory bandwidthLarge CacheGather/ScatterEfficient SynchronizationFixed Function Units
30Wrap Up Why parallel is important Two platforms for parallel work The architecture of CPU’s and GPU’sThe kernels used for comparisonThe more down to earth results of GPU speed upWhy some other people got GPU speed ups of 10X to 100X