Presentation is loading. Please wait.

Presentation is loading. Please wait.

Debunking the 100X GPU vs. CPU Myth: An Evaluation of Throughput Computing on CPU and GPU Presented by: Ahmad Lashgar ECE Department, University of Tehran.

Similar presentations


Presentation on theme: "Debunking the 100X GPU vs. CPU Myth: An Evaluation of Throughput Computing on CPU and GPU Presented by: Ahmad Lashgar ECE Department, University of Tehran."— Presentation transcript:

1 Debunking the 100X GPU vs. CPU Myth: An Evaluation of Throughput Computing on CPU and GPU Presented by: Ahmad Lashgar ECE Department, University of Tehran Seminar of Parallel Processing. Instructor: Dr. Fakhraie 29 Dec 11 ISCA 2010 Original authors: Victor W Lee et al. Intel Corporation 1 Some slides are included from original paper only for educational purposes

2 Abstract Is the GPU silver bullet of parallel computing? How far is the difference between peak and achievable performance? 2

3 Overview Abstract Architecture – CPU: Intel core i7 – GPU: Nvidia GTX280 Implications for throughput computing applications Methodology Results Analyzing the results Platform optimization guides Conclusion 3

4 Architecture (1) Intel core i7-960 – 4-core, 3.2 GHz – 2-way multi-threading – 4-wide – L1 32KB, L2 256KB, L3 3MB – 32 GB/sec 4 [DIXON’2010]

5 Architecture (2) Nvidia GTX280 – 30 core, 1.3GHz – 1024-way multi-threading – 8-way SIMD – 16KB software managed cache (shared memory) – 141 GB/s 5 [LINDHOLM’2008]

6 Architecture (3) Core i7-960GTX280 Core430 Frequency (GHz)3.21.3 Transistors0.7B (263mm2)1.4B (576mm2) Memory Bandwidth (GB/s)32141 SP SIMD48 DP SIMD21 Peak SP scalar GFLOPS25.6116.6 Peak SP SIMD GFLOPS102.4311.1 (933.1) Peak DB SIMD GFLOPS51.277.8 Red texts are not the author’s numbers. 6

7 Implications for throughput computing applications 1.Number of core difference 2.Cache size/multi-threading 3.Bandwidth difference 7

8 1. Number of cores difference It is all about the core complexity: – The common goal: Improving pipeline efficiency – CPU goal: Single-thread performance Exploiting ILP Sophisticated branch predictor Multiple issue logics – GPU goal: Throughput Interleaving hundreds of threads 8

9 2. Cache size/multi-threading CPU goal: reducing memory latency – Programmer-transparent data caching Increasing the cache size to capture the working set – Prefetching (HW/SW) GPU goal: hiding memory latency – Interleave the execution of hundreds of threads to hide the latency of each other Notice: – CPU uses multi-threading for latency hiding – GPU uses software controlled caching (shared memory) for reducing memory latency 9

10 3. Bandwidth difference Bandwidth versus latency CPU goal: single thread performance – Workloads do not demand for many memory accesses – Bring the data as soon as possible GPU goal: throughput – There are lots of memory accesses, provide the good bandwidth – No matter the latency, core will hide it! 10

11 Methodology (1) Hardware – Intel Core i7-960, 6GB DRAM, GTX280 1GB Software – SUSE Enterprise 11 – CUDA Toolkit 2.3 11

12 Methodology (2) Optimizations – On CPU: SGEMM, SpMV and FFT from Intel MKL 10 Always 2 threads per core – On GPU: Best possible algorithm for SpMV, FFT and MC Often 128 to 256 threads per core (to leverage shared memory and register-file usage) – Interleaving GPU execution and HD/DH memory transfers where possible 12

13 Results The HD/DH data transfer time is not considered Only 2.5X on average – Far from what is reported by previous researches (100X) 13

14 Where is the speedup of previous researches?! What CPU and GPU are compared? How much optimization is performed on CPU and GPU? – Where they optimize both platforms, they reported much lower speedup (like this paper) 14

15 Analyzing the results (1) 1.Bandwidth 2.Compute flops (single precision) 3.Compute flops (double precision) 4.Reduction and synchronization 5.Fixed function 15

16 Analyzing the results (2) 1.Bandwidth – Peak: GTX280/Corei7-960 ~ 4.7X – Feature: Large working set, Performance is bounded by the bandwidth – Examples SAXPY (5.3X) LBM (5X) SpMV (1.9X) – CPU benefits from caching 16

17 Analyzing the results (3) 2.Compute Flops (Single Precision) – Peak: GTX280/Corei7-960 ~ 3X – Feature: Bounded by computation, benefit from more cores – Examples SGEMM, Conv and FFT (2.8-4X) 17

18 Analyzing the results (4) 3.Compute Flops (Double Precision) – Peak: GTX280/Corei7-960 ~ 1.5X – Feature: Bounded by computation, benefit from more cores – Examples MC (1.8X) Blitz (5X) – Uses transcendental operations Sort (1.25X slower) – Due to decrease in SIMD width usage – Depends on scalar performance 18

19 Analyzing the results (5) 4.Reduction and Synchronization – Feature: More threads, higher the synchronization overhead – Examples Hist (1.8X) – On CPU, 28% of the time is spent on atomic operations – On GPU, the atomic operations are much slower Solv (1.9X slower) – Multiple kernel launches to preserve cache coherency on GPU 19

20 Analyzing the results (6) 5.Fixed function – Feature: Interpolation, texturing and transcendental operation are bonus on GPU – Examples Bilat (5.7X) – On CPU, 66% of the time is spent on transcendental operations GJK (14.9X) – Uses texture lookup 20

21 Platform optimization guides CPU programmer have heavily relied on increasing clock frequency Their application do not benefits from TLP and DLP Today CPUs use wider SIMD which stays idle if not exploited by programmer (or compiler) This paper showed that careful multi-threading can reduce the gap heavily – For LBM, from 114X down to 5X Let’s learn some optimization tips from the authors 21

22 CPU optimization Scalability (4X): – Scale the kernel with the number of threads Blocking (5X): – Be aware of cache hierarchy and use it efficiently Regularizing (1.5X): – Align the data regularly to take advantage of SIMD 22

23 GPU optimization Global synchronization – Reduce the atomic operations Shared memory – Use shared memory to reduce of-chip demand – Shared memory is multi-banked and is efficient for gathers/scatters operations 23

24 Conclusion This work analyzed the performance of important throughput computing kernels on CPU and GPU – the gap is much lower that previous reports (~2.5X) Recommendation for a throughput computing architecture: – High compute – High bandwidth – Large cache – Gather/scatter support – Efficient synchronization – Fixed function units 24

25 Thank you for your attention. any question? 25

26 References [LEE’2010] V. W. Lee et al, Debunking the 100X GPU vs. CPU Myth: An Evaluation of Throughput Computing on CPU and GPU, ISCA 2010 [DIXON’2010] M. Dixon et al, The next-generation Intel® Core ™ Microarchitecture, Intel® Technology Journal, Volume 14 Issue 3, 2010 [LINDHOLM’2008] E. Lindholm et al, NVIDIA Tesla A Unified Graphics and Computing Architecture, IEEE Micro 2008 26


Download ppt "Debunking the 100X GPU vs. CPU Myth: An Evaluation of Throughput Computing on CPU and GPU Presented by: Ahmad Lashgar ECE Department, University of Tehran."

Similar presentations


Ads by Google