Presentation is loading. Please wait.

Presentation is loading. Please wait.

Debunking the 100X GPU vs. CPU Myth An Evaluation of Throughput Computing on CPU and GPU Present by Chunyi Victor W Lee, Changkyu Kim, Jatin Chhugani,

Similar presentations


Presentation on theme: "Debunking the 100X GPU vs. CPU Myth An Evaluation of Throughput Computing on CPU and GPU Present by Chunyi Victor W Lee, Changkyu Kim, Jatin Chhugani,"— Presentation transcript:

1 Debunking the 100X GPU vs. CPU Myth An Evaluation of Throughput Computing on CPU and GPU Present by Chunyi Victor W Lee, Changkyu Kim, Jatin Chhugani, Michael Deisher, Daehyun Kim, Anthony D. Nguyen, Nadathur Satish, Mikhail Smelyanskiy, Srinivas Chennupat, Per Hammarlund, Ronak Singhal and Pradeep Dubey Throughput Computing Lab and Intel Architecture Group, Intel Corporation ISCA’10, June 19–23, 2010, Saint-Malo, France.

2 Outline The Rise of GPGPU Compute / Bandwidth Bound app. 14 Computing kernals Optimization Debunking the 100X GPU vs. CPU Myth Summary 2011/09/30Embedded System Lab. CCU2

3 Outline The Rise of GPGPU Compute / Bandwidth Bound app. 14 Computing kernals Optimization Debunking the 100X GPU vs. CPU Myth Summary 2011/09/30Embedded System Lab. CCU3

4 The Rise of GPGPU CPU Designed for a wide variety of applications and to provide fast response times to a single task GPU Built specifically for rendering graphics applications that have a large degree of data parallelism 2011/09/30Embedded System Lab. CCU4

5 The Rise of GPGPU Designed for graphics processing with many small processing elements The massive processing capability of GPU allures programmers to start exploring general purpose computing with GPU 2011/09/30Embedded System Lab. CCU5

6 The Rise of GPGPU Intel Core i7 960nVIDIA GTX 280 Cores430*8 Peak SP Flop105 GF/s933 GF/s Peak BW30 GB/s141 GB/s 2011/09/30Embedded System Lab. CCU6 SP: Single-Precision Floating Point BW: Local DRAM bandwidth

7 The Rise of GPGPU High-level programming language – HLSL, Cg, GLSL, CTM, BrookGPU Compute Unified Device Architecture ( CUDA ) Open Computing Language ( OpenCL ) 2011/09/30Embedded System Lab. CCU7 http://www.gpgpu.org

8 The Rise of GPGPU 2011/09/30Embedded System Lab. CCU8 GPUs have significant performance gain Not orders of magnitude faster than CPUs

9 Outline The Rise of GPGPU Compute / Bandwidth Bound app. 14 Computing kernals Optimization Debunking the 100X GPU vs. CPU Myth Summary 2011/09/30Embedded System Lab. CCU9

10 Compute / Bandwidth Bound Performance depends on two resources – Compute does the work – Bandwidth feeds the compute For compute bound applications Performance = Efficiency * Peak Compute Capability For bandwidth bound applications Performance = Efficiency * Peak Bandwidth Capability 2011/09/30Embedded System Lab. CCU10

11 Outline The Rise of GPGPU Compute / Bandwidth Bound app. 14 Computing kernals Optimization Debunking the 100X GPU vs. CPU Myth Summary 2011/09/30Embedded System Lab. CCU11

12 14 Computing kernals SGEMMMonte CarloConv FFTSAXPYLBM SolvSpMVGJK SortRCSearch HistBilat 2011/09/30Embedded System Lab. CCU12

13 14 Computing kernals ConvCommon image filtering operation SAXPYBasic Linear Algebra Subprogram LBMLattice Boltzmann method SolvGame physics simulators GJKPhysically-based animations simulation RCRay Casting, visualize 3D dataset HistHistogram computation BilatBilateral filter 2011/09/30Embedded System Lab. CCU13

14 SGEMM (Single precision General Matrix Multiply) Kernel in linear algebra numerical algorithm Regular access patterns Compute Bound 2011/09/30Embedded System Lab. CCU14

15 Monte Carlo Random samples a complex function Regular access patterns Compute Bound 2011/09/30Embedded System Lab. CCU15

16 FFT (Fast Fourier Transform) Converts signals from time domain to frequence domain Regular access patterns Compute Bound 2011/09/30Embedded System Lab. CCU16

17 SpMV (Sparse Matrix vector Multiplication) Sparse Matrix vector Multiplication Gather access patterns Bandwidth Bound 2011/09/30Embedded System Lab. CCU17

18 Sort (Radix sort) Multi-pass sorting algorithm Gather/ Scatter access patterns Compute bound 2011/09/30Embedded System Lab. CCU18

19 Search in-memory tree structured index search Gather/ Scatter access patterns Compute bound for small tree, otherwise Bandwidth bond 2011/09/30Embedded System Lab. CCU19

20 Experiment 1)a 3.2GHz Core i7-960 processor SUSE Enterprise Server 11 operating system 6GB of PC1333 DDR3 memory 2)a 1.3GHz GTX280 processor (with 1GB GDDR3 memory) in the same Core i7 system Nvidia driver version 19.180, CUDA 2.3 toolkit. 2011/09/30Embedded System Lab. CCU20

21 Experiment 2011/09/30Embedded System Lab. CCU21 2.5X

22 Outline The Rise of GPGPU Compute / Bandwidth Bound app. 14 Computing kernals Optimization Debunking the 100X GPU vs. CPU Myth Summary 2011/09/30Embedded System Lab. CCU22

23 Optimization CPU optimization GPU optimization Hardware Recommandations 2011/09/30Embedded System Lab. CCU23

24 Optimization CPU optimization – Multi-threading – SIMDification – Cache blocking – Memory management – Data structure re-arrangement 2011/09/30Embedded System Lab. CCU24

25 Optimization GPU optimization – Multi-threading – Branch divergence reduction – Coalescing memory accesses – Synchronization avoidance – Local shared buffer optimization 2011/09/30Embedded System Lab. CCU25

26 Hardware Recommandations – Large Cache – High memory bandwidth – Efficient sync. – Cache coherence 2011/09/30Embedded System Lab. CCU26

27 Truth or Myth ? GPUs is 10 – 100x faster than CPUs

28 Max Speedup Intel Core i7 960nVIDIA GTX 280 Cores430*8 Peak SP Flop105 GF/s933 GF/s Peak BW30 GB/s141 GB/s 2011/09/30Embedded System Lab. CCU28 Max Speedup: GTX 280 over Core i7 960 Compute Bound Apps: (SP)933/102 = 9.1x Bandwidth Bound Apps:141/30 = 4.7x

29 2011/09/30Embedded System Lab. CCU29

30 Outline The Rise of GPGPU Compute / Bandwidth Bound app. 14 Computing kernals Optimization Debunking the 100X GPU vs. CPU Myth Summary 2011/09/30Embedded System Lab. CCU30

31 Summary Without parallelization, GPU and CPU won’t perform Architecture specific optimization Compare performance fairly Memory is the key in Multicore 2011/09/30Embedded System Lab. CCU31

32 Thanks for your attention. Your Queries?


Download ppt "Debunking the 100X GPU vs. CPU Myth An Evaluation of Throughput Computing on CPU and GPU Present by Chunyi Victor W Lee, Changkyu Kim, Jatin Chhugani,"

Similar presentations


Ads by Google