Debunking the 100X GPU vs. CPU Myth: An Evaluation of Throughput Computing on CPU and GPU Presented by: Ahmad Lashgar ECE Department, University of Tehran.

Slides:



Advertisements
Similar presentations
Larrabee Eric Jogerst Cortlandt Schoonover Francis Tan.
Advertisements

Vectors, SIMD Extensions and GPUs COMP 4611 Tutorial 11 Nov. 26,
Orchestrated Scheduling and Prefetching for GPGPUs Adwait Jog, Onur Kayiran, Asit Mishra, Mahmut Kandemir, Onur Mutlu, Ravi Iyer, Chita Das.
Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.
Instructor Notes We describe motivation for talking about underlying device architecture because device architecture is often avoided in conventional.
Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.
Optimization on Kepler Zehuan Wang
Utilization of GPU’s for General Computing Presenter: Charlene DiMeglio Paper: Aspects of GPU for General Purpose High Performance Computing Suda, Reiji,
Debunking the 100X GPU vs. CPU Myth
Appendix A. Appendix A — 2 FIGURE A.2.1 Historical PC. VGA controller drives graphics display from framebuffer memory. Copyright © 2009 Elsevier, Inc.
Appendix A — 1 FIGURE A.2.2 Contemporary PCs with Intel and AMD CPUs. See Chapter 6 for an explanation of the components and interconnects in this figure.
HPCC Mid-Morning Break High Performance Computing on a GPU cluster Dirk Colbry, Ph.D. Research Specialist Institute for Cyber Enabled Discovery.
Helper Threads via Virtual Multithreading on an experimental Itanium 2 processor platform. Perry H Wang et. Al.
OWL: Cooperative Thread Array (CTA) Aware Scheduling Techniques for Improving GPGPU Performance Adwait Jog, Onur Kayiran, Nachiappan CN, Asit Mishra, Mahmut.
Acceleration of the Smith– Waterman algorithm using single and multiple graphics processors Author : Ali Khajeh-Saeed, Stephen Poole, J. Blair Perot. Publisher:
GPUs. An enlarging peak performance advantage: –Calculation: 1 TFLOPS vs. 100 GFLOPS –Memory Bandwidth: GB/s vs GB/s –GPU in every PC and.
Towards Acceleration of Fault Simulation Using Graphics Processing Units Kanupriya Gulati Sunil P. Khatri Department of ECE Texas A&M University, College.
CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.
University of Michigan Electrical Engineering and Computer Science Low-Power Scientific Computing Ganesh Dasika, Ankit Sethia, Trevor Mudge, Scott Mahlke.
L11: Sparse Linear Algebra on GPUs CS6963. Administrative Issues Next assignment, triangular solve – Due 5PM, Tuesday, March 15 – handin cs6963 lab 3.
Accelerating Machine Learning Applications on Graphics Processors Narayanan Sundaram and Bryan Catanzaro Presented by Narayanan Sundaram.
GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.
Debunking the 100X GPU vs CPU Myth: An Evaluation of Throughput Computing on CPU and GPU Victor W. Lee, et al. Intel Corporation ISCA ’10 June 19-23, 2010,
To GPU Synchronize or Not GPU Synchronize? Wu-chun Feng and Shucai Xiao Department of Computer Science, Department of Electrical and Computer Engineering,
Synergy.cs.vt.edu Power and Performance Characterization of Computational Kernels on the GPU Yang Jiao, Heshan Lin, Pavan Balaji (ANL), Wu-chun Feng.
Accelerating SQL Database Operations on a GPU with CUDA Peter Bakkum & Kevin Skadron The University of Virginia GPGPU-3 Presentation March 14, 2010.
Motivation “Every three minutes a woman is diagnosed with Breast cancer” (American Cancer Society, “Detailed Guide: Breast Cancer,” 2006) Explore the use.
Performance and Energy Efficiency of GPUs and FPGAs
COLLABORATIVE EXECUTION ENVIRONMENT FOR HETEROGENEOUS PARALLEL SYSTEMS Aleksandar Ili´c, Leonel Sousa 2010 IEEE International Symposium on Parallel & Distributed.
CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA
Shared memory systems. What is a shared memory system Single memory space accessible to the programmer Processor communicate through the network to the.
1 The Performance Potential for Single Application Heterogeneous Systems Henry Wong* and Tor M. Aamodt § *University of Toronto § University of British.
GPUs and Accelerators Jonathan Coens Lawrence Tan Yanlin Li.
Status of the L1 STS Tracking I. Kisel GSI / KIP CBM Collaboration Meeting GSI, March 12, 2009.
General Purpose Computing on Graphics Processing Units: Optimization Strategy Henry Au Space and Naval Warfare Center Pacific 09/12/12.
GPU in HPC Scott A. Friedman ATS Research Computing Technologies.
YOU LI SUPERVISOR: DR. CHU XIAOWEN CO-SUPERVISOR: PROF. LIU JIMING THURSDAY, MARCH 11, 2010 Speeding up k-Means by GPUs 1.
Fast Support Vector Machine Training and Classification on Graphics Processors Bryan Catanzaro Narayanan Sundaram Kurt Keutzer Parallel Computing Laboratory,
GPU Architecture Overview
Hardware Acceleration Using GPUs M Anirudh Guide: Prof. Sachin Patkar VLSI Consortium April 4, 2008.
Introduction What is GPU? It is a processor optimized for 2D/3D graphics, video, visual computing, and display. It is highly parallel, highly multithreaded.
Jie Chen. 30 Multi-Processors each contains 8 cores at 1.4 GHz 4GB GDDR3 memory offers ~100GB/s memory bandwidth.
1)Leverage raw computational power of GPU  Magnitude performance gains possible.
Yang Yu, Tianyang Lei, Haibo Chen, Binyu Zang Fudan University, China Shanghai Jiao Tong University, China Institute of Parallel and Distributed Systems.
Debunking the 100X GPU vs CPU Myth: An Evaluation of Throughput Computing on CPU and GPU Victor W. Lee, et al. Intel Corporation ISCA ’10 June 19-23, 2010,
Hybrid Multi-Core Architecture for Boosting Single-Threaded Performance Presented by: Peyman Nov 2007.
Debunking the 100X GPU vs. CPU Myth An Evaluation of Throughput Computing on CPU and GPU Present by Chunyi Victor W Lee, Changkyu Kim, Jatin Chhugani,
Weekly Report- Reduction Ph.D. Student: Leo Lee date: Oct. 30, 2009.
University of Michigan Electrical Engineering and Computer Science Adaptive Input-aware Compilation for Graphics Engines Mehrzad Samadi 1, Amir Hormati.
Sparse Matrix-Vector Multiply on the Keystone II Digital Signal Processor Yang Gao, Fan Zhang and Dr. Jason D. Bakos 2014 IEEE High Performance Extreme.
Sunpyo Hong, Hyesoon Kim
GPGPU introduction. Why is GPU in the picture Seeking exa-scale computing platform Minimize power per operation. – Power is directly correlated to the.
My Coordinates Office EM G.27 contact time:
Kevin Skadron University of Virginia Dept. of Computer Science LAVA Lab Trends in Multicore Architecture.
Analyzing Memory Access Intensity in Parallel Programs on Multicore Lixia Liu, Zhiyuan Li, Ahmed Sameh Department of Computer Science, Purdue University,
S. Pardi Frascati, 2012 March GPGPU Evaluation – First experiences in Napoli Silvio Pardi.
Heterogeneous Processing KYLE ADAMSKI. Overview What is heterogeneous processing? Why it is necessary Issues with heterogeneity CPU’s vs. GPU’s Heterogeneous.
P&H Ap. A GPUs for Graphics and Computing. Appendix A — 2 FIGURE A.2.1 Historical PC. VGA controller drives graphics display from framebuffer memory.
Henry Wong, Misel-Myrto Papadopoulou, Maryam Sadooghi-Alvandi, Andreas Moshovos University of Toronto Demystifying GPU Microarchitecture through Microbenchmarking.
Fall 2012 Parallel Computer Architecture Lecture 4: Multi-Core Processors Prof. Onur Mutlu Carnegie Mellon University 9/14/2012.
Appendix C Graphics and Computing GPUs
Scalpel: Customizing DNN Pruning to the
Yang Gao and Dr. Jason D. Bakos
Employing compression solutions under openacc
EECE571R -- Harnessing Massively Parallel Processors ece
ISPASS th April Santa Rosa, California
Lynn Choi School of Electrical Engineering
Trends in Multicore Architecture
CSE 502: Computer Architecture
Multicore and GPU Programming
Presentation transcript:

Debunking the 100X GPU vs. CPU Myth: An Evaluation of Throughput Computing on CPU and GPU Presented by: Ahmad Lashgar ECE Department, University of Tehran Seminar of Parallel Processing. Instructor: Dr. Fakhraie 29 Dec 11 ISCA 2010 Original authors: Victor W Lee et al. Intel Corporation 1 Some slides are included from original paper only for educational purposes

Abstract Is the GPU silver bullet of parallel computing? How far is the difference between peak and achievable performance? 2

Overview Abstract Architecture – CPU: Intel core i7 – GPU: Nvidia GTX280 Implications for throughput computing applications Methodology Results Analyzing the results Platform optimization guides Conclusion 3

Architecture (1) Intel core i7-960 – 4-core, 3.2 GHz – 2-way multi-threading – 4-wide – L1 32KB, L2 256KB, L3 3MB – 32 GB/sec 4 [DIXON’2010]

Architecture (2) Nvidia GTX280 – 30 core, 1.3GHz – 1024-way multi-threading – 8-way SIMD – 16KB software managed cache (shared memory) – 141 GB/s 5 [LINDHOLM’2008]

Architecture (3) Core i7-960GTX280 Core430 Frequency (GHz) Transistors0.7B (263mm2)1.4B (576mm2) Memory Bandwidth (GB/s)32141 SP SIMD48 DP SIMD21 Peak SP scalar GFLOPS Peak SP SIMD GFLOPS (933.1) Peak DB SIMD GFLOPS Red texts are not the author’s numbers. 6

Implications for throughput computing applications 1.Number of core difference 2.Cache size/multi-threading 3.Bandwidth difference 7

1. Number of cores difference It is all about the core complexity: – The common goal: Improving pipeline efficiency – CPU goal: Single-thread performance Exploiting ILP Sophisticated branch predictor Multiple issue logics – GPU goal: Throughput Interleaving hundreds of threads 8

2. Cache size/multi-threading CPU goal: reducing memory latency – Programmer-transparent data caching Increasing the cache size to capture the working set – Prefetching (HW/SW) GPU goal: hiding memory latency – Interleave the execution of hundreds of threads to hide the latency of each other Notice: – CPU uses multi-threading for latency hiding – GPU uses software controlled caching (shared memory) for reducing memory latency 9

3. Bandwidth difference Bandwidth versus latency CPU goal: single thread performance – Workloads do not demand for many memory accesses – Bring the data as soon as possible GPU goal: throughput – There are lots of memory accesses, provide the good bandwidth – No matter the latency, core will hide it! 10

Methodology (1) Hardware – Intel Core i7-960, 6GB DRAM, GTX280 1GB Software – SUSE Enterprise 11 – CUDA Toolkit

Methodology (2) Optimizations – On CPU: SGEMM, SpMV and FFT from Intel MKL 10 Always 2 threads per core – On GPU: Best possible algorithm for SpMV, FFT and MC Often 128 to 256 threads per core (to leverage shared memory and register-file usage) – Interleaving GPU execution and HD/DH memory transfers where possible 12

Results The HD/DH data transfer time is not considered Only 2.5X on average – Far from what is reported by previous researches (100X) 13

Where is the speedup of previous researches?! What CPU and GPU are compared? How much optimization is performed on CPU and GPU? – Where they optimize both platforms, they reported much lower speedup (like this paper) 14

Analyzing the results (1) 1.Bandwidth 2.Compute flops (single precision) 3.Compute flops (double precision) 4.Reduction and synchronization 5.Fixed function 15

Analyzing the results (2) 1.Bandwidth – Peak: GTX280/Corei7-960 ~ 4.7X – Feature: Large working set, Performance is bounded by the bandwidth – Examples SAXPY (5.3X) LBM (5X) SpMV (1.9X) – CPU benefits from caching 16

Analyzing the results (3) 2.Compute Flops (Single Precision) – Peak: GTX280/Corei7-960 ~ 3X – Feature: Bounded by computation, benefit from more cores – Examples SGEMM, Conv and FFT (2.8-4X) 17

Analyzing the results (4) 3.Compute Flops (Double Precision) – Peak: GTX280/Corei7-960 ~ 1.5X – Feature: Bounded by computation, benefit from more cores – Examples MC (1.8X) Blitz (5X) – Uses transcendental operations Sort (1.25X slower) – Due to decrease in SIMD width usage – Depends on scalar performance 18

Analyzing the results (5) 4.Reduction and Synchronization – Feature: More threads, higher the synchronization overhead – Examples Hist (1.8X) – On CPU, 28% of the time is spent on atomic operations – On GPU, the atomic operations are much slower Solv (1.9X slower) – Multiple kernel launches to preserve cache coherency on GPU 19

Analyzing the results (6) 5.Fixed function – Feature: Interpolation, texturing and transcendental operation are bonus on GPU – Examples Bilat (5.7X) – On CPU, 66% of the time is spent on transcendental operations GJK (14.9X) – Uses texture lookup 20

Platform optimization guides CPU programmer have heavily relied on increasing clock frequency Their application do not benefits from TLP and DLP Today CPUs use wider SIMD which stays idle if not exploited by programmer (or compiler) This paper showed that careful multi-threading can reduce the gap heavily – For LBM, from 114X down to 5X Let’s learn some optimization tips from the authors 21

CPU optimization Scalability (4X): – Scale the kernel with the number of threads Blocking (5X): – Be aware of cache hierarchy and use it efficiently Regularizing (1.5X): – Align the data regularly to take advantage of SIMD 22

GPU optimization Global synchronization – Reduce the atomic operations Shared memory – Use shared memory to reduce of-chip demand – Shared memory is multi-banked and is efficient for gathers/scatters operations 23

Conclusion This work analyzed the performance of important throughput computing kernels on CPU and GPU – the gap is much lower that previous reports (~2.5X) Recommendation for a throughput computing architecture: – High compute – High bandwidth – Large cache – Gather/scatter support – Efficient synchronization – Fixed function units 24

Thank you for your attention. any question? 25

References [LEE’2010] V. W. Lee et al, Debunking the 100X GPU vs. CPU Myth: An Evaluation of Throughput Computing on CPU and GPU, ISCA 2010 [DIXON’2010] M. Dixon et al, The next-generation Intel® Core ™ Microarchitecture, Intel® Technology Journal, Volume 14 Issue 3, 2010 [LINDHOLM’2008] E. Lindholm et al, NVIDIA Tesla A Unified Graphics and Computing Architecture, IEEE Micro