Debunking the 100X GPU vs CPU Myth: An Evaluation of Throughput Computing on CPU and GPU Victor W. Lee, et al. Intel Corporation ISCA ’10 June 19-23, 2010,

Slides:



Advertisements
Similar presentations
Vectors, SIMD Extensions and GPUs COMP 4611 Tutorial 11 Nov. 26,
Advertisements

Revisiting Co-Processing for Hash Joins on the Coupled CPU- GPU Architecture School of Computer Engineering Nanyang Technological University 27 th Aug.
Multi-core and tera- scale computing A short overview of benefits and challenges CSC 2007 Andrzej Nowak, CERN
Instructor Notes We describe motivation for talking about underlying device architecture because device architecture is often avoided in conventional.
Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.
1 A GPU Accelerated Storage System NetSysLab The University of British Columbia Abdullah Gharaibeh with: Samer Al-Kiswany Sathish Gopalakrishnan Matei.
MULTICORE PROCESSOR TECHNOLOGY.  Introduction  history  Why multi-core ?  What do you mean by multicore?  Multi core architecture  Comparison of.
Utilization of GPU’s for General Computing Presenter: Charlene DiMeglio Paper: Aspects of GPU for General Purpose High Performance Computing Suda, Reiji,
Early Linpack Performance Benchmarking on IPE Mole-8.5 Fermi GPU Cluster Xianyi Zhang 1),2) and Yunquan Zhang 1),3) 1) Laboratory of Parallel Software.
GPGPU Introduction Alan Gray EPCC The University of Edinburgh.
Debunking the 100X GPU vs. CPU Myth
HPCC Mid-Morning Break High Performance Computing on a GPU cluster Dirk Colbry, Ph.D. Research Specialist Institute for Cyber Enabled Discovery.
Dr. Alexandra Fedorova August 2007 Introduction to Systems Research at SFU.
2009/04/07 Yun-Yang Ma.  Overview  What is CUDA ◦ Architecture ◦ Programming Model ◦ Memory Model  H.264 Motion Estimation on CUDA ◦ Method ◦ Experimental.
Team Members: Tyler Drake Robert Wrisley Kyle Von Koepping Justin Walsh Faculty Advisors: Computer Science – Prof. Sanjay Rajopadhye Electrical & Computer.
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 19, 2011 Emergence of GPU systems and clusters for general purpose High Performance Computing.
Accelerating Machine Learning Applications on Graphics Processors Narayanan Sundaram and Bryan Catanzaro Presented by Narayanan Sundaram.
University of Michigan Electrical Engineering and Computer Science Amir Hormati, Mehrzad Samadi, Mark Woh, Trevor Mudge, and Scott Mahlke Sponge: Portable.
Adnan Ozsoy & Martin Swany DAMSL - Distributed and MetaSystems Lab Department of Computer Information and Science University of Delaware September 2011.
Panda: MapReduce Framework on GPU’s and CPU’s
Introduction to Computing By Engr. Bilal Ahmad. Aim of the Lecture  In this Lecture the focus will be on Technology, we will be discussing some specifications.
Debunking the 100X GPU vs. CPU Myth: An Evaluation of Throughput Computing on CPU and GPU Presented by: Ahmad Lashgar ECE Department, University of Tehran.
To GPU Synchronize or Not GPU Synchronize? Wu-chun Feng and Shucai Xiao Department of Computer Science, Department of Electrical and Computer Engineering,
Synergy.cs.vt.edu Power and Performance Characterization of Computational Kernels on the GPU Yang Jiao, Heshan Lin, Pavan Balaji (ANL), Wu-chun Feng.
Accelerating SQL Database Operations on a GPU with CUDA Peter Bakkum & Kevin Skadron The University of Virginia GPGPU-3 Presentation March 14, 2010.
Motivation “Every three minutes a woman is diagnosed with Breast cancer” (American Cancer Society, “Detailed Guide: Breast Cancer,” 2006) Explore the use.
NVDA Preetam Jinka Akhil Kolluri Pavan Naik. Background Graphics processing units (GPUs) Chipsets Workstations Personal computers Mobile devices Servers.
COLLABORATIVE EXECUTION ENVIRONMENT FOR HETEROGENEOUS PARALLEL SYSTEMS Aleksandar Ili´c, Leonel Sousa 2010 IEEE International Symposium on Parallel & Distributed.
Chapter 2 Computer Clusters Lecture 2.3 GPU Clusters for Massive Paralelism.
CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA
David Luebke NVIDIA Research GPU Computing: The Democratization of Parallel Computing.
Shared memory systems. What is a shared memory system Single memory space accessible to the programmer Processor communicate through the network to the.
GPUs and Accelerators Jonathan Coens Lawrence Tan Yanlin Li.
Modeling GPU non-Coalesced Memory Access Michael Fruchtman.
By Arun Bhandari Course: HPC Date: 01/28/12. GPU (Graphics Processing Unit) High performance many core processors Only used to accelerate certain parts.
CS 395 Last Lecture Summary, Anti-summary, and Final Thoughts.
CS6963 L15: Design Review and CUBLAS Paper Discussion.
Multi-core Programming Introduction Topics. Topics General Ideas Moore’s Law Amdahl's Law Processes and Threads Concurrency vs. Parallelism.
Advanced Computer Architecture, CSE 520 Generating FPGA-Accelerated DFT Libraries Chi-Li Yu Nov. 13, 2007.
GPU in HPC Scott A. Friedman ATS Research Computing Technologies.
Applying GPU and POSIX Thread Technologies in Massive Remote Sensing Image Data Processing By: Group 17 King Mongkut's Institute of Technology Ladkrabang.
HPCLatAm 2013 HPCLatAm 2013 Permutation Index and GPU to Solve efficiently Many Queries AUTORES  Mariela Lopresti  Natalia Miranda  Fabiana Piccoli.
Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.
Large-scale Deep Unsupervised Learning using Graphics Processors
Hardware. Make sure you have paper and pen to hand as you will need to take notes and write down answers and thoughts that you can refer to later on.
Robert Liao Tracy Wang CS252 Spring Overview Traditional GPU Architecture The NVIDIA G80 Processor CUDA (Compute Unified Device Architecture) LAPACK.
Fast Support Vector Machine Training and Classification on Graphics Processors Bryan Catanzaro Narayanan Sundaram Kurt Keutzer Parallel Computing Laboratory,
GPU Architecture and Programming
Introducing collaboration members – Korea University (KU) ALICE TPC online tracking algorithm on a GPU Computing Platforms – GPU Computing Platforms Joohyung.
Introduction What is GPU? It is a processor optimized for 2D/3D graphics, video, visual computing, and display. It is highly parallel, highly multithreaded.
GPUs: Overview of Architecture and Programming Options Lee Barford firstname dot lastname at gmail dot com.
Debunking the 100X GPU vs CPU Myth: An Evaluation of Throughput Computing on CPU and GPU Victor W. Lee, et al. Intel Corporation ISCA ’10 June 19-23, 2010,
1 Workshop 9: General purpose computing using GPUs: Developing a hands-on undergraduate course on CUDA programming SIGCSE The 42 nd ACM Technical.
GPU Programming Shirley Moore CPS 5401 Fall 2013
MULTICORE PROCESSOR TECHNOLOGY.  Introduction  history  Why multi-core ?  What do you mean by multicore?  Multi core architecture  Comparison of.
GPUs – Graphics Processing Units Applications in Graphics Processing and Beyond COSC 3P93 – Parallel ComputingMatt Peskett.
Debunking the 100X GPU vs. CPU Myth An Evaluation of Throughput Computing on CPU and GPU Present by Chunyi Victor W Lee, Changkyu Kim, Jatin Chhugani,
University of Michigan Electrical Engineering and Computer Science Adaptive Input-aware Compilation for Graphics Engines Mehrzad Samadi 1, Amir Hormati.
An Integrated GPU Power and Performance Model (ISCA’10, June 19–23, 2010, Saint-Malo, France. International Symposium on Computer Architecture)
December 13, G raphical A symmetric P rocessing Prototype Presentation December 13, 2004.
My Coordinates Office EM G.27 contact time:
Parallel Computers Today Oak Ridge / Cray Jaguar > 1.75 PFLOPS Two Nvidia 8800 GPUs > 1 TFLOPS Intel 80- core chip > 1 TFLOPS  TFLOPS = floating.
Analyzing Memory Access Intensity in Parallel Programs on Multicore Lixia Liu, Zhiyuan Li, Ahmed Sameh Department of Computer Science, Purdue University,
S. Pardi Frascati, 2012 March GPGPU Evaluation – First experiences in Napoli Silvio Pardi.
CS 179: GPU Computing LECTURE 2: MORE BASICS. Recap Can use GPU to solve highly parallelizable problems Straightforward extension to C++ ◦Separate CUDA.
GPU Architecture and Its Application
NFV Compute Acceleration APIs and Evaluation
Graphics Processor Graphics Processing Unit
Lecture 2: Intro to the simd lifestyle and GPU internals
6- General Purpose GPU Programming
Presentation transcript:

Debunking the 100X GPU vs CPU Myth: An Evaluation of Throughput Computing on CPU and GPU Victor W. Lee, et al. Intel Corporation ISCA ’10 June 19-23, 2010, Saint-Malo, France

Mythbusters view on the topic CPU vs GPU CPU vs GPU GPU-or-Paintball-Cannons-are-Cool GPU-or-Paintball-Cannons-are-Cool GPU-or-Paintball-Cannons-are-Cool GPU-or-Paintball-Cannons-are-Cool Full movie: Full movie: ml ml ml ml

The Initial Claim Over the past 4 years NVIDIA has made a great many claims regarding how porting various types of applications to run on GPUs instead of CPUs can tremendously improve performance by anywhere from 10x to 500x. Over the past 4 years NVIDIA has made a great many claims regarding how porting various types of applications to run on GPUs instead of CPUs can tremendously improve performance by anywhere from 10x to 500x. But it actually began much earlier (SIGGRAPH 2004) But it actually began much earlier (SIGGRAPH 2004) GP2-CPU-vs-GPU-BillMark.pdf GP2-CPU-vs-GPU-BillMark.pdf GP2-CPU-vs-GPU-BillMark.pdf GP2-CPU-vs-GPU-BillMark.pdf

Intel’s Response? Intel, unsurprisingly, sees the situation differently, but has remained relatively quiet on the issue, possibly because Larrabee was going to be positioned as a discrete GPU. Intel, unsurprisingly, sees the situation differently, but has remained relatively quiet on the issue, possibly because Larrabee was going to be positioned as a discrete GPU.

Intel’s Response? The recent announcement that Larrabee has been repurposed as an HPC/scientific computing solution may therefore be partially responsible for Intel ramping up an offensive against NVIDIA's claims regarding GPU computing. The recent announcement that Larrabee has been repurposed as an HPC/scientific computing solution may therefore be partially responsible for Intel ramping up an offensive against NVIDIA's claims regarding GPU computing. At the International Symposium On Computer Architecture (ISCA) this June, a team from Intel presented a whitepaper purporting to investigate the real-world performance delta between CPUs and GPUs. At the International Symposium On Computer Architecture (ISCA) this June, a team from Intel presented a whitepaper purporting to investigate the real-world performance delta between CPUs and GPUs. CPUs

But before that…. December 16, 2009 December 16, 2009 One month after ISCA’s final papers were due. One month after ISCA’s final papers were due. The Federal Trade Commission filed an antitrust- related lawsuit against Intel Wednesday, accusing the chip maker of deliberately attempting hurt its competition and ultimately consumers. The Federal Trade Commission filed an antitrust- related lawsuit against Intel Wednesday, accusing the chip maker of deliberately attempting hurt its competition and ultimately consumers. antitrust- related lawsuit against Intel Wednesdayantitrust- related lawsuit against Intel Wednesday The Federal Trade Commission's complaint against Intel for alleged anticompetitive practices has a new twist: graphics chips. The Federal Trade Commission's complaint against Intel for alleged anticompetitive practices has a new twist: graphics chips.Federal Trade Commission's complaintFederal Trade Commission's complaint

2009 was expensive for Intel The European Commission fined Intel for nearly 1.5 billion USD, The European Commission fined Intel for nearly 1.5 billion USD,European Commission fined IntelEuropean Commission fined Intel the US Federal Trade Commission sued Intel on anti-trust grounds, and the US Federal Trade Commission sued Intel on anti-trust grounds, and US Federal Trade Commission sued IntelUS Federal Trade Commission sued Intel Intel settled with AMD for another 1.25 billion USD. Intel settled with AMD for another 1.25 billion USD. Intel settled with AMD Intel settled with AMD If nothing else it was an expensive year, and while Intel settling with AMD was a significant milestone for the company it was not the end of their troubles. If nothing else it was an expensive year, and while Intel settling with AMD was a significant milestone for the company it was not the end of their troubles.

Finally the settlement(s) The EU Fine is still under appeal ($1.45B) The EU Fine is still under appeal ($1.45B) 8/4/2010 Intel Settles with the FTC 8/4/2010 Intel Settles with the FTC Then there is the whole Dell issue…. Then there is the whole Dell issue….

So back to the paper, What did Intel Say? Throughput Computing Throughput Computing Kernels Kernels What is a kernel? What is a kernel? Kernels selected: Kernels selected: SGEMM, MC, Conv, FFT, SAXPY, LBM, Solv, SpMV, GJK, Sort, RC, Search, Hist, Bilat SGEMM, MC, Conv, FFT, SAXPY, LBM, Solv, SpMV, GJK, Sort, RC, Search, Hist, Bilat

The Hardware selected CPU: CPU: 3.2GHz Core i7-960, 6GB RAM 3.2GHz Core i7-960, 6GB RAM GPU GPU 1.3GHz eVGA GeForce GTX280 w/ 1GB 1.3GHz eVGA GeForce GTX280 w/ 1GB

Optimizations: CPU CPU Mutithreading, Mutithreading, cache blocking, and cache blocking, and reorganization of memory accesses for SIMDification reorganization of memory accesses for SIMDification GPU GPU Minimizing global synchronization, and Minimizing global synchronization, and using local shared buffers. using local shared buffers.

This even made Slashdot Hardware: Intel, NVIDIA Take Shots At CPU vs. GPU Performance Hardware: Intel, NVIDIA Take Shots At CPU vs. GPU Performance Hardware:Intel, NVIDIA Take Shots At CPU vs. GPU Performance Hardware:Intel, NVIDIA Take Shots At CPU vs. GPU Performance

And PCWorld Intel: 2-year-old Nvidia GPU Outperforms 3.2GHz Core I7 Intel: 2-year-old Nvidia GPU Outperforms 3.2GHz Core I7 Intel researchers have published the results of a performance comparison between their latest quad- core Core i7 processor and a two-year-old Nvidia graphics card, and found that the Intel processor can't match the graphics chip's parallel processing performance. Intel researchers have published the results of a performance comparison between their latest quad- core Core i7 processor and a two-year-old Nvidia graphics card, and found that the Intel processor can't match the graphics chip's parallel processing performance. arold_nvidia_gpu_outperforms_32ghz_core_i7.html arold_nvidia_gpu_outperforms_32ghz_core_i7.html arold_nvidia_gpu_outperforms_32ghz_core_i7.html arold_nvidia_gpu_outperforms_32ghz_core_i7.html

From the paper's abstract: In the past few years there have been many studies claiming GPUs deliver substantial speedups...over multi-core CPUs...[W]e perform a rigorous performance analysis and find that after applying optimizations appropriate for both CPUs and GPUs the performance gap between an Nvidia GTX280 processor and the Intel Core i7 960 processor narrows to only 2.5x on average. In the past few years there have been many studies claiming GPUs deliver substantial speedups...over multi-core CPUs...[W]e perform a rigorous performance analysis and find that after applying optimizations appropriate for both CPUs and GPUs the performance gap between an Nvidia GTX280 processor and the Intel Core i7 960 processor narrows to only 2.5x on average. Do you have a problem with this statement? Do you have a problem with this statement?

Intel's own paper indirectly raises a question when it notes: Intel's own paper indirectly raises a question when it notes: The previously reported LBM number on GPUs claims 114X speedup over CPUs. However, we found that with careful multithreading, reorganization of memory access patterns, and SIMD optimizations, the performance on both CPUs and GPUs is limited by memory bandwidth and the gap is reduced to only 5X. The previously reported LBM number on GPUs claims 114X speedup over CPUs. However, we found that with careful multithreading, reorganization of memory access patterns, and SIMD optimizations, the performance on both CPUs and GPUs is limited by memory bandwidth and the gap is reduced to only 5X.

What is important about the context? The International Symposium on Computer Architecture (ISCA) in Saint-Malo, France, interestingly enough, is the same event where NVIDIA’s Chief Scientist Bill Dally received the prestigious 2010 Eckert-Mauchly Award for his pioneering work in architecture for parallel computing. The International Symposium on Computer Architecture (ISCA) in Saint-Malo, France, interestingly enough, is the same event where NVIDIA’s Chief Scientist Bill Dally received the prestigious 2010 Eckert-Mauchly Award for his pioneering work in architecture for parallel computing.2010 Eckert-Mauchly Award2010 Eckert-Mauchly Award

NVIDIA Blog Response: It’s a rare day in the world of technology when a company you compete with stands up at an important conference and declares that your technology is *only* up to 14 times faster than theirs. It’s a rare day in the world of technology when a company you compete with stands up at an important conference and declares that your technology is *only* up to 14 times faster than theirs. only-up-to-14-times-faster-than-cpus-says-intel/ only-up-to-14-times-faster-than-cpus-says-intel/ only-up-to-14-times-faster-than-cpus-says-intel/ only-up-to-14-times-faster-than-cpus-says-intel/

NVIDIA Blog Response: (cont) The real myth here is that multi-core CPUs are easy for any developer to use and see performance improvements. The real myth here is that multi-core CPUs are easy for any developer to use and see performance improvements.

Undergraduate students learning parallel programming at M.I.T. disputed this when they looked at the performance increase they could get from different processor types and compared this with the amount of time they needed to spend in re-writing their code. Undergraduate students learning parallel programming at M.I.T. disputed this when they looked at the performance increase they could get from different processor types and compared this with the amount of time they needed to spend in re-writing their code. According to them, for the same investment of time as coding for a CPU, they could get more than 35x the performance from a GPU. According to them, for the same investment of time as coding for a CPU, they could get more than 35x the performance from a GPU.

Despite substantial investments in parallel computing tools and libraries, efficient multi- core optimization remains in the realm of experts like those Intel recruited for its analysis. Despite substantial investments in parallel computing tools and libraries, efficient multi- core optimization remains in the realm of experts like those Intel recruited for its analysis. In contrast, the CUDA parallel computing architecture from NVIDIA is a little over 3 years old and already hundreds of consumer, professional and scientific applications are seeing speedups ranging from 10 to 100x using NVIDIA GPUs. In contrast, the CUDA parallel computing architecture from NVIDIA is a little over 3 years old and already hundreds of consumer, professional and scientific applications are seeing speedups ranging from 10 to 100x using NVIDIA GPUs.consumer professionalscientificconsumer professionalscientific

Questions Where did the 2.5x, 5x, and 14x come from? Where did the 2.5x, 5x, and 14x come from? How big were the problems that Intel used for comparisons? [compare w/ cache size] How big were the problems that Intel used for comparisons? [compare w/ cache size] How were they selected? How were they selected? What optimizations were done? What optimizations were done?

Fermi cards were almost certainly unavailable when Intel commenced its project, but it's still worth noting that some of the GF100's architectural advances partially address (or at least alleviate) certain performance-limiting handicaps Intel points to when comparing Nehalem to a GT200 processor. Fermi cards were almost certainly unavailable when Intel commenced its project, but it's still worth noting that some of the GF100's architectural advances partially address (or at least alleviate) certain performance-limiting handicaps Intel points to when comparing Nehalem to a GT200 processor.

Bottom Line Parallelization is hard, whether you're working with a quad-core x86 CPU or a 240-core GPU; each architecture has strengths and weaknesses that make it better or worse at handling certain kinds of workloads. Parallelization is hard, whether you're working with a quad-core x86 CPU or a 240-core GPU; each architecture has strengths and weaknesses that make it better or worse at handling certain kinds of workloads.

Other Reading ull_papers/Vuduc.pdf On the Limits of GPU Acceleration ull_papers/Vuduc.pdf ull_papers/Vuduc.pdf ull_papers/Vuduc.pdf