Debunking the 100X GPU vs CPU Myth: An Evaluation of Throughput Computing on CPU and GPU Victor W. Lee, et al. Intel Corporation ISCA ’10 June 19-23, 2010,

Slides:

Advertisements

Similar presentations

Using Graphics Processors for Real-Time Global Illumination UK GPU Computing Conference 2011 Graham Hazel.

Advertisements

Vectors, SIMD Extensions and GPUs COMP 4611 Tutorial 11 Nov. 26,

Multi-core and tera- scale computing A short overview of benefits and challenges CSC 2007 Andrzej Nowak, CERN

Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.

Exploring Memory Consistency for Massively Threaded Throughput- Oriented Processors Blake Hechtman Daniel J. Sorin 0.

Lecture 6: Multicore Systems

Instructor Notes We describe motivation for talking about underlying device architecture because device architecture is often avoided in conventional.

Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.

GPGPU Introduction Alan Gray EPCC The University of Edinburgh.

Debunking the 100X GPU vs. CPU Myth

HPCC Mid-Morning Break High Performance Computing on a GPU cluster Dirk Colbry, Ph.D. Research Specialist Institute for Cyber Enabled Discovery.

Challenge the future Delft University of Technology Evaluating Multi-Core Processors for Data-Intensive Kernels Alexander van Amesfoort Delft.

A many-core GPU architecture.. Price, performance, and evolution.

CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.

Linear Algebra on GPUs Vasily Volkov. GPU Architecture Features SIMD architecture – Don’t be confused by scalar ISA which is only a program model We use.

L11: Sparse Linear Algebra on GPUs CS6963. Administrative Issues Next assignment, triangular solve – Due 5PM, Tuesday, March 15 – handin cs6963 lab 3.

Accelerating Machine Learning Applications on Graphics Processors Narayanan Sundaram and Bryan Catanzaro Presented by Narayanan Sundaram.

Computing Platform Benchmark By Boonyarit Changaival King Mongkut’s University of Technology Thonburi (KMUTT)

Debunking the 100X GPU vs. CPU Myth: An Evaluation of Throughput Computing on CPU and GPU Presented by: Ahmad Lashgar ECE Department, University of Tehran.

GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.

Debunking the 100X GPU vs CPU Myth: An Evaluation of Throughput Computing on CPU and GPU Victor W. Lee, et al. Intel Corporation ISCA ’10 June 19-23, 2010,

Performance and Energy Efficiency of GPUs and FPGAs

COLLABORATIVE EXECUTION ENVIRONMENT FOR HETEROGENEOUS PARALLEL SYSTEMS Aleksandar Ili´c, Leonel Sousa 2010 IEEE International Symposium on Parallel & Distributed.

Shared memory systems. What is a shared memory system Single memory space accessible to the programmer Processor communicate through the network to the.

1 Chapter 04 Authors: John Hennessy & David Patterson.

Extracted directly from:

GPUs and Accelerators Jonathan Coens Lawrence Tan Yanlin Li.

CS 395 Last Lecture Summary, Anti-summary, and Final Thoughts.

Multi-core Programming Introduction Topics. Topics General Ideas Moore’s Law Amdahl's Law Processes and Threads Concurrency vs. Parallelism.

Advanced Computer Architecture, CSE 520 Generating FPGA-Accelerated DFT Libraries Chi-Li Yu Nov. 13, 2007.

Sparse Matrix-Vector Multiplication on Throughput-Oriented Processors

GPU in HPC Scott A. Friedman ATS Research Computing Technologies.

NVIDIA Tesla GPU Zhuting Xue EE126. GPU Graphics Processing Unit The "brain" of graphics, which determines the quality of performance of the graphics.

© 2007 SET Associates Corporation SAR Processing Performance on Cell Processor and Xeon Mark Backues, SET Corporation Uttam Majumder, AFRL/RYAS.

Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.

NVIDIA Fermi Architecture Patrick Cozzi University of Pennsylvania CIS Spring 2011.

Fast Support Vector Machine Training and Classification on Graphics Processors Bryan Catanzaro Narayanan Sundaram Kurt Keutzer Parallel Computing Laboratory,

GPU Architecture and Programming

GPU Programming and Architecture: Course Overview Patrick Cozzi University of Pennsylvania CIS Fall 2012.

1)Leverage raw computational power of GPU  Magnitude performance gains possible.

Processor Level Parallelism. Improving the Pipeline Pipelined processor – Ideal speedup = num stages – Branches / conflicts mean limited returns after.

Why it might be interesting to look at ARM Ben Couturier, Vijay Kartik Niko Neufeld, PH-LBC SFT Technical Group Meeting 08/10/2012.

GPU Based Sound Simulation and Visualization Torbjorn Loken, Torbjorn Loken, Sergiu M. Dascalu, and Frederick C Harris, Jr. Department of Computer Science.

Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA

GPUs – Graphics Processing Units Applications in Graphics Processing and Beyond COSC 3P93 – Parallel ComputingMatt Peskett.

Debunking the 100X GPU vs. CPU Myth An Evaluation of Throughput Computing on CPU and GPU Present by Chunyi Victor W Lee, Changkyu Kim, Jatin Chhugani,

Weekly Report- Reduction Ph.D. Student: Leo Lee date: Oct. 30, 2009.

An Integrated GPU Power and Performance Model (ISCA’10, June 19–23, 2010, Saint-Malo, France. International Symposium on Computer Architecture)

Computer Architecture Lecture 24 Parallel Processing Ralph Grishman November 2015 NYU.

GPGPU introduction. Why is GPU in the picture Seeking exa-scale computing platform Minimize power per operation. – Power is directly correlated to the.

My Coordinates Office EM G.27 contact time:

Fast and parallel implementation of Image Processing Algorithm using CUDA Technology On GPU Hardware Neha Patil Badrinath Roysam Department of Electrical.

Kevin Skadron University of Virginia Dept. of Computer Science LAVA Lab Trends in Multicore Architecture.

IBM Cell Processor Ryan Carlson, Yannick Lanner-Cusin, & Cyrus Stoller CS87: Parallel and Distributed Computing.

Analyzing Memory Access Intensity in Parallel Programs on Multicore Lixia Liu, Zhiyuan Li, Ahmed Sameh Department of Computer Science, Purdue University,

Processor Level Parallelism 1

Fall 2012 Parallel Computer Architecture Lecture 4: Multi-Core Processors Prof. Onur Mutlu Carnegie Mellon University 9/14/2012.

Auburn University COMP8330/7330/7336 Advanced Parallel and Distributed Computing Parallel Hardware Dr. Xiao Qin Auburn.

GPU Architecture and Its Application

Graphics Processor Graphics Processing Unit

A Level Computing – a2 Component 2 1A, 1B, 1C, 1D, 1E.

CS427 Multicore Architecture and Parallel Computing

GPU Computing Jan Just Keijser Nikhef Jamboree, Utrecht

Lecture 2: Intro to the simd lifestyle and GPU internals

Vector Processing => Multimedia

NVIDIA Fermi Architecture

Trends in Multicore Architecture

Multicore and GPU Programming

6- General Purpose GPU Programming

Multicore and GPU Programming

Presentation transcript:

Debunking the 100X GPU vs CPU Myth: An Evaluation of Throughput Computing on CPU and GPU Victor W. Lee, et al. Intel Corporation ISCA ’10 June 19-23, 2010, Saint-Malo, France

Intel vs. NVIDIA: Throughput Computing Smackdown! Victor W. Lee, et al. Intel Corporation ISCA ’10 June 19-23, 2010, Saint-Malo, France

Results Intel’s PaperPrevious Papers

Architecture Core i7 960GTX280

Hardware Core i processing elements 2-way hyper-threading 4-wide SIMD Caches 32KB/256KB/8MB out-of-order super-scalar GTX processing elements 100s hardware threads 8-wide SIMD* Caches 16KB/… texture sampling units transcendental units

Hardware

Bandwidth Bound Kernels SAXPY (scalar alpha x plus y) LBM (lattice Boltzmann method) SpMV (sparse matrix × vector) Low compute to memory ratio Optimizations: – Blocking reduces cache misses

Computation Bound Kernels SGEMM (sparse and dense) MC (Monte Carlo options) Conv (image convolution) FFT (fast Fourier transform) Bilat (bilateral filter) + GPU – Higher FLOPS – Hardware transcendentals + CPU: super-scalar

Gather/Scatter Bound Kernels GJK (collision detection) RC (ray casting) Irregular memory access + GPU: texture lookup + CPU: little need for SIMD

Synchronization Bound Kernels Hist (histogram) Solv (constraint solver) Atomic access to same memory + CPU: hardware atomic access Optimization: – Reduce synchronization…

Drama Courtesy of Prof. Harris

December 16, 2009 – One month after ISCA’s final papers were due. – The Federal Trade Commission filed an antitrust- related lawsuit against Intel Wednesday, accusing the chip maker of deliberately attempting hurt its competition and ultimately consumers. antitrust- related lawsuit against Intel Wednesday – The Federal Trade Commission's complaint against Intel for alleged anticompetitive practices has a new twist: graphics chips.Federal Trade Commission's complaint Antitrust

2009 was expensive for Intel The European Commission fined Intel for nearly 1.5 billion USD,European Commission fined Intel the US Federal Trade Commission sued Intel on anti-trust grounds, and US Federal Trade Commission sued Intel Intel settled with AMD for another 1.25 billion USD. Intel settled with AMD – If nothing else it was an expensive year, and while Intel settling with AMD was a significant milestone for the company it was not the end of their troubles.

Finally the settlement(s) The EU Fine is still under appeal ($1.45B) 8/4/2010 Intel Settles with the FCC Mother of All Programs: “…code name Intel bestowed on a series of payments it made to Dell…” – Intel: “Rebates” if you don’t use AMD

What is important about the context? The International Symposium on Computer Architecture (ISCA) in Saint-Malo, France, interestingly enough, is the same event where NVIDIA’s Chief Scientist Bill Dally received the prestigious 2010 Eckert-Mauchly Award for his pioneering work in architecture for parallel computing.2010 Eckert-Mauchly Award

NVIDIA Blog Response: “It’s a rare day in the world of technology when a company you compete with stands up at an important conference and declares that your technology is *only* up to 14 times faster than theirs.” “The real myth here is that multi-core CPUs are easy for any developer to use and see performance improvements.” only-up-to-14-times-faster-than-cpus-says-intel.html only-up-to-14-times-faster-than-cpus-says-intel.html

Undergraduate students learning parallel programming at M.I.T. disputed this when they looked at the performance increase they could get from different processor types and compared this with the amount of time they needed to spend in re-writing their code. According to them, for the same investment of time as coding for a CPU, they could get more than 35x the performance from a GPU.

Fermi cards were almost certainly unavailable when Intel commenced its project, but it's still worth noting that some of the GF100's architectural advances partially address (or at least alleviate) certain performance-limiting handicaps Intel points to when comparing Nehalem to a GT200 processor.

Can’t We All Get Along? Parallelization is hard, whether you're working with a quad-core x86 CPU or a 240-core GPU; each architecture has strengths and weaknesses that make it better or worse at handling certain kinds of workloads.