Kevin Skadron University of Virginia Dept. of Computer Science LAVA Lab Trends in Multicore Architecture.

Slides:



Advertisements
Similar presentations
Larrabee Eric Jogerst Cortlandt Schoonover Francis Tan.
Advertisements

Accelerators for HPC: Programming Models Accelerators for HPC: StreamIt on GPU High Performance Applications on Heterogeneous Windows Clusters
Multi-core and tera- scale computing A short overview of benefits and challenges CSC 2007 Andrzej Nowak, CERN
Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.
Lecture 38: Chapter 7: Multiprocessors Today’s topic –Vector processors –GPUs –An example 1.
Lecture 6: Multicore Systems
Instructor Notes We describe motivation for talking about underlying device architecture because device architecture is often avoided in conventional.
L1 Event Reconstruction in the STS I. Kisel GSI / KIP CBM Collaboration Meeting Dubna, October 16, 2008.
Computer Abstractions and Technology
MULTICORE PROCESSOR TECHNOLOGY.  Introduction  history  Why multi-core ?  What do you mean by multicore?  Multi core architecture  Comparison of.
Utilization of GPU’s for General Computing Presenter: Charlene DiMeglio Paper: Aspects of GPU for General Purpose High Performance Computing Suda, Reiji,
GPGPU Introduction Alan Gray EPCC The University of Edinburgh.
Chapter1 Fundamental of Computer Design Dr. Bernard Chen Ph.D. University of Central Arkansas.
Chapter 8 Hardware Conventional Computer Hardware Architecture.
A many-core GPU architecture.. Price, performance, and evolution.
GPUs. An enlarging peak performance advantage: –Calculation: 1 TFLOPS vs. 100 GFLOPS –Memory Bandwidth: GB/s vs GB/s –GPU in every PC and.
Programming with CUDA, WS09 Waqar Saleem, Jens Müller Programming with CUDA and Parallel Algorithms Waqar Saleem Jens Müller.
CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.
Debunking the 100X GPU vs. CPU Myth: An Evaluation of Throughput Computing on CPU and GPU Presented by: Ahmad Lashgar ECE Department, University of Tehran.
GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.
Synergy.cs.vt.edu Power and Performance Characterization of Computational Kernels on the GPU Yang Jiao, Heshan Lin, Pavan Balaji (ANL), Wu-chun Feng.
1 Instant replay  The semester was split into roughly four parts. —The 1st quarter covered instruction set architectures—the connection between software.
Accelerating SQL Database Operations on a GPU with CUDA Peter Bakkum & Kevin Skadron The University of Virginia GPGPU-3 Presentation March 14, 2010.
Chapter1 Fundamental of Computer Design Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2010.
GPU Programming with CUDA – Accelerated Architectures Mike Griffiths
Computer System Architectures Computer System Software
CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA
David Luebke NVIDIA Research GPU Computing: The Democratization of Parallel Computing.
Shared memory systems. What is a shared memory system Single memory space accessible to the programmer Processor communicate through the network to the.
1 The Performance Potential for Single Application Heterogeneous Systems Henry Wong* and Tor M. Aamodt § *University of Toronto § University of British.
BY: ALI AJORIAN ISFAHAN UNIVERSITY OF TECHNOLOGY 2012 GPU Architecture 1.
1b.1 Types of Parallel Computers Two principal approaches: Shared memory multiprocessor Distributed memory multicomputer ITCS 4/5145 Parallel Programming,
By Arun Bhandari Course: HPC Date: 01/28/12. GPU (Graphics Processing Unit) High performance many core processors Only used to accelerate certain parts.
Parallel Applications Parallel Hardware Parallel Software IT industry (Silicon Valley) Users Efficient Parallel CKY Parsing on GPUs Youngmin Yi (University.
Multi-Core Architectures
Multi-core Programming Introduction Topics. Topics General Ideas Moore’s Law Amdahl's Law Processes and Threads Concurrency vs. Parallelism.
Kevin Skadron University of Virginia Dept. of Computer Science LAVA Lab Wrapup and Open Issues.
GPU in HPC Scott A. Friedman ATS Research Computing Technologies.
Chapter 2 Parallel Architecture. Moore’s Law The number of transistors on a chip doubles every years. – Has been valid for over 40 years – Can’t.
Programming Concepts in GPU Computing Dušan Gajić, University of Niš Programming Concepts in GPU Computing Dušan B. Gajić CIITLab, Dept. of Computer Science.
VTU – IISc Workshop Compiler, Architecture and HPC Research in Heterogeneous Multi-Core Era R. Govindarajan CSA & SERC, IISc
Parallelism: A Serious Goal or a Silly Mantra (some half-thought-out ideas)
Outline  Over view  Design  Performance  Advantages and disadvantages  Examples  Conclusion  Bibliography.
A Closer Look At GPUs By Kayvon Fatahalian and Mike Houston Presented by Richard Stocker.
Kevin Skadron University of Virginia Dept. of Computer Science LAVA Lab Wrapup and Open Issues.
Multi-Core Development Kyle Anderson. Overview History Pollack’s Law Moore’s Law CPU GPU OpenCL CUDA Parallelism.
Introduction What is GPU? It is a processor optimized for 2D/3D graphics, video, visual computing, and display. It is highly parallel, highly multithreaded.
Shashwat Shriparv InfinitySoft.
1)Leverage raw computational power of GPU  Magnitude performance gains possible.
Efficient Parallel CKY Parsing on GPUs Youngmin Yi (University of Seoul) Chao-Yue Lai (UC Berkeley) Slav Petrov (Google Research) Kurt Keutzer (UC Berkeley)
MULTICORE PROCESSOR TECHNOLOGY.  Introduction  history  Why multi-core ?  What do you mean by multicore?  Multi core architecture  Comparison of.
Performance Tuning John Black CS 425 UNR, Fall 2000.
Computer Architecture Lecture 24 Parallel Processing Ralph Grishman November 2015 NYU.
GPGPU introduction. Why is GPU in the picture Seeking exa-scale computing platform Minimize power per operation. – Power is directly correlated to the.
My Coordinates Office EM G.27 contact time:
“Processors” issues for LQCD January 2009 André Seznec IRISA/INRIA.
Constructing a system with multiple computers or processors 1 ITCS 4/5145 Parallel Programming, UNC-Charlotte, B. Wilkinson. Jan 13, 2016.
Heterogeneous Processing KYLE ADAMSKI. Overview What is heterogeneous processing? Why it is necessary Issues with heterogeneity CPU’s vs. GPU’s Heterogeneous.
Fall 2012 Parallel Computer Architecture Lecture 4: Multi-Core Processors Prof. Onur Mutlu Carnegie Mellon University 9/14/2012.
Auburn University COMP8330/7330/7336 Advanced Parallel and Distributed Computing Parallel Hardware Dr. Xiao Qin Auburn.
Lynn Choi School of Electrical Engineering
How do we evaluate computer architectures?
Trends in Multicore Architecture
Chapter 1 Introduction.
Graphics Processing Unit
Multicore and GPU Programming
The University of Adelaide, School of Computer Science
6- General Purpose GPU Programming
CSE 502: Computer Architecture
Multicore and GPU Programming
Presentation transcript:

Kevin Skadron University of Virginia Dept. of Computer Science LAVA Lab Trends in Multicore Architecture

© Kevin Skadron, 2008 Outline Objective of this segment: explain why this tutorial is relevant to researchers in systems design Why multicore? Why GPUs? Why CUDA?

© Kevin Skadron, 2008 Why Multicore? Why the dramatic paradigm shift? Combination of both “ILP wall” and “power wall” Not just power ILP wall: wider superscalar, more aggressive OO execution, have run out of steam Power wall: only way to boost single-thread performance was to crank frequency Aggressive circuits (expensive) Very deep pipeline – 30+ stages? (expensive) Power-saving techniques weren’t able to compensate This leaves only natural frequency growth due to technology scaling (~20-30% per generation) Don’t need expensive microarchitectures to obtain that scaling

© Kevin Skadron, 2008 Actual Power Max Power (Watts) i386 i486 Pentium® w/MMX tech. Pentium® w/MMX tech  Pentium® Pro Pentium® II Pentium® 4 Pentium® III Source: Intel Core 2 Duo

© Kevin Skadron, 2008 The Multi-core Revolution Can’t make a single core faster Moore’s Law  same core is 2X smaller per generation Need to keep adding value to maintain average selling price More and more cache doesn’t cut it Use all those transistors to put multiple processors (cores) on a chip 2X cores per generation Cores can potentially be optimized for power But harder to program, except for independent tasks How many independent tasks are there to run at once?

© Kevin Skadron, 2008 Single-core Watts/Spec (through 2005) (courtesy Mark Horowitz)

© Kevin Skadron, 2008 Where Do GPUs Fit In? Big-picture goal for processor designers: improve user benefit with Moore’s Law Solve bigger problems Improve user experience  induce them to buy new systems Need scalable, programmable multicore Scalable: doubling processing elements (PEs) ~doubles performance Multicore achieves this, if your program has scalable parallelism Programmable: easy to realize performance potential GPUs can do this! GPUs provide a pool of cores with general-purpose instruction sets Graphics has lots of parallelism that scales to large # cores CUDA leverages this background Need to maximize performance/mm 2 Need high volume to achieve commodity pricing GPUs of course leverage the 3D market

© Kevin Skadron, 2008 How do GPUs differ from CPUs? Key: perf/mm 2 Emphasize throughput, not per-thread latency Maximize number of PEs and utilization Amortize hardware in time—multithreading Hide latency with computation, not caching Spend area on PEs instead Hide latencies with fast thread switch and many threads/PE GPUs much more aggressive than today’s CPUs: 24 active threads/PE on G80 Exploit SIMD efficiency Amortize hardware in space—share fetch/control among multiple PEs 8 in the case of Tesla architecture Note that SIMD  vector NVIDIA’s architecture is “scalar SIMD” (SIMT), AMD does both High bandwidth to global memory Minimize amount of multithreading needed G80 memory interface is 384-bit, R600 is 512-bit Net result: 470 GFLOP/s and ~80 GB/s sustained in G80 CPUs seem to be following similar trends

© Kevin Skadron, 2008 How do GPUs differ from CPUs? (2) Hardware thread creation and management New thread for each vertex/pixel CPU: kernel or user-level software involvement Virtualized cores Program is agnostic about physical number of cores True for both 3D and general-purpose CPU: number of threads generally f(# cores) Hardware barriers These characteristics simplify problem decomposition, scalability, and portability Nothing prevents non-graphics hardware from adopting these features

© Kevin Skadron, 2008 How do GPUs differ from CPUs? (3) Specialized graphics hardware (Here I only talk about what is exposed through CUDA) Texture path High-bandwidth gather, interpolation Constant memory Even higher-bandwidth access to small read-only data regions Transcendentals (reciprocal sqrt, trig, log2, etc.) Different implementation of atomic memory operations GPU: handled in memory interface CPU: generally handled with CPU involvement Local scratchpad in each core (a.k.a. per-block shared memory) Memory system exploits spatial, not temporal locality

© Kevin Skadron, 2008 How do GPUs differ from CPUs? (4) Fundamental trends are actually very general Exploit parallelism in time and space Other processor families are following similar paths (multithreading, SIMD, etc.) Niagara Larrabee Network processors Clearspeed Cell BE Many others…. Alternative: heterogeneous (Nomenclature note: asymmetric = single-ISA, heterogeneous = multi-ISA) Cell BE Fusion

© Kevin Skadron, 2008 Why is CUDA Important? Mass market platform Easy to buy and set up a system Provides a solution for manycore parallelism Not limited to small core counts Easy to learn abstractions for massive parallelism Abstractions are not tied to a specific platform Doesn’t depend on graphics pipeline; can be implemented on other platforms Preliminary results suggest that CUDA programs run efficiently on multicore CPUs [Stratton’08] Supports a wide range of application characteristics More general than streaming Not limited to data parallelism

© Kevin Skadron, 2008 Why is CUDA Important? (2) CUDA + GPUs facilitate multicore research at scale 16, 8-way SIMD cores = 128 PEs Simple programming model allows exploration of new algorithms, hardware bottlenecks, and parallel programming features The whole community can learn from this CUDA + GPUs provide a real platform…today Results are not theoretical Increases interest from potential users, e.g. computational scientists Boosts opportunities for interdisciplinary collaboration Underlying ISA can be targeted with new languages Great opportunity for research in parallel languages CUDA is teachable Undergrads can start writing real programs within a couple of weeks

© Kevin Skadron, 2008 Thank you Questions?