GPU Programming with CUDA – Accelerated Architectures Mike Griffiths

Slides:



Advertisements
Similar presentations
Accelerators for HPC: Programming Models Accelerators for HPC: StreamIt on GPU High Performance Applications on Heterogeneous Windows Clusters
Advertisements

Vectors, SIMD Extensions and GPUs COMP 4611 Tutorial 11 Nov. 26,
Multi-core and tera- scale computing A short overview of benefits and challenges CSC 2007 Andrzej Nowak, CERN
Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.
Lecture 6: Multicore Systems
1 Computational models of the physical world Cortical bone Trabecular bone.
Instructor Notes We describe motivation for talking about underlying device architecture because device architecture is often avoided in conventional.
Multi-core systems System Architecture COMP25212 Daniel Goodman Advanced Processor Technologies Group.
Monte-Carlo method and Parallel computing  An introduction to GPU programming Mr. Fang-An Kuo, Dr. Matthew R. Smith NCHC Applied Scientific Computing.
Today’s topics Single processors and the Memory Hierarchy
GPU System Architecture Alan Gray EPCC The University of Edinburgh.
XEON PHI. TOPICS What are multicore processors? Intel MIC architecture Xeon Phi Programming for Xeon Phi Performance Applications.
Utilization of GPU’s for General Computing Presenter: Charlene DiMeglio Paper: Aspects of GPU for General Purpose High Performance Computing Suda, Reiji,
GPGPU Introduction Alan Gray EPCC The University of Edinburgh.
HPCC Mid-Morning Break High Performance Computing on a GPU cluster Dirk Colbry, Ph.D. Research Specialist Institute for Cyber Enabled Discovery.
Types of Parallel Computers
FSOSS Dr. Chris Szalwinski Professor School of Information and Communication Technology Seneca College, Toronto, Canada GPU Research Capabilities.
GRAPHICS AND COMPUTING GPUS Jehan-François Pâris
GPUs. An enlarging peak performance advantage: –Calculation: 1 TFLOPS vs. 100 GFLOPS –Memory Bandwidth: GB/s vs GB/s –GPU in every PC and.
Parallel Computing Overview CS 524 – High-Performance Computing.
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 19, 2011 Emergence of GPU systems and clusters for general purpose High Performance Computing.
Panda: MapReduce Framework on GPU’s and CPU’s
GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.
1b.1 Types of Parallel Computers Two principal approaches: Shared memory multiprocessor Distributed memory multicomputer ITCS 4/5145 Parallel Programming,
HPCC Mid-Morning Break Dirk Colbry, Ph.D. Research Specialist Institute for Cyber Enabled Discovery Introduction to the new GPU (GFX) cluster.
ORIGINAL AUTHOR JAMES REINDERS, INTEL PRESENTED BY ADITYA AMBARDEKAR Overview for Intel Xeon Processors and Intel Xeon Phi coprocessors.
Lecture 2 : Introduction to Multicore Computing Bong-Soo Sohn Associate Professor School of Computer Science and Engineering Chung-Ang University.
Exploiting Disruptive Technology: GPUs for Physics Chip Watson Scientific Computing Group Jefferson Lab Presented at GlueX Collaboration Meeting, May 11,
Computer System Architectures Computer System Software
Shared memory systems. What is a shared memory system Single memory space accessible to the programmer Processor communicate through the network to the.
Lecture 2 : Introduction to Multicore Computing
1b.1 Types of Parallel Computers Two principal approaches: Shared memory multiprocessor Distributed memory multicomputer ITCS 4/5145 Parallel Programming,
By Arun Bhandari Course: HPC Date: 01/28/12. GPU (Graphics Processing Unit) High performance many core processors Only used to accelerate certain parts.
GPU Programming with CUDA – Optimisation Mike Griffiths
© David Kirk/NVIDIA and Wen-mei W. Hwu, 1 Programming Massively Parallel Processors Lecture Slides for Chapter 1: Introduction.
3. April 2006Bernd Panzer-Steindel, CERN/IT1 HEPIX 2006 CPU technology session some ‘random walk’
Parallel Processing - introduction  Traditionally, the computer has been viewed as a sequential machine. This view of the computer has never been entirely.
GPU in HPC Scott A. Friedman ATS Research Computing Technologies.
Taking the Complexity out of Cluster Computing Vendor Update HPC User Forum Arend Dittmer Director Product Management HPC April,
Multiprocessing. Going Multi-core Helps Energy Efficiency William Holt, HOT Chips 2005 Adapted from UC Berkeley "The Beauty and Joy of Computing"
VTU – IISc Workshop Compiler, Architecture and HPC Research in Heterogeneous Multi-Core Era R. Govindarajan CSA & SERC, IISc
Emergence of GPU systems and clusters for general purpose high performance computing ITCS 4145/5145 April 3, 2012 © Barry Wilkinson.
1 Latest Generations of Multi Core Processors
Copyright © 2011 Curt Hill MIMD Multiple Instructions Multiple Data.
Some key aspects of NVIDIA GPUs and CUDA. Silicon Usage.
Multicore – The future of Computing Chief Engineer Terje Mathisen.
Outline Why this subject? What is High Performance Computing?
3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 1.
Co-Processor Architectures Fermi vs. Knights Ferry Roger Goff Dell Senior Global CERN/LHC Technologist |
GPGPU introduction. Why is GPU in the picture Seeking exa-scale computing platform Minimize power per operation. – Power is directly correlated to the.
My Coordinates Office EM G.27 contact time:
Background Computer System Architectures Computer System Software.
Parallel Computers Today Oak Ridge / Cray Jaguar > 1.75 PFLOPS Two Nvidia 8800 GPUs > 1 TFLOPS Intel 80- core chip > 1 TFLOPS  TFLOPS = floating.
Processor Level Parallelism 2. How We Got Here Developments in PC CPUs.
Emergence of GPU systems for general purpose high performance computing ITCS 4145/5145 © Barry Wilkinson GPUIntro.ppt Oct 30, 2014.
Constructing a system with multiple computers or processors 1 ITCS 4/5145 Parallel Programming, UNC-Charlotte, B. Wilkinson. Jan 13, 2016.
Lecture 13 Parallel Processing. 2 What is Parallel Computing? Traditionally software has been written for serial computation. Parallel computing is the.
Introduction to Data Analysis with R on HPC Texas Advanced Computing Center Feb
Heterogeneous Processing KYLE ADAMSKI. Overview What is heterogeneous processing? Why it is necessary Issues with heterogeneity CPU’s vs. GPU’s Heterogeneous.
11 Brian Van Straalen Portable Performance Discussion August 7, FASTMath SciDAC Institute.
Constructing a system with multiple computers or processors
Mattan Erez The University of Texas at Austin
Chapter 17 Parallel Processing
Constructing a system with multiple computers or processors
Constructing a system with multiple computers or processors
Constructing a system with multiple computers or processors
Multicore and GPU Programming
Types of Parallel Computers
CSE 502: Computer Architecture
Multicore and GPU Programming
Presentation transcript:

GPU Programming with CUDA – Accelerated Architectures Mike Griffiths

Why Accelerators? Architectural Details Latest Products Accelerated Systems Overview

Why Accelerators? Architectural Details Latest Products Accelerated Systems Overview

Power = Frequency x Voltage² Performance Improvements traditionally realised by increasing frequency Voltage decreased to maintain steady power Voltage cannot be decreased any further 1’s and 0’s represented by different voltages Need to be able to distinguish between the two CPU Limitations

Moores Law: A doubling of transistors every couple of years  BUT Clock speeds are not increasing  Longer more complex pipelines? Increase performance by adding parallelism Perform many operations per clock cycle More cores More operations per core Keep power per core low Moores Law

Much of the functionality of CPUs is unused for HPC Branch prediction, out of order execution, etc. Ideally for HPC we want: Simple, Low Power and Highly Parallel cores Problem: Still need operating systems, I/O, scheduling Solution: “Hybrid Systems” – CPUs provide management, “Accelerators” (or co-processors) provide compute power. Accelerators

Why Accelerators? Architectural Details Latest Products Accelerated Systems Overview

Chip fabrication prohibitively expensive HPC market relatively small Graphics Processing Units (GPUs) have evolved from the desire from improved graphical realism in games Significantly different architecture Lots of number crunching cores Highly parallel Initially GPUs started to be used for general purpose use (GPGPU) NVIDIA and AMD now tailor their architectures for HPC Designing an Accelerator

Intel Xeon Phi – Many Integrated Core (MIC) architecture Lots of Pentium cores with wide vector units Closer to traditional multi-core Simplifies programming? Codenamed “Larrabee, Knights Ferry, Knights Corner, Knights Landing” Many simple-low power cores What are the alternatives

Accelerators in HPC 48,000 Xeon Phi boards Equal number of Opterons to GPUS

AMD 12-core Not much space is dedicated to compute Architecture of a Multi-Core CPU = compute unit (core)

NVIDIA Fermi GPU Much more space dedicated to compute (at the cost of cache and advanced features) Architecture of a NVIDIA GPU = streaming multiprocessors compute unit (each with 32 cores)

Similarly has large amounts of dedicated compute space Architecture of a Xeon Phi = compute unit (core)

Accelerators use dedicated Graphics Memory Separate to CPU “main” memory Many HPC applications require high memory bandwidth Memory GPUs and Xeon Phi use Graphics DRAM CPUs use DRAM

Why Accelerators? Architectural Details Latest Products (with a focus on NVIDIA GPUs) Accelerated Systems Overview

NVIDIA – Tesla GPUs, specifically for HPC (using same architecture as Ge-Force) AMD – FirePro HPC, specifically for HPC (evolved from ATI Radeon) Intel – Xeon Phi – recently emerged to compete with GPUs Latest Products

Chip partitioned into Streaming Multiprocessors (SMs) Multiple cores per SM Not cache coherent. No communication possible across SMs. Tesla Series GPUs

Less scheduling units than cores Threads are scheduled in groups of 32, called a warp Threads within a warp always execute the same instruction in lock-step (on different data elements) NVIDIA Streaming Multiprocessor

Tesla Range Specifications “Fermi” 2050 “Fermi” 2070 “Fermi” 2090 “Kepler” K20 “Kepler” K20X “Kepler” K40 CUDA cores DP Performance 515 GFlops 665 GFlops 1.17 TFlops 1.31 TFlops 1.43 Tflops Memory Bandwidth 144 GB/s 178 GB/s208 GB/s250 GB/s288 GB/s Memory3 GB6 GB 5 GB6 GB12 GB

NVIDIA Roadmap

Why Accelerators? Architectural Details Latest Products Accelerated Systems Overview

Cache C P P PPP CCCC Interconnect Memory Shared-memory system. e.g. Sunfire, SGI Origin, Symetric Multiprocessors Can use interconnect + memory as a communications network Machine architectures P PPP CCCC Interconnect MMMM Distributed memory system e.g. Beowulf clusters. Architecture matches message passing paradigm. Processor

CPUs and Accelerators are used together GPUs cannot be used instead of CPUs GPUs perform compute heavy parts Communication is via PCIe bus Accelerated Systems DRAMGDRAM CPUGPU/Accelerator I/OI/O PCIe

Can have multiple CPUs and Accelerators within each “Shared Memory Node” CPUs share memory but accelerators do not! Larger Accelerated Systems DRAM GDRAM CPUGPU/Accelerator I/OI/O PCIe GDRAM CPUGPU/Accelerator I/OI/O Interconnect

Accelerated Supercomputers … … … … ……

(Normally) use one host CPU core (thread) per accelerator Program manages communication between host CPUs MPI for distributed memory OpenMP for shared memory on the same node Multiple Accelerators in Parallel

Insert your accelerator into PCI-e Make sure that There is enough space Your power supply unit (PSU)is up to the job You install the latest drivers Simple Accelerated Workstation

Multiple Servers can be connected via interconnect Several vendors offer GPU servers For example 2 multi core CPUs + 4 GPUS GPU Workstation Server

Dedicated HPC Blades for scalable HPC E.g. Cray XK7 4 CPUs + 4 GPUS + 2 interconnect chips (shared by 2 computer nodes Compute Blades

The Iceberg GPU Nodes C410X with 8 Fermi GPU 2xC6100 with dual Intel westmere 6core CPU’s

GPU Accelerated Libraries and Applications (MATLAB, Ansys, etc) GPU mostly abstracted from end user GPU Accelerated Directives (OpenACC) Helps compiler auto generate code for the GPU CUDA for NVIDIA GPUs Extension to the C language (more to follow) OpenCL Similar to CUDA but cross-platform No access to cutting edge NVIDIA functionaility Programming Techniques

Accelerators have higher compute and memory bandwidth capabilities than CPUs Silicon dedicated to many simplistic cores Use of graphics memory Accelerators are typically not used alone, but work in tandem with CPUs Most common are NVIDIA GPUs and Intel Xeon Phis. Including current top 2 systems on top500 list Architectures differ GPU accelerated systems scale from simple workstations to large-scale supercomputers Summary