Extracted directly from:

Slides:



Advertisements
Similar presentations
The Development of Mellanox - NVIDIA GPUDirect over InfiniBand A New Model for GPU to GPU Communications Gilad Shainer.
Advertisements

1 ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 22, 2013 GPUMemories.ppt GPU Memories These notes will introduce: The basic memory hierarchy.
Instructor Notes We describe motivation for talking about underlying device architecture because device architecture is often avoided in conventional.
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 28, 2011 GPUMemories.ppt GPU Memories These notes will introduce: The basic memory hierarchy.
A Complete GPU Compute Architecture by NVIDIA Tamal Saha, Abhishek Rawat, Minh Le {ts4rq, ar8eb,
Modern GPU Architectures Varun Sampath University of Pennsylvania CIS Spring 2012.
Optimization on Kepler Zehuan Wang
GPU System Architecture Alan Gray EPCC The University of Edinburgh.
GPGPU Introduction Alan Gray EPCC The University of Edinburgh.
1 ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Feb 26, 2013, DyanmicParallelism.ppt CUDA Dynamic Parallelism These notes will outline CUDA.
Development of a track trigger based on parallel architectures Felice Pantaleo PH-CMG-CO (University of Hamburg) Felice Pantaleo PH-CMG-CO (University.
Acceleration of the Smith– Waterman algorithm using single and multiple graphics processors Author : Ali Khajeh-Saeed, Stephen Poole, J. Blair Perot. Publisher:
GPUs. An enlarging peak performance advantage: –Calculation: 1 TFLOPS vs. 100 GFLOPS –Memory Bandwidth: GB/s vs GB/s –GPU in every PC and.
1 Threading Hardware in G80. 2 Sources Slides by ECE 498 AL : Programming Massively Parallel Processors : Wen-Mei Hwu John Nickolls, NVIDIA.
CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.
GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.
Efficient Pseudo-Random Number Generation for Monte-Carlo Simulations Using GPU Siddhant Mohanty, Subho Shankar Banerjee, Dushyant Goyal, Ajit Mohanty.
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 7: Threading Hardware in G80.
CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA
Simultaneous Multithreading: Maximizing On-Chip Parallelism Presented By: Daron Shrode Shey Liggett.
BY: ALI AJORIAN ISFAHAN UNIVERSITY OF TECHNOLOGY 2012 GPU Architecture 1.
1 Chapter 04 Authors: John Hennessy & David Patterson.
CUDA 5.0 By Peter Holvenstot CS6260. CUDA 5.0 Latest iteration of CUDA toolkit Requires Compute Capability 3.0 Compatible Kepler cards being installed.
Parallel Applications Parallel Hardware Parallel Software IT industry (Silicon Valley) Users Efficient Parallel CKY Parsing on GPUs Youngmin Yi (University.
Chapter 2 Parallel Architecture. Moore’s Law The number of transistors on a chip doubles every years. – Has been valid for over 40 years – Can’t.
Multiprocessing. Going Multi-core Helps Energy Efficiency William Holt, HOT Chips 2005 Adapted from UC Berkeley "The Beauty and Joy of Computing"
Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.
NVIDIA Fermi Architecture Patrick Cozzi University of Pennsylvania CIS Spring 2011.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 CS 395 Winter 2014 Lecture 17 Introduction to Accelerator.
CUDA Optimizations Sathish Vadhiyar Parallel Programming.
GPU Architecture and Programming
Introducing collaboration members – Korea University (KU) ALICE TPC online tracking algorithm on a GPU Computing Platforms – GPU Computing Platforms Joohyung.
(1) Kernel Execution ©Sudhakar Yalamanchili and Jin Wang unless otherwise noted.
The introduction of GPGPU and some implementations on model checking Zhimin Wu Nanyang Technological University, Singapore.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 8: Threading Hardware in G80.
Introduction What is GPU? It is a processor optimized for 2D/3D graphics, video, visual computing, and display. It is highly parallel, highly multithreaded.
Some key aspects of NVIDIA GPUs and CUDA. Silicon Usage.
ICAL GPU 架構中所提供分散式運算 之功能與限制. 11/17/09ICAL2 Outline Parallel computing with GPU NVIDIA CUDA SVD matrix computation Conclusion.
1)Leverage raw computational power of GPU  Magnitude performance gains possible.
Martin Kruliš by Martin Kruliš (v1.0)1.
Computer Architecture Lecture 24 Parallel Processing Ralph Grishman November 2015 NYU.
CUDA Compute Unified Device Architecture. Agent Based Modeling in CUDA Implementation of basic agent based modeling on the GPU using the CUDA framework.
GPGPU introduction. Why is GPU in the picture Seeking exa-scale computing platform Minimize power per operation. – Power is directly correlated to the.
My Coordinates Office EM G.27 contact time:
1 ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 28, 2013 Branching.ppt Control Flow These notes will introduce scheduling control-flow.
CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS 1.
Emergence of GPU systems for general purpose high performance computing ITCS 4145/5145 © Barry Wilkinson GPUIntro.ppt Oct 30, 2014.
CS 179: GPU Computing LECTURE 2: MORE BASICS. Recap Can use GPU to solve highly parallelizable problems Straightforward extension to C++ ◦Separate CUDA.
1 ITCS 4/5145 Parallel Programming, B. Wilkinson, Nov 12, CUDASynchronization.ppt Synchronization These notes introduce: Ways to achieve thread synchronization.
Henry Wong, Misel-Myrto Papadopoulou, Maryam Sadooghi-Alvandi, Andreas Moshovos University of Toronto Demystifying GPU Microarchitecture through Microbenchmarking.
Computer Engg, IIT(BHU)
Lecture 20 Computing with Graphical Processing Units
Sathish Vadhiyar Parallel Programming
CS427 Multicore Architecture and Parallel Computing
EECE571R -- Harnessing Massively Parallel Processors ece
GPU Memory Details Martin Kruliš by Martin Kruliš (v1.1)
GPU Memories These notes will introduce:
Lecture 2: Intro to the simd lifestyle and GPU internals
Lecture 5: GPU Compute Architecture
Mattan Erez The University of Texas at Austin
Presented by: Isaac Martin
NVIDIA Fermi Architecture
Mattan Erez The University of Texas at Austin
©Sudhakar Yalamanchili and Jin Wang unless otherwise noted
6- General Purpose GPU Programming
Multicore and GPU Programming
CIS 6930: Chip Multiprocessor: Parallel Architecture and Programming
CIS 6930: Chip Multiprocessor: Parallel Architecture and Programming
CIS 6930: Chip Multiprocessor: Parallel Architecture and Programming
Presentation transcript:

Extracted directly from: NVIDIA Kepler GK110 Architecture (chip) (As used in coit-grid08.uncc.edu K20 GPU server) Highlights – To discuss in class Extracted directly from: “Whitepaper NVIDIA’s Next Generation CUDATM Compute Architecture KeplerTM GK110”, NVIDIA, 2012 http://www.nvidia.com/content/PDF/kepler/NVIDIA-Kepler- GK110-Architecture-Whitepaper.pdf ITCS 4/5010 GPU Programming, B. Wilkinson, GK110ArchNotes.ppt Feb 11, 2013

Designed for performance and power efficiency 7.1 billion transistors Over 1 TFlop of double precision throughput 3x performance per watt of Fermi New features in Kepler GK110: Dynamic Parallelism Hyper-Q with GK110 Grid Management Unit (GMU) NVIDIA GPUDirect™ RDMA Kepler GK110 Chip

Kelper GK110 Full chip block diagram

Kepler GK110 supports the new CUDA Compute Capability 3.5 GTX 470/480s have GT100s C2050s on grid06 and grid07 are compute cap 2.0

New streaming multiprocessor (now called SMX) 192 single‐precision CUDA cores, 64 double‐precision units, 32 special function units (SFU), and 32 load/store units (LD/ST). Full Kepler GK110 has 15 SMXs Some products may have 13 or 14 SMXs

Quad Warp Scheduler The SMX schedules threads in groups of 32 parallel threads called warps. Each SMX features four warp schedulers and eight instruction dispatch units, allowing four warps to be issued and executed concurrently. (128 threads) Kepler GK110 allows double precision instructions to be paired with other instructions.

One Warp Scheduler Unit

Each thread can access up to 255 registers (x4 of Fermi) New Shuffle instruction which allows threads within a warp to share data without passing data through shared memory: Atomic operations: Improved by 9x to one operation per clock – fast enough to use frequently with kernel inner loops

Texture units improvements Not considered in class For image processing Speed improvements when programs need to operate on image data

New: 48 KB Read-only memory cache Compiler/programmer can use to advantage Shared memory/L1 cache split: Each SMX has 64 KB on‐chip memory, that can be configured as: 48 KB of Shared memory with 16 KB of L1 cache, or 16 KB of shared memory with 48 KB of L1 cache (new) a 32KB / 32KB split between shared memory and L1 cache. Faster than L2

Dynamic Parallelism Fermi could only launch one kernel at a time on a single device. Kernel had to complete before calling for another GPU task. “In Kepler GK110 any kernel can launch another kernel, and can create the necessary streams, events and manage the dependencies needed to process additional work without the need for host CPU interaction.” “ .. makes it easier for developers to create and optimize recursive and data‐dependent execution patterns, and allows more of a program to be run directly on GPU.”

“Dynamic Parallelism allows more parallel code in an application to be launched directly by the GPU onto itself (right side of image) rather than requiring CPU intervention (left side of image).” Control must be transferred back to CPU before a new kernel can execute Only return to CPU when all GPU operations are completed. Why is this faster?

“With Dynamic Parallelism, the grid resolution can be determined dynamically at runtime in a data dependent manner. Starting with a coarse grid, the simulation can “zoom in” on areas of interest while avoiding unnecessary calculation in areas with little change …. ” Image attribution Charles Reid

Hyper‐Q “The Fermi architecture supported 16‐way concurrency of kernel launches from separate streams, but ultimately the streams were all multiplexed into the same hardware work queue.” “Kepler GK110 … Hyper‐Q increases the total number of connections (work queues) … by allowing 32 simultaneous, hardware‐managed connections..” “… allows connections from multiple CUDA streams, from multiple Message Passing Interface (MPI) processes, or even from multiple threads within a process. Applications that previously encountered false serialization across tasks, thereby limiting GPU utilization, can see up to a 32x performance increase without changing any existing code.”

Hyper‐Q “Each CUDA stream is managed within its own hardware work queue … “

“The redesigned Kepler HOST to GPU workflow shows the new Grid Management Unit, which allows it to manage the actively dispatching grids, pause dispatch and hold pending and suspended grids.”

NVIDIA GPUDirect™ “Kepler GK110 supports the RDMA feature in NVIDIA GPUDirect, which is designed to improve performance by allowing direct access to GPU memory by third‐party devices such as IB adapters, NICs, and SSDs. When using CUDA 5.0, GPUDirect provides the following important features: · Direct memory access (DMA) between NIC and GPU without the need for CPU‐side data buffering. (Huge improvement for GPU-only Servers) · Significantly improved MPISend/MPIRecv efficiency between GPU and other nodes in a network. · Eliminates CPU bandwidth and latency bottlenecks · Works with variety of 3rd‐party network, capture, and storage devices.”

“GPUDirect RDMA allows direct access to GPU memory from 3rd‐party devices such as network adapters, which translates into direct transfers between GPUs across nodes as well.”

Questions