David Luebke NVIDIA Research GPU Computing: The Democratization of Parallel Computing.

Slides:



Advertisements
Similar presentations
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.
Advertisements

Speed, Accurate and Efficient way to identify the DNA.
Complete Unified Device Architecture A Highly Scalable Parallel Programming Framework Submitted in partial fulfillment of the requirements for the Maryland.
GPU programming: CUDA Acknowledgement: the lecture materials are based on the materials in NVIDIA teaching center CUDA course materials, including materials.
Intro to GPU’s for Parallel Computing. Goals for Rest of Course Learn how to program massively parallel processors and achieve – high performance – functionality.
GPU System Architecture Alan Gray EPCC The University of Edinburgh.
GPGPU Introduction Alan Gray EPCC The University of Edinburgh.
HPCC Mid-Morning Break High Performance Computing on a GPU cluster Dirk Colbry, Ph.D. Research Specialist Institute for Cyber Enabled Discovery.
GPU Programming and CUDA Sathish Vadhiyar Parallel Programming.
Acceleration of the Smith– Waterman algorithm using single and multiple graphics processors Author : Ali Khajeh-Saeed, Stephen Poole, J. Blair Perot. Publisher:
Introduction to CUDA and GPUGPU Computing
1 Threading Hardware in G80. 2 Sources Slides by ECE 498 AL : Programming Massively Parallel Processors : Wen-Mei Hwu John Nickolls, NVIDIA.
Towards Acceleration of Fault Simulation Using Graphics Processing Units Kanupriya Gulati Sunil P. Khatri Department of ECE Texas A&M University, College.
CUDA Programming Model Xing Zeng, Dongyue Mou. Introduction Motivation Programming Model Memory Model CUDA API Example Pro & Contra Trend Outline.
Programming with CUDA, WS09 Waqar Saleem, Jens Müller Programming with CUDA and Parallel Algorithms Waqar Saleem Jens Müller.
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 19, 2011 Emergence of GPU systems and clusters for general purpose High Performance Computing.
CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.
LCAD Laboratório de Computação de Alto Desempenho DI/UFES Computing Unified Device Architecture (CUDA) A Mass-Produced High Performance Parallel Programming.
Accelerating Machine Learning Applications on Graphics Processors Narayanan Sundaram and Bryan Catanzaro Presented by Narayanan Sundaram.
Heterogeneous Computing Dr. Jason D. Bakos. Heterogeneous Computing 2 “Traditional” Parallel/Multi-Processing Large-scale parallel platforms: –Individual.
To GPU Synchronize or Not GPU Synchronize? Wu-chun Feng and Shucai Xiao Department of Computer Science, Department of Electrical and Computer Engineering,
© David Kirk/NVIDIA and Wen-mei W. Hwu, , SSL 2014, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.
Accelerating SQL Database Operations on a GPU with CUDA Peter Bakkum & Kevin Skadron The University of Virginia GPGPU-3 Presentation March 14, 2010.
An Introduction to Programming with CUDA Paul Richmond
Motivation “Every three minutes a woman is diagnosed with Breast cancer” (American Cancer Society, “Detailed Guide: Breast Cancer,” 2006) Explore the use.
Training Program on GPU Programming with CUDA 31 st July, 7 th Aug, 14 th Aug 2011 CUDA Teaching UoM.
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 7: Threading Hardware in G80.
GPU Programming David Monismith Based on notes taken from the Udacity Parallel Programming Course.
Shared memory systems. What is a shared memory system Single memory space accessible to the programmer Processor communicate through the network to the.
GPUs and Accelerators Jonathan Coens Lawrence Tan Yanlin Li.
By Arun Bhandari Course: HPC Date: 01/28/12. GPU (Graphics Processing Unit) High performance many core processors Only used to accelerate certain parts.
Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.
Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2012.
© David Kirk/NVIDIA and Wen-mei W. Hwu, 1 Programming Massively Parallel Processors Lecture Slides for Chapter 1: Introduction.
ECSE , Spring 2014 (modified from Stanford CS 193G) Lecture 1: Introduction to Massively Parallel Computing.
Algorithm Engineering „GPGPU“ Stefan Edelkamp. Graphics Processing Units  GPGPU = (GP)²U General Purpose Programming on the GPU  „Parallelism for the.
CUDA All material not from online sources/textbook copyright © Travis Desell, 2012.
+ CUDA Antonyus Pyetro do Amaral Ferreira. + The problem The advent of multicore CPUs and manycore GPUs means that mainstream processor chips are now.
Programming Concepts in GPU Computing Dušan Gajić, University of Niš Programming Concepts in GPU Computing Dušan B. Gajić CIITLab, Dept. of Computer Science.
Lecture 8 : Manycore GPU Programming with CUDA Courtesy : Prof. Christopher Cooper’s and Prof. Chowdhury’s course note slides are used in this lecture.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 CS 395 Winter 2014 Lecture 17 Introduction to Accelerator.
Automatic translation from CUDA to C++ Luca Atzori, Vincenzo Innocente, Felice Pantaleo, Danilo Piparo 31 August, 2015.
GPU Architecture and Programming
Accelerating Error Correction in High-Throughput Short-Read DNA Sequencing Data with CUDA Haixiang Shi Bertil Schmidt Weiguo Liu Wolfgang Müller-Wittig.
GPU Programming and CUDA Sathish Vadhiyar Parallel Programming.
Hardware Acceleration Using GPUs M Anirudh Guide: Prof. Sachin Patkar VLSI Consortium April 4, 2008.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 8: Threading Hardware in G80.
Jie Chen. 30 Multi-Processors each contains 8 cores at 1.4 GHz 4GB GDDR3 memory offers ~100GB/s memory bandwidth.
Compiler and Runtime Support for Enabling Generalized Reduction Computations on Heterogeneous Parallel Configurations Vignesh Ravi, Wenjing Ma, David Chiu.
Introduction to CUDA (1 of n*) Patrick Cozzi University of Pennsylvania CIS Spring 2011 * Where n is 2 or 3.
Efficient Parallel CKY Parsing on GPUs Youngmin Yi (University of Seoul) Chao-Yue Lai (UC Berkeley) Slav Petrov (Google Research) Kurt Keutzer (UC Berkeley)
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.
Parallel Programming Basics  Things we need to consider:  Control  Synchronization  Communication  Parallel programming languages offer different.
Introduction to CUDA CAP 4730 Spring 2012 Tushar Athawale.
Lecture 8 : Manycore GPU Programming with CUDA Courtesy : SUNY-Stony Brook Prof. Chowdhury’s course note slides are used in this lecture note.
© David Kirk/NVIDIA and Wen-mei W. Hwu, CS/EE 217 GPU Architecture and Programming Lecture 2: Introduction to CUDA C.
Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2014.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE 8823A GPU Architectures Module 2: Introduction.
GPU Programming and CUDA Sathish Vadhiyar High Performance Computing.
My Coordinates Office EM G.27 contact time:
Fast and parallel implementation of Image Processing Algorithm using CUDA Technology On GPU Hardware Neha Patil Badrinath Roysam Department of Electrical.
Kevin Skadron University of Virginia Dept. of Computer Science LAVA Lab Trends in Multicore Architecture.
GPGPU Programming with CUDA Leandro Avila - University of Northern Iowa Mentor: Dr. Paul Gray Computer Science Department University of Northern Iowa.
1 Massively Parallel Computing with CUDA Dr. Antonino Tumeo Politecnico di Milano.
Computer Engg, IIT(BHU)
CS427 Multicore Architecture and Parallel Computing
ECE498AL Spring 2010 Lecture 4: CUDA Threads – Part 2
6- General Purpose GPU Programming
CIS 6930: Chip Multiprocessor: Parallel Architecture and Programming
Presentation transcript:

David Luebke NVIDIA Research GPU Computing: The Democratization of Parallel Computing

© NVIDIA Corporation 2007 Tutorial Speakers David LuebkeNVIDIA Research Kevin Skadron University of Virginia Michael GarlandNVIDIA Research John Owens University of California Davis

© NVIDIA Corporation 2007 Tutorial Schedule 1:30 – 1:55Introduction & MotivationLuebke 1:55 – 2:15Manycore architectural trendsSkadron 2:15 – 3:15CUDA model & programming Garland 3:15 – 3:30Break 3:30 – 4:00GPU architecture & implicationsLuebke 4:00 – 5:00Advanced data-parallel programmingOwens 5:00 – 5:30Architectural lessons & research opportunitiesSkadron

© NVIDIA Corporation 2007 Parallel Computing’s Golden Age 1980s, early `90s: a golden age for parallel computing Particularly data-parallel computing Architectures Connection Machine, MasPar, Cray True supercomputers: incredibly exotic, powerful, expensive Algorithms, languages, & programming models Solved a wide variety of problems Various parallel algorithmic models developed P-RAM, V-RAM, circuit, hypercube, etc.

Parallel Computing’s Dark Age But…impact of data-parallel computing limited Thinking Machines sold 7 CM-1s (100s of systems total) MasPar sold ~200 systems Commercial and research activity subsided Massively-parallel machines replaced by clusters of ever-more powerful commodity microprocessors Beowulf, Legion, grid computing, … Massively parallel computing lost momentum to the inexorable advance of commodity technology

© NVIDIA Corporation 2007 Enter the GPU GPU = Graphics Processing Unit Chip in computer video cards, PlayStation 3, Xbox, etc. Two major vendors: NVIDIA and ATI (now AMD)

© NVIDIA Corporation 2007 Enter the GPU GPUs are massively multithreaded manycore chips NVIDIA Tesla products have up to 128 scalar processors Over 12,000 concurrent threads in flight Over 470 GFLOPS sustained performance Users across science & engineering disciplines are achieving 100x or better speedups on GPUs CS researchers can use GPUs as a research platform for manycore computing: arch, PL, numeric, …

© NVIDIA Corporation 2007 Enter CUDA CUDA is a scalable parallel programming model and a software environment for parallel computing Minimal extensions to familiar C/C++ environment Heterogeneous serial-parallel programming model NVIDIA’s TESLA GPU architecture accelerates CUDA Expose the computational horsepower of NVIDIA GPUs Enable general-purpose GPU computing CUDA also maps well to multicore CPUs!

© NVIDIA Corporation 2007 The Democratization of Parallel Computing GPU Computing with CUDA brings data-parallel computing to the masses Over 46,000,000 CUDA-capable GPUs sold A “developer kit” costs ~$200 (for 500 GFLOPS) Data-parallel supercomputers are everywhere! CUDA makes this power accessible We’re already seeing innovations in data-parallel computing Massively parallel computing has become a commodity technology!

GPU Computing: Motivation

X 13–457x 45X 100X 35X 17X

© NVIDIA Corporation 2007 GPUs Are Fast Theoretical peak performance: 518 GFLOPS Sustained μbenchmark performance: Raw math: 472 GFLOPS (8800 Ultra) Raw bandwidth: 80 GB per second (Tesla C870) Actual application performance: Molecular dynamics: 290 GFLOPS (VMD ion placement)

© NVIDIA Corporation 2007 GPUs Are Getting Faster, Faster

© NVIDIA Corporation 2007 G80 (launched Nov 2006 – GeForce 8800 GTX) 128 Thread Processors execute kernel threads Up to 12,288 parallel threads active Per-block shared memory (PBSM) accelerates processing Manycore GPU – Block Diagram Thread Execution Manager Input Assembler Host PBSM Global Memory Load/store PBSM Thread Processors PBSM Thread Processors PBSM

CUDA Programming Model

Heterogeneous Programming CUDA = serial program with parallel kernels, all in C Serial C code executes in a CPU thread Parallel kernel C code executes in thread blocks across multiple processing elements Serial Code... Parallel Kernel KernelA >>(args); Serial Code Parallel Kernel KernelB >>(args);

© NVIDIA Corporation 2007 GPU Computing with CUDA: A Highly Multithreaded Coprocessor The GPU is a highly parallel compute device serves as a coprocessor for the host CPU has its own device memory on the card executes many threads in parallel Parallel kernels run a single program in many threads GPU threads are extremely lightweight Thread creation and context switching are essentially free GPU expects 1000’s of threads for full utilization

CUDA: Programming GPU in C Philosophy: provide minimal set of extensions necessary to expose power Declaration specifiers to indicate where things live __global__ void KernelFunc(...); // kernel function, runs on device __device__ int GlobalVar; // variable in device memory __shared__ int SharedVar; // variable in per-block shared memory Extend function invocation syntax for parallel kernel launch KernelFunc >>(...); // launch 500 blocks w/ 128 threads each Special variables for thread identification in kernels dim3 threadIdx; dim3 blockIdx; dim3 blockDim; dim3 gridDim; Intrinsics that expose specific operations in kernel code __syncthreads(); // barrier synchronization within kernel

© NVIDIA Corporation 2007 Tesla TM High Performance Computing Quadro ® Design & Creation GeForce ® Entertainment Architecture: TESLA Chips: G80, G84, G92, … Decoder Ring

© NVIDIA Corporation 2007 A New Platform: Tesla HPC-oriented product line C870: board (1 GPU) D870: deskside unit (2 GPUs) S870: 1u server unit(4 GPUs)

© NVIDIA Corporation 2007 Conclusion GPUs are massively parallel manycore computers Ubiquitous - most successful parallel processor in history Useful - users achieve huge speedups on real problems CUDA is a powerful parallel programming model Heterogeneous - mixed serial-parallel programming Scalable - hierarchical thread execution model Accessible - minimal but expressive changes to C They provide tremendous scope for innovative, impactful research

Questions? David Luebke