Intro to GPU’s for Parallel Computing. Goals for Rest of Course Learn how to program massively parallel processors and achieve – high performance – functionality.

Slides:

Advertisements

Similar presentations

Optimization on Kepler Zehuan Wang

Advertisements

Monte-Carlo method and Parallel computing  An introduction to GPU programming Mr. Fang-An Kuo, Dr. Matthew R. Smith NCHC Applied Scientific Computing.

GPU System Architecture Alan Gray EPCC The University of Edinburgh.

Early Linpack Performance Benchmarking on IPE Mole-8.5 Fermi GPU Cluster Xianyi Zhang 1),2) and Yunquan Zhang 1),3) 1) Laboratory of Parallel Software.

Appendix A — 1 FIGURE A.2.2 Contemporary PCs with Intel and AMD CPUs. See Chapter 6 for an explanation of the components and interconnects in this figure.

HPCC Mid-Morning Break High Performance Computing on a GPU cluster Dirk Colbry, Ph.D. Research Specialist Institute for Cyber Enabled Discovery.

Sparse LU Factorization for Parallel Circuit Simulation on GPU Ling Ren, Xiaoming Chen, Yu Wang, Chenxi Zhang, Huazhong Yang Department of Electronic Engineering,

GRAPHICS AND COMPUTING GPUS Jehan-François Pâris

A many-core GPU architecture.. Price, performance, and evolution.

Acceleration of the Smith– Waterman algorithm using single and multiple graphics processors Author : Ali Khajeh-Saeed, Stephen Poole, J. Blair Perot. Publisher:

GPUs. An enlarging peak performance advantage: –Calculation: 1 TFLOPS vs. 100 GFLOPS –Memory Bandwidth: GB/s vs GB/s –GPU in every PC and.

1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 19, 2011 Emergence of GPU systems and clusters for general purpose High Performance Computing.

Accelerating Machine Learning Applications on Graphics Processors Narayanan Sundaram and Bryan Catanzaro Presented by Narayanan Sundaram.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Programming Massively Parallel Processors.

Introduction What is GPU? It is a processor optimized for 2D/3D graphics, video, visual computing, and display. It is highly parallel, highly multithreaded.

GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.

GPU Programming with CUDA – Accelerated Architectures Mike Griffiths

Exploiting Disruptive Technology: GPUs for Physics Chip Watson Scientific Computing Group Jefferson Lab Presented at GlueX Collaboration Meeting, May 11,

1 ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Dec 31, 2012 Emergence of GPU systems and clusters for general purpose High Performance Computing.

2012/06/22 Contents  GPU (Graphic Processing Unit)  CUDA Programming  Target: Clustering with Kmeans  How to use.

National Center for Supercomputing Applications University of Illinois at Urbana-Champaign Direct Self-Consistent Field Computations on GPU Clusters Guochun.

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 7: Threading Hardware in G80.

Chapter 2 Computer Clusters Lecture 2.3 GPU Clusters for Massive Paralelism.

CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA

David Luebke NVIDIA Research GPU Computing: The Democratization of Parallel Computing.

Computer Graphics Graphics Hardware

BY: ALI AJORIAN ISFAHAN UNIVERSITY OF TECHNOLOGY 2012 GPU Architecture 1.

Extracted directly from:

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 408, University of Illinois, Urbana-Champaign 1 Programming Massively Parallel Processors Lecture.

By Arun Bhandari Course: HPC Date: 01/28/12. GPU (Graphics Processing Unit) High performance many core processors Only used to accelerate certain parts.

© David Kirk/NVIDIA and Wen-mei W. Hwu, 1 Programming Massively Parallel Processors Lecture Slides for Chapter 1: Introduction.

ECSE , Spring 2014 (modified from Stanford CS 193G) Lecture 1: Introduction to Massively Parallel Computing.

Computación algebraica dispersa con GPUs y su aplicación en tomografía electrónica Non-linear iterative optimization method for locating particles using.

Taking the Complexity out of Cluster Computing Vendor Update HPC User Forum Arend Dittmer Director Product Management HPC April,

Programming Concepts in GPU Computing Dušan Gajić, University of Niš Programming Concepts in GPU Computing Dušan B. Gajić CIITLab, Dept. of Computer Science.

NVIDIA Tesla GPU Zhuting Xue EE126. GPU Graphics Processing Unit The "brain" of graphics, which determines the quality of performance of the graphics.

NVIDIA Fermi Architecture Patrick Cozzi University of Pennsylvania CIS Spring 2011.

Emergence of GPU systems and clusters for general purpose high performance computing ITCS 4145/5145 April 3, 2012 © Barry Wilkinson.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 CMPS 5433 Dr. Ranette Halverson Programming Massively.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Programming Massively Parallel Processors.

Hardware Acceleration Using GPUs M Anirudh Guide: Prof. Sachin Patkar VLSI Consortium April 4, 2008.

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL1, University of Illinois, Urbana-Champaign ECE 498AL1 Programming Massively Parallel Processors.

1 Ceng 545 GPU Computing. Grading 2 Midterm Exam: 20% Homeworks: 40% Demo/knowledge: 25% Functionality: 40% Report: 35% Project: 40% Design Document:

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lecture 18: Final Project Kickoff.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 8: Threading Hardware in G80.

Introduction What is GPU? It is a processor optimized for 2D/3D graphics, video, visual computing, and display. It is highly parallel, highly multithreaded.

ICAL GPU 架構中所提供分散式運算之功能與限制. 11/17/09ICAL2 Outline Parallel computing with GPU NVIDIA CUDA SVD matrix computation Conclusion.

1)Leverage raw computational power of GPU  Magnitude performance gains possible.

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 Final Project Notes.

1 Workshop 9: General purpose computing using GPUs: Developing a hands-on undergraduate course on CUDA programming SIGCSE The 42 nd ACM Technical.

GPU Programming Shirley Moore CPS 5401 Fall 2013

Chun-Yuan Lin Brief of GPU&CUDA. What is GPU? Graphics Processing Units 2015/12/16 2 GPU.

Debunking the 100X GPU vs. CPU Myth An Evaluation of Throughput Computing on CPU and GPU Present by Chunyi Victor W Lee, Changkyu Kim, Jatin Chhugani,

Morgan Kaufmann Publishers Multicores, Multiprocessors, and Clusters

Lecture 8 : Manycore GPU Programming with CUDA Courtesy : SUNY-Stony Brook Prof. Chowdhury’s course note slides are used in this lecture note.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 Graphic Processing Processors (GPUs) Parallel.

Sunpyo Hong, Hyesoon Kim

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 Final Project Kickoff.

GPGPU introduction. Why is GPU in the picture Seeking exa-scale computing platform Minimize power per operation. – Power is directly correlated to the.

Parallel Computers Today Oak Ridge / Cray Jaguar > 1.75 PFLOPS Two Nvidia 8800 GPUs > 1 TFLOPS Intel 80- core chip > 1 TFLOPS  TFLOPS = floating.

Fast and parallel implementation of Image Processing Algorithm using CUDA Technology On GPU Hardware Neha Patil Badrinath Roysam Department of Electrical.

Graphic Processing Units Presentation by John Manning.

Emergence of GPU systems for general purpose high performance computing ITCS 4145/5145 © Barry Wilkinson GPUIntro.ppt Oct 30, 2014.

Parallel Computers Today LANL / IBM Roadrunner > 1 PFLOPS Two Nvidia 8800 GPUs > 1 TFLOPS Intel 80- core chip > 1 TFLOPS  TFLOPS = floating point.

Brief of GPU&CUDA Chun-Yuan Lin.

CS427 Multicore Architecture and Parallel Computing

© David Kirk/NVIDIA and Wen-mei W. Hwu,

Parallel Computers Today

Graphics Processing Unit

Intro to GPU’s for Parallel Computing

Presentation transcript:

Intro to GPU’s for Parallel Computing

Goals for Rest of Course Learn how to program massively parallel processors and achieve – high performance – functionality and maintainability – scalability across future generations Acquire technical knowledge required to achieve the above goals – principles and patterns of parallel programming – processor architecture features and constraints – programming API, tools and techniques Overview of architecture first, then introduce architecture as we go

Equipment Your own, if CUDA-enabled; will use CUDA SDK in C – Compute Unified Device Architecture – NVIDIA G80 or newer – G80 emulator won’t quite work Lab machine – Tesla – Ubuntu – Quad core Xeon, 2 Ghz – 16 Gb memory – Two Tesla C1060 “Tesla C1060 Computing Processor Board” 240 Cores 1.3 Ghz Clock 4 Gb memory – MD5 test Average Mhash/s 2x 3.2 Ghz Xeon: 42 Mhash/s

Equipment May use our Beowulf cluster for MPI, beancounter.math.uaa.alaska.edu – 13 custom-built boxes each containing a dual processor 1 Ghz Pentium III, 768 Mb of shared memory – Total of 27 nodes, including the master. – NetBSD 2.0F – Connected through a 100Mbps switch.

Why Massively Parallel Processors A quiet revolution and potential build-up – 2006 Calculation: 367 GFLOPS vs. 32 GFLOPS – G80 Memory Bandwidth: 86.4 GB/s vs. 8.4 GB/s – Until recently, programmed through graphics API – GPU in every PC and workstation – massive volume and potential impact

DRAM Cache ALU Control ALU DRAM CPU GPU CPUs and GPUs have fundamentally different design philosophies

Architecture of a CUDA-capable GPU Streaming Processor (SP) Streaming Multiprocessor (SM) Building Block 30 SM’s each with 8 SP’s on the C1060

GT200 Characteristics 1 TFLOPS peak performance (25-50 times of current high- end microprocessors) 265 GFLOPS sustained for apps such as Visual Molecular Dynamics (VMD) Massively parallel, 128 cores, 90W Massively threaded, sustains 1000s of threads per app times speedup over high-end microprocessors on scientific and media applications: medical imaging, molecular dynamics “I think they're right on the money, but the huge performance differential (currently 3 GPUs ~= 300 SGI Altix Itanium2s) will invite close scrutiny so I have to be careful what I say publically until I triple check those numbers.” -John Stone, VMD group, Physics UIUC

9 Future Apps Reflect a Concurrent World Exciting applications in future mass computing market have been traditionally considered “ supercomputing applications ” – Molecular dynamics simulation, Video and audio coding and manipulation, 3D imaging and visualization, Consumer game physics, and virtual reality products – These “Super-apps” represent and model physical, concurrent world Various granularities of parallelism exist, but… – programming model must not hinder parallel implementation – data delivery needs careful management

Stretching Traditional Architectures Traditional parallel architectures cover some super- applications – DSP, GPU, network apps, Scientific The game is to grow mainstream architectures “ out ” or domain-specific architectures “ in ” – CUDA is latter

Sample of Previous GPU Projects ApplicationDescriptionSourceKernel% time H.264 SPEC ‘06 version, change in guess vector 34, % LBM SPEC ‘06 version, change to single precision and print fewer reports 1,481285>99% RC5-72 Distributed.net RC5-72 challenge client code 1,979218>99% FEM Finite element modeling, simulation of 3D graded materials 1, % RPES Rye Polynomial Equation Solver, quantum chem, 2-electron repulsion 1, % PNS Petri Net simulation of a distributed system >99% SAXPY Single-precision implementation of saxpy, used in Linpack’s Gaussian elim. routine 95231>99% TRACF Two Point Angular Correlation Function % FDTD Finite-Difference Time Domain analysis of 2D electromagnetic wave propagation 1, % MRI-Q Computing a matrix Q, a scanner’s configuration in MRI reconstruction 49033>99%

Speedup of Applications GeForce 8800 GTX vs. 2.2GHz Opteron  speedup in a kernel is typical, as long as the kernel can occupy enough parallel threads 25  to 400  speedup if the function’s data requirements and control flow suit the GPU and the application is optimized