Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Ceng 545 GPU Computing. Grading 2 Midterm Exam: 20% Homeworks: 40% Demo/knowledge: 25% Functionality: 40% Report: 35% Project: 40% Design Document:

Similar presentations


Presentation on theme: "1 Ceng 545 GPU Computing. Grading 2 Midterm Exam: 20% Homeworks: 40% Demo/knowledge: 25% Functionality: 40% Report: 35% Project: 40% Design Document:"— Presentation transcript:

1 1 Ceng 545 GPU Computing

2 Grading 2 Midterm Exam: 20% Homeworks: 40% Demo/knowledge: 25% Functionality: 40% Report: 35% Project: 40% Design Document: 25% Project Presentation: 25% Demo/Final Report: 50%

3 TextBooks /References 3 D. Kirk and W. Hwu,”Programming Massively Parallel Processors”,Morgan Kaufmann 2010, 978-0-12-381472-2 J. Sanders and E. Kandrot, “CUDA by Example: An Introduction to General- Purpose GPU Programming”,Pearson 2010,978-0-13-138768-3 Draft textbook by Prof. Hwu and Prof. Kirk available at the website NVIDIA, NVidia CUDA Programming Guide, NVidia, 2007 (reference book) Videos (Stanford University) http://itunes.apple.com/us/itunes-u/programming-massively-parallel/id384233322 Lecture Notes (Illinois University ) http://courses.engr.illinois.edu/ece498/al/ Lecture notes will be posted at the class web site

4 GPU vs CPU A GPU is tailored for highly parallel operation while a CPU executes programs serially For this reason, GPUs have many parallel execution units and higher transistor counts (GTX 480 has 3200*million), while CPUs have few execution units and higher clockspeeds GPUs have much deeper pipelines (several thousand stages vs 10- 20 for CPUs) GPUs have significantly faster and more advanced memory interfaces as they need to shift around a lot more data than CPUs

5 Many-core GPUs vs Multicore CPU Design philosophies: The design of a CPU is optimized for sequential code performance.(Large cache memories are provided to reduce the instruction and data access latencies) Memory bandwith: CPU :It has to satisfy requirements from OS, applications and I/O devices. GPU: Small cache memories are provided to help control the bandwith requirements so multiple threads that access the same memory data do not need to all go to the DDRAM Marketplace

6 6 CPU vs. GPU - Hardware More transistors devoted to data processing Supercomputing 2008 Education Program

7 7 Processing Element Processing element = thread processor = ALU

8 Supercomputing 2008 Education Program 8 Memory Architecture Constant Memory Texture Memory Device Memory

9 Why Massively Parallel Processor © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 ECE 498AL, University of Illinois, Urbana-Champaign 9 A quiet revolution and potential build-up Calculation: 367 GFLOPS vs. 32 GFLOPS Memory Bandwidth: 86.4 GB/s vs. 8.4 GB/s Until last year, programmed through graphics API GPU in every PC and workstation – massive volume and potential impact GFLOPS G80 = GeForce 8800 GTX G71 = GeForce 7900 GTX G70 = GeForce 7800 GTX NV40 = GeForce 6800 Ultra NV35 = GeForce FX 5950 Ultra NV30 = GeForce FX 5800

10 GeForce 8800 © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 ECE 498AL, University of Illinois, Urbana-Champaign 10 16 highly threaded SM’s (each with 8 SP), >128 FPU’s, 367 GFLOPS, 768 MB DRAM, 86.4 GB/S Mem BW, 4GB/S BW to CPU

11 We have 11 GTX 480 32 highly threaded SM’s (each with 15 SP), >480 FPU’s, 1344 GFLOPS, 1536 MB DRAM, 177.4 GB/S Mem BW Tesla C2070 Compared to the latest quad-core CPUs,Tesla C2050 and C2070 Computing Processors deliver equivalent supercomputing performance at 1/10th the cost and 1/20th the powerconsumption.

12 12 Terms GPGPU General-Purpose computing on a Graphics Processing Unit Using graphic hardware for non-graphic computations CUDA Compute Unified Device Architecture Software architecture for managing data-parallel programming

13 Parallel Programming © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 ECE 498AL, University of Illinois, Urbana-Champaign 13 MPI: Computing nodes do not share memory; all data sharing and interaction must be done through explicit passing. Cuda provides sharde memory OpenMP :It has not been able to scale beyond a couple hundred computing nodes due to thread management overheads and cache coherence hardware requirements. Cuda achivies simple thread management and no cache coherence hardware requirements.

14 Future Apps Reflect a Concurrent World © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 ECE 498AL, University of Illinois, Urbana-Champaign 14 Exciting applications in future mass computing market have been traditionally considered “ supercomputing applications ” Molecular dynamics simulation, Video and audio coding and manipulation, 3D imaging and visualization, Consumer game physics, and virtual reality products These “Super-apps” represent and model physical, concurrent world Various granularities of parallelism exist, but… programming model must not hinder parallel implementation data delivery needs careful management

15 Stretching Traditional Architectures Traditional parallel architectures cover some super-applications DSP, GPU, network apps, Scientific The game is to grow mainstream architectures “ out ” or domain- specific architectures “ in ” CUDA is latter © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 ECE 498AL, University of Illinois, Urbana-Champaign 15

16 Previous Projects ApplicationDescriptionSourceKernel% time H.264 SPEC ‘06 version, change in guess vector 34,81119435% LBM SPEC ‘06 version, change to single precision and print fewer reports 1,481285>99% RC5-72 Distributed.net RC5-72 challenge client code 1,979218>99% FEM Finite element modeling, simulation of 3D graded materials 1,87414699% RPES Rye Polynomial Equation Solver, quantum chem, 2-electron repulsion 1,10428199% PNS Petri Net simulation of a distributed system 322160>99% SAXPY Single-precision implementation of saxpy, used in Linpack’s Gaussian elim. routine 95231>99% TRACF Two Point Angular Correlation Function 5369896% FDTD Finite-Difference Time Domain analysis of 2D electromagnetic wave propagation 1,3659316% MRI-Q Computing a matrix Q, a scanner’s configuration in MRI reconstruction 49033>99% © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 ECE 498AL, University of Illinois, Urbana-Champaign 16

17 Speedup of Applications GeForce 8800 GTX vs. 2.2GHz Opteron 248 10  speedup in a kernel is typical, as long as the kernel can occupy enough parallel threads 25  to 400  speedup if the function’s data requirements and control flow suit the GPU and the application is optimized © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 ECE 498AL, University of Illinois, Urbana-Champaign 17


Download ppt "1 Ceng 545 GPU Computing. Grading 2 Midterm Exam: 20% Homeworks: 40% Demo/knowledge: 25% Functionality: 40% Report: 35% Project: 40% Design Document:"

Similar presentations


Ads by Google