Download presentation
Presentation is loading. Please wait.
Published byLuke Carpenter Modified over 9 years ago
1
NVIDIA Fermi Architecture Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2011
2
Administrivia Assignment 4 grades returned Project checkpoint on Monday Post an update on your blog beforehand Poster session: 04/28 Three weeks from tomorrow
3
G80, GT200, and Fermi November 2006: G80 June 2008: GT200 March 2011: Fermi (GF100) Image from: http://stanford-cs193g-sp2010.googlecode.com/svn/trunk/lectures/lecture_11/the_fermi_architecture.pdf
4
New GPU Generation What are the technical goals for a new GPU generation?
5
New GPU Generation What are the technical goals for a new GPU generation? Improve existing application performance. How?
6
New GPU Generation What are the technical goals for a new GPU generation? Improve existing application performance. How? Advance programmability. In what ways?
7
Fermi: What’s More? More total cores (SPs) – not SMs though More registers: 32K per SM More shared memory: up to 48K per SM More Super Functional Units (SFUs)
8
Fermi: What’s Faster? Faster double precision – 8x over GT200 Faster atomic operations. What for? 5-20x Faster context switches Between applications – 10x Between graphics and compute, e.g., OpenGL and CUDA
9
Fermi: What’s New? L1 and L2 caches. For compute or graphics? Dual warp scheduling Concurrent kernel execution C++ support Full IEEE 754-2008 support in hardware Unified address space Error Correcting Code (ECC) memory support Fixed function tessellation for graphics
10
G80, GT200, and Fermi Image from: http://stanford-cs193g-sp2010.googlecode.com/svn/trunk/lectures/lecture_11/the_fermi_architecture.pdf
11
G80, GT200, and Fermi Image from: http://stanford-cs193g-sp2010.googlecode.com/svn/trunk/lectures/lecture_11/the_fermi_architecture.pdf
12
GT200 and Fermi Image from: http://stanford-cs193g-sp2010.googlecode.com/svn/trunk/lectures/lecture_11/the_fermi_architecture.pdf
13
Fermi Block Diagram Image from: http://www.nvidia.com/content/PDF/fermi_white_papers/NVIDIA_Fermi_Compute_Architecture_Whitepaper.pdf GF100 16 SMs Each with 32 cores 512 total cores Each SM hosts up to 48 warps, or 1,536 threads In flight, up to 24,576 threads
14
Fermi SM Why 32 cores per SM instead of 8? Why not more SMs? G80 – 8 coresGT200 – 8 coresGF100 – 32 cores
15
Fermi SM Image from: http://www.nvidia.com/content/PDF/fermi_white_papers/NVIDIA_Fermi_Compute_Architecture_Whitepaper.pdf Dual warp scheduling Why? 32K registers 32 cores Floating point and integer unit per core 16 Load/stores 4 SFUs
16
Fermi SM Image from: http://www.nvidia.com/content/PDF/fermi_white_papers/NVIDIA_Fermi_Compute_Architecture_Whitepaper.pdf 16 SMs * 32 cores/SM = 512 floating point operations per cycle Why not in practice?
17
Fermi SM Image from: http://www.nvidia.com/content/PDF/fermi_white_papers/NVIDIA_Fermi_Compute_Architecture_Whitepaper.pdf Each SM 64KB on-chip memory 48KB shared memory / 16KB L1 cache, or 16KB L1 cache / 48 KB shared memory Configurable by CUDA developer
18
Fermi Dual Warping Scheduling Image from: http://stanford-cs193g-sp2010.googlecode.com/svn/trunk/lectures/lecture_11/the_fermi_architecture.pdf
19
Slide from: http://gpgpu.org/wp/wp-content/uploads/2009/11/SC09_CUDA_luebke_Intro.pdf
20
Fermi Caches Slide from: http://stanford-cs193g-sp2010.googlecode.com/svn/trunk/lectures/lecture_11/the_fermi_architecture.pdf
21
Fermi Caches Slide from: http://stanford-cs193g-sp2010.googlecode.com/svn/trunk/lectures/lecture_11/the_fermi_architecture.pdf
22
Image from: http://www.nvidia.com/content/PDF/fermi_white_papers/NVIDIA_Fermi_Compute_Architecture_Whitepaper.pdf Fermi: Unified Address Space
23
64-bit virtual addresses 40-bit physical addresses (currently) CUDA 4: Shared address space with CPU. Why?
24
Fermi: Unified Address Space 64-bit virtual addresses 40-bit physical addresses (currently) CUDA 4: Shared address space with CPU. Why? No explicit CPU/GPU copies Direct GPU-GPU copies Direct I/O device to GPU copies
25
Fermi ECC ECC Protected Register file, L1, L2, DRAM Uses redundancy to ensure data integrity against cosmic rays flipping bits For example, 64 bits is stored as 72 bits Fix single bit errors, detect multiple bit errors What are the applications?
26
Fermi Tessellation Image from: http://stanford-cs193g-sp2010.googlecode.com/svn/trunk/lectures/lecture_11/the_fermi_architecture.pdf
27
Fermi Tessellation Image from: http://stanford-cs193g-sp2010.googlecode.com/svn/trunk/lectures/lecture_11/the_fermi_architecture.pdf
28
Fermi Tessellation Image from: http://stanford-cs193g-sp2010.googlecode.com/svn/trunk/lectures/lecture_11/the_fermi_architecture.pdf Fixed function hardware on each SM for graphics Texture filtering Texture cache Tessellation Vertex Fetch / Attribute Setup Stream Output Viewport Transform. Why?
29
Observations Becoming easier to port CPU code to the GPU Recursion, fast atomics, L1/L2 caches, faster global memory In fact…
30
Observations Becoming easier to port CPU code to the GPU Recursion, fast atomics, L1/L2 caches, faster global memory In fact… GPUs are starting to look like CPUs Beefier SMs, L1 and L2 caches, dual warp scheduling, double precision, fast atomics
Similar presentations
© 2024 SlidePlayer.com Inc.
All rights reserved.