Presentation is loading. Please wait.

Presentation is loading. Please wait.

NVIDIA Fermi Architecture Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2011.

Similar presentations


Presentation on theme: "NVIDIA Fermi Architecture Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2011."— Presentation transcript:

1 NVIDIA Fermi Architecture Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2011

2 Administrivia Assignment 4 grades returned Project checkpoint on Monday  Post an update on your blog beforehand Poster session: 04/28  Three weeks from tomorrow

3 G80, GT200, and Fermi November 2006: G80 June 2008: GT200 March 2011: Fermi (GF100) Image from: http://stanford-cs193g-sp2010.googlecode.com/svn/trunk/lectures/lecture_11/the_fermi_architecture.pdf

4 New GPU Generation What are the technical goals for a new GPU generation?

5 New GPU Generation What are the technical goals for a new GPU generation?  Improve existing application performance. How?

6 New GPU Generation What are the technical goals for a new GPU generation?  Improve existing application performance. How?  Advance programmability. In what ways?

7 Fermi: What’s More? More total cores (SPs) – not SMs though More registers: 32K per SM More shared memory: up to 48K per SM More Super Functional Units (SFUs)

8 Fermi: What’s Faster? Faster double precision – 8x over GT200 Faster atomic operations. What for?  5-20x Faster context switches  Between applications – 10x  Between graphics and compute, e.g., OpenGL and CUDA

9 Fermi: What’s New? L1 and L2 caches.  For compute or graphics? Dual warp scheduling Concurrent kernel execution C++ support Full IEEE 754-2008 support in hardware Unified address space Error Correcting Code (ECC) memory support Fixed function tessellation for graphics

10 G80, GT200, and Fermi Image from: http://stanford-cs193g-sp2010.googlecode.com/svn/trunk/lectures/lecture_11/the_fermi_architecture.pdf

11 G80, GT200, and Fermi Image from: http://stanford-cs193g-sp2010.googlecode.com/svn/trunk/lectures/lecture_11/the_fermi_architecture.pdf

12 GT200 and Fermi Image from: http://stanford-cs193g-sp2010.googlecode.com/svn/trunk/lectures/lecture_11/the_fermi_architecture.pdf

13 Fermi Block Diagram Image from: http://www.nvidia.com/content/PDF/fermi_white_papers/NVIDIA_Fermi_Compute_Architecture_Whitepaper.pdf GF100 16 SMs Each with 32 cores  512 total cores Each SM hosts up to  48 warps, or  1,536 threads In flight, up to  24,576 threads

14 Fermi SM Why 32 cores per SM instead of 8?  Why not more SMs? G80 – 8 coresGT200 – 8 coresGF100 – 32 cores

15 Fermi SM Image from: http://www.nvidia.com/content/PDF/fermi_white_papers/NVIDIA_Fermi_Compute_Architecture_Whitepaper.pdf Dual warp scheduling  Why? 32K registers 32 cores  Floating point and integer unit per core 16 Load/stores 4 SFUs

16 Fermi SM Image from: http://www.nvidia.com/content/PDF/fermi_white_papers/NVIDIA_Fermi_Compute_Architecture_Whitepaper.pdf 16 SMs * 32 cores/SM = 512 floating point operations per cycle Why not in practice?

17 Fermi SM Image from: http://www.nvidia.com/content/PDF/fermi_white_papers/NVIDIA_Fermi_Compute_Architecture_Whitepaper.pdf Each SM  64KB on-chip memory 48KB shared memory / 16KB L1 cache, or 16KB L1 cache / 48 KB shared memory  Configurable by CUDA developer

18 Fermi Dual Warping Scheduling Image from: http://stanford-cs193g-sp2010.googlecode.com/svn/trunk/lectures/lecture_11/the_fermi_architecture.pdf

19 Slide from: http://gpgpu.org/wp/wp-content/uploads/2009/11/SC09_CUDA_luebke_Intro.pdf

20 Fermi Caches Slide from: http://stanford-cs193g-sp2010.googlecode.com/svn/trunk/lectures/lecture_11/the_fermi_architecture.pdf

21 Fermi Caches Slide from: http://stanford-cs193g-sp2010.googlecode.com/svn/trunk/lectures/lecture_11/the_fermi_architecture.pdf

22 Image from: http://www.nvidia.com/content/PDF/fermi_white_papers/NVIDIA_Fermi_Compute_Architecture_Whitepaper.pdf Fermi: Unified Address Space

23 64-bit virtual addresses 40-bit physical addresses (currently) CUDA 4: Shared address space with CPU. Why?

24 Fermi: Unified Address Space 64-bit virtual addresses 40-bit physical addresses (currently) CUDA 4: Shared address space with CPU. Why?  No explicit CPU/GPU copies  Direct GPU-GPU copies  Direct I/O device to GPU copies

25 Fermi ECC ECC Protected  Register file, L1, L2, DRAM Uses redundancy to ensure data integrity against cosmic rays flipping bits  For example, 64 bits is stored as 72 bits Fix single bit errors, detect multiple bit errors What are the applications?

26 Fermi Tessellation Image from: http://stanford-cs193g-sp2010.googlecode.com/svn/trunk/lectures/lecture_11/the_fermi_architecture.pdf

27 Fermi Tessellation Image from: http://stanford-cs193g-sp2010.googlecode.com/svn/trunk/lectures/lecture_11/the_fermi_architecture.pdf

28 Fermi Tessellation Image from: http://stanford-cs193g-sp2010.googlecode.com/svn/trunk/lectures/lecture_11/the_fermi_architecture.pdf Fixed function hardware on each SM for graphics  Texture filtering  Texture cache  Tessellation  Vertex Fetch / Attribute Setup  Stream Output  Viewport Transform. Why?

29 Observations Becoming easier to port CPU code to the GPU  Recursion, fast atomics, L1/L2 caches, faster global memory In fact…

30 Observations Becoming easier to port CPU code to the GPU  Recursion, fast atomics, L1/L2 caches, faster global memory In fact… GPUs are starting to look like CPUs  Beefier SMs, L1 and L2 caches, dual warp scheduling, double precision, fast atomics


Download ppt "NVIDIA Fermi Architecture Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2011."

Similar presentations


Ads by Google