Presentation is loading. Please wait.

Presentation is loading. Please wait.

CMSC 611: Advanced Computer Architecture

Similar presentations


Presentation on theme: "CMSC 611: Advanced Computer Architecture"— Presentation transcript:

1 CMSC 611: Advanced Computer Architecture
Complex Parallel Systems

2 Computational Examples
Connection Machine 5 SGI Origin Intel Nehalem

3 Thinking Machines CM5 (1993)
MIMD, SPARC processors Fat Tree communication network D. Hillis and L. Tucker, “The CM-5 Connection Machine: A Scalable Supercomputer,” Communications of the ACM, v36n11, November 1993

4 SGI Origin (1998) MIPS R10000 processor Hypercube connected
ccNUMA / directory protocol

5 SGI Origin Node Ammon, "Hypercube Connectivity within ccNUMA Architecture", Silicon Graphics, 1998.

6 Origin Communication Level Latency (ns) L1 cache 5.1 L2 cache 56.4
local memory 310 4P remote memory 540 8P avg. remote memory 707 16P avg. remote memory 726 32P avg. remote memory 773 64P avg. remote memory 867 128P avg. remote memory 945 Laudon and Lenoski, "SGI Origin: A ccNUMA Highly Scalable Server", Proceedings of Computer Architecture 1997

7 Intel Nehalem Design Appaloosa, “Intel Nehalem Microarchitecture”, Wikimedia project, November 2008

8 Communication Performance
Michael Thomadakis: The Architecture of the Nehalem Processor and Nehalem-EP SMP Platforms

9 Graphics Hardware Problem domain Pixel-Planes 4 Pixel-Planes 5
SGI Reality Engine Pixel Flow NVIDIA GeForce 6 NVIDIA Maxwell

10 Graphics Rendering Just model the surfaces
(that’s all you can see) Approximate them with a mesh of triangles Get really good at rendering triangles

11 Graphics Pipeline Transform: find where each vertex goes on the screen
Clip Rasterize Shade Visibility/Blend Display

12 Graphics Pipeline Clip: get rid of off-screen parts (especially behind the viewer) Transform Clip Rasterize Shade Visibility/Blend Display

13 Graphics Pipeline Rasterize: find which pixels are inside the triangle
Transform Clip Rasterize Shade Visibility/Blend Display

14 Graphics Pipeline Shade: compute the color for each pixel Transform
Clip Rasterize Shade Visibility/Blend Display

15 Graphics Pipeline Visibility: throw out pixels covered by opaque stuff that’s already rendered Blend: Combine colors for partially transparent objects Transform Clip Rasterize Shade Visibility/Blend Display

16 Graphics Pipeline Display: Show results to user Transform Clip
Rasterize Shade Visibility/Blend Display

17 Graphics Pipeline vertex triangle pixel frame Transform Clip Rasterize
Shade Visibility/Blend Display triangle pixel frame

18 Computation and Bandwidth
Based on: • 100 Mtri/sec 60Hz) • 256 Bytes vertex data • 128 Bytes interpolated • 68 Bytes pixel output • 5x depth complexity • 16 4-Byte textures • 223 ops/vertex • 1664 ops/pixel • No caching • No compression Vertex 75 GB/s 67 GFLOPS Triangle 13 GB/s 335 GB/s Texture 45 GB/s Fragment Pixel 1.1 TFLOPS

19 UNC Pixel-Planes 4 (1985) DSP vertex processor Custom rasterizer
512x512 SIMD array Full screen Fuchs et al., ”Fast spheres, shadows, textures, transparencies, and image enhancements in pixel-planes", SIGGRAPH 1985

20 UNC Pixel-Planes 5 (1989) ~40 i860 CPUs for vertex processing
~20 128x128 SIMD arrays for pixel processing Fuchs et al., ”Pixel-Planes 5: a heterogeneous multiprocessor graphics system using processor enhanced memory", SIGGRAPH 1989

21 SGI Reality Engine (1993) Akeley, ”Reality Engine Graphics", SIGGRAPH 1993

22 Pixel-Flow (1992-1997) ~35 nodes, each with 2 HP-PA 8000 CPUs
128x64 SIMD array (~160 tiles/screen) Eyles, et al., "PixelFlow: The Realization", Graphics Hardware 1997

23 Pixel-Flow Eyles, et al., "PixelFlow: The Realization", Graphics Hardware 1997

24 NVIDIA GeForce 6 (2004) Kilgariff and Fernando, ”The GeForce 6 GPU Architecture", GPU Gems 2, 2005

25 GeForce 6 Parallelism More Parallel Data Parallel … Vertex Triangle
Pixel Triangle Pipeline More Parallel More Pipeline

26 NVIDIA G80/Tesla (2006) NVIDIA, “NVIDIA GeForce 8800 GPU Architecture Overview”, TB _v01, November 2006

27 NVIDIA Maxwell (2014) NVIDIA, NVIDIA GeForce GTX 980 Whitepaper, 2014

28 Maxwell SIMD Processing Block
32 Cores 8 Special Function NVIDIA Terminology: Warp = interleaved threads Hide memory latency Want at least 4-8 Thread Block = Warps*Cores Flexible Registers Trade registers for warps NVIDIA, NVIDIA GeForce GTX 980 Whitepaper, 2014

29 Maxwell Streaming Multiprocessor (SMM)
4 SIMD blocks Share L1 Caches Share memory Share tessellation HW NVIDIA, NVIDIA GeForce GTX 980 Whitepaper, 2014

30 Maxwell Graphics Processing Cluster
4 SMM Share rasterizer NVIDIA, NVIDIA GeForce GTX 980 Whitepaper, 2014

31 Full Maxwell (again) NVIDIA, NVIDIA GeForce GTX 980 Whitepaper, 2014


Download ppt "CMSC 611: Advanced Computer Architecture"

Similar presentations


Ads by Google