Presentation is loading. Please wait.

Presentation is loading. Please wait.

February 11, 2004 1 Streaming Architectures and GPUs Ian Buck Bill Dally & Pat Hanrahan Stanford University February 11, 2004.

Similar presentations


Presentation on theme: "February 11, 2004 1 Streaming Architectures and GPUs Ian Buck Bill Dally & Pat Hanrahan Stanford University February 11, 2004."— Presentation transcript:

1 February 11, 2004 1 Streaming Architectures and GPUs Ian Buck Bill Dally & Pat Hanrahan Stanford University February 11, 2004

2 2 To Exploit VLSI Technology We Need: Parallelism –To keep 100s of ALUs per chip (thousands/board millions/system) busy Latency tolerance –To cover 500 cycle remote memory access time Locality –To match 20Tb/s ALU bandwidth to ~100Gb/s chip bandwidth Moore’s Law –Growth of transistors, not performance Arithmetic is cheap, global bandwidth is expensive Local << global on-chip << off-chip << global system Courtesy of Bill Dally

3 February 11, 2004 3 Arithmetic Intensity Lots of ops per word transferred Compute-to-Bandwidth ratio High Arithmetic Intensity desirable –App limited by ALU performance, not off-chip bandwidth –More chip real estate for ALUs, not caches Courtesy of Pat Hanrahan

4 February 11, 2004 4 Brook: Stream programming Model –Enforce Data Parallel computing –Encourage Arithmetic Intensity –Provide fundamental ops for stream computing

5 February 11, 2004 5 Streams & Kernels Streams –Collection of records requiring similar computation Vertex positions, voxels, FEM cell, … –Provide data parallelism Kernels –Functions applied to each element in stream transforms, PDE, … No dependencies between stream elements –Encourage high Arithmetic Intensity

6 February 11, 2004 6 Vectors vs. Streams Vectors: v: array of floats Instruction sequence LD v0 LD v1 ADD v0, v1, v2 ST v2  Large set of temps Streams: s: stream of records Instruction sequence LD s0 LD s1 CALLS f, s0, s1, s2 ST s2  Small set of temps Higher arithmetic intensity: |f|/|s|  |+|/|v|

7 February 11, 2004 7 Imagine –Stream processor for image and signal processing –16mm die in 0.18um TI process –21M transistors 2GB/s32GB/s SDRAM Stream Register File ALU Cluster 544GB/s

8 February 11, 2004 8 Merrimac Processor 90nm tech (1 V) ASIC technology 1 GHz (37 FO4) 128 GOPs Inter-cluster switch between clusters 127.5 mm2 (small ~12x10) –Stanford Imagine is 16mm x 16mm –MIT Raw is 18mm x 18mm 25 Watts (P4 = 75 W) –~41W with memories Network 12.5 mm 1 0. 2 m m Microcontroller C l u s t e r C l u s t e r C l u s t e r C l u s t e r C l u s t e r C l u s t e r C l u s t e r C l u s t e r C l u s t e r C l u s t e r C l u s t e r C l u s t e r C l u s t e r C l u s t e r C l u s t e r C l u s t e r 1 6 R D R A M I n t e r f a c e s F o r w a r d E C C A d d r e s s G e n R e o r d e r B u f f e r A d d r e s s G e n $ bank A d d r e s s G e n R e o r d e r B u f f e r A d d r e s s G e n M e m s w i t c h Mips64 20kc Mips64 20kc

9 February 11, 2004 9 Merrimac Streaming Supercomputer

10 February 11, 2004 10 Streaming Applications Finite volume – StreamFLO (from TFLO) Finite element - StreamFEM Molecular dynamics code (ODEs) - StreamMD Model (elliptic, hyperbolic and parabolic) PDEs PCA Applications: FFT, Matrix Mul, SVD, Sort

11 February 11, 2004 11 StreamFLO StreamFLO is the Brook version of FLO82, a FORTRAN code written by Prof. Jameson, for the solution of the inviscid flow around an airfoil. The code uses a cell centered finite volume formulation with a multigrid acceleration to solve the 2D Euler equations. The structure of the code is similar to TFLO and the algorithm is found in many compressible flow solvers.

12 February 11, 2004 12 StreamFEM A Brook implementation of the Discontinuous Galerkin (DG) Finite Element Method (FEM) in 2D triangulated domains.

13 February 11, 2004 13 StreamMD: motivation Application: study the folding of human proteins. Molecular Dynamics: computer simulation of the dynamics of macro molecules. Why this application? –Expect high arithmetic intensity. –Requires variable length neighborlists. –Molecular Dynamics can be used in engine simulation to model spray, e.g. droplet formation and breakup, drag, deformation of droplet. Test case chosen for initial evaluation: box of water molecules. DNA molecule Human immunodeficiency virus (HIV)

14 February 11, 2004 14 Summary of Application Results ApplicationSustained GFLOPS 1 FP Ops / Mem Ref LRF RefsSRF RefsMem Refs StreamFEM2D (Euler, quadratic) 32.223.5169.5M (93.6%) 10.3M (5.7%) 1.4M (0.7%) StreamFEM2D (MHD, cubic) 33.550.6733.3M (94.0%) 43.8M (5.6%) 3.2M (0.4%) StreamMD14.2 2 12.190.2M (97.5%) 1.6M (1.7%) 0.7M (0.8%) StreamFLO11.4 2 7.4234.3M (95.7%) 7.2M (2.9%) 3.4M (1.4%) 1. Simulated on a machine with 64GFLOPS peak performance 2. The low numbers are a result of many divide and square-root operations

15 February 11, 2004 15 Streaming on graphics hardware? Pentium 4 SSE theoretical* 3GHz * 4 wide *.5 inst / cycle = 6 GFLOPS GeForce FX 5900 (NV35) fragment shader observed: MULR R0, R0, R0: 20 GFLOPS equivalent to a 10 GHz P4 and getting faster: 3x improvement over NV30 (6 months) *from Intel P4 Optimization Manual GeForce FX Pentium 4 NV30 NV35

16 February 11, 2004 16 GPU Program Architecture Input Registers Program Output Registers Constants Registers Texture

17 February 11, 2004 17 Example Program Simple Specular and Diffuse Lighting !!VP1.0 # # c[0-3] = modelview projection (composite) matrix # c[4-7] = modelview inverse transpose # c[32] = eye-space light direction # c[33] = constant eye-space half-angle vector (infinite viewer) # c[35].x = pre-multiplied monochromatic diffuse light color & diffuse mat. # c[35].y = pre-multiplied monochromatic ambient light color & diffuse mat. # c[36] = specular color # c[38].x = specular power # outputs homogenous position and color # DP4 o[HPOS].x, c[0], v[OPOS]; # Compute position. DP4 o[HPOS].y, c[1], v[OPOS]; DP4 o[HPOS].z, c[2], v[OPOS]; DP4 o[HPOS].w, c[3], v[OPOS]; DP3 R0.x, c[4], v[NRML]; # Compute normal. DP3 R0.y, c[5], v[NRML]; DP3 R0.z, c[6], v[NRML]; # R0 = N' = transformed normal DP3 R1.x, c[32], R0; # R1.x = Ldir DOT N' DP3 R1.y, c[33], R0; # R1.y = H DOT N' MOV R1.w, c[38].x; # R1.w = specular power LIT R2, R1; # Compute lighting values MAD R3, c[35].x, R2.y, c[35].y; # diffuse + ambient MAD o[COL0].xyz, c[36], R2.z, R3; # + specular END

18 February 11, 2004 18 Cg/HLSL: High level language for GPUs Specular Lighting // Lookup the normal map float4 normal = 2 * (tex2D(normalMap, I.texCoord0.xy) - 0.5); // Multiply 3 X 2 matrix generated using lightDir and halfAngle with // scaled normal followed by lookup in intensity map with the result. float2 intensCoord = float2(dot(I.lightDir.xyz, normal.xyz), dot(I.halfAngle.xyz, normal.xyz)); float4 intensity = tex2D(intensityMap, intensCoord); // Lookup color float4 color = tex2D(colorMap, I.texCoord3.xy); // Blend/Modulate intensity with color return color * intensity;

19 February 11, 2004 19 GPU: Data Parallel –Each fragment shaded independently No dependencies between fragments –Temporary registers are zeroed –No static variables –No Read-Modify-Write textures Multiple “pixel pipes” –Data Parallelism Support ALU heavy architectures Hide Memory Latency [Torborg and Kajiya 96, Anderson et al. 97, Igehy et al. 98]

20 February 11, 2004 20 GPU: Arithmetic Intensity Lots of ops per word transferred Graphics pipeline –Vertex BW: 1 triangle = 32 bytes; OP: 100-500 f32-ops / triangle –Rasterization Create 16-32 fragments per triangle –Fragment BW: 1 fragment = 10 bytes OP: 300-1000 i8-ops/fragment Shader Programs Courtesy of Pat Hanrahan

21 February 11, 2004 21 Streaming Architectures SDRAM Stream Register File ALU Cluster

22 February 11, 2004 22 Streaming Architectures SDRAM Stream Register File ALU Cluster MAD R3, R1, R2; MAD R5, R2, R3; Kernel Execution Unit

23 February 11, 2004 23 Streaming Architectures SDRAM Stream Register File ALU Cluster MAD R3, R1, R2; MAD R5, R2, R3; Kernel Execution Unit Parallel Fragment Pipelines

24 February 11, 2004 24 Streaming Architectures SDRAM Stream Register File ALU Cluster MAD R3, R1, R2; MAD R5, R2, R3; Kernel Execution Unit Parallel Fragment Pipelines Stream Register File: Texture Cache? F-Buffer [Mark et al.]

25 February 11, 2004 25 Conclusions The problem is bandwidth – arithmetic is cheap Stream processing & architectures can provide VLSI- efficient scientific computing –Imagine –Merrimac GPUs are first generation streaming architectures –Apply same stream programming model for general purpose computing on GPUs GeForce FX


Download ppt "February 11, 2004 1 Streaming Architectures and GPUs Ian Buck Bill Dally & Pat Hanrahan Stanford University February 11, 2004."

Similar presentations


Ads by Google