Brook for GPUs Ian Buck, Tim Foley, Daniel Horn, Jeremy Sugerman Pat Hanrahan February 10th, 2003
February 11th, Brook: general purpose streaming language developed for PCA Program/Merrimac –compiler: RStream Reservoir Labs –DARPA PCA Program Stanford: SmartMemories UT Austin: TRIPS MIT: RAW –Brook version 0.2 spec: –Brook for GPUs: Stream Execution Unit Stream Register File Memory System Network Interface Scalar Execution Unit text DRDRAM Network
February 11th, Brook: general purpose streaming language stream programming model –enforce data parallel computing streams –encourage arithmetic intensity kernels C with streams
February 11th, Brook for gpus demonstrate gpu streaming coprocessor –make programming gpus easier hide texture/pbuffer data management hide graphics based constructs in CG/HLSL hide rendering passes virtualize resources –performance! … on applications that matter –highlight gpu areas for improvement features required general purpose stream computing
February 11th, system outline.br Brook source files brcc source to source compiler brt Brook run-time library
February 11th, Brook language streams streams –collection of records requiring similar computation particle positions, voxels, FEM cell, … float3 positions ; float3 velocityfield ; – encourage data parallelism
February 11th, Brook language kernels kernels –functions applied to streams similar to for_all construct kernel void foo (float a<>, float b<>, out float result<>) { result = a + b; } float a ; float b ; float c ; foo(a,b,c); for (i=0; i<100; i++) c[i] = a[i]+b[i]; – no dependencies between stream elements encourage high arithmetic intensity
February 11th, Brook language kernels Ray Triangle Intersection kernel void krnIntersectTriangle(Ray ray<>, Triangle tris[], RayState oldraystate<>, GridTrilist trilist[], out Hit candidatehit<>) { float idx, det, inv_det; float3 edge1, edge2, pvec, tvec, qvec; if(oldraystate.state.y > 0) { idx = trilist[oldraystate.state.w].trinum; edge1 = tris[idx].v1 - tris[idx].v0; edge2 = tris[idx].v2 - tris[idx].v0; pvec = cross(ray.d, edge2); det = dot(edge1, pvec); inv_det = 1.0f/det; tvec = ray.o - tris[idx].v0; candidatehit.data.y = dot( tvec, pvec ) * inv_det; qvec = cross( tvec, edge1 ); candidatehit.data.z = dot( ray.d, qvec ) * inv_det; candidatehit.data.x = dot( edge2, qvec ) * inv_det; candidatehit.data.w = idx; } else { candidatehit.data = float4(0,0,0,-1); }
February 11th, Brook language additional features reductions –scalar –stream stride & repeat GatherOp & ScatterOp –a[i] += p –p = a[i]++
February 11th, brcc compiler infrastructure based on ctool – parser –build code tree –extend C grammar to accept Brook convert –tree transformations codegen –generate cg & hlsl code –call cgc, fxc –generate stub function
February 11th, Applications Ray-tracer FFT Segmentation Linear Algebra: –BLAS, LINPACK, LAPACK
February 11th, Brook Performance
February 11th, GPU Gotchas Time Registers Used
February 11th, GPU Gotchas NVIDIA NV3x: Register usage vs. Time Time Registers Used
February 11th, GPU Gotchas NVIDIA: Register Penalty Render to Texture Limitation –Requires explicit copy or heavy pbuffer solution –Superbuffer extension needed SIG03.pdf
February 11th, GPU Gotchas ATI Radeon 9800 Pro Limited dependent texture lookup 96 instructions 24-bit floating point –s16e7 Integers up to 131,072 (s23e8: 16,777,216) Memory Refs Math Ops Memory Refs Math Ops Memory Refs Math Ops Memory Refs Math Ops
February 11th, GPU Catch-Up! Integer & Bit Ops & Double Precision Memory Addressing CGC/FXC Performance –Hand code performance critical code No native reduction support No native scatter support –p[i] = a (indirect write) No programmable blend –GatherOp / ScatterOp Limited 4x4 output –Brook virtualized kernel outputs Readback still slow –NV35 OpenGL: 600 MB/sec Download 170 MB/sec Readback –ATI DirectX: 550 MB/sec Download 50 MB/sec Readback
February 11th, GPUs of the future (we hope) Complete Instruction Sets –Integers, Bit Ops, Doubles, Mem Access Integration –Streaming coprocessor not just a rendering device Streaming architectures SDRAM Stream Register File ALU Cluster
February 11th, Brook for GPUs Release v0.3 available on Sourceforge Project Page – Source – Over 4K downloads! Questions? Fly-fishing fly images from The English Fly Fishing ShopThe English Fly Fishing Shop