Presentation on theme: "Graphics on a Stream Processor"— Presentation transcript:
1Graphics on a Stream Processor Peter DjeuMarch 20, 2003
2Polygon Rendering on a Stream Architecture Owens, Dally, Kapasi, Rixner, Mattson, MoweryStanford Computer Sciences Lab
3GoalsCreate a working version of OpenGL on the Imagine Stream ProcessorWhy Imagine? Imagine is programmable. Contemporary graphics cards were not.Use programmability and flexibility to head towards the goal of 80 million poly’s / frame.The trend is more flexibility: over 200 proposed extensions to OpenGL, people already used complex software rendering (like multipass hacks e.x: shadows)
6Basic DefinitionsStreams are sets of data elements. All elements are a single data type.Index streamsKernels are pieces of code that operate on streams. They take a stream as input and produce a stream as output. Kernels can be chained together.
7Basic Definitions (page 2) Instruction-Level Parallelism - issuing independent instructions in the same cycle (ex: 6 functional units in an ALU cluster, VLIW)Data-Level Parallelism - performing the same operation on multiple pieces of data (ex: 8 ALU clusters operating on a single stream, vector computing)
8Basic Definitions (page 3) Produce-Consumer LocalityOccurs when one component of a system is producing something that is immediately consumed by another component of the system.In the polygon rendering pipeline, occurs when one kernel produces a stream that is immediately consumed by another kernelthe Stream Register File (SRF) and local registers exploit producer-consumer locality
9Why are Streams Good for Graphics? Rendering is inherently paralleltriangles can be processed in parallel, but may need to be ordered when drawn to the screenNo need for caching, primitives are rendered and then discarded (not textures)Latency not as important with streams. Stream proc’s emphasize max throughput.VLSI has allowed us to build stream proc’s.
10Homogenous StreamsKernels are most efficient if all elements in the stream require identical operations (less complexity and no stalls from conditionals).Use conditional streams to separate heterogeneous streams into homogenous.Ex: backface culling (forward, backward)
11Benefits of Using Imagine Memory hierarchylocal registers are great for ALU’s; SRF is a big cache to exploit producer-consumer localitySIMD architecture exploits parallelism and homogeneity in streamsOne kernel at a time, and it is run on all 8 ALU clusters (data-level parallelism on streams)Granularity on the stream level, so less instruction issues, less bandwidth needed.
123 Stage Pipeline Geometry - generates triangles Rasterization - generates fragmentsComposite - generates pixelsSee figure: each node is a kernel,each edge is a stream
14Load Balancing (A Problem) Large triangles take longer to be rasterized, potentially slowing down the pipeline.Sol’n: every cycle, ALU clusters fetch if they are idle and keep processing otherwiseSol’n may not be good for all situations, since we incur fetch cost even if no one really needs to fetch (ex: when triangles are all large but roughly equal in size)
15Ordering (A Problem) OpenGL requires in-order completion What situations require in-order completion?What situations do not require in-order comp.?Can we solve this ordering problem without having to sort? For instance, ask user to label triangles with priority. Or assign priority based on orig. order and use a second Z-buffer test using the priority val.
16Ordering (A Problem) (page 2) Paper’s sol’n: Create 2 streams, then concatenate ordered onto unorderedAssign in-order ID to each triangle in the stream (now extra ID value is in the stream)Since overlaps are rare, use a hash func. to find when two fragments overlap on the screenResolve overlaps by using the ID numberUnclear: how did they implement a hash table with 2-bits per hash entry (32 * 8 words total)?
17Consumer-Producer Locality (An Advantage) Break up the entire input into batches, then make sure the batches fit inside the SRFTrips to main memory can really be reducedBut Owens et al. batch their input before they beginning processing-- is this fair? Dynamic batching may not be as effective as their hand-crafted batches.
19Other Advantages to the Imagine Implementation Latency Tolerance - ALU clusters process the current batch while the memory system fetches the next batchPipeline reordering - ex: turn off blending and move the texturing kernel after the depth test kernel, cull non-visible fragmentsFlexible resource allocation - hardware is reused at each stage, no a priori allocation
204 Testing Platforms Software - no hardware acceleration Imagine Nvidia Quadro (real)Nvidia Quadro (ideal) - a scaling of the Quadro’s real-world performance to its advertised peak performanceis this a good idea or a bad idea?
224 Test Scenes and Batch Sizes Sphere total triangles, B: 80 trianglesAdvs-1 - point-sampled 512 x 512 texture mapped, B: 256 verticesAdvs-8 - mipmapped 512 x 512 base level texture, B: 24 verticesFill - fills entire window, 512 x 512 texture map, B: 16 verticesWhy is there such a dramatic drop in batchsize between Advs-1 and Advs-8?
23ResultsLow (Memory to SRF traffic), as compared to (SRF to local register files traffic) means current memory hierarchy is being well utilized (i.e. hardware is exploiting producer-consumer locality)Hash function produces too many false conflicts, leading to too much computation (see Fig. 9). Future work: better hash func.
24Results (page 2)Not enough ALU’s - ADVS-8 needed ~2.5 times the ALU ops than ADVS-1, but performance dropped > 50% (however, batch size also dropped significantly)Batch size is often too small to be efficientstartup costs, paid per batchpriming and draining inner loops (sub-optimal saturation as loops are starting/winding down?)
26Batch Processing, Clearing Z Buffer in Background
27Future Work via Hardware Extensions Problem: The Imagine hardware cannot currently compete with the powerful rasterizers in Graphics CardsSol’n: Add hardware that is dedicated to Graphics and polygon rendering, such as:texture cacheALU in memory system for doing on-the-spot depth test, alpha blend, filtering
28Final ImpressionsImpressive that they got the OpenGL pipeline working on a general-purpose stream processorNot enough ALU power, and things may not improve with more hardware (ordering and communication constraints may still bottleneck)Programmability is good, but not worth it if the overall performance on existing applications is worseLimited size SRF very limiting when triangle size (and hence generated fragments) is unbounded
29Comparing Reyes and OpenGL on a Stream Architecture Owens, Khailany, Towles, DallyStanford Computer Sciences Lab
30Motivation Recent trends in graphics: Smaller triangles in modelingDemand for more complex shadingMemory bandwidth becoming more scarceThe Reyes Rendering Pipeline addresses these issues.
31OpenGL Pipeline Stages Transformations and vertex ops - 1st pass for shading (interpolation, light, glColor)Assemble / Clip / Project - view frustumRasterize - triangles become fragmentsFragment ops - 2nd pass for shading (texturing, blending)Visibility / Filter - depth buffer, composite
32Reyes Rendering Pipeline Dice / Split - primitives are subdivided into micropolygonsShade - shading done in eye space (in the paper, a screen space calculation is used)Sample - projection to screen spaceVisibility / Filter - depth-buffer, composite
35Why are We Using Imagine? (again?) Imagine is parallel, exploits producer-consumer localityImagine is not specifically designed for either pipeline so it is a good platform to use to compare and contrast 2 pipelinesBetter than OpenGL optimized graphics cards because it is not so specialized
36Stanford Real-Time Shading Language (RTSL) High level language for writing shader programs for targeted hardware (Imagine)For OpenGL: Generates code for a Vertex Program and a Fragment ProgramFor Reyes: Generates code for a Vertex Program (4 vertices to a micropolygon)
37Subdividing Micropolygons Primitives start out as a collection of B-Spline Control Points4 control points are treated as 1 quadIf all edges of a quad are less than a threshold length, than return the the quadElse subdivide the quad into 4 quads and repeat the test.
38Fixing Subdivision Cracks Store equations of edges rather than the coordinates of verticesEach edge is finalized immediately when it falls under the acceptable tolerance length, even if adjoining edges are still too longEdge equations are extended to fill in cracks created by Catmull-Clark subdivision
40The Rest of the Reyes Pipeline Shading - Perform all shading in this stage. Use Coherent Access Textures if possible:requires power-of-two sized diced subdivisiongood: no texture filtering due to alignmentbad: can result in ~ twice as many micropolygonsSampling - Better load balancing, primitives are bounded and flat shaded, unlike OpenGLComposite and Filter - Identical to OpenGL
41Experimental Testing Isim - Full Imagine simulator used to test OpenGL Idebug - Faster Imagine simulator (lack kernel stalls or cluster occupancy effects) used to test Reyes
42Problems with the Experimental Testing System The paper is unclear on why it chose to test Reyes on Idebug rather than Isim.The application of the +20% rule may be faulty here since it is not shown Reyes is an “average” application.Extra scratchpad memory, more microcode store, and more local registers were added to Reyes. Were they added to OpenGL?
43Test ScenesTeapot-N - N is either 20, 30, 40, or 64. Contains 2N2 triangles and 3 light sourcesPin textures, point sampled texturesPin textures, mipmapped texturesArmadillo - Complex marble procedural shader (1200 flops / fragment!)
44ResultsScenes rendered by OpenGL have a much higher frame rate than scenes rendered by Reyes (an order of magnitude difference)
46Results (page 2)Subdivision (in the geometry stage) is consuming most of the processing time. Recall this is a stage unique to Reyes.Even excluding the subdivision stage, Reyes is still twice as slow as OpenGL.The next largest bulk of total execution time is the vertex processor. Since it is now doing both forms of shading, it is reasonable that it will take longer.
47Results (page 3)Too many 0-coverage quads are generated. These quads usually have extreme aspect ratios and cover no pixels at all in screen space.Cause: 2-way split at each subdivision iteration, producing 4 child quadsSol’n: Use a 1-way split at each subdivision iteration, producing 2 child quads instead.
49Results (page 4) Small triangles bog down OpenGL Less opportunity to interpolate on small triangles, so full shading computation needs to be used more oftenHigh triangle volume means that the start-up cost of the rasterizer is incurred more often
51Future WorkUse a better subdivision algorithm (more adaptive, more efficient, artifact free, with high performance)Add more ALU clustersExperiment with hybrid systems that are based on Imagine but contain more dedicated, specialized hardware
52ConclusionsAlthough Reyes implementation was slow, current trends in computer graphics (smaller triangles, complex shading, memory bandwidth limits) continue to make Reyes and similar pipelines attractive
53Final ImpressionsBoth pipelines were implemented by Owens et al., which is good for quasi-standardizationLike the first paper, this paper wanted to add specialized hardware, so we may be at the limit of Imagine’s graphical power, as it was initially designed.Good observations / ideas, but not very convincing experimental evidence.Goal of comparing Reyes and OpenGL was not achieved.