Graphics on a Stream Processor

Slides:



Advertisements
Similar presentations
Accelerating Real-Time Shading with Reverse Reprojection Caching Diego Nehab 1 Pedro V. Sander 2 Jason Lawrence 3 Natalya Tatarchuk 4 John R. Isidoro 4.
Advertisements

COMPUTER GRAPHICS SOFTWARE.
An Optimized Soft Shadow Volume Algorithm with Real-Time Performance Ulf Assarsson 1, Michael Dougherty 2, Michael Mounier 2, and Tomas Akenine-Möller.
Sven Woop Computer Graphics Lab Saarland University
COMPUTER GRAPHICS CS 482 – FALL 2014 NOVEMBER 10, 2014 GRAPHICS HARDWARE GRAPHICS PROCESSING UNITS PARALLELISM.
Lecture 38: Chapter 7: Multiprocessors Today’s topic –Vector processors –GPUs –An example 1.
Understanding the graphics pipeline Lecture 2 Original Slides by: Suresh Venkatasubramanian Updates by Joseph Kider.
Graphics Pipeline.
RealityEngine Graphics Kurt Akeley Silicon Graphics Computer Systems.
Graphics Hardware CMSC 435/634. Transform Shade Clip Project Rasterize Texture Z-buffer Interpolate Vertex Fragment Triangle A Graphics Pipeline.
CAP4730: Computational Structures in Computer Graphics Visible Surface Determination.
Results / Compared to Relief Mapping It does not scale linearly with screen coverage as does the other techniques. However, for larger displacements, it.
Computer Graphics Hardware Acceleration for Embedded Level Systems Brian Murray
CGDD 4003 THE MASSIVE FIELD OF COMPUTER GRAPHICS.
Status – Week 243 Victor Moya. Summary Current status. Current status. Tests. Tests. XBox documentation. XBox documentation. Post Vertex Shader geometry.
3D Graphics Processor Architecture Victor Moya. PhD Project Research on architecture improvements for future Graphic Processor Units (GPUs). Research.
The Imagine Stream Processor Flexibility with Performance March 30, 2001 William J. Dally Computer Systems Laboratory Stanford University
IN4151 Introduction 3D graphics 1 Introduction to 3D computer graphics part 2 Viewing pipeline Multi-processor implementation GPU architecture GPU algorithms.
Microprocessors Introduction to ia64 Architecture Jan 31st, 2002 General Principles.
Introduction to Parallel Rendering: Sorting, Chromium, and MPI Mengxia Zhu Spring 2006.
© 2004 Tomas Akenine-Möller1 Shadow Generation Hardware Vision day at DTU 2004 Tomas Akenine-Möller Lund University.
GPU Graphics Processing Unit. Graphics Pipeline Scene Transformations Lighting & Shading ViewingTransformations Rasterization GPUs evolved as hardware.
University of Texas at Austin CS 378 – Game Technology Don Fussell CS 378: Computer Game Technology Beyond Meshes Spring 2012.
Computer Graphics: Programming, Problem Solving, and Visual Communication Steve Cunningham California State University Stanislaus and Grinnell College.
Sort-Last Parallel Rendering for Viewing Extremely Large Data Sets on Tile Displays Paper by Kenneth Moreland, Brian Wylie, and Constantine Pavlakos Presented.
1 Instant replay  The semester was split into roughly four parts. —The 1st quarter covered instruction set architectures—the connection between software.
Shadows Computer Graphics. Shadows Shadows Extended light sources produce penumbras In real-time, we only use point light sources –Extended light sources.
REAL-TIME VOLUME GRAPHICS Christof Rezk Salama Computer Graphics and Multimedia Group, University of Siegen, Germany Eurographics 2006 Real-Time Volume.
Programmable Pipelines. Objectives Introduce programmable pipelines ­Vertex shaders ­Fragment shaders Introduce shading languages ­Needed to describe.
Computer graphics & visualization REYES Render Everything Your Eyes Ever Saw.
Programmable Pipelines. 2 Objectives Introduce programmable pipelines ­Vertex shaders ­Fragment shaders Introduce shading languages ­Needed to describe.
Chris Kerkhoff Matthew Sullivan 10/16/2009.  Shaders are simple programs that describe the traits of either a vertex or a pixel.  Shaders replace a.
Cg Programming Mapping Computational Concepts to GPUs.
A Sorting Classification of Parallel Rendering Molnar et al., 1994.
CS 450: COMPUTER GRAPHICS REVIEW: INTRODUCTION TO COMPUTER GRAPHICS – PART 2 SPRING 2015 DR. MICHAEL J. REALE.
Week 2 - Friday.  What did we talk about last time?  Graphics rendering pipeline  Geometry Stage.
CS 638, Fall 2001 Multi-Pass Rendering The pipeline takes one triangle at a time, so only local information, and pre-computed maps, are available Multi-Pass.
Introduction to Parallel Rendering Jian Huang, CS 594, Spring 2002.
Stream Processing Main References: “Comparing Reyes and OpenGL on a Stream Architecture”, 2002 “Polygon Rendering on a Stream Architecture”, 2000 Department.
Computer Graphics The Rendering Pipeline - Review CO2409 Computer Graphics Week 15.
GRAPHICS PIPELINE & SHADERS SET09115 Intro to Graphics Programming.
Programmable Pipelines Ed Angel Professor of Computer Science, Electrical and Computer Engineering, and Media Arts Director, Arts Technology Center University.
Interactive Rendering With Coherent Ray Tracing Eurogaphics 2001 Wald, Slusallek, Benthin, Wagner Comp 238, UNC-CH, September 10, 2001 Joshua Stough.
Polygon Rendering on a Stream Architecture John D. Owens, William J. Dally, Ujval J. Kapasi, Scott Rixner, Peter Mattson, Ben Mowery Concurrent VLSI Architecture.
- Laboratoire d'InfoRmatique en Image et Systèmes d'information
Precomputation- based Prefetching By James Schatz and Bashar Gharaibeh.
Review on Graphics Basics. Outline Polygon rendering pipeline Affine transformations Projective transformations Lighting and shading From vertices to.
A SEMINAR ON 1 CONTENT 2  The Stream Programming Model  The Stream Programming Model-II  Advantage of Stream Processor  Imagine’s.
COMPUTER GRAPHICS CS 482 – FALL 2015 SEPTEMBER 29, 2015 RENDERING RASTERIZATION RAY CASTING PROGRAMMABLE SHADERS.
Ray Tracing using Programmable Graphics Hardware
What are shaders? In the field of computer graphics, a shader is a computer program that runs on the graphics processing unit(GPU) and is used to do shading.
The Imagine Stream Processor Ujval J. Kapasi, William J. Dally, Scott Rixner, John D. Owens, and Brucek Khailany Presenter: Lu Hao.
A Sorting Classification of Parallel Rendering Molnar et al., 1994.
Siggraph 2009 RenderAnts: Interactive REYES Rendering on GPUs Kun Zhou Qiming Hou Zhong Ren Minmin Gong Xin Sun Baining Guo JAEHYUN CHO.
GPU Architecture and Its Application
COMPUTER GRAPHICS CHAPTER 38 CS 482 – Fall 2017 GRAPHICS HARDWARE
Week 2 - Monday CS361.
Week 2 - Friday CS361.
Performance of Single-cycle Design
Graphics Processing Unit
Deferred Lighting.
Chapter 6 GPU, Shaders, and Shading Languages
The Graphics Rendering Pipeline
Graphics Processing Unit
Chapter V Vertex Processing
Lecture 13 Clipping & Scan Conversion
Instruction Execution Cycle
RADEON™ 9700 Architecture and 3D Performance
Virtual Memory: Working Sets
CIS 441/541: Introduction to Computer Graphics Lecture 15: shaders
Presentation transcript:

Graphics on a Stream Processor Peter Djeu March 20, 2003

Polygon Rendering on a Stream Architecture Owens, Dally, Kapasi, Rixner, Mattson, Mowery Stanford Computer Sciences Lab

Goals Create a working version of OpenGL on the Imagine Stream Processor Why Imagine? Imagine is programmable. Contemporary graphics cards were not. Use programmability and flexibility to head towards the goal of 80 million poly’s / frame. The trend is more flexibility: over 200 proposed extensions to OpenGL, people already used complex software rendering (like multipass hacks e.x: shadows)

Imagine Review 1

Imagine Review 2

Basic Definitions Streams are sets of data elements. All elements are a single data type. Index streams Kernels are pieces of code that operate on streams. They take a stream as input and produce a stream as output. Kernels can be chained together.

Basic Definitions (page 2) Instruction-Level Parallelism - issuing independent instructions in the same cycle (ex: 6 functional units in an ALU cluster, VLIW) Data-Level Parallelism - performing the same operation on multiple pieces of data (ex: 8 ALU clusters operating on a single stream, vector computing)

Basic Definitions (page 3) Produce-Consumer Locality Occurs when one component of a system is producing something that is immediately consumed by another component of the system. In the polygon rendering pipeline, occurs when one kernel produces a stream that is immediately consumed by another kernel the Stream Register File (SRF) and local registers exploit producer-consumer locality

Why are Streams Good for Graphics? Rendering is inherently parallel triangles can be processed in parallel, but may need to be ordered when drawn to the screen No need for caching, primitives are rendered and then discarded (not textures) Latency not as important with streams. Stream proc’s emphasize max throughput. VLSI has allowed us to build stream proc’s.

Homogenous Streams Kernels are most efficient if all elements in the stream require identical operations (less complexity and no stalls from conditionals). Use conditional streams to separate heterogeneous streams into homogenous. Ex: backface culling (forward, backward)

Benefits of Using Imagine Memory hierarchy local registers are great for ALU’s; SRF is a big cache to exploit producer-consumer locality SIMD architecture exploits parallelism and homogeneity in streams One kernel at a time, and it is run on all 8 ALU clusters (data-level parallelism on streams) Granularity on the stream level, so less instruction issues, less bandwidth needed.

3 Stage Pipeline Geometry - generates triangles Rasterization - generates fragments Composite - generates pixels See figure: each node is a kernel, each edge is a stream

Diagram of General Polygon Pipeline

Load Balancing (A Problem) Large triangles take longer to be rasterized, potentially slowing down the pipeline. Sol’n: every cycle, ALU clusters fetch if they are idle and keep processing otherwise Sol’n may not be good for all situations, since we incur fetch cost even if no one really needs to fetch (ex: when triangles are all large but roughly equal in size)

Ordering (A Problem) OpenGL requires in-order completion What situations require in-order completion? What situations do not require in-order comp.? Can we solve this ordering problem without having to sort? For instance, ask user to label triangles with priority. Or assign priority based on orig. order and use a second Z-buffer test using the priority val.

Ordering (A Problem) (page 2) Paper’s sol’n: Create 2 streams, then concatenate ordered onto unordered Assign in-order ID to each triangle in the stream (now extra ID value is in the stream) Since overlaps are rare, use a hash func. to find when two fragments overlap on the screen Resolve overlaps by using the ID number Unclear: how did they implement a hash table with 2-bits per hash entry (32 * 8 words total)?

Consumer-Producer Locality (An Advantage) Break up the entire input into batches, then make sure the batches fit inside the SRF Trips to main memory can really be reduced But Owens et al. batch their input before they beginning processing-- is this fair? Dynamic batching may not be as effective as their hand-crafted batches.

Producer Consumer Locality Using the Stream Register File (SRF)

Other Advantages to the Imagine Implementation Latency Tolerance - ALU clusters process the current batch while the memory system fetches the next batch Pipeline reordering - ex: turn off blending and move the texturing kernel after the depth test kernel, cull non-visible fragments Flexible resource allocation - hardware is reused at each stage, no a priori allocation

4 Testing Platforms Software - no hardware acceleration Imagine Nvidia Quadro (real) Nvidia Quadro (ideal) - a scaling of the Quadro’s real-world performance to its advertised peak performance is this a good idea or a bad idea?

Frame Rates for the 4 Scenes

4 Test Scenes and Batch Sizes Sphere - 81920 total triangles, B: 80 triangles Advs-1 - point-sampled 512 x 512 texture mapped, B: 256 vertices Advs-8 - mipmapped 512 x 512 base level texture, B: 24 vertices Fill - fills entire window, 512 x 512 texture map, B: 16 vertices Why is there such a dramatic drop in batch size between Advs-1 and Advs-8?

Results Low (Memory to SRF traffic), as compared to (SRF to local register files traffic) means current memory hierarchy is being well utilized (i.e. hardware is exploiting producer-consumer locality) Hash function produces too many false conflicts, leading to too much computation (see Fig. 9). Future work: better hash func.

Results (page 2) Not enough ALU’s - ADVS-8 needed ~2.5 times the ALU ops than ADVS-1, but performance dropped > 50% (however, batch size also dropped significantly) Batch size is often too small to be efficient startup costs, paid per batch priming and draining inner loops (sub-optimal saturation as loops are starting/winding down?)

Measure of Relative Performance (Logarithmic)

Batch Processing, Clearing Z Buffer in Background

Future Work via Hardware Extensions Problem: The Imagine hardware cannot currently compete with the powerful rasterizers in Graphics Cards Sol’n: Add hardware that is dedicated to Graphics and polygon rendering, such as: texture cache ALU in memory system for doing on-the-spot depth test, alpha blend, filtering

Final Impressions Impressive that they got the OpenGL pipeline working on a general-purpose stream processor Not enough ALU power, and things may not improve with more hardware (ordering and communication constraints may still bottleneck) Programmability is good, but not worth it if the overall performance on existing applications is worse Limited size SRF very limiting when triangle size (and hence generated fragments) is unbounded

Comparing Reyes and OpenGL on a Stream Architecture Owens, Khailany, Towles, Dally Stanford Computer Sciences Lab

Motivation Recent trends in graphics: Smaller triangles in modeling Demand for more complex shading Memory bandwidth becoming more scarce The Reyes Rendering Pipeline addresses these issues.

OpenGL Pipeline Stages Transformations and vertex ops - 1st pass for shading (interpolation, light, glColor) Assemble / Clip / Project - view frustum Rasterize - triangles become fragments Fragment ops - 2nd pass for shading (texturing, blending) Visibility / Filter - depth buffer, composite

Reyes Rendering Pipeline Dice / Split - primitives are subdivided into micropolygons Shade - shading done in eye space (in the paper, a screen space calculation is used) Sample - projection to screen space Visibility / Filter - depth-buffer, composite

The Reyes and OpenGL Pipeline Stages

Differences Between OpenGL and Reyes

Why are We Using Imagine? (again?) Imagine is parallel, exploits producer-consumer locality Imagine is not specifically designed for either pipeline so it is a good platform to use to compare and contrast 2 pipelines Better than OpenGL optimized graphics cards because it is not so specialized

Stanford Real-Time Shading Language (RTSL) High level language for writing shader programs for targeted hardware (Imagine) For OpenGL: Generates code for a Vertex Program and a Fragment Program For Reyes: Generates code for a Vertex Program (4 vertices to a micropolygon)

Subdividing Micropolygons Primitives start out as a collection of B-Spline Control Points 4 control points are treated as 1 quad If all edges of a quad are less than a threshold length, than return the the quad Else subdivide the quad into 4 quads and repeat the test.

Fixing Subdivision Cracks Store equations of edges rather than the coordinates of vertices Each edge is finalized immediately when it falls under the acceptable tolerance length, even if adjoining edges are still too long Edge equations are extended to fill in cracks created by Catmull-Clark subdivision

Subdivision Algorithm that Avoids Cracking

The Rest of the Reyes Pipeline Shading - Perform all shading in this stage. Use Coherent Access Textures if possible: requires power-of-two sized diced subdivision good: no texture filtering due to alignment bad: can result in ~ twice as many micropolygons Sampling - Better load balancing, primitives are bounded and flat shaded, unlike OpenGL Composite and Filter - Identical to OpenGL

Experimental Testing Isim - Full Imagine simulator used to test OpenGL Idebug - Faster Imagine simulator (lack kernel stalls or cluster occupancy effects) used to test Reyes

Problems with the Experimental Testing System The paper is unclear on why it chose to test Reyes on Idebug rather than Isim. The application of the +20% rule may be faulty here since it is not shown Reyes is an “average” application. Extra scratchpad memory, more microcode store, and more local registers were added to Reyes. Were they added to OpenGL?

Test Scenes Teapot-N - N is either 20, 30, 40, or 64. Contains 2N2 triangles and 3 light sources Pin-1 - 5 textures, point sampled textures Pin-8 - 5 textures, mipmapped textures Armadillo - Complex marble procedural shader (1200 flops / fragment!)

Results Scenes rendered by OpenGL have a much higher frame rate than scenes rendered by Reyes (an order of magnitude difference)

Frame Rate of OpenGL vs. Reyes (logarithmic)

Results (page 2) Subdivision (in the geometry stage) is consuming most of the processing time. Recall this is a stage unique to Reyes. Even excluding the subdivision stage, Reyes is still twice as slow as OpenGL. The next largest bulk of total execution time is the vertex processor. Since it is now doing both forms of shading, it is reasonable that it will take longer.

Results (page 3) Too many 0-coverage quads are generated. These quads usually have extreme aspect ratios and cover no pixels at all in screen space. Cause: 2-way split at each subdivision iteration, producing 4 child quads Sol’n: Use a 1-way split at each subdivision iteration, producing 2 child quads instead.

Work Distribution in OpenGL and Reyes

Results (page 4) Small triangles bog down OpenGL Less opportunity to interpolate on small triangles, so full shading computation needs to be used more often High triangle volume means that the start-up cost of the rasterizer is incurred more often

OpenGL Runtime as a Function of Triangle Size

Future Work Use a better subdivision algorithm (more adaptive, more efficient, artifact free, with high performance) Add more ALU clusters Experiment with hybrid systems that are based on Imagine but contain more dedicated, specialized hardware

Conclusions Although Reyes implementation was slow, current trends in computer graphics (smaller triangles, complex shading, memory bandwidth limits) continue to make Reyes and similar pipelines attractive

Final Impressions Both pipelines were implemented by Owens et al., which is good for quasi-standardization Like the first paper, this paper wanted to add specialized hardware, so we may be at the limit of Imagine’s graphical power, as it was initially designed. Good observations / ideas, but not very convincing experimental evidence. Goal of comparing Reyes and OpenGL was not achieved.