Partitioning Screen Space 2 Rui Wang. Architectural Implications of Hardware- Accelerated Bucket Rendering on the PC (97’) Dynamic Load Balancing for.

Slides:



Advertisements
Similar presentations
Visible-Surface Detection(identification)
Advertisements

COMPUTER GRAPHICS SOFTWARE.
COMPUTER GRAPHICS CS 482 – FALL 2014 NOVEMBER 10, 2014 GRAPHICS HARDWARE GRAPHICS PROCESSING UNITS PARALLELISM.
CS 352: Computer Graphics Chapter 7: The Rendering Pipeline.
Quick Review of Apr 10 material B+-Tree File Organization –similar to B+-tree index –leaf nodes store records, not pointers to records stored in an original.
Render Cache John Tran CS851 - Interactive Ray Tracing February 5, 2003.
03/12/02 (c) 2002 University of Wisconsin, CS559 Last Time Some Visibility (Hidden Surface Removal) algorithms –Painter’s Draw in some order Things drawn.
Memory Management Chapter 7. Memory Management Subdividing memory to accommodate multiple processes Memory needs to be allocated efficiently to pack as.
Graphics Graphics Korea University cgvr.korea.ac.kr 1 Hidden Surface Removal 고려대학교 컴퓨터 그래픽스 연구실.
Chapter 6: Vertices to Fragments Part 2 E. Angel and D. Shreiner: Interactive Computer Graphics 6E © Addison-Wesley Mohan Sridharan Based on Slides.
1 Dr. Scott Schaefer Hidden Surfaces. 2/62 Hidden Surfaces.
Computer Graphics Hardware Acceleration for Embedded Level Systems Brian Murray
Hidden Surface Elimination Wen-Chieh (Steve) Lin Institute of Multimedia Engineering I-Chen Lin’ CG Slides, Rich Riesenfeld’s CG Slides, Shirley, Fundamentals.
3D Graphics Processor Architecture Victor Moya. PhD Project Research on architecture improvements for future Graphic Processor Units (GPUs). Research.
DDDDRRaw: A Prototype Toolkit for Distributed Real-Time Rendering on Commodity Clusters Thu D. Nguyen and Christopher Peery Department of Computer Science.
1 Memory Management Chapter 7. 2 Memory Management Subdividing memory to accommodate multiple processes Memory needs to be allocated to ensure a reasonable.
1 CSCE 441: Computer Graphics Hidden Surface Removal (Cont.) Jinxiang Chai.
IN4151 Introduction 3D graphics 1 Introduction to 3D computer graphics part 2 Viewing pipeline Multi-processor implementation GPU architecture GPU algorithms.
Introduction to Parallel Rendering: Sorting, Chromium, and MPI Mengxia Zhu Spring 2006.
Chapter 11 Operating Systems
Vertices and Fragments III Mohan Sridharan Based on slides created by Edward Angel 1 CS4395: Computer Graphics.
Parallel Graphics Rendering Matthew Campbell Senior, Computer Science
COOL Chips IV A High Performance 3D Graphics Rasterizer with Effective Memory Structure Woo-Chan Park, Kil-Whan Lee*, Seung-Gi Lee, Moon-Hee Choi, Won-Jong.
Hidden Surface Removal
Sort-Last Parallel Rendering for Viewing Extremely Large Data Sets on Tile Displays Paper by Kenneth Moreland, Brian Wylie, and Constantine Pavlakos Presented.
Memory Management Chapter 7.
The Visibility Problem In many environments, most of the primitives (triangles) are not visible most of the time –Architectural walkthroughs, Urban environments.
A Sorting Classification of Parallel Rendering Molnar et al., 1994.
Ramazan Bitirgen, Engin Ipek and Jose F.Martinez MICRO’08 Presented by PAK,EUNJI Coordinated Management of Multiple Interacting Resources in Chip Multiprocessors.
Matrices from HELL Paul Taylor Basic Required Matrices PROJECTION WORLD VIEW.
Week 2 - Friday.  What did we talk about last time?  Graphics rendering pipeline  Geometry Stage.
10/26/04© University of Wisconsin, CS559 Fall 2004 Last Time Drawing lines Polygon fill rules Midterm Oct 28.
Visible-Surface Detection Jehee Lee Seoul National University.
Shared Memory Parallelization of Decision Tree Construction Using a General Middleware Ruoming Jin Gagan Agrawal Department of Computer and Information.
Introduction to Parallel Rendering Jian Huang, CS 594, Spring 2002.
3D Graphics for Game Programming Chapter IV Fragment Processing and Output Merging.
Stream Processing Main References: “Comparing Reyes and OpenGL on a Stream Architecture”, 2002 “Polygon Rendering on a Stream Architecture”, 2000 Department.
Hidden Surface Removal 1.  Suppose that we have the polyhedron which has 3 totally visible surfaces, 4 totally invisible/hidden surfaces, and 1 partially.
Real-time Graphics for VR Chapter 23. What is it about? In this part of the course we will look at how to render images given the constrains of VR: –we.
1 Memory Management Chapter 7. 2 Memory Management Subdividing memory to accommodate multiple processes Memory needs to be allocated to ensure a reasonable.
Interactive Rendering With Coherent Ray Tracing Eurogaphics 2001 Wald, Slusallek, Benthin, Wagner Comp 238, UNC-CH, September 10, 2001 Joshua Stough.
Computer Graphics Chapter 6 Andreas Savva. 2 Interactive Graphics Graphics provides one of the most natural means of communicating with a computer. Interactive.
1Computer Graphics Implementation II Lecture 16 John Shearer Culture Lab – space 2
Implementation II Ed Angel Professor of Computer Science, Electrical and Computer Engineering, and Media Arts University of New Mexico.
Binary Space Partitioning Trees Ray Casting Depth Buffering
Hidden Surface Removal
Implementation II.
1 Memory Management Chapter 7. 2 Memory Management Subdividing memory to accommodate multiple processes Memory needs to be allocated to ensure a reasonable.
Image Synthesis Rabie A. Ramadan, PhD 4. 2 Review Questions Q1: What are the two principal tasks required to create an image of a three-dimensional scene?
Advanced Computer Graphics Spring 2014 K. H. Ko School of Mechatronics Gwangju Institute of Science and Technology.
Review on Graphics Basics. Outline Polygon rendering pipeline Affine transformations Projective transformations Lighting and shading From vertices to.
Lecture 3 : Performance of Parallel Programs Courtesy : MIT Prof. Amarasinghe and Dr. Rabbah’s course note.
Implementation of a Renderer Consider Programs are processd by the system line & polygon, outside the view volume Efficiently Understanding of the implementation.
Adaptive Sorting “A Dynamically Tuned Sorting Library” “Optimizing Sorting with Genetic Algorithms” By Xiaoming Li, Maria Jesus Garzaran, and David Padua.
Memory Coherence in Shared Virtual Memory System ACM Transactions on Computer Science(TOCS), 1989 KAI LI Princeton University PAUL HUDAK Yale University.
Informationsteknologi Wednesday, October 3, 2007Computer Systems/Operating Systems - Class 121 Today’s class Memory management Virtual memory.
Computer Graphics I, Fall 2010 Implementation II.
Image Processing A Study in Pixel Averaging Building a Resolution Pyramid With Parallel Computing Denise Runnels and Farnaz Zand.
1 CSCE 441: Computer Graphics Hidden Surface Removal Jinxiang Chai.
CS559: Computer Graphics Lecture 12: Antialiasing & Visibility Li Zhang Spring 2008.
Computer Graphics Inf4/MSc 1 Computer Graphics Lecture 5 Hidden Surface Removal and Rasterization Taku Komura.
A Sorting Classification of Parallel Rendering Molnar et al., 1994.
Siggraph 2009 RenderAnts: Interactive REYES Rendering on GPUs Kun Zhou Qiming Hou Zhong Ren Minmin Gong Xin Sun Baining Guo JAEHYUN CHO.
Memory Management Chapter 7.
COMPUTER GRAPHICS CHAPTER 38 CS 482 – Fall 2017 GRAPHICS HARDWARE
Ioannis E. Venetis Department of Computer Engineering and Informatics
Week 2 - Friday CS361.
Parallel Algorithm Design
The Graphics Rendering Pipeline
Improving cache performance of MPEG video codec
Presentation transcript:

Partitioning Screen Space 2 Rui Wang

Architectural Implications of Hardware- Accelerated Bucket Rendering on the PC (97’) Dynamic Load Balancing for Parallel Polygon Rendering (94’) Models of the Impact of Overlap in Bucket Rendering (98’)

Architectural Implications of Hardware- Accelerated Bucket Rendering on the PC Bucket Rendering Divide scene into screen-space tiles and render each tile independently Advantages Smaller memory requirement Parallel rendering Disadvantages Overlap of primitives over multiple buckets Overhead of sorting primitives into buckets

Bucket Rendering RenderMan, PixelPlanes 5, PixelFlow, Apple systems, Talisman Choosing tile size Smaller memory requirement, better load balancing, but more overlaps 32x32 tile with dense SRAM 128x128 tile with embedded DRAM

Architecture Buckets

Bucket Sorting Screen-aligned bounding box Exact bucket sorting

Analytical computation of overlap Molnar 91’, modified to consider r Equation: S w

More Equations … blah, blah, blah minimized when r=1or

Overhead incurred by overlap Main memory bandwidth AGP bandwidth Primitives set-up Experiments Intel Scene Management Trace polygons rendered Bucket size 16, 32, 64, 128, 256

Results Larger size tri doesn’t matter. Many primitives that are smaller than 1K pixel size have with large aspect ratio (Fig.2-9) Expected E o assuming r=1 closely fit observed overlap => Molnar equation is still a good prediction assuming r=1 (Tab.2) Bounding box area scales with screen resolution

Assume triangle set-up: 1M/s Bandwidth of memory: 200~300MB/s AGP bandwidth: 256MB/s 128x128 tile: resource consumption stay below 25%, 15%, 10% respectively for all resolutions (Tab.3) 32x32 tile: resource consumption prohibitively high above 1024x768 (Tab.4) Discussion

Dynamic Load Balancing for Parallel Polygon Rendering A New graphics renderer incorporates novel partitioning methods for efficient Execution on a parallel computer. It requires little overhead, executes efficiently, and demands minimal processor synchronization.

Horizontal, vertical strips, rectangular areas Load balancing: Data nonadaptive: statically, dynamically Data adaptive: less suitable for a single image Data nonadaptive Tile image space to RxP rectangular areas Tradeoff to choosing R (20 is good) This paper uses 2xP by applying a task adaptive work decomposition strategy Image-space Parallel Algorithms

Preprocessing: 13% Data read-in, xforming, normal calc., back-face culling, clipping, perspective proj. Rendering: 86% Hidden-face removal, shading, anti-aliasing, etc Post processing: 1% Display frame buffer or store in file Fig.2 Algorithm Description

Number of objects > P Round-robin distribution Number of objects < P Split into multiple sub objects Parallel xforming or clipping; Reader process assigns data to individual processors Sort Implementation on BBN TC2000 (96 processors), speedup 9.4 Preprocessing

Modified scan-line z-buffer algorithm with stochastic sampling (16 per pixel) R=2: preprocessing overhead reduced The first P regions assigned to P processors Dynamically get from the additional P tasks partition : (steal tasks from other processors) Search for the processor with most work left P max If there is sufficient work to do, proceed, otherwise return Set locks to prevent concurrent stealing Partition P max work into two segments, take the lower one Unset locks and work on the new task Rendering

The splitting strategy maintains coherence at P max side without interruption With task adaptive approach, load balancing is good even for R=2 Effect of initial region aspect ratio on splitting: 2:1 worst, 1:3 best

Find the remaining work on a processor Record and update the current remaining scan lines Fig.5 Avoid race conditions or deadlocks Test and lock: if not locked, store in p_max ; if locked, store in pot_p_max After getting through a lock, the work might have been completed meanwhile Post Processing … Some Issues

Locally Cached (LC) Distributing data among memories Copy exactly the amount of data necessary for computation to local memory After preprocessing, each processor queries and retrieves data for rendering from other processors Data movement during partitioning Quick clip after data retrieved, instead of obtain exactly the lower region task => reduce the amount of data for further partitioning Data Distribution

Load balancing overhead increases with P Communication Overhead also increases with P

Bucket Rendering Subdivide frame buffer into coherent regions that are rendered independently Tile size is a design decision Advantages and disadvantages Models of the Impact of Overlap in Bucket Rendering

Analytical Models Fig.1 defines key terms Per-pixel cost=1 Per-triangle cost for non-tiled=k, Tiled=kO knockout...

Modeling Overlap Overlap factor: Source of error: Assume unit aspect ratio, best for overlap Assume uniform tri area, worst for overlap Assume rho=3

Assume single processing unit for per-triangle and per-pixel work Software Model k=25, S=32, rho=3 a worst ≈ 32.7

Assume perfect pipelining (infinite buffer); processing time depends on the slowest unit Processing time on real systems (Fig.3) Hardware Model k=25k’≈47

Non-tiled: Tiled: Hardware Model a worst ≈ 25

Inefficiency of tiled rendering is limited The worst case occurs at exactly the triangle area where processing the triangle requires as much time as processing its pixels (a worst =k) => pipeline balanced Observations of HW Model

How well the mathematical model works Datasets Experiments large triangles avg small triangles avg. 8.4

OpenGL Stream Codec (GLS) Argus rendering system on top of SimOS (full machine simulator, measurement taken at pixel granularity) Experiments on SW Model

Decide k, O, a: k=9.3 (non-textured), k=3.6 (textured) O real is measured, O calc is predicted a=average tri area measured Experiments on SW Model

Worst efficiency occurs at moderate size (conforms to the model) Limitation: Error of overlap estimation using Molnar equ. Does not consider scan line processing time Analysis on SW Model

Single triangle buffer vs. Infinite Experiments on HW Model

Decide k, O, a: k=25 O real is measured, O calc is predicted a=average tri area measured Experiments on HW Model

The model correctly predict inefficiency peaks at triangle area close to k=25 Interesting thing to understand (???) More buffer space is most effective when average tri area match k (pipeline balanced) For large avg. triangles, tiling moves effective tri area closer to k, thus infinite buffer takes effect For small avg. triangles, tiling moves effective tri area below k, thus infinite buffer has little effect Overlap calc error still exists Analysis on HW Model

The impact of overlap is highest when work is balanced between triangle processing and pixel processing The impact of overlap in pipelined system is limited to a window of tri size around the design point of the system Has limitations Conclusion