A Reconfigurable Architecture for Load-Balanced Rendering Graphics Hardware July 31, 2005, Los Angeles, CA Jiawen Chen Michael I. Gordon William Thies.

Slides:

Advertisements

Similar presentations

Sven Woop Computer Graphics Lab Saarland University

Advertisements

COMPUTER GRAPHICS CS 482 – FALL 2014 NOVEMBER 10, 2014 GRAPHICS HARDWARE GRAPHICS PROCESSING UNITS PARALLELISM.

Lecture 38: Chapter 7: Multiprocessors Today’s topic –Vector processors –GPUs –An example 1.

Bottleneck Elimination from Stream Graphs S. M. Farhad The University of Sydney Joint work with Yousun Ko Bernd Burgstaller Bernhard Scholz.

The Raw Architecture Signal Processing on a Scalable Composable Computation Fabric David Wentzlaff, Michael Taylor, Jason Kim, Jason Miller, Fae Ghodrat,

Graphics Pipeline.

Graphics Hardware CMSC 435/634. Transform Shade Clip Project Rasterize Texture Z-buffer Interpolate Vertex Fragment Triangle A Graphics Pipeline.

Prepared 5/24/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron.

Control Flow Virtualization for General-Purpose Computation on Graphics Hardware Ghulam Lashari Ondrej Lhotak University of Waterloo.

Computer Graphics Hardware Acceleration for Embedded Level Systems Brian Murray

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408, University of Illinois, Urbana-Champaign 1 Programming Massively Parallel Processors Chapter.

3D Graphics Processor Architecture Victor Moya. PhD Project Research on architecture improvements for future Graphic Processor Units (GPUs). Research.

IN4151 Introduction 3D graphics 1 Introduction to 3D computer graphics part 2 Viewing pipeline Multi-processor implementation GPU architecture GPU algorithms.

Introduction to Parallel Rendering: Sorting, Chromium, and MPI Mengxia Zhu Spring 2006.

ATI GPUs and Graphics APIs Mark Segal. ATI Hardware X1K series 8 SIMD vertex engines, 16 SIMD fragment (pixel) engines 3-component vector + scalar ALUs.

Evolution of the Programmable Graphics Pipeline Patrick Cozzi University of Pennsylvania CIS Spring 2011.

The Graphics Pipeline CS2150 Anthony Jones. Introduction What is this lecture about? – The graphics pipeline as a whole – With examples from the video.

Status – Week 260 Victor Moya. Summary shSim. shSim. GPU design. GPU design. Future Work. Future Work. Rumors and News. Rumors and News. Imagine. Imagine.

GPU Tutorial 이윤진 Computer Game 2007 가을 2007 년 11 월 다섯째 주, 12 월 첫째 주.

GPU Graphics Processing Unit. Graphics Pipeline Scene Transformations Lighting & Shading ViewingTransformations Rasterization GPUs evolved as hardware.

University of Texas at Austin CS 378 – Game Technology Don Fussell CS 378: Computer Game Technology Beyond Meshes Spring 2012.

GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.

1 A Hierarchical Shadow Volume Algorithm Timo Aila 1,2 Tomas Akenine-Möller 3 1 Helsinki University of Technology 2 Hybrid Graphics 3 Lund University.

Under the Hood: 3D Pipeline. Motherboard & Chipset PCI Express x16.

Ray Tracing and Photon Mapping on GPUs Tim PurcellStanford / NVIDIA.

REAL-TIME VOLUME GRAPHICS Christof Rezk Salama Computer Graphics and Multimedia Group, University of Siegen, Germany Eurographics 2006 Real-Time Volume.

Graphics on Key by Eyal Sarfati and Eran Gilat Supervised by Prof. Shmuel Wimer, Amnon Stanislavsky and Mike Sumszyk 1.

1 The Performance Potential for Single Application Heterogeneous Systems Henry Wong* and Tor M. Aamodt § *University of Toronto § University of British.

Orchestration by Approximation Mapping Stream Programs onto Multicore Architectures S. M. Farhad (University of Sydney) Joint work with Yousun Ko Bernd.

Software Pipelining for Stream Programs on Resource Constrained Multi-core Architectures IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEM 2012 Authors:

Cg Programming Mapping Computational Concepts to GPUs.

A Sorting Classification of Parallel Rendering Molnar et al., 1994.

Week 2 - Friday.  What did we talk about last time?  Graphics rendering pipeline  Geometry Stage.

Static Translation of Stream Programs S. M. Farhad School of Information Technology The University of Sydney.

Introduction to Parallel Rendering Jian Huang, CS 594, Spring 2002.

StreamX10: A Stream Programming Framework on X10 Haitao Wei School of Computer Science at Huazhong University of Sci&Tech.

Stream Processing Main References: “Comparing Reyes and OpenGL on a Stream Architecture”, 2002 “Polygon Rendering on a Stream Architecture”, 2000 Department.

Shadow Mapping Chun-Fa Chang National Taiwan Normal University.

A Closer Look At GPUs By Kayvon Fatahalian and Mike Houston Presented by Richard Stocker.

GRAPHICS PIPELINE & SHADERS SET09115 Intro to Graphics Programming.

Polygon Rendering on a Stream Architecture John D. Owens, William J. Dally, Ujval J. Kapasi, Scott Rixner, Peter Mattson, Ben Mowery Concurrent VLSI Architecture.

Department of Computer Science 1 Beyond CUDA/GPUs and Future Graphics Architectures Karu Sankaralingam University of Wisconsin-Madison Adapted from “Toward.

Advanced Computer Graphics Spring 2014 K. H. Ko School of Mechatronics Gwangju Institute of Science and Technology.

May 8, 2007Farid Harhad and Alaa Shams CS7080 Overview of the GPU Architecture CS7080 Final Class Project Supervised by: Dr. Elias Khalaf By: Farid Harhad.

Profile Guided Deployment of Stream Programs on Multicores S. M. Farhad The University of Sydney Joint work with Yousun Ko Bernd Burgstaller Bernhard Scholz.

A SEMINAR ON 1 CONTENT 2  The Stream Programming Model  The Stream Programming Model-II  Advantage of Stream Processor  Imagine’s.

Orchestration by Approximation Mapping Stream Programs onto Multicore Architectures S. M. Farhad (University of Sydney) Joint work with Yousun Ko Bernd.

A Common Machine Language for Communication-Exposed Architectures Bill Thies, Michal Karczmarek, Michael Gordon, David Maze and Saman Amarasinghe MIT Laboratory.

Fateme Hajikarami Spring  What is GPGPU ? ◦ General-Purpose computing on a Graphics Processing Unit ◦ Using graphic hardware for non-graphic computations.

From Turing Machine to Global Illumination Chun-Fa Chang National Taiwan Normal University.

COMPUTER GRAPHICS CS 482 – FALL 2015 SEPTEMBER 29, 2015 RENDERING RASTERIZATION RAY CASTING PROGRAMMABLE SHADERS.

Ray Tracing using Programmable Graphics Hardware

StreamIt on Raw StreamIt Group: Michael Gordon, William Thies, Michal Karczmarek, David Maze, Jasper Lin, Jeremy Wong, Andrew Lamb, Ali S. Meli, Chris.

Mapping Computational Concepts to GPUs Mark Harris NVIDIA.

The Graphics Pipeline Revisited Real Time Rendering Instructor: David Luebke.

Ray Tracing by GPU Ming Ouhyoung. Outline Introduction Graphics Hardware Streaming Ray Tracing Discussion.

High-Bandwidth Packet Switching on the Raw General-Purpose Architecture Gleb Chuvpilo Saman Amarasinghe MIT LCS Computer Architecture Group January 9,

A Sorting Classification of Parallel Rendering Molnar et al., 1994.

Static Translation of Stream Program to a Parallel System S. M. Farhad The University of Sydney.

GPU Architecture and Its Application

COMPUTER GRAPHICS CHAPTER 38 CS 482 – Fall 2017 GRAPHICS HARDWARE

A Common Machine Language for Communication-Exposed Architectures

Graphics Processing Unit

From Turing Machine to Global Illumination

The Graphics Rendering Pipeline

A Reconfigurable Architecture for Load-Balanced Rendering

Graphics Processing Unit

A Hierarchical Shadow Volume Algorithm

RADEON™ 9700 Architecture and 3D Performance

CIS 6930: Chip Multiprocessor: GPU Architecture and Programming

Presentation transcript:

A Reconfigurable Architecture for Load-Balanced Rendering Graphics Hardware July 31, 2005, Los Angeles, CA Jiawen Chen Michael I. Gordon William Thies Matthias Zwicker Kari Pulli Frédo Durand

The Load Balancing Problem GPUs: fixed resource allocation –Fixed number of functional units per task –Horizontal load balancing achieved via data parallelism –Vertical load balancing impossible for many applications Our goal: flexible allocation –Both vertical and horizontal –On a per-rendering pass basis V R T F D V R T F D V R T F D V R T F D task parallel data parallel Parallelism in multiple graphics pipelines

Application-specific load balancing Screenshot from Counterstrike Input Vertex Sync Triangle Setup Pixel V P Simplified graphics pipeline

Application-specific load balancing Screenshot from Doom 3 Input Vertex Sync Triangle Setup V R Simplified graphics pipeline Rest of Pixel Pipeline Rest of Pixel Pipeline Rasterizer

Our Approach: Hardware Use a general-purpose multi-core processor –With a programmable communications network –Map pipeline stages to one or more cores MIT Raw Processor –16 general purpose cores –Low-latency programmable network Die Photo of 16-tile Raw chip Diagram of a 4x4 Raw processor

Our Approach: Software Specify graphics pipeline in software as a stream program –Easily reconfigurable Static load balancing –Stream graph specifies resource allocation –Tailor stream graph to rendering pass StreamIt programming language Input Vertex join split Triangle Setup split Pixel V P Sort-middle graphics pipeline stream graph

Benefits of Programmable Approach Compile stream program to multi-core processor Flexible resource allocation Fully programmable pipeline –Pipeline specialization Nontraditional configurations –Image processing –GPGPU Stream graph for graphics pipeline StreamIt Layout on 8x8 Raw

Related Work Scalable Architectures –Pomegranate [Eldridge et al., 2000] Streaming Architectures –Imagine [Owens et al., 2000] Unified Shader Architectures –ATI Xenos

Outline Background –Raw Architecture –StreamIt programming language Programmer Workflow –Examples and Results Future Work

The Raw Processor A scalable computation fabric –Mesh of identical tiles –No global signals Programmable interconnect –Integrated into bypass paths –Register mapped –Fast neighbor communications –Essential for flexible resource allocation Raw tiles –Compute processor –Programmable Switch Processor A 4x4 Raw chip Computation Resources Switch Processor Diagram

The Raw Processor Current hardware –180nm process –16 tiles at 425 MHz –6.8 GFLOPS peak –47.6 GB/s memory bandwidth Simulation results based on 8x8 configuration –64 tiles at 425 MHz –27.2 GFLOPS peak –108.8 GB/s memory bandwidth (32 ports) Die photo of 16-tile Raw chip 180nm process, 331 mm 2

StreamIt High-level stream programming language –Architecture independent Structured Stream Model –Computation organized as filters in a stream graph –FIFO data channels –No global notion of time –No global state Example stream graph

StreamIt Graph Constructs parallel computation may be any StreamIt language construct joiner splitter pipeline feedback loop joiner splitter splitjoin filter Graphics pipeline stream graph

Automatic Layout and Scheduling StreamIt compiler performs layout, scheduling on Raw –Simulated annealing layout algorithm –Generates code for compute processors –Generates routing schedule for switch processors Layout on 8x8 Raw Input Vertex Processor Sync Triangle Setup Rasterizer Pixel Processor Frame Buffer StreamIt Compiler Stream graph

Outline Background –Raw Architecture –StreamIt programming language Programmer Workflow –Examples and Results Future Work

Programmer Workflow For each rendering pass –Estimate resource requirements –Implement pipeline in StreamIt –Adjust splitjoin widths –Compile with StreamIt compiler –Profile application Input Vertex join split Triangle Setup split Pixel V P Sort-middle Stream Graph

Switching Between Multiple Configurations Multi-pass rendering algorithms –Switch configurations between passes –Pipeline flush required anyway (e.g. shadow volumes) Configuration 1Configuration 2

Experimental Setup Compare reconfigurable pipeline against fixed resource allocation Use same inputs on Raw simulator Compare throughput and utilization Manual layout on Raw Fixed Resource Allocation: 6 vertex units, 15 pixel pipelines Input Vertex Processor Sync Triangle Setup Rasterizer Pixel Processor Frame Buffer

Example: Phong Shading Per-pixel phong-shaded polyhedron 162 vertices, 1 light Covers large area of screen Allocate only 1 vertex unit Exploit task parallelism –Devote 2 tiles to pixel shader –1 for computing the lighting direction and normal –1 for shading Pipeline specialization –Eliminate texture coordinate interpolation, etc Output, rendered using the Raw simulator

Phong Shading Stream Graph Input Vertex Processor Triangle Setup Rasterizer Pixel Processor A Frame Buffer Pixel Processor B Phong Shading Stream GraphAutomatic Layout on Raw

Utilization Plot: Phong Shading Fixed pipeline Reconfigurable pipeline

Example: Shadow Volumes 4 textured triangles, 1 point light Very large shadow volumes cover most of the screen Rendered in 3 passes –Initialize depth buffer –Draw extruded shadow volume geometry with Z-fail algorithm –Draw textured triangles with stencil testing Different configuration for each pass –Adjust ratio of vertex to pixel units –Eliminate unused operations Output, rendered using the Raw simulator

Shadow Volumes Stream Graph: Passes 1 and 2 Input Vertex Processor Triangle Setup Rasterizer Frame Buffer

Shadow Volumes Stream Graph: Pass 3 Input Vertex Processor Triangle Setup Rasterizer Texture Lookup Frame Buffer Texture Filtering Shadow Volumes Pass 3 Stream GraphAutomatic Layout on Raw

Utilization Plot: Shadow Volumes Fixed pipeline Reconfigurable pipeline Pass 1Pass 2Pass 3 Pass 1Pass 2Pass 3

Limitations Software rasterization is extremely slow –55 cycles per fragment Memory system –Technique does not optimize for texture access

Future Work Augment Raw with special purpose hardware Explore memory hierarchy –Texture prefetching –Cache performance Single-pass rendering algorithms –Load imbalances may occur within a pass –Decompose scene into multiple passses –Tradeoff between throughput gained from better load balance and cost of flush Dynamic Load Balancing

Summary Reconfigurable Architecture –Application-specific static load balancing –Increased throughput and utilization Ideas: –General-purpose multi-core processor –Programmable communications network –Streaming characterization

Acknowledgements Mike Doggett, Eric Chan David Wentzlaff, Patrick Griffin, Rodric Rabbah, and Jasper Lin John Owens Saman Amarasinghe Raw group at MIT DARPA, NSF, MIT Oxygen Alliance