Department of Computer Science 1 Beyond CUDA/GPUs and Future Graphics Architectures Karu Sankaralingam University of Wisconsin-Madison Adapted from “Toward.

Slides:



Advertisements
Similar presentations
Sven Woop Computer Graphics Lab Saarland University
Advertisements

COMPUTER GRAPHICS CS 482 – FALL 2014 NOVEMBER 10, 2014 GRAPHICS HARDWARE GRAPHICS PROCESSING UNITS PARALLELISM.
Dynamic Thread Assignment on Heterogeneous Multiprocessor Architectures Pree Thiengburanathum Advanced computer architecture Oct 24,
Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.
Thread Criticality Predictors for Dynamic Performance, Power, and Resource Management in Chip Multiprocessors Abhishek Bhattacharjee Margaret Martonosi.
Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.
PERFORMANCE ANALYSIS OF MULTIPLE THREADS/CORES USING THE ULTRASPARC T1 (NIAGARA) Unique Chips and Systems (UCAS-4) Dimitris Kaseridis & Lizy K. John The.
Physically Based Real-time Ray Tracing Ryan Overbeck.
GI 2006, Québec, June 9th 2006 Implementing the Render Cache and the Edge-and-Point Image on Graphics Hardware Edgar Velázquez-Armendáriz Eugene Lee Bruce.
Render Cache John Tran CS851 - Interactive Ray Tracing February 5, 2003.
Real-Time Rendering TEXTURING Lecture 02 Marina Gavrilova.
Two-Level Grids for Ray Tracing on GPUs
Afrigraph 2004 Interactive Ray-Tracing of Free-Form Surfaces Carsten Benthin Ingo Wald Philipp Slusallek Computer Graphics Lab Saarland University, Germany.
Rasterization and Ray Tracing in Real-Time Applications (Games) Andrew Graff.
Fast and Accurate Soft Shadows using a Real-Time Beam Tracer Ravi Ramamoorthi Columbia Vision and Graphics Center Columbia University
Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow Wilson W. L. Fung Ivan Sham George Yuan Tor M. Aamodt Electrical and Computer Engineering.
Control Flow Virtualization for General-Purpose Computation on Graphics Hardware Ghulam Lashari Ondrej Lhotak University of Waterloo.
Computer Graphics (Fall 2005) COMS 4160, Lecture 21: Ray Tracing
Order-Independent Texture Synthesis Li-Yi Wei Marc Levoy Gcafe 1/30/2003.
3D Graphics Processor Architecture Victor Moya. PhD Project Research on architecture improvements for future Graphic Processor Units (GPUs). Research.
IN4151 Introduction 3D graphics 1 Introduction to 3D computer graphics part 2 Viewing pipeline Multi-processor implementation GPU architecture GPU algorithms.
ATI GPUs and Graphics APIs Mark Segal. ATI Hardware X1K series 8 SIMD vertex engines, 16 SIMD fragment (pixel) engines 3-component vector + scalar ALUs.
Foundations of Computer Graphics (Spring 2010) CS 184, Lecture 14: Ray Tracing
Anjul Patney University of California, Davis Real-Time Reyes Programmable Pipelines and Research Challenges.
Interactive k-D Tree GPU Raytracing Daniel Reiter Horn, Jeremy Sugerman, Mike Houston and Pat Hanrahan.
Introduction What is GPU? It is a processor optimized for 2D/3D graphics, video, visual computing, and display. It is highly parallel, highly multithreaded.
GPU Graphics Processing Unit. Graphics Pipeline Scene Transformations Lighting & Shading ViewingTransformations Rasterization GPUs evolved as hardware.
COOL Chips IV A High Performance 3D Graphics Rasterizer with Effective Memory Structure Woo-Chan Park, Kil-Whan Lee*, Seung-Gi Lee, Moon-Hee Choi, Won-Jong.
Ray Tracing Primer Ref: SIGGRAPH HyperGraphHyperGraph.
Computer Graphics 2 Lecture x: Acceleration Techniques for Ray-Tracing Benjamin Mora 1 University of Wales Swansea Dr. Benjamin Mora.
Interactive Ray Tracing: From bad joke to old news David Luebke University of Virginia.
Ray Tracing and Photon Mapping on GPUs Tim PurcellStanford / NVIDIA.
GPU-accelerated Evaluation Platform for High Fidelity Networking Modeling 11 December 2007 Alex Donkers Joost Schutte.
Realtime Caustics using Distributed Photon Mapping Johannes Günther Ingo Wald * Philipp Slusallek Computer Graphics Group Saarland University ( * now at.
1 The Performance Potential for Single Application Heterogeneous Systems Henry Wong* and Tor M. Aamodt § *University of Toronto § University of British.
GPUs and Accelerators Jonathan Coens Lawrence Tan Yanlin Li.
Architectural Support for Fine-Grained Parallelism on Multi-core Architectures Sanjeev Kumar, Corporate Technology Group, Intel Corporation Christopher.
Stefan PopovHigh Performance GPU Ray Tracing Real-time Ray Tracing on GPU with BVH-based Packet Traversal Stefan Popov, Johannes Günther, Hans- Peter Seidel,
Stream Processing Main References: “Comparing Reyes and OpenGL on a Stream Architecture”, 2002 “Polygon Rendering on a Stream Architecture”, 2000 Department.
Interactive Visualization of Exceptionally Complex Industrial CAD Datasets Andreas Dietrich Ingo Wald Philipp Slusallek Computer Graphics Group Saarland.
A Closer Look At GPUs By Kayvon Fatahalian and Mike Houston Presented by Richard Stocker.
Interactive Rendering With Coherent Ray Tracing Eurogaphics 2001 Wald, Slusallek, Benthin, Wagner Comp 238, UNC-CH, September 10, 2001 Joshua Stough.
Introduction What is GPU? It is a processor optimized for 2D/3D graphics, video, visual computing, and display. It is highly parallel, highly multithreaded.
Sample Based Visibility for Soft Shadows using Alias-free Shadow Maps Erik Sintorn – Ulf Assarsson – uffe.
Hierarchical Penumbra Casting Samuli Laine Timo Aila Helsinki University of Technology Hybrid Graphics, Ltd.
Interactive Ray Tracing of Dynamic Scenes Tomáš DAVIDOVIČ Czech Technical University.
A SEMINAR ON 1 CONTENT 2  The Stream Programming Model  The Stream Programming Model-II  Advantage of Stream Processor  Imagine’s.
GPU Based Sound Simulation and Visualization Torbjorn Loken, Torbjorn Loken, Sergiu M. Dascalu, and Frederick C Harris, Jr. Department of Computer Science.
Graphics Interface 2009 The-Kiet Lu Kok-Lim Low Jianmin Zheng 1.
From Turing Machine to Global Illumination Chun-Fa Chang National Taiwan Normal University.
COMPUTER GRAPHICS CS 482 – FALL 2015 SEPTEMBER 29, 2015 RENDERING RASTERIZATION RAY CASTING PROGRAMMABLE SHADERS.
Shouqing Hao Institute of Computing Technology, Chinese Academy of Sciences Processes Scheduling on Heterogeneous Multi-core Architecture.
Sunpyo Hong, Hyesoon Kim
Ray Tracing by GPU Ming Ouhyoung. Outline Introduction Graphics Hardware Streaming Ray Tracing Discussion.
Computer Architecture Lecture 24 Parallel Processing Ralph Grishman November 2015 NYU.
ECE 526 – Network Processing Systems Design Programming Model Chapter 21: D. E. Comer.
My Coordinates Office EM G.27 contact time:
Path/Ray Tracing Examples. Path/Ray Tracing Rendering algorithms that trace photon rays Trace from eye – Where does this photon come from? Trace from.
Veysi ISLER, Department of Computer Engineering, Middle East Technical University, Ankara, TURKEY Spring
Fall 2012 Parallel Computer Architecture Lecture 4: Multi-Core Processors Prof. Onur Mutlu Carnegie Mellon University 9/14/2012.
GPU Architecture and Its Application
COMPUTER GRAPHICS CHAPTER 38 CS 482 – Fall 2017 GRAPHICS HARDWARE
Multi-core processors
Graphics Processing Unit
Real-Time Ray Tracing Stefan Popov.
From Turing Machine to Global Illumination
Hybrid Ray Tracing of Massive Models
Hyperthreading Technology
Graphics Processing Unit
Graphics Processing Unit
Presentation transcript:

Department of Computer Science 1 Beyond CUDA/GPUs and Future Graphics Architectures Karu Sankaralingam University of Wisconsin-Madison Adapted from “Toward A Multicore Architecture for Real-time Raytracing, MICRO-41, 2008, Venkatraman Govindaraju, Peter Djeu, Karthikeyan Sankaralingam, Mary Vernon, William R. Mark.

Department of Computer Science 2 Real-time Graphics Rendering Today

Department of Computer Science 3 Real-time Graphics Rendering Today Future

Department of Computer Science 4 Real-time Graphics Rendering What are the problems? How can we get there?

Department of Computer Science What is wrong with this picture? 5

Department of Computer Science GPU/CUDA 6 Z-buffer

Department of Computer Science 7 Z-buffer Arch “Ptolemic” Graphic Universe  Architecture, application all optimized for Z-buffer  Difficult to render images with realistic effects. –self-reflection, soft shadows, ambient occlusion  Problems: –Scene constraints, Artist and programmer productivity Application

Department of Computer Science Current Graphics Architectures 8 Courtesy: ACM Queue

Department of Computer Science How did we get here?  Hardware Rasterizers and perspective-correct texture mapping (RIVA 128)  Single Pass Multitexture (TNT / TNT2)  Register Combiners: a generalization of multitexture (GeForce 256)  Per-pixel Shading (Geforce 2 GTS)  Programmable Hardware Pixel Shading  Programmable Vertex Shading  CUDA 9

Department of Computer Science 10 Algorithm Arch “Copernican” Graphic Universe  Architecture, application revolves around Algorithm  More general purpose algorithm  Easier to provide realistic effects  Architecture can support other applications ApplicationRay-tracing

Department of Computer Science Future Graphics Architectures 11 Courtesy: ACM Queue

Department of Computer Science 12 Executive Summary: Copernicus System  Co-designed application, architecture and analysis framework  Path from specialized graphics architecture to more general purpose architecture.  A detailed characterization and analysis framework  Real-time frame rates possible for high quality dynamic scenes

Department of Computer Science 13 Outline  Motivation  Copernicus system –Graphics Algorithm: Razor –Architecture –Evaluation and Results  Summary

Department of Computer Science 14 Ray-tracing Full scene CubeCylinder  Simulating the behavior light rays through 3D scene  Rays from eye to scene (Primary rays)  Rays from hitpoint to light (Secondary rays)  Acceleration structure (eg. BSP Tree) for efficiency

Department of Computer Science 15 Disadvantages of Raytracing  Every frame need to rebuild the acceleration structure for dynamic scenes.  Irregular data accesses for traversing the acceleration structure.  Higher resolution secondary ray tracing computation

Department of Computer Science 16 Razor: A Dynamic Multiresolution Raytracer Cube Cylinder Thread 1Thread 2  Packet ray-tracer: Traces beam of rays instead of a ray –Opportunity for data level parallelism  Each thread lazily builds its own acceleration structure(KD Tree) –Builds the portion of structure it needs.

Department of Computer Science 17 Razor: A Dynamic Multiresolution Raytracer  Multi-level resolution to reduce secondary rays computation.  Replicates KD-Tree to reduce synchronization across threads. –Hypothesis: Duplication across threads will be limited.

Department of Computer Science 18 Razor Implementation  Linux/x86 –Implemented Razor in Intel Clovertown. –Parallelized using pthreads.  Optimized with SSE instructions  Sustains 1 FPS on this prototype system  Helps develop algorithms  Designed with future hardware in mind

Department of Computer Science 19 Razor’s Memory Usage # Threads Memory footprint

Department of Computer Science 20 Parallel Scalability # Threads Speedup

Department of Computer Science 21 Outline  Motivation  Copernicus system –Graphics Algorithm: Razor –Architecture –Evaluation and Results  Summary

Department of Computer Science 22 Architecture: Core Inorder core Private L1 Data and Instruction Cache Supports SIMD instructions SMT Threads to hide memory latency

Department of Computer Science 23 Architecture: Tile Shared L2 cache Shared Accelerator for specialized instructions

Department of Computer Science 24 Architecture: Chip

Department of Computer Science 25 Architecture Razor Mapping Assigned to Tile Assigned to Core

Department of Computer Science 26 Outline  Motivation  Copernicus system –Graphics Algorithm: Razor –Architecture –Evaluation and Results  Summary

Department of Computer Science 27 Benchmark Scenes v CourtyardFairyforestForest JuarezSaloon

Department of Computer Science 28 Evaluation Methodology  Simulation with Multifacet/GEMS –Simulate SSE Instructions –Simulate a full tile –Validated with prototype data Pin-based and PAPI-based performance counters –Randomly selected regions of scenes  Full chip –Simulating full chip is too slow –Build customized analytic model

Department of Computer Science 29 Analytical Model  Core Level –Pipeline stalls –Multiple threads  Tile Level –L2 contention  Chip Level –Main memory contention  Compared with our simulation results

Department of Computer Science 30 Single Core Performance (Single Issue) IPC

Department of Computer Science 31 Single Core Performance (Dual Issue) IPC

Department of Computer Science 32 Single Tile Performance IPC

Department of Computer Science 33 Full Chip Performance #Tiles Million Rays/Seconds

Department of Computer Science 34 So, Are we there yet?

Department of Computer Science 35 Results  Goal: 100 Million rays per second  Achieved: 50 Million rays per second –With 16 tiles and 4 DIMMs  Insights: –4 SMT single issue is ideal for this workload –Good parallel scalability –Razor’s physically-motivated optimizations work  Potential for further architectural optimizations –Shared accelerator –Wide SIMD bundles

Department of Computer Science 36 Outline  Motivation  Copernicus system –Graphics Algorithm: Razor –Architecture –Evaluation and Results  Summary

Department of Computer Science 37 Summary  A transformation path to ray-tracing –Ptolemic universe to Copernican graphics universe  Unique architecture design point –Tradeoff data redundancy and re-computation over synchronization  Evaluation methodology interesting in its own right –Prototype, simulation and analytical framework to design and evaluate future systems  Future work –Instructions specialization and shared accelerator design –Tradeoffs with SIMD width and area –Memory system

Department of Computer Science 38 Other Questions?

Department of Computer Science 39 Raytracing

Department of Computer Science 40 Razor: A Dynamic Packet Ray-tracer  Packet ray-tracer –Traces beam of rays instead of ray –Opportunity for data level parallelism  Each thread lazily builds its own acceleration structure (kd-Tree). –Builds the portion of structure it needs.  Multi-level resolution to reduce secondary rays computation.  Replicates acceleration structure to reduce synchronization across threads. –Hypothesis: Duplication across threads will be limited.