XMT-GPU A PRAM Architecture for Graphics Computation Tom DuBois, Bryant Lee, Yi Wang, Marc Olano and Uzi Vishkin.

Slides:



Advertisements
Similar presentations
COMPUTER GRAPHICS CS 482 – FALL 2014 NOVEMBER 10, 2014 GRAPHICS HARDWARE GRAPHICS PROCESSING UNITS PARALLELISM.
Advertisements

Lecture 38: Chapter 7: Multiprocessors Today’s topic –Vector processors –GPUs –An example 1.
Lecture 6: Multicore Systems
Graphics Hardware CMSC 435/634. Transform Shade Clip Project Rasterize Texture Z-buffer Interpolate Vertex Fragment Triangle A Graphics Pipeline.
Prepared 5/24/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron.
GPGPU Introduction Alan Gray EPCC The University of Edinburgh.
Appendix A — 1 FIGURE A.2.2 Contemporary PCs with Intel and AMD CPUs. See Chapter 6 for an explanation of the components and interconnects in this figure.
HPCC Mid-Morning Break High Performance Computing on a GPU cluster Dirk Colbry, Ph.D. Research Specialist Institute for Cyber Enabled Discovery.
Sparse LU Factorization for Parallel Circuit Simulation on GPU Ling Ren, Xiaoming Chen, Yu Wang, Chenxi Zhang, Huazhong Yang Department of Electronic Engineering,
GRAPHICS AND COMPUTING GPUS Jehan-François Pâris
2009/04/07 Yun-Yang Ma.  Overview  What is CUDA ◦ Architecture ◦ Programming Model ◦ Memory Model  H.264 Motion Estimation on CUDA ◦ Method ◦ Experimental.
Control Flow Virtualization for General-Purpose Computation on Graphics Hardware Ghulam Lashari Ondrej Lhotak University of Waterloo.
Some Thoughts on Technology and Strategies for Petaflops.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408, University of Illinois, Urbana-Champaign 1 Programming Massively Parallel Processors Chapter.
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 19, 2011 Emergence of GPU systems and clusters for general purpose High Performance Computing.
3D Graphics Processor Architecture Victor Moya. PhD Project Research on architecture improvements for future Graphic Processor Units (GPUs). Research.
Many-Core Programming with GRAMPS Jeremy Sugerman Kayvon Fatahalian Solomon Boulos Kurt Akeley Pat Hanrahan.
Network coding on the GPU Péter Vingelmann Supervisor: Frank H.P. Fitzek.
The PTX GPU Assembly Simulator and Interpreter N.M. Stiffler Zheming Jin Ibrahim Savran.
ATI GPUs and Graphics APIs Mark Segal. ATI Hardware X1K series 8 SIMD vertex engines, 16 SIMD fragment (pixel) engines 3-component vector + scalar ALUs.
Evolution of the Programmable Graphics Pipeline Patrick Cozzi University of Pennsylvania CIS Spring 2011.
Status – Week 260 Victor Moya. Summary shSim. shSim. GPU design. GPU design. Future Work. Future Work. Rumors and News. Rumors and News. Imagine. Imagine.
Comparison of Modern CPUs and GPUs And the convergence of both Jonathan Palacios Josh Triska.
Introduction What is GPU? It is a processor optimized for 2D/3D graphics, video, visual computing, and display. It is highly parallel, highly multithreaded.
GPU Graphics Processing Unit. Graphics Pipeline Scene Transformations Lighting & Shading ViewingTransformations Rasterization GPUs evolved as hardware.
GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.
Emergence of GPU systems for general purpose high performance computing ITCS 4145/5145 April 4, 2013 © Barry Wilkinson CUDAIntro.ppt.
GPGPU platforms GP - General Purpose computation using GPU
Under the Hood: 3D Pipeline. Motherboard & Chipset PCI Express x16.
GPU Programming with CUDA – Accelerated Architectures Mike Griffiths
1 ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Dec 31, 2012 Emergence of GPU systems and clusters for general purpose High Performance Computing.
Shared memory systems. What is a shared memory system Single memory space accessible to the programmer Processor communicate through the network to the.
Computer Graphics Graphics Hardware
Chris Kerkhoff Matthew Sullivan 10/16/2009.  Shaders are simple programs that describe the traits of either a vertex or a pixel.  Shaders replace a.
Cg Programming Mapping Computational Concepts to GPUs.
Programming Concepts in GPU Computing Dušan Gajić, University of Niš Programming Concepts in GPU Computing Dušan B. Gajić CIITLab, Dept. of Computer Science.
NVIDIA Tesla GPU Zhuting Xue EE126. GPU Graphics Processing Unit The "brain" of graphics, which determines the quality of performance of the graphics.
Accelerating image recognition on mobile devices using GPGPU
Emergence of GPU systems and clusters for general purpose high performance computing ITCS 4145/5145 April 3, 2012 © Barry Wilkinson.
Stream Processing Main References: “Comparing Reyes and OpenGL on a Stream Architecture”, 2002 “Polygon Rendering on a Stream Architecture”, 2000 Department.
GPU Architecture and Programming
A Closer Look At GPUs By Kayvon Fatahalian and Mike Houston Presented by Richard Stocker.
CS662 Computer Graphics Game Technologies Jim X. Chen, Ph.D. Computer Science Department George Mason University.
Introduction What is GPU? It is a processor optimized for 2D/3D graphics, video, visual computing, and display. It is highly parallel, highly multithreaded.
GPUs: Overview of Architecture and Programming Options Lee Barford firstname dot lastname at gmail dot com.
May 8, 2007Farid Harhad and Alaa Shams CS7080 Overview of the GPU Architecture CS7080 Final Class Project Supervised by: Dr. Elias Khalaf By: Farid Harhad.
A SEMINAR ON 1 CONTENT 2  The Stream Programming Model  The Stream Programming Model-II  Advantage of Stream Processor  Imagine’s.
The Effects of Parallel Programming on Gaming Anthony Waterman.
Havok FX Physics on NVIDIA GPUs. Copyright © NVIDIA Corporation 2004 What is Effects Physics? Physics-based effects on a massive scale 10,000s of objects.
GPUs – Graphics Processing Units Applications in Graphics Processing and Beyond COSC 3P93 – Parallel ComputingMatt Peskett.
Fateme Hajikarami Spring  What is GPGPU ? ◦ General-Purpose computing on a Graphics Processing Unit ◦ Using graphic hardware for non-graphic computations.
From Turing Machine to Global Illumination Chun-Fa Chang National Taiwan Normal University.
Mapping Computational Concepts to GPUs Mark Harris NVIDIA.
Computer Architecture Lecture 24 Parallel Processing Ralph Grishman November 2015 NYU.
GPGPU introduction. Why is GPU in the picture Seeking exa-scale computing platform Minimize power per operation. – Power is directly correlated to the.
3/12/2013Computer Engg, IIT(BHU)1 CUDA-3. GPGPU ● General Purpose computation using GPU in applications other than 3D graphics – GPU accelerates critical.
Emergence of GPU systems for general purpose high performance computing ITCS 4145/5145 © Barry Wilkinson GPUIntro.ppt Oct 30, 2014.
Heterogeneous Processing KYLE ADAMSKI. Overview What is heterogeneous processing? Why it is necessary Issues with heterogeneity CPU’s vs. GPU’s Heterogeneous.
Computer Graphics Graphics Hardware
Emergence of GPU systems for general purpose high performance computing ITCS 4145/5145 July 12, 2012 © Barry Wilkinson CUDAIntro.ppt.
Auburn University COMP8330/7330/7336 Advanced Parallel and Distributed Computing Parallel Hardware Dr. Xiao Qin Auburn.
GPU Architecture and Its Application
COMPUTER GRAPHICS CHAPTER 38 CS 482 – Fall 2017 GRAPHICS HARDWARE
Graphics Processing Unit
From Turing Machine to Global Illumination
Graphics Processing Unit
Computer Graphics Graphics Hardware
Graphics Processing Unit
CIS 6930: Chip Multiprocessor: GPU Architecture and Programming
Presentation transcript:

XMT-GPU A PRAM Architecture for Graphics Computation Tom DuBois, Bryant Lee, Yi Wang, Marc Olano and Uzi Vishkin

Bifurcation Between CPU and GPU CPUs General purpose, serial General purpose, serialGPUs Special purpose, parallel Special purpose, parallel CPUs are becoming more parallel Dual and quad cores, roadmaps predict many-cores Dual and quad cores, roadmaps predict many-cores Unclear how to build or program these many-cores Unclear how to build or program these many-cores GPUs more general Gaining momentum for more general apps Gaining momentum for more general apps

Is Unification Possible? Can a single general purpose, many-core processor replace a CPU + GPU? It must be Standalone – no coprocessor needed Standalone – no coprocessor needed Easy and flexible to program Easy and flexible to program Competitive on anything with CPUs Competitive on anything with CPUs Competitive on graphics with GPUs Competitive on graphics with GPUs We choose a unification candidate (XMT) in part because it satisfies the first 3 During the Q&A session we welcome your thoughts on what else could be used

Main Experiment and Results Can XMT satisfy the 4 th ? Simulate surface shading (a common graphics app) on GP and GPU representatives Mixed results XMT slightly faster on some GPU tasks XMT slightly faster on some GPU tasks GPUs significantly faster on others GPUs significantly faster on others Unification unlikely, but momentum may shift towards GP

CPU History Overview Serial random access machine programming model Great success story, dominant model for decades Great success story, dominant model for decades Popular for both theory and practice Popular for both theory and practice Relies on faster serial hardware for performance gains - no longer sufficient Relies on faster serial hardware for performance gains - no longer sufficient Multi-cores available (2-4 cores/chip) Many-cores on horizon (100’s or 1000’s); but how will they looks?

What Will Future CPUs Look Like? Future from major vendors unclear Proposals try to look like serial RAM to programmers Proposals try to look like serial RAM to programmers Long term software spiral broken Long term software spiral broken PRAM (parallel random access machine) Model preferred by programming community Model preferred by programming community natural extension of serial scalable included in major algorithm textbooks Discounted because of difficulty building one Discounted because of difficulty building one Recently building one has become feasible Recently building one has become feasible

XMT: eXplicit Multi-Threading PRAM-on-chip vision under development at the University of Maryland since 1997 Targeting ~1000 cores on chip Targeting ~1000 cores on chip PRAM-like programmability PRAM-like programmability On chip shared L1 cache provides memory bandwidth necessary for PRAM On chip shared L1 cache provides memory bandwidth necessary for PRAM Previous work has established XMT’s performance on a variety of applications Previous work has established XMT’s performance on a variety of applications Simulator and FPGA implementations available Simulator and FPGA implementations available

Programming XMT XMTC: Single-program multiple-data (SPMD) extension of standard C which resembles CRCW PRAM Spawning creates lightweight, asynchronous threads, serial execution resumes once all threads complete

XMT FPGA In Use 64 Processor, 75MHz prototype Used in undergrad theory class (also: non-major Freshmen and 35 high-school students) 6 significant projects 6 significant projects No architecture discussion, minimal XMTC discussion No architecture discussion, minimal XMTC discussion

GPU History Overview Stream programming model Streams and kernels: simple and easily exploits locality for some tasks Streams and kernels: simple and easily exploits locality for some tasks Handles irregular, fine-grained, and serial code poorly Handles irregular, fine-grained, and serial code poorly Originally very inflexible “Programming” meant setting bits for Muxes “Programming” meant setting bits for Muxes Modern GPUs are much more flexible C like languages C like languages GPGPU GPGPU Still tied to stream model Still tied to stream model

Very High Level GPU Pipeline Old pipelined architecture New virtual pipeline architecture Vertex Processing Fragment Processing Pixel Processing Computation units capable of vertex, fragment, pixel, and other processing Flow Control

More Detailed Modern GPU From NVIDIA GeForce 8800 GPU Architecture Overview Virtual pipelined Virtual pipelined

GeForce and XMT Similarities

Clusters of processors, functional units

GeForce and XMT Similarities On chip memory and access network

GeForce and XMT Similarities Control Logic

Does XMT Meet Unification Requirements? Unification requirements Ability to stand alone Ability to stand alone Easy and flexible to program Easy and flexible to program Must perform GP tasks well Must perform GP tasks well Competitive with modern GPUs Competitive with modern GPUs Graphics performance only unknown GPUs are a unique competitor Successful, commodity, parallel hardware Successful, commodity, parallel hardware

Our Experiment Simulated XMT system vs. several real GPUs on fragment shading Compute shading – memory light, general Compute shading – memory light, general Texture shading – memory heavy, specialized Texture shading – memory heavy, specialized Only fragment shading stage compared

Simulating Fragment Shading‏ Simulated XMT used in place of the fragment shading step in software Application Display Mesa OpenGL Vertex Processing Rasterization Fragment Shading Other Fragment Ops XMT Fragment Shading Program XMT Simulator

The Competitors *Scaled for the same FLOPS level as GeForce 8800 Simulated XMT* Released mid 2006NVidia GeForce 7900 Released late 2006NVidia GeForce 8800 Released mid 2004ATI x700 Simulated XMT* Released mid 2006NVidia GeForce 7900 Released late 2006NVidia GeForce 8800 Released mid 2004ATI x700 Simulated XMT* Released mid 2006NVidia GeForce 7900 Released late 2006NVidia GeForce 8800 Released mid 2004ATI x700 Simulated XMT* Released mid 2006NVidia GeForce 7900 Released late 2006NVidia GeForce 8800 Released mid 2004ATI x700

XMT Variants We used 3 variants of XMT Version 1 - unmodified Version 1 - unmodified Version 2 - with graphics ISA Version 2 - with graphics ISA Floor, Fraction, Linear Interpolation Version 3 – graphics ISA and 4-way vector computations on 8-bit arithmetic Version 3 – graphics ISA and 4-way vector computations on 8-bit arithmetic

Compute Shader Results 3222 FPS XMT version FPSXMT version FPS GeForce FPSGeForce FPSx700

Texture Shader Results 275 FPSXMT version FPS XMT version FPSXMT version FPS GeForce FPS GeForce FPSx700

Analysis XMT compute shades faster XMT texture shades much slower Acceptable for some apps, not all Acceptable for some apps, not all The GeForce GPUs follow the same trend Sacrificing speed on most used apps for greater flexibility on others Sacrificing speed on most used apps for greater flexibility on others

Summary Divide between CPU/GPU is blurring A unified system gives Ease of programming Ease of programming Good GP performance Good GP performance Good graphics performance Good graphics performance Combination systems needed for truly high performance apps An XMT + GPU system could provide the best of both worlds An XMT + GPU system could provide the best of both worlds How to partition between them? Can they cooperate? How to partition between them? Can they cooperate?