General Purpose GPU (GPGPU) Aaron Smith University of Texas at Austin Spring 2003.

Slides:



Advertisements
Similar presentations
Introduction to C Programming
Advertisements

Introduction to C Programming
Taking CUDA to Ludicrous Speed Getting Righteous Performance from your GPU 1.
Intermediate GPGPU Programming in CUDA
Λλ Divergence Analysis with Affine Constraints Diogo Sampaio, Sylvain Collange and Fernando Pereira The Federal University of Minas.
Programming in C Chapter 10 Structures and Unions
Optimizing Compilers for Modern Architectures Compiler Improvement of Register Usage Chapter 8, through Section 8.4.
Software project Gnome Graphics Olga Sorkine Andrei Scharf Office: Schreiber 002, Web:
High-Quality Unstructured Volume Rendering on the PC Platform High-Quality Unstructured Volume Rendering on the PC Platform Hardware Workshop 2002 Stefan.
Dynamic Memory Allocation in C.  What is Memory What is Memory  Memory Allocation in C Memory Allocation in C  Difference b\w static memory allocation.
Debugging Tools Tim Purcell NVIDIA. Programming Soap Box Successful programming systems require at least three ‘tools’ Successful programming systems.
Code Generation.
The University of Adelaide, School of Computer Science
Lecture 38: Chapter 7: Multiprocessors Today’s topic –Vector processors –GPUs –An example 1.
What is a pointer? First of all, it is a variable, just like other variables you studied So it has type, storage etc. Difference: it can only store the.
Nirmalya Roy School of Electrical Engineering and Computer Science Washington State University Cpt S 122 – Data Structures Characters and Strings.
Computer Architecture CSCE 350
CPS3340 COMPUTER ARCHITECTURE Fall Semester, /17/2013 Lecture 12: Procedures Instructor: Ashraf Yaseen DEPARTMENT OF MATH & COMPUTER SCIENCE CENTRAL.
The University of Adelaide, School of Computer Science
Status – Week 257 Victor Moya. Summary GPU interface. GPU interface. GPU state. GPU state. API/Driver State. API/Driver State. Driver/CPU Proxy. Driver/CPU.
Chapter 8 Runtime Support. How program structures are implemented in a computer memory? The evolution of programming language design has led to the creation.
The Programmable Graphics Hardware Pipeline Doug James Asst. Professor CS & Robotics.
Network coding on the GPU Péter Vingelmann Supervisor: Frank H.P. Fitzek.
Railway Foundation Electronic, Electrical and Processor Engineering.
Run-time Environment and Program Organization
The programmable pipeline Lecture 10 Slide Courtesy to Dr. Suresh Venkatasubramanian.
GPU Graphics Processing Unit. Graphics Pipeline Scene Transformations Lighting & Shading ViewingTransformations Rasterization GPUs evolved as hardware.
01/31/02 (C) 2002, UNiversity of Wisconsin, CS 559 Last Time Color and Color Spaces.
Shekoofeh Azizi Spring  CUDA is a parallel computing platform and programming model invented by NVIDIA  With CUDA, you can send C, C++ and Fortran.
CS559-Computer Graphics Copyright Stephen Chenney Image File Formats How big is the image? –All files in some way store width and height How is the image.
REAL-TIME VOLUME GRAPHICS Christof Rezk Salama Computer Graphics and Multimedia Group, University of Siegen, Germany Eurographics 2006 Real-Time Volume.
GPU Programming Robert Hero Quick Overview (The Old Way) Graphics cards process Triangles Graphics cards process Triangles Quads.
Enhancing GPU for Scientific Computing Some thoughts.
Real-time Graphical Shader Programming with Cg (HLSL)
Geometric Objects and Transformations. Coordinate systems rial.html.
GPU Shading and Rendering Shading Technology 8:30 Introduction (:30–Olano) 9:00 Direct3D 10 (:45–Blythe) Languages, Systems and Demos 10:30 RapidMind.
Chris Kerkhoff Matthew Sullivan 10/16/2009.  Shaders are simple programs that describe the traits of either a vertex or a pixel.  Shaders replace a.
Runtime Environments Compiler Construction Chapter 7.
Compiler Construction
Chapter 0.2 – Pointers and Memory. Type Specifiers  const  may be initialised but not used in any subsequent assignment  common and useful  volatile.
1 Dr. Scott Schaefer Programmable Shaders. 2/30 Graphics Cards Performance Nvidia Geforce 6800 GTX 1  6.4 billion pixels/sec Nvidia Geforce 7900 GTX.
HW-Accelerated HD video playback under Linux Zou Nan hai Open Source Technology Center.
The programmable pipeline Lecture 3.
8. 1 MPEG MPEG is Moving Picture Experts Group On 1992 MPEG-1 was the standard, but was replaced only a year after by MPEG-2. Nowadays, MPEG-2 is gradually.
Array in C++ / review. An array contains multiple objects of identical types stored sequentially in memory. The individual objects in an array, referred.
Dynamic Memory Allocation. Domain A subset of the total domain name space. A domain represents a level of the hierarchy in the Domain Name Space, and.
CS662 Computer Graphics Game Technologies Jim X. Chen, Ph.D. Computer Science Department George Mason University.
CSE 381 – Advanced Game Programming GLSL. Rendering Revisited.
C LANGUAGE Characteristics of C · Small size
Chapter 2 — Instructions: Language of the Computer — 1 Conditional Operations Branch to a labeled instruction if a condition is true – Otherwise, continue.
Ray Tracing using Programmable Graphics Hardware
What are shaders? In the field of computer graphics, a shader is a computer program that runs on the graphics processing unit(GPU) and is used to do shading.
EEL 3801 C++ as an Enhancement of C. EEL 3801 – Lotzi Bölöni Comments  Can be done with // at the start of the commented line.  The end-of-line terminates.
OpenGL Shading Language
An Introduction to the Cg Shading Language Marco Leon Brandeis University Computer Science Department.
COMP 175 | COMPUTER GRAPHICS Remco Chang1/XX13 – GLSL Lecture 13: OpenGL Shading Language (GLSL) COMP 175: Computer Graphics April 12, 2016.
7-Nov Fall 2001: copyright ©T. Pearce, D. Hutchinson, L. Marshall Oct lecture23-24-hll-interrupts 1 High Level Language vs. Assembly.
Windows Programming Lecture 03. Pointers and Arrays.
Programmable Shaders Dr. Scott Schaefer.
Multimedia Outline Compression RTP Scheduling Spring 2000 CS 461.
Graphics Processing Unit
Video Compression - MPEG
Chapter 6 GPU, Shaders, and Shading Languages
Vector Processing => Multimedia
Outline Image formats and basic operations Image representation
Lecture 4: MIPS Instruction Set
Computer Organization and Design Assembly & Compilation
Debugging Tools Tim Purcell NVIDIA.
Real-World File Structures
CIS 441/541: Introduction to Computer Graphics Lecture 15: shaders
Presentation transcript:

General Purpose GPU (GPGPU) Aaron Smith University of Texas at Austin Spring 2003

Motivation Graphics processors are becoming more programmable –DirectX/OpenGL - Vertex and Pixel Shaders Explore the current state of the art –How would a typical application run on a GPU? –What are the difficulties? Requirements?

MPEG Overview Format for storing compressed audio and video Uses prediction between frames to achieve compression (exploits spatial locality) –“I” or intra-frames simply a frame encoded as a still image (no history) –“P” or predicted frames predicted from most recently reconstructed I or P frame can also be treated like I frames when no good match –“B” or bi-directional frames predicted from closest two I or P frames, one in the past and one in the future no good match then intra code like I frame Typical sequence looks like: –IBBPBBPBBPBBIBBPBBPB... Remember what a B frame is??? –decode the I frame, then the first P frame then the first and second B frame –0xx312645

GPU Programming Model Streams Programming Pixel Shaders –store data in texture memory –use multiple passes to render and re-render to texture memory Vertex Shaders??? –more powerful than pixel shaders from an instruction standpoint –but...not very useful because of restriction on accessing texture memory What are the limitations? –branching ?

MPEG and the GPU decoding is sequential data structures are regular –typical video stream is 352x240 basic result is pixel color data

NVIDIA Cg High Level Shading Language Vertex and Pixel Shaders OpenGL and DirectX Support Can be compiled at runtime!

Cg Profiles 1) Which profile do we choose? Will the model fit? 2) What about portability? Can we move between architectures?

DirectX 8 – PS_2_0

PS_2_0 Cont.

MPEG -> Cg Challenges Data Types –float/int basic types on GPU –unsigned char dominate type in MPEG Loops –Most profiles do not support loops unless they can be completely unrolled –i.e. loop.cg(49) : warning C7012: not unrolling loop that executes 352 times since maximum loop unroll count is 256 No recursion –Normally not a problem we can change to iterative –But on the GPU we have a problem with “Loops” Arrays –Severe restrictions on index variables –Some profiles assign each array element to a register Ie. float array[10] uses ten registers Pointers –Not supported

Implementation Only support 352x240 resolution Allocate fixed data structures to hold frame –352x240 = x x (yuv) Hold data in texture memory Use Cg pixel shaders –vertex shaders cannot access texture memory Work backwards

An Example C -> CG Convert MPEG decoder store() routine into CG shader –Simplify…simplify…simplify –Factor

store_ppm_tga() - Original static void store_ppm_tga(outname,src,offset,incr,height,tgaflag) char *outname; unsigned char *src[]; int offset, incr, height; int tgaflag; { int i, j; int y, u, v, r, g, b; int crv, cbu, cgu, cgv; unsigned char *py, *pu, *pv; static unsigned char tga24[14] = {0,0,2,0,0,0,0, 0,0,0,0,0,24,32}; char header[FILENAME_LENGTH]; static unsigned char *u422, *v422, *u444, *v444; if (chroma_format==CHROMA444) { u444 = src[1]; v444 = src[2]; } else { if (!u444) { if (chroma_format==CHROMA420) { if (!(u422 = (unsigned char *)malloc((Coded_Picture_Width>>1) *Coded_Picture_Height))) Error("malloc failed"); if (!(v422 = (unsigned char *)malloc((Coded_Picture_Width>>1) *Coded_Picture_Height))) Error("malloc failed"); } if (!(u444 = (unsigned char *)malloc(Coded_Picture_Width *Coded_Picture_Height))) Error("malloc failed"); if (!(v444 = (unsigned char *)malloc(Coded_Picture_Width *Coded_Picture_Height))) Error("malloc failed"); } if (chroma_format==CHROMA420) { conv420to422(src[1],u422); conv420to422(src[2],v422); conv422to444(u422,u444); conv422to444(v422,v444); } else { conv422to444(src[1],u444); conv422to444(src[2],v444); } } strcat(outname,tgaflag ? ".tga" : ".ppm"); if ((outfile = open(outname,O_CREAT|O_TRUNC|O_WRONLY|O_BINARY,0666))==-1) { sprintf(Error_Text,"Couldn't create %s\n",outname); Error(Error_Text); } optr = obfr; if (tgaflag) { /* TGA header */ for (i=0; i<12; i++) putbyte(tga24[i]); putword(horizontal_size); putword(height); putbyte(tga24[12]); putbyte(tga24[13]); } crv = Inverse_Table_6_9[matrix_coefficients][0]; cbu = Inverse_Table_6_9[matrix_coefficients][1]; cgu = Inverse_Table_6_9[matrix_coefficients][2]; cgv = Inverse_Table_6_9[matrix_coefficients][3]; for (i=0; i<height; i++) { py = src[0] + offset + incr*i; pu = u444 + offset + incr*i; pv = v444 + offset + incr*i; for (j=0; j<horizontal_size; j++) { u = *pu ; v = *pv ; y = * (*py ); /* (255/219)*65536 */ r = Clip[(y + crv*v )>>16]; g = Clip[(y - cgu*u - cgv*v )>>16]; b = Clip[(y + cbu*u )>>16]; if (tgaflag) putbyte(b); putbyte(g); putbyte(r); else putbyte(r); putbyte(g); putbyte(b); } if (optr!=obfr) write(outfile,obfr,optr-obfr); close(outfile); }

Quick Analysis Pointers –Remove Conditionals (if/else) –Remove Dynamic Memory –Remove File I/O –Remove Table lookups –Remove Constant array indexes –OK! Constant loop invariants –OK!

store_tga() - Simplified static void store_tga(unsigned char *src[]) { int i, j; int y, u, v, r, g, b; int crv, cbu, cgu, cgv; int incr = 352; int height = 240; int data_idx = 0; /* index into BitMap.data[] */ static unsigned char u422[176*240]; static unsigned char v422[176*240]; static unsigned char u444[352*240]; static unsigned char v444[352*240]; /* 352 x 240 x 3 frame */ BitMap.channels = 3; BitMap.size_x = 352; BitMap.size_y = 240; conv420to422(src[1],u422); /* u422 = src[1] */ conv420to422(src[2],v422); /* v422 = src[2] */ conv422to444(u422,u444); /* u444 = u422 */ conv422to444(v422,v444); /* v422 = v444 */ /* matrix coefficients */ crv = ; cbu = ; cgu = 25675; cgv = 53279; /* convert YUV to RGB */ for (i=0; i<height; i++) { for (j=0; j<horizontal_size; j++) { u = u444[incr*i+j] - 128; v = v444[incr*i+j] - 128; y = * (src[0][incr*i+j] - 16); #define CLIP(x) ( (x 255) ? 255 : x) ) r = CLIP((y + crv*v )>>16); g = CLIP((y - cgu*u - cgv*v )>>16); b = CLIP((y + cbu*u )>>16); BitMap.data[data_idx++] = r; BitMap.data[data_idx++] = g; BitMap.data[data_idx++] = b; } #ifdef _WIN32 // output the frame DrawGLScene((tImageTGA *)&BitMap); #endif }

Quick Analysis Removed –If/else –Pointers –File i/o –Table lookups What’s Left? –Function calls (for chrominance conversion) conv420to422() and conv422to444() –YUV to RGB loop

YUV -> RGB (cg) float3 main( in float3 texcoords0 : TEXCOORD0, /* texture coord */ uniform sampler2D yImage : TEXUNIT0, /* handle to texture with Y data */ in float3 texcoords1 : TEXCOORD1, /* texture coord */ uniform sampler2D uImage : TEXUNIT1, /* handle to texture with U data */ in float3 texcoords2 : TEXCOORD2, /* texture coord */ uniform sampler2D vImage : TEXUNIT2 /* handle to texture with V data */ ) : COLOR { float3 yuvcolor; // f(xyz) -> yvu float3 rgbcolor; yuvcolor.x = tex2D(yImage, texcoords0).x; yuvcolor.z = tex2D(uImage, texcoords1).y-0.5; yuvcolor.y = tex2D(vImage, texcoords2).z-0.5; rgbcolor.r = 2*(yuvcolor.x/ /2 * yuvcolor.z); rgbcolor.g = 2*(yuvcolor.x/ * yuvcolor.y/ * yuvcolor.z/2); rgbcolor.b = 2*(yuvcolor.x/ /2 * yuvcolor.y); return rgbcolor; } dcl_2ds0 dcl_2ds1 dcl_2ds2 defc0, , , , defc1, , , , defc2, , , , defc3, , , , dclt0.xyz dclt1.xyz dclt2.xyz texld r0, t1, s1 texld r1, t0, s0 add r0.x, r0.y, -c1.y mov r1.z, r0.x texld r0, t2, s2 add r0.x, r0.z, -c1.y mov r1.y, r0.x dp3 r0.x, r1, c3 mul r0.x, c1.x, r0.x dp3 r0.w, r1, c2 mov r0.y, r0.w dp3 r0.w, r1, c1.x mul r0.w, c1.x, r0.w mov r0.z, r0.w mov r1.w, c0.w mov r1.xyz, r0 mov oC0, r1 // 17 instructions, 2 R-regs.

Quick Analysis YUV -> RGB –17 instructions and 2 registers –352x240 = px * 17 = ~1.4M instr/frame

Just for Fun What if we needed 1024 instructions?? –352x240 = px * 1024 = 86,507,520 instr/frame