Presentation is loading. Please wait.

Presentation is loading. Please wait.

General Purpose GPU (GPGPU) Aaron Smith University of Texas at Austin Spring 2003.

Similar presentations

Presentation on theme: "General Purpose GPU (GPGPU) Aaron Smith University of Texas at Austin Spring 2003."— Presentation transcript:

1 General Purpose GPU (GPGPU) Aaron Smith University of Texas at Austin Spring 2003

2 Motivation Graphics processors are becoming more programmable –DirectX/OpenGL - Vertex and Pixel Shaders Explore the current state of the art –How would a typical application run on a GPU? –What are the difficulties? Requirements?

3 MPEG Overview Format for storing compressed audio and video Uses prediction between frames to achieve compression (exploits spatial locality) –“I” or intra-frames simply a frame encoded as a still image (no history) –“P” or predicted frames predicted from most recently reconstructed I or P frame can also be treated like I frames when no good match –“B” or bi-directional frames predicted from closest two I or P frames, one in the past and one in the future no good match then intra code like I frame Typical sequence looks like: –IBBPBBPBBPBBIBBPBBPB... Remember what a B frame is??? –decode the I frame, then the first P frame then the first and second B frame –0xx312645

4 GPU Programming Model Streams Programming Pixel Shaders –store data in texture memory –use multiple passes to render and re-render to texture memory Vertex Shaders??? –more powerful than pixel shaders from an instruction standpoint –but...not very useful because of restriction on accessing texture memory What are the limitations? –branching ?

5 MPEG and the GPU decoding is sequential data structures are regular –typical video stream is 352x240 basic result is pixel color data

6 NVIDIA Cg High Level Shading Language Vertex and Pixel Shaders OpenGL and DirectX Support Can be compiled at runtime!

7 Cg Profiles 1) Which profile do we choose? Will the model fit? 2) What about portability? Can we move between architectures?

8 DirectX 8 – PS_2_0

9 PS_2_0 Cont.

10 MPEG -> Cg Challenges Data Types –float/int basic types on GPU –unsigned char dominate type in MPEG Loops –Most profiles do not support loops unless they can be completely unrolled –i.e. : warning C7012: not unrolling loop that executes 352 times since maximum loop unroll count is 256 No recursion –Normally not a problem we can change to iterative –But on the GPU we have a problem with “Loops” Arrays –Severe restrictions on index variables –Some profiles assign each array element to a register Ie. float array[10] uses ten registers Pointers –Not supported

11 Implementation Only support 352x240 resolution Allocate fixed data structures to hold frame –352x240 = 84880 x 21120 x 21120 (yuv) Hold data in texture memory Use Cg pixel shaders –vertex shaders cannot access texture memory Work backwards

12 An Example C -> CG Convert MPEG decoder store() routine into CG shader –Simplify…simplify…simplify –Factor

13 store_ppm_tga() - Original static void store_ppm_tga(outname,src,offset,incr,height,tgaflag) char *outname; unsigned char *src[]; int offset, incr, height; int tgaflag; { int i, j; int y, u, v, r, g, b; int crv, cbu, cgu, cgv; unsigned char *py, *pu, *pv; static unsigned char tga24[14] = {0,0,2,0,0,0,0, 0,0,0,0,0,24,32}; char header[FILENAME_LENGTH]; static unsigned char *u422, *v422, *u444, *v444; if (chroma_format==CHROMA444) { u444 = src[1]; v444 = src[2]; } else { if (!u444) { if (chroma_format==CHROMA420) { if (!(u422 = (unsigned char *)malloc((Coded_Picture_Width>>1) *Coded_Picture_Height))) Error("malloc failed"); if (!(v422 = (unsigned char *)malloc((Coded_Picture_Width>>1) *Coded_Picture_Height))) Error("malloc failed"); } if (!(u444 = (unsigned char *)malloc(Coded_Picture_Width *Coded_Picture_Height))) Error("malloc failed"); if (!(v444 = (unsigned char *)malloc(Coded_Picture_Width *Coded_Picture_Height))) Error("malloc failed"); } if (chroma_format==CHROMA420) { conv420to422(src[1],u422); conv420to422(src[2],v422); conv422to444(u422,u444); conv422to444(v422,v444); } else { conv422to444(src[1],u444); conv422to444(src[2],v444); } } strcat(outname,tgaflag ? ".tga" : ".ppm"); if ((outfile = open(outname,O_CREAT|O_TRUNC|O_WRONLY|O_BINARY,0666))==-1) { sprintf(Error_Text,"Couldn't create %s\n",outname); Error(Error_Text); } optr = obfr; if (tgaflag) { /* TGA header */ for (i=0; i<12; i++) putbyte(tga24[i]); putword(horizontal_size); putword(height); putbyte(tga24[12]); putbyte(tga24[13]); } crv = Inverse_Table_6_9[matrix_coefficients][0]; cbu = Inverse_Table_6_9[matrix_coefficients][1]; cgu = Inverse_Table_6_9[matrix_coefficients][2]; cgv = Inverse_Table_6_9[matrix_coefficients][3]; for (i=0; i>16]; g = Clip[(y - cgu*u - cgv*v + 32768)>>16]; b = Clip[(y + cbu*u + 32786)>>16]; if (tgaflag) putbyte(b); putbyte(g); putbyte(r); else putbyte(r); putbyte(g); putbyte(b); } if (optr!=obfr) write(outfile,obfr,optr-obfr); close(outfile); }

14 Quick Analysis Pointers –Remove Conditionals (if/else) –Remove Dynamic Memory –Remove File I/O –Remove Table lookups –Remove Constant array indexes –OK! Constant loop invariants –OK!

15 store_tga() - Simplified static void store_tga(unsigned char *src[]) { int i, j; int y, u, v, r, g, b; int crv, cbu, cgu, cgv; int incr = 352; int height = 240; int data_idx = 0; /* index into[] */ static unsigned char u422[176*240]; static unsigned char v422[176*240]; static unsigned char u444[352*240]; static unsigned char v444[352*240]; /* 352 x 240 x 3 frame */ BitMap.channels = 3; BitMap.size_x = 352; BitMap.size_y = 240; conv420to422(src[1],u422); /* u422 = src[1] */ conv420to422(src[2],v422); /* v422 = src[2] */ conv422to444(u422,u444); /* u444 = u422 */ conv422to444(v422,v444); /* v422 = v444 */ /* matrix coefficients */ crv = 104597; cbu = 132201; cgu = 25675; cgv = 53279; /* convert YUV to RGB */ for (i=0; i>16); g = CLIP((y - cgu*u - cgv*v + 32768)>>16); b = CLIP((y + cbu*u + 32786)>>16);[data_idx++] = r;[data_idx++] = g;[data_idx++] = b; } #ifdef _WIN32 // output the frame DrawGLScene((tImageTGA *)&BitMap); #endif }

16 Quick Analysis Removed –If/else –Pointers –File i/o –Table lookups What’s Left? –Function calls (for chrominance conversion) conv420to422() and conv422to444() –YUV to RGB loop

17 YUV -> RGB (cg) float3 main( in float3 texcoords0 : TEXCOORD0, /* texture coord */ uniform sampler2D yImage : TEXUNIT0, /* handle to texture with Y data */ in float3 texcoords1 : TEXCOORD1, /* texture coord */ uniform sampler2D uImage : TEXUNIT1, /* handle to texture with U data */ in float3 texcoords2 : TEXCOORD2, /* texture coord */ uniform sampler2D vImage : TEXUNIT2 /* handle to texture with V data */ ) : COLOR { float3 yuvcolor; // f(xyz) -> yvu float3 rgbcolor; yuvcolor.x = tex2D(yImage, texcoords0).x; yuvcolor.z = tex2D(uImage, texcoords1).y-0.5; yuvcolor.y = tex2D(vImage, texcoords2).z-0.5; rgbcolor.r = 2*(yuvcolor.x/2 + 1.402/2 * yuvcolor.z); rgbcolor.g = 2*(yuvcolor.x/2 - 0.344136 * yuvcolor.y/2 - 0.714136 * yuvcolor.z/2); rgbcolor.b = 2*(yuvcolor.x/2 + 1.773/2 * yuvcolor.y); return rgbcolor; } dcl_2ds0 dcl_2ds1 dcl_2ds2 defc0, 0.000000, 0.000000, 0.000000, 1.000000 defc1, 2.000000, 0.500000, 0.886500, 0.000000 defc2, 1.000000, -0.344000, -0.714000, 0.000000 defc3, 0.500000, 0.000000, 0.701000, 0.000000 texld r0, t1, s1 texld r1, t0, s0 add r0.x, r0.y, -c1.y mov r1.z, r0.x texld r0, t2, s2 add r0.x, r0.z, -c1.y mov r1.y, r0.x dp3 r0.x, r1, c3 mul r0.x, c1.x, r0.x dp3 r0.w, r1, c2 mov r0.y, r0.w dp3 r0.w, r1, c1.x mul r0.w, c1.x, r0.w mov r0.z, r0.w mov r1.w, c0.w mov, r0 mov oC0, r1 // 17 instructions, 2 R-regs.

18 Quick Analysis YUV -> RGB –17 instructions and 2 registers –352x240 = 84480 px * 17 = ~1.4M instr/frame

19 Just for Fun What if we needed 1024 instructions?? –352x240 = 84480 px * 1024 = 86,507,520 instr/frame

Download ppt "General Purpose GPU (GPGPU) Aaron Smith University of Texas at Austin Spring 2003."

Similar presentations

Ads by Google