ELIS – Multimedia Lab Charles Hollemeersch Bart Pieters 28/11/2008 Gastcollege GPU-team Multimedia Lab.

ELIS – Multimedia Lab Charles Hollemeersch Bart Pieters 28/11/2008 Gastcollege GPU-team Multimedia Lab

2/50 ELIS – Multimedia Lab GPGPU Bart Pieters - MMLab Gastcollege OMMT – 28/11/2008 Who we are Charles Hollemeersch –PhD student at Multimedia Lab –Charlesfrederik.Hollemeersch@ugent.beCharlesfrederik.Hollemeersch@ugent.be Bart Pieters –PhD student at Multimedia Lab –Bart.Pieters@ugent.beBart.Pieters@ugent.be Visit our website –http://multimedialab.elis.ugent.be/ and http://multimedialab.elis.ugent.be/GPUhttp://multimedialab.elis.ugent.be/ http://multimedialab.elis.ugent.be/GPU

3/50 ELIS – Multimedia Lab GPGPU Bart Pieters - MMLab Gastcollege OMMT – 28/11/2008 Our Research Topics Video acceleration –accelerate state-of-the-art video codecs using the GPU Game technology –texture compression, parallel game actors … Medical visualization –reconstruction of medical images … Multi-GPU applications –…

4/50 ELIS – Multimedia Lab GPGPU Bart Pieters - MMLab Gastcollege OMMT – 28/11/2008 Introducing Multimedia Lab’s ‘Supercomputer’ Quad GPU PC –four GeForce 280GTX video cards –3732 gigaflops of GPU processing power

5/50 ELIS – Multimedia Lab GPGPU Bart Pieters - MMLab Gastcollege OMMT – 28/11/2008 Agenda 8u30 – 9u45 –Bart – GPGPU 10u00 – 11u15 –Charles – Game Technology

ELIS – Multimedia Lab Bart Pieters Multimedia Lab – UGent 28/11/2008

7/50 ELIS – Multimedia Lab GPGPU Bart Pieters - MMLab Gastcollege OMMT – 28/11/2008 Overview Introduction –GPU –GPGPU Programming Concepts and Mappings –Direct3D and OpenGL –NVIDIA CUDA Case Study: Decoding H.264/AVC –motion compensation –results Conclusions Q&A

8/50 ELIS – Multimedia Lab GPGPU Bart Pieters - MMLab Gastcollege OMMT – 28/11/2008 Graphics Processing Unit (GPU) Programmable chip on graphics cards Developed in a gaming context –3-D scenery by means of rasterization Programmable pipeline since DirectX 8.1 –vertex, geometry, and pixel shaders –high-level language support Modern GPUs support high-precision –32-bit floating point Massive floating-point processing power –933 gigaflops (NVIDIA GeForce 280GTX) –141.7 GB/s peak memory bandwidth –fast PCI-Express bus, up to 2GB/sec transfer speed

9/50 ELIS – Multimedia Lab GPGPU Bart Pieters - MMLab Gastcollege OMMT – 28/11/2008 CPU and GPU Comparison Intel Xeon X5355NVIDIA G80 (8800 GTX) Clock Speed2,66 GHz575 MHz #Cores / SPEs4128 Max. GFlop/s (float)85500 Typical Instr. Duration1-2 cycles (SSE)min. 4 cycles Die Size (mm²)143480 Typical Memory Speed8GB/sec (DDR2-1066)86 GB/sec (GDDR3-1800) Power Usage (watt)120185 Price (€)800500 Today’s GPUs are yesterday’s supercomputers

10/50 ELIS – Multimedia Lab GPGPU Bart Pieters - MMLab Gastcollege OMMT – 28/11/2008 Why are GPUs so fast? Parallelism –massively-parallel/many-core architecture needs a lot of work to be efficient –specialized hardware build for parallel tasks –more transistors mean more performance Multi-billion dollar gaming industry drives innovation Control Cache ALU ALU DRAMDRAM GPU CPU

11/50 ELIS – Multimedia Lab GPGPU Bart Pieters - MMLab Gastcollege OMMT – 28/11/2008 Computational Model: Stream Processing Model GPU is practically a stream processor Applications consist of streams and kernels Each kernel takes relatively long to process (PCIe, memory latency) –latency hidden by throughput Input Stream Kernel Output Stream

12/50 ELIS – Multimedia Lab GPGPU Bart Pieters - MMLab Gastcollege OMMT – 28/11/2008 Inside a modern GPU L2 FB SP L1 TF Thread Processor Vtx Thread Issue Setup / Rstr / ZCull Geom Thread IssuePixel Thread Issue Data Assembler Host SP L1 TF SP L1 TF SP L1 TF SP L1 TF SP L1 TF SP L1 TF SP L1 TF L2 FB L2 FB L2 FB L2 FB L2 FB

13/50 ELIS – Multimedia Lab GPGPU Bart Pieters - MMLab Gastcollege OMMT – 28/11/2008 Introducing GPGPU The GPU on commodity video cards has evolved into a processor that is –powerful –flexible –inexpensive –precise Attractive platform for general-purpose computation

14/50 ELIS – Multimedia Lab GPGPU Bart Pieters - MMLab Gastcollege OMMT – 28/11/2008 GPGPU General-Purpose GPU –use the GPU for general-purpose algorithms No magical GPU compiler –still no x86 processor (Larrabee?) –explicit mappings required using advanced APIs Programming close to the hardware –trend for higher abstraction, i.e. NVIDIA CUDA Techniques are suited for future many-core architectures –future CPU/GPU projects, AMD Fusion, Larrabee, … Dependency issues –hundreds of independent tasks required for efficient use

15/50 ELIS – Multimedia Lab GPGPU Bart Pieters - MMLab Gastcollege OMMT – 28/11/2008 Stream Processing Model Revisited GPU is practically a stream processor Applications consist of streams and kernels Read back is not possible Input Stream Kernel Output Stream

16/50 ELIS – Multimedia Lab GPGPU Bart Pieters - MMLab Gastcollege OMMT – 28/11/2008 GPGPU in Practice

17/50 ELIS – Multimedia Lab GPGPU Bart Pieters - MMLab Gastcollege OMMT – 28/11/2008 GPGPU APIs Classic way –(mis)use graphics pipeline render a special ‘scene’ –Direct3D, OpenGL –pixel, geometry, and vertex shaders New APIs specifically for GPGPU computations –NVIDIA CUDA, ATI CTM, DirectX11 Compute Shader, OpenCl

19/50 ELIS – Multimedia Lab GPGPU Bart Pieters - MMLab Gastcollege OMMT – 28/11/2008 3-D Pipeline Deep pipeline GPUCPU Vertices (3D) Xformed, Lit Vertices (2D) Fragments (pre-pixels) Final pixels (Color, Depth) Graphics State Render-to-texture Programmable

20/50 ELIS – Multimedia Lab GPGPU Bart Pieters - MMLab Gastcollege OMMT – 28/11/2008 3-D Pipeline - Transform Vertex Shader –processing geometry data –input is a vertex position texture coordinates vertex color,... –output is a vertex struct Vertex { float3 position : POSITION; float4 color : COLOR0; }; struct Vertex { float3 position : POSITION; float4 color : COLOR0; }; Vertex wave(Vertex vin) { Vertex vout; vout.x = vin.x; vout.y = vin.y; vout.z = (sin(vin.x) + sin(IN.wave.x)) * 2.5f; vout.color = float4(1.0f, 1.0f, 1.0f, 1.0f); return vout; } Vertex wave(Vertex vin) { Vertex vout; vout.x = vin.x; vout.y = vin.y; vout.z = (sin(vin.x) + sin(IN.wave.x)) * 2.5f; vout.color = float4(1.0f, 1.0f, 1.0f, 1.0f); return vout; } Vertex Shader Vertex Shader

21/50 ELIS – Multimedia Lab GPGPU Bart Pieters - MMLab Gastcollege OMMT – 28/11/2008 Pixel (or fragment) Shader –input is interpolated vertex data position texture coordinates normals, … –use texels from a texture –output is a fragment pixel color transparancy depth –result is stored in the frame buffer or in a texture ‘Render to Texture’ PSOut shade(PSIn pin) { PSOut pout; pout.color = tex(pin.tex, sampler) return pout; } PSOut shade(PSIn pin) { PSOut pout; pout.color = tex(pin.tex, sampler) return pout; } Pixel Shader Pixel Shader struct PSOut { float4 color; : COLOR0 }; struct PSOut { float4 color; : COLOR0 }; struct PSIn { float2 tex; : TEXCOORD0 }; struct PSIn { float2 tex; : TEXCOORD0 }; 3-D Pipeline - Shading

22/50 ELIS – Multimedia Lab GPGPU Bart Pieters - MMLab Gastcollege OMMT – 28/11/2008 GPU-CPU Analogies Explicit mapping on 3-D concepts is necessary Rewrite an algorithm and find parallelism Use the GPU in parallel to the CPU 1.upload data to the GPU very fast PCI-Express bus, up to 2GB/sec transfer speed 2.process the data meanwhile the CPU is available 3.download result to the CPU recent GPU models have high download speed

23/50 ELIS – Multimedia Lab GPGPU Bart Pieters - MMLab Gastcollege OMMT – 28/11/2008 Intermediary Buffer in System Memory GPU-CPU Pipelined Design CPU GPU GPU Data workPrepare GPU Data workPrepare GPU Data Process Data, Visualize

24/50 ELIS – Multimedia Lab GPGPU Bart Pieters - MMLab Gastcollege OMMT – 28/11/2008 GPU-CPU Analogies (2) CPU GPU Array Texture

25/50 ELIS – Multimedia Lab GPGPU Bart Pieters - MMLab Gastcollege OMMT – 28/11/2008 GPU-CPU Analogies (3) … fish[] = createfish() … for all pixels bwfish[i][j]= bw(fish[i][j]); … CPUGPU Render Array Write = Render to Texture

26/50 ELIS – Multimedia Lab GPGPU Bart Pieters - MMLab Gastcollege OMMT – 28/11/2008 GPU-CPU Analogies (4) Loop body / kernel / algorithm step = Fragment Program CPUGPU Motion Compensation for (int y=0;y<height;++y) { for (int x=0;x<width;++x) { Vec2 mv = mvectors[y/4][x/4]; int ox = Clip(x + mv.x); int oy = Clip(y + mv.y); output[y][x] = input[oy][ox]; } PSOut motioncompens(PSIn in) { PSOut out; Float2 mv = in.mv; Float2 texcoords = in.texcoords; texcoords += mv; out.color = tex2d(texcoords, sampler); } Vec2 mv = mvectors[y/4][x/4]; int ox = Clip(x + mv.x); int oy = Clip(y + mv.y); output[y][x] = input[oy][ox]; C++Microsoft HLSL

27/50 ELIS – Multimedia Lab GPGPU Bart Pieters - MMLab Gastcollege OMMT – 28/11/2008 GPU Loop for Each Pixel Vertex Shader Rasterizer Pixel Shader PSOut motioncompens(PSIn in) { PSOut out; Float2 mv = in.mv; Float2 texcoords = in.texcoords; texcoords += mv; out.color = tex2d(texcoords, sampler); } PSOut motioncompens(PSIn in) { PSOut out; Float2 mv = in.mv; Float2 texcoords = in.texcoords; texcoords += mv; out.color = tex2d(texcoords, sampler); } PSOut motioncompens(PSIn in) { PSOut out; Vec2 mv = in.mv; Vec2 texcoords = in.texcoords; texcoords += mv; out.color = tex2d(texcoords, sampler); } PSOut motioncompens(PSIn in) { PSOut out; Float2 mv = in.mv; Float2 texcoords = in.texcoords; texcoords += mv; out.color = tex2d(texcoords, sampler); } … PSOut motioncompens(PSIn in) { PSOut out; Float2 mv = in.mv; Float2 texcoords = in.texcoords; texcoords += mv; out.color = tex2d(texcoords, sampler); } Render a Quad

29/50 ELIS – Multimedia Lab GPGPU Bart Pieters - MMLab Gastcollege OMMT – 28/11/2008 GPGPU-specific APIs NVIDIA CUDA –Compute Unified Device Architecture –C-code with annotations compiled to executable code DirectX 11 Compute Shader –shader execution without rendering –technology preview available in latest DirectX SDK OpenCl –Open Computing Language –C++-code with annotations ATI CTM –Close to The Metal –GPU assembler –depricated

30/50 ELIS – Multimedia Lab GPGPU Bart Pieters - MMLab Gastcollege OMMT – 28/11/2008 NVIDIA CUDA General-Purpose GPU Computing Platform GPU is a super-threaded co-processor –acceleration of massive amounts of GPU threads Supported on NVIDIA G80 and higher –50-500EUR price range No more (mis)use of 3-D API C-code with annotations for –memory location –host or device functions –thread synchronization Compilation with CUDA-compiler –split host and device code –linkable object code

31/50 ELIS – Multimedia Lab GPGPU Bart Pieters - MMLab Gastcollege OMMT – 28/11/2008 NVIDIA CUDA - Example void runGPUTest() { CUT_DEVICE_INIT();... float* d_data = NULL; // allocate gpu memory cudaMalloc( (void**) &d_data, size); dim3 dimBlock(8, 8, 1); dim3 dimGrid(width / dimBlock.x, height / dimBlock.y, 1); // run kernel on gpu transformKernel >>( d_data ); // download cudaMemcpy( h_data, d_data, size, cudaMemcpyDeviceToHost);... }

32/50 ELIS – Multimedia Lab GPGPU Bart Pieters - MMLab Gastcollege OMMT – 28/11/2008 NVIDIA CUDA – Example (2) __ global__ void transformKernel( float* g_odata) { // calculate normalized texture coordinates unsigned int x = blockIdx.x*blockDim.x + threadIdx.x; unsigned int y = blockIdx.y*blockDim.y + threadIdx.y; int2 mv = tex2D(mvtex, x, y); int mx = x + mv.x; int my = y + mv.y; g_odata[y*width + x] = tex2D(reftex, mx, my); }

33/50 ELIS – Multimedia Lab GPGPU Bart Pieters - MMLab Gastcollege OMMT – 28/11/2008 Device (GPU) Grid 1 Programming Model Host (CPU) Block (0, 0) Block (1, 0) Grid 2 Kernel 1 Kernel 2 Block (0, 1) Block (1, 1) Block (2, 0) Block (2, 1) Block (1, 1) Thread (0, 0) Thread (1, 0) Thread (2, 0) Thread (0, 1) Thread (1, 1) Thread (2, 1) Thread (0, 2) Thread (1, 2) Thread (2, 2) Thread (0, 3) Thread (1, 3) Thread (2, 3)

34/50 ELIS – Multimedia Lab GPGPU Bart Pieters - MMLab Gastcollege OMMT – 28/11/2008 Hardware Model Multiprocessor – MP (16) Streaming Processor (8 per MP) –handles one thread Memory –very fast high-latency –uncached –special memory hardware for constants & texture (cached) Registers –limited amount

35/50 ELIS – Multimedia Lab GPGPU Bart Pieters - MMLab Gastcollege OMMT – 28/11/2008 CUDA Threads Each Streaming Processor handles one thread –240 on GeForce 280GTX! Smart hardware can schedule thousands of threads on 240 processors Extremely lightweight –not like CPU threads Threads per Multiprocessor handled in SIMD manner –each thread executes the same instruction at a given clock cycle –lock-step execution Block (1, 1) Thread (0, 0) Thread (1, 0) Thread (2, 0) Thread (0, 1) Thread (1, 1) Thread (2, 1) Thread (0, 2) Thread (1, 2) Thread (2, 2) Thread (0, 3) Thread (1, 3) Thread (2, 3)

36/50 ELIS – Multimedia Lab GPGPU Bart Pieters - MMLab Gastcollege OMMT – 28/11/2008 … x 3 y … z … Lock-step Execution x = x * 2; If ( x > 10) z = 0; Else z = y / 2 ++x x = x * 2; If ( x > 10) z = 0; Else z = y / 2 ++x Thread 1 x 100 y … z … x = x * 2; If ( x > 10) z = 0; Else z = y / 2 ++x x = x * 2; If ( x > 10) z = 0; Else z = y / 2 ++x Thread 2 x 200 y … z … x = x * 2; If ( x > 10) z = 0; Else z = y / 2 ++x x = x * 2; If ( x > 10) z = 0; Else z = y / 2 ++x Thread 31 x 1 y … z … x = x * 2; If ( x > 10) z = 0; Else z = y / 2 ++x x = x * 2; If ( x > 10) z = 0; Else z = y / 2 ++x Thread 32 Locked Program Counter Heavy branching needs to be avoided

38/50 ELIS – Multimedia Lab GPGPU Bart Pieters - MMLab Gastcollege OMMT – 28/11/2008 Decoding H.264/AVC Many decoding steps are suitable for parallelization –quantization –transformation –motion compensation –deblocking –color space conversion Others introduce dependencies –entropy coding –intra prediction

39/50 ELIS – Multimedia Lab GPGPU Bart Pieters - MMLab Gastcollege OMMT – 28/11/2008 Video Coding Hardware Specialized on-board 2-D video processing chips –one macroblock at the time –black boxes limited support for non-windows systems –limited support for various video codecs e.g. H.264/AVC profiles –partly programmable GPU –millions of transistors –accessible via 3-D API or General Purpose GPU API

40/50 ELIS – Multimedia Lab GPGPU Bart Pieters - MMLab Gastcollege OMMT – 28/11/2008 Decoding an H.264/AVC bitstream H.264/AVC –recent video coding standard –successor of MPEG-4 Visual Computationally intensive –multiple reference frames (up to 16) –B-pictures –sub-pixel interpolations Motion compensation, reconstruction, deblocking, and color space conversion –takes up to 80% of total processing time –suitable for execution on the GPU

41/50 ELIS – Multimedia Lab GPGPU Bart Pieters - MMLab Gastcollege OMMT – 28/11/2008 Intermediary Buffer in System Memory Pipelined Design for Video Decoding CPU GPU MVsResidue MVsResidue VLD, IQ, Inverse Transformation Read Bitstream VLD, IQ, Inverse Transformation Read Bitstream CSC, Visualization MC, Reconstr., Deblocking CSC, Visualization MC, Reconstr., Deblocking QPs

42/50 ELIS – Multimedia Lab GPGPU Bart Pieters - MMLab Gastcollege OMMT – 28/11/2008 Input Sequence Reference Picture Current Picture (x 1,y 1 ) Motion Vectors Prediction Time Motion Compensation (x 2,y 2 ) (x 3,y 3 ) … … Residual Data = = - -

43/50 ELIS – Multimedia Lab GPGPU Bart Pieters - MMLab Gastcollege OMMT – 28/11/2008 Reference Picture (x 1,y 1 ) Motion Vectors Prediction Time Motion Compensation Motion Compensation: Decoder (x 2,y 2 ) (x 3,y 3 ) … … Residual Data + +

44/50 ELIS – Multimedia Lab GPGPU Bart Pieters - MMLab Gastcollege OMMT – 28/11/2008 Output Array Motion Compensation in CUDA Device (GPU) Kernel 1 Block (0, 0) Block (1, 0) Block (0, 1) Block (1, 1) Block (2, 0) Block (2, 1) Block (1, 1) Thread (0, 0) Thread (1, 0) Thread (2, 0) Thread (0, 1) Thread (1, 1) Thread (2, 1) Thread (0, 2) Thread (1, 2) Thread (2, 2)

47/50 ELIS – Multimedia Lab GPGPU Bart Pieters - MMLab Gastcollege OMMT – 28/11/2008 Motion Compensation in Direct3D Put video picture in textures Use vertices to represent a macroblock Let texture coordinate point to the texture Full-pel motion compensation –manipulate texture coordinates Multiple pixel shaders fill macroblocks and interpolate [0.50,0.30][0.60,0.30] [0.50,0.4 0] [0.60,0.40] [0.60,0.40] [0.50,0.5 0] [0.60,0.50 ] Reference texture for rasterization process

48/50 ELIS – Multimedia Lab GPGPU Bart Pieters - MMLab Gastcollege OMMT – 28/11/2008 Interpolation Strategies for Sub-pixel MC Vertex Grid Viewable area + + Vertex Shaders Pixel Shader 1 Full-Pel Half-Pel Q-Pel Pixel Shader 2 Pixel Shader 3

49/50 ELIS – Multimedia Lab GPGPU Bart Pieters - MMLab Gastcollege OMMT – 28/11/2008 Experimental Results GPU algorithm scores faster than CPU algorithm CPU is offloaded, free for other tasks

50/50 ELIS – Multimedia Lab GPGPU Bart Pieters - MMLab Gastcollege OMMT – 28/11/2008 Conclusions GPU is an attractive platform for general-purpose computation –flexible, powerful, inexpensive General-purpose APIs –approach the GPU as a super-threaded co-processor GPGPU requires lots of parallel jobs –e.g. hundreds to thousands GPGPU allow faster execution while offloading the CPU –e.g. decoding of H.264/AVC bitstreams GPGPU techniques are suited for future architectures

51/50 ELIS – Multimedia Lab GPGPU Bart Pieters - MMLab Gastcollege OMMT – 28/11/2008 Questions? Multimedia Lab –http://multimedialab.elis.ugent.be/GPUhttp://multimedialab.elis.ugent.be/GPU –Bart.Pieters@ugent.beBart.Pieters@ugent.be

ELIS – Multimedia Lab Charles Hollemeersch Bart Pieters 28/11/2008 Gastcollege GPU-team Multimedia Lab.

Similar presentations

Presentation on theme: "ELIS – Multimedia Lab Charles Hollemeersch Bart Pieters 28/11/2008 Gastcollege GPU-team Multimedia Lab."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

ELIS – Multimedia Lab Charles Hollemeersch Bart Pieters 28/11/2008 Gastcollege GPU-team Multimedia Lab.

Similar presentations

Presentation on theme: "ELIS – Multimedia Lab Charles Hollemeersch Bart Pieters 28/11/2008 Gastcollege GPU-team Multimedia Lab."— Presentation transcript:

Similar presentations

About project

Feedback