Presentation on theme: "Håkon Kvale Stensland Simula Research Laboratory"— Presentation transcript:
1 Håkon Kvale Stensland Simula Research Laboratory INF5063 – GPU & CUDAHåkon Kvale StenslandSimula Research Laboratory
2 PC Graphics Timeline Challenges: Render infinitely complex scenesAnd extremely high resolutionIn 1/60th of one second (60 frames per second)Graphics hardware has evolved from a simple hardwired pipeline to a highly programmable multiword processorSpill driver markedetDirectX 6DirectX 7DirectX 8DirectX 9DirectX 9.0cDirectX 9.0cDirectX 10DirectX 5MultitexturingT&L TextureStageStateSM 1.xSM 2.0SM 3.0SM 3.0SM 4.0Riva 128Riva TNTGeForce 256GeForce 3CgGeForceFXGeForce 6GeForce 7GeForce 8199819992000200120022003200420052006
3 Basic 3D Graphics Pipeline ApplicationHostScene ManagementGeometryRasterizationFrameBufferMemoryGeometry = Setter opp vertexer (kanter i triangler), og lager trianglerRasterization = Finne ut hvilke pixler som hører hjemme i triangletPixel = Manipulere, gjøre effekter med pixlerROP = Render Out Put = Siste ”touch” på triangler, eller på AA, osv. Finner også ut om vi trenger pixlen (er den synlig (Hidden Surface Removal))GPUPixel ProcessingROP/FBI/Display
4 Graphics in the PC Architecture QuickPath (QPI) between processor and Northbridge (X58)Memory Control in CPUNorthbridge (IOH) handles PCI ExpressPCIe 2.0 x16 bandwidth at 16 GB/s (8 GB in each direction)Southbridge (ICH10) handles all other peripheralsPCI Express = Seriel bus, flere kanaler…. 0,5 GB/sec hver veiMer strøm!Snart PCI Express 3.0
5 High-end Hardware nVIDIA Fermi Architecture The latest generation GPU, codenamed GF1003,1 billion transistors512 Processing cores (SP)IEEE CapableShared coherent L2 cacheFull C++ SupportUp to 16 concurrent kernels40nm største hos TSMC
6 Lab Hardware nVidia GeForce GTX 280 nVidia GeForce 8800GT Based on the GT200 chip1400 million transistors240 Processing cores (SP) at 1476MHz1024 MB Memory with 159 GB/sec bandwidthnVidia GeForce 8800GTBased on the G92 chip754 million transistors112 Processing cores (SP) at 1500MHz256 MB Memory with 57.6GB/sec bandwidthBegge er Compute 1.1En maskin med 8800GTX – Compute 1.0
7 GeForce GF100 Architecture A ”mile-high” overview
8 nVIDIA GF100 vs. GT200 Architecture TPC = Texture Processor ClusterSM = Streaming MultiprocessorSP = Stream Processor (Oppererer med Scalar-verdier)SFU = Special Function UnitTEX = Texturing Unit
9 TPC… SM… SP… Some more details… Texture Processing ClusterSMStreaming MultiprocessorIn CUDA: Multiprocessor, and fundamental unit for a thread blockTEXTexture UnitSPStream ProcessorScalar ALU for single CUDA threadSFUSuper Function UnitTPCTPCTPCTPCTPCTPCTPCTPCTexture Processor ClusterStreaming MultiprocessorInstruction L1Data L1SMInstruction Fetch/DispatchSMShared MemoryTEXSMSPSPSPSPSMSFUSFUSPSPSPSP
10 SP: The basic processing block The nVIDIA Approach:A Stream Processor works on a single operationAMD GPU’s work on up to five operationsNow, let’s take a step back for a closer look!
11 Streaming Multiprocessor (SM) 8 Streaming Processors (SP)2 Super Function Units (SFU)Multi-threaded instruction dispatch1 to 1024 threads activeTry to Cover latency of texture/memory loadsLocal register file (RF)16 KB shared memoryDRAM texture and memory accessStreaming Multiprocessor(SM)Instruction FetchInstruction L1CacheL1FillThread/Instruction DispatchWorkShared MemoryControlSPRFRF4SP4ResultsSP1RF1RF5SP5SSFFUSP2RF2RF6SP6USP3RF3RF7SP7Load TextureConstant L1CacheL1FillStore toLoad from MemoryStore to MemoryFoils adapted from nVIDIA
12 SM Register File Register File (RF) 32 KBProvides 4 operands/clockTEX pipe can also read/write Register File3 SMs share 1 TEXLoad/Store pipe can also read/write Register FileI$L1MultithreadedInstruction BufferRC$SharedFL1MemOperand SelectTEX er mer viktig for grafikkMADSFU
13 Constants Immediate address constants Indexed address constants Constants stored in memory, and cached on chipL1 cache is per Streaming MultiprocessorI$L1MultithreadedInstruction BufferRC$SharedFL1MemOperand SelectMADSFU
14 Shared Memory Each Stream Multiprocessor has 16KB of Shared Memory 16 banks of 32bit wordsCUDA uses Shared Memory as shared storage visible to all threads in a thread blockRead and Write accessI$L1MultithreadedInstruction BufferRC$SharedFL1MemOperand SelectMADSFU
15 Execution Pipes Scalar MAD pipe Scalar SFU pipe Float Multiply, Add, etc.Integer ops,ConversionsOnly one instruction per clockScalar SFU pipeSpecial functions like Sin, Cos, Log, etc.Only one operation per four clocksTEX pipe (external to SM, shared by all SM’s in a TPC)Load/Store pipeCUDA has both global and local memory access through Load/StoreI$L1MultithreadedInstruction BufferRC$SharedFL1MemOperand SelectFMUL = Floating point multipler, FADD = Floating point add,Hot clock/shader clockMADSFU
17 What is really GPGPU?General Purpose computation using GPU in other applications than 3D graphicsGPU can accelerate parts of an applicationParallel data algorithms using the GPUs propertiesLarge data arrays, streaming throughputFine-grain SIMD parallelismFast floating point (FP) operationsApplications for GPGPUGame effects (physics) nVIDIA PhysXImage processing (Photoshop CS4)Video Encoding/Transcoding (Elemental RapidHD)Distributed processing (StanfordRAID6, AES, MatLab, etc.Mark Harris, of Nvidia, runs the gpgpu.org website
18 Previous GPGPU use, and limitations Working with a Graphics APISpecial cases with an API like Microsoft Direct3D or OpenGLAddressing modesLimited by texture sizeShader capabilitiesLimited outputs of the available shader programsInstruction setsNo integer or bit operationsCommunication is limitedBetween pixelsper threadper Shaderper ContextInput RegistersFragment ProgramTextureConstantsTemp RegistersOutput RegistersFB Memory
19 nVIDIA CUDA “Compute Unified Device Architecture” General purpose programming modelUser starts several batches of threads on a GPUGPU is in this case a dedicated super-threaded, massively data parallel co-processorSoftware StackGraphics driver, language compilers (Toolkit), and tools (SDK)Graphics driver loads programs into GPUAll drivers from nVIDIA now support CUDAInterface is designed for computing (no graphics )“Guaranteed” maximum download & readback speedsExplicit GPU memory management
20 Outline The CUDA Programming Model Basic concepts and data typesThe CUDA Application Programming InterfaceBasic functionalityAn example application:The good old Motion JPEG implementation!I dag skal skjelettet frem…. Om to uker blir det mer kjøtt på bena
21 The CUDA Programming Model The GPU is viewed as a compute device that:Is a coprocessor to the CPU, referred to as the hostHas its own DRAM called device memoryRuns many threads in parallelData-parallel parts of an application are executed on the device as kernels, which run in parallel on many threadsDifferences between GPU and CPU threadsGPU threads are extremely lightweightVery little creation overheadGPU needs 1000s of threads for full efficiencyMulti-core CPU needs only a few
22 Thread Batching: Grids and Blocks A kernel is executed as a grid of thread blocksAll threads share data memory spaceA thread block is a batch of threads that can cooperate with each other by:Synchronizing their executionNon synchronous execution is very bad for performance!Efficiently sharing data through a low latency shared memoryTwo threads from two different blocks cannot cooperateHostKernel 1Kernel 2DeviceGrid 1Block(0, 0)(1, 0)(2, 0)(0, 1)(1, 1)(2, 1)Grid 2Block (1, 1)Thread(3, 1)(4, 1)(0, 2)(1, 2)(2, 2)(3, 2)(4, 2)(3, 0)(4, 0)1 kernel eksekverer 1 grid…1 block kjører på 1 SM (Multiprosessor)
23 CUDA Device Memory Space Overview Each thread can:R/W per-thread registersR/W per-thread local memoryR/W per-block shared memoryR/W per-grid global memoryRead only per-grid constant memoryRead only per-grid texture memoryThe host can R/W global, constant, and texture memories(Device) GridConstantMemoryTextureGlobalBlock (0, 0)Shared MemoryLocalThread (0, 0)RegistersThread (1, 0)Block (1, 0)HostGlobal, constant, and texture memory spaces are persistent across kernels called by the same application.
24 Global, Constant, and Texture Memories Global memory:Main means of communicating R/W Data between host and deviceContents visible to all threadsTexture and Constant Memories:Constants initialized by host(Device) GridConstantMemoryTextureGlobalBlock (0, 0)Shared MemoryLocalThread (0, 0)RegistersThread (1, 0)Block (1, 0)Host
25 Terminology Recap device = GPU = Set of multiprocessors Multiprocessor = Set of processors & shared memoryKernel = Program running on the GPUGrid = Array of thread blocks that execute a kernelThread block = Group of SIMD threads that execute a kernel and can communicate via shared memoryMemoryLocationCachedAccessWhoLocalOff-chipNoRead/writeOne threadSharedOn-chipN/A - residentAll threads in a blockGlobalAll threads + hostConstantYesReadTexture
26 Access Times Register – Dedicated HW – Single cycle Shared Memory – Dedicated HW – Single cycleLocal Memory – DRAM, no cache – “Slow”Global Memory – DRAM, no cache – “Slow”Constant Memory – DRAM, cached, 1…10s…100s of cycles, depending on cache localityTexture Memory – DRAM, cached, 1…10s…100s of cycles, depending on cache locality
28 CUDA HighlightsThe API is an extension to the ANSI C programming languageLow learning curve than OpenGL/Direct3DThe hardware is designed to enable lightweight runtime and driverHigh performance
29 CUDA Device Memory Allocation cudaMalloc()Allocates object in the device Global MemoryRequires two parametersAddress of a pointer to the allocated objectSize of allocated objectcudaFree()Frees object from device Global MemoryPointer to the object(Device) GridConstantMemoryTextureGlobalBlock (0, 0)Shared MemoryLocalThread (0, 0)RegistersThread (1, 0)Block (1, 0)Host
30 CUDA Device Memory Allocation Code example:Allocate a 64 * 64 single precision float arrayAttach the allocated storage to Md.elements“d” is often used to indicate a device data structureBLOCK_SIZE = 64;Matrix Mdint size = BLOCK_SIZE * BLOCK_SIZE * sizeof(float);cudaMalloc((void**)&Md.elements, size);cudaFree(Md.elements);
31 CUDA Host-Device Data Transfer cudaMemcpy()memory data transferRequires four parametersPointer to sourcePointer to destinationNumber of bytes copiedType of transferHost to HostHost to DeviceDevice to HostDevice to DeviceAsynchronous operations available (Streams)(Device) GridConstantMemoryTextureGlobalBlock (0, 0)Shared MemoryLocalThread (0, 0)RegistersThread (1, 0)Block (1, 0)Host
32 CUDA Host-Device Data Transfer Code example:Transfer a 64 * 64 single precision float arrayM is in host memory and Md is in device memorycudaMemcpyHostToDevice and cudaMemcpyDeviceToHost are symbolic constantscudaMemcpy(Md.elements, M.elements, size, cudaMemcpyHostToDevice);cudaMemcpy(M.elements, Md.elements, size, cudaMemcpyDeviceToHost);
33 CUDA Function Declarations Executed on the:Only callable from the:__device__ float DeviceFunc()device__global__ void KernelFunc()host__host__ float HostFunc()__global__ defines a kernel functionMust return void__device__ and __host__ can be used together
34 CUDA Function Declarations __device__ functions cannot have their address takenLimitations for functions executed on the device:No recursionNo static variable declarations inside the functionNo variable number of arguments
36 CompilationAny source file containing CUDA language extensions must be compiled with nvccnvcc is a compiler driverWorks by invoking all the necessary tools and compilers like cudacc, g++, etc.nvcc can output:Either C codeThat must then be compiled with the rest of the application using another toolOr object code directly
37 Linking & ProfilingAny executable with CUDA code requires two dynamic libraries:The CUDA runtime library (cudart)The CUDA core library (cuda)Several tools are available to optimize your applicationnVIDIA CUDA Visual ProfilernVIDIA Occupancy CalculatorWindows users: NVIDIA Parallel Nsight for Visual Studio
38 Debugging Using Device Emulation An executable compiled in device emulation mode (nvcc -deviceemu):No need of any device and CUDA driverWhen running in device emulation mode, one can:Use host native debug support (breakpoints, inspection, etc.)Call any host function from device codeDetect deadlock situations caused by improper usage of __syncthreadsnVIDIA CUDA GDBprintf is now available on the device! (cuPrintf)
39 Before you start…Four lines have to be added to your group users .bash_profile or .bashrc filePATH=$PATH:/usr/local/cuda/binLD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda/libexport PATHexport LD_LIBRARY_PATHSDK is downloaded in the /opt/ folderCopy and build in your users home directory
40 Some usefull resources nVIDIA CUDA Programming Guide 3.2 nVIDIA CUDA C Programming Best Practices Guide nVIDIA CUDA Reference Manual 3.2
42 14 different MJPEG encoders on GPU Results from the course INF5063: Advanced course in programming heterogeneous architecturesTwo assignments (home exams) on MJPEG (on Cell and GPU)Y axis IS NOT at the SAME SCALEone assignment “I/O bound” on the CellLooks like it was easy to get better performance on Nvidia GPU (more about this later)Nvidia GeForce GPUProblems:Only used global memoryTo much synchronization between threadsHost part of the code not optimized
43 Profiling a Motion JPEG encoder on x86 A small selection of DCT algorithms:2D-Plain: Standard forward 2D DCT1D-Plain: Two consecutive 1D transformations with transpose in between and after1D-AAN: Optimized version of 1D-Plain2D-Matrix: 2D-Plain implemented with matrix multiplicationSingle threaded application profiled on a Intel Core i5 750
44 Optimizing for GPU, use the memory correctly!! Several different types of memory on GPU:Global memoryConstant memoryTexture memoryShared memoryFirst Commandment when using the GPUs:Select the correct memory space, AND use it correctly!
45 How about using a better algorithm?? Used CUDA Visual Profiler to isolate DCT performance2D-Plain Optimized is optimized for GPU:Shared memoryCoalesced memory accessLoop unrollingBranch preventionAsynchronous transfersSecond Commandment when using the GPUs:Choose an algorithm suited for the architecture!
46 Effect of offloading VLC to the GPU VLC (Variable Length Coding) can also be offloaded:One thread per macro blockCPU does bitstream mergeEven though algorithm is not perfectly suited for the architecture, offloading effect is still important!
Your consent to our cookies if you continue to use this website.