12 Administrative: Assignments Homework:8 assignmentsEach worth 10% of your grade (100 pts. each)Final Project:2 weeks for a custom final projectDetails are up to you!20% of your grade (200 pts.)
13 Administrative: Assignments Assignments will be due Wednesday, 5PMExtensions may be granted…Talk to TA’s beforehand!Office Hours: located in 104 ANBConnor: Tuesday, 8-10PMKevin: Monday, 8-10PM
14 Administrative: Assignments Doing the assignments:CUDA-capable machine required!Must have NVIDIA GPUSetting up environment can be trickyThree options:DIY with your own setupUse provided instructions with given environmentUse lab machines
15 Administrative: Assignments Submitting assignments:Due date: Wednesday 5PMSubmit assignment as .tar/.zip, or similarInclude README file!Name, compilation instructions, answers to conceptual questions on sets, etc.Submit all assignments toReceiving graded assignments:Assignments should get back 1 week after submissionWe will you back with grade and comments
16 GPU History: Early Days Before GPUs:All graphics run on the CPUEach pixel drawn in seriesSuper slow! (CS171, anyone?)Early GPUs:1980s: Blitters (fixed image sprites) allowed fast image memory transfer1990s: Introduction of DirectX and OpenGLBrought fixed function pipeline for rendering
17 GPU History: Early Days Fixed Function Pipeline:“Fixed” OpenGL statesPhong or Gouraud shading?Render as wireframe or solid?Very limiting, made early games look similar
18 GPU History: Shaders Early 2000’s: shaders introduced Allow for much more interesting shading models
19 GPU History: Shaders Shaders: expanded world of rendering greatly Vertex shaders: apply operations per-vertexFragment shaders: apply operations per-pixelGeometry shaders: apply operations to add new geometry
20 GPU History: Shaders These are great when dealing with graphics data… Vertices, faces, pixels, etc.What about general purpose?Can trick GPUDirectX “compute” shader may be an optionAnything slicker?
21 GPU History: CUDA 2007: NVIDIA introduces CUDA C-style programming API for GPUEasier to do GPGPUEasier memory handlingBetter tools, libraries, etc.
22 GPU History: CUDA New advantages on the table: Scattered reads Shared memoryFaster memory transfer to/from the GPU
23 GPU History: Other APIs Plenty of other API’s exist for GPGPUOpenCL/WebCLDirectX Compute ShaderOther
24 Using the GPUHighly parallelizable parts of computational problems
25 A simple problem… Add two arrays On the CPU: A + B -> C (allocate memory for C)For (i from 1 to array length)C[i] <- A[i] + B[i]Operates sequentially… can we do better?
26 A simple problem… On the CPU (multi-threaded): (allocate memory for C)Create # of threads equal to number of cores on processor (around 2, 4, perhaps 8)(Allocate portions of A, B, C to each thread...)...In each thread,For (i from beginning region of thread)C[i] <- A[i] + B[i]//lots of waiting involved for memory reads, writes, ...Wait for threads to synchronize...Slightly faster – 2-8x (slightly more with other tricks)
27 A simple problem… How many threads? How does performance scale? Context switching: High penalty on the CPU!
28 A simple problem… On the GPU: Speedup: Very high! (e.g. 10x, 100x) (allocate memory for A, B, C on GPU)Create the “kernel” – each thread will perform one (or a few) additionsSpecify the following kernel operation:For (all i‘s assigned to this thread)C[i] <- A[i] + B[i]Start ~20000 (!) threadsWait for threads to synchronize...Speedup: Very high! (e.g. 10x, 100x)
29 GPU: Strengths Revealed ParallelismLow context switch penalty!We can “cover up” performance loss by creating more threads!
30 GPU Computing: Step by Step Setup inputs on the host (CPU-accessible memory)Allocate memory for inputs on the GPUCopy inputs from host to GPUAllocate memory for outputs on the hostAllocate memory for outputs on the GPUStart GPU kernelCopy output from GPU to host(Copying can be asynchronous)
31 GPU: Internals Blocks: Groups of threads Can cooperate via shared memoryCan synchronize with each otherMax size: 512, 1024 threads (hardware-dependent)Warps: Subgroups of threads within blockExecute “in-step”Size: 32 threads