CS6963 L17: Asynchronous Concurrent Execution, Open GL Rendering.

Slides:



Advertisements
Similar presentations
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.
Advertisements

GPU programming: CUDA Acknowledgement: the lecture materials are based on the materials in NVIDIA teaching center CUDA course materials, including materials.
Intermediate GPGPU Programming in CUDA
Multi-GPU and Stream Programming Kishan Wimalawarne.
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 28, 2011 GPUMemories.ppt GPU Memories These notes will introduce: The basic memory hierarchy.
Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.
1 ITCS 5/4145 Parallel computing, B. Wilkinson, April 11, CUDAMultiDimBlocks.ppt CUDA Grids, Blocks, and Threads These notes will introduce: One.
GPU programming: CUDA Acknowledgement: the lecture materials are based on the materials in NVIDIA teaching center CUDA course materials, including materials.
CS179: GPU Programming Lecture 8: More CUDA Runtime.
L16: Sorting and OpenGL Interface
GPU Programming and CUDA Sathish Vadhiyar Parallel Programming.
CS6963 L19: Dynamic Task Queues and More Synchronization.
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Feb 14, 2011 Streams.pptx CUDA Streams These notes will introduce the use of multiple CUDA.
CS 179: GPU Computing Lecture 2: The Basics. Recap Can use GPU to solve highly parallelizable problems – Performance benefits vs. CPU Straightforward.
L8: Writing Correct Programs, cont. and Control Flow L8: Control Flow.
L15: Review for Midterm. Administrative Project proposals due today at 5PM (hard deadline) – handin cs6963 prop March 31, MIDTERM in class L15: Review.
L13: Review for Midterm. Administrative Project proposals due Friday at 5PM (hard deadline) No makeup class Friday! March 23, Guest Lecture Austin Robison,
Programming with CUDA, WS09 Waqar Saleem, Jens Müller Programming with CUDA and Parallel Algorithms Waqar Saleem Jens Müller.
I/O Hardware n Incredible variety of I/O devices n Common concepts: – Port – connection point to the computer – Bus (daisy chain or shared direct access)
CUDA Programming continued ITCS 4145/5145 Nov 24, 2010 © Barry Wilkinson Revised.
Programming with CUDA WS 08/09 Lecture 5 Thu, 6 Nov, 2008.
Big Kernel: High Performance CPU-GPU Communication Pipelining for Big Data style Applications Sajitha Naduvil-Vadukootu CSC 8530 (Parallel Algorithms)
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, ECE 498AL, University of Illinois, Urbana-Champaign ECE408 / CS483 Applied Parallel Programming.
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, April 12, 2012 Timing.ppt Measuring Performance These notes will introduce: Timing Program.
Cuda Streams Presented by Savitha Parur Venkitachalam.
To GPU Synchronize or Not GPU Synchronize? Wu-chun Feng and Shucai Xiao Department of Computer Science, Department of Electrical and Computer Engineering,
Shekoofeh Azizi Spring  CUDA is a parallel computing platform and programming model invented by NVIDIA  With CUDA, you can send C, C++ and Fortran.
© David Kirk/NVIDIA and Wen-mei W. Hwu, , SSL 2014, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.
An Introduction to Programming with CUDA Paul Richmond
GPU Programming David Monismith Based on notes taken from the Udacity Parallel Programming Course.
Martin Kruliš by Martin Kruliš (v1.0)1.
CS6963 L14: Dynamic Scheduling. L14: Dynamic Task Queues 2 CS6963 Administrative STRSM due March 17 (EXTENDED) Midterm coming -In class April 4, open.
CS6963 L17: Lessons from Particle System Implementations.
CS6963 L15: Design Review and CUBLAS Paper Discussion.
CUDA Asynchronous Memory Usage and Execution Yukai Hung Department of Mathematics National Taiwan University Yukai Hung
Advanced / Other Programming Models Sathish Vadhiyar.
CIS 565 Fall 2011 Qing Sun
CUDA Optimizations Sathish Vadhiyar Parallel Programming.
Introducing collaboration members – Korea University (KU) ALICE TPC online tracking algorithm on a GPU Computing Platforms – GPU Computing Platforms Joohyung.
1 ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Feb 4, 2013 Streams.pptx Page-Locked Memory and CUDA Streams These notes introduce the use.
CS6235 L17: Generalizing CUDA: Concurrent Dynamic Execution, and Unified Address Space.
OpenCL Sathish Vadhiyar Sources: OpenCL quick overview from AMD OpenCL learning kit from AMD.
CUDA Streams These notes will introduce the use of multiple CUDA streams to overlap memory transfers with kernel computations. Also introduced is paged-locked.
CUDA - 2.
CS6963 L14: Design Review, Projects, Performance Cliffs and Optimization Benefits.
GPU Programming and CUDA Sathish Vadhiyar Parallel Programming.
Some key aspects of NVIDIA GPUs and CUDA. Silicon Usage.
GPUs: Overview of Architecture and Programming Options Lee Barford firstname dot lastname at gmail dot com.
CS6963 L18: Global Synchronization and Sorting. L18: Synchronization and Sorting 2 CS6963 Administrative Grading -Should have exams. Nice job! -Design.
1)Leverage raw computational power of GPU  Magnitude performance gains possible.
L7: Writing Correct Programs. Administrative Next assignment available – Goals of assignment: – simple memory hierarchy management – block-thread decomposition.
Synchronization These notes introduce:
CS/EE 217 GPU Architecture and Parallel Programming Lecture 17: Data Transfer and CUDA Streams.
CUDA Odds and Ends Joseph Kider University of Pennsylvania CIS Fall 2011.
CUDA Compute Unified Device Architecture. Agent Based Modeling in CUDA Implementation of basic agent based modeling on the GPU using the CUDA framework.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.
CS6963 L13: Application Case Study III: Molecular Visualization and Material Point Method.
CUDA programming Performance considerations (CUDA best practices)
CS 179 Lecture 13 Host-Device Data Transfer 1. Moving data is slow So far we’ve only considered performance when the data is already on the GPU This neglects.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, ECE 498AL, University of Illinois, Urbana-Champaign ECE408 Fall 2015 Applied Parallel Programming.
CUDA C/C++ Basics Part 2 - Blocks and Threads
L16: Dynamic Task Queues and Synchronization
CS427 Multicore Architecture and Parallel Computing
Heterogeneous Programming
Basic CUDA Programming
Host-Device Data Transfer
L16: Feedback from Design Review
CUDA Execution Model – III Streams and Events
L7: Writing Correct Programs
6- General Purpose GPU Programming
Presentation transcript:

CS6963 L17: Asynchronous Concurrent Execution, Open GL Rendering

L17: Asynchronous xfer & Open GL 2 CS6963 Administrative Midterm -In class April 5, open notes -Review notes, readings and Lecture 15 Project Feedback -Everyone should have feedback from me -Follow up later today on a few responses Design Review -Intermediate assessment of progress on project (next slide) -Due Monday, April 12, 5PM -Sign up on doodle.com poll Final projects -Poster session, April 28 (dry run April 26) -Final report, May 5

L17: Asynchronous xfer & Open GL 3 CS6963 Design Reviews Goal is to see a solid plan for each project and make sure projects are on track -Plan to evolve project so that results guaranteed -Show at least one thing is working -How work is being divided among team members Major suggestions from proposals -Project complexity – break it down into smaller chunks with evolutionary strategy -Add references – what has been done before? Known algorithm? GPU implementation? -In some cases, claim no communication but it seems needed to me

L17: Asynchronous xfer & Open GL 4 CS6963 Design Reviews Oral, 10-minute Q&A session (April 14 in class, April 13/14 office hours, or by appointment) -Each team member presents one part -Team should identify “lead” to present plan Three major parts: I.Overview - Define computation and high-level mapping to GPU II.Project Plan -The pieces and who is doing what. -What is done so far? (Make sure something is working by the design review) III.Related Work -Prior sequential or parallel algorithms/implementations -Prior GPU implementations (or similar computations) Submit slides and written document revising proposal that covers these and cleans up anything missing from proposal.

L17: Asynchronous xfer & Open GL 5 CS6963 Final Project Presentation Dry run on April 26 -Easels, tape and poster board provided -Tape a set of Powerpoint slides to a standard 2’x3’ poster, or bring your own poster. Poster session during class on April 28 -Invite your friends, profs who helped you, etc. Final Report on Projects due May 5 -Submit code -And written document, roughly 10 pages, based on earlier submission. -In addition to original proposal, include -Project Plan and How Decomposed (from DR) -Description of CUDA implementation -Performance Measurement -Related Work (from DR)

L17: Asynchronous xfer & Open GL 6 CS6963 Sources for Today’s Lecture Presentation (possibly related to particle project in SDK) nced_CUDA_Training_NVISION08.pdf nced_CUDA_Training_NVISION08.pdf This presentation also talks about finite differencing and molecular dynamics. Asynchronous copies in CUDA Software Developer Kit called asyncAPI, bandwidthTest Chapter in CUDA 3.0 Programmer’s Guide Chapter for Open GL interoperability

L17: Asynchronous xfer & Open GL 7 CS6963 Overview of Concurrent Execution Review semantics of kernel launch Key performance consideration of using GPU as accelerator? -COPY COST!!! Some applications are data intensive, and even large device memories are too small to hold data -Appears to be a feature of some of your projects, and probably of the MRI application we studied Concurrent operation available on newer GPUs -Overlapping Data Transfer and Kernel Execution -Concurrent Kernel Execution -Concurrent Data Transfers

L17: Asynchronous xfer & Open GL 8 CS6963 Review from L2: Semantics of Kernel Launch Kernel launch is asynchronous (> CC 1.1), synchronize at end Timing example (excerpt from simpleStreams in CUDA SDK): A bunch of runs in a row example (excerpt from transpose in CUDA SDK) cudaEvent_t start_event, stop_event; cudaEventCreate(&start_event); cudaEventCreate(&stop_event); cudaEventRecord(start_event, 0); init_array >>(d_a, d_c, niterations); cudaEventRecord(stop_event, 0); cudaEventSynchronize(stop_event); cudaEventElapsedTime(&elapsed_time, start_event, stop_event); for (int i = 0; i < numIterations; ++i) { transpose >>(d_odata, d_idata, size_x, size_y); } cudaThreadSynchronize();

L17: Asynchronous xfer & Open GL 9 CS6963 Optimizing Host-Device Transfers Host-Device Data Transfers -Device to host memory bandwidth much lower than device to device bandwidth -8 GB/s peak (PCI-e x16 Gen 2) vs. 102 GB/s peak (Tesla C1060) Minimize transfers -Intermediate data can be allocated, operated on, and deallocated without ever copying them to host memory Group transfers -One large transfer much better than many small ones Slide source: Nvidia, 2008

L17: Asynchronous xfer & Open GL 10 CS6963 Asynchronous Copy To/From Host (compute capability 1.1 and above) Warning: I have not tried this, and could not find a lot of information on it. Concept: -Memory bandwidth can be a limiting factor on GPUs -Sometimes computation cost dominated by copy cost -But for some computations, data can be “tiled” and computation of tiles can proceed in parallel (some of our projects) -Can we be computing on one tile while copying another? Strategy: -Use page-locked memory on host, and asynchronous copies -Primitive cudaMemcpyAsync -Effect is GPU performs DMA from Host Memory -Synchronize with cudaThreadSynchronize()

L17: Asynchronous xfer & Open GL 11 CS6963 Copying from Host to Device cudaMemcpy(dst, src, nBytes, direction) Can only go as fast as the PCI-e bus and not eligible for asynchronous data transfer cudaMallocHost(…): Page-locked host memory -Use this in place of standard malloc(…) on the host -Prevents OS from paging host memory -Allows PCI-e DMA to run at full speed Asynchronous data transfer -Requires page-locked host memory Enables highest cudaMemcpy performance -3.2 GB/s on PCI-e x16 Gen GB/s on PCI-e x16 Gen2 See bandwidthTest in SDK

L17: Asynchronous xfer & Open GL 12 CS6963 What does Page-Locked Host Memory Mean? How the Async copy works: -DMA performed by GPU memory controller -CUDA driver takes virtual addresses and translates them to physical addresses -Then copies physical addresses onto GPU -Now what happens if the host OS decides to swap out the page??? Special malloc holds page in place on host -Prevents host OS from moving the page -CudaMallocHost() But performance could degrade if this is done on lots of pages! -Bypassing virtual memory mechanisms

L17: Asynchronous xfer & Open GL 13 CS6963 Example of Asynchronous Data Transfer cudaStreamCreate(&stream1); cudaStreamCreate(&stream2); cudaMemcpyAsync(dst1, src1, size, dir, stream1); kernel >>(…); cudaMemcpyAsync(dst2, src2, size, dir, stream2); kernel >>(…); src1 and src2 must have been allocated using cudaMallocHost stream1 and stream2 identify streams associated with asynchronous call (note 4 th “parameter” to kernel invocation)

L17: Asynchronous xfer & Open GL 14 CS6963 Code from asyncAPI SDK project // allocate host memory CUDA_SAFE_CALL( cudaMallocHost((void**)&a, nbytes) ); memset(a, 0, nbytes); // allocate device memory CUDA_SAFE_CALL( cudaMalloc((void**)&d_a, nbytes) ); CUDA_SAFE_CALL( cudaMemset(d_a, 255, nbytes) ); … // declare grid and thread dimensions and create start and stop events // asynchronously issue work to the GPU (all to stream 0) cudaEventRecord(start, 0); cudaMemcpyAsync(d_a, a, nbytes, cudaMemcpyHostToDevice, 0); increment_kernel >>(d_a, value); cudaMemcpyAsync(a, d_a, nbytes, cudaMemcpyDeviceToHost, 0); cudaEventRecord(stop, 0); // have CPU do some work while waiting for GPU to finish // release resources CUDA_SAFE_CALL( cudaFreeHost(a) ); CUDA_SAFE_CALL( cudaFree(d_a) );

L17: Asynchronous xfer & Open GL 15 CS6963 Tiling Idea Use tiling to manage data too large to fit into GPU for each pair of tiles (make sure not to go off end) if (not first iteration) synchronize for stream1 Initiate copy for stream1 into GPU Launch kernel for stream1 if (not first iteration) synchronize for stream2 Initiate copy for stream2 Launch kernel for stream2 // Clean up Last tile, final synchronization

L17: Asynchronous xfer & Open GL 16 CS6963 Concurrent Kernel Execution (compute capability 2.0) NOTE: Not available in current systems! Execute multiple kernels concurrently Upper limit is 4 kernels at a time Keep in mind, sharing of all resources between kernels may limit concurrency and its profitability

L17: Asynchronous xfer & Open GL 17 CS6963 Concurrent Data Transfers (compute capability 2.0) NOTE: Not available in current systems! Can concurrently copy from host to GPU and GPU to host using asynchronous Memcpy

L17: Asynchronous xfer & Open GL 18 CS6963 A Few More Details Only available in “some” more recent architectures Does your device support cudaMemCpyAsync? -Call cudaGetDeviceProperties() -Check the deviceOverlap property Does your device support concurrent kernel execution -Call cudaGetDeviceProperties() -Check the concurrentKernels property

L17: Asynchronous xfer & Open GL 19 CS OpenGL Rendering OpenGL buffer objects can be mapped into the CUDA address space and then used as global memory -Vertex buffer objects -Pixel buffer objects Allows direct visualization of data from computation -No device to host transfer -Data stays in device memory –very fast compute / viz -Automatic DMA from Tesla to Quadro (via host for now) Data can be accessed from the kernel like any other global data (in device memory)

L17: Asynchronous xfer & Open GL 20 CS6963 OpenGL Interoperability 1. Register a buffer object with CUDA -cudaGLRegisterBufferObject(GLuintbuffObj); -OpenGL can use a registered buffer only as a source -Unregister the buffer prior to rendering to it by OpenGL 2. Map the buffer object to CUDA memory -cudaGLMapBufferObject(void**devPtr, GLuintbuffObj); -Returns an address in global memory Buffer must be registered prior to mapping 3. Launch a CUDA kernel to process the buffer Unmap the buffer object prior to use by OpenGL -cudaGLUnmapBufferObject(GLuintbuffObj); 4. Unregister the buffer object -cudaGLUnregisterBufferObject(GLuintbuffObj); -Optional: needed if the buffer is a render target 5. Use the buffer object in OpenGL code

L17: Asynchronous xfer & Open GL 21 CS6963 Example from simpleGL in SDK 1. GL calls to create and initialize buffer, then registered with CUDA: // create buffer object glGenBuffers( 1, vbo); glBindBuffer( GL_ARRAY_BUFFER, *vbo); // initialize buffer object unsigned int size = mesh_width * mesh_height * 4 * sizeof( float)*2; glBufferData( GL_ARRAY_BUFFER, size, 0, GL_DYNAMIC_DRAW); glBindBuffer( GL_ARRAY_BUFFER, 0); // register buffer object with CUDA cudaGLRegisterBufferObject(*vbo);

L17: Asynchronous xfer & Open GL 22 CS Map OpenGL buffer object for writing from CUDA float4 *dptr; cudaGLMapBufferObject( (void**)&dptr, vbo)); 3. Execute the kernel to compute values for dptr dim3 block(8, 8, 1); dim3 grid(mesh_width / block.x, mesh_height / block.y, 1); kernel >>(dptr, mesh_width, mesh_height, anim); 4. Unregister the OpenGL buffer object and return to Open GL cudaGLUnmapBufferObject( vbo); Example from simpleGL in SDK, cont.

L17: Asynchronous xfer & Open GL 23 CS6963 Next Week Exam on Monday Sorting algorithms / Open CL?