CS6963 L17: Lessons from Particle System Implementations.

Slides:

Advertisements

Similar presentations

CS179: GPU Programming Lecture 5: Memory. Today GPU Memory Overview CUDA Memory Syntax Tips and tricks for memory handling.

Advertisements

Intermediate GPGPU Programming in CUDA

1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 28, 2011 GPUMemories.ppt GPU Memories These notes will introduce: The basic memory hierarchy.

Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.

GPU programming: CUDA Acknowledgement: the lecture materials are based on the materials in NVIDIA teaching center CUDA course materials, including materials.

CS179: GPU Programming Lecture 8: More CUDA Runtime.

L16: Sorting and OpenGL Interface

CS6963 L19: Dynamic Task Queues and More Synchronization.

Acceleration of the Smith– Waterman algorithm using single and multiple graphics processors Author : Ali Khajeh-Saeed, Stephen Poole, J. Blair Perot. Publisher:

L15: Review for Midterm. Administrative Project proposals due today at 5PM (hard deadline) – handin cs6963 prop March 31, MIDTERM in class L15: Review.

L13: Review for Midterm. Administrative Project proposals due Friday at 5PM (hard deadline) No makeup class Friday! March 23, Guest Lecture Austin Robison,

Programming with CUDA, WS09 Waqar Saleem, Jens Müller Programming with CUDA and Parallel Algorithms Waqar Saleem Jens Müller.

GPU Acceleration of the Generalized Interpolation Material Point Method Wei-Fan Chiang, Michael DeLisi, Todd Hummel, Tyler Prete, Kevin Tew, Mary Hall,

L8: Memory Hierarchy Optimization, Bandwidth CS6963.

Programming with CUDA WS 08/09 Lecture 5 Thu, 6 Nov, 2008.

CUDA (Compute Unified Device Architecture) Supercomputing for the Masses by Peter Zalutski.

CUDA and the Memory Model (Part II). Code executed on GPU.

Big Kernel: High Performance CPU-GPU Communication Pipelining for Big Data style Applications Sajitha Naduvil-Vadukootu CSC 8530 (Parallel Algorithms)

EECS 370 Discussion 1 SMBC. EECS 370 Discussion Exam 2 Solutions posted online Will be returned in next discussion (12/9) – Grades hopefully up on CTools.

To GPU Synchronize or Not GPU Synchronize? Wu-chun Feng and Shucai Xiao Department of Computer Science, Department of Electrical and Computer Engineering,

Shekoofeh Azizi Spring  CUDA is a parallel computing platform and programming model invented by NVIDIA  With CUDA, you can send C, C++ and Fortran.

© David Kirk/NVIDIA and Wen-mei W. Hwu, , SSL 2014, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.

Martin Kruliš by Martin Kruliš (v1.0)1.

Extracted directly from:

Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.

GPU Programming with CUDA – Optimisation Mike Griffiths

CUDA Programming David Monismith CS599 Based on notes from the Udacity Parallel Programming (cs344) Course.

CS6963 L15: Design Review and CUBLAS Paper Discussion.

CUDA Asynchronous Memory Usage and Execution Yukai Hung Department of Mathematics National Taiwan University Yukai Hung

Advanced / Other Programming Models Sathish Vadhiyar.

CIS 565 Fall 2011 Qing Sun

CS6963 L17: Asynchronous Concurrent Execution, Open GL Rendering.

NVIDIA Fermi Architecture Patrick Cozzi University of Pennsylvania CIS Spring 2011.

CS179: GPU Programming Lecture 16: Final Project Discussion.

CUDA Optimizations Sathish Vadhiyar Parallel Programming.

1 ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Feb 4, 2013 Streams.pptx Page-Locked Memory and CUDA Streams These notes introduce the use.

CS6235 L17: Generalizing CUDA: Concurrent Dynamic Execution, and Unified Address Space.

CUDA Streams These notes will introduce the use of multiple CUDA streams to overlap memory transfers with kernel computations. Also introduced is paged-locked.

1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 25, 2011 Synchronization.ppt Synchronization These notes will introduce: Ways to achieve.

Some key aspects of NVIDIA GPUs and CUDA. Silicon Usage.

CS6963 L18: Global Synchronization and Sorting. L18: Synchronization and Sorting 2 CS6963 Administrative Grading -Should have exams. Nice job! -Design.

CUDA Performance Considerations (2 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2011.

1)Leverage raw computational power of GPU  Magnitude performance gains possible.

Introduction to CUDA (1 of n*) Patrick Cozzi University of Pennsylvania CIS Spring 2011 * Where n is 2 or 3.

Programming with CUDA WS 08/09 Lecture 10 Tue, 25 Nov, 2008.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.

Optimizing Stencil Computations March 18, Administrative Midterm coming April 3? In class March 25, can bring one page of notes Review notes, readings.

An Efficient CUDA Implementation of the Tree-Based Barnes Hut n-body Algorithm By Martin Burtscher and Keshav Pingali Jason Wengert.

Introduction to CUDA CAP 4730 Spring 2012 Tushar Athawale.

Synchronization These notes introduce:

AUTO-GC: Automatic Translation of Data Mining Applications to GPU Clusters Wenjing Ma Gagan Agrawal The Ohio State University.

CUDA Odds and Ends Joseph Kider University of Pennsylvania CIS Fall 2011.

CUDA Compute Unified Device Architecture. Agent Based Modeling in CUDA Implementation of basic agent based modeling on the GPU using the CUDA framework.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.

GPU Performance Optimisation Alan Gray EPCC The University of Edinburgh.

CS6963 L13: Application Case Study III: Molecular Visualization and Material Point Method.

Programming with CUDA WS 08/09 Lecture 2 Tue, 28 Oct, 2008.

CUDA programming Performance considerations (CUDA best practices)

CS 179 Lecture 13 Host-Device Data Transfer 1. Moving data is slow So far we’ve only considered performance when the data is already on the GPU This neglects.

CS 179: GPU Computing Recitation 2: Synchronization, Shared memory, Matrix Transpose.

Matrix Multiplication in CUDA Kyeo-Reh Park Kyeo-Reh Park Nuclear & Quantum EngineeringNuclear & Quantum Engineering.

1 ITCS 4/5145 Parallel Programming, B. Wilkinson, Nov 12, CUDASynchronization.ppt Synchronization These notes introduce: Ways to achieve thread synchronization.

Computer Engg, IIT(BHU)

Sathish Vadhiyar Parallel Programming

CS427 Multicore Architecture and Parallel Computing

Basic CUDA Programming

Recitation 2: Synchronization, Shared memory, Matrix Transpose

L16: Feedback from Design Review

Synchronization These notes introduce:

Presentation transcript:

CS6963 L17: Lessons from Particle System Implementations

L17: Particle Systems 2 CS6963 Administrative Still missing some design reviews -Please to me slides from presentation -And updates to reports -By Thursday, Apr 16, 5PM Grading -Lab2 problem 1 graded, problem 2 under construction -Return exams by Friday AM Upcoming cross-cutting systems seminar, Monday, April 20, 12:15-1:30PM, LCR: “Technology Drivers for Multicore Architectures,” Rajeev Balasubramonian, Mary Hall, Ganesh Gopalakrishnan, John Regehr Final Reports on projects -Poster session April 29 with dry run April 27 -Also, submit written document and software by May 6 -Invite your friends! I’ll invite faculty, NVIDIA, graduate students, application owners,..

L17: Particle Systems 3 CS6963 Particle Systems MPM/GIMP Particle animation and other special effects Monte-carlo transport simulation Fluid dynamics Plasma simulations What are the performance/implementation challenges? -Global synchronization -Global memory access costs (how to reduce) -Copy to/from host overlapped with computation Many of these issues arise in other projects -E.g., overlapping host copies with computation image mosaicing

L17: Particle Systems 4 CS6963 Sources for Today’s Lecture A particle system simulation in the CUDA Software Developer Kit called particles Implementation description in /Developer/CUDA/projects/particles/doc/particles.pdf Possibly related presentation in nced_CUDA_Training_NVISION08.pdf This presentation also talks about finite differencing and molecular dynamics. Asynchronous copies in CUDA Software Developer Kit called asyncAPI

L17: Particle Systems 5 CS6963 Relevant Lessons from Particle Simulation 1.Global synchronization using atomic operation 2.Asynchronous copy from Host to GPU 3.Use of shared memory to cache particle data 4.Use of texture cache to accelerate particle lookup 5.OpenGL rendering

L17: Particle Systems 6 CS Global synchronization Concept: -We need to perform some computations on particles, and others on grid cells -Existing MPM/GIMP provides a mapping from particles to the grid nodes to which they contribute -Would like an inverse mapping from grid cells to the particles that contribute to their result Strategy: -Decompose the threads so that each computes results at a particle -Use global synchronization to construct an inverse mapping from grid cells to particles -Primitive: atomicAdd

L17: Particle Systems 7 CS6963 Example Code to Build Inverse Mapping __device__ void addParticleToCell(int3 gridPos, uint index, uint* gridCounters, uint* gridCells) { // calculate grid hash uint gridHash = calcGridHash(gridPos); // increment cell counter using atomics int counter = atomicAdd(&gridCounters[gridHash], 1); counter = min(counter, params.maxParticlesPerCell-1); // write particle index into this cell (very uncoalesced!) gridCells[gridHash*params.maxParticlesPerCell + counter] = index; } index refers to index of particle gridPos represents grid cell in 3-d space gridCells is data structure in global memory for the inverse mapping What this does: Builds up gridCells as array limited by max # particles per grid atomicAdd gives how many particles have already been added to this cell

L17: Particle Systems 8 CS Asynchronous Copy To/From Host Warning: I have not tried this, and could not find a lot of information on it. Concept: -Memory bandwidth can be a limiting factor on GPUs -Sometimes computation cost dominated by copy cost -But for some computations, data can be “tiled” and computation of tiles can proceed in parallel (some of our projects) -Can we be computing on one tile while copying another? Strategy: -Use page-locked memory on host, and asynchronous copies -Primitive cudaMemcpyAsync -Synchronize with cudaThreadSynchronize()

L17: Particle Systems 9 CS6963 Copying from Host to Device cudaMemcpy(dst, src, nBytes, direction) Can only go as fast as the PCI-e bus and not eligible for asynchronous data transfer cudaMallocHost(…): Page-locked host memory -Use this in place of standard malloc(…) on the host -Prevents OS from paging host memory -Allows PCI-e DMA to run at full speed Asynchronous data transfer -Requires page-locked host memory

L17: Particle Systems 10 CS6963 Example of Asynchronous Data Transfer cudaStreamCreate(&stream1); cudaStreamCreate(&stream2); cudaMemcpyAsync(dst1, src1, size, dir, stream1); kernel >>(…); cudaMemcpyAsync(dst2, src2, size, dir, stream2); kernel >>(…); src1 and src2 must have been allocated using cudaMallocHost stream1 and stream2 identify streams associated with asynchronous call (note 4 th “parameter” to kernel invocation)

L17: Particle Systems 11 CS6963 Particle Data has some Reuse Two ideas: -Cache particle data in shared memory (3.) -Cache particle data in texture cache (4.) 11 L2:Introduction to CUDA

L17: Particle Systems 12 CS6963 Code from Oster presentation Newtonian mechanics on point masses: struct particleStruct{ float3 pos; float3 vel; float3 force; }; pos = pos+ vel*dt vel = vel+ force/mass*dt

L17: Particle Systems 13 CS Cache Particle Data in Shared Memory __shared__ float3 s_pos[N_THREADS]; __shared__ float3 s_vel[N_THREADS]; __shared__ float3 s_force[N_THREADS]; int tx= threadIdx.x; idx= threadIdx.x+ blockIdx.x*blockDim.x; s_pos[tx] = P[idx].pos; s_vel[tx] = P[idx].vel; s_force[tx] = P[idx].force; __syncthreads(); s_pos[tx] = s_pos[tx] + s_vel[tx] * dt; s_vel[tx] = s_vel[tx] + s_force[tx]/mass * dt; P[idx].pos= s_pos[tx]; P[idx].vel= s_vel[tx];

L17: Particle Systems 14 CS Use texture cache for read-only data Texture memory is special section of device global memory -Read only -Cached by spatial location (1D, 2D, 3D) Can achieve high performance -If reuse within thread block so access is cached -Useful to eliminate cost of uncoalesced global memory access Requires special mechanisms for defining a texture, and accessing a texture

L17: Particle Systems 15 CS6963 Using Textures: from Finite Difference Example Declare a texture ref texture fTex; Bind f to texture ref via an array cudaMallocArray(fArray,…) cudaMemcpy2DToArray(fArray, f, …); cudaBindTextureToArray(fTex, fArray…); Access with array texture functions f[x,y] = tex2D(fTex, x,y);

L17: Particle Systems 16 CS6963 Use of Textures in Particle Simulation Macro determines whether texture is used a. Declaration of texture references in particles_kernel.cu #if USE_TEX // textures for particle position and velocity texture oldPosTex; texture oldVelTex; texture particleHashTex; texture cellStartTex; texture gridCountersTex; texture gridCellsTex; #endif

L17: Particle Systems 17 CS6963 Use of Textures in Particle Simulation #if USE_TEX CUDA_SAFE_CALL(cudaBindTexture(0, oldPosTex, oldPos, numBodies*sizeof(float4))); CUDA_SAFE_CALL(cudaBindTexture(0, oldVelTex, oldVel, numBodies*sizeof(float4))); #endif reorderDataAndFindCellStartD >>((uint2 *) particleHash, (float4 *) oldPos, (float4 *) oldVel, (float4 *) sortedPos, (float4 *) sortedVel, (uint *) cellStart); #if USE_TEX CUDA_SAFE_CALL(cudaUnbindTexture(oldPosTex)); CUDA_SAFE_CALL(cudaUnbindTexture(oldVelTex)); #endif b. Bind/Unbind Textures right before kernel invocation

L17: Particle Systems 18 CS6963 Use of Textures in Particle Simulation c. Texture fetch (hidden in a macro) ifdef USE_TEX #define FETCH(t, i) tex1Dfetch(t##Tex, i) #else #define FETCH(t, i) t[i] #endif Here’s an access in particles_kernel.cu float4 pos = FETCH(oldPos, index);

L17: Particle Systems 19 CS OpenGL Rendering OpenGL buffer objects can be mapped into the CUDA address space and then used as global memory -Vertex buffer objects -Pixel buffer objects Allows direct visualization of data from computation -No device to host transfer -Data stays in device memory –very fast compute / viz -Automatic DMA from Tesla to Quadro (via host for now) Data can be accessed from the kernel like any other global data (in device memory)

L17: Particle Systems 20 CS6963 OpenGL Interoperability Register a buffer object with CUDA -cudaGLRegisterBufferObject(GLuintbuffObj); -OpenGL can use a registered buffer only as a source -Unregister the buffer prior to rendering to it by OpenGL Map the buffer object to CUDA memory -cudaGLMapBufferObject(void**devPtr, GLuintbuffObj); -Returns an address in global memory Buffer must be registered prior to mapping Launch a CUDA kernel to process the buffer Unmap the buffer object prior to use by OpenGL -cudaGLUnmapBufferObject(GLuintbuffObj); Unregister the buffer object -cudaGLUnregisterBufferObject(GLuintbuffObj); -Optional: needed if the buffer is a render target Use the buffer object in OpenGL code

L17: Particle Systems 21 CS6963 Final Project Presentation Dry run on April 27 -Easels, tape and poster board provided -Tape a set of Powerpoint slides to a standard 2’x3’ poster, or bring your own poster. Final Report on Projects due May 6 -Submit code -And written document, roughly 10 pages, based on earlier submission. -In addition to original proposal, include -Project Plan and How Decomposed (from DR) -Description of CUDA implementation -Performance Measurement -Related Work (from DR)

L17: Particle Systems 22 CS6963 Final Remaining Lectures This one: Particle Systems April 20 Sorting April 22 -? -Would like to talk about dynamic scheduling? -If nothing else, following paper: “Efficient Computation of Sum-products on GPUs Through Software- Managed Cache,” M. Silberstein, A. Schuster, D. Geiger, A. Patney, J. Owens, ICS