Rui (Ray) Wu raywu1990@nevada.unr.edu Unified Cuda Memory Rui (Ray) Wu raywu1990@nevada.unr.edu
Outline Profile Unified Memory Ideas about Unified Vector Dot Product How to add vectors more than the maximum thread number? PA2
Profile What is nvprof? Profile nvprof ./PA0 <argv> “nvprof” does not need “cudaEvent_t” and has more detailed information
Unified Memory
Unified Memory Key idea: allocate and access data that can be used by code running on any processor in the system, CPU or GPU No need to “cudaMemcpyHostToDevice” and “cudaMemcpyDeviceToHost” Multiple GPUs and multiple CPUs Read more details: https://devblogs.nvidia.com/unified-memory-cuda- beginners/ http://on-demand.gputechconf.com/supercomputing/2013/presentation/SC3120-Unified-Memory-CUDA-6.0.pdf
Unified Memory
Unified Memory: Vector Addition Example: https://devblogs.nvidia.com/unified-memory-cuda-beginners/ cudaDeviceSynchronize: synchronize before access the data!
Unified Memory How does it work? Store data into “Page”: Unified Memory is able to automatically migrate data at the level of individual pages between host and device memory Move “Page” between CPU memory and GPU memory cudaMemcpy => cudaMallocManaged Page-> similar to cache, performs better if you use the loading data multiple times. Read: three methods to avoid page faults
Unified Memory When it accesses any absent pages, the GPU stalls execution of the accessing threads, and the Page Migration Engine migrates the pages to the device before resuming the threads. Pre-Pascal GPUs lack hardware page faulting, so coherence can’t be guaranteed. An access from the CPU while a kernel is running will cause a segmentation fault! Pascal and Volta GPUs support system-wide atomic memory operations. That means you can atomically operate on values anywhere in the system from multiple GPUs. What is “Pascal” and “Volta”: https://en.wikipedia.org/wiki/CUDA
Unified Memory 49-bit virtual addressing and on-demand page migration. 49-bit virtual addresses are sufficient to enable GPUs to access the entire system memory plus the memory of all GPUs in the system. 49 bits means how many GB? Discuss in next class More reading materials: https://devblogs.nvidia.com/unified-memory-in- cuda-6/
Ideas about Unified Vector Dot Product Step 1: calculate product of each pair in one block (serve PA2) Step 2: __syncthreads() threads in this block Step 3: sum reduction
Ideas about Unified Vector Dot Product: Sum Reduction
Ideas about Unified Vector Dot Product: Sum Reduction __syncthreads() threads in this block Book page P80 introduces how to do this by using shared memory. Shared memory: old version More details: http://developer.download.nvidia.com/compute/cuda/1.1-Beta/x86_website/projects/reduction/doc/reduction.pdf
How to add vectors more than the maximum thread number? Figure!!! Show relations between each other Draw on the board!
How to add vectors more than the maximum thread number?
PA2: Matrix Multiplication Now we know how to do vector product with one block. How about matrix multiplication? Draw graph on the board!!!! More details: http://developer.download.nvidia.com/compute/cuda/1.1-Beta/x86_website/projects/reduction/doc/reduction.pdf
Thank you! Questions?