Presentation is loading. Please wait.

Presentation is loading. Please wait.

Rui (Ray) Wu raywu1990@nevada.unr.edu Unified Cuda Memory Rui (Ray) Wu raywu1990@nevada.unr.edu.

Similar presentations


Presentation on theme: "Rui (Ray) Wu raywu1990@nevada.unr.edu Unified Cuda Memory Rui (Ray) Wu raywu1990@nevada.unr.edu."— Presentation transcript:

1 Rui (Ray) Wu raywu1990@nevada.unr.edu
Unified Cuda Memory Rui (Ray) Wu

2 Outline Profile Unified Memory Ideas about Unified Vector Dot Product
How to add vectors more than the maximum thread number? PA2

3 Profile What is nvprof? Profile nvprof ./PA0 <argv>
“nvprof” does not need “cudaEvent_t” and has more detailed information

4 Unified Memory

5 Unified Memory Key idea: allocate and access data that can be used by code running on any processor in the system, CPU or GPU No need to “cudaMemcpyHostToDevice” and “cudaMemcpyDeviceToHost” Multiple GPUs and multiple CPUs Read more details: beginners/

6 Unified Memory

7 Unified Memory: Vector Addition
Example: cudaDeviceSynchronize: synchronize before access the data!

8 Unified Memory How does it work?
Store data into “Page”: Unified Memory is able to automatically migrate data at the level of individual pages between host and device memory Move “Page” between CPU memory and GPU memory cudaMemcpy => cudaMallocManaged Page-> similar to cache, performs better if you use the loading data multiple times. Read: three methods to avoid page faults

9 Unified Memory When it accesses any absent pages, the GPU stalls execution of the accessing threads, and the Page Migration Engine migrates the pages to the device before resuming the threads. Pre-Pascal GPUs lack hardware page faulting, so coherence can’t be guaranteed. An access from the CPU while a kernel is running will cause a segmentation fault!  Pascal and Volta GPUs support system-wide atomic memory operations. That means you can atomically operate on values anywhere in the system from multiple GPUs.  What is “Pascal” and “Volta”:

10 Unified Memory 49-bit virtual addressing and on-demand page migration. 49-bit virtual addresses are sufficient to enable GPUs to access the entire system memory plus the memory of all GPUs in the system. 49 bits means how many GB? Discuss in next class More reading materials: cuda-6/

11 Ideas about Unified Vector Dot Product
Step 1: calculate product of each pair in one block (serve PA2) Step 2: __syncthreads() threads in this block Step 3: sum reduction

12 Ideas about Unified Vector Dot Product: Sum Reduction

13 Ideas about Unified Vector Dot Product: Sum Reduction
__syncthreads() threads in this block Book page P80 introduces how to do this by using shared memory. Shared memory: old version More details:

14 How to add vectors more than the maximum thread number?
Figure!!! Show relations between each other Draw on the board!

15 How to add vectors more than the maximum thread number?

16 PA2: Matrix Multiplication
Now we know how to do vector product with one block. How about matrix multiplication? Draw graph on the board!!!! More details:

17 Thank you! Questions?


Download ppt "Rui (Ray) Wu raywu1990@nevada.unr.edu Unified Cuda Memory Rui (Ray) Wu raywu1990@nevada.unr.edu."

Similar presentations


Ads by Google