Rui (Ray) Wu raywu1990@nevada.unr.edu Unified Cuda Memory Rui (Ray) Wu raywu1990@nevada.unr.edu.

Slides:



Advertisements
Similar presentations
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.
Advertisements

List Ranking and Parallel Prefix
INF5063 – GPU & CUDA Håkon Kvale Stensland iAD-lab, Department for Informatics.
Multi-GPU and Stream Programming Kishan Wimalawarne.
Lecture 38: Chapter 7: Multiprocessors Today’s topic –Vector processors –GPUs –An example 1.
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 28, 2011 GPUMemories.ppt GPU Memories These notes will introduce: The basic memory hierarchy.
Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.
1 ITCS 5/4145 Parallel computing, B. Wilkinson, April 11, CUDAMultiDimBlocks.ppt CUDA Grids, Blocks, and Threads These notes will introduce: One.
More on threads, shared memory, synchronization
GPU Programming and CUDA Sathish Vadhiyar Parallel Programming.
CUDA Programming continued ITCS 4145/5145 Nov 24, 2010 © Barry Wilkinson Revised.
1 Last Class: Introduction Operating system = interface between user & architecture Importance of OS OS history: Change is only constant User-level Applications.
To GPU Synchronize or Not GPU Synchronize? Wu-chun Feng and Shucai Xiao Department of Computer Science, Department of Electrical and Computer Engineering,
Shekoofeh Azizi Spring  CUDA is a parallel computing platform and programming model invented by NVIDIA  With CUDA, you can send C, C++ and Fortran.
Virtual Memory Chantha Thoeun. Overview  Purpose:  Use the hard disk as an extension of RAM.  Increase the available address space of a process. 
An Introduction to Programming with CUDA Paul Richmond
Rensselaer Polytechnic Institute CSC 432 – Operating Systems David Goldschmidt, Ph.D.
GPU Programming and CUDA Sathish Vadhiyar High Performance Computing.
80386DX.
Martin Kruliš by Martin Kruliš (v1.0)1.
Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.
Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2012.
COS 598: Advanced Operating System. Operating System Review What are the two purposes of an OS? What are the two modes of execution? Why do we have two.
CUDA All material not from online sources/textbook copyright © Travis Desell, 2012.
CIS 565 Fall 2011 Qing Sun
CUDA Optimizations Sathish Vadhiyar Parallel Programming.
GPU Architecture and Programming
CS6235 L17: Generalizing CUDA: Concurrent Dynamic Execution, and Unified Address Space.
CUDA - 2.
GPU Programming and CUDA Sathish Vadhiyar Parallel Programming.
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 25, 2011 Synchronization.ppt Synchronization These notes will introduce: Ways to achieve.
Jie Chen. 30 Multi-Processors each contains 8 cores at 1.4 GHz 4GB GDDR3 memory offers ~100GB/s memory bandwidth.
QCAdesigner – CUDA HPPS project
Introduction to CUDA (1 of n*) Patrick Cozzi University of Pennsylvania CIS Spring 2011 * Where n is 2 or 3.
CUDA Basics. Overview What is CUDA? Data Parallelism Host-Device model Thread execution Matrix-multiplication.
Operating Systems ECE344 Ashvin Goel ECE University of Toronto Demand Paging.
CS/EE 217 GPU Architecture and Parallel Programming Midterm Review
Weekly Report- Reduction Ph.D. Student: Leo Lee date: Oct. 30, 2009.
Introduction to CUDA CAP 4730 Spring 2012 Tushar Athawale.
Synchronization These notes introduce:
1 GPU programming Dr. Bernhard Kainz. 2 Dr Bernhard Kainz Overview About myself Motivation GPU hardware and system architecture GPU programming languages.
Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2014.
CUDA Compute Unified Device Architecture. Agent Based Modeling in CUDA Implementation of basic agent based modeling on the GPU using the CUDA framework.
GPU Programming and CUDA Sathish Vadhiyar High Performance Computing.
An Asymmetric Distributed Shared Memory Model for Heterogeneous Parallel Systems Isaac Gelado, Javier Cabezas. John Stone, Sanjay Patel, Nacho Navarro.
CS 179 Lecture 13 Host-Device Data Transfer 1. Moving data is slow So far we’ve only considered performance when the data is already on the GPU This neglects.
CS 179: GPU Computing Recitation 2: Synchronization, Shared memory, Matrix Transpose.
CS 179: GPU Programming Lecture 7. Week 3 Goals: – More involved GPU-accelerable algorithms Relevant hardware quirks – CUDA libraries.
1 ITCS 4/5145 Parallel Programming, B. Wilkinson, Nov 12, CUDASynchronization.ppt Synchronization These notes introduce: Ways to achieve thread synchronization.
CS/EE 217 – GPU Architecture and Parallel Programming
CUDA C/C++ Basics Part 2 - Blocks and Threads
Prof. Zhang Gang School of Computer Sci. & Tech.
Sathish Vadhiyar Parallel Programming
CS427 Multicore Architecture and Parallel Computing
Heterogeneous Programming
Recitation 2: Synchronization, Shared memory, Matrix Transpose
Introduction to Operating Systems
Pipeline parallelism and Multi–GPU Programming
CS 179 Lecture 14.
FIGURE 12-1 Memory Hierarchy
CS/EE 217 – GPU Architecture and Parallel Programming
Antonio R. Miele Marco D. Santambrogio Politecnico di Milano
Rui (Ray) Wu Atomic Operation Rui (Ray) Wu
Introduction to CUDA.
Antonio R. Miele Marco D. Santambrogio Politecnico di Milano
CUDA Execution Model – III Streams and Events
CS179: GPU PROGRAMMING Recitation 2 GPU Memory Synchronization
Chapter 4:Parallel Programming in CUDA C
Synchronization These notes introduce:
6- General Purpose GPU Programming
Presentation transcript:

Rui (Ray) Wu raywu1990@nevada.unr.edu Unified Cuda Memory Rui (Ray) Wu raywu1990@nevada.unr.edu

Outline Profile Unified Memory Ideas about Unified Vector Dot Product How to add vectors more than the maximum thread number? PA2

Profile What is nvprof? Profile nvprof ./PA0 <argv> “nvprof” does not need “cudaEvent_t” and has more detailed information

Unified Memory

Unified Memory Key idea: allocate and access data that can be used by code running on any processor in the system, CPU or GPU No need to “cudaMemcpyHostToDevice” and “cudaMemcpyDeviceToHost” Multiple GPUs and multiple CPUs Read more details: https://devblogs.nvidia.com/unified-memory-cuda- beginners/ http://on-demand.gputechconf.com/supercomputing/2013/presentation/SC3120-Unified-Memory-CUDA-6.0.pdf

Unified Memory

Unified Memory: Vector Addition Example: https://devblogs.nvidia.com/unified-memory-cuda-beginners/ cudaDeviceSynchronize: synchronize before access the data!

Unified Memory How does it work? Store data into “Page”: Unified Memory is able to automatically migrate data at the level of individual pages between host and device memory Move “Page” between CPU memory and GPU memory cudaMemcpy => cudaMallocManaged Page-> similar to cache, performs better if you use the loading data multiple times. Read: three methods to avoid page faults

Unified Memory When it accesses any absent pages, the GPU stalls execution of the accessing threads, and the Page Migration Engine migrates the pages to the device before resuming the threads. Pre-Pascal GPUs lack hardware page faulting, so coherence can’t be guaranteed. An access from the CPU while a kernel is running will cause a segmentation fault!  Pascal and Volta GPUs support system-wide atomic memory operations. That means you can atomically operate on values anywhere in the system from multiple GPUs.  What is “Pascal” and “Volta”: https://en.wikipedia.org/wiki/CUDA

Unified Memory 49-bit virtual addressing and on-demand page migration. 49-bit virtual addresses are sufficient to enable GPUs to access the entire system memory plus the memory of all GPUs in the system. 49 bits means how many GB? Discuss in next class More reading materials: https://devblogs.nvidia.com/unified-memory-in- cuda-6/

Ideas about Unified Vector Dot Product Step 1: calculate product of each pair in one block (serve PA2) Step 2: __syncthreads() threads in this block Step 3: sum reduction

Ideas about Unified Vector Dot Product: Sum Reduction

Ideas about Unified Vector Dot Product: Sum Reduction __syncthreads() threads in this block Book page P80 introduces how to do this by using shared memory. Shared memory: old version More details: http://developer.download.nvidia.com/compute/cuda/1.1-Beta/x86_website/projects/reduction/doc/reduction.pdf

How to add vectors more than the maximum thread number? Figure!!! Show relations between each other Draw on the board!

How to add vectors more than the maximum thread number?

PA2: Matrix Multiplication Now we know how to do vector product with one block. How about matrix multiplication? Draw graph on the board!!!! More details: http://developer.download.nvidia.com/compute/cuda/1.1-Beta/x86_website/projects/reduction/doc/reduction.pdf

Thank you! Questions?