CUDA. Assignment  Subject: DES using CUDA  Deliverables: des.c, des.cu, report  Due: 12/14,

Slides:



Advertisements
Similar presentations
CS179: GPU Programming Lecture 5: Memory. Today GPU Memory Overview CUDA Memory Syntax Tips and tricks for memory handling.
Advertisements

GPU programming: CUDA Acknowledgement: the lecture materials are based on the materials in NVIDIA teaching center CUDA course materials, including materials.
Intermediate GPGPU Programming in CUDA
Computer Architecture Lecture 7 Compiler Considerations and Optimizations.
Lecture 6: Multicore Systems
Instructor Notes We describe motivation for talking about underlying device architecture because device architecture is often avoided in conventional.
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 28, 2011 GPUMemories.ppt GPU Memories These notes will introduce: The basic memory hierarchy.
Optimization on Kepler Zehuan Wang
GPU programming: CUDA Acknowledgement: the lecture materials are based on the materials in NVIDIA teaching center CUDA course materials, including materials.
GPU Programming and CUDA Sathish Vadhiyar Parallel Programming.
University of Michigan Electrical Engineering and Computer Science Transparent CPU-GPU Collaboration for Data-Parallel Kernels on Heterogeneous Systems.
Acceleration of the Smith– Waterman algorithm using single and multiple graphics processors Author : Ali Khajeh-Saeed, Stephen Poole, J. Blair Perot. Publisher:
CS 179: GPU Computing Lecture 2: The Basics. Recap Can use GPU to solve highly parallelizable problems – Performance benefits vs. CPU Straightforward.
1 Threading Hardware in G80. 2 Sources Slides by ECE 498 AL : Programming Massively Parallel Processors : Wen-Mei Hwu John Nickolls, NVIDIA.
CUDA Programming Model Xing Zeng, Dongyue Mou. Introduction Motivation Programming Model Memory Model CUDA API Example Pro & Contra Trend Outline.
Programming with CUDA, WS09 Waqar Saleem, Jens Müller Programming with CUDA and Parallel Algorithms Waqar Saleem Jens Müller.
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, March 22, 2011 Branching.ppt Control Flow These notes will introduce scheduling control-flow.
CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.
Programming with CUDA WS 08/09 Lecture 3 Thu, 30 Oct, 2008.
GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.
Shekoofeh Azizi Spring  CUDA is a parallel computing platform and programming model invented by NVIDIA  With CUDA, you can send C, C++ and Fortran.
An Introduction to Programming with CUDA Paul Richmond
2012/06/22 Contents  GPU (Graphic Processing Unit)  CUDA Programming  Target: Clustering with Kmeans  How to use.
CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 8: Memory Hardware in G80.
GPU Programming with CUDA – Optimisation Mike Griffiths
Basic CUDA Programming Computer Architecture 2015 (Prof. Chih-Wei Liu) Final Project – CUDA Tutorial TA Cheng-Yen Yang
High Performance Computing with GPUs: An Introduction Krešimir Ćosić, Thursday, August 12th, LSST All Hands Meeting 2010, Tucson, AZ GPU Tutorial:
Algorithm Engineering „GPGPU“ Stefan Edelkamp. Graphics Processing Units  GPGPU = (GP)²U General Purpose Programming on the GPU  „Parallelism for the.
CUDA All material not from online sources/textbook copyright © Travis Desell, 2012.
Multiprocessing. Going Multi-core Helps Energy Efficiency William Holt, HOT Chips 2005 Adapted from UC Berkeley "The Beauty and Joy of Computing"
NVIDIA Fermi Architecture Patrick Cozzi University of Pennsylvania CIS Spring 2011.
CUDA Optimizations Sathish Vadhiyar Parallel Programming.
GPU Architecture and Programming
CUDA - 2.
GPU Programming and CUDA Sathish Vadhiyar Parallel Programming.
Jie Chen. 30 Multi-Processors each contains 8 cores at 1.4 GHz 4GB GDDR3 memory offers ~100GB/s memory bandwidth.
Some key aspects of NVIDIA GPUs and CUDA. Silicon Usage.
OpenCL Programming James Perry EPCC The University of Edinburgh.
1)Leverage raw computational power of GPU  Magnitude performance gains possible.
Programming with CUDA WS 08/09 Lecture 10 Tue, 25 Nov, 2008.
Introduction to CUDA CAP 4730 Spring 2012 Tushar Athawale.
Computer Architecture Lecture 24 Parallel Processing Ralph Grishman November 2015 NYU.
GPGPU introduction. Why is GPU in the picture Seeking exa-scale computing platform Minimize power per operation. – Power is directly correlated to the.
GPU Performance Optimisation Alan Gray EPCC The University of Edinburgh.
My Coordinates Office EM G.27 contact time:
1 ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 28, 2013 Branching.ppt Control Flow These notes will introduce scheduling control-flow.
Fast and parallel implementation of Image Processing Algorithm using CUDA Technology On GPU Hardware Neha Patil Badrinath Roysam Department of Electrical.
CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS 1.
CUDA programming Performance considerations (CUDA best practices)
CS 179: GPU Computing LECTURE 2: MORE BASICS. Recap Can use GPU to solve highly parallelizable problems Straightforward extension to C++ ◦Separate CUDA.
Single Instruction Multiple Threads
Computer Engg, IIT(BHU)
Sathish Vadhiyar Parallel Programming
CS427 Multicore Architecture and Parallel Computing
EECE571R -- Harnessing Massively Parallel Processors ece
GPU Memory Details Martin Kruliš by Martin Kruliš (v1.1)
Basic CUDA Programming
Lecture 2: Intro to the simd lifestyle and GPU internals
Mattan Erez The University of Texas at Austin
Presented by: Isaac Martin
NVIDIA Fermi Architecture
Antonio R. Miele Marco D. Santambrogio Politecnico di Milano
Parallel programming with GPGPU coprocessors
Antonio R. Miele Marco D. Santambrogio Politecnico di Milano
General Purpose Graphics Processing Units (GPGPUs)
6- General Purpose GPU Programming
CIS 6930: Chip Multiprocessor: Parallel Architecture and Programming
CIS 6930: Chip Multiprocessor: Parallel Architecture and Programming
Presentation transcript:

CUDA

Assignment  Subject: DES using CUDA  Deliverables: des.c, des.cu, report  Due: 12/14,

Index  What is GPU?  Programming model and Simple Example  The Environment for CUDA programming  What is DES?

What’s in a GPU?  A GPU is a heterogeneous chip multi-processor (highly tuned for graphics)

Slimming down  kjkd Idea #1: Remove components that help a single instruction stream run fast

Parallel execution Two coresFour cores Sixteen cores: 16 simultaneous instruction streams  Be able to share an instruction stream

SIMD processing Idea #2: Amortize cost/complexity of managing an instruction stream across many ALUs 16 cores = 128 ALUs

What about branches?

Throughput! Idea #3: Interleave processing of many fragments on a single core to avoid stalls caused by high latency operations

Summary: three key ideas of GPU 1. Use many “slimmed down cores” to run in parallel 2. Pack cores full of ALUs (by sharing instruction stream across groups of fragments) 3. Avoid latency stalls by interleaving execution of many groups of fragments  When one group stalls, work on another group

Programming Model  GPU is viewed as a compute device operating as a coprocessor to the main CPU (host)  Data-parallel, compute intensive functions should be off-loaded to the device  Functions that are executed many times, but independently on different data, are prime candidates  I.e. body of for-loops  A function compiled for the device is called a kernel  The kernel is executed on the device as many different threads  Both host (CPU) and device (GPU) manage their own memory, host memory and device memory

Block and Thread Allocation  Blocks assigned to SMs (Streaming Multiprocessos)  Threads assigned to PEs (Processing Elements) Each thread executes the kernel Each block has an unique block ID Each thread has an unique thread ID within the block Warp: max 32 threads GTX 280: 30SMs 1 SM: 8 SPs 1 SM: 32 warps  1024 threads Total threads: 30*1024 = 30,720

Memory model  Memory types  Registers (r/w per thread)  Local mem (r/w per thread)  Shared mem (r/w per block)  Global mem (r/w per kernel)  Constant mem (r per kernel)  Separate from CPU  CPU can access global and constant mem via PCIe bus

Simple Example (C to CUDA conversion) __global_ void ForceCalcKernel(int nbodies, struct Body *body,..) {} __global_ void Advancing Kernel(int nbodies, struct Body *body, …){} int main(…) { Body *body, *body1; … cudaMalloc((void**)&body1, sizeof(Body)*nbodies); cudaMemcpy(body1, body, sizeof(Body)*nbodies, cuda_HostToDevice); for(timestep = …) { ForceCalcKernel >(nbodies, body1, …); AdvancingKernel >(nbodies, body1, …); } cudaMemcpy(body, body1, sizeof(Body)*nbodies, cuda_DeviceToHost); cudaFree(body1); … } Indicates GPU kernel that CPU can call Separate address spaces, need two pointers Allocate memory on GPU Copy CPU data to GPU Call GPU kernel with 1block and 1thread per block Copy GPU data back to CPU

Environment  The NVCC compiler  CUDA kernels are typically stored in files ending with.cu  NVCC uses the host compiler (CL/G++) to compile CPU code  NVCC automatically handles #include’s and linking  You can download CUDA toolkit from:  uda-downloads uda-downloads

What is DES?  The archetypal block cipher  An algorithm that takes a fixed-length string of plaintext bits and transforms it through a series of complicated operations into another ciphertext bitstring of the same length  The block size is 64 bits