Jie Chen. 30 Multi-Processors each contains 8 cores at 1.4 GHz 4GB GDDR3 memory offers ~100GB/s memory bandwidth.

Slides:



Advertisements
Similar presentations
Taking CUDA to Ludicrous Speed Getting Righteous Performance from your GPU 1.
Advertisements

Intermediate GPGPU Programming in CUDA
INF5063 – GPU & CUDA Håkon Kvale Stensland iAD-lab, Department for Informatics.
Complete Unified Device Architecture A Highly Scalable Parallel Programming Framework Submitted in partial fulfillment of the requirements for the Maryland.
GPU History CUDA. Graphics Pipeline Elements 1. A scene description: vertices, triangles, colors, lighting 2.Transformations that map the scene to a camera.
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 28, 2011 GPUMemories.ppt GPU Memories These notes will introduce: The basic memory hierarchy.
Prepared 5/24/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron.
GPU Programming and CUDA Sathish Vadhiyar Parallel Programming.
Fast Circuit Simulation on Graphics Processing Units Kanupriya Gulati † John F. Croix ‡ Sunil P. Khatri † Rahm Shastry ‡ † Texas A&M University, College.
Acceleration of the Smith– Waterman algorithm using single and multiple graphics processors Author : Ali Khajeh-Saeed, Stephen Poole, J. Blair Perot. Publisher:
2009/04/07 Yun-Yang Ma.  Overview  What is CUDA ◦ Architecture ◦ Programming Model ◦ Memory Model  H.264 Motion Estimation on CUDA ◦ Method ◦ Experimental.
1 Threading Hardware in G80. 2 Sources Slides by ECE 498 AL : Programming Massively Parallel Processors : Wen-Mei Hwu John Nickolls, NVIDIA.
Towards Acceleration of Fault Simulation Using Graphics Processing Units Kanupriya Gulati Sunil P. Khatri Department of ECE Texas A&M University, College.
1 CIS 665: GPU Programming Lecture 2: The CUDA Programming Model.
CUDA Programming Model Xing Zeng, Dongyue Mou. Introduction Motivation Programming Model Memory Model CUDA API Example Pro & Contra Trend Outline.
Programming with CUDA, WS09 Waqar Saleem, Jens Müller Programming with CUDA and Parallel Algorithms Waqar Saleem Jens Müller.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408, University of Illinois, Urbana-Champaign 1 Programming Massively Parallel Processors Chapter.
CUDA Slides by David Kirk.
Parallel Programming using CUDA. Traditional Computing Von Neumann architecture: instructions are sent from memory to the CPU Serial execution: Instructions.
CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign ECE 498AL Lecture 2: The CUDA Programming Model.
CUDA and the Memory Model (Part II). Code executed on GPU.
To GPU Synchronize or Not GPU Synchronize? Wu-chun Feng and Shucai Xiao Department of Computer Science, Department of Electrical and Computer Engineering,
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 7: Threading Hardware in G80.
CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA
Shared memory systems. What is a shared memory system Single memory space accessible to the programmer Processor communicate through the network to the.
Accelerating Statistical Static Timing Analysis Using Graphics Processing Units Kanupriya Gulati & Sunil P. Khatri Department of ECE Texas A&M University,
BY: ALI AJORIAN ISFAHAN UNIVERSITY OF TECHNOLOGY 2012 GPU Architecture 1.
CS6963 L18: Introduction to CUDA November 5, 2009.
© David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30-July 2, 2008 Taiwan 2008 CUDA Course Programming Massively Parallel Processors: the CUDA experience.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Programming Massively Parallel Processors CUDA.
By Arun Bhandari Course: HPC Date: 01/28/12. GPU (Graphics Processing Unit) High performance many core processors Only used to accelerate certain parts.
Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.
Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2012.
L19: Advanced CUDA Issues November 10, Administrative CLASS CANCELLED, TUESDAY, NOVEMBER 17 Guest Lecture, November 19, Ganesh Gopalakrishnan Thursday,
+ CUDA Antonyus Pyetro do Amaral Ferreira. + The problem The advent of multicore CPUs and manycore GPUs means that mainstream processor chips are now.
GPU computing and CUDA Marko Mišić
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 GPU Programming with CUDA.
CUDA Optimizations Sathish Vadhiyar Parallel Programming.
Introduction to and CUDA © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 University of Illinois, Urbana-Champaign.
GPU Architecture and Programming
GPU Programming and CUDA Sathish Vadhiyar Parallel Programming.
Introduction What is GPU? It is a processor optimized for 2D/3D graphics, video, visual computing, and display. It is highly parallel, highly multithreaded.
Some key aspects of NVIDIA GPUs and CUDA. Silicon Usage.
1)Leverage raw computational power of GPU  Magnitude performance gains possible.
Introduction to CUDA (1 of n*) Patrick Cozzi University of Pennsylvania CIS Spring 2011 * Where n is 2 or 3.
L16: Introduction to CUDA November 1, 2012 CS4230.
1 GPU programming Dr. Bernhard Kainz. 2 Dr Bernhard Kainz Overview About myself Motivation GPU hardware and system architecture GPU programming languages.
OpenCL Joseph Kider University of Pennsylvania CIS Fall 2011.
Parallel Programming Basics  Things we need to consider:  Control  Synchronization  Communication  Parallel programming languages offer different.
Introduction to CUDA CAP 4730 Spring 2012 Tushar Athawale.
1 GPU programming Dr. Bernhard Kainz. 2 Dr Bernhard Kainz Overview About myself Motivation GPU hardware and system architecture GPU programming languages.
Lecture 8 : Manycore GPU Programming with CUDA Courtesy : SUNY-Stony Brook Prof. Chowdhury’s course note slides are used in this lecture note.
Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2014.
1 The CUDA Programming Model © David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL Spring 2010, University of Illinois, Urbana-Champaign.
CS6963 L2: Introduction to CUDA January 14, L2:Introduction to CUDA.
Fast and parallel implementation of Image Processing Algorithm using CUDA Technology On GPU Hardware Neha Patil Badrinath Roysam Department of Electrical.
CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS 1.
Summer School s-Science with Many-core CPU/GPU Processors Lecture 2 Introduction to CUDA © David Kirk/NVIDIA and Wen-mei W. Hwu Braga, Portugal, June 14-18,
Unit -VI  Cloud and Mobile Computing Principles  CUDA Blocks and Treads  Memory handling with CUDA  Multi-CPU and Multi-GPU solution.
Jun Doi IBM Research – Tokyo Early Performance Evaluation of Lattice QCD on POWER+GPU Cluster 17 July 2015.
Computer Engg, IIT(BHU)
GPU Programming with CUDA
CS427 Multicore Architecture and Parallel Computing
Programming Massively Parallel Graphics Processors
Programming Massively Parallel Graphics Processors
ECE 498AL Lecture 2: The CUDA Programming Model
Synchronization These notes introduce:
6- General Purpose GPU Programming
Presentation transcript:

Jie Chen

30 Multi-Processors each contains 8 cores at 1.4 GHz 4GB GDDR3 memory offers ~100GB/s memory bandwidth

 General Purpose computation using GPU in applications other than 3D graphics ◦ GPU accelerates critical path of application  Data parallel algorithms leverage GPU attributes ◦ Large data arrays, streaming throughput ◦ Fine-grain SIMD parallelism ◦ Low-latency floating point (FP) computation  Applications – see //GPGPU.org ◦ Game effects (FX) physics, image processing ◦ Physical modeling, computational engineering, matrix algebra, convolution, correlation, sorting

 CUDA ◦ Compute Unified Device Arc hitecture  NVIDIA Specific  OpenCL ◦ The open standard for parallel programming of heterogeneous systems  Hardware independent  Industrial Standard  Version 1.0 specification is out, Beta implementations  BrookGPU ◦ ATI Specific  Stream Programming (SIMD)

 The GPU is viewed as a compute device that: ◦ Is a coprocessor to the CPU or host ◦ Has its own DRAM (device memory)‏ ◦ Data parallel portions of an application: kernels ◦ A single kernel runs many threads in parallel  Differences between GPU and CPU threads ◦ GPU threads are extremely lightweight  Very little creation overhead ◦ GPU needs 1000s of threads for full efficiency  Multi-core CPU needs only a few

 A kernel is executed as a grid of thread blocks ◦ All threads share data memory space  A thread block is a batch of threads that can cooperate with each other by: ◦ Synchronizing their execution  For hazard-free shared memory accesses ◦ Efficiently sharing data through a low latency shared memory  Two threads from two different blocks cannot cooperate Host Kernel 1 Kernel 2 Device Grid 1 Block (0, 0)‏ Block (1, 0)‏ Block (2, 0)‏ Block (0, 1)‏ Block (1, 1)‏ Block (2, 1)‏ Grid 2 Block (1, 1)‏ Thread (0, 1)‏ Thread (1, 1)‏ Thread (2, 1)‏ Thread (3, 1)‏ Thread (4, 1)‏ Thread (0, 2)‏ Thread (1, 2)‏ Thread (2, 2)‏ Thread (3, 2)‏ Thread (4, 2)‏ Thread (0, 0)‏ Thread (1, 0)‏ Thread (2, 0)‏ Thread (3, 0)‏ Thread (4, 0)‏

 Each thread can: ◦ R/W per-thread registers ◦ R/W per-thread local memory ◦ R/W per-block shared memory ◦ R/W per-grid global memory ◦ Read only per-grid constant memory ◦ Read only per-grid texture memory The host can R/W global, constant, and texture memories (Device) Grid Constant Memory Texture Memory Global Memory Block (0, 0)‏ Shared Memory Local Memor y Thread (0, 0)‏ Register s Local Memor y Thread (1, 0)‏ Register s Block (1, 0)‏ Shared Memory Local Memor y Thread (0, 0)‏ Register s Local Memor y Thread (1, 0)‏ Register s Hos t

 Declspecs ◦ global, device, shared, local, constant  Keywords ◦ threadIdx, blockIdx  Intrinsics ◦ __syncthreads  Runtime API ◦ Memory, symbol, execution management  Function launch __device__ float filter[N]; __global__ void convolve (float *image) { __shared__ float region[M];... region[threadIdx] = image[i]; __syncthreads()... image[j] = result; } // Allocate GPU memory void *myimage = cudaMalloc(bytes)‏ // 100 blocks, 10 threads per block convolve >> (myimage);

 Wilson-Dirac Matrix ◦ M = -1/2 * Dx,x’ + (4 + m)δx,x’ ◦ M(U) x = b where U is gauge field, b is source spin vector ◦ Dslash is an application of D to a spinor field.  Quarks live on sites (24 floats each) ◦ Spinor (4 x 3 complex numbers)  Gluons live on links (18 floats each) ◦ Gauge field: 3 x 3 color complex matrices (SU(3) matrix)  Most calculations are multiplications of spinors and gauge field.

1 output quark site (24 floats) 2 x 4 input quark site (24 x 8 floats) 2 x 4 input gluon links (18 x 8 floats) Total memory: 4 x 32 x ( x18) = 384 MB 1320 Flops/each site 4

 Dslash/Clover on Intel CPUs ◦ ~ 20 Gflops on Nehalem 8 cores.  Dslash/Clover on Tesla GPUs ◦ ~80 Gflops on single C1060  Dlassh/Clover on Multiple GPUs ◦ ~240 Gflops on 4 C1060 GPUs with MPI

 Data Parallel ◦ Any non-data parallel code will slow down applications.  Memory layout ◦ Avoid memory access conflicts ◦ Half-warp threads (16 threads) could access all memory in one transaction  Use shared memory or registers ◦ Local and global memory too slow  At least 100 times slower