To GPU Synchronize or Not GPU Synchronize? Wu-chun Feng and Shucai Xiao Department of Computer Science, Department of Electrical and Computer Engineering,

Slides:



Advertisements
Similar presentations
Intermediate GPGPU Programming in CUDA
Advertisements

Speed, Accurate and Efficient way to identify the DNA.
Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.
Chimera: Collaborative Preemption for Multitasking on a Shared GPU
Exploiting Graphics Processors for High- performance IP Lookup in Software Routers Author: Jin Zhao, Xinya Zhang, Xin Wang, Yangdong Deng, Xiaoming Fu.
An Effective GPU Implementation of Breadth-First Search Lijuan Luo, Martin Wong and Wen-mei Hwu Department of Electrical and Computer Engineering, UIUC.
Appendix A — 1 FIGURE A.2.2 Contemporary PCs with Intel and AMD CPUs. See Chapter 6 for an explanation of the components and interconnects in this figure.
Sparse LU Factorization for Parallel Circuit Simulation on GPU Ling Ren, Xiaoming Chen, Yu Wang, Chenxi Zhang, Huazhong Yang Department of Electronic Engineering,
GPU Programming and CUDA Sathish Vadhiyar Parallel Programming.
PARALLEL PROCESSING COMPARATIVE STUDY 1. CONTEXT How to finish a work in short time???? Solution To use quicker worker. Inconvenient: The speed of worker.
Acceleration of the Smith– Waterman algorithm using single and multiple graphics processors Author : Ali Khajeh-Saeed, Stephen Poole, J. Blair Perot. Publisher:
2009/04/07 Yun-Yang Ma.  Overview  What is CUDA ◦ Architecture ◦ Programming Model ◦ Memory Model  H.264 Motion Estimation on CUDA ◦ Method ◦ Experimental.
1 Threading Hardware in G80. 2 Sources Slides by ECE 498 AL : Programming Massively Parallel Processors : Wen-Mei Hwu John Nickolls, NVIDIA.
Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow Wilson W. L. Fung Ivan Sham George Yuan Tor M. Aamodt Electrical and Computer Engineering.
L13: Review for Midterm. Administrative Project proposals due Friday at 5PM (hard deadline) No makeup class Friday! March 23, Guest Lecture Austin Robison,
Programming with CUDA, WS09 Waqar Saleem, Jens Müller Programming with CUDA and Parallel Algorithms Waqar Saleem Jens Müller.
CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.
Accelerating Machine Learning Applications on Graphics Processors Narayanan Sundaram and Bryan Catanzaro Presented by Narayanan Sundaram.
University of Michigan Electrical Engineering and Computer Science Amir Hormati, Mehrzad Samadi, Mark Woh, Trevor Mudge, and Scott Mahlke Sponge: Portable.
GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.
Shekoofeh Azizi Spring  CUDA is a parallel computing platform and programming model invented by NVIDIA  With CUDA, you can send C, C++ and Fortran.
Accelerating SQL Database Operations on a GPU with CUDA Peter Bakkum & Kevin Skadron The University of Virginia GPGPU-3 Presentation March 14, 2010.
An Introduction to Programming with CUDA Paul Richmond
Motivation “Every three minutes a woman is diagnosed with Breast cancer” (American Cancer Society, “Detailed Guide: Breast Cancer,” 2006) Explore the use.
Training Program on GPU Programming with CUDA 31 st July, 7 th Aug, 14 th Aug 2011 CUDA Teaching UoM.
GPU Programming and CUDA Sathish Vadhiyar High Performance Computing.
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 7: Threading Hardware in G80.
© 2009 Matthew J. Sottile, Timothy G. Mattson, and Craig E Rasmussen 1 Concurrency in Programming Languages Matthew J. Sottile Timothy G. Mattson Craig.
CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA
Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.
BY: ALI AJORIAN ISFAHAN UNIVERSITY OF TECHNOLOGY 2012 GPU Architecture 1.
High Performance Computing with GPUs: An Introduction Krešimir Ćosić, Thursday, August 12th, LSST All Hands Meeting 2010, Tucson, AZ GPU Tutorial:
Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 CS 395 Winter 2014 Lecture 17 Introduction to Accelerator.
CUDA Optimizations Sathish Vadhiyar Parallel Programming.
GPU Architecture and Programming
Introducing collaboration members – Korea University (KU) ALICE TPC online tracking algorithm on a GPU Computing Platforms – GPU Computing Platforms Joohyung.
CUDA - 2.
GPU Programming and CUDA Sathish Vadhiyar Parallel Programming.
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 25, 2011 Synchronization.ppt Synchronization These notes will introduce: Ways to achieve.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 8: Threading Hardware in G80.
Introduction What is GPU? It is a processor optimized for 2D/3D graphics, video, visual computing, and display. It is highly parallel, highly multithreaded.
Jie Chen. 30 Multi-Processors each contains 8 cores at 1.4 GHz 4GB GDDR3 memory offers ~100GB/s memory bandwidth.
Some key aspects of NVIDIA GPUs and CUDA. Silicon Usage.
Compiler and Runtime Support for Enabling Generalized Reduction Computations on Heterogeneous Parallel Configurations Vignesh Ravi, Wenjing Ma, David Chiu.
1)Leverage raw computational power of GPU  Magnitude performance gains possible.
David Angulo Rubio FAMU CIS GradStudent. Introduction  GPU(Graphics Processing Unit) on video cards has evolved during the last years. They have become.
Introduction to CUDA CAP 4730 Spring 2012 Tushar Athawale.
University of Michigan Electrical Engineering and Computer Science Adaptive Input-aware Compilation for Graphics Engines Mehrzad Samadi 1, Amir Hormati.
Synchronization These notes introduce:
An Integrated GPU Power and Performance Model (ISCA’10, June 19–23, 2010, Saint-Malo, France. International Symposium on Computer Architecture)
AUTO-GC: Automatic Translation of Data Mining Applications to GPU Clusters Wenjing Ma Gagan Agrawal The Ohio State University.
Martin Kruliš by Martin Kruliš (v1.0)1.
CUDA Compute Unified Device Architecture. Agent Based Modeling in CUDA Implementation of basic agent based modeling on the GPU using the CUDA framework.
My Coordinates Office EM G.27 contact time:
1 ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 28, 2013 Branching.ppt Control Flow These notes will introduce scheduling control-flow.
Fast and parallel implementation of Image Processing Algorithm using CUDA Technology On GPU Hardware Neha Patil Badrinath Roysam Department of Electrical.
1 ITCS 4/5145 Parallel Programming, B. Wilkinson, Nov 12, CUDASynchronization.ppt Synchronization These notes introduce: Ways to achieve thread synchronization.
GPGPU Programming with CUDA Leandro Avila - University of Northern Iowa Mentor: Dr. Paul Gray Computer Science Department University of Northern Iowa.
Gwangsun Kim, Jiyun Jeong, John Kim
Sathish Vadhiyar Parallel Programming
CS427 Multicore Architecture and Parallel Computing
Dynamic Parallelism Martin Kruliš by Martin Kruliš (v1.0)
Accelerating MapReduce on a Coupled CPU-GPU Architecture
Lecture 5: GPU Compute Architecture
Lecture 5: GPU Compute Architecture for the last time
Synchronization These notes introduce:
6- General Purpose GPU Programming
CIS 6930: Chip Multiprocessor: Parallel Architecture and Programming
Presentation transcript:

To GPU Synchronize or Not GPU Synchronize? Wu-chun Feng and Shucai Xiao Department of Computer Science, Department of Electrical and Computer Engineering, Virginia Tech ISCAS 2010

Outline Introduction Preliminaries Related Works Proposed GPU-based Synchronization Problems, Experiments, and Analysis Conclusions

Introduction Multi(many)-core era has come. General purpose GPU (GPGPU) allows massively parallel computation with low cost. GPUs typically map well only to data- parallel or task-parallel applications –Due to the lack of support for communication between streaming multiprocessors (SMs).

Introduction (cont.) Communication could be done via global memory. –Need barrier synchronization. CPU barrier synchronization –By (inefficiently) implementing the barrier synchronization via the host CPU. –Slow.

Introduction (cont.) GPU barrier synchronization –Improve performance by 10~40%. –Theoretically run the risk that barrier may release earlier. CUDA 2.2 support new function _threadfence() to solve this problem. –The correctness could be guarantee.

Introduction (cont.) Unfortunately, _threadfence() incurs so much overhead in the proposed GPU barrier synchronization. –That is, CPU barrier synchronization performs as well as or better than the GPU barrier synchronizations in many cases. “Whether to GPU synchronize or not GPU synchronize?”

Preliminaries: CUDA Compute Unified Device Architecture Developed by nVIDIA. The CPU code does the sequential part. Highly parallelized part usually implement in the GPU code, called kernel. Calling GPU function in CPU code is called kernel launch. In a kernel, threads are grouped as a grid of thread blocks, and each thread block contains a number of threads. –Multiple blocks can be executed on the same SM, but one block cannot be executed across different SMs.

Preliminaries: GPU architecture

Preliminaries: synchronization Synchronization in parallel programming. –Making sure that each thread get the right data for computation. CUDA provides a data communication mechanism for threads within a single block via the barrier function syncthreads(). –Intra-SM communication. However, there is no explicit software or hardware support for data communication of threads across different blocks. –Inter-SM communication.

Related Works When multiple GPU thread blocks are scheduled to execute on a single SM simultaneously, deadlock might occurs. –In multi-core environment, a process can yield its execution to other processes, but CUDA blocks do not. [17] assign only one block per SM to address this problem.

Related Works (cont.) When a barrier synchronization is needed across different blocks, programmers traditionally use a kernel launch as a way to implicitly barrier synchronize [4], [7]. [14] propose a protocol for data communication across multiple GPUs. –Data needs to be transferred to the host memory first and then copied back to the device memory, and hence poor performance in different SMs on a single GPU.

Proposed GPU-based Synchronization GPU synchronization –Lock-based synchronization. Single mutex variable for all thread blocks. Once a block finishes its computation on an SM, it atomically increments the mutex variable. –Lock-free synchronization. One distinct variable to control each block, thus eliminating the need for different blocks to contend for the single mutex variable. The need for atomic addition is removed.

Experiments Environment –GeForce GTX 280: 30SMs, 8 cores each, running at 1.3GHz (shader clock). –CUDA 2.2 SDK. –Details are omitted. Two experiments. –Dynamic programming (DP) for genomic sequence alignment (specifically the Smith- Waterman algorithm). –Bitonic sort (BS).

Performance comparisons

Problems, Experiments, and Analysis In order to eliminate the infinitesimal risk that barrier may release earlier when the proposed synchronization run, _threadfence() is used and hence incurred the overhead. The same experiments with modified barrier using _threadfence().

Performance comparisons

Problems, Experiments, and Analysis (cont.) Analyze GPU lock-based synchronization as an example, whose operation set is a superset of that are used in the other one. Synchronization overhead –t a is the overhead of atomic add –t c is the mutex variable checking time –t s is the time consumed by syncthreads() –t f is the threadfence() execution time.

Problems, Experiments, and Analysis (cont.) Unfortunately, the execution times for these component operations cannot be measured directly on the GPU. An indirect approach. –A kernel’s execution time can be expressed as –Measuring the kernel execution time both with and without a specific operation and taking the time difference as the overhead of that operation.

Execution time profiling Use micro benchmark to test. –Calculates the average of two floats over 10,000 iterations. CPU synchronization –Each kernel calculates the average once, and the kernel is launched 10,000 times. GPU synchronization –The kernel is launched only once, and there is a 10,000-iteration for loop used in the kernel with the GPU barrier function called in each loop.

Results For 10,000 times of execution, t s = 0.541, t a = × n, t c = 5.564, and t f = × n , where n is the number of blocks in the kernel, and the units are in milliseconds. t5 t4 t3 t1 t2 t s =t3-t1 t a =t2 t c =t4-t3-t2 t f =t5-t4

Conclusions Demonstrate the efficiency of inter-SM communication using GPU-based barrier synchronization. To eliminate the risk of asynchronous, threadfence() is used though it incurs high overhead. Grudgingly conclude that one should GPU synchronize (with or without threadfence()) on the current generation, but more definitive ’yes’ for the next generation GPUs.