GPU ProgrammingOther Relevant ObservationsExperiments GPU kernels run on C blocks (CTAs) of W warps. Each warp is a group of w=32 threads, which are executed.

Slides:



Advertisements
Similar presentations
Daniel Cederman and Philippas Tsigas Distributed Computing and Systems Chalmers University of Technology.
Advertisements

Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.
Lecture 6: Multicore Systems
Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.
Optimization on Kepler Zehuan Wang
High Performance Comparison-Based Sorting Algorithm on Many-Core GPUs Xiaochun Ye, Dongrui Fan, Wei Lin, Nan Yuan, and Paolo Ienne Key Laboratory of Computer.
1 Virtual Private Caches ISCA’07 Kyle J. Nesbit, James Laudon, James E. Smith Presenter: Yan Li.
Multiprocessors ELEC 6200: Computer Architecture and Design Instructor : Agrawal Name: Nam.
Windows 2000 and Solaris: Threads and SMP Management Submitted by: Rahul Bhuman.
1 Lecture 14: Virtual Memory Today: DRAM and Virtual memory basics (Sections )
Rensselaer Polytechnic Institute CSC 432 – Operating Systems David Goldschmidt, Ph.D.
Gnort: High Performance Intrusion Detection Using Graphics Processors Giorgos Vasiliadis, Spiros Antonatos, Michalis Polychronakis, Evangelos Markatos,
University of Michigan Electrical Engineering and Computer Science Amir Hormati, Mehrzad Samadi, Mark Woh, Trevor Mudge, and Scott Mahlke Sponge: Portable.
GPGPU platforms GP - General Purpose computation using GPU
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 7: Threading Hardware in G80.
CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA
Parallel Applications Parallel Hardware Parallel Software IT industry (Silicon Valley) Users Efficient Parallel CKY Parsing on GPUs Youngmin Yi (University.
CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.
Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 CS 395 Winter 2014 Lecture 17 Introduction to Accelerator.
CUDA Optimizations Sathish Vadhiyar Parallel Programming.
GPU Architecture and Programming
MotivationFundamental ProblemsProblems on Graphs Parallel processors are becoming common place. Each core of a multi-core processor consists of a CPU and.
Ihr Logo Operating Systems Internals & Design Principles Fifth Edition William Stallings Chapter 2 (Part II) Operating System Overview.
Operating Systems David Goldschmidt, Ph.D. Computer Science The College of Saint Rose CIS 432.
CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2011.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 8: Threading Hardware in G80.
Jie Chen. 30 Multi-Processors each contains 8 cores at 1.4 GHz 4GB GDDR3 memory offers ~100GB/s memory bandwidth.
Some key aspects of NVIDIA GPUs and CUDA. Silicon Usage.
Optimizing Parallel Reduction in CUDA Mark Harris NVIDIA Developer Technology.
CUDA Performance Considerations (2 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2011.
1)Leverage raw computational power of GPU  Magnitude performance gains possible.
1 Copyright © 2010, Elsevier Inc. All rights Reserved Chapter 2 Parallel Hardware and Parallel Software An Introduction to Parallel Programming Peter Pacheco.
An Efficient CUDA Implementation of the Tree-Based Barnes Hut n-body Algorithm By Martin Burtscher and Keshav Pingali Jason Wengert.
The Sort Benchmark AlgorithmsSolid State Disks External Memory Multiway Mergesort  Phase 1: Run Formation  Phase 2: Merge Runs  Careful parameter selection.
Weekly Report- Reduction Ph.D. Student: Leo Lee date: Oct. 30, 2009.
Sunpyo Hong, Hyesoon Kim
COMP 5704 Project Presentation Parallel Buffer Trees and Searching Cory Fraser School of Computer Science Carleton University, Ottawa, Canada
My Coordinates Office EM G.27 contact time:
Fast and parallel implementation of Image Processing Algorithm using CUDA Technology On GPU Hardware Neha Patil Badrinath Roysam Department of Electrical.
CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS 1.
The Sort Benchmark AlgorithmsSolid State Disks External Memory Multiway Mergesort  Phase 1: Run Formation  Phase 2: Merge Runs  Careful parameter selection.
CS 179: GPU Computing Recitation 2: Synchronization, Shared memory, Matrix Transpose.
GPU Acceleration of Particle-In-Cell Methods B. M. Cowan, J. R. Cary, S. W. Sides Tech-X Corporation.
Parallel primitives – Scan operation CDP – Written by Uri Verner 1 GPU Algorithm Design.
Relational Query Processing on OpenCL-based FPGAs Zeke Wang, Johns Paul, Hui Yan Cheah (NTU, Singapore), Bingsheng He (NUS, Singapore), Wei Zhang (HKUST,
Single Instruction Multiple Threads
COMPUTER GRAPHICS CHAPTER 38 CS 482 – Fall 2017 GRAPHICS HARDWARE
Address – 32 bits WRITE Write Cache Write Main Byte Offset Tag Index Valid Tag Data 16K entries 16.
Sathish Vadhiyar Parallel Programming
CS427 Multicore Architecture and Parallel Computing
EECE571R -- Harnessing Massively Parallel Processors ece
Multi-core processors
Embedded Systems Design
Cache Memory Presentation I
Lecture 2: Intro to the simd lifestyle and GPU internals
ECE 498AL Spring 2010 Lectures 8: Threading & Memory Hardware in G80
Recitation 2: Synchronization, Shared memory, Matrix Transpose
Presented by: Isaac Martin
NVIDIA Fermi Architecture
Symmetric Multiprocessing (SMP)
Outline Chapter 2 (cont) OS Design OS structure
Chapter 4 Multiprocessors
CSE 471 Autumn 1998 Virtual memory
CS179: GPU PROGRAMMING Recitation 2 GPU Memory Synchronization
Main Memory Background
Lecture 5: Synchronization and ILP
Lecture 18: Coherence and Synchronization
6- General Purpose GPU Programming
CIS 6930: Chip Multiprocessor: Parallel Architecture and Programming
Presentation transcript:

GPU ProgrammingOther Relevant ObservationsExperiments GPU kernels run on C blocks (CTAs) of W warps. Each warp is a group of w=32 threads, which are executed in SIMD fashion. Each CTA has a small but fast private cache, that is shared among its threads (shared memory). All CTAs can access the large global memory, which has a higher latency than shared memory (ca. 2 orders of magnitude). While intra-CTA synchronization is basically free, GPU programs are divided into R kernels to facilitate inter-CTA synchronization via CPU.  We observe that accesses to global memory can achieve maximum bandwidth if performed in blocks of size w (takes care of modelling coalesced memory accesses).  We limit the CTA size to one warp. Since some resources are shared among CTAs, there is a hardware limit to the number of CTAs, that can be accommodated at the same time. While possible, it is not beneficial to use more CTAs. We implement a bank-conflict free algorithm for prefix sums using out design principles and compare against thrust [2] and back40computing [3]: We extend our implementation to colored prefix sums. The design process shows the importance of bank conflicts for the kernel runtime: We also implement the first sorting algorithm for small inputs that causes no bank conflicts. We replace the base case of thrust mergesort with our shearsort based implementation: Cost of a GPU Algorithm  Our model [1] is based on the Parallel External Memory (PEM) model by Arge et al. [2]  There is unlimited global memory, while shared memory is of size M.  We can view each CTA as a virtual processor with w cores that work in SIMD fashion.  Generic kernel layout: 1)Load data from global memory to shared memory. 2)Work on the data in shared memory. 3)Write output data to global memory.  There is unlimited global memory, while shared memory is of size M.  In most applications, n >> M, so each CTA has to work on multiple pages of input data. The total runtime of a GPU algorithm on input size n, consisting of R(n) kernels and executed on p processors is where t k is the amount of parallel computation performed in kernel k, λ is the latency for accesses to global memory, q k is the amount of parallel data transfer (I/O) and σ is the time needed for a global synchronization. Bank Conflicts Shared memory is organized in S memory banks that can be accessed simultaneously without penalty. When data is stored, the memory banks are used cyclically. On modern GPUs, S=w. Bank conflicts happen, when multiple threads access different words on the same shared memory bank at the same time. If that happens, the memory access are performed sequentially, which makes runtimes hard to predict in case of random or data dependent memory accesses. References [1]N. Sitchinava and V. Weichert. Provably Efficient GPU Algorithms. Submitted to SPAA [2]L. Arge, M. Goodrich, M. Nelson and N. Sitchinava. Fundamental parallel algorithms for private-cache chip multiprocessors. SPAA [3] [4] Provably Efficient GPU Algorithms MADALGO – Center for Massive Data Algorithmics, a Center of the Danish National Research Foundation Volker Weichert Goethe University Nodari Sitchinava KIT Block 0 Warp 0Warp 1Warp C-1 Block B-1 Warp 0Warp 1Warp C-1 Global Memory Block 0 Threads Shared memory Block C-1 Threads Shared memory Kernel 0Kernel 1Kernel R-1CPU A[0]A[S]A[2S] A[S-1]A[2S-1]A[3S-1] A[1]A[S+1]A[2S+1] Memory Bank 0 Memory Bank 1 Memory Bank M-1