Understanding Performance of Concurrent Data Structures on Graphics Processors Daniel Cederman, Bapi Chatterjee, Philippas Tsigas Distributed Computing.

Slides:

Advertisements

Similar presentations

Wait-Free Queues with Multiple Enqueuers and Dequeuers

Advertisements

Accelerators for HPC: Programming Models Accelerators for HPC: StreamIt on GPU High Performance Applications on Heterogeneous Windows Clusters

On Dynamic Load Balancing on Graphics Processors Daniel Cederman and Philippas Tsigas Chalmers University of Technology.

Concurrent programming: From theory to practice Concurrent Algorithms 2014 Vasileios Trigonakis Georgios Chatzopoulos.

Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.

Instructor Notes We describe motivation for talking about underlying device architecture because device architecture is often avoided in conventional.

Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.

Evaluation of Message Passing Synchronization Algorithms in Embedded Systems 1 Evaluation of Message Passing Synchronization Algorithms in Embedded Systems.

Wait-Free Reference Counting and Memory Management Håkan Sundell, Ph.D.

Local-spin, Abortable Mutual Exclusion Joe Rideout.

Performance and power consumption evaluation of concurrent queue implementations 1 Performance and power consumption evaluation of concurrent queue implementations.

Scalable Synchronous Queues By William N. Scherer III, Doug Lea, and Michael L. Scott Presented by Ran Isenberg.

Concurrent Data Structures in Architectures with Limited Shared Memory Support Ivan Walulya Yiannis Nikolakopoulos Marina Papatriantafilou Philippas Tsigas.

CS6963 L19: Dynamic Task Queues and More Synchronization.

TOWARDS A SOFTWARE TRANSACTIONAL MEMORY FOR GRAPHICS PROCESSORS Daniel Cederman, Philippas Tsigas and Muhammad Tayyab Chaudhry.

Lock-free Cuckoo Hashing Nhan Nguyen & Philippas Tsigas ICDCS 2014 Distributed Computing and Systems Chalmers University of Technology Gothenburg, Sweden.

Lock vs. Lock-Free memory Fahad Alduraibi, Aws Ahmad, and Eman Elrifaei.

University of Michigan Electrical Engineering and Computer Science Amir Hormati, Mehrzad Samadi, Mark Woh, Trevor Mudge, and Scott Mahlke Sponge: Portable.

A Lock-Free, Cache-Efficient Multi-Core Synchronization Mechanism for Line-Rate Network Traffic Monitoring Patrick P. C. Lee 1, Tian Bu 2, Girish Chandranmenon.

GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.

Copyright © 2010, Oracle and/or its affiliates. All rights reserved. Who’s Afraid of a Big Bad Lock Nir Shavit Sun Labs at Oracle Joint work with Danny.

SUPPORTING LOCK-FREE COMPOSITION OF CONCURRENT DATA OBJECTS Daniel Cederman and Philippas Tsigas.

Accelerating SQL Database Operations on a GPU with CUDA Peter Bakkum & Kevin Skadron The University of Virginia GPGPU-3 Presentation March 14, 2010.

Lecture 2 : Introduction to Multicore Computing Bong-Soo Sohn Associate Professor School of Computer Science and Engineering Chung-Ang University.

CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA

Shared memory systems. What is a shared memory system Single memory space accessible to the programmer Processor communicate through the network to the.

Simple Wait-Free Snapshots for Real-Time Systems with Sporadic Tasks Håkan Sundell Philippas Tsigas.

Behavior of Synchronization Methods in Commonly Used Languages and Systems Yiannis Nikolakopoulos Joint work with: D. Cederman, B.

Atlanta, Georgia TiNy Threads on BlueGene/P: Exploring Many-Core Parallelisms Beyond The Traditional OS Handong Ye, Robert Pavel, Aaron Landwehr, Guang.

Extracted directly from:

Modeling GPU non-Coalesced Memory Access Michael Fruchtman.

Parallel Applications Parallel Hardware Parallel Software IT industry (Silicon Valley) Users Efficient Parallel CKY Parsing on GPUs Youngmin Yi (University.

Architectural Support for Fine-Grained Parallelism on Multi-core Architectures Sanjeev Kumar, Corporate Technology Group, Intel Corporation Christopher.

General Purpose Computing on Graphics Processing Units: Optimization Strategy Henry Au Space and Naval Warfare Center Pacific 09/12/12.

GPU in HPC Scott A. Friedman ATS Research Computing Technologies.

Håkan Sundell, Chalmers University of Technology 1 NOBLE: A Non-Blocking Inter-Process Communication Library Håkan Sundell Philippas.

Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations Xin Huo, Vignesh T. Ravi, Gagan Agrawal Department of Computer Science and Engineering.

Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.

A Consistency Framework for Iteration Operations in Concurrent Data Structures Yiannis Nikolakopoulos A. Gidenstam M. Papatriantafilou P. Tsigas Distributed.

Introducing collaboration members – Korea University (KU) ALICE TPC online tracking algorithm on a GPU Computing Platforms – GPU Computing Platforms Joohyung.

Challenges in Non-Blocking Synchronization Håkan Sundell, Ph.D. Guest seminar at Department of Computer Science, University of Tromsö, Norway, 8 Dec 2005.

Maged M.Michael Michael L.Scott Department of Computer Science Univeristy of Rochester Presented by: Jun Miao.

Heterogeneous CPU/GPU co- processor clusters Michael Fruchtman.

Introduction What is GPU? It is a processor optimized for 2D/3D graphics, video, visual computing, and display. It is highly parallel, highly multithreaded.

Compiler and Runtime Support for Enabling Generalized Reduction Computations on Heterogeneous Parallel Configurations Vignesh Ravi, Wenjing Ma, David Chiu.

Executing Parallel Programs with Potential Bottlenecks Efficiently Yoshihiro Oyama Kenjiro Taura Akinori Yonezawa {oyama, tau,

1)Leverage raw computational power of GPU  Magnitude performance gains possible.

Wait-Free Multi-Word Compare- And-Swap using Greedy Helping and Grabbing Håkan Sundell PDPTA 2009.

Contemporary Languages in Parallel Computing Raymond Hummel.

University of Michigan Electrical Engineering and Computer Science Adaptive Input-aware Compilation for Graphics Engines Mehrzad Samadi 1, Amir Hormati.

Sunpyo Hong, Hyesoon Kim

Computer Architecture Lecture 24 Parallel Processing Ralph Grishman November 2015 NYU.

GPGPU introduction. Why is GPU in the picture Seeking exa-scale computing platform Minimize power per operation. – Power is directly correlated to the.

My Coordinates Office EM G.27 contact time:

Heterogeneous Processing KYLE ADAMSKI. Overview What is heterogeneous processing? Why it is necessary Issues with heterogeneity CPU’s vs. GPU’s Heterogeneous.

Hang Zhang1, Xuhao Chen1, Nong Xiao1,2, Fang Liu1

Håkan Sundell Philippas Tsigas

Concurrent Data Structures for Near-Memory Computing

Challenges in Concurrent Computing

A Lock-Free Algorithm for Concurrent Bags

Accelerating MapReduce on a Coupled CPU-GPU Architecture

Linchuan Chen, Xin Huo and Gagan Agrawal

Anders Gidenstam Håkan Sundell Philippas Tsigas

Yiannis Nikolakopoulos

NOBLE: A Non-Blocking Inter-Process Communication Library

CS510 - Portland State University

GPU Scheduling on the NVIDIA TX2:

6- General Purpose GPU Programming

Multicore and GPU Programming

Presentation transcript:

Understanding Performance of Concurrent Data Structures on Graphics Processors Daniel Cederman, Bapi Chatterjee, Philippas Tsigas Distributed Computing and Systems D&IT, Chalmers University, Sweden (Supported by PEPPHER, SCHEME, VR) Euro-PAR 2012

Parallelization on GPU (GPGPU) CUDA, OpenCl Independent of CPU Ubiquitous Main processor Uniprocessors – No more Multi-core, Many-core Co-processor Graphics Processors SIMD, N x speedup 2/31

Data structures + Multi-core/Many- core = Concurrent data structure Rich literature and growing Applications CDS on GPU Synchronization aware applications on GPU Challenging but required Concurrent Programming Parallel Slowdown 3/31

Concurrent Data Structures on GPU Implementation Issues Performance Portability 4/31

GPU (Nvidia) Architecture Evolution Support to Synchronization Concurrent Data Structure Concurrent FIFO QueuesConcurrent FIFO Queues CDS on GPU Implementation & OptimizationImplementation & Optimization Performance Portability AnalysisPerformance Portability Analysis Outline of the talk 5/31

GPU (Nvidia) Architecture Evolution Support to Synchronization Concurrent Data Structure Concurrent FIFO QueuesConcurrent FIFO Queues CDS on GPU Implementation & OptimizationImplementation & Optimization Performance Portability AnalysisPerformance Portability Analysis Outline of the talk 6/31

ProcessorAtomicsCache Tesla(CC 1.0)No AtomicsNo Cache Tesla (CC = 1.x, x>0) Atomics available No Cache Fermi (CC=2.x) Atomics on L2 Unified L2 and Configurable L1 Kepler (CC=3.x) Faster than earlier L2 73% faster than Fermi GPU Architecture Evolution 7/31

CAS behavior on GPUs Compare and Swap (CAS) on GPU 8/31

CDS on GPU – Motivation & Challenges Transition from a pure co-processor to a more independent compute unit. CUDA and OpenCL. Synchronization primitives getting cheaper with availability of multilevel cache. Synchronization aware programs vs. inherent SIMD. 9/31

GPU (Nvidia) Architecture Evolution Support to Synchronization Concurrent Data Structure Concurrent FIFO QueuesConcurrent FIFO Queues CDS on GPU Implementation & OptimizationImplementation & Optimization Performance Portability AnalysisPerformance Portability Analysis Outline of the talk 10/31

11/31 Concurrent Data Structure 1.Synchronization Progress guarantee. 2.Blocking. 3.Non-blocking. 1.Lock – free 2.Wait - free

Concurrent FIFO Queues 12/31

Concurrent FIFO Queues Single Producer Single Consumer (SPSC) Multi Producer Multi Consumer (MPMC) Lamport 1983 : Lamport Queue Michael & Scott 1997 : MS-Queue (Blocking and non-blocking) Giacomoni et al : FastForward Queue Tsigas & Zhang 2001: TZ-Queue Lee et al : MCRingBuffer Preud'homme et al : BatchQueue 13/31

1.Lock-free, Array-based. 2.Synchronization through atomic read and write on shared head and tail Causes cache thrashing. Lamport [1] SPSC FIFO Queues enqueue (data) { if (NEXT(head) == tail) return FALSE; buffer[head] = data; head = NEXT(head); return TRUE; } dequeue (data) { if (head == tail) { return FALSE; data = buffer[tail]; tail = NEXT(tail); return TRUE; } 14/31

head and tail private to producer and consumer lowering cache thrashing. FastForward [2] The queue is divided into two batches – producer writes to one of them while the consumer reads from the other one. BatchQueue [3] SPSC FIFO Queues 15/31

1.Similar to BatchQueue but handles many batches. 2.Many batches may cause less latency if producer is not fast enough. MCRingBuffer [4] SPSC FIFO Queues 16/31

1.Linked-List based. 2.Mutual exclusion locks to synchronize. 3.CAS-based spin lock and Bakery Lock – fine Grained and Coarse grained. MPMC FIFO Queues MS-queue (Blocking) [5] 17/31

1.Lock-free. 2.Uses CAS to add nodes at tail and remove nodes from head. 3.Helping mechanism between threads leads to true lock-freedom. MPMC FIFO Queues MS-queue (Non-blocking) [5] 18/31

1.Lock-free, Array-based. 2.Uses CAS to insert elements and move head and tail. 3.head and tail pointers are moved after every x:th operation. MPMC FIFO Queues TZ-Queue (Non-Blocking) [6] 19/31

GPU (Nvidia) Architecture Evolution Support to Synchronization Concurrent Data Structure Concurrent FIFO QueuesConcurrent FIFO Queues CDS on GPU Implementation & OptimizationImplementation & Optimization Performance Portability AnalysisPerformance Portability Analysis Outline of the talk 20/31

Implementation Platform ProcessorClock Speed Memory ClockCoresLL CacheArchitecture 8800GT1.6GHz1.0GHz140Tesla(CC 1.1) GTX2801.3GHz1.1GHz300Tesla(CC 1.3) Tesla C GHz1.5GHz14786kBFermi(CC 2.0) GTX GHz3.0GHz8512kBKepler(CC 3.0) Intel E5645(2x) 2.4GHz0.7GHz2412MBIntel HT 21/31

GPU implementation 1.A thread-block works either as a producer or as a consumer. 2.Varying number of thread blocks for MPMC queues. 3.Shared Memory is used for private variables of producer and consumer. 22/31

GPU optimization 1.BatchQueue and MCRingBuffer - advantage of shared memory to make them Buffered. 2.Coalescing in memory transfer in buffered queues. 3.Empirical optimization in TZ-Queue – move the pointers after every second operation. 23/31

Experimental Setup 1.Throughput = # {successful enque or deque} / ms. 2.MPMC experiments : 25% enque and 75% deque. 3.Contention – high and low. 4.In CPU, producers and consumers were put on different sockets. 24/31

SPSC on CPU Reducing Cache Thrashing Increasing Throughput Cache Profile Throughput 25/31

SPSC on GPU GPU without cache – no cache thrashing. GPU Shared memory advantage – Buffering. High memory clock + faster cache advantage – unbuffered. Throughput 26/31

MPMC on CPU SpinLock (CAS based) beats the bakery lock (read/write). Best Lock based vs Lock-free (High Contention) Best Lock based vs Lock-free (Low Contention) 27/31 Lock free better than Blocking.

MPMC on GPU (High Contention) Newer Architecture Scalability C2050(CC 2.0) GTX280(CC 1.3) GTX680(CC 3.0) 28/31

CAS behavior on GPUs Compare and Swap (CAS) on GPU 29/31

MPMC on GPU (Low Contention) GTX280(CC 1.3) C2050(CC 2.0) GTX680(CC 3.0) Lower Contention Scalability 30/31

Summary 1.Concurrent queues are in general performance portable from CPU to GPU. 2.The configurable cache are still NOT enough to remove the benefit of redesigning algorithms from GPU shared memory viewpoint. 3.Significantly improved atomics in Fermi and further in Kepler is a big motivation for algorithmic designs of CDS for GPU. 31/31

References 1.Lamport L.: Specifying Concurrent program modules. ACM Transactions on Programming Languages and Systems 5, (1983), Giacomoni, J., Moseley, T., Vachharajani, M.: FastForward for efficient pipeline parallelism: a cache-optimized concurrent lock-free queue. In: Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming, ACM (2008) Preud'homme, T., Sopena, J., Thomas, G., Folliot, B.: BatchQueue: Fast and Memory- Thrifty Core to Core Communication. In: 22nd International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD). (2010) Lee, P.P.C., Bu, T., Chandranmenon, G.: A lock-free, cache-efficient shared ring Buffer for multi-core architectures. In: Proceedings of the 5th ACM/IEEE Symposium on architectures for Networking and Communications Systems. ANCS '09, New York, NY, USA, ACM (2009) Michael, M., Scott, M.: Simple, fast, and practical non-blocking and blocking concurrent queue algorithms. In: Proceedings of the 15th annual ACM symposium on Principles of distributed computing, ACM (1996) Tsigas, P., Zhang, Y.: A simple, fast and scalable non-blocking concurrent fifo queue for shared memory multiprocessor systems. In: Proceedings of the 13 th annual ACM symposium on Parallel algorithms and architectures, ACM (2001)