Performance Evaluation of Lock-free Data Structures on GPUs Performance Evaluation of Lock-free Data Structures on GPUs

Slides:

Advertisements

Similar presentations

Fast and Lock-Free Concurrent Priority Queues for Multi-Thread Systems Håkan Sundell Philippas Tsigas.

Advertisements

Wait-Free Linked-Lists Shahar Timnat, Anastasia Braginsky, Alex Kogan, Erez Petrank Technion, Israel Presented by Shahar Timnat 469-+

Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.

A Complete GPU Compute Architecture by NVIDIA Tamal Saha, Abhishek Rawat, Minh Le {ts4rq, ar8eb,

Optimization on Kepler Zehuan Wang

1 A GPU Accelerated Storage System NetSysLab The University of British Columbia Abdullah Gharaibeh with: Samer Al-Kiswany Sathish Gopalakrishnan Matei.

Scalable and Lock-Free Concurrent Dictionaries

Wait-Free Reference Counting and Memory Management Håkan Sundell, Ph.D.

Exploiting Graphics Processors for High- performance IP Lookup in Software Routers Author: Jin Zhao, Xinya Zhang, Xin Wang, Yangdong Deng, Xiaoming Fu.

Locality-Conscious Lock-Free Linked Lists Anastasia Braginsky & Erez Petrank 1.

Concurrent Data Structures in Architectures with Limited Shared Memory Support Ivan Walulya Yiannis Nikolakopoulos Marina Papatriantafilou Philippas Tsigas.

Lock-free Cuckoo Hashing Nhan Nguyen & Philippas Tsigas ICDCS 2014 Distributed Computing and Systems Chalmers University of Technology Gothenburg, Sweden.

Hash Tables and Associative Containers CS-212 Dick Steflik.

CS510 Advanced OS Seminar Class 10 A Methodology for Implementing Highly Concurrent Data Objects by Maurice Herlihy.

CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.

B + -Trees (Part 1) COMP171. Slide 2 Main and secondary memories  Secondary storage device is much, much slower than the main RAM  Pages and blocks.

Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit Concurrent Skip Lists.

Panda: MapReduce Framework on GPU’s and CPU’s

Debunking the 100X GPU vs. CPU Myth: An Evaluation of Throughput Computing on CPU and GPU Presented by: Ahmad Lashgar ECE Department, University of Tehran.

Shekoofeh Azizi Spring  CUDA is a parallel computing platform and programming model invented by NVIDIA  With CUDA, you can send C, C++ and Fortran.

Accelerating SQL Database Operations on a GPU with CUDA Peter Bakkum & Kevin Skadron The University of Virginia GPGPU-3 Presentation March 14, 2010.

Maps A map is an object that maps keys to values Each key can map to at most one value, and a map cannot contain duplicate keys KeyValue Map Examples Dictionaries:

Comp 249 Programming Methodology Chapter 15 Linked Data Structure - Part B Dr. Aiman Hanna Department of Computer Science & Software Engineering Concordia.

1 Hash Tables  a hash table is an array of size Tsize  has index positions 0.. Tsize-1  two types of hash tables  open hash table  array element type.

8.4 paging Paging is a memory-management scheme that permits the physical address space of a process to be non-contiguous. The basic method for implementation.

Brought to you by Max (ICQ: TEL: ) February 5, 2005 Advanced Data Structures Introduction.

A GPU Implementation of Inclusion-based Points-to Analysis Mario Méndez-Lojo (AMD) Martin Burtscher (Texas State University, USA) Keshav Pingali (U.T.

111 © 2002, Cisco Systems, Inc. All rights reserved.

HPCLatAm 2013 HPCLatAm 2013 Permutation Index and GPU to Solve efficiently Many Queries AUTORES  Mariela Lopresti  Natalia Miranda  Fabiana Piccoli.

November 15, 2007 A Java Implementation of a Lock- Free Concurrent Priority Queue Bart Verzijlenberg.

Symbol Tables and Search Trees CSE 2320 – Algorithms and Data Structures Vassilis Athitsos University of Texas at Arlington 1.

Introduction What is GPU? It is a processor optimized for 2D/3D graphics, video, visual computing, and display. It is highly parallel, highly multithreaded.

GPU-based Computing. Tesla C870 GPU 8 KB / multiprocessor 1.5 GB per GPU 16 KB up to 768 threads () up to 768 threads ( 21 bytes of shared memory and.

A Methodology for Creating Fast Wait-Free Data Structures Alex Koganand Erez Petrank Computer Science Technion, Israel.

1)Leverage raw computational power of GPU  Magnitude performance gains possible.

Wait-Free Multi-Word Compare- And-Swap using Greedy Helping and Grabbing Håkan Sundell PDPTA 2009.

Skiplist-based Concurrent Priority Queues Itay Lotan Stanford University Nir Shavit Sun Microsystems Laboratories.

A Simple Optimistic skip-list Algorithm Maurice Herlihy Brown University & Sun Microsystems Laboratories Yossi Lev Brown University & Sun Microsystems.

Range Queries in Non-blocking k-ary Search Trees Trevor Brown Hillel Avni.

Hazard Pointers: Safe Memory Reclamation for Lock-Free Objects Maged M. Michael Presented by Abdulai Sei.

An Efficient CUDA Implementation of the Tree-Based Barnes Hut n-body Algorithm By Martin Burtscher and Keshav Pingali Jason Wengert.

1 CSCD 326 Data Structures I Hashing. 2 Hashing Background Goal: provide a constant time complexity method of searching for stored data The best traditional.

Advanced Data Structure By Kayman 21 Jan Outline Review of some data structures Array Linked List Sorted Array New stuff 3 of the most important.

Scavenger: A New Last Level Cache Architecture with Global Block Priority Arkaprava Basu, IIT Kanpur Nevin Kirman, Cornell Mainak Chaudhuri, IIT Kanpur.

Priority Queues Dan Dvorin Based on ‘The Art of Multiprocessor Programming’, by Herlihy & Shavit, chapter 15.

Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit Concurrent Skip Lists.

November 27, 2007 Verification of a Concurrent Priority Queue Bart Verzijlenberg.

CS 732: Advance Machine Learning

SkipLists and Balanced Search The Art Of MultiProcessor Programming Maurice Herlihy & Nir Shavit Chapter 14 Avi Kozokin.

Sunpyo Hong, Hyesoon Kim

Martin Kruliš by Martin Kruliš (v1.0)1.

Onlinedeeneislam.blogspot.com1 Design and Analysis of Algorithms Slide # 1 Download From

An algorithm of Lock-free extensible hash table Yi Feng.

Hazard Pointers: Safe Memory Reclamation for Lock-Free Objects MAGED M. MICHAEL PRESENTED BY NURIT MOSCOVICI ADVANCED TOPICS IN CONCURRENT PROGRAMMING,

CUDA Compute Unified Device Architecture. Agent Based Modeling in CUDA Implementation of basic agent based modeling on the GPU using the CUDA framework.

1 Chapter 4 Unordered List. 2 Learning Objectives ● Describe the properties of an unordered list. ● Study sequential search and analyze its worst- case.

My Coordinates Office EM G.27 contact time:

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS 1.

CS 179: GPU Computing LECTURE 2: MORE BASICS. Recap Can use GPU to solve highly parallelizable problems Straightforward extension to C++ ◦Separate CUDA.

Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit Concurrent Skip Lists.

Hashing Exercises.

Anders Gidenstam Håkan Sundell Philippas Tsigas

Concurrent Data Structures Concurrent Algorithms 2016

Concurrent Data Structures Concurrent Algorithms 2017

Yiannis Nikolakopoulos

A Concurrent Lock-Free Priority Queue for Multi-Thread Systems

Performance Evaluation of Concurrent Lock-free Data Structures on GPUs

Presentation transcript:

Performance Evaluation of Lock-free Data Structures on GPUs Performance Evaluation of Lock-free Data Structures on GPUs Prabhakar Misra and Mainak Chaudhuri Indian Institute of Technology, Kanpur

Sketch Talk in one slide Result highlights Related work Lock-free data structures CUDA implementation Evaluation methodology Empirical results Summary

Sketch  Talk in one slide  Result highlights Related work Lock-free data structures CUDA implementation Evaluation methodology Empirical results Summary

Talk in One SlideTalk in One Slide Locks are expensive in GPUs – Thousands of threads cause high contention Lock-free data structures offer a possible way to implement irregular computations on GPUs – Support for dynamically changing pointer-linked data structures is important in many applications Large body of existing research on lock-free data structures for traditional multiprocessors This is the first detailed study to explore lock-free linear lists, hash tables, skip lists, and priority queues on CUDA-enabled GPUs

Result highlightsResult highlights Significant speedup on Tesla C2070 (Fermi GF100) over 24-core server execution Maximum speedup – 7.4x for linear lists – 11.3x for hash tables – 30.7x for skip list – 30.8x for priority queue Lock-free hash table shows best scalability for a wide range of operation mixes and key ranges – Throughput ranges from 20.8 MOPS to 98.9 MOPS

Sketch Talk in one slide Result highlights  Related work Lock-free data structures CUDA implementation Evaluation methodology Empirical results Summary

Related workRelated work Our lock-free linear list implementation follows a variation of the Harris-Michael construction Our hash table implementation leverages the linear list implementation Our lock-free skip list construction is due to Herlihy, Lev, and Shavit We follow the construction due to Lotan and Shavit for our lock-free priority queue implementation More related works are discussed in the paper

Sketch Talk in one slide Result highlights Related work  Lock-free data structures CUDA implementation Evaluation methodology Empirical results Summary

Lock-free data structuresLock-free data structures Linear lists, hash tables, skip lists, priority queues – Important computation building blocks – We implement a set using these data structures – Lock-free and wait-free operations on the set – Lock-free operation: infinitely often some instance of this operation finishes in finite number of steps – Wait-free operation: every instance of this operation finishes in finite number of steps – Correctness criteria: linearizable (except priority queue, which is quiescently consistent) See paper for definition

Lock-free linear listLock-free linear list Implemented using a sorted singly linked list Supported ops: add, delete, search – Add(x) returns 0 if x is already in the set; otherwise adds x at sorted position and returns 1 – Delete(x) returns 0 if x is not in the set; otherwise removes x from the set and returns 1 – Search(x) returns 0 or 1 if x is not found or found in the set – Add and delete are lock-free – Search is wait-free (just walks the list) – Delete only logically deletes a node by marking it – Subsequent add and delete operations physically remove the logically deleted nodes

Lock-free linear list: add(x)Lock-free linear list: add(x) <x >=x x Mark+next (32 bits) CAS on Mark+next The Mark bit is the least significant bit of the aligned 32-bit next field Needed for logical deletion

Lock-free linear list: Physical deleteLock-free linear list: Physical delete Mark+next (32 bits) CAS on Mark+next Delete(x) logically marks a node. Subsequent add or delete physically deletes it when walking the list.

Lock-free hash tableLock-free hash table Leverages lock-free linear list construction – Implemented as a single linear list – An array of pointers stores the starting point (head node) of each bucket – The head node of each bucket stores a special key – Add, delete, and search operations on a bucket start at the head node of that bucket – Number of buckets is constant and fixed at the time of CUDA kernel launch Supports the same three operations as lock- free linear list

Lock-free hash tableLock-free hash table Delete(22): 22 mod 4 =

Skip listSkip list A skip list is a hierarchy of linear lists – Keys present in level n+1 form a subset of the keys present in level n – Given that a key is present in level n, there is a probability p of finding the key in level n+1 – When a new key is inserted, the maximum level up to which this key can be present is decided by a random number r with expected value 1/(1-p) A skip list offers expected logarithmic search complexity

Skip listSkip list Keys are kept sorted at the lowest-level list – Head and tail nodes maintain the smallest and largest keys Upper-level lists provide probabilistic short- cuts into the lower-level lists leading to an expected logarithmic search time m M

Lock-free skip listLock-free skip list Leverages lock-free linear list implementation Additional complications in linking up or removing multiple nodes in different lists – Not possible to make multiple Mark+next field modifications atomic using single-word CAS – Depending on the traversal path of Add and Delete some middle level node of a marked key may get physically removed while leaving the other levels unchanged: violates the subset property

Lock-free skip listLock-free skip list Add is made linearizable by adding a key bottom-up Delete is made linearizable by logically marking the levels of the key to be deleted top-down A key is defined to be present in the set if it is found unmarked in the lowest-level list Two major performance bottlenecks – Large number of CAS operations – Complex code structure leading to significant volume of control flow divergences

Lock-free priority queueLock-free priority queue Supports two operations on the underlying set: Add and DeleteMin Leverages lock-free skip list due to its logarithmic search complexity guarantee – Makes the Add operation to have expected logarithmic time – DeleteMin walks the lowest-level list until an unmarked key is found, which it marks logically using CAS and calls Delete of skip list on that key New performance bottleneck – Heavy contention near the head due to concurrent DeleteMin operations

Sketch Talk in one slide Result highlights Related work Lock-free data structures  CUDA implementation Evaluation methodology Empirical results Summary

CUDA implementationCUDA implementation Extensive use of atomicCAS All data structures use a generic node class – All of them build on the basic linear list Large number of nodes are pre-allocated – Pointers to these are stored in an array – A global index points to the next free node – An Add operation executes an atomicInc on this index and uses the node pointed to by the pointer at the returned index Deleted nodes are not reused – Requires an implementation of an elaborate solution to the ABA problem – Left to future research

Sketch Talk in one slide Result highlights Related work Lock-free data structures CUDA implementation  Evaluation methodology Empirical results Summary

Evaluation methodologyEvaluation methodology Experiments are done on two platforms – Tesla C2070 card featuring one GF100 Fermi GPU 14 streaming multiprocessors (SM), each having 32 CUDA cores; thread blocks map to SMs 1.15 GHz core frequency and 1.49 GHz memory frequency 48 KB shared memory and 16 KB L1 cache per thread block; 768 KB globally shared L2 cache – Quad processor SMP, each processor having six cores (Intel X7460 CPU) running at 2.66 GHz 16 MB L3 cache shared by six cores in each processor Lock-free implementations use POSIX threads and rely on x86 cmpxchg instruction for realizing the atomicCAS primitive

Evaluation methodologyEvaluation methodology Each data structure is evaluated on – A range of integer keys [0, 100), [0, 1000), [0, 10000), and [0, ) Keys are generated uniformly at random from the range; these are input arguments to the operations – Two different mixes of supported operations – Different number of operations ranging from to in steps of Number of thread blocks and threads per block for the CUDA kernel are optimized – In most cases, the number of thread blocks is such that each thread carries out one operation – 64 threads per block for linear list and 512 threads per block for the rest

Evaluation methodologyEvaluation methodology For evaluation on CPU, thread count that offers the best performance is picked – 24 threads do not always offer the best In summary, each experiment shows results using the best performance on the GPU as well as on the CPU Lock-free hash table uses ten thousand buckets Lock-free skip list uses p=0.5 and 32 levels – Lock-free priority queue leverages the lock-free skip list that uses the same parameters

Sketch Talk in one slide Result highlights Related work Lock-free data structures CUDA implementation Evaluation methodology  Empirical results Summary

Lock-free linear listLock-free linear list Add %Delete %Search % No major difference between search-heavy and add/delete-heavy op strings Best performance on small key ranges and larger op counts Best speedup=7.3

Lock-free hash tableLock-free hash table Consistent speedup across all key ranges and op mixesBest speedup = 11.3

Lock-free skip listLock-free skip list Speedup drops with increasing key rangeFor identical key range, speedup improves with no. of Add opsStill good speedup at large key range for Add-heavy op strings Around 4x speedup Best speedup=30.7

Lock-free priority queueLock-free priority queue Add %DeleteMin % Trends are similar to skip list: speedup increases with Add % Best speedup=30.8

Hash table vs. linear listHash table vs. linear list The data are shown for the largest key range  On GPU, the hash table is 36x to 538x faster than linear list  On CPU, the hash table is only 8x to 54x faster than linear list  GPU exposes more concurrency in the lock-free hash table

Skip list vs. linear listSkip list vs. linear list  On GPU, the skip list is 2x to 20x faster than linear list  GPU still exposes more concurrency than CPU for skip list  Hash table shows far better scalability than skip list

Throughput of hash tableThroughput of hash table Hash table is the best performing data structure among the four we have evaluated – For the largest key range, on a search-heavy op mix [20, 20, 60], the throughput ranges from 28.6 MOPS to 98.9 MOPS on the GPU – For an add/delete-heavy op mix [40, 40, 20], the throughput range is 20.8 MOPS to 72.0 MOPS Nearly 100 MOPS on a search-heavy op mix

Sketch Talk in one slide Result highlights Related work Lock-free data structures CUDA implementation Evaluation methodology Empirical results  Summary

Summary First detailed evaluation of four lock-free data structures on CUDA-enabled GPU All four data structures offer moderate to high speedup on small to medium key ranges compared to CPU implementations Benefits are low for large key ranges in linear lists, skip lists, and priority queues – Primarily due to CAS overhead and complex control flow in skip lists and priority queues Hash tables offer consistently good speedup on arbitrary key ranges and op mixes – Nearly 100 MOPS throughput for search-heavy op mixes and more than 11x speedup over CPU

Summary Further improvement requires two key architectural innovations in GPUs – Fast atomics and high synchronization throughput Helpful for all kinds of scalable implementations – Reduction in control flow divergence overhead Helpful for complex lock-free constructions such as skip lists and priority queues

Thank you