Computer Laboratory Practical non-blocking data structures Tim Harris Computer Laboratory.

Slides:



Advertisements
Similar presentations
Fast and Lock-Free Concurrent Priority Queues for Multi-Thread Systems Håkan Sundell Philippas Tsigas.
Advertisements

Wait-Free Linked-Lists Shahar Timnat, Anastasia Braginsky, Alex Kogan, Erez Petrank Technion, Israel Presented by Shahar Timnat 469-+
Wait-Free Queues with Multiple Enqueuers and Dequeuers
Concurrent programming for dummies (and smart people too) Tim Harris & Keir Fraser.
Architecture-aware Analysis of Concurrent Software Rajeev Alur University of Pennsylvania Amir Pnueli Memorial Symposium New York University, May 2010.
CS492B Analysis of Concurrent Programs Lock Basics Jaehyuk Huh Computer Science, KAIST.
CS510 – Advanced Operating Systems 1 The Synergy Between Non-blocking Synchronization and Operating System Structure By Michael Greenwald and David Cheriton.
Background Concurrent access to shared data can lead to inconsistencies Maintaining data consistency among cooperating processes is critical What is wrong.
Maged M. Michael, “Hazard Pointers: Safe Memory Reclamation for Lock- Free Objects” Presentation Robert T. Bauer.
Scalable and Lock-Free Concurrent Dictionaries
Wait-Free Reference Counting and Memory Management Håkan Sundell, Ph.D.
Virendra J. Marathe, William N. Scherer III, and Michael L. Scott Department of Computer Science University of Rochester Presented by: Armand R. Burks.
Scalable Synchronous Queues By William N. Scherer III, Doug Lea, and Michael L. Scott Presented by Ran Isenberg.
Locality-Conscious Lock-Free Linked Lists Anastasia Braginsky & Erez Petrank 1.
Concurrent Data Structures in Architectures with Limited Shared Memory Support Ivan Walulya Yiannis Nikolakopoulos Marina Papatriantafilou Philippas Tsigas.
Progress Guarantee for Parallel Programs via Bounded Lock-Freedom Erez Petrank – Technion Madanlal Musuvathi- Microsoft Bjarne Steensgaard - Microsoft.
TOWARDS A SOFTWARE TRANSACTIONAL MEMORY FOR GRAPHICS PROCESSORS Daniel Cederman, Philippas Tsigas and Muhammad Tayyab Chaudhry.
Lock-free Cuckoo Hashing Nhan Nguyen & Philippas Tsigas ICDCS 2014 Distributed Computing and Systems Chalmers University of Technology Gothenburg, Sweden.
CS510 Advanced OS Seminar Class 10 A Methodology for Implementing Highly Concurrent Data Objects by Maurice Herlihy.
CS510 Concurrent Systems Class 2 A Lock-Free Multiprocessor OS Kernel.
Language Support for Lightweight transactions Tim Harris & Keir Fraser Presented by Narayanan Sundaram 04/28/2008.
Computer Laboratory Practical non-blocking linked lists Tim Harris Computer Laboratory.
SUPPORTING LOCK-FREE COMPOSITION OF CONCURRENT DATA OBJECTS Daniel Cederman and Philippas Tsigas.
שירן חליבה Concurrent Queues. Outline: Some definitions 3 queue implementations : A Bounded Partial Queue An Unbounded Total Queue An Unbounded Lock-Free.
Introduction to Embedded Systems
1 Lock-Free Linked Lists Using Compare-and-Swap by John Valois Speaker’s Name: Talk Title: Larry Bush.
Practical and Lock-Free Doubly Linked Lists Håkan Sundell Philippas Tsigas.
Parallel Programming Philippas Tsigas Chalmers University of Technology Computer Science and Engineering Department © Philippas Tsigas.
Software Transactional Memory for Dynamic-Sized Data Structures Maurice Herlihy, Victor Luchangco, Mark Moir, William Scherer Presented by: Gokul Soundararajan.
CS510 Concurrent Systems Jonathan Walpole. A Lock-Free Multiprocessor OS Kernel.
A Qualitative Survey of Modern Software Transactional Memory Systems Virendra J. Marathe Michael L. Scott.
CS5204 – Operating Systems Transactional Memory Part 2: Software-Based Approaches.
November 15, 2007 A Java Implementation of a Lock- Free Concurrent Priority Queue Bart Verzijlenberg.
Challenges in Non-Blocking Synchronization Håkan Sundell, Ph.D. Guest seminar at Department of Computer Science, University of Tromsö, Norway, 8 Dec 2005.
REVIEW OF COMMONLY USED DATA STRUCTURES IN OS. NEEDS FOR EFFICIENT DATA STRUCTURE Storage complexity & Computation complexity matter Consider the problem.
Operating Systems ECE344 Ashvin Goel ECE University of Toronto Mutual Exclusion.
Maged M.Michael Michael L.Scott Department of Computer Science Univeristy of Rochester Presented by: Jun Miao.
A Methodology for Creating Fast Wait-Free Data Structures Alex Koganand Erez Petrank Computer Science Technion, Israel.
Non-Blocking Concurrent Data Objects With Abstract Concurrency By Jack Pribble Based on, “A Methodology for Implementing Highly Concurrent Data Objects,”
Executing Parallel Programs with Potential Bottlenecks Efficiently Yoshihiro Oyama Kenjiro Taura Akinori Yonezawa {oyama, tau,
Wait-Free Multi-Word Compare- And-Swap using Greedy Helping and Grabbing Håkan Sundell PDPTA 2009.
Practical concurrent algorithms Mihai Letia Concurrent Algorithms 2012 Distributed Programming Laboratory Slides by Aleksandar Dragojevic.
Hazard Pointers: Safe Memory Reclamation for Lock-Free Objects Maged M. Michael Presented by Abdulai Sei.
CS510 Concurrent Systems Jonathan Walpole. A Methodology for Implementing Highly Concurrent Data Objects.
© David Kirk/NVIDIA and Wen-mei W. Hwu University of Illinois, CS/EE 217 GPU Architecture and Parallel Programming Lecture 15: Atomic Operations.
Software Transactional Memory Should Not Be Obstruction-Free Robert Ennals Presented by Abdulai Sei.
A Methodology for Implementing Highly Concurrent Data Objects by Maurice Herlihy Slides by Vincent Rayappa.
Concurrent Tries with Efficient Non-blocking Snapshots Aleksandar Prokopec Phil Bagwell Martin Odersky École Polytechnique Fédérale de Lausanne Nathan.
November 27, 2007 Verification of a Concurrent Priority Queue Bart Verzijlenberg.
MULTIVIE W Slide 1 (of 21) Software Transactional Memory Should Not Be Obstruction Free Paper: Robert Ennals Presenter: Emerson Murphy-Hill.
1 Why Threads are a Bad Idea (for most purposes) based on a presentation by John Ousterhout Sun Microsystems Laboratories Threads!
An algorithm of Lock-free extensible hash table Yi Feng.
Hazard Pointers: Safe Memory Reclamation for Lock-Free Objects MAGED M. MICHAEL PRESENTED BY NURIT MOSCOVICI ADVANCED TOPICS IN CONCURRENT PROGRAMMING,
CSCI1600: Embedded and Real Time Software Lecture 17: Concurrent Programming Steven Reiss, Fall 2015.
Introduction to operating systems What is an operating system? An operating system is a program that, from a programmer’s perspective, adds a variety of.
Scalable Computing model : Lock free protocol By Peeyush Agrawal 2010MCS3469 Guided By Dr. Kolin Paul.
December 1, 2006©2006 Craig Zilles1 Threads & Atomic Operations in Hardware  Previously, we introduced multi-core parallelism & cache coherence —Today.
Håkan Sundell Philippas Tsigas
Atomic Operations in Hardware
Atomic Operations in Hardware
Challenges in Concurrent Computing
A Lock-Free Algorithm for Concurrent Bags
Expander: Lock-free Cache for a Concurrent Data Structure
Anders Gidenstam Håkan Sundell Philippas Tsigas
Yiannis Nikolakopoulos
CSCI1600: Embedded and Real Time Software
A Concurrent Lock-Free Priority Queue for Multi-Thread Systems
Multicore programming
Multicore programming
CSCI1600: Embedded and Real Time Software
Presentation transcript:

Computer Laboratory Practical non-blocking data structures Tim Harris Computer Laboratory

Overview  Introduction  Lock-free data structures  Correctness requirements  Linked lists using CAS  Multi-word CAS  Conclusions

Computer Laboratory Introduction class Counter { int next = 0; int getNumber () { int t; t = next; next = t + 1; return t; }  What can go wrong here? next = 0 Thread1: getNumber() t = 0 Thread2: getNumber() t = 0 result=0 next = 1 result=0 

Computer Laboratory Introduction (2) class Counter { int next = 0; synchronized int getNumber () { int t; t = next; next = t + 1; return t; } next = 0  What about now? Thread1: getNumber() t = 0 Thread2: getNumber() result=0 Lock released Lock acquired result=1 next = 1next = 2

Computer Laboratory Introduction (3) class Counter { int next = 0; synchronized int getNumber () { int t; t = next; next = t + 1; return t; }  Now the problem is liveness Thread1: getNumber() Thread2: getNumber() Priority inversion: 1 is low priority, 2 is high priority, but some other thread 3 (of medium priority) prevents 1 making any progress Sharing: suppose that these operations may be invoked both in ordinary code and in interrupt handlers… Failure: what if thread 1 fails while holding the lock? The lock’s still held and the state may be inconsistent

Computer Laboratory Introduction (4) class Counter { int next = 0; int getNumber () { int t; do { t = next; } while (CAS (&next, t, t + 1) != t); return t; }  In this case a non-blocking design is easy: Atomic compare and swap Location Expected value New value

Computer Laboratory Correctness  Safety: we usually want a ‘linearizable’ implementation (Herlihy 1990)  The data structure is only accessed through a well-defined interface  Operations on the data structure appear to occur atomically at some point between invocation and response  Liveness: usually one of two requirements  A ‘wait free’ implementation guarantees per-thread progress  A ‘non-blocking’ implementation guarantees only system-wide progress

Computer Laboratory Overview  Introduction  Linked lists using CAS  Basic list operations  Alternative implementations  Extensions  Multi-word CAS  Conclusions

Computer Laboratory Lists using CAS  Insert 20: H10 30 T  20

Computer Laboratory Lists using CAS (2)  Insert 20: H10 30 T   25 

Computer Laboratory Lists using CAS (3)  Delete 10: H10 30 TH  30

Computer Laboratory Lists using CAS (4)  Delete 10 & insert 20: H10 30 TH10 30 H10 30 H   20 

Computer Laboratory Logical vs physical deletion  Use a ‘spare’ bit to indicate logically deleted nodes: H10 30 TH  20  10  30  30  30X  10  30 

Computer Laboratory Implementation problems  Also need to consider visibility of updates H10 30 T  20 Write barrier

Computer Laboratory Implementation problems (2)  …and the ordering of reads too H10 30 T while (val < seek) { p = p->next; val = p->val; }   val = ???

Computer Laboratory Overview  Introduction  Linked lists using CAS  Multi-word CAS  Design  Results  Conclusions

Computer Laboratory Multi-word CAS  Atomic read-modify-write to a set of locations  A useful building block:  Many existing designs (queues, stacks, etc) use CAS2 directly (e.g. Detlefs ’00)  More generally it can be used to move a structure between consistent states  We’d like it to be non-blocking, disjoint-access parallel, linearizable, and efficient with natural data

Computer Laboratory Previous work  Lots of designs… Anderson ’95YesStrong LL/SCp(w+l)+l l=log 2 p+log 2 a I+R ’95YesCASp + log 2 p Herlihy ’93NoCAS0 YesCAS0 or 2 Moir ’97YesStrong LL/SClog 2 p+log 2 n I+R ’95YesStrong LL/SClog 2 p …none of them practicable p processors, word size w, max n locations, max a addresses ParallelRequiresReserved bits

Computer Laboratory Design H T 0x100 0x108 0x110 0x118 0x104 0x10C 0x114 0x11C status=UNDECIDED locations=2 a1=0x10C o1=0x110 n1=0x118 a2=0x114 o2=0x118 n2=  Build descriptor  Acquire locations  Decide outcome  Release locations DCSS (&status, UNDECIDED, 0x10C, 0x110, &descriptor) DCSS (&status, UNDECIDED, 0x114, 0x118, &descriptor) CAS (&status, UNDECIDED, SUCCEEDED) status=SUCCEEDED CAS (0x10C, &descriptor, 0x118)CAS (0x114, &descriptor, null) null

Computer Laboratory Reading H T 0x100 0x108 0x110 0x118 0x104 0x10C 0x114 0x11C status=UNDECIDED locations=2 a1=0x10c o1=0x110 n1=0x118 a2=0x114 o2=0x118 n2= word_t read (addr_t a) { word_t val = *a; if (!isDescriptor(val)) return val else { SUCCEEDED => return new value; return old value; }

Computer Laboratory 100x108 0x10C ac=0x200 oc=0 au=0x10C ou=0x110 nu=0x200  Now we need DCSS from CAS:  Easier than full CAS2: the locations used for ‘control’ and ‘update’ addresses must not overlap, only the ‘update’ address may be changed + we don’t need the result  DCSS(&status, UNDECIDED 0x10C, 0x110, &descriptor): CAS (0x10C, 0x110, &DCSSDescriptor) if (*0x200 == 0) CAS (0x10C, &DCSSDescriptor, 0x200) else CAS (0x10C, &DCSSDescriptor, 0x110); Whither DCSS?

Computer Laboratory Evaluation: method  Attempt to permute elements in a vector. Can control:  Level of concurrency  Length of the vector  Number of elements being permuted  Padding between elements  Management of descriptors

Computer Laboratory Evaluation: small systems HF HF-RC IR MCS MCS-FG  gargantubrain.cl: 4-processor IA-64 (Itanium)  Vector=1024, Width=2-64, No padding   s per successful update CASn width (words permuted per update) Algorithm used

Computer Laboratory Evaluation: large systems ms per successful update Number of processors  hodgkin.hpcf: 64-processor  Origin-2000, MIPS R12000  Vector=1024, Width=2  One element per cache line HF-RC IR MCS

Computer Laboratory Overview  Introduction  Linked lists using CAS  Multi-word CAS  Conclusions

Computer Laboratory Conclusions  Some general techniques  The descriptor pointers serve two purposes:  They allow ‘helpers’ to find out the information needed to complete their work.  They indicate ownership of locations  Correctness seems clearest when thinking about the state of the shared memory, not the state of individual threads  Unlike previous work we need only a small and constant number of reserved bits (e.g. 2 to identify descriptor pointers if there’s no type information available at run time)

Computer Laboratory Conclusions (2)  Our scheme is the first practical one:  Can operate on general pointer-based data structures  Competitive with lock-based schemes  Can operate on highly parallel systems  Disjoint-access parallel, non-blocking, linearizable