A Methodology for Implementing Highly Concurrent Data Objects by Maurice Herlihy Slides by Vincent Rayappa.

Slides:



Advertisements
Similar presentations
Maurice Herlihy (DEC), J. Eliot & B. Moss (UMass)
Advertisements

Data Structures Static and Dynamic.
Chapter 3 – Lists A list is just what the name implies, a finite, ordered sequence of items. Order indicates each item has a position. A list of size 0.
Threads Cannot be Implemented As a Library Andrew Hobbs.
Synchronization. How to synchronize processes? – Need to protect access to shared data to avoid problems like race conditions – Typical example: Updating.
1 Chapter 4 Synchronization Algorithms and Concurrent Programming Gadi Taubenfeld © 2014 Synchronization Algorithms and Concurrent Programming Synchronization.
Architecture-aware Analysis of Concurrent Software Rajeev Alur University of Pennsylvania Amir Pnueli Memorial Symposium New York University, May 2010.
CS510 – Advanced Operating Systems 1 The Synergy Between Non-blocking Synchronization and Operating System Structure By Michael Greenwald and David Cheriton.
Dynamic Allocation Eric Roberts CS 106B February 4, 2013.
A Lock-Free Multiprocessor OS Kernel1 Henry Massalin and Calton Pu Columbia University June 1991 Presented by: Kenny Graunke.
Concurrency 101 Shared state. Part 1: General Concepts 2.
Transactional Memory Part 1: Concepts and Hardware- Based Approaches 1Dennis Kafura – CS5204 – Operating Systems.
Transactional Memory: Architectural Support for Lock- Free Data Structures Herlihy & Moss Presented by Robert T. Bauer.
Synchron. CSE 471 Aut 011 Some Recent Medium-scale NUMA Multiprocessors (research machines) DASH (Stanford) multiprocessor. –“Cluster” = 4 processors on.
Concurrent Data Structures in Architectures with Limited Shared Memory Support Ivan Walulya Yiannis Nikolakopoulos Marina Papatriantafilou Philippas Tsigas.
Introduction to Lock-free Data-structures and algorithms Micah J Best May 14/09.
Computer Laboratory Practical non-blocking data structures Tim Harris Computer Laboratory.
1 Lecture 21: Synchronization Topics: lock implementations (Sections )
CS510 Advanced OS Seminar Class 10 A Methodology for Implementing Highly Concurrent Data Objects by Maurice Herlihy.
Simple, Fast, and Practical Non- Blocking and Blocking Concurrent Queue Algorithms Presenter: Jim Santmyer By: Maged M. Micheal Michael L. Scott Department.
CS510 Concurrent Systems Class 2 A Lock-Free Multiprocessor OS Kernel.
Language Support for Lightweight transactions Tim Harris & Keir Fraser Presented by Narayanan Sundaram 04/28/2008.
Multiscalar processors
Chapter 12 C Data Structures Acknowledgment The notes are adapted from those provided by Deitel & Associates, Inc. and Pearson Education Inc.
© Copyright 1992–2004 by Deitel & Associates, Inc. and Pearson Education Inc. All Rights Reserved. Chapter 12 – Data Structures Outline 12.1Introduction.
CS510 Concurrent Systems Class 5 Threads Cannot Be Implemented As a Library.
Synchron. CSE 4711 The Need for Synchronization Multiprogramming –“logical” concurrency: processes appear to run concurrently although there is only one.
Pointers Applications
Memory Consistency Models Some material borrowed from Sarita Adve’s (UIUC) tutorial on memory consistency models.
CS510 Concurrent Systems Introduction to Concurrency.
Software Transactional Memory for Dynamic-Sized Data Structures Maurice Herlihy, Victor Luchangco, Mark Moir, William Scherer Presented by: Gokul Soundararajan.
CS510 Concurrent Systems Jonathan Walpole. A Lock-Free Multiprocessor OS Kernel.
Concurrency, Mutual Exclusion and Synchronization.
A Qualitative Survey of Modern Software Transactional Memory Systems Virendra J. Marathe Michael L. Scott.
Kernel Locking Techniques by Robert Love presented by Scott Price.
JAVA MEMORY MODEL AND ITS IMPLICATIONS Srikanth Seshadri
Transactional Memory Lecturer: Danny Hendler. 2 2 From the New York Times…
A Methodology for Creating Fast Wait-Free Data Structures Alex Koganand Erez Petrank Computer Science Technion, Israel.
Non-Blocking Concurrent Data Objects With Abstract Concurrency By Jack Pribble Based on, “A Methodology for Implementing Highly Concurrent Data Objects,”
Hazard Pointers: Safe Memory Reclamation for Lock-Free Objects Maged M. Michael Presented by Abdulai Sei.
CS510 Concurrent Systems Jonathan Walpole. A Methodology for Implementing Highly Concurrent Data Objects.
Software Transactional Memory Should Not Be Obstruction-Free Robert Ennals Presented by Abdulai Sei.
Hazard Pointers: Safe Memory Reclamation for Lock-Free Objects MAGED M. MICHAEL PRESENTED BY NURIT MOSCOVICI ADVANCED TOPICS IN CONCURRENT PROGRAMMING,
CS510 Concurrent Systems Tyler Fetters. A Methodology for Implementing Highly Concurrent Data Objects.
MORE POINTERS Plus: Memory Allocation Heap versus Stack.
CS510 Concurrent Systems Jonathan Walpole. Introduction to Concurrency.
Mutual Exclusion -- Addendum. Mutual Exclusion in Critical Sections.
Priority Queues, Heaps, and Heapsort CSE 2320 – Algorithms and Data Structures Vassilis Athitsos University of Texas at Arlington 1.
Database Applications (15-415) DBMS Internals- Part III Lecture 13, March 06, 2016 Mohammad Hammoud.
DYNAMIC MEMORY ALLOCATION. Disadvantages of ARRAYS MEMORY ALLOCATION OF ARRAY IS STATIC: Less resource utilization. For example: If the maximum elements.
December 1, 2006©2006 Craig Zilles1 Threads & Atomic Operations in Hardware  Previously, we introduced multi-core parallelism & cache coherence —Today.
CSE 220 – C Programming malloc, calloc, realloc.
Dynamic Allocation in C
Chapter 12 – Data Structures
By Michael Greenwald and David Cheriton Presented by Jonathan Walpole
Department of Computer Science, University of Rochester
Atomic Operations in Hardware
Atomic Operations in Hardware
ICS143A: Principles of Operating Systems Lecture 15: Locking
Heap Sort Example Qamar Abbas.
CSCI206 - Computer Organization & Programming
Martin Rinard Laboratory for Computer Science
Map interface Empty() - return true if the map is empty; else return false Size() - return the number of elements in the map Find(key) - if there is an.
A Methodology for Implementing Highly Concurrent Data Objects
Lecture 21: Synchronization and Consistency
Part 1: Concepts and Hardware- Based Approaches
CSE 153 Design of Operating Systems Winter 19
Why we have Counterintuitive Memory Models
Programming with Shared Memory Specifying parallelism
Lecture: Coherence and Synchronization
Presentation transcript:

A Methodology for Implementing Highly Concurrent Data Objects by Maurice Herlihy Slides by Vincent Rayappa

2 Introduction Objective: Describe a methodology of transforming sequential data structures into concurrent ones. –Use LL/SC instructions to accomplish this. Why LL/SC? –Universally applicable to transform data structures into concurrent ones –Easier than using CAS. A lot of modern architectures support LL/SC: PPC, ARM, MIPS etc. –Not on x86?

3 Example of LL/SC PPC has load-linked (lwarx) and store cond (stwcx) instructions. Example from asm long // <- Zero on failure, one on success (r3). AtomicStore( long prev, // -> Previous value (r3). long next, // -> New value (r4). void *addr ) // -> Location to update (r5). { retry: lwarx r6, 0, r5 // current = *addr; cmpw r6, r3 // if( current != prev ) bne fail // goto fail; stwcx. r4, 0, r5 // if( reservation == addr ) *addr = next; bne- retry // else goto retry; li r3, 1 // Return true. blr // We're outta here. fail: stwcx. r6, 0, r5 // Clear reservation. li r3, 0 // Return false. blr // We're outta here. }

4 General Methodology Programmer provides a non-concurrent implementation of data structure (with restrictions) By applying transformation techniques and memory management steps the data structure is made concurrent. Small vs. big objects: –Small: efficient to whole data structure from one memory region to another –Big: inefficient to copy whole data structure.

5 Small Object Transformation Process B Process A M M’ (copied by Process B & Modified) LL SC LL M’’ (copied by Process A & Modified) SC Fails! This synchronization is non-blocking because some process makes progress at any given time. –Spurious failures: no-progress Restrictions: –Operations must be free of side-effects other than changing memory block a process owns –Operations must be well-defined for all legal states of object.

6 Small Object Memory Management Each process owns memory block big enough to copy data structure –When version pointer successfully updated, process releases ownership of new block and acquires ownership of old block. Since only once process can swing the pointer each block has well defined owner.

7 Comparison to Type-stable memory TSM: value for t stable was left undefined. Here we have clearer management (and recycling of memory). –Given up ownership of ‘new version’ when SC succeeds –Acquire ownership of ‘old version’ when SC succeeds.

8 Race condition Stale read are still possible. In previous example: –Processes A and B read pointer to M –Process A updates versions from M to M’. –Now A owns M and as part of next operation can use it to copy/update contents from current version. –If Process B (because it is slow) is still reading M, it can see incomplete edits that A in now doing! Operations on stale data can cause unpredictable behavior –Violates the ‘Operations must be well-defined …’ restriction. Prevent operations based on stale data by validating data before using it.

9 Validating data Use counters check[0] & check[1] –use 32-bit integers, making validation robust Modify: check[0]++ -> modify -> check[1]++ Copy: read check[1], copy, read check[0]. check[0] == check[1] ?. –Yes: copy is consistent –No: we are reading edits in progress i.e. edits in progress Can also be implemented in hardware –Not clear modern architectures support this.

10 Example: Priority Queue A binary tree where a node have a value greater than both the subtrees (max. priority queue) Operations supported –Peak at Max (or min) value –Extract max value –Insert new value Priority queues implemented via heap data-structure encoded as an array.

11 typedef struct { pqueue_type version; unsigned check[2]; } Pqueue_type; static Pqueue_type *new_pqueue; int Pqueue_deq(Pqueue_type **Q){ Pqueue_type *old_pqueue; /* concurrent object */ pqueue_type *old_version, *new_version; /* seq object */ int result; unsigned first, last; while (1) { old_pqueue = load_linked(Q); old_version = &old_pqueue­>version; new_version = &new_pqueue­>version; first = old_pqueue­>check[1]; copy(old_version, new_version); last = old_pqueue­>check[0]; if (first == last) { result = pqueue_deq(new_version); if (store_conditional(Q, new_version)) break; } /* if */ } /* while */ new_pqueue = old_pqueue; return result; } /* Pqueue_deq */ Guard against concurrent read of new_version Concurrent read of old_pqueue/old_version possible A process ‘owns’ new_version; others can still read new_version Consistency check Calls LL Calls SC Q- Shared global red – sequential object blue – concurrent object

12 Experimental Results Time to enqueue and dequeue into 16-element p-queue. Naïve retry was caused a lot of failure for enqueue since it was slower than dequeue Added exponential back-off on failure. Number of operations = n is # of processes

13 Large Objects Large objects cannot be copied fully Sequential operations create logically distinct version of objects –As opposed to changing them in place –Logically – not physically – distinct since programmer is free to share memory between old and new versions. Concurrent operation: –Read pointer using LL –Use sequential operation to create new version –Swing pointer to new version using SC

14 Large Object Memory Management Memory block required per sequential operation not fixed –Depends how much sharing happens between old and new versions Processes own block of memory. –If they run out of block, they might have to borrow blocks from a common pool. Memory managed via recoverable set data structure.

15 Recoverable Set Blocks in one of three states: –Committed, Allocated, Freed Block B1 Committed Block B1 Allocated set_alloc Block B1 Freed set_free set_commit

16 Large Object Performance Same experiment as with small objects except that the heap has 512 instead of 16 elements

17 Comments This paper shows an method for transforming sequential data structures into concurrent ones with reasonable performance. –Transformed program performs with 2x of spin-lock with backoff –Reasonable depends on need. Massalin, Michael, Scott don’t think performance is good enough. Reasoning about NBS not easy –Reasoning about which memory locations are accessed concurrently and which are not is difficult. –With locks, inside critical sections you know you do not have concurrent access. Automated transformation –If the transformation is done automatically by compiler or pre- processor, it would be easy to use NBS. –Perhaps it might even be worth the performance penalty.