CS510 Concurrent Systems Jonathan Walpole. A Methodology for Implementing Highly Concurrent Data Objects.

Slides:



Advertisements
Similar presentations
Fast and Lock-Free Concurrent Priority Queues for Multi-Thread Systems Håkan Sundell Philippas Tsigas.
Advertisements

Synchronization. How to synchronize processes? – Need to protect access to shared data to avoid problems like race conditions – Typical example: Updating.
CS492B Analysis of Concurrent Programs Lock Basics Jaehyuk Huh Computer Science, KAIST.
CS510 – Advanced Operating Systems 1 The Synergy Between Non-blocking Synchronization and Operating System Structure By Michael Greenwald and David Cheriton.
CPSC 668Set 18: Wait-Free Simulations Beyond Registers1 CPSC 668 Distributed Algorithms and Systems Fall 2006 Prof. Jennifer Welch.
A Lock-Free Multiprocessor OS Kernel1 Henry Massalin and Calton Pu Columbia University June 1991 Presented by: Kenny Graunke.
Concurrency 101 Shared state. Part 1: General Concepts 2.
Parallel Processing (CS526) Spring 2012(Week 6).  A parallel algorithm is a group of partitioned tasks that work with each other to solve a large problem.
Locality-Conscious Lock-Free Linked Lists Anastasia Braginsky & Erez Petrank 1.
PARALLEL PROGRAMMING with TRANSACTIONAL MEMORY Pratibha Kona.
“THREADS CANNOT BE IMPLEMENTED AS A LIBRARY” HANS-J. BOEHM, HP LABS Presented by Seema Saijpaul CS-510.
1 Lecture 21: Transactional Memory Topics: consistency model recap, introduction to transactional memory.
The Performance of Spin Lock Alternatives for Shared-Memory Microprocessors Thomas E. Anderson Presented by David Woodard.
490dp Synchronous vs. Asynchronous Invocation Robert Grimm.
Introduction to Lock-free Data-structures and algorithms Micah J Best May 14/09.
1 Lecture 23: Transactional Memory Topics: consistency model recap, introduction to transactional memory.
1 Lecture 21: Synchronization Topics: lock implementations (Sections )
CS510 Advanced OS Seminar Class 10 A Methodology for Implementing Highly Concurrent Data Objects by Maurice Herlihy.
Simple, Fast, and Practical Non- Blocking and Blocking Concurrent Queue Algorithms Presenter: Jim Santmyer By: Maged M. Micheal Michael L. Scott Department.
© Andy Wellings, 2004 Roadmap  Introduction  Concurrent Programming  Communication and Synchronization  Completing the Java Model  Overview of the.
CS510 Concurrent Systems Class 2 A Lock-Free Multiprocessor OS Kernel.
CS510 Concurrent Systems Class 13 Software Transactional Memory Should Not be Obstruction-Free.
Language Support for Lightweight transactions Tim Harris & Keir Fraser Presented by Narayanan Sundaram 04/28/2008.
CS510 Concurrent Systems Class 5 Threads Cannot Be Implemented As a Library.
I/O Systems ◦ Operating Systems ◦ CS550. Note:  Based on Operating Systems Concepts by Silberschatz, Galvin, and Gagne  Strongly recommended to read.
Simple Wait-Free Snapshots for Real-Time Systems with Sporadic Tasks Håkan Sundell Philippas Tsigas.
Software Transactional Memory for Dynamic-Sized Data Structures Maurice Herlihy, Victor Luchangco, Mark Moir, William Scherer Presented by: Gokul Soundararajan.
CHAPTER 2: COMPUTER-SYSTEM STRUCTURES Computer system operation Computer system operation I/O structure I/O structure Storage structure Storage structure.
CS510 Concurrent Systems Jonathan Walpole. A Lock-Free Multiprocessor OS Kernel.
Concurrency, Mutual Exclusion and Synchronization.
CSCE 668 DISTRIBUTED ALGORITHMS AND SYSTEMS Fall 2011 Prof. Jennifer Welch CSCE 668 Set 18: Wait-Free Simulations Beyond Registers 1.
November 15, 2007 A Java Implementation of a Lock- Free Concurrent Priority Queue Bart Verzijlenberg.
Maged M.Michael Michael L.Scott Department of Computer Science Univeristy of Rochester Presented by: Jun Miao.
SPECULATIVE EXECUTION IN A DISTRIBUTED FILE SYSTEM E. B. Nightingale P. M. Chen J. Flint University of Michigan.
Page 111/15/2015 CSE 30341: Operating Systems Principles Chapter 11: File System Implementation  Overview  Allocation methods: Contiguous, Linked, Indexed,
Threads Cannot be Implemented as a Library Hans-J. Boehm.
A Methodology for Creating Fast Wait-Free Data Structures Alex Koganand Erez Petrank Computer Science Technion, Israel.
Non-Blocking Concurrent Data Objects With Abstract Concurrency By Jack Pribble Based on, “A Methodology for Implementing Highly Concurrent Data Objects,”
Wait-Free Multi-Word Compare- And-Swap using Greedy Helping and Grabbing Håkan Sundell PDPTA 2009.
The read-copy-update mechanism for supporting real-time applications on shared-memory multiprocessor systems with Linux Guniguntala et al.
© David Kirk/NVIDIA and Wen-mei W. Hwu University of Illinois, CS/EE 217 GPU Architecture and Parallel Programming Lecture 15: Atomic Operations.
Software Transactional Memory Should Not Be Obstruction-Free Robert Ennals Presented by Abdulai Sei.
CS510 Concurrent Systems Jonathan Walpole. RCU Usage in Linux.
A Methodology for Implementing Highly Concurrent Data Objects by Maurice Herlihy Slides by Vincent Rayappa.
1 Becoming More Effective with C++ … Day Two Stanley B. Lippman
November 27, 2007 Verification of a Concurrent Priority Queue Bart Verzijlenberg.
Hazard Pointers: Safe Memory Reclamation for Lock-Free Objects MAGED M. MICHAEL PRESENTED BY NURIT MOSCOVICI ADVANCED TOPICS IN CONCURRENT PROGRAMMING,
CS510 Concurrent Systems Tyler Fetters. A Methodology for Implementing Highly Concurrent Data Objects.
CS510 Concurrent Systems Jonathan Walpole. Introduction to Concurrency.
Free Transactions with Rio Vista Landon Cox April 15, 2016.
Pitfalls: Time Dependent Behaviors CS433 Spring 2001 Laxmikant Kale.
Multiprocessors – Locks
Concurrency 2 CS 2110 – Spring 2016.
Free Transactions with Rio Vista
Threads Cannot Be Implemented As a Library
CS510 Concurrent Systems Jonathan Walpole.
Distributed Algorithms (22903)
CSCE 668 DISTRIBUTED ALGORITHMS AND SYSTEMS
CS510 Concurrent Systems Jonathan Walpole.
A Methodology for Implementing Highly Concurrent Data Objects
Designing Parallel Algorithms (Synchronization)
Lecture 6: Transactions
Overview Continuation from Monday (File system implementation)
Outline Allocation Free space management Memory mapped files
Free Transactions with Rio Vista
Overview: File system implementation (cont)
Software Transactional Memory Should Not be Obstruction-Free
Kernel Synchronization II
CS510 Advanced Topics in Concurrency
Chapter 14: File-System Implementation
Presentation transcript:

CS510 Concurrent Systems Jonathan Walpole

A Methodology for Implementing Highly Concurrent Data Objects

Concurrent Object A data structure shared by concurrent processes Traditionally implemented using locks Problem: in asynchronous systems slow/failed processes impede fast processes Alternative approaches include non-blocking and wait-free concurrent objects... but these are hard to write!... and hard to understand!

Goals Make it easy to write concurrent objects (automatable method) Make them as easy to reason about as sequential objects (linearizability) Preserve as much performance as possible

The Plan Write sequential object first Transform it to a concurrent one (automatically) Ensure it is linearizable Use load-linked, store conditional for non-blocking Transform non-blocking implementation to wait-free implementation when necessary

Non-Blocking Concurrency Non-blocking: after a finite number of steps at least one process must complete Wait-free: every process must complete after a finite number of steps A system that is merely non-blocking is prone to starvation A wait-free system protects against starvation

The Methodology Sequential objects must be total, i.e., well defined on all valid states of the data It must also have no side effects Why?

Load-Linked Store Conditional Load-linked copies a memory location Store conditional stores to a memory location, only if the location load-linked has not been accessed No ABA, but it can suffer from spurious failures! - for example, due to cache line replacement - or execution of a new load-linked on a different location

Small Object Method Reads memory pointer using load_linked Copies version into another block Applies sequential operation to the copy Calls store_conditional to swing the pointer from the old to the new On success transformation complete On failure retry

What if one thread reads while another is modifying, can the copy be corrupted? - to prevent access to incomplete state, two version counters are used (check[0] and check[1]) - Updates: a thread updates check[0], then does the modification, then updates check[1] - Copies: a thread reads check[1], copies the version, then reads check[0] Is This Safe?

Small Objects Cont. - Code int Pqueue_deq(Pqueue_type **Q){... while(1){ old_pqueue = load_linked(Q); old_version = &old_pqueue->version; new_version = &new_pqueue->version; first = old_pqueue->check[1]; copy(old_version, new_version); last = old_pqueue->check[0]; if (first == last){ result = pqueue_deq(new_version); if (store_conditional(Q, new_version))break; } new_pqueue = old_pqueue; return result; } Preventing the race condition! Copy the old, new data. If the check values do not match, loop again. We failed. If the check values DO match, now we can perform our dequeue operation! Try to publicize the new heap via store_conditional, which could fail and we loop back. Lastly, copy the old concurrent object pointer to the new concurrent pointer. Return our priority queue result.

Simple Non-Blocking Algorithm vs. Spin-Lock

Why Does It Perform So Poorly?

Small Objects Cont. – Back Off... if (first == last) { result = pqueue_deq(new_version); if (store_conditional(Q, new_version )) break; } if (max_delay < DELAY_LIMIT) max_delay = 2 * max_delay; delay = random() % max_delay; for (i = 0; i < delay; i++); } /* end while*/ new_pqueue = old_pqueue; return result; } When the consistency check or the store_conditional fails, introduce back-off for a random amount of time!

Effect of Adding Backoff

Why Does It Still Perform Poorly?

Other Problems Long operations struggle to finish They are repeatedly forced to retry by short operations The approach is subject to starvation! How can we fix this?

Small Objects – Wait Free Each process declares its intended operations ahead of time, using an invocation structure - name, arguments, and toggle bit to determine if invocation is old or new The results of each operation are recorded in a response structure - result value, toggle bit Every process tries to do every other processes operations before doing its own!!!! - but only one succeeds (exactly one succeeds!)

... announce[P].op_name = DEQ_CODE; new_toggle = announce[P].toggle = !announce[P].toggle; if (max_delay> 1) max_delay = max_delay >> 1; while(((*Q)->responses[P].toggle != new_toggle) || ((*Q)->responses[P].toggle != new_toggle)){ old_pqueue = load_linked(Q); old_version = &old_pqueue->version; new_version = &new_pqueue->version; first = old_pqueue->check[1]; memcopy(old_version, new_version, sizeof(pqueue_type)); last = old_pqueue->check[0]; if (first == last){ result = pqueue_deq(new_version); apply(announce, Q); if (store_conditional(Q, new_version )) break; } if (max_delay < DELAY_LIMIT) max_delay = 2 * max_delay; delay = random() % max_delay; for (i = 0; i < delay; i++); } new_pqueue = old_pqueue; return result; } Small Objects – Wait Free Record your operation Flip the toggle bit ? apply all pending operations to the NEW version Commit the new version

Doing The Work of Others void apply (inv_type announce[MAX_PROCS], pqueue_type *object){ int i; for (i = 0; i < MAX_PROCS; i++){ if(announce[i].toggle != object->res_types[i].toggle){ switch(announce[i].op_name){ case ENG_CODE: object->res_type[i].value = pqueue_enq(&ojbect->version, announce[i].arg); break; case DEQ_CODE: object->res_type[i].value = pqueue_deq(&ojbect->version, announce[i].arg); break; default: fprintf(stderr, “Unknown operation code \n”); exit(1); }; object->res_types[i].toggle = announce[i].toggle; }

Small Objects Cont. – Wait Free Small Object, Non-Blocking (back-off) Small Object, Wait Free (back- off)

Wait-Free Performance Sucks! Why?

Large Concurrent Objects - Cannot be copied in a single block - Represented by a set of blocks linked by pointers - The programmer is responsible for determining which blocks of the object to copy - The less copying the better the performance

Summary & Contributions Foundation for transforming sequential implementations to concurrent ones -uses LL and SC -simplifies programming complexity -could be performed by a compiler -maintains a “reasonable” level of performance?