By Michael Greenwald and David Cheriton Presented by Jonathan Walpole

Slides:

Advertisements

Similar presentations

CS492B Analysis of Concurrent Programs Lock Basics Jaehyuk Huh Computer Science, KAIST.

Advertisements

CS510 – Advanced Operating Systems 1 The Synergy Between Non-blocking Synchronization and Operating System Structure By Michael Greenwald and David Cheriton.

Chapter 6: Process Synchronization

Silberschatz, Galvin and Gagne ©2009 Operating System Concepts – 8 th Edition, Chapter 6: Process Synchronization.

Wait-Free Reference Counting and Memory Management Håkan Sundell, Ph.D.

Parallel Processing (CS526) Spring 2012(Week 6).  A parallel algorithm is a group of partitioned tasks that work with each other to solve a large problem.

Review: Chapters 1 – Chapter 1: OS is a layer between user and hardware to make life easier for user and use hardware efficiently Control program.

The Performance of Spin Lock Alternatives for Shared-Memory Microprocessors Thomas E. Anderson Presented by David Woodard.

1 Lecture 21: Synchronization Topics: lock implementations (Sections )

CS510 Advanced OS Seminar Class 10 A Methodology for Implementing Highly Concurrent Data Objects by Maurice Herlihy.

CS510 Concurrent Systems Class 2 A Lock-Free Multiprocessor OS Kernel.

CS510 Concurrent Systems Class 13 Software Transactional Memory Should Not be Obstruction-Free.

User-Level Interprocess Communication for Shared Memory Multiprocessors Brian N. Bershad, Thomas E. Anderson, Edward D. Lazowska, and Henry M. Levy Presented.

Why The Grass May Not Be Greener On The Other Side: A Comparison of Locking vs. Transactional Memory Written by: Paul E. McKenney Jonathan Walpole Maged.

Memory Consistency Models Some material borrowed from Sarita Adve’s (UIUC) tutorial on memory consistency models.

CS510 Concurrent Systems Jonathan Walpole. A Lock-Free Multiprocessor OS Kernel.

Cosc 4740 Chapter 6, Part 3 Process Synchronization.

Operating Systems CSE 411 Multi-processor Operating Systems Multi-processor Operating Systems Dec Lecture 30 Instructor: Bhuvan Urgaonkar.

COMP 111 Threads and concurrency Sept 28, Tufts University Computer Science2 Who is this guy? I am not Prof. Couch Obvious? Sam Guyer New assistant.

Kernel Locking Techniques by Robert Love presented by Scott Price.

CS510 Concurrent Systems Why the Grass May Not Be Greener on the Other Side: A Comparison of Locking and Transactional Memory.

Hazard Pointers: Safe Memory Reclamation for Lock-Free Objects Maged M. Michael Presented by Abdulai Sei.

CS399 New Beginnings Jonathan Walpole. 2 Concurrent Programming & Synchronization Primitives.

CS510 Concurrent Systems Jonathan Walpole. A Methodology for Implementing Highly Concurrent Data Objects.

Processes and Virtual Memory

CS510 Concurrent Systems Jonathan Walpole. RCU Usage in Linux.

A Methodology for Implementing Highly Concurrent Data Objects by Maurice Herlihy Slides by Vincent Rayappa.

December 1, 2006©2006 Craig Zilles1 Threads & Atomic Operations in Hardware  Previously, we introduced multi-core parallelism & cache coherence —Today.

Multiprocessors – Locks

Tutorial 2: Homework 1 and Project 1

Maurice Herlihy and J. Eliot B. Moss, ISCA '93

Jonathan Walpole Computer Science Portland State University

Processes and threads.

CS703 – Advanced Operating Systems

Håkan Sundell Philippas Tsigas

Lock-Free Linked Lists Using Compare-and-Swap

Memory Consistency Models

Lecture 19: Coherence and Synchronization

Why The Grass May Not Be Greener On The Other Side: A Comparison of Locking vs. Transactional Memory By McKenney, Michael, Triplett and Walpole.

Atomic Operations in Hardware

Atomic Operations in Hardware

Memory Consistency Models

CS510 Concurrent Systems Jonathan Walpole.

CS510 Concurrent Systems Jonathan Walpole.

CS510 Concurrent Systems Jonathan Walpole.

Designing Parallel Algorithms (Synchronization)

CSE 451: Operating Systems Spring 2012 Module 6 Review of Processes, Kernel Threads, User-Level Threads Ed Lazowska 570 Allen.

Outline Module 1 and 2 dealt with processes, scheduling and synchronization Next two modules will deal with memory and storage Processes require data to.

Lecture 21: Synchronization and Consistency

Part 1: Concepts and Hardware- Based Approaches

Lecture: Coherence and Synchronization

Lecture 2 Part 2 Process Synchronization

Fast Communication and User Level Parallelism

Hybrid Transactional Memory

Concurrency: Mutual Exclusion and Process Synchronization

Software Transactional Memory Should Not be Obstruction-Free

Presented by Neha Agrawal

CS533 Concepts of Operating Systems

CS510 - Portland State University

Kernel Synchronization II

CSE 153 Design of Operating Systems Winter 19

CS333 Intro to Operating Systems

Chapter 6: Synchronization Tools

Lecture: Coherence and Synchronization

CSE 153 Design of Operating Systems Winter 2019

Lecture 18: Coherence and Synchronization

CSE 542: Operating Systems

CSE 542: Operating Systems

CSE 542: Operating Systems

Presentation transcript:

By Michael Greenwald and David Cheriton Presented by Jonathan Walpole The Synergy Between Non-blocking Synchronization and Operating System Structure By Michael Greenwald and David Cheriton Presented by Jonathan Walpole CS510 – Concurrent Systems

CS510 – Concurrent Systems The Cache Kernel Yet Another Attempt at a µ-kernel Minimal privileged mode code: Caching model of kernel functionality OS functionality in user-mode class libraries allowing application-customizability of OS Shared memory and signal-based communication are the only communication mechanisms Goal: scalable, robust, flexible operating system design CS510 – Concurrent Systems

Synergy of NBS and kernel structure Claim: NBS and Good Design go together in the Cache Kernel design and implementation NBS allows better OS structuring Good OS structure simplifies NBS CS510 – Concurrent Systems

Non-blocking Synchronization Basic idea of NBS Associate version number with data structure Read version number at start of critical section Atomically check and increment version number at end of critical section at the same time as applying the update(s), using a Double Compare and Swap (DCAS) instruction. Roll back and retry on failure (i.e., if version number changed while you were in the critical section). An optimistic synchronization technique similar to that of the Synthesis Kernel CS510 – Concurrent Systems

Example: Deletion from Linked List do { retry: backoffIfNeeded(); version = list->version; /* Read version # */ for (p = list->head;(p->next != elt); p = p->next) { if (p == NULL) { /* Not found */ if (version != list->version) { goto retry; } /* List changed */ return NULL; /* Really not found */ } } while( !DCAS( &(list->version), &(p->next), version, elt, version+1, elt->next ) CS510 – Concurrent Systems

Double Compare and Swap int DCAS( int *addr1, int *addr2, int old1, int old2, int new1, int new2) { <begin atomic> if ((*addr1 == old1) && (*addr2 == old2)) { *addr1 = new1; *addr2 = new2; return(TRUE); } else { return(FALSE); } <end atomic> } CS510 – Concurrent Systems

Hardware Support for DCAS Could be supported through extension to LL/SC (load-linked/store-conditional) sequence LLP/SCP LLP (load-linked-pipelined) Load and link a second address after an earlier LL LLP is linked to the following SCP SCP (store-conditional-pipelined) Store depends on both previous LLP and LL not having been invalidated in this CPU’s cache If LLP/SCP fails, so does enclosing LL/SC CS510 – Concurrent Systems

Software Support for DCAS Basic idea: make DCAS operations atomic by putting them in a critical section protected by a lock! OS manages the lock (how many locks do you need?) Presumably, DCAS is a system call Must make sure that delayed lock holders release the lock and roll back Ensures that DCAS still has non-blocking properties Problems implementing this efficiently How to support readers? How to avoid lock contention? CS510 – Concurrent Systems

Back to the List Example do { retry: backoffIfNeeded(); version = list->version; /* Read version # */ for (p = list->head;(p->next != elt); p = p->next) { if (p == NULL) { /* Not found */ if (version != list->version) { goto retry; } /* List changed */ return NULL; /* Really not found */ } } while( !DCAS( &(list->version), &(p->next), version, elt, version+1, elt->next ) CS510 – Concurrent Systems

CS510 – Concurrent Systems Problem … What happens if another thread deletes an element while we are traversing the list? We may end up with an invalid pointer in p What if we traverse it into the free pool? What if the memory has been reused? What if we end up in a different data structure? What if the memory is reused for a different type? How can we prevent this kind of reader hijacking? Or at least, how can we prevent it from crashing the reader? CS510 – Concurrent Systems

Type-stable memory management (TSM) Descriptor that is allocated as type T is guaranteed to remain of type T for at least tstable A generalization of existing technique – allocation pools: e.g. process-descriptors are statically allocated at system init, and are type-stable for lifetime of system But is this sufficient to ensure that read-side code will reach the DCAS in the presence of concurrent modification? And how long is tstable ? When is it safe to reuse memory? CS510 – Concurrent Systems

Other solutions to reader hijacking If memory is reclaimed immediately upon updates, readers must protect themselves from hijacking But how? Must ensure that a pointer is valid before dereferencing it Additional checks at the end of the read side critical section are required to ensure no concurrent write access took place CS510 – Concurrent Systems

CS510 – Concurrent Systems Claim: TSM aids NBS Type stability ensures safety of pointer dereferences. Without TSM, delete example is too big to fit on slide And very expensive to execute Need to check for changes on each loop iteration TSM makes NBS code simpler and faster CS510 – Concurrent Systems

Other benefits of NBS in Cache Kernel Signals are only form of IPC in Cache Kernel NBS simplifies synchronization in signal handlers Makes it easier to write efficient signal-safe code CS510 – Concurrent Systems

Contention Minimizing Data Structures (CMDS) Locality-based structuring used in Cache kernel: Replication: Per-processor structures (ie. run queue) Hierarchical data structures with read-mostly high levels (ie. hash tables with per bucket synchronization) Cache block alignment of descriptors to minimize false sharing and improve cache efficiency Well-known techniques used in other systems (Tornado, Synthesis, …) CS510 – Concurrent Systems

CS510 – Concurrent Systems Benefits of CMDS Minimizes logical and physical contention Minimizes (memory) consistency overhead Minimizes false sharing Reduces lock conflicts/convoys in lock-based system Reduces synchronization contention with NBS fewer retries from conflict at point of DCAS CMDS is good for locks, cache consistency and NBS! NBS needs CMDS CS510 – Concurrent Systems

Minimizing the Window of Inconsistency Delaying writes, and grouping them together at the end minimizes the window of inconsistency Advantages of a small window of inconsistency Less probability of failure leaving inconsistency Preemption-safe: easy to back-out of critical section Reduces lock hold time & contention (in lock-based systems) Its good for system design whether you use blocking or non-blocking synchronization! CS510 – Concurrent Systems

How Does NBS Affect Contention? What is the relationship between window of inconsistency and probability of conflict (contention)? NBS approaches have a critical section just like locking approaches NBS critical section is defined by the scope of the version number This is NOT the same as the window of inconsistency! CS510 – Concurrent Systems

Priority Inversion Issues NBS allows synchronization to be subordinate to scheduling It avoids priority inversion problem of locks Also, page fault,I/O, and other blocking operations Does the highest priority process always make progress? Is there a retry-based equivalent to priority inversion? CS510 – Concurrent Systems

Why NBS is good for OS structure Fail-stop safe: OS class libraries tolerant of user threads being terminated. Most Cache Kernel OS functionality implemented in user-mode class libraries Portable: same code on uniprocessor, multiprocessor, and signal handlers Deadlock free (almost) See examples of deletion and insertion without going through reclamation CS510 – Concurrent Systems

Implementing non-blocking Synchronization Basic approach Read version number Rely on TSM for type safety of pointers Increment version number and check with every modification (abort if changed). Straightforward transformation from locking … so long as you have DCAS and can tolerate TSM Replace acquire with read of version number Replace release with DCAS … CS510 – Concurrent Systems

Implementing non-blocking Synchronization (cont.) Variants of the basic approach: N reads, 1 write: No backout 2 reads, 2 writes: No version number In the Cache kernel, every case of synchronization falls into the special cases CS510 – Concurrent Systems

Complexity and Correctness DCAS reduces size of algorithms (lines of code) by 75% compared to CAS, for linked lists, queues and stacks Special case uses of DCAS reduce complexity further Relatively straightforward transformation from locking Similar code size CS510 – Concurrent Systems

CS510 – Concurrent Systems Performance Issues Simulation-based study With non-preemptive scheduling: DCAS-based NBS almost as fast as spin-locks CAS-based NBS slower With preemptive scheduling: DCAS and CAS-based NBS better than spin-locks Note, this was using mid 1990’s CPUs The cost of CAS is much higher on today’s CPUs What are the implications of this? CS510 – Concurrent Systems

Hardware-Based Optimizations A hardware-based optimization using advisory locking Cache based advisory locking can help avoid contention in the form of “useless parallelism” Instead of just checking at the end and forcing retries, check at the beginning and back off if retry is inevitable Uses Cload instruction (conditional load) which succeeds only if the location does not have an advisory lock set, and sets the lock on success Initially load version # with Cload, wait and retry on failure (seems like using TSL in a spin-lock?) CS510 – Concurrent Systems

CS510 – Concurrent Systems Conclusions Claim: “Good OS structure” can support Non-blocking synchronization Type-Stable Mem. Mgmt (TSM) Data Structures that Minimize Contention (CMDS) Minimizing the window of inconsistency Claim: Non-blocking synchronization can support convenient OS structure Avoids deadlock; allows signals as sole IPC mechanism Fault tolerant; kernel functionality can be moved to user space Performance; isolates scheduling from synchronization Claim: Strong synergy between non-blocking synchronization & good OS structure. CS510 – Concurrent Systems

Advantages of Non-blocking Synchronization No deadlock! Especially useful for signal handlers. Portability: same code on uni and multiprocessors and in signal handlers Performance: Minimizes interference between synchronization and process scheduling Recovery: Insulation from process failures So why isn’t it universally deployed? CS510 – Concurrent Systems

Obstacles to deployment Complexity: confusing to design and write efficient algorithms, especially in the absence of a DCAS primitive Correctness: Its hard to be convinced that there are no subtle bugs, especially on weak memory consistency architectures. Is TSM enough to preserve critical section invariants? Performance: Poor performance under contention due to excessive retry high overhead due to expense of CAS on modern CPUs CS510 – Concurrent Systems