The Performance of Spin Lock Alternatives for Shared-Memory Microprocessors Thomas E. Anderson Presented by David Woodard.

Slides:



Advertisements
Similar presentations
1 Synchronization A parallel computer is a collection of processing elements that cooperate and communicate to solve large problems fast. Types of Synchronization.
Advertisements

Synchronization. How to synchronize processes? – Need to protect access to shared data to avoid problems like race conditions – Typical example: Updating.
Global Environment Model. MUTUAL EXCLUSION PROBLEM The operations used by processes to access to common resources (critical sections) must be mutually.
Ch. 7 Process Synchronization (1/2) I Background F Producer - Consumer process :  Compiler, Assembler, Loader, · · · · · · F Bounded buffer.
Chapter 6: Process Synchronization
Silberschatz, Galvin and Gagne ©2009 Operating System Concepts – 8 th Edition, Chapter 6: Process Synchronization.
Mutual Exclusion.
Synchronization without Contention John M. Mellor-Crummey and Michael L. Scott+ ECE 259 / CPS 221 Advanced Computer Architecture II Presenter : Tae Jun.
Parallel Processing (CS526) Spring 2012(Week 6).  A parallel algorithm is a group of partitioned tasks that work with each other to solve a large problem.
Multiple Processor Systems
Transactional Memory Yujia Jin. Lock and Problems Lock is commonly used with shared data Priority Inversion –Lower priority process hold a lock needed.
CS510 Concurrent Systems Class 1b Spin Lock Performance.
1 Lecture 21: Synchronization Topics: lock implementations (Sections )
CS252/Patterson Lec /23/01 CS213 Parallel Processing Architecture Lecture 7: Multiprocessor Cache Coherency Problem.
CPE 731 Advanced Computer Architecture Snooping Cache Multiprocessors Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.
User-Level Interprocess Communication for Shared Memory Multiprocessors Brian N. Bershad, Thomas E. Anderson, Edward D. Lazowska, and Henry M. Levy Presented.
CS533 - Concepts of Operating Systems 1 Class Discussion.
1 Lecture 20: Protocols and Synchronization Topics: distributed shared-memory multiprocessors, synchronization (Sections )
CS-334: Computer Architecture
Introduction to Embedded Systems
CHAPTER 3 TOP LEVEL VIEW OF COMPUTER FUNCTION AND INTERCONNECTION
Top Level View of Computer Function and Interconnection.
© Janice Regan, CMPT 300, May CMPT 300 Introduction to Operating Systems Mutual Exclusion.
The Performance of Spin Lock Alternatives for Shared-Memory Multiprocessors THOMAS E. ANDERSON Presented by Daesung Park.
ECE200 – Computer Organization Chapter 9 – Multiprocessors.
MULTIVIE W Slide 1 (of 23) The Performance of Spin Lock Alternatives for Shared-Memory Multiprocessors Paper: Thomas E. Anderson Presentation: Emerson.
A Low-Overhead Coherence Solution for Multiprocessors with Private Cache Memories Also known as “Snoopy cache” Paper by: Mark S. Papamarcos and Janak H.
Computer Architecture Lecture 2 System Buses. Program Concept Hardwired systems are inflexible General purpose hardware can do different tasks, given.
EEE440 Computer Architecture
ECEG-3202 Computer Architecture and Organization Chapter 3 Top Level View of Computer Function and Interconnection.
CS399 New Beginnings Jonathan Walpole. 2 Concurrent Programming & Synchronization Primitives.
Dr Mohamed Menacer College of Computer Science and Engineering, Taibah University CE-321: Computer.
1 Lecture 19: Scalable Protocols & Synch Topics: coherence protocols for distributed shared-memory multiprocessors and synchronization (Sections )
SYNAR Systems Networking and Architecture Group CMPT 886: The Art of Scalable Synchronization Dr. Alexandra Fedorova School of Computing Science SFU.
Lecture 27 Multiprocessor Scheduling. Last lecture: VMM Two old problems: CPU virtualization and memory virtualization I/O virtualization Today Issues.
Group 1 chapter 3 Alex Francisco Mario Palomino Mohammed Ur-Rehman Maria Lopez.
1 Dynamic Decentralized Cache Schemes for MIMD Parallel Processors Larry Rudolph Zary Segall Presenter: Tu Phan.
Chapter 3 System Buses.  Hardwired systems are inflexible  General purpose hardware can do different tasks, given correct control signals  Instead.
Spin Locks and Contention Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit.
The University of Adelaide, School of Computer Science
Lecture 21 Synchronization
Lecture 19: Coherence and Synchronization
The University of Adelaide, School of Computer Science
The University of Adelaide, School of Computer Science
Lecture 18: Coherence and Synchronization
Chapter 3 Top Level View of Computer Function and Interconnection
Multiprocessor Introduction and Characteristics of Multiprocessor
CS510 Concurrent Systems Jonathan Walpole.
Designing Parallel Algorithms (Synchronization)
Lecture 21: Synchronization and Consistency
Lecture: Coherence and Synchronization
CS533 Concepts of Operating Systems
CS510 Concurrent Systems Jonathan Walpole.
The University of Adelaide, School of Computer Science
Lecture 17 Multiprocessors and Thread-Level Parallelism
CSE 153 Design of Operating Systems Winter 19
CS333 Intro to Operating Systems
Lecture: Coherence, Synchronization
Lecture 17 Multiprocessors and Thread-Level Parallelism
Lecture 23: Transactional Memory
Lecture: Coherence and Synchronization
Lecture 19: Coherence and Synchronization
Lecture 18: Coherence and Synchronization
The University of Adelaide, School of Computer Science
Lecture 17 Multiprocessors and Thread-Level Parallelism
Presentation transcript:

The Performance of Spin Lock Alternatives for Shared-Memory Microprocessors Thomas E. Anderson Presented by David Woodard

Introduction Shared Memory Multiprocessors  Need to protect shared data structures (critical sections)  Often share resources including ports to memory (bus, network, etc.)  Challenge: Efficiently implement scalable, low latency mechanisms to protect shared data

Introduction Spin Locks  One approach to protecting shared data on multiprocessors  Efficient on some systems, but greatly degrades performance on others

Multiprocessor Architecture Paper focuses on two dimensions for design  Interconnect type (bus/multistage network)  Cache coherence strategy Six Proposed Models  Bus: no cache coherence  Bus: snoopy write through invalidation cache coherence  Bus: snoopy write-back invalidation cache coherence  Bus: snoopy distributed write cache coherence  Multistage network: no cache coherence  Multistage network: invalidation based cache coherence

Mutual Exclusion and Atomic Operations Most processors support atomic read/write operations Test and Set Load the (old) value of lock Store TRUE in lock If the loaded value is false, continue  else continue to try until lock is free (spin lock)

Test and Set in a Spin Lock Advantages  Quickly gain access to lock when available  Works well on systems with few processors or low contention Disadvantages  Slows down other processors (including the processor holding the lock!)  Shared resources are also used to carry out the test and set instructions  More complex algorithms to reduce the burden on resources increases latency in acquiring lock

Spin on Read Intended for processors with per CPU coherent caches  Each CPU can spin testing the value of the lock in its own cache  If free, then send test and lock transaction Problem  Nature of cache coherence protocols slow down process  More pronounced in systems with invalidation based policies

Why Quiescence is Slow for Spin on Read When the lock is released its value is modified, hence all cached copies of it are invalidated Subsequent reads on all processors miss in cache, hence generating bus contention Many see the lock free at the same time because there is a delay in satisfying the cache miss of the one that will eventually succeed in getting the lock next Many attempt to set it using TSL Each attempt generates contention and invalidates all copies All but one attempt fails, causing the CPU to revert to reading The first read misses in the cache! By the time all this is over, the critical section has completed and the lock has been freed again!

Performance

Quiescence Time for Spin on Read

Proposed Software Solutions Based on CSMA (Carrier Sense Multiple Access)  Basic Idea: Adjust the length of time between attempts to access shared resource  Dynamically or Statically set delay? When to delay?  After Spin on Read returns true, delay before setting  After every memory access  Better on models where Spin on Read generates contention

Proposed Software Solutions Delay on attempted set  Reduces the number of TSLs thereby reducing contention  Works well when delay is short and there is little contention OR when delay is long and there is a lot of contention Delay on every memory access  Works well on systems without per CPU caches Reduces the number of memory accesses thereby reducing the number of read instructions

Length of Delay - Static Advantages  Each processor is given its own “slot”; this makes it easy to assign priority to CPUs  Few empty slots = good latency; Few crowded slots = little contention Disadvantages  Doesn’t adjust to environments prone to bursts  Processors with same delay that have conflict will always have conflict

Length of Delay - Dynamic Advantages  Adjusts to evolving environments; increases delay time after each conflict (up to a ceiling) Disadvantages  What criteria determine the amount of back off?  Long critical sections could keep increasing delay in some CPUs Bound maximum delay: What if the bound is too high? Too low?

Proposed Software Solution - Queuing Flag Based Approach  As CPU waits  Add to queue  Waiting CPUs spin on flag of processor ahead of it in the queue (different for each CPU) No bus or cache contention  Queue assertion and deletion require locks Not useful for small critical sections (such as queue operations!)

Proposed Software Solution - Queuing Counter Based Approach  Each CPU does an atomic read and increment to acquire a unique sequence number  When a processor releases a lock it signals the processor with the next successive sequence number Sets a flag in a different cache block unique to the waiting processor Processor spinning on its own flag sees the change and continues (occurs invalidation and read miss cycles)

Proposed Software Solution - Queuing Advantages  Separate flag locations in memory prevents saturation from multiple accesses Especially useful for multistage networks (separate memory modules) Disadvantages  Still not efficient for models without per processor caches  Requires memory access of one memory location  Increased lock latency due to increased instructions (increment counter, check location, zero location, set another location)  Preempting a process cause all processes behind it in the queue to wait  Can’t wait for multiple events

Results

Hardware Solutions: Network Combining Networks  Combine requests to same lock (forward one, return other)  Combining benefit increases with increase in contention Hardware Queuing  Blocking enter and exit instructions queue processes at memory module  Eliminate polling across the network Goodman’s Queue Links  Stores the name of the next processor in the queue directly in each processor’s cache  Inform next processor asynchronously (via inter-processor interrupt?)

Hardware Solutions: Bus Use additional bus with specific coherence policy  Additional die space? Separate clock speed for bus? Read broadcast  When one processor reads a value which other processors also need, fill all caches with one read  Eliminates extended quiescence waiting periods due to pending reads Monitor bus for test and set instructions  Prevents bus contention  If one processor performs the test and set instruction, it can share the result and other processors can abort their test and set instructions  Typically cache and bus controllers are not aware of instruction types; this information is handled by functional units (ex: ALUs) further down the pipeline

Conclusions Traditional Spin Lock approaches are not affective for large numbers of processors When contention is low, models borrowed from CSMA work well  Delay slots When contention is high, queuing methods work well  Trades lock latency for more efficient/parallelized lock hand-off Hardware approaches are very promising, but requires additional logic  Additional cost in die size and money to manufacture

Resources Dr. Jonathan Walpole 3/winter2008/home.html Emma Kuo: 3/winter2007/slides/42.pdf