MULTIVIE W Slide 1 (of 23) The Performance of Spin Lock Alternatives for Shared-Memory Multiprocessors Paper: Thomas E. Anderson Presentation: Emerson.

Slides:

Advertisements

Similar presentations

Cache Coherence. Memory Consistency in SMPs Suppose CPU-1 updates A to 200. write-back: memory and cache-2 have stale values write-through: cache-2 has.

Advertisements

1 Synchronization A parallel computer is a collection of processing elements that cooperate and communicate to solve large problems fast. Types of Synchronization.

Synchronization without Contention

Synchronization. How to synchronize processes? – Need to protect access to shared data to avoid problems like race conditions – Typical example: Updating.

John M. Mellor-Crummey Algorithms for Scalable Synchronization on Shared- Memory Multiprocessors Joseph Garvey & Joshua San Miguel Michael L. Scott.

ECE 454 Computer Systems Programming Parallel Architectures and Performance Implications (II) Ding Yuan ECE Dept., University of Toronto

Super computers Parallel Processing By: Lecturer \ Aisha Dawood.

Synchronization without Contention John M. Mellor-Crummey and Michael L. Scott+ ECE 259 / CPS 221 Advanced Computer Architecture II Presenter : Tae Jun.

Parallel Processing (CS526) Spring 2012(Week 6).  A parallel algorithm is a group of partitioned tasks that work with each other to solve a large problem.

Multiple Processor Systems

Synchron. CSE 471 Aut 011 Some Recent Medium-scale NUMA Multiprocessors (research machines) DASH (Stanford) multiprocessor. –“Cluster” = 4 processors on.

Cache Coherent Distributed Shared Memory. Motivations Small processor count –SMP machines –Single shared memory with multiple processors interconnected.

Transactional Memory Yujia Jin. Lock and Problems Lock is commonly used with shared data Priority Inversion –Lower priority process hold a lock needed.

The Performance of Spin Lock Alternatives for Shared-Memory Microprocessors Thomas E. Anderson Presented by David Woodard.

1 Multiprocessors. 2 Idea: create powerful computers by connecting many smaller ones good news: works for timesharing (better than supercomputer) bad.

CS510 Concurrent Systems Class 1b Spin Lock Performance.

1 Lecture 21: Synchronization Topics: lock implementations (Sections )

CS252/Patterson Lec /23/01 CS213 Parallel Processing Architecture Lecture 7: Multiprocessor Cache Coherency Problem.

BusMultis.1 Review: Where are We Now? Processor Control Datapath Memory Input Output Input Output Memory Processor Control Datapath  Multiprocessor –

Scalable Reader Writer Synchronization John M.Mellor-Crummey, Michael L.Scott.

1 Lecture 23: Multiprocessors Today’s topics:  RAID  Multiprocessor taxonomy  Snooping-based cache coherence protocol.

CS533 - Concepts of Operating Systems 1 CS533 Concepts of Operating Systems Class 8 Synchronization on Multiprocessors.

Synchronization Todd C. Mowry CS 740 November 1, 2000 Topics Locks Barriers Hardware primitives.

CS533 - Concepts of Operating Systems 1 Class Discussion.

1 Lecture 20: Protocols and Synchronization Topics: distributed shared-memory multiprocessors, synchronization (Sections )

Lecture 37: Chapter 7: Multiprocessors Today’s topic –Introduction to multiprocessors –Parallelism in software –Memory organization –Cache coherence 1.

Multiprocessor Cache Coherency

Symmetric Multiprocessors and Performance of SpinLock Techniques Based on Anderson’s paper “Performance of Spin Lock Alternatives for Shared-Memory Multiprocessors”

More on Locks: Case Studies

1 Cache coherence CEG 4131 Computer Architecture III Slides developed by Dr. Hesham El-Rewini Copyright Hesham El-Rewini.

The Performance of Spin Lock Alternatives for Shared-Memory Multiprocessors THOMAS E. ANDERSON Presented by Daesung Park.

ECE200 – Computer Organization Chapter 9 – Multiprocessors.

Memory Coherence in Shared Virtual Memory System ACM Transactions on Computer Science(TOCS), 1989 KAI LI Princeton University PAUL HUDAK Yale University.

A Low-Overhead Coherence Solution for Multiprocessors with Private Cache Memories Also known as “Snoopy cache” Paper by: Mark S. Papamarcos and Janak H.

Caltech CS184 Spring DeHon 1 CS184b: Computer Architecture (Abstractions and Optimizations) Day 12: May 3, 2003 Shared Memory.

Jeremy Denham April 7,  Motivation  Background / Previous work  Experimentation  Results  Questions.

CALTECH cs184c Spring DeHon CS184c: Computer Architecture [Parallel and Multithreaded] Day 10: May 8, 2001 Synchronization.

Kernel Locking Techniques by Robert Love presented by Scott Price.

Computer Science and Engineering Parallel and Distributed Processing CSE 8380 April 5, 2005 Session 22.

Computer Science and Engineering Copyright by Hesham El-Rewini Advanced Computer Architecture CSE 8383 March 20, 2008 Session 9.

Memory Coherence in Shared Virtual Memory System ACM Transactions on Computer Science(TOCS), 1989 KAI LI Princeton University PAUL HUDAK Yale University.

1 Lecture 19: Scalable Protocols & Synch Topics: coherence protocols for distributed shared-memory multiprocessors and synchronization (Sections )

1 Global and high-contention operations: Barriers, reductions, and highly- contended locks Katie Coons April 6, 2006.

SYNAR Systems Networking and Architecture Group CMPT 886: The Art of Scalable Synchronization Dr. Alexandra Fedorova School of Computing Science SFU.

1 Lecture 3: Coherence Protocols Topics: consistency models, coherence protocol examples.

CS 2200 Presentation 18b MUTEX. Questions? Our Road Map Processor Networking Parallel Systems I/O Subsystem Memory Hierarchy.

Lecture 27 Multiprocessor Scheduling. Last lecture: VMM Two old problems: CPU virtualization and memory virtualization I/O virtualization Today Issues.

August 13, 2001Systems Architecture II1 Systems Architecture II (CS ) Lecture 11: Multiprocessors: Uniform Memory Access * Jeremy R. Johnson Monday,

The University of Adelaide, School of Computer Science

Ch4. Multiprocessors & Thread-Level Parallelism 4. Syn (Synchronization & Memory Consistency) ECE562 Advanced Computer Architecture Prof. Honggang Wang.

Spin Locks and Contention Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit.

Queue Locks and Local Spinning Some Slides based on: The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit.

CMSC 611: Advanced Computer Architecture Shared Memory Most slides adapted from David Patterson. Some from Mohomed Younis.

COMP8330/7330/7336 Advanced Parallel and Distributed Computing Tree-Based Networks Cache Coherence Dr. Xiao Qin Auburn University

The University of Adelaide, School of Computer Science

Lecture 18: Coherence and Synchronization

Reactive Synchronization Algorithms for Multiprocessors

Global and high-contention operations: Barriers, reductions, and highly-contended locks Katie Coons April 6, 2006.

Lecture 2: Snooping-Based Coherence

Multiprocessor Introduction and Characteristics of Multiprocessor

CS510 Concurrent Systems Jonathan Walpole.

Lecture: Coherence and Synchronization

CS 213 Lecture 11: Multiprocessor 3: Directory Organization

CS533 Concepts of Operating Systems

CS510 Concurrent Systems Jonathan Walpole.

The University of Adelaide, School of Computer Science

Lecture: Coherence and Synchronization

Lecture 19: Coherence and Synchronization

Lecture 18: Coherence and Synchronization

The University of Adelaide, School of Computer Science

Presentation transcript:

MULTIVIE W Slide 1 (of 23) The Performance of Spin Lock Alternatives for Shared-Memory Multiprocessors Paper: Thomas E. Anderson Presentation: Emerson Murphy-Hill

MULTIVIE W Slide 2 (of 23) Introduction Shared Memory Multiprocessors Mutual exclusion required Almost always hardware primitives provided –Direct mutual exclusion –Mutual exclusion through locking Interest here: short critical regions, spin locks The problem: spinning processors cost communication bandwidth – how can we cut it?

MULTIVIE W Slide 3 (of 23) Range of Architectures Two dimensions: –Interconnect type (multistage network or bus) –Cache type So six architectures considered: –Multistage network without private caches –Multistage network, invalidation based cache coherence using RD –Bus without coherent private cache –Bus w/snoopy write through invalidation-based cache coherence –Bus with snoopy write-back invalidation based cache coherence –Bus with snoopy distributed write cache coherence Architectures generally read, modify, and write atomically

MULTIVIE W Slide 4 (of 23) Why Spinlocks are Slow Tradeoff: frequent polling gets you the lock faster, but slows everyone else down Latency is an issue: some overhead for complicated spinlock algorithm

MULTIVIE W Slide 5 (of 23) A Spin-Waiting Algorithm Spin on Test-and-Set while(TestAndSet(lock) = BUSY); Lock := CLEAR; Slow, because: –Lock holder must content with non-lock holders –Spinning requests slow other requests

MULTIVIE W Slide 6 (of 23) Another Spin-Waiting Algorithm Spin on Read (Test-and-Test-and-Set) while(lock=BUSY or TestAndSet(lock)=BUSY); lock := CLEAR; For architectures with per-processor cache Like previous, but no network/bus communication on read For short critical sections, this is slow, because the time to quiesce (all processors resume spinning) dominates

MULTIVIE W Slide 7 (of 23) Reasons Why Quiescence is Slow Elapsed time between Read and Test-and- Set All cached copies of a lock are invalidated on a Test-and-Set, even if the test fails Invalidation-based cache-coherence requires O(P) bus/network cycles, because a written value has to be propegated to every processor (the same one!)

MULTIVIE W Slide 8 (of 23) Validation

MULTIVIE W Slide 9 (of 23) Validation (a bit more)

MULTIVIE W Slide 10 (of 23) Now, Speed it Up… Author presents 5 alternative approaches Interesting approach – 4 are based on the observation that communication during spin waiting is like CSMA (Ethernet) networking protocols

MULTIVIE W Slide 11 (of 23) 1/5: Static Delay on Lock Release When a processor notices the lock has been released, it waits a fixed amount of time before trying a Test-And-Set Each processor is assigned a static delay (slot) Good performance: –Fewer slots, fewer spinning processors –Many slots, more spinning processors

MULTIVIE W Slide 12 (of 23) 2/5: Backoff on Lock Release Like Ethernet backoff Wait a small amount of time between Read and Test-and-Set If processor collides with another processor, it backs off for a greater random interval Indirectly, processors base backoff interval on the number of spinning processors But…

MULTIVIE W Slide 13 (of 23) More on Backoff… Processors should not change their mean delay if another processor acquires the lock Maximum time to delay should be bounded Initial delay on arrival should be a fraction of the last delay

MULTIVIE W Slide 14 (of 23) 3/5: Static Delay before Reference while(lock=BUSY or TestAndSet(lock)=BUSY) delay(); Here you just check the lock less often Good when: –Checking frequently, and few other spinners –Checking infrequently, many spinners

MULTIVIE W Slide 15 (of 23) 4/5: Backoff before Reference while(lock=BUSY or TestAndSet(lock)=BUSY) delay(); delay += randomBackoff(); Analogous to backoff on lock release Both dynamic and static backoff are bad when the critical section is long: they just keep backing off while the lock is being held

MULTIVIE W Slide 16 (of 23) 5/5: Queue Can’t estimate backoff by number of waiting processes, can’t keep a process queue (just as slow as the lock!) This author’s contribution (finally): Initflags[0] := HAS_LOCK; flags[1..P-1] := MUST_WAIT; queueLast := 0; LockmyPlace := ReadAndIncrement(queueLast); while(flags[myPlace mod P]=MUST_WAIT); Unlockflags[myPlace mod P] := MUST_WAIT; flags[(myPlace+1) mod P] := HAS_LOCK;

MULTIVIE W Slide 17 (of 23) More on Queuing Works especially well for multistage networks – each flag can be on a separate module, so a single memory location isn’t saturated with requests Works less well if there’s a bus without cache coherence, because we still have the problem that each process has to poll for a single value in one place Lock latency is increased (overhead), so poor performance when there’s no contention

MULTIVIE W Slide 18 (of 23) Benchmark Spin-lock Alternatives

MULTIVIE W Slide 19 (of 23) Overhead vs. Number of Slots

MULTIVIE W Slide 20 (of 23) Spin-waiting Overhead for a Burst

MULTIVIE W Slide 21 (of 23) Network Hardware Solutions Combining Networks –Multiple paths to same memory location Hardware Queuing –Eliminates polling across the network Goodman’s Queue Links –Stores the name of the next processor in the queue directly in each processor’s cache –Eliminates need for memory access for queuing

MULTIVIE W Slide 22 (of 23) Bus Hardware Solutions Invalidate cache copies ONLY when Test-and-Set succeeds Read broadcast –Whenever some other processor reads a value which I know is invalid, I get a copy of that value too (piggyback) –Eliminates the cascade of read-misses Special handling of Test-and-Set –Cache and bus controllers don’t mess with the bus if the lock is busy –Essentially, doesn’t do a test-and-set so long as there is a possibility it might fail

MULTIVIE W Slide 23 (of 23) Conclusions Spin-locking performance doesn’t scale A variant of Ethernet backoff has good results when there is little lock contention Queuing (parallelizing lock handoff) has good results when there are waiting processors A little supportive hardware goes a long way towards a healthy multiprocessor relationship