Presentation is loading. Please wait.

Presentation is loading. Please wait.

MULTIVIE W Slide 1 (of 23) The Performance of Spin Lock Alternatives for Shared-Memory Multiprocessors Paper: Thomas E. Anderson Presentation: Emerson.

Similar presentations


Presentation on theme: "MULTIVIE W Slide 1 (of 23) The Performance of Spin Lock Alternatives for Shared-Memory Multiprocessors Paper: Thomas E. Anderson Presentation: Emerson."— Presentation transcript:

1 MULTIVIE W Slide 1 (of 23) The Performance of Spin Lock Alternatives for Shared-Memory Multiprocessors Paper: Thomas E. Anderson Presentation: Emerson Murphy-Hill

2 MULTIVIE W Slide 2 (of 23) Introduction Shared Memory Multiprocessors Mutual exclusion required Almost always hardware primitives provided –Direct mutual exclusion –Mutual exclusion through locking Interest here: short critical regions, spin locks The problem: spinning processors cost communication bandwidth – how can we cut it?

3 MULTIVIE W Slide 3 (of 23) Range of Architectures Two dimensions: –Interconnect type (multistage network or bus) –Cache type So six architectures considered: –Multistage network without private caches –Multistage network, invalidation based cache coherence using RD –Bus without coherent private cache –Bus w/snoopy write through invalidation-based cache coherence –Bus with snoopy write-back invalidation based cache coherence –Bus with snoopy distributed write cache coherence Architectures generally read, modify, and write atomically

4 MULTIVIE W Slide 4 (of 23) Why Spinlocks are Slow Tradeoff: frequent polling gets you the lock faster, but slows everyone else down Latency is an issue: some overhead for complicated spinlock algorithm

5 MULTIVIE W Slide 5 (of 23) A Spin-Waiting Algorithm Spin on Test-and-Set while(TestAndSet(lock) = BUSY); Lock := CLEAR; Slow, because: –Lock holder must content with non-lock holders –Spinning requests slow other requests

6 MULTIVIE W Slide 6 (of 23) Another Spin-Waiting Algorithm Spin on Read (Test-and-Test-and-Set) while(lock=BUSY or TestAndSet(lock)=BUSY); lock := CLEAR; For architectures with per-processor cache Like previous, but no network/bus communication on read For short critical sections, this is slow, because the time to quiesce (all processors resume spinning) dominates

7 MULTIVIE W Slide 7 (of 23) Reasons Why Quiescence is Slow Elapsed time between Read and Test-and- Set All cached copies of a lock are invalidated on a Test-and-Set, even if the test fails Invalidation-based cache-coherence requires O(P) bus/network cycles, because a written value has to be propegated to every processor (the same one!)

8 MULTIVIE W Slide 8 (of 23) Validation

9 MULTIVIE W Slide 9 (of 23) Validation (a bit more)

10 MULTIVIE W Slide 10 (of 23) Now, Speed it Up… Author presents 5 alternative approaches Interesting approach – 4 are based on the observation that communication during spin waiting is like CSMA (Ethernet) networking protocols

11 MULTIVIE W Slide 11 (of 23) 1/5: Static Delay on Lock Release When a processor notices the lock has been released, it waits a fixed amount of time before trying a Test-And-Set Each processor is assigned a static delay (slot) Good performance: –Fewer slots, fewer spinning processors –Many slots, more spinning processors

12 MULTIVIE W Slide 12 (of 23) 2/5: Backoff on Lock Release Like Ethernet backoff Wait a small amount of time between Read and Test-and-Set If processor collides with another processor, it backs off for a greater random interval Indirectly, processors base backoff interval on the number of spinning processors But…

13 MULTIVIE W Slide 13 (of 23) More on Backoff… Processors should not change their mean delay if another processor acquires the lock Maximum time to delay should be bounded Initial delay on arrival should be a fraction of the last delay

14 MULTIVIE W Slide 14 (of 23) 3/5: Static Delay before Reference while(lock=BUSY or TestAndSet(lock)=BUSY) delay(); Here you just check the lock less often Good when: –Checking frequently, and few other spinners –Checking infrequently, many spinners

15 MULTIVIE W Slide 15 (of 23) 4/5: Backoff before Reference while(lock=BUSY or TestAndSet(lock)=BUSY) delay(); delay += randomBackoff(); Analogous to backoff on lock release Both dynamic and static backoff are bad when the critical section is long: they just keep backing off while the lock is being held

16 MULTIVIE W Slide 16 (of 23) 5/5: Queue Can’t estimate backoff by number of waiting processes, can’t keep a process queue (just as slow as the lock!) This author’s contribution (finally): Initflags[0] := HAS_LOCK; flags[1..P-1] := MUST_WAIT; queueLast := 0; LockmyPlace := ReadAndIncrement(queueLast); while(flags[myPlace mod P]=MUST_WAIT); Unlockflags[myPlace mod P] := MUST_WAIT; flags[(myPlace+1) mod P] := HAS_LOCK;

17 MULTIVIE W Slide 17 (of 23) More on Queuing Works especially well for multistage networks – each flag can be on a separate module, so a single memory location isn’t saturated with requests Works less well if there’s a bus without cache coherence, because we still have the problem that each process has to poll for a single value in one place Lock latency is increased (overhead), so poor performance when there’s no contention

18 MULTIVIE W Slide 18 (of 23) Benchmark Spin-lock Alternatives

19 MULTIVIE W Slide 19 (of 23) Overhead vs. Number of Slots

20 MULTIVIE W Slide 20 (of 23) Spin-waiting Overhead for a Burst

21 MULTIVIE W Slide 21 (of 23) Network Hardware Solutions Combining Networks –Multiple paths to same memory location Hardware Queuing –Eliminates polling across the network Goodman’s Queue Links –Stores the name of the next processor in the queue directly in each processor’s cache –Eliminates need for memory access for queuing

22 MULTIVIE W Slide 22 (of 23) Bus Hardware Solutions Invalidate cache copies ONLY when Test-and-Set succeeds Read broadcast –Whenever some other processor reads a value which I know is invalid, I get a copy of that value too (piggyback) –Eliminates the cascade of read-misses Special handling of Test-and-Set –Cache and bus controllers don’t mess with the bus if the lock is busy –Essentially, doesn’t do a test-and-set so long as there is a possibility it might fail

23 MULTIVIE W Slide 23 (of 23) Conclusions Spin-locking performance doesn’t scale A variant of Ethernet backoff has good results when there is little lock contention Queuing (parallelizing lock handoff) has good results when there are waiting processors A little supportive hardware goes a long way towards a healthy multiprocessor relationship


Download ppt "MULTIVIE W Slide 1 (of 23) The Performance of Spin Lock Alternatives for Shared-Memory Multiprocessors Paper: Thomas E. Anderson Presentation: Emerson."

Similar presentations


Ads by Google