EECE 5501 5.5: Synchronization Issue: How can synchronization operations be implemented in bus-based cache-coherent multiprocessors Components of a synchronization.

EECE 5501 5.5: Synchronization Issue: How can synchronization operations be implemented in bus-based cache-coherent multiprocessors Components of a synchronization event –Acquire method –Waiting algorithm Busy waiting (processor cannot do other work) Blocking (higher overhead, state must be saved) –Release method

EECE 5502 Implementing Mutual Exclusion (Lock-Unlock) Hardware solution –Use a set of LOCK bus lines Expensive and nonscalable P1P1 P2P2 P3P3 PpPp LOCK1 LOCK2 LOCK3 LOCK4

EECE 5503 Software solution –Requires hardware support for an atomic test-and-set operation –Example lock:ldreg, location cmpreg, #0 bnzlock stlocation, #1 ret unlock: stlocation, #0 ret Does this work?

EECE 5504 Simple software test-and-set lock –lock:t&sreg, location bnzreg, lock ret –unlock:stlocation, #0 ret Other possible atomic instructions –Swap reg, location –Fetch&op (operation) location fetch&inclocation fetch&addreg, location –Compare&swapreg1, reg2, location /* if (reg1 = M[location]) then M[Location] reg2 */

EECE 5505 Performance of t&s Locks Figure 5.29 Based on following code –Lock(L); critical-section( c ); /* c time in crit. sec. */ unlock(L); Exponential backoff (like CSMA) –If (lock is unsuccessful) then wait ( k * f i ) time units before another attempt constants chosen based on experiments

EECE 5506 Test-and-test-and-set Lock Could be basis for better solution Operation –Lock:testreg, location bnzreg, lock t&sreg, location bnzreg, lock ret

EECE 5507 Performance Goals for Locks Low latency Low traffic Scalability Low storage cost Fairness –Starvation should be avoided swap lock? t&s lock? test-and-t&s lock? Evaluation of locks:

EECE 5508 (LL, SC) Primitives LL (load-locked) –Loads synchronization variable into a register SC (store-conditional) –Tries to store the register value into the synchronization variable memory iff no other processor has written to that location (or cache block) since the LL

EECE 5509 lock:LLreg1, location bnzreg, lock /* if locked, try again */ SClocation, reg2 beqz lock/* if sc failed, start again */ ret unlock:st location, #0 ret

EECE 55010 Comments on LL-SC Only certain undo-able instructions are permitted between LL and SC Many different types of fetch&op instructions can be implemented SC does not generate invalidations upon a failure Only one processor can perform LL or SC at any given time instant

EECE 55011 Ticket Lock LOCK:LLreg1, ticket addreg2, reg1, #1 SCticket, reg2 beqz lock LOCK1:loadreg3, LED cmpreg1, reg3 bnzLOCK1 ret Unlock:loadreg1, LED increg1 storeLED, reg1 ticketLED

EECE 55012 Array-based LOCK LOCK:LLreg1, ticket addreg2, reg1, #1 (mod p) SCticket, reg2 beqzlock storeptr, reg2 LOCK1:loadreg3, LED[reg1] cmpreg3, #1 bnzLOCK1 storeLED[reg1], #0 ret Unlock:loadreg1, ptr storeLED[reg1], #1 ret ticket LED … …

EECE 55013 Comments on LL-SC LL-SC does not generate bus traffic if LL fails LL-SC does not generate invalidations if SC fails LL-SC does generate read-miss bus traffic even if SC fails O(p) traffic per lock acquisition LL-SC is not a fair lock

EECE 55014 Comments on Ticket Lock Operates like the ticket system at a bank Every process wanting to acquire the lock takes a ticket number and then busy-waits on a global now- serving number To release the lock, a process increments the now- serving number Ticket is fair, generates low bus traffic, and uses a constant amount of small storage Main problem: When now-serving changes, all processors cached copies are invalidated, and they all incur a read miss

EECE 55015 Comments on Array-Based Lock Uses fetch&increment to obtain a unique location on which to busy-wait (not a value) Lock data structure contains an array of p locations (each in a separate cache block) Acquire –Use fetch&increment to obtain next available location in lock array (with wraparound) Release –Write unlocked to the next array location It is fair, uses O(p) space, and is more scalable than ticket lock since only 1 processor read-misses

EECE 55016 Comparison Comparative performance: Fig. 5.30 –LL-SC with exponential backoff is best NOTE –… if a process holding a lock stops or slows down while it is in its critical section, all other processes may have to wait. [pp. 350-351] Try to avoid locks Try to use LL-SC type operations instead of actual locks

EECE 55017 5.5.5. Barriers Hardware barrier –Use special bus line and wired-OR Software barrier –Use locks, shared counters, and flags –E.g., refer to p. 354 of text

EECE 55018 Centralized barrier BARRIER (bar_name, p) { LOCK(bar_name.lock); if (bar_name.counter == 0) bar_name.flag = 0; mycount = bar_name.counter++; UNLOCK(bar_name.lock); if (mycount == p) { bar_name.counter = 0; bar_name.flag = 1; } else while (bar_name.flag == 0) { } } Problem with this code?

EECE 55019 Centralized barrier has potential problem with flag re-initialization Centralized barrier with sense reversal BARRIER (bar_name, p) { local_sense = !(local_sense); LOCK(bar_name.lock); mycount = bar_name.counter++; if (mycount == p) { UNLOCK(bar_name.lock); bar_name.counter = 0; bar_name.flag = local_sense; } else { UNLOCK(bar_name.lock); while (bar_name.flag != local_sense) { } } }

EECE 55020 Improving Barrier Performance Use software combining tree –With a bus, this has no significant benefit Use a special bus primitive to reduce the number of bus transactions for read misses in a centralized barrier –A processor monitors the bus and aborts its read miss if it sees the response to a read miss to the same location (by another processor)

EECE 55021 5.6. Implications for Software Use details of H/W design to design better, more efficient S/W –Keep machine fixed and examine how to improve parallel programs Programmer s Bag of Tricks –Assign tasks to reduce spatial interleaving of access patterns –Structure data to reduce spatial interleaving of access patterns E.g., 4D arrays instead of 2D arrays for equation solver kernel

EECE 55022 –Beware of conflict misses Figure 5.34 –Sizing dimensions of allocated arrays to powers of 2 is bad –This is a problem with direct-mapped caches –Use per-processor heaps Heap reservoir of memory space for a process –Copy data to increase spatial locality –Pad arrays Refer to Figure 5.36 Try to avoid false sharing within a cache block –Determine how to organize arrays of records Which data will be used together? Refer to Figure 5.36 –Align arrays to cache block boundaries Array should begin at cache block boundary

EECE 5501 5.5: Synchronization Issue: How can synchronization operations be implemented in bus-based cache-coherent multiprocessors Components of a synchronization.

Similar presentations

Presentation on theme: "EECE 5501 5.5: Synchronization Issue: How can synchronization operations be implemented in bus-based cache-coherent multiprocessors Components of a synchronization."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

EECE 5501 5.5: Synchronization Issue: How can synchronization operations be implemented in bus-based cache-coherent multiprocessors Components of a synchronization.

Similar presentations

Presentation on theme: "EECE 5501 5.5: Synchronization Issue: How can synchronization operations be implemented in bus-based cache-coherent multiprocessors Components of a synchronization."— Presentation transcript:

Similar presentations

About project

Feedback