Presentation is loading. Please wait.

Presentation is loading. Please wait.

Computer Architecture II 1 Computer architecture II Lecture 9.

Similar presentations


Presentation on theme: "Computer Architecture II 1 Computer architecture II Lecture 9."— Presentation transcript:

1 Computer Architecture II 1 Computer architecture II Lecture 9

2 Computer Architecture II 2 Today Synchronization for SMM –Test and set, ll and sc, array –barrier Scalable Multiprocessors –What is a scalable machine?

3 Computer Architecture II 3 Synchronization Types of Synchronization –Mutual Exclusion –Event synchronization point-to-point group global (barriers) All solutions rely on hardware support for an atomic read- modify-write operation We look today at synchronization for cache- coherent, bus-based multiprocessors

4 Computer Architecture II 4 Components of a Synchronization Event Acquire method –Acquire right to the synch (e.g. enter critical section) Waiting algorithm –Wait for synch to become available when it isn’t –busy-waiting, blocking, or hybrid Release method –Enable other processors to acquire

5 Computer Architecture II 5 Performance Criteria for Synch. Ops Latency (time per op) –especially when light contention Bandwidth (ops per sec) –especially under high contention Traffic –load on critical resources –especially on failures under contention Storage Fairness

6 Computer Architecture II 6 Strawman Lock lock:ldregister, location /* copy location to register */ cmplocation, #0 /* compare with 0 */ bnzlock/* if not 0, try again */ stlocation, #1/* store 1 to mark it locked */ ret/* return control to caller */ unlock:st location, #0/* write 0 to location */ ret/* return control to caller */ Busy-Waiting Location is initially 0 Why doesn’t the acquire method work?

7 Computer Architecture II 7 Atomic Instructions Specifies a location, register, & atomic operation –Value in location read into a register –Another value (function of value read or not) stored into location Many variants –Varying degrees of flexibility in second part Simple example: test&set –Value in location read into a specified register –Constant 1 stored into location –Successful if value loaded into register is 0 –Other constants could be used instead of 1 and 0

8 Computer Architecture II 8 Simple Test&Set Lock lock:t&sregister, location bnzlock /* if not 0, try again */ ret /* return control to caller */ unlock:st location, #0 /* write 0 to location */ ret /* return control to caller */ The same code for lock in pseudocode: while (not acquired) /* lock is aquired be another one*/ test&set(location); /* try to acquire the lock*/ Condition: architecture supports atomic test and set –Copy location to register and set location to 1 Problem: –t&s modifies the variable location in its cache each time it tries to acquire the lock=> cache block invalidations => bus traffic (especially for high contention)

9 Computer Architecture II 9 s s s s s s s s s s s s s s s s l l l l l l l l l l l l l l l l n n n n n n n n n n n n n n n n u uuuuuuuuuuuuuuu Number of processors T ime (  s) 111315 0 2 4 6 8 10 12 14 16 18 20 s Test&set,c = 0 l Test&set, exponential backoff c = 3.64 n Test&set, exponential backoff c = 0 u Ideal 9753 T&S Lock Microbenchmark: SGI Challenge lock; delay(c); unlock; Why does performance degrade? –Bus Transactions on T&S

10 Computer Architecture II 10 Other read-modify-write primitives Fetch&op –Atomically read and modify (by using op operation) and write a memory location –E.g. fetch&add, fetch&incr Compare&swap –Three operands: location, register to compare with, register to swap with

11 Computer Architecture II 11 Enhancements to Simple Lock Problem of t&s: lots of invalidations if the lock can not be taken Reduce frequency of issuing test&sets while waiting –Test&set lock with exponential backoff i=0; while (! acquired) { /* lock is acquired be another one*/ test&set(location); if (!acquired) {/* test&set didn’t succeed*/ wait (t i ); /* sleep some time i++; } Less invalidations May wait more

12 Computer Architecture II 12 s s s s s s s s s s s s s s s s l l l l l l l l l l l l l l l l n n n n n n n n n n n n n n n n u uuuuuuuuuuuuuuu Number of processors T ime (  s) 111315 0 2 4 6 8 10 12 14 16 18 20 s Test&set,c = 0 l Test&set, exponential backoff c = 3.64 n Test&set, exponential backoff c = 0 u Ideal 9753 T&S Lock Microbenchmark: SGI Challenge lock; delay(c); unlock; Why does performance degrade? –Bus Transactions on T&S

13 Computer Architecture II 13 Enhancements to Simple Lock Reduce frequency of issuing test&sets while waiting –Test-and-test&set lock while (! acquired) { /* lock is acquired be another one*/ if (location=1) /* test with ordinary load */ continue; else { test&set(location); if (acquired) {/*succeeded*/ break } Keep testing with ordinary load –Just a hint: cached lock variable will be invalidated when release occurs –If location becomes 0, use t&s to modify the variable atomically –If failure start over Further reduces the bus transactions –load produces bus traffic only when the lock is released –t&s produces bus traffic each time is executed

14 Computer Architecture II 14 Lock performance LatencyBus TrafficScalabilityStorageFairness t&sLow contention: low latency High contention: high latency A lotpoorLow (does not increase with processor number) no t&s with backoff Low contention: low latency (as t&s for no contention) High contention: high latency Less than t&sBetter than t&sLow (does not increase with processor number) no t&t&sLow contention: low latency, a little higher than t&s High contention: high latency Less than t&s and t&s with backoff Better than t&s and t&s with backoff Low (does not increase with processor number) no

15 Computer Architecture II 15 Improved Hardware Primitives: LL-SC Goals: –Problem of test&set: generate lot of bus traffic –Failed read-modify-write attempts don’t generate invalidations –Nice if single primitive can implement range of r-m-w operations Load-Locked (or -linked), Store-Conditional –LL reads variable into register –Work on the value from the register –SC tries to store back to location –succeed if and only if no other write to the variable since this processor’s LL indicated by a condition flag If SC succeeds, all three steps happened atomically If fails, doesn’t write or generate invalidations –must retry acquire

16 Computer Architecture II 16 Simple Lock with LL-SC lock: ll reg1, location /* LL location to reg1 */ sc location, reg2 /* SC reg2 into location*/ beqz reg2, lock /* if failed, start again */ ret unlock: st location, #0 /* write 0 to location */ ret Can simulate the atomic ops t&s, fetch&op, compare&swap by changing what’s between LL & SC (exercise) –Only a couple of instructions so SC likely to succeed –Don’t include instructions that would need to be undone (e.g. stores) SC can fail (without putting transaction on bus) if: –Detects intervening write even before trying to get bus –Tries to get bus but another processor’s SC gets bus first LL, SC are not lock, unlock respectively –Only guarantee no conflicting write to lock variable between them –But can use directly to implement simple operations on shared variables

17 Computer Architecture II 17 Advanced lock algorithms Problems with presented approaches –Unfair: the order of arrival does not count –All processors try to acquire the lock when released –More processes may incur a read miss when the lock released Desirable: only one miss

18 Computer Architecture II 18 Ticket Lock Draw a ticket with a number, wait until the number is shown Two counters per lock (next_ticket, now_serving) –Acquire: fetch&inc next_ticket; wait for now_serving == next_ticket atomic op when arrive at lock, not when it’s free (so less contention) –Release: increment now-serving Performance –low latency for low-contention –O(p) read misses at release, since all spin on same variable –FIFO order like simple LL-SC lock, but no invalidation when SC succeeds, and fair

19 Computer Architecture II 19 Array-based Queuing Locks Waiting processes poll on different locations in an array of size p –Acquire fetch&inc to obtain address on which to spin (next array element) ensure that these addresses are in different cache lines or memories –Release set next location in array, thus waking up process spinning on it –O(1) traffic per acquire with coherent caches –FIFO ordering, as in ticket lock, but, O(p) space per lock –Not so great for non-cache-coherent machines with distributed memory array location I spin on not necessarily in my local memory

20 Computer Architecture II 20 Lock performance LatencyBus TrafficScalabilityStorageFairness t&sLow contention: low latency High contention: high latency A lotpoorO(1)no t&s with backoff Low contention: low latency (as t&s) High contention: high latency Less than t&sBetter than t&sO(1)no t&t&sLow contention: low latency, a little higher than t&s High contention: high latency Less: no traffic while waiting Better than t&s with backoffO(1)no ll/scLow contention: low latency High contention: better than t&t&s Like t&t&s + no traffic on missed attempt Better than t&t&sO(1)no ticketLow contention: low latency High contention: better than ll/sc Little less than ll/sc Like ll/scO(1)Yes (FIFO) arrayLow contention: low latency, like t&t&s High contention: better than ticket Less than ticketMore scalable than ticket (one processor incurs the miss) O(p)Yes (FIFO)

21 Computer Architecture II 21 Point to Point Event Synchronization Software methods: –Busy-waiting: use ordinary variables as flags –Blocking: semaphores –Interrupts Full hardware support: full-empty bit with each word in memory –Set when word is “full” with newly produced data (i.e. when written) –Unset when word is “empty” due to being consumed (i.e. when read) –Natural for word-level producer-consumer synchronization producer: write if empty, set to full; consumer: read if full; set to empty –Hardware preserves read or write atomicity –Problem: flexibility multiple consumers multiple update of a producer

22 Computer Architecture II 22 Barriers Hardware barriers –Wired-AND line separate from address/data bus Set input 1 when arrive, wait for output to be 1 to leave –Useful when barriers are global and very frequent –Difficult to support arbitrary subset of processors even harder with multiple processes per processor –Difficult to dynamically change number and identity of participants e.g. latter due to process migration –Not common today on bus-based machines Software algorithms implemented using locks, flags, counters

23 Computer Architecture II 23 struct bar_type {int counter; struct lock_type lock; int flag = 0;} bar_name; BARRIER (bar_name, p) { LOCK(bar_name.lock); if (bar_name.counter == 0) bar_name.flag = 0; /* reset flag if first to reach*/ mycount = bar_name.counter++; /* mycount is private */ UNLOCK(bar_name.lock); if (mycount == p) { /* last to arrive */ bar_name.counter = 0; /* reset for next barrier */ bar_name.flag = 1; /* release waiters */ } else while (bar_name.flag == 0) {}; /* busy wait for release */ } A Simple Centralized Barrier Shared counter maintains number of processes that have arrived –increment when arrive (lock), check until reaches numprocs –Problem?

24 Computer Architecture II 24 A Working Centralized Barrier Consecutively entering the same barrier doesn’t work –Must prevent process from entering until all have left previous instance –Could use another counter, but increases latency and contention Sense reversal: wait for flag to take different value consecutive times –Toggle this value only when all processes reach 1.BARRIER (bar_name, p) { 2.local_sense = !(local_sense); /* toggle private sense variable */ 3. LOCK(bar_name.lock); 4.mycount = bar_name.counter++;/* mycount is private */ 5.if (bar_name.counter == p) 6.UNLOCK(bar_name.lock); 7.bar_name.counter = 0; 8. bar_name.flag = local_sense;/* release waiters*/ 9.else { 10. UNLOCK(bar_name.lock); 11.while (bar_name.flag != local_sense) {}; } 12.}

25 Computer Architecture II 25 Centralized Barrier Performance Latency –critical path length at least proportional to p (the accesses to the critical region are serialized by the lock) Traffic –p bus transaction to obtain the lock –p bus transactions to modify the counter –2 bus transaction for the last processor to reset the counter and release the waiting process –p-1 bus transactions for the first p-1 processors to read the flag Storage Cost –Very low: centralized counter and flag Fairness –Same processor should not always be last to exit barrier Key problems for centralized barrier are latency and traffic –Especially with distributed memory, traffic goes to same node

26 Computer Architecture II 26 Improved Barrier Algorithms for a Bus –Separate arrival and exit trees, and use sense reversal –Valuable in distributed network: communicate along different paths –On bus, all traffic goes on same bus, and no less total traffic –Higher latency (log p steps of work, and O(p) serialized bus transactions) –Advantage on bus is use of ordinary reads/writes instead of locks Software combining tree Only k processors access the same location, where k is degree of tree (k=2 in the example below)

27 Computer Architecture II 27 Scalable Multiprocessors

28 Computer Architecture II 28 Scalable Machines Scalability: capability of a system to increase by adding processors, memory, I/O devices 4 important aspects of scalability –bandwidth increases with number of processors –latency does not increase or increases slowly –Cost increases slowly with number of processors –Physical placement of resources

29 Computer Architecture II 29 Limited Scaling of a Bus Small configurations are cost-effective CharacteristicBus Physical Length~ 1 ft Number of Connectionsfixed Maximum Bandwidthfixed Interface to Comm. mediumextended memory interface Global Orderarbitration Protectionvirtual -> physical Trusttotal OSsingle comm. abstractionHW

30 Computer Architecture II 30 Workstations in a LAN? No clear limit to physical scaling, little trust, no global order Independent failure and restart CharacteristicBusLAN Physical Length~ 1 ftKM Number of Connectionsfixedmany Maximum Bandwidthfixed??? Interface to Comm. mediummemory interfaceperipheral Global Orderarbitration??? ProtectionVirtual -> physicalOS Trusttotalnone OSsingleindependent comm. abstractionHWSW

31 Computer Architecture II 31 Bandwidth Scalability Bandwidth limitation: single set of wires Must have many independent wires (remember bisection width?) => switches

32 Computer Architecture II 32 Dancehall MP Organization Network bandwidth demand: scales linearly with number of processors Latency: Increases with number of stages of switches (remember butterfly?) –Adding local memory would offer fixed latency

33 Computer Architecture II 33 Generic Distributed Memory Multiprocessor Most common structure

34 Computer Architecture II 34 Bandwidth scaling requirements Large number of independent communication paths between nodes: large number of concurrent transactions using different wires Independent transactions No global arbitration Effect of a transaction only visible to the nodes involved –Broadcast difficult (was easy for bus): additional transactions needed

35 Computer Architecture II 35 Latency Scaling T(n) = Overhead + Channel Time (Channel Occupancy) + Routing Delay + Contention Time Overhead: processing time in initiating and completing a transfer Channel Time(n) = n/B RoutingDelay (h,n)

36 Computer Architecture II 36 Cost Scaling Cost(p,m) = fixed cost + incremental cost (p,m) Bus Based SMP –Add more processors and memory Scalable machines –processors, memory, network Parallel efficiency(p) = Speedup(p) / p Costup(p) = Cost(p) / Cost(1) Cost-effective: Speedup(p) > Costup(p)

37 Computer Architecture II 37 Cost Effective? 2048 processors: 475 fold speedup at 206x cost

38 Computer Architecture II 38 Physical Scaling Chip-level integration Board-level System level

39 Computer Architecture II 39 Chip-level integration: nCUBE/2 Network integrated onto the chip 14 bidirectional links => 8096 nodes Entire machine synchronous at 40 MHz Single-chip node Basic module Hypercube network configuration DRAM interface D M A c h a n n e l s R o u t e r MMU I-Fetch & decode 64-bit integer IEEE floating point Operand $ Execution unit 1024 Nodes

40 Computer Architecture II 40 Board level integration: CM-5 Use standard microprocessor components Scalable network interconnect

41 Computer Architecture II 41 System Level Integration Loose packaging IBM SP2 Cluster blades


Download ppt "Computer Architecture II 1 Computer architecture II Lecture 9."

Similar presentations


Ads by Google