Presentation is loading. Please wait.

Presentation is loading. Please wait.

Computer Architecture II 1 Computer architecture II Lecture 10.

Similar presentations


Presentation on theme: "Computer Architecture II 1 Computer architecture II Lecture 10."— Presentation transcript:

1 Computer Architecture II 1 Computer architecture II Lecture 10

2 Computer Architecture II 2 Today Synchronization for SMM –Test and set, ll and sc, array –barrier Scalable Multiprocessors –What is a scalable machine?

3 Computer Architecture II 3 Synchronization Types of Synchronization –Mutual Exclusion –Event synchronization point-to-point group global (barriers) All solutions rely on hardware support for an atomic read- modify-write operation We look today at synchronization for cache- coherent, bus-based multiprocessors

4 Computer Architecture II 4 Components of a Synchronization Event Acquire method –Acquire right to the synch (e.g. enter critical section) Waiting algorithm –Wait for synch to become available when it isn’t –busy-waiting, blocking, or hybrid Release method –Enable other processors to acquire

5 Computer Architecture II 5 Performance Criteria for Synch. Ops Latency (time per op) –especially when light contention Bandwidth (ops per sec) –especially under high contention Traffic –load on critical resources –especially on failures under contention Storage Fairness

6 Computer Architecture II 6 Strawman Lock lock:ldregister, location /* copy location to register */ cmplocation, #0 /* compare with 0 */ bnzlock/* if not 0, try again */ stlocation, #1/* store 1 to mark it locked */ ret/* return control to caller */ unlock:st location, #0/* write 0 to location */ ret/* return control to caller */ Busy-Waiting Location is initially 0 Why doesn’t the acquire method work?

7 Computer Architecture II 7 Atomic Instructions Specifies a location, register, & atomic operation –Value in location read into a register –Another value (function of value read or not) stored into location Many variants –Varying degrees of flexibility in second part Simple example: test&set –Value in location read into a specified register –Constant 1 stored into location –Successful if value loaded into register is 0 –Other constants could be used instead of 1 and 0

8 Computer Architecture II 8 Simple Test&Set Lock lock:t&sregister, location bnzlock /* if not 0, try again */ ret /* return control to caller */ unlock:st location, #0 /* write 0 to location */ ret /* return control to caller */ The same code for lock in pseudocode: while (not acquired) /* lock is aquired be another one*/ test&set(location); /* try to acquire the lock*/ Condition: architecture supports atomic test and set –Copy location to register and set location to 1 Problem: –t&s modifies the variable location in its cache each time it tries to acquire the lock=> cache block invalidations => bus traffic (especially for high contention)

9 Computer Architecture II 9 s s s s s s s s s s s s s s s s l l l l l l l l l l l l l l l l n n n n n n n n n n n n n n n n u uuuuuuuuuuuuuuu Number of processors T ime (  s) 111315 0 2 4 6 8 10 12 14 16 18 20 s Test&set,c = 0 l Test&set, exponential backoff c = 3.64 n Test&set, exponential backoff c = 0 u Ideal 9753 T&S Lock Microbenchmark: SGI Challenge lock; delay(c); unlock; Why does performance degrade? –Bus Transactions on T&S

10 Computer Architecture II 10 Other read-modify-write primitives Fetch&op –Atomically read and modify (by using op operation) and write a memory location –E.g. fetch&add, fetch&incr Compare&swap –Three operands: location, register to compare with, register to swap with

11 Computer Architecture II 11 Enhancements to Simple Lock Problem of t&s: lots of invalidations if the lock can not be taken Reduce frequency of issuing test&sets while waiting –Test&set lock with exponential backoff i=0; while (! acquired) { /* lock is acquired be another one*/ test&set(location); if (!acquired) {/* test&set didn’t succeed*/ wait (t i ); /* sleep some time i++; } Less invalidations May wait more

12 Computer Architecture II 12 s s s s s s s s s s s s s s s s l l l l l l l l l l l l l l l l n n n n n n n n n n n n n n n n u uuuuuuuuuuuuuuu Number of processors T ime (  s) 111315 0 2 4 6 8 10 12 14 16 18 20 s Test&set,c = 0 l Test&set, exponential backoff c = 3.64 n Test&set, exponential backoff c = 0 u Ideal 9753 T&S Lock Microbenchmark: SGI Challenge lock; delay(c); unlock; Why does performance degrade? –Bus Transactions on T&S

13 Computer Architecture II 13 Enhancements to Simple Lock Reduce frequency of issuing test&sets while waiting –Test-and-test&set lock while (! acquired) { /* lock is acquired be another one*/ if (location=1) /* test with ordinary load */ continue; else { test&set(location); if (acquired) {/*succeeded*/ break } Keep testing with ordinary load –Just a hint: cached lock variable will be invalidated when release occurs –If location becomes 0, use t&s to modify the variable atomically –If failure start over Further reduces the bus transactions –load produces bus traffic only when the lock is released –t&s produces bus traffic each time is executed

14 Computer Architecture II 14 Lock performance LatencyBus TrafficScalabilityStorageFairness t&sLow contention: low latency High contention: high latency A lotpoorLow (does not increase with processor number) no t&s with backoff Low contention: low latency (as t&s for no contention) High contention: high latency Less than t&sBetter than t&sLow (does not increase with processor number) no t&t&sLow contention: low latency, a little higher than t&s High contention: high latency Less than t&s and t&s with backoff Better than t&s and t&s with backoff Low (does not increase with processor number) no

15 Computer Architecture II 15 Improved Hardware Primitives: LL-SC Goals: –Problem of test&set: generate lot of bus traffic –Failed read-modify-write attempts don’t generate invalidations –Nice if single primitive can implement range of r-m-w operations Load-Locked (or -linked), Store-Conditional –LL reads variable into register –Work on the value from the register –SC tries to store back to location –succeed if and only if no other write to the variable since this processor’s LL indicated by a condition flag If SC succeeds, all three steps happened atomically If fails, doesn’t write or generate invalidations –must retry acquire

16 Computer Architecture II 16 Simple Lock with LL-SC lock: ll reg1, location /* LL location to reg1 */ sc location, reg2 /* SC reg2 into location*/ beqz reg2, lock /* if failed, start again */ ret unlock: st location, #0 /* write 0 to location */ ret Can simulate the atomic ops t&s, fetch&op, compare&swap by changing what’s between LL & SC (exercise) –Only a couple of instructions so SC likely to succeed –Don’t include instructions that would need to be undone (e.g. stores) SC can fail (without putting transaction on bus) if: –Detects intervening write even before trying to get bus –Tries to get bus but another processor’s SC gets bus first LL, SC are not lock, unlock respectively –Only guarantee no conflicting write to lock variable between them –But can use directly to implement simple operations on shared variables

17 Computer Architecture II 17 Advanced lock algorithms Problems with presented approaches –Unfair: the order of arrival does not count –All processors try to acquire the lock when released –More processes may incur a read miss when the lock released Desirable: only one miss

18 Computer Architecture II 18 Ticket Lock Draw a ticket with a number, wait until the number is shown Two counters per lock (next_ticket, now_serving) –Acquire: fetch&inc next_ticket; wait for now_serving == next_ticket atomic op when arrive at lock, not when it’s free (so less contention) –Release: increment now-serving Performance –low latency for low-contention –O(p) read misses at release, since all spin on same variable –FIFO order like simple LL-SC lock, but no invalidation when SC succeeds, and fair

19 Computer Architecture II 19 Array-based Queuing Locks Waiting processes poll on different locations in an array of size p –Acquire fetch&inc to obtain address on which to spin (next array element) ensure that these addresses are in different cache lines or memories –Release set next location in array, thus waking up process spinning on it –O(1) traffic per acquire with coherent caches –FIFO ordering, as in ticket lock, but, O(p) space per lock –Not so great for non-cache-coherent machines with distributed memory array location I spin on not necessarily in my local memory

20 Computer Architecture II 20 Lock performance LatencyBus TrafficScalabilityStorageFairness t&sLow contention: low latency High contention: high latency A lotpoorO(1)no t&s with backoff Low contention: low latency (as t&s) High contention: high latency Less than t&sBetter than t&sO(1)no t&t&sLow contention: low latency, a little higher than t&s High contention: high latency Less: no traffic while waiting Better than t&s with backoffO(1)no ll/scLow contention: low latency High contention: better than t&t&s Like t&t&s + no traffic on missed attempt Better than t&t&sO(1)no ticketLow contention: low latency High contention: better than ll/sc Little less than ll/sc Like ll/scO(1)Yes (FIFO) arrayLow contention: low latency, like t&t&s High contention: better than ticket Less than ticketMore scalable than ticket (one processor incurs the miss) O(p)Yes (FIFO)

21 Transactional memory Problems of mutexes –require thinking about overlapping operations and partial operations in distantly separated and seemingly unrelated sections of code: very difficult and error-prone for programmers. – have to think how to prevent deadlock, livelock –can lead to priority inversion: a high priority thread waits on low- priority thread holding exclusive access to a resource Software transactional memory – a concurrency control mechanism analogous to database transactions –alternative to lock-based synchronization. Computer Architecture II 21

22 Transactional memory A transaction – a piece of code that executes a series of reads and writes to shared memory –reads and writes logically occur at a single instant in time –intermediate states are not visible to other (successful) transactions. How does it work –optimistic: a thread completes modifications to shared memory without regard for what other threads might be doing –every performed read and write is recorded in a log Computer Architecture II 22

23 Transactional memory Locks –threads have not acwrite does not happen until the thread is sure that it has exclusive control TM –write performed right away –the burden is on read: a thread reads and then verifies if other cessed the same location in the past –if validation successful: "commit“ –if conflicting changes are detected: transaction aborted, i.e. all the changes rolled back and re-executed until successful Computer Architecture II 23

24 Transactional memory benefits simpler to understand, transaction used in isolation => easier to maintain multithreaded programs deadlock and livelock external: handled by an external manager priority inversion: still an issue, but high-priority transactions can abort conflicting lower priority which have not committed yet increased concurrency: –no thread waits for the access to a resource –different threads can modify different parts of the same resource despite overhead of retrying transitions, conflicts rarely arrise in real programs => large performance gain over threads Computer Architecture II 24

25 Transactional memory drawbacks overhead maintaining the log overhead of committing transactions can not perform operations that can be undone (e.g. I/O) –solution: buffers for irreversible operations can be undone can be commited to I/O outside the transactions Computer Architecture II 25

26 Transactional memory Complexity –n concurrent transactions –O(n) space and time Language support –simple atomic keyword –when end of blocked reached, the transaction is comitted if possible // Insert a node into a doubly-linked list atomically atomic { newNode->prev = node; newNode->next = node->next; node->next->prev = newNode; node->next = newNode; } Computer Architecture II 26

27 Transactional memory Guard condition – similar to condition variables for POSIX threads –if the condition is not satisfied, the transaction manager will waits until another transaction has made a commit that affects the condition before retrying. –no need to explicit signal atomic (queueSize > 0) { remove item from queue and use it } Computer Architecture II 27

28 Computer Architecture II 28 Point to Point Event Synchronization Software methods: –Busy-waiting: use ordinary variables as flags –Blocking: semaphores –Interrupts Full hardware support: full-empty bit with each word in memory –Set when word is “full” with newly produced data (i.e. when written) –Unset when word is “empty” due to being consumed (i.e. when read) –Natural for word-level producer-consumer synchronization producer: write if empty, set to full; consumer: read if full; set to empty –Hardware preserves read or write atomicity –Problem: flexibility multiple consumers multiple update of a producer

29 Computer Architecture II 29 Barriers Hardware barriers –Wired-AND line separate from address/data bus Set input 1 when arrive, wait for output to be 1 to leave –Useful when barriers are global and very frequent –Difficult to support arbitrary subset of processors even harder with multiple processes per processor –Difficult to dynamically change number and identity of participants e.g. latter due to process migration –Not common today on bus-based machines Software algorithms implemented using locks, flags, counters

30 Computer Architecture II 30 struct bar_type {int counter; struct lock_type lock; int flag = 0;} bar_name; BARRIER (bar_name, p) { LOCK(bar_name.lock); if (bar_name.counter == 0) bar_name.flag = 0; /* reset flag if first to reach*/ mycount = bar_name.counter++; /* mycount is private */ UNLOCK(bar_name.lock); if (mycount == p) { /* last to arrive */ bar_name.counter = 0; /* reset for next barrier */ bar_name.flag = 1; /* release waiters */ } else while (bar_name.flag == 0) {}; /* busy wait for release */ } A Simple Centralized Barrier Shared counter maintains number of processes that have arrived –increment when arrive (lock), check until reaches numprocs –Problem?

31 Computer Architecture II 31 A Working Centralized Barrier Consecutively entering the same barrier doesn’t work –Must prevent process from entering until all have left previous instance –Could use another counter, but increases latency and contention Sense reversal: wait for flag to take different value consecutive times –Toggle this value only when all processes reach 1.BARRIER (bar_name, p) { 2.local_sense = !(local_sense); /* toggle private sense variable */ 3. LOCK(bar_name.lock); 4.mycount = bar_name.counter++;/* mycount is private */ 5.if (bar_name.counter == p) 6.UNLOCK(bar_name.lock); 7.bar_name.counter = 0; 8. bar_name.flag = local_sense;/* release waiters*/ 9.else { 10. UNLOCK(bar_name.lock); 11.while (bar_name.flag != local_sense) {}; } 12.}

32 Computer Architecture II 32 Centralized Barrier Performance Latency –critical path length at least proportional to p (the accesses to the critical region are serialized by the lock) Traffic –p bus transaction to obtain the lock –p bus transactions to modify the counter –2 bus transaction for the last processor to reset the counter and release the waiting process –p-1 bus transactions for the first p-1 processors to read the flag Storage Cost –Very low: centralized counter and flag Fairness –Same processor should not always be last to exit barrier Key problems for centralized barrier are latency and traffic –Especially with distributed memory, traffic goes to same node

33 Computer Architecture II 33 Improved Barrier Algorithms for a Bus –Separate arrival and exit trees, and use sense reversal –Valuable in distributed network: communicate along different paths –On bus, all traffic goes on same bus, and no less total traffic –Higher latency (log p steps of work, and O(p) serialized bus transactions) –Advantage on bus is use of ordinary reads/writes instead of locks Software combining tree Only k processors access the same location, where k is degree of tree (k=2 in the example below)

34 Computer Architecture II 34 Scalable Multiprocessors

35 Computer Architecture II 35 Scalable Machines Scalability: capability of a system to increase by adding processors, memory, I/O devices 4 important aspects of scalability –bandwidth increases with number of processors –latency does not increase or increases slowly –Cost increases slowly with number of processors –Physical placement of resources

36 Computer Architecture II 36 Limited Scaling of a Bus Small configurations are cost-effective CharacteristicBus Physical Length~ 1 ft Number of Connectionsfixed Maximum Bandwidthfixed Interface to Comm. mediumextended memory interface Global Orderarbitration Protectionvirtual -> physical Trusttotal OSsingle comm. abstractionHW

37 Computer Architecture II 37 Workstations in a LAN? No clear limit to physical scaling, little trust, no global order Independent failure and restart CharacteristicBusLAN Physical Length~ 1 ftKM Number of Connectionsfixedmany Maximum Bandwidthfixed??? Interface to Comm. mediummemory interfaceperipheral Global Orderarbitration??? ProtectionVirtual -> physicalOS Trusttotalnone OSsingleindependent comm. abstractionHWSW

38 Computer Architecture II 38 Bandwidth Scalability Bandwidth limitation: single set of wires Must have many independent wires (remember bisection width?) => switches

39 Computer Architecture II 39 Dancehall MP Organization Network bandwidth demand: scales linearly with number of processors Latency: Increases with number of stages of switches (remember butterfly?) –Adding local memory would offer fixed latency

40 Computer Architecture II 40 Generic Distributed Memory Multiprocessor Most common structure

41 Computer Architecture II 41 Bandwidth scaling requirements Large number of independent communication paths between nodes: large number of concurrent transactions using different wires Independent transactions No global arbitration Effect of a transaction only visible to the nodes involved –Broadcast difficult (was easy for bus): additional transactions needed

42 Computer Architecture II 42 Latency Scaling T(n) = Overhead + Channel Time (Channel Occupancy) + Routing Delay + Contention Time Overhead: processing time in initiating and completing a transfer Channel Time(n) = n/B RoutingDelay (h,n)

43 Computer Architecture II 43 Cost Scaling Cost(p,m) = fixed cost + incremental cost (p,m) Bus Based SMP –Add more processors and memory Scalable machines –processors, memory, network Parallel efficiency(p) = Speedup(p) / p Costup(p) = Cost(p) / Cost(1) Cost-effective: Speedup(p) > Costup(p)

44 Computer Architecture II 44 Cost Effective? 2048 processors: 475 fold speedup at 206x cost

45 Computer Architecture II 45 Physical Scaling Chip-level integration –Multicore –Cell Board-level –Several multicores on a board System level –Clusters, supercomputers

46 Computer Architecture II 46 Chip-level integration: nCUBE/2 Network integrated onto the chip 14 bidirectional links => 8096 nodes Entire machine synchronous at 40 MHz Single-chip node Basic module Hypercube network configuration DRAM interface D M A c h a n n e l s R o u t e r MMU I-Fetch & decode 64-bit integer IEEE floating point Operand $ Execution unit 1024 Nodes

47 Computer Architecture II 47 Chip-level integration: Cell PPE 3.2 GHzGHz Synergetic Processing Elements

48 Computer Architecture II 48 Board level integration: CM-5 Use standard microprocessor components Scalable network interconnect

49 Computer Architecture II 49 System Level Integration Loose packaging IBM SP2 Cluster blades

50 Computer Architecture II 50 Roadrunner next-generation supercomputer to be built at the Los Alamos National Laboratory in New Mexico.supercomputerLos Alamos National LaboratoryNew Mexico 1 petaflops US Department of Energy.petaflopsUS Department of Energy hybrid design –more than 16,000 AMD Opteron cores (~2200 IBM x3755 4U servers, each holding four dual core Opterons, connected by Infiniband)AMDOpteronInfiniband –a comparable number of Cell microprocessorsCell microprocessors –Red Hat Linux operating systemRed Hat Linux –When completed (2008), it will be the world's most powerful computer, and cover approximately 12,000 square feet (1,100 square meters). It is expected to be operational in 2008. simulating how nuclear materials age and whether the aging nuclear weapon arsenal of the United States is safe and reliable.


Download ppt "Computer Architecture II 1 Computer architecture II Lecture 10."

Similar presentations


Ads by Google