Computer Architecture II 1 Computer architecture II Lecture 9.

Slides:

Advertisements

Similar presentations

EECE : Synchronization Issue: How can synchronization operations be implemented in bus-based cache-coherent multiprocessors Components of a synchronization.

Advertisements

1 Lecture 20: Synchronization & Consistency Topics: synchronization, consistency models (Sections )

1 Synchronization A parallel computer is a collection of processing elements that cooperate and communicate to solve large problems fast. Types of Synchronization.

Synchronization without Contention

Synchronization. How to synchronize processes? – Need to protect access to shared data to avoid problems like race conditions – Typical example: Updating.

Multiprocessors—Synchronization. Synchronization Why Synchronize? Need to know when it is safe for different processes to use shared data Issues for Synchronization:

Global Environment Model. MUTUAL EXCLUSION PROBLEM The operations used by processes to access to common resources (critical sections) must be mutually.

CS492B Analysis of Concurrent Programs Lock Basics Jaehyuk Huh Computer Science, KAIST.

Ch. 7 Process Synchronization (1/2) I Background F Producer - Consumer process :  Compiler, Assembler, Loader, · · · · · · F Bounded buffer.

Synchronization without Contention John M. Mellor-Crummey and Michael L. Scott+ ECE 259 / CPS 221 Advanced Computer Architecture II Presenter : Tae Jun.

Parallel Processing (CS526) Spring 2012(Week 6).  A parallel algorithm is a group of partitioned tasks that work with each other to solve a large problem.

Multiple Processor Systems

Synchron. CSE 471 Aut 011 Some Recent Medium-scale NUMA Multiprocessors (research machines) DASH (Stanford) multiprocessor. –“Cluster” = 4 processors on.

CS6290 Synchronization. Synchronization Shared counter/sum update example –Use a mutex variable for mutual exclusion –Only one processor can own the mutex.

The Performance of Spin Lock Alternatives for Shared-Memory Microprocessors Thomas E. Anderson Presented by David Woodard.

Computer Architecture II 1 Computer architecture II Lecture 10.

Scalability CS 258, Spring 99 David E. Culler Computer Science Division U.C. Berkeley.

CS510 Concurrent Systems Class 1b Spin Lock Performance.

1 Lecture 21: Synchronization Topics: lock implementations (Sections )

Synchronization Todd C. Mowry CS 740 November 1, 2000 Topics Locks Barriers Hardware primitives.

1 Lecture 20: Protocols and Synchronization Topics: distributed shared-memory multiprocessors, synchronization (Sections )

Synchronization (Barriers) Parallel Processing (CS453)

Includes slides from course CS162 at UC Berkeley, by prof Anthony D. Joseph and Ion Stoica and from course CS194, by prof. Katherine Yelick Shared Memory.

Synchronization CSE 661 – Parallel and Vector Architectures Dr. Muhamed Mudawar Computer Engineering Department King Fahd University of Petroleum and Minerals.

The Performance of Spin Lock Alternatives for Shared-Memory Multiprocessors THOMAS E. ANDERSON Presented by Daesung Park.

Fundamentals of Parallel Computer Architecture - Chapter 91 Chapter 9 Hardware Support for Synchronization Yan Solihin Copyright.

MULTIVIE W Slide 1 (of 23) The Performance of Spin Lock Alternatives for Shared-Memory Multiprocessors Paper: Thomas E. Anderson Presentation: Emerson.

Caltech CS184 Spring DeHon 1 CS184b: Computer Architecture (Abstractions and Optimizations) Day 12: May 3, 2003 Shared Memory.

CALTECH cs184c Spring DeHon CS184c: Computer Architecture [Parallel and Multithreaded] Day 10: May 8, 2001 Synchronization.

Anshul Kumar, CSE IITD CSL718 : Multiprocessors Synchronization, Memory Consistency 17th April, 2006.

Anshul Kumar, CSE IITD ECE729 : Advance Computer Architecture Lecture 26: Synchronization, Memory Consistency 25 th March, 2010.

Caltech CS184 Spring DeHon 1 CS184b: Computer Architecture (Abstractions and Optimizations) Day 22: May 20, 2005 Synchronization.

1 Lecture 19: Scalable Protocols & Synch Topics: coherence protocols for distributed shared-memory multiprocessors and synchronization (Sections )

1 Global and high-contention operations: Barriers, reductions, and highly- contended locks Katie Coons April 6, 2006.

1 Synchronization “A parallel computer is a collection of processing elements that cooperate and communicate to solve large problems fast.” Types of Synchronization.

Synchronization Todd C. Mowry CS 740 November 24, 1998 Topics Locks Barriers.

Chapter 3 System Buses.  Hardwired systems are inflexible  General purpose hardware can do different tasks, given correct control signals  Instead.

Lecture 13 Parallel Processing. 2 What is Parallel Computing? Traditionally software has been written for serial computation. Parallel computing is the.

Multiprocessors – Locks

Outline Introduction Centralized shared-memory architectures (Sec. 5.2) Distributed shared-memory and directory-based coherence (Sec. 5.4) Synchronization:

CS703 – Advanced Operating Systems

Prof John D. Kubiatowicz

Lecture 19: Coherence and Synchronization

Lecture 5: Synchronization

The University of Adelaide, School of Computer Science

The University of Adelaide, School of Computer Science

Lecture 18: Coherence and Synchronization

Reactive Synchronization Algorithms for Multiprocessors

Global and high-contention operations: Barriers, reductions, and highly-contended locks Katie Coons April 6, 2006.

Multiprocessor Introduction and Characteristics of Multiprocessor

CS510 Concurrent Systems Jonathan Walpole.

Cache Coherence Protocols 15th April, 2006

Designing Parallel Algorithms (Synchronization)

Lecture 5: Snooping Protocol Design Issues

Computer Science Division

Shared Memory Systems Miodrag Bolic.

Lecture 21: Synchronization and Consistency

Lecture: Coherence and Synchronization

Hardware-Software Trade-offs in Synchronization

Hardware-Software Trade-offs in Synchronization and Data Layout

CS533 Concepts of Operating Systems

Lecture 4: Synchronization

The University of Adelaide, School of Computer Science

Lecture: Coherence, Synchronization

Lecture 17 Multiprocessors and Thread-Level Parallelism

Lecture 21: Synchronization & Consistency

Lecture: Coherence and Synchronization

Lecture 19: Coherence and Synchronization

Lecture 18: Coherence and Synchronization

The University of Adelaide, School of Computer Science

Presentation transcript:

Computer Architecture II 1 Computer architecture II Lecture 9

Computer Architecture II 2 Today Synchronization for SMM –Test and set, ll and sc, array –barrier Scalable Multiprocessors –What is a scalable machine?

Computer Architecture II 3 Synchronization Types of Synchronization –Mutual Exclusion –Event synchronization point-to-point group global (barriers) All solutions rely on hardware support for an atomic read- modify-write operation We look today at synchronization for cache- coherent, bus-based multiprocessors

Computer Architecture II 4 Components of a Synchronization Event Acquire method –Acquire right to the synch (e.g. enter critical section) Waiting algorithm –Wait for synch to become available when it isn’t –busy-waiting, blocking, or hybrid Release method –Enable other processors to acquire

Computer Architecture II 5 Performance Criteria for Synch. Ops Latency (time per op) –especially when light contention Bandwidth (ops per sec) –especially under high contention Traffic –load on critical resources –especially on failures under contention Storage Fairness

Computer Architecture II 6 Strawman Lock lock:ldregister, location /* copy location to register */ cmplocation, #0 /* compare with 0 */ bnzlock/* if not 0, try again */ stlocation, #1/* store 1 to mark it locked */ ret/* return control to caller */ unlock:st location, #0/* write 0 to location */ ret/* return control to caller */ Busy-Waiting Location is initially 0 Why doesn’t the acquire method work?

Computer Architecture II 7 Atomic Instructions Specifies a location, register, & atomic operation –Value in location read into a register –Another value (function of value read or not) stored into location Many variants –Varying degrees of flexibility in second part Simple example: test&set –Value in location read into a specified register –Constant 1 stored into location –Successful if value loaded into register is 0 –Other constants could be used instead of 1 and 0

Computer Architecture II 8 Simple Test&Set Lock lock:t&sregister, location bnzlock /* if not 0, try again */ ret /* return control to caller */ unlock:st location, #0 /* write 0 to location */ ret /* return control to caller */ The same code for lock in pseudocode: while (not acquired) /* lock is aquired be another one*/ test&set(location); /* try to acquire the lock*/ Condition: architecture supports atomic test and set –Copy location to register and set location to 1 Problem: –t&s modifies the variable location in its cache each time it tries to acquire the lock=> cache block invalidations => bus traffic (especially for high contention)

Computer Architecture II 9 s s s s s s s s s s s s s s s s l l l l l l l l l l l l l l l l n n n n n n n n n n n n n n n n u uuuuuuuuuuuuuuu Number of processors T ime (  s) s Test&set,c = 0 l Test&set, exponential backoff c = 3.64 n Test&set, exponential backoff c = 0 u Ideal 9753 T&S Lock Microbenchmark: SGI Challenge lock; delay(c); unlock; Why does performance degrade? –Bus Transactions on T&S

Computer Architecture II 10 Other read-modify-write primitives Fetch&op –Atomically read and modify (by using op operation) and write a memory location –E.g. fetch&add, fetch&incr Compare&swap –Three operands: location, register to compare with, register to swap with

Computer Architecture II 11 Enhancements to Simple Lock Problem of t&s: lots of invalidations if the lock can not be taken Reduce frequency of issuing test&sets while waiting –Test&set lock with exponential backoff i=0; while (! acquired) { /* lock is acquired be another one*/ test&set(location); if (!acquired) {/* test&set didn’t succeed*/ wait (t i ); /* sleep some time i++; } Less invalidations May wait more

Computer Architecture II 12 s s s s s s s s s s s s s s s s l l l l l l l l l l l l l l l l n n n n n n n n n n n n n n n n u uuuuuuuuuuuuuuu Number of processors T ime (  s) s Test&set,c = 0 l Test&set, exponential backoff c = 3.64 n Test&set, exponential backoff c = 0 u Ideal 9753 T&S Lock Microbenchmark: SGI Challenge lock; delay(c); unlock; Why does performance degrade? –Bus Transactions on T&S

Computer Architecture II 13 Enhancements to Simple Lock Reduce frequency of issuing test&sets while waiting –Test-and-test&set lock while (! acquired) { /* lock is acquired be another one*/ if (location=1) /* test with ordinary load */ continue; else { test&set(location); if (acquired) {/*succeeded*/ break } Keep testing with ordinary load –Just a hint: cached lock variable will be invalidated when release occurs –If location becomes 0, use t&s to modify the variable atomically –If failure start over Further reduces the bus transactions –load produces bus traffic only when the lock is released –t&s produces bus traffic each time is executed

Computer Architecture II 14 Lock performance LatencyBus TrafficScalabilityStorageFairness t&sLow contention: low latency High contention: high latency A lotpoorLow (does not increase with processor number) no t&s with backoff Low contention: low latency (as t&s for no contention) High contention: high latency Less than t&sBetter than t&sLow (does not increase with processor number) no t&t&sLow contention: low latency, a little higher than t&s High contention: high latency Less than t&s and t&s with backoff Better than t&s and t&s with backoff Low (does not increase with processor number) no

Computer Architecture II 15 Improved Hardware Primitives: LL-SC Goals: –Problem of test&set: generate lot of bus traffic –Failed read-modify-write attempts don’t generate invalidations –Nice if single primitive can implement range of r-m-w operations Load-Locked (or -linked), Store-Conditional –LL reads variable into register –Work on the value from the register –SC tries to store back to location –succeed if and only if no other write to the variable since this processor’s LL indicated by a condition flag If SC succeeds, all three steps happened atomically If fails, doesn’t write or generate invalidations –must retry acquire

Computer Architecture II 16 Simple Lock with LL-SC lock: ll reg1, location /* LL location to reg1 */ sc location, reg2 /* SC reg2 into location*/ beqz reg2, lock /* if failed, start again */ ret unlock: st location, #0 /* write 0 to location */ ret Can simulate the atomic ops t&s, fetch&op, compare&swap by changing what’s between LL & SC (exercise) –Only a couple of instructions so SC likely to succeed –Don’t include instructions that would need to be undone (e.g. stores) SC can fail (without putting transaction on bus) if: –Detects intervening write even before trying to get bus –Tries to get bus but another processor’s SC gets bus first LL, SC are not lock, unlock respectively –Only guarantee no conflicting write to lock variable between them –But can use directly to implement simple operations on shared variables

Computer Architecture II 17 Advanced lock algorithms Problems with presented approaches –Unfair: the order of arrival does not count –All processors try to acquire the lock when released –More processes may incur a read miss when the lock released Desirable: only one miss

Computer Architecture II 18 Ticket Lock Draw a ticket with a number, wait until the number is shown Two counters per lock (next_ticket, now_serving) –Acquire: fetch&inc next_ticket; wait for now_serving == next_ticket atomic op when arrive at lock, not when it’s free (so less contention) –Release: increment now-serving Performance –low latency for low-contention –O(p) read misses at release, since all spin on same variable –FIFO order like simple LL-SC lock, but no invalidation when SC succeeds, and fair

Computer Architecture II 19 Array-based Queuing Locks Waiting processes poll on different locations in an array of size p –Acquire fetch&inc to obtain address on which to spin (next array element) ensure that these addresses are in different cache lines or memories –Release set next location in array, thus waking up process spinning on it –O(1) traffic per acquire with coherent caches –FIFO ordering, as in ticket lock, but, O(p) space per lock –Not so great for non-cache-coherent machines with distributed memory array location I spin on not necessarily in my local memory

Computer Architecture II 20 Lock performance LatencyBus TrafficScalabilityStorageFairness t&sLow contention: low latency High contention: high latency A lotpoorO(1)no t&s with backoff Low contention: low latency (as t&s) High contention: high latency Less than t&sBetter than t&sO(1)no t&t&sLow contention: low latency, a little higher than t&s High contention: high latency Less: no traffic while waiting Better than t&s with backoffO(1)no ll/scLow contention: low latency High contention: better than t&t&s Like t&t&s + no traffic on missed attempt Better than t&t&sO(1)no ticketLow contention: low latency High contention: better than ll/sc Little less than ll/sc Like ll/scO(1)Yes (FIFO) arrayLow contention: low latency, like t&t&s High contention: better than ticket Less than ticketMore scalable than ticket (one processor incurs the miss) O(p)Yes (FIFO)

Computer Architecture II 21 Point to Point Event Synchronization Software methods: –Busy-waiting: use ordinary variables as flags –Blocking: semaphores –Interrupts Full hardware support: full-empty bit with each word in memory –Set when word is “full” with newly produced data (i.e. when written) –Unset when word is “empty” due to being consumed (i.e. when read) –Natural for word-level producer-consumer synchronization producer: write if empty, set to full; consumer: read if full; set to empty –Hardware preserves read or write atomicity –Problem: flexibility multiple consumers multiple update of a producer

Computer Architecture II 22 Barriers Hardware barriers –Wired-AND line separate from address/data bus Set input 1 when arrive, wait for output to be 1 to leave –Useful when barriers are global and very frequent –Difficult to support arbitrary subset of processors even harder with multiple processes per processor –Difficult to dynamically change number and identity of participants e.g. latter due to process migration –Not common today on bus-based machines Software algorithms implemented using locks, flags, counters

Computer Architecture II 23 struct bar_type {int counter; struct lock_type lock; int flag = 0;} bar_name; BARRIER (bar_name, p) { LOCK(bar_name.lock); if (bar_name.counter == 0) bar_name.flag = 0; /* reset flag if first to reach*/ mycount = bar_name.counter++; /* mycount is private */ UNLOCK(bar_name.lock); if (mycount == p) { /* last to arrive */ bar_name.counter = 0; /* reset for next barrier */ bar_name.flag = 1; /* release waiters */ } else while (bar_name.flag == 0) {}; /* busy wait for release */ } A Simple Centralized Barrier Shared counter maintains number of processes that have arrived –increment when arrive (lock), check until reaches numprocs –Problem?

Computer Architecture II 24 A Working Centralized Barrier Consecutively entering the same barrier doesn’t work –Must prevent process from entering until all have left previous instance –Could use another counter, but increases latency and contention Sense reversal: wait for flag to take different value consecutive times –Toggle this value only when all processes reach 1.BARRIER (bar_name, p) { 2.local_sense = !(local_sense); /* toggle private sense variable */ 3. LOCK(bar_name.lock); 4.mycount = bar_name.counter++;/* mycount is private */ 5.if (bar_name.counter == p) 6.UNLOCK(bar_name.lock); 7.bar_name.counter = 0; 8. bar_name.flag = local_sense;/* release waiters*/ 9.else { 10. UNLOCK(bar_name.lock); 11.while (bar_name.flag != local_sense) {}; } 12.}

Computer Architecture II 25 Centralized Barrier Performance Latency –critical path length at least proportional to p (the accesses to the critical region are serialized by the lock) Traffic –p bus transaction to obtain the lock –p bus transactions to modify the counter –2 bus transaction for the last processor to reset the counter and release the waiting process –p-1 bus transactions for the first p-1 processors to read the flag Storage Cost –Very low: centralized counter and flag Fairness –Same processor should not always be last to exit barrier Key problems for centralized barrier are latency and traffic –Especially with distributed memory, traffic goes to same node

Computer Architecture II 26 Improved Barrier Algorithms for a Bus –Separate arrival and exit trees, and use sense reversal –Valuable in distributed network: communicate along different paths –On bus, all traffic goes on same bus, and no less total traffic –Higher latency (log p steps of work, and O(p) serialized bus transactions) –Advantage on bus is use of ordinary reads/writes instead of locks Software combining tree Only k processors access the same location, where k is degree of tree (k=2 in the example below)

Computer Architecture II 27 Scalable Multiprocessors

Computer Architecture II 28 Scalable Machines Scalability: capability of a system to increase by adding processors, memory, I/O devices 4 important aspects of scalability –bandwidth increases with number of processors –latency does not increase or increases slowly –Cost increases slowly with number of processors –Physical placement of resources

Computer Architecture II 29 Limited Scaling of a Bus Small configurations are cost-effective CharacteristicBus Physical Length~ 1 ft Number of Connectionsfixed Maximum Bandwidthfixed Interface to Comm. mediumextended memory interface Global Orderarbitration Protectionvirtual -> physical Trusttotal OSsingle comm. abstractionHW

Computer Architecture II 30 Workstations in a LAN? No clear limit to physical scaling, little trust, no global order Independent failure and restart CharacteristicBusLAN Physical Length~ 1 ftKM Number of Connectionsfixedmany Maximum Bandwidthfixed??? Interface to Comm. mediummemory interfaceperipheral Global Orderarbitration??? ProtectionVirtual -> physicalOS Trusttotalnone OSsingleindependent comm. abstractionHWSW

Computer Architecture II 31 Bandwidth Scalability Bandwidth limitation: single set of wires Must have many independent wires (remember bisection width?) => switches

Computer Architecture II 32 Dancehall MP Organization Network bandwidth demand: scales linearly with number of processors Latency: Increases with number of stages of switches (remember butterfly?) –Adding local memory would offer fixed latency

Computer Architecture II 33 Generic Distributed Memory Multiprocessor Most common structure

Computer Architecture II 34 Bandwidth scaling requirements Large number of independent communication paths between nodes: large number of concurrent transactions using different wires Independent transactions No global arbitration Effect of a transaction only visible to the nodes involved –Broadcast difficult (was easy for bus): additional transactions needed

Computer Architecture II 35 Latency Scaling T(n) = Overhead + Channel Time (Channel Occupancy) + Routing Delay + Contention Time Overhead: processing time in initiating and completing a transfer Channel Time(n) = n/B RoutingDelay (h,n)

Computer Architecture II 36 Cost Scaling Cost(p,m) = fixed cost + incremental cost (p,m) Bus Based SMP –Add more processors and memory Scalable machines –processors, memory, network Parallel efficiency(p) = Speedup(p) / p Costup(p) = Cost(p) / Cost(1) Cost-effective: Speedup(p) > Costup(p)

Computer Architecture II 37 Cost Effective? 2048 processors: 475 fold speedup at 206x cost

Computer Architecture II 38 Physical Scaling Chip-level integration Board-level System level

Computer Architecture II 39 Chip-level integration: nCUBE/2 Network integrated onto the chip 14 bidirectional links => 8096 nodes Entire machine synchronous at 40 MHz Single-chip node Basic module Hypercube network configuration DRAM interface D M A c h a n n e l s R o u t e r MMU I-Fetch & decode 64-bit integer IEEE floating point Operand $ Execution unit 1024 Nodes

Computer Architecture II 40 Board level integration: CM-5 Use standard microprocessor components Scalable network interconnect

Computer Architecture II 41 System Level Integration Loose packaging IBM SP2 Cluster blades