CS 152 Computer Architecture and Engineering Lecture 19: Synchronization and Sequential Consistency Krste Asanovic Electrical Engineering and Computer.

Slides:



Advertisements
Similar presentations
Symmetric Multiprocessors: Synchronization and Sequential Consistency.
Advertisements

1 Episode III in our multiprocessing miniseries. Relaxed memory models. What I really wanted here was an elephant with sunglasses relaxing On a beach,
1 Lecture 20: Synchronization & Consistency Topics: synchronization, consistency models (Sections )
Synchronization. How to synchronize processes? – Need to protect access to shared data to avoid problems like race conditions – Typical example: Updating.
4/16/2013 CS152, Spring 2013 CS 152 Computer Architecture and Engineering Lecture 19: Directory-Based Cache Protocols Krste Asanovic Electrical Engineering.
CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture Snoopy Caches I Steve Ko Computer Sciences and Engineering University at Buffalo.
4/4/2013 CS152, Spring 2013 CS 152 Computer Architecture and Engineering Lecture 17: Synchronization and Sequential Consistency Krste Asanovic Electrical.
CS 162 Memory Consistency Models. Memory operations are reordered to improve performance Hardware (e.g., store buffer, reorder buffer) Compiler (e.g.,
© Krste Asanovic, 2014CS252, Spring 2014, Lecture 12 CS252 Graduate Computer Architecture Spring 2014 Lecture 12: Synchronization and Memory Models Krste.
Chapter 6: Process Synchronization
Silberschatz, Galvin and Gagne ©2013 Operating System Concepts – 9 th Edition Chapter 5: Process Synchronization.
Silberschatz, Galvin and Gagne ©2009 Operating System Concepts – 8 th Edition, Chapter 6: Process Synchronization.
Concurrent Programming James Adkison 02/28/2008. What is concurrency? “happens-before relation – A happens before B if A and B belong to the same process.
Process Synchronization. Module 6: Process Synchronization Background The Critical-Section Problem Peterson’s Solution Synchronization Hardware Semaphores.
CS492B Analysis of Concurrent Programs Consistency Jaehyuk Huh Computer Science, KAIST Part of slides are based on CS:App from CMU.
Parallel Processing (CS526) Spring 2012(Week 6).  A parallel algorithm is a group of partitioned tasks that work with each other to solve a large problem.
CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture Snoopy Caches II Steve Ko Computer Sciences and Engineering University at Buffalo.
Concurrency.
CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture Directory-Based Caches I Steve Ko Computer Sciences and Engineering University at Buffalo.
CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture Synchronization and Consistency II Steve Ko Computer Sciences and Engineering University at.
CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture Synchronization and Consistency I Steve Ko Computer Sciences and Engineering University at Buffalo.
CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture Multithreading II Steve Ko Computer Sciences and Engineering University at Buffalo.
By Sarita Adve & Kourosh Gharachorloo Review by Jim Larson Shared Memory Consistency Models: A Tutorial.
CS 152 Computer Architecture and Engineering Lecture 19: Synchronization and Sequential Consistency Krste Asanovic Electrical Engineering and Computer.
1 Lecture 7: Consistency Models Topics: sequential consistency, requirements to implement sequential consistency, relaxed consistency models.
Computer Architecture II 1 Computer architecture II Lecture 9.
1 Lecture 15: Consistency Models Topics: sequential consistency, requirements to implement sequential consistency, relaxed consistency models.
CS 252 Graduate Computer Architecture Lecture 11: Multiprocessors-II Krste Asanovic Electrical Engineering and Computer Sciences University of California,
April 13, 2011CS152, Spring 2011 CS 152 Computer Architecture and Engineering Lecture 18: Snoopy Caches Krste Asanovic Electrical Engineering and Computer.
CPE 731 Advanced Computer Architecture Multiprocessor Introduction
April 8, 2010CS152, Spring 2010 CS 152 Computer Architecture and Engineering Lecture 19: Synchronization and Sequential Consistency Krste Asanovic Electrical.
CS 152 Computer Architecture and Engineering Lecture 20: Snoopy Caches Krste Asanovic Electrical Engineering and Computer Sciences University of California,
April 4, 2011CS152, Spring 2011 CS 152 Computer Architecture and Engineering Lecture 17: Synchronization and Sequential Consistency Krste Asanovic Electrical.
April 18, 2011CS152, Spring 2011 CS 152 Computer Architecture and Engineering Lecture 19: Directory-Based Cache Protocols Krste Asanovic Electrical Engineering.
April 15, 2010CS152, Spring 2010 CS 152 Computer Architecture and Engineering Lecture 20: Snoopy Caches Krste Asanovic Electrical Engineering and Computer.
Operating Systems CSE 411 CPU Management Oct Lecture 13 Instructor: Bhuvan Urgaonkar.
Memory Consistency Models Some material borrowed from Sarita Adve’s (UIUC) tutorial on memory consistency models.
Multi-core systems System Architecture COMP25212 Daniel Goodman Advanced Processor Technologies Group.
Shared Memory Consistency Models: A Tutorial Sarita V. Adve Kouroush Ghrachorloo Western Research Laboratory September 1995.
Memory Consistency Models Alistair Rendell See “Shared Memory Consistency Models: A Tutorial”, S.V. Adve and K. Gharachorloo Chapter 8 pp of Wilkinson.
COMP 111 Threads and concurrency Sept 28, Tufts University Computer Science2 Who is this guy? I am not Prof. Couch Obvious? Sam Guyer New assistant.
Operating Systems ECE344 Ashvin Goel ECE University of Toronto Mutual Exclusion.
UC Regents Spring 2014 © UCBCS 152 L13: Synchronization John Lazzaro (not a prof - “John” is always OK) CS 152 Computer Architecture and Engineering.
By Sarita Adve & Kourosh Gharachorloo Slides by Jim Larson Shared Memory Consistency Models: A Tutorial.
Memory Consistency Models. Outline Review of multi-threaded program execution on uniprocessor Need for memory consistency models Sequential consistency.
CS399 New Beginnings Jonathan Walpole. 2 Concurrent Programming & Synchronization Primitives.
4/13/2016 CS152, Spring 2016 CS 152 Computer Architecture and Engineering Lecture 18: Snoopy Caches Dr. George Michelogiannakis EECS, University of California.
Synchronization Questions answered in this lecture: Why is synchronization necessary? What are race conditions, critical sections, and atomic operations?
Multiprocessors – Locks
Symmetric Multiprocessors: Synchronization and Sequential Consistency
CS 152 Computer Architecture and Engineering Lecture 18: Snoopy Caches
Background on the need for Synchronization
Memory Consistency Models
Lecture 11: Consistency Models
Memory Consistency Models
CS 252 Graduate Computer Architecture Lecture 10: Multiprocessors
CMPE 382 / ECE 510 Computer Organization & Architecture Chapter 4 – Multiprocessors and Multithreading based on text: Computer Architecture : A Quantitative.
Symmetric Multiprocessors: Synchronization and Sequential Consistency
Krste Asanovic Electrical Engineering and Computer Sciences
Designing Parallel Algorithms (Synchronization)
Krste Asanovic Electrical Engineering and Computer Sciences
Symmetric Multiprocessors: Synchronization and Sequential Consistency
Lecture 15 Multi-core Chips
Dr. George Michelogiannakis EECS, University of California at Berkeley
Concurrency: Mutual Exclusion and Process Synchronization
Memory Consistency Models
CS 152 Computer Architecture and Engineering Lecture 20: Snoopy Caches
CS333 Intro to Operating Systems
CS 152 Computer Architecture and Engineering CS252 Graduate Computer Architecture Lecture 22 Synchronization Krste Asanovic Electrical Engineering and.
CS 152 Computer Architecture and Engineering CS252 Graduate Computer Architecture Lecture 19 Memory Consistency Models Krste Asanovic Electrical Engineering.
Presentation transcript:

CS 152 Computer Architecture and Engineering Lecture 19: Synchronization and Sequential Consistency Krste Asanovic Electrical Engineering and Computer Sciences University of California, Berkeley

4/17/ CS152-Spring’08 Summary: Multithreaded Categories Time (processor cycle) SuperscalarFine-GrainedCoarse-Grained Multiprocessing Simultaneous Multithreading Thread 1 Thread 2 Thread 3 Thread 4 Thread 5 Idle slot

4/17/ CS152-Spring’08 Uniprocessor Performance (SPECint) VAX : 25%/year 1978 to 1986 RISC + x86: 52%/year 1986 to 2002 RISC + x86: ??%/year 2002 to present From Hennessy and Patterson, Computer Architecture: A Quantitative Approach, 4th edition, X

4/17/ CS152-Spring’08 Déjà vu all over again? “… today’s processors … are nearing an impasse as technologies approach the speed of light..” David Mitchell, The Transputer: The Time Is Now (1989) Transputer had bad timing (Uniprocessor performance  )  Procrastination rewarded: 2X seq. perf. / 1.5 years “We are dedicating all of our future product development to multicore designs. … This is a sea change in computing” Paul Otellini, President, Intel (2005) All microprocessor companies switch to MP (2X CPUs / 2 yrs)  Procrastination penalized: 2X sequential perf. / 5 yrs Manufacturer/Year AMD/’07Intel/’07IBM/’07Sun/’07 Processors/chip 4228 Threads/Processor 1128 Threads/chip 42464

4/17/ CS152-Spring’08 symmetric All memory is equally far away from all processors Any processor can do any I/O (set up a DMA transfer) Symmetric Multiprocessors Memory I/O controller Graphics output CPU-Memory bus bridge Processor I/O controller I/O bus Networks Processor

4/17/ CS152-Spring’08 Synchronization The need for synchronization arises whenever there are concurrent processes in a system (even in a uniprocessor system) Forks and Joins: In parallel programming, a parallel process may want to wait until several events have occurred Producer-Consumer: A consumer process must wait until the producer process has produced data Exclusive use of a resource: Operating system has to ensure that only one process uses a resource at a given time producer consumer fork join P1 P2

4/17/ CS152-Spring’08 A Producer-Consumer Example The program is written assuming instructions are executed in order. Producer posting Item x: Load R tail, (tail) Store (R tail ), x R tail =R tail +1 Store (tail), R tail Consumer: Load R head, (head) spin:Load R tail, (tail) if R head ==R tail goto spin Load R, (R head ) R head =R head +1 Store (head), R head process(R) Producer Consumer tailhead R tail R head R Problems?

4/17/ CS152-Spring’08 A Producer-Consumer Example continued Producer posting Item x: Load R tail, (tail) Store (R tail ), x R tail =R tail +1 Store (tail), R tail Consumer: Load R head, (head) spin:Load R tail, (tail) if R head ==R tail goto spin Load R, (R head ) R head =R head +1 Store (head), R head process(R) Can the tail pointer get updated before the item x is stored? Programmer assumes that if 3 happens after 2, then 4 happens after 1. Problem sequences are: 2, 3, 4, 1 4, 1, 2,

4/17/ CS152-Spring’08 Sequential Consistency A Memory Model “ A system is sequentially consistent if the result of any execution is the same as if the operations of all the processors were executed in some sequential order, and the operations of each individual processor appear in the order specified by the program” Leslie Lamport Sequential Consistency = arbitrary order-preserving interleaving of memory references of sequential programs M PPPPPP

4/17/ CS152-Spring’08 Sequential Consistency Sequential concurrent tasks:T1, T2 Shared variables:X, Y (initially X = 0, Y = 10) T1:T2: Store (X), 1 (X = 1) Load R 1, (Y) Store (Y), 11 (Y = 11) Store (Y’), R 1 (Y’= Y) Load R 2, (X) Store (X’), R 2 (X’= X) what are the legitimate answers for X’ and Y’ ? (X’,Y’)  {(1,11), (0,10), (1,10), (0,11)} ? If y is 11 then x cannot be 0

4/17/ CS152-Spring’08 Sequential Consistency Sequential consistency imposes more memory ordering constraints than those imposed by uniprocessor program dependencies ( ) What are these in our example ? T1:T2: Store (X), 1 (X = 1) Load R 1, (Y) Store (Y), 11 (Y = 11) Store (Y’), R 1 (Y’= Y) Load R 2, (X) Store (X’), R 2 (X’= X) additional SC requirements Does (can) a system with caches or out-of-order execution capability provide a sequentially consistent view of the memory ? more on this later

4/17/ CS152-Spring’08 Multiple Consumer Example Producer posting Item x: Load R tail, (tail) Store (R tail ), x R tail =R tail +1 Store (tail), R tail Consumer: Load R head, (head) spin:Load R tail, (tail) if R head ==R tail goto spin Load R, (R head ) R head =R head +1 Store (head), R head process(R) What is wrong with this code? Critical section: Needs to be executed atomically by one consumer  locks tailhead Producer R tail Consumer 1 RR head R tail Consumer 2 RR head R tail

4/17/ CS152-Spring’08 Locks or Semaphores E. W. Dijkstra, 1965 A semaphore is a non-negative integer, with the following operations: P(s): if s>0, decrement s by 1, otherwise wait V(s): increment s by 1 and wake up one of the waiting processes P’s and V’s must be executed atomically, i.e., without interruptions or interleaved accesses to s by other processors initial value of s determines the maximum no. of processes in the critical section Process i P(s) V(s)

4/17/ CS152-Spring’08 Implementation of Semaphores Semaphores (mutual exclusion) can be implemented using ordinary Load and Store instructions in the Sequential Consistency memory model. However, protocols for mutual exclusion are difficult to design... Simpler solution: atomic read-modify-write instructions Test&Set (m), R: R  M[m]; if R==0 then M[m] 1; Swap (m), R: R t  M[m]; M[m] R; R R t ; Fetch&Add (m), R V, R: R  M[m]; M[m] R + R V ; Examples: m is a memory location, R is a register

4/17/ CS152-Spring’08 CS152 Administrivia CS194-6 class description Quiz results Quiz common mistakes

4/17/ CS152-Spring’08 CS194-6 Digital Systems Project Laboratory Prerequisites: EECS 150 Units/Credit: 4, may be taken for a grade. Meeting times: M 10:30-12 (lecture), F (lab) Instructor: TBA Fall 2008 CS 194 is a capstone digital design project course. The projects are team projects, with 2-4 students per team. Teams propose a project in a detailed design document, implement the project in Verilog, map it to a state-of-the-art FPGA-based platform, and verify its correct operation using a test vector suite. Projects may be of two types: general-purpose computing systems based on a standard ISA (for example, a pipelined implementation of a subset of the MIPS ISA, with caches and a DRAM interface), or special-purpose digital computing systems (for example, a real-time engine to decode an MPEG video packet stream). This class requires a significant time commitment (we expect students to spend 200 hours on the project over the semester). Note that CS 194 is a pure project course (no exams or homework). Be sure to register for both CS 194 P 006 and CS 194 S 601.

4/17/ CS152-Spring’08 Common Mistakes Need physical registers to hold committed architectural registers plus any inflight destination values An instruction only allocates physical registers for destination, not sources Not every instruction needs a physical register for destination (branches, stores don’t have destination)

4/17/ CS152-Spring’08 Critical Section P: Test&Set (mutex),R temp if (R temp !=0) goto P Load R head, (head) spin:Load R tail, (tail) if R head ==R tail goto spin Load R, (R head ) R head =R head +1 Store (head), R head V: Store (mutex),0 process(R) Multiple Consumers Example using the Test&Set Instruction Other atomic read-modify-write instructions (Swap, Fetch&Add, etc.) can also implement P’s and V’s What if the process stops or is swapped out while in the critical section?

4/17/ CS152-Spring’08 Nonblocking Synchronization Compare&Swap(m), R t, R s : if (R t ==M[m]) then M[m]=R s ; R s =R t ; status success; elsestatus fail; try: Load R head, (head) spin:Load R tail, (tail) if R head ==R tail goto spin Load R, (R head ) R newhead = R head +1 Compare&Swap(head), R head, R newhead if (status==fail) goto try process(R) status is an implicit argument

4/17/ CS152-Spring’08 Load-reserve & Store-conditional Special register(s) to hold reservation flag and address, and the outcome of store-conditional try: Load-reserve R head, (head) spin:Load R tail, (tail) if R head ==R tail goto spin Load R, (R head ) R head = R head + 1 Store-conditional (head), R head if (status==fail) goto try process(R) Load-reserve R, (m):  ; R  M[m]; Store-conditional (m), R: if == then cancel other procs’ reservation on m; M[m] R; status succeed; else status fail;

4/17/ CS152-Spring’08 Performance of Locks Blocking atomic read-modify-write instructions e.g., Test&Set, Fetch&Add, Swap vs Non-blocking atomic read-modify-write instructions e.g., Compare&Swap, Load-reserve/Store-conditional vs Protocols based on ordinary Loads and Stores Performance depends on several interacting factors: degree of contention, caches, out-of-order execution of Loads and Stores later...

4/17/ CS152-Spring’08 Issues in Implementing Sequential Consistency Implementation of SC is complicated by two issues Out-of-order execution capability Load(a); Load(b)yes Load(a); Store(b)yes if a  b Store(a); Load(b)yes if a  b Store(a); Store(b)yes if a  b Caches Caches can prevent the effect of a store from being seen by other processors M PPPPPP

4/17/ CS152-Spring’08 Memory Fences Instructions to sequentialize memory accesses Processors with relaxed or weak memory models (i.e., permit Loads and Stores to different addresses to be reordered) need to provide memory fence instructions to force the serialization of memory accesses Examples of processors with relaxed memory models: Sparc V8 (TSO,PSO): Membar Sparc V9 (RMO): Membar #LoadLoad, Membar #LoadStore Membar #StoreLoad, Membar #StoreStore PowerPC (WO): Sync, EIEIO Memory fences are expensive operations, however, one pays the cost of serialization only when it is required

4/17/ CS152-Spring’08 Using Memory Fences Producer posting Item x: Load R tail, (tail) Store (R tail ), x Membar SS R tail =R tail +1 Store (tail), R tail Consumer: Load R head, (head) spin:Load R tail, (tail) if R head ==R tail goto spin Membar LL Load R, (R head ) R head =R head +1 Store (head), R head process(R) Producer Consumer tailhead R tail R head R ensures that tail ptr is not updated before x has been stored ensures that R is not loaded before x has been stored

4/17/ CS152-Spring’08 Data-Race Free Programs a.k.a. Properly Synchronized Programs Process 1... Acquire(mutex); Release(mutex); Process 2... Acquire(mutex); Release(mutex); Synchronization variables (e.g. mutex) are disjoint from data variables Accesses to writable shared data variables are protected in critical regions no data races except for locks (Formal definition is elusive) In general, it cannot be proven if a program is data-race free.

4/17/ CS152-Spring’08 Fences in Data-Race Free Programs Process 1... Acquire(mutex); membar; membar; Release(mutex); Process 2... Acquire(mutex); membar; membar; Release(mutex); Relaxed memory model allows reordering of instructions by the compiler or the processor as long as the reordering is not done across a fence The processor also should not speculate or prefetch across fences

4/17/ CS152-Spring’08 Mutual Exclusion Using Load/Store A protocol based on two shared variables c1 and c2. Initially, both c1 and c2 are 0 (not busy) What is wrong? Process 1... c1=1; L: if c2=1 then go to L c1=0; Process 2... c2=1; L: if c1=1 then go to L c2=0; Deadlock!

4/17/ CS152-Spring’08 Mutual Exclusion: second attempt To avoid deadlock, let a process give up the reservation (i.e. Process 1 sets c1 to 0) while waiting. Deadlock is not possible but with a low probability a livelock may occur. An unlucky process may never get to enter the critical section starvation Process 1... L: c1=1; if c2=1 then { c1=0; go to L} c1=0 Process 2... L: c2=1; if c1=1 then { c2=0; go to L} c2=0

4/17/ CS152-Spring’08 A Protocol for Mutual Exclusion T. Dekker, 1966 Process 1... c1=1; turn = 1; L: if c2=1 & turn=1 then go to L c1=0; A protocol based on 3 shared variables c1, c2 and turn. Initially, both c1 and c2 are 0 (not busy) turn = i ensures that only process i can wait variables c1 and c2 ensure mutual exclusion Solution for n processes was given by Dijkstra and is quite tricky! Process 2... c2=1; turn = 2; L: if c1=1 & turn=2 then go to L c2=0;

4/17/ CS152-Spring’08 Analysis of Dekker’s Algorithm... Process 1 c1=1; turn = 1; L: if c2=1 & turn=1 then go to L c1=0;... Process 2 c2=1; turn = 2; L: if c1=1 & turn=2 then go to L c2=0; Scenario 1... Process 1 c1=1; turn = 1; L: if c2=1 & turn=1 then go to L c1=0;... Process 2 c2=1; turn = 2; L: if c1=1 & turn=2 then go to L c2=0; Scenario 2

4/17/ CS152-Spring’08 N-process Mutual Exclusion Lamport’s Bakery Algorithm Process i choosing[i] = 1; num[i] = max(num[0], …, num[N-1]) + 1; choosing[i] = 0; for(j = 0; j < N; j++) { while( choosing[j] ); while( num[j] && ( ( num[j] < num[i] ) || ( num[j] == num[i] && j < i ) ) ); } num[i] = 0; Initially num[j] = 0, for all j Entry Code Exit Code

4/17/ CS152-Spring’08 Acknowledgements These slides contain material developed and copyright by: –Arvind (MIT) –Krste Asanovic (MIT/UCB) –Joel Emer (Intel/MIT) –James Hoe (CMU) –John Kubiatowicz (UCB) –David Patterson (UCB) MIT material derived from course UCB material derived from course CS252