Synchronization Todd C. Mowry CS 740 November 1, 2000 Topics Locks Barriers Hardware primitives.

Slides:



Advertisements
Similar presentations
EECE : Synchronization Issue: How can synchronization operations be implemented in bus-based cache-coherent multiprocessors Components of a synchronization.
Advertisements

1 Lecture 20: Synchronization & Consistency Topics: synchronization, consistency models (Sections )
1 Synchronization A parallel computer is a collection of processing elements that cooperate and communicate to solve large problems fast. Types of Synchronization.
Synchronization without Contention
Synchronization. How to synchronize processes? – Need to protect access to shared data to avoid problems like race conditions – Typical example: Updating.
John M. Mellor-Crummey Algorithms for Scalable Synchronization on Shared- Memory Multiprocessors Joseph Garvey & Joshua San Miguel Michael L. Scott.
CS492B Analysis of Concurrent Programs Lock Basics Jaehyuk Huh Computer Science, KAIST.
Synchronization without Contention John M. Mellor-Crummey and Michael L. Scott+ ECE 259 / CPS 221 Advanced Computer Architecture II Presenter : Tae Jun.
Parallel Processing (CS526) Spring 2012(Week 6).  A parallel algorithm is a group of partitioned tasks that work with each other to solve a large problem.
Multiple Processor Systems
CS6290 Synchronization. Synchronization Shared counter/sum update example –Use a mutex variable for mutual exclusion –Only one processor can own the mutex.
CS 162 Discussion Section Week 3. Who am I? Mosharaf Chowdhury Office 651 Soda 4-5PM.
The Performance of Spin Lock Alternatives for Shared-Memory Microprocessors Thomas E. Anderson Presented by David Woodard.
CS510 Concurrent Systems Class 1b Spin Lock Performance.
1 Lecture 21: Synchronization Topics: lock implementations (Sections )
1 Lecture 2: Snooping and Directory Protocols Topics: Snooping wrap-up and directory implementations.
CS533 - Concepts of Operating Systems 1 Class Discussion.
CS252/Patterson Lec /28/01 CS 213 Lecture 10: Multiprocessor 3: Directory Organization.
1 Lecture 20: Protocols and Synchronization Topics: distributed shared-memory multiprocessors, synchronization (Sections )
Computer Architecture II 1 Computer architecture II Lecture 9.
Symmetric Multiprocessors and Performance of SpinLock Techniques Based on Anderson’s paper “Performance of Spin Lock Alternatives for Shared-Memory Multiprocessors”
Synchronization (Barriers) Parallel Processing (CS453)
Includes slides from course CS162 at UC Berkeley, by prof Anthony D. Joseph and Ion Stoica and from course CS194, by prof. Katherine Yelick Shared Memory.
Synchronization CSE 661 – Parallel and Vector Architectures Dr. Muhamed Mudawar Computer Engineering Department King Fahd University of Petroleum and Minerals.
The Performance of Spin Lock Alternatives for Shared-Memory Multiprocessors THOMAS E. ANDERSON Presented by Daesung Park.
Fundamentals of Parallel Computer Architecture - Chapter 91 Chapter 9 Hardware Support for Synchronization Yan Solihin Copyright.
Caltech CS184 Spring DeHon 1 CS184b: Computer Architecture (Abstractions and Optimizations) Day 15: May 9, 2003 Distributed Shared Memory.
MULTIVIE W Slide 1 (of 23) The Performance of Spin Lock Alternatives for Shared-Memory Multiprocessors Paper: Thomas E. Anderson Presentation: Emerson.
Presenter: Long Ma Advisor: Dr. Zhang 4.5 DISTRIBUTED MUTUAL EXCLUSION.
Operating Systems ECE344 Ashvin Goel ECE University of Toronto Mutual Exclusion.
Jeremy Denham April 7,  Motivation  Background / Previous work  Experimentation  Results  Questions.
CALTECH cs184c Spring DeHon CS184c: Computer Architecture [Parallel and Multithreaded] Day 10: May 8, 2001 Synchronization.
Anshul Kumar, CSE IITD CSL718 : Multiprocessors Synchronization, Memory Consistency 17th April, 2006.
Anshul Kumar, CSE IITD ECE729 : Advance Computer Architecture Lecture 26: Synchronization, Memory Consistency 25 th March, 2010.
CS399 New Beginnings Jonathan Walpole. 2 Concurrent Programming & Synchronization Primitives.
Caltech CS184 Spring DeHon 1 CS184b: Computer Architecture (Abstractions and Optimizations) Day 22: May 20, 2005 Synchronization.
1 Lecture 19: Scalable Protocols & Synch Topics: coherence protocols for distributed shared-memory multiprocessors and synchronization (Sections )
1 Global and high-contention operations: Barriers, reductions, and highly- contended locks Katie Coons April 6, 2006.
SYNAR Systems Networking and Architecture Group CMPT 886: The Art of Scalable Synchronization Dr. Alexandra Fedorova School of Computing Science SFU.
1 Synchronization “A parallel computer is a collection of processing elements that cooperate and communicate to solve large problems fast.” Types of Synchronization.
Synchronization Todd C. Mowry CS 740 November 24, 1998 Topics Locks Barriers.
Queue Locks and Local Spinning Some Slides based on: The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit.
CS3771 Today: Distributed Coordination  Previous class: Distributed File Systems Issues: Naming Strategies: Absolute Names, Mount Points (logical connection.
Multiprocessors – Locks
Outline Introduction Centralized shared-memory architectures (Sec. 5.2) Distributed shared-memory and directory-based coherence (Sec. 5.4) Synchronization:
CS703 – Advanced Operating Systems
Synchronization Feb 2017 Topics Locks Barriers
Prof John D. Kubiatowicz
Lecture 19: Coherence and Synchronization
Lecture 5: Synchronization
Lecture 18: Coherence and Synchronization
Reactive Synchronization Algorithms for Multiprocessors
Global and high-contention operations: Barriers, reductions, and highly-contended locks Katie Coons April 6, 2006.
CS510 Concurrent Systems Jonathan Walpole.
Cache Coherence Protocols 15th April, 2006
Designing Parallel Algorithms (Synchronization)
Lecture 5: Snooping Protocol Design Issues
Shared Memory Systems Miodrag Bolic.
Lecture 21: Synchronization and Consistency
Lecture: Coherence and Synchronization
Hardware-Software Trade-offs in Synchronization and Data Layout
CS533 Concepts of Operating Systems
Lecture 4: Synchronization
CS510 Concurrent Systems Jonathan Walpole.
Synchronization on a Parallel Machine Transactional Memory
CS333 Intro to Operating Systems
Lecture: Coherence, Synchronization
Lecture: Coherence and Synchronization
Lecture 19: Coherence and Synchronization
Lecture 18: Coherence and Synchronization
Presentation transcript:

Synchronization Todd C. Mowry CS 740 November 1, 2000 Topics Locks Barriers Hardware primitives

CS 740 F’00– 2 – Types of Synchronization Mutual Exclusion Locks Event Synchronization Global or group-based (barriers) Point-to-point

CS 740 F’00– 3 – Busy Waiting vs. Blocking Busy-waiting is preferable when: scheduling overhead is larger than expected wait time processor resources are not needed for other tasks schedule-based blocking is inappropriate (e.g., in OS kernel)

CS 740 F’00– 4 – A Simple Lock lock:ldregister, location cmpregister, #0 bnzlock stlocation, #1 ret unlock:st location, #0 ret

CS 740 F’00– 5 – Need Atomic Primitive! Test&Set Swap Fetch&Op Fetch&Incr, Fetch&Decr Compare&Swap

CS 740 F’00– 6 – Test&Set based lock lock:t&sregister, location bnzlock ret unlock:st location, #0 ret

CS 740 F’00– 7 – T&S Lock Performance Code: lock; delay(c); unlock; Same total no. of lock calls as p increases; measure time per transfer

CS 740 F’00– 8 – Test and Test and Set A:while (lock != free) if (test&set(lock) == free) { critical section; } else goto A; (+) spinning happens in cache (-) can still generate a lot of traffic when many processors go to do test&set

CS 740 F’00– 9 – Test and Set with Backoff Upon failure, delay for a while before retrying either constant delay or exponential backoff Tradeoffs: (+) much less network traffic (-) exponential backoff can cause starvation for high-contention locks –new requestors back off for shorter times But exponential found to work best in practice

CS 740 F’00– 10 – Test and Set with Update Test and Set sends updates to processors that cache the lock Tradeoffs: (+) good for bus-based machines (-) still lots of traffic on distributed networks Main problem with test&set-based schemes is that a lock release causes all waiters to try to get the lock, using a test&set to try to get it.

CS 740 F’00– 11 – Ticket Lock (fetch&incr based) Two counters: next_ticket (number of requestors) now_serving (number of releases that have happened) Algorithm: First do a fetch&incr on next_ticket (not test&set) When release happens, poll the value of now_serving –if my_ticket, then I win Use delay; but how much?

CS 740 F’00– 12 – Ticket Lock Tradeoffs (+) guaranteed FIFO order; no starvation possible (+) latency can be low if fetch&incr is cacheable (+) traffic can be quite low (-) but traffic is not guaranteed to be O(1) per lock acquire

CS 740 F’00– 13 – Array-Based Queueing Locks Every process spins on a unique location, rather than on a single now_serving counter fetch&incr gives a process the address on which to spin Tradeoffs: (+) guarantees FIFO order (like ticket lock) (+) O(1) traffic with coherence caches (unlike ticket lock) (-) requires space per lock proportional to P

CS 740 F’00– 14 – List-Base Queueing Locks (MCS) All other good things + O(1) traffic even without coherent caches (spin locally) Uses compare&swap to build linked lists in software Locally-allocated flag per list node to spin on Can work with fetch&store, but loses FIFO guarantee Tradeoffs: (+) less storage than array-based locks (+) O(1) traffic even without coherent caches (-) compare&swap not easy to implement

CS 740 F’00– 15 – Implementing Fetch&Op Load Linked/Store Conditional lock:llreg1, location /* LL location to reg1 */ bnzreg1, lock /* check if location locked*/ sclocation, reg2 /* SC reg2 into location*/ beqzreg2, lock /* if failed, start again */ ret unlock: st location, #0 /* write 0 to location */ ret

CS 740 F’00– 16 – Barriers We will discuss five barriers: centralized software combining tree dissemination barrier tournament barrier MCS tree-based barrier

CS 740 F’00– 17 – Centralized Barrier Basic idea: notify a single shared counter when you arrive poll that shared location until all have arrived Simple implementation require polling/spinning twice: first to ensure that all procs have left previous barrier second to ensure that all procs have arrived at current barrier Solution to get one spin: sense reversal

CS 740 F’00– 18 – Software Combining Tree Barrier Writes into one tree for barrier arrival Reads from another tree to allow procs to continue Sense reversal to distinguish consecutive barriers

CS 740 F’00– 19 – Dissemination Barrier log P rounds of synchronization In round k, proc i synchronizes with proc (i+2 k ) mod P Advantage: Can statically allocate flags to avoid remote spinning

CS 740 F’00– 20 – Tournament Barrier Binary combining tree Representative processor at a node is statically chosen no fetch&op needed In round k, proc i=2 k sets a flag for proc j=i-2 k i then drops out of tournament and j proceeds in next round i waits for global flag signalling completion of barrier to be set –could use combining wakeup tree

CS 740 F’00– 21 – MCS Software Barrier Modifies tournament barrier to allow static allocation in wakeup tree, and to use sense reversal Every processor is a node in two P-node trees: has pointers to its parent building a fanin-4 arrival tree has pointers to its children to build a fanout-2 wakeup tree

CS 740 F’00– 22 – Barrier Recommendations Criteria: length of critical path number of network transactions space requirements atomic operation requirements

CS 740 F’00– 23 – Space Requirements Centralized: constant MCS, combining tree: O(P) Dissemination, Tournament: O(PlogP)

CS 740 F’00– 24 – Network Transactions Centralized, combining tree: O(P) if broadcast and coherent caches; unbounded otherwise Dissemination: O(PlogP) Tournament, MCS: O(P)

CS 740 F’00– 25 – Critical Path Length If independent parallel network paths available: all are O(logP) except centralized, which is O(P) Otherwise (e.g., shared bus): linear factors dominate

CS 740 F’00– 26 – Primitives Needed Centralized and combining tree: atomic increment atomic decrement Others: atomic read atomic write

CS 740 F’00– 27 – Barrier Recommendations Without broadcast on distributed memory: Dissemination –MCS is good, only critical path length is about 1.5X longer –MCS has somewhat better network load and space requirements Cache coherence with broadcast (e.g., a bus): MCS with flag wakeup –centralized is best for modest numbers of processors Big advantage of centralized barrier: adapts to changing number of processors across barrier calls