1 Synchronization A parallel computer is a collection of processing elements that cooperate and communicate to solve large problems fast. Types of Synchronization.

Slides:

Advertisements

Similar presentations

EECE : Synchronization Issue: How can synchronization operations be implemented in bus-based cache-coherent multiprocessors Components of a synchronization.

Advertisements

Symmetric Multiprocessors: Synchronization and Sequential Consistency.

L.N. Bhuyan Adapted from Patterson’s slides

1 Lecture 20: Synchronization & Consistency Topics: synchronization, consistency models (Sections )

Synchronization without Contention

Operating Systems Part III: Process Management (Process Synchronization)

The University of Adelaide, School of Computer Science

Synchronization. How to synchronize processes? – Need to protect access to shared data to avoid problems like race conditions – Typical example: Updating.

1 Chapter 5 Concurrency: Mutual Exclusion and Synchronization Principals of Concurrency Mutual Exclusion: Hardware Support Semaphores Readers/Writers Problem.

Multiprocessors—Synchronization. Synchronization Why Synchronize? Need to know when it is safe for different processes to use shared data Issues for Synchronization:

John M. Mellor-Crummey Algorithms for Scalable Synchronization on Shared- Memory Multiprocessors Joseph Garvey & Joshua San Miguel Michael L. Scott.

Concurrency: Mutual Exclusion and Synchronization Chapter 5.

CS492B Analysis of Concurrent Programs Lock Basics Jaehyuk Huh Computer Science, KAIST.

Ch. 7 Process Synchronization (1/2) I Background F Producer - Consumer process :  Compiler, Assembler, Loader, · · · · · · F Bounded buffer.

Mutual Exclusion By Shiran Mizrahi. Critical Section class Counter { private int value = 1; //counter starts at one public Counter(int c) { //constructor.

Silberschatz, Galvin and Gagne ©2009 Operating System Concepts – 8 th Edition, Chapter 6: Process Synchronization.

Synchronization without Contention John M. Mellor-Crummey and Michael L. Scott+ ECE 259 / CPS 221 Advanced Computer Architecture II Presenter : Tae Jun.

Parallel Processing (CS526) Spring 2012(Week 6).  A parallel algorithm is a group of partitioned tasks that work with each other to solve a large problem.

Synchron. CSE 471 Aut 011 Some Recent Medium-scale NUMA Multiprocessors (research machines) DASH (Stanford) multiprocessor. –“Cluster” = 4 processors on.

The Performance of Spin Lock Alternatives for Shared-Memory Microprocessors Thomas E. Anderson Presented by David Woodard.

CS510 Concurrent Systems Class 1b Spin Lock Performance.

1 Lecture 21: Synchronization Topics: lock implementations (Sections )

BusMultis.1 Review: Where are We Now? Processor Control Datapath Memory Input Output Input Output Memory Processor Control Datapath  Multiprocessor –

Scalable Reader Writer Synchronization John M.Mellor-Crummey, Michael L.Scott.

Interprocessor arbitration

Synchronization Todd C. Mowry CS 740 November 1, 2000 Topics Locks Barriers Hardware primitives.

1 Lecture 20: Protocols and Synchronization Topics: distributed shared-memory multiprocessors, synchronization (Sections )

Synchronization (Barriers) Parallel Processing (CS453)

Includes slides from course CS162 at UC Berkeley, by prof Anthony D. Joseph and Ion Stoica and from course CS194, by prof. Katherine Yelick Shared Memory.

The Performance of Spin Lock Alternatives for Shared-Memory Multiprocessors THOMAS E. ANDERSON Presented by Daesung Park.

Fundamentals of Parallel Computer Architecture - Chapter 91 Chapter 9 Hardware Support for Synchronization Yan Solihin Copyright.

MULTIVIE W Slide 1 (of 23) The Performance of Spin Lock Alternatives for Shared-Memory Multiprocessors Paper: Thomas E. Anderson Presentation: Emerson.

Jeremy Denham April 7,  Motivation  Background / Previous work  Experimentation  Results  Questions.

CALTECH cs184c Spring DeHon CS184c: Computer Architecture [Parallel and Multithreaded] Day 10: May 8, 2001 Synchronization.

Caltech CS184 Spring DeHon 1 CS184b: Computer Architecture (Abstractions and Optimizations) Day 22: May 20, 2005 Synchronization.

1 Lecture 19: Scalable Protocols & Synch Topics: coherence protocols for distributed shared-memory multiprocessors and synchronization (Sections )

Super computers Parallel Processing

1 Global and high-contention operations: Barriers, reductions, and highly- contended locks Katie Coons April 6, 2006.

1 Synchronization “A parallel computer is a collection of processing elements that cooperate and communicate to solve large problems fast.” Types of Synchronization.

Synchronization Todd C. Mowry CS 740 November 24, 1998 Topics Locks Barriers.

Queue Locks and Local Spinning Some Slides based on: The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit.

Silberschatz, Galvin and Gagne ©2009 Operating System Concepts – 8 th Edition Chapter 6: Process Synchronization.

Background on the need for Synchronization

Synchronization Feb 2017 Topics Locks Barriers

Prof John D. Kubiatowicz

Lecture 19: Coherence and Synchronization

Lecture 5: Synchronization

The University of Adelaide, School of Computer Science

The University of Adelaide, School of Computer Science

Reactive Synchronization Algorithms for Multiprocessors

Global and high-contention operations: Barriers, reductions, and highly-contended locks Katie Coons April 6, 2006.

The University of Adelaide, School of Computer Science

CS510 Concurrent Systems Jonathan Walpole.

Designing Parallel Algorithms (Synchronization)

Shared Memory Systems Miodrag Bolic.

Lecture 21: Synchronization and Consistency

Lecture: Coherence and Synchronization

CS 213 Lecture 11: Multiprocessor 3: Directory Organization

Hardware-Software Trade-offs in Synchronization and Data Layout

CS533 Concepts of Operating Systems

Lecture 4: Synchronization

CS510 Concurrent Systems Jonathan Walpole.

The University of Adelaide, School of Computer Science

CSE 153 Design of Operating Systems Winter 19

Lecture: Coherence, Synchronization

Lecture 17 Multiprocessors and Thread-Level Parallelism

Lecture 21: Synchronization & Consistency

Lecture: Coherence and Synchronization

Lecture 18: Coherence and Synchronization

The University of Adelaide, School of Computer Science

Presentation transcript:

1 Synchronization A parallel computer is a collection of processing elements that cooperate and communicate to solve large problems fast. Types of Synchronization Mutual Exclusion Event synchronization – point-to-point – group – global (barriers)

2 Ticket Lock Only one r-m-w (from only one processor) per acquire Works like waiting line at deli or bank Two counters per lock (next_ticket, now_serving) Acquire: fetch&inc next_ticket; wait for now_serving to equal it – atomic op when arrive at lock, not when its free (so less contention) Release: increment now-serving FIFO order, low latency for low-contention if fetch&inc cacheable Still O(p) read misses at release, since all spin on same variable – like simple LL-SC lock, but no inval when SC succeeds, and fair Can be difficult to find a good amount to delay on backoff – exponential backoff not a good idea due to FIFO order – backoff proportional to now-serving - next-ticket may work well Wouldnt it be nice to poll different locations...

3 Array-based Queuing Locks Waiting processes poll on different locations in an array of size p Acquire – fetch&inc to obtain address on which to spin (next array element) – ensure that these addresses are in different cache lines or memories Release – set next location in array, thus waking up process spinning on it O(1) traffic per acquire with coherent caches FIFO ordering, as in ticket lock But, O(p) space per lock Good performance for bus-based machines Not so great for non-cache-coherent machines with distributed memory – array location I spin on not necessarily in my local memory (solution later)

4 Array based lock on scalable machines No bus Avoid spinning on other locations not in local memory Useful (especially) for machines such as Cray T3E – No coherent cache Solution based on distributed queues: – Each node in the linked list stored on the processor that requested the lock – Head and Tail pointers – Each process spins on its own local node – On release: wake up the next process by writing the flag in the next processors node. On Cache-coherent machines: (sec 8.8) L.V.Kale

5 Point to Point Event Synchronization Software methods: Interrupts Busy-waiting: use ordinary variables as flags Blocking: use semaphores Full hardware support: full-empty bit with each word in memory Set when word is full with newly produced data (i.e. when written) Unset when word is empty due to being consumed (i.e. when read) Natural for word-level producer-consumer synchronization – producer: write if empty, set to full; consumer: read if full; set to empty Hardware preserves atomicity of bit manipulation with read or write Problem: flexiblity – multiple consumers, or multiple writes before consumer reads? – needs language support to specify when to use – composite data structures?

6 Barriers Software algorithms implemented using locks, flags, counters Hardware barriers Wired-AND line separate from address/data bus Set input high when arrive, wait for output to be high to leave In practice, multiple wires to allow reuse Useful when barriers are global and very frequent Difficult to support arbitrary subset of processors – even harder with multiple processes per processor Difficult to dynamically change number and identity of participants – e.g. latter due to process migration Not common today on bus-based machines Lets look at software algorithms with simple hardware primitives

7 A Simple Centralized Barrier Shared counter maintains number of processes that have arrived increment when arrive (lock), check until reaches numprocs struct bar_type {int counter; struct lock_type lock; int flag = 0;} bar_name; BARRIER (bar_name, p) { LOCK(bar_name.lock); if (bar_name.counter == 0) bar_name.flag = 0; /* reset flag if first to reach*/ mycount = bar_name.counter++; /* mycount is private */ UNLOCK(bar_name.lock); if (mycount == p) { /* last to arrive */ bar_name.counter = 0; /* reset for next barrier */ bar_name.flag = 1; /* release waiters */ } else while (bar_name.flag == 0) {}; /* busy wait for release */ }

8 A Working Centralized Barrier Consecutively entering the same barrier doesnt work Must prevent process from entering until all have left previous instance Could use another counter, but increases latency and contention Sense reversal: wait for flag to take different value consecutive times Toggle this value only when all processes reach BARRIER (bar_name, p) { local_sense = !(local_sense); /* toggle private sense variable */ LOCK(bar_name.lock); mycount = bar_name.counter++; /* mycount is private */ if (bar_name.counter == p) UNLOCK(bar_name.lock); bar_name.flag = local_sense; /* release waiters*/ else {UNLOCK(bar_name.lock); while (bar_name.flag != local_sense) {}; } }

9 Centralized Barrier Performance Latency Want short critical path in barrier Centralized has critical path length at least proportional to p Traffic Barriers likely to be highly contended, so want traffic to scale well About 3p bus transactions in centralized Storage Cost Very low: centralized counter and flag Fairness Same processor should not always be last to exit barrier No such bias in centralized Key problems for centralized barrier are latency and traffic Especially with distributed memory, traffic goes to same node

10 Improved Barrier Algorithms for a Bus Separate arrival and exit trees, and use sense reversal Valuable in distributed network: communicate along different paths On bus, all traffic goes on same bus, and no less total traffic Higher latency (log p steps of work, and O(p) serialized bus xactions) Advantage on bus is use of ordinary reads/writes instead of locks Software combining tree Only k processors access the same location, where k is degree of tree

11 Dissemination Barrier Goal is to allow statically allocated flags avoid remote spinning even without cache coherence log p rounds of synchronization In round k, proc i synchronizes with proc (i+2 k ) mod p can statically allocate flags to avoid remote spinning Like a butterfly network

12 Tournament Barrier Like binary combining tree But representative processor at a node chosen statically no fetch-and-op needed In round k, proc i sets a flag for proc j = i - 2 k (mod 2 k+1 ) i then drops out of tournament and j proceeds in next round i waits for global flag signalling completion of barrier to be set by root – could use combining wakeup tree Without coherent caches and broadcast, suffers from either traffic due to single flag or same problem as combining trees (for wakeup)

13 MCS Barrier Modifies tournament barrier to allow static allocation in wakeup tree, and to use sense reversal Every processor is a node in two p-node trees has pointers to its parent, building a fanin-4 arrival tree has pointers to its children to build a fanout-2 wakeup tree + spins on local flag variables + requires O(P) space for P processors + theoretical minimum no. of network transactions (2P -2) + O(log P) network transactions on critical path

14 Network Transactions Centralized, combining tree:O(p) if broadcast and coherent caches; unbounded otherwise Dissemination:O(p log p) Tournament, MCS:O(p)

15 Critical Path Length If independent parallel network paths available: all are O(log P) except centralized, which is O(P) If not (e.g. shared bus): linear terms dominate

16 Synchronization Summary Rich interaction of hardware-software tradeoffs Must evaluate hardware primitives and software algorithms together primitives determine which algorithms perform well Evaluation methodology is challenging Use of delays, microbenchmarks Should use both microbenchmarks and real workloads Simple software algorithms with common hardware primitives do well on bus Will see more sophisticated techniques for distributed machines Hardware support still subject of debate Theoretical research argues for swap or compare&swap, not fetch&op Algorithms that ensure constant-time access, but complex

17 Reading assignment: Cs433 Read sections 7.9 and 8.8 Synch. Issues on Scalable multiprocessors and directory based schemes