Includes slides from course CS162 at UC Berkeley, by prof Anthony D. Joseph and Ion Stoica and from course CS194, by prof. Katherine Yelick Shared Memory.

Includes slides from course CS162 at UC Berkeley, by prof Anthony D. Joseph and Ion Stoica and from course CS194, by prof. Katherine Yelick Shared Memory Programming Synchronization primitives Ing. Andrea Marongiu (a.marongiu@unibo.it)

Program is a collection of threads of control. Can be created dynamically, mid-execution, in some languages Each thread has a set of private variables, e.g., local stack variables Also a set of shared variables, e.g., static variables, shared common blocks, or global heap. Threads communicate implicitly by writing and reading shared variables. Threads coordinate by synchronizing on shared variables PnP1 P0 s s =... y =..s... Shared memory i: 2i: 5 Private memory i: 8 Shared Memory Programming

Thread 1 for i = 0, n/2-1 s = s + sqr(A[i]) Thread 2 for i = n/2, n-1 s = s + sqr(A[i]) static int s = 0; Problem is a race condition on variable s in the program A race condition or data race occurs when: -two processors (or two threads) access the same variable, and at least one does a write. -The accesses are concurrent (not synchronized) so they could happen simultaneously Shared Memory code for computing a sum

Thread 1 …. compute f([A[i]) and put in reg0 reg1 = s reg1 = reg1 + reg0 s = reg1 … Thread 2 … compute f([A[i]) and put in reg0 reg1 = s reg1 = reg1 + reg0 s = reg1 … static int s = 0; Assume A = [3,5], f is the square function, and s=0 initially For this program to work, s should be 34 at the end but it may be 34,9, or 25 The atomic operations are reads and writes Never see ½ of one number, but no += operation is not atomic All computations happen in (private) registers 925 0 0 9 9 35 Af = square Shared Memory code for computing a sum

Thread 1 local_s1= 0 for i = 0, n/2-1 local_s1 = local_s1 + sqr(A[i]) s = s + local_s1 Thread 2 local_s2 = 0 for i = n/2, n-1 local_s2= local_s2 + sqr(A[i]) s = s +local_s2 static int s = 0; Since addition is associative, it’s OK to rearrange order Right? Most computation is on private variables -Sharing frequency is also reduced, which might improve speed -But there is still a race condition on the update of shared s Shared Memory code for computing a sum ATOMIC

Atomic Operations To understand a concurrent program, we need to know what the underlying indivisible operations are! Atomic Operation: an operation that always runs to completion or not at all It is indivisible: it cannot be stopped in the middle and state cannot be modified by someone else in the middle Fundamental building block – if no atomic operations, then have no way for threads to work together On most machines, memory references and assignments (i.e. loads and stores) of words are atomic

Role of Synchronization “A parallel computer is a collection of processing elements that cooperate and communicate to solve large problems fast.” Types of Synchronization Mutual Exclusion Event synchronization point-to-point group global (barriers) How much hardware support? Most used forms of synchronization in shared memory parallel programming

Motivation: “Too much milk” Example: People need to coordinate: Arrive home, put milk away3:30 Buy milk3:25 Arrive at storeArrive home, put milk away3:20 Leave for storeBuy milk3:15 Leave for store3:05 Look in Fridge. Out of milk3:00 Look in Fridge. Out of milkArrive at store3:10 Person BPerson ATime

Definitions Synchronization: using atomic operations to ensure cooperation between threads For now, only loads and stores are atomic hard to build anything useful with only reads and writes Mutual Exclusion: ensuring that only one thread does a particular thing at a time One thread excludes the other while doing its task Critical Section: piece of code that only one thread can execute at once Critical section and mutual exclusion are two ways of describing the same thing Critical section defines sharing granularity

More Definitions Lock: prevents someone from doing something Lock before entering critical section and before accessing shared data Unlock when leaving, after accessing shared data Wait if locked Important idea: all synchronization involves waiting Example: fix the milk problem by putting a lock on refrigerator Lock it and take key if you are going to go buy milk Fixes too much (coarse granularity): roommate angry if only wants orange juice #$@%@#$@

Need to be careful about correctness of concurrent programs, since non-deterministic Always write down desired behavior first think first, then code What are the correctness properties for the “Too much milk” problem? Never more than one person buys Someone buys if needed Restrict ourselves to use only atomic load and store operations as building blocks Too Much Milk: Correctness properties

Too Much Milk: Solution #1 Use a note to avoid buying too much milk: Leave a note before buying (kind of “lock”) Remove note after buying (kind of “unlock”) Don’t buy if note (wait) Suppose a computer tries this (remember, only memory read/write are atomic): if (noMilk) { if (noNote) { leave Note; buy milk; remove note; } } Result?

Too Much Milk: Solution #1 Thread A Thread B if (noMilk) if (noNote) { if (noMilk) if (noNote) { leave Note; buy milk; remove note; } leave Note; buy milk; remove note; } Need to atomically update lock variable

How to Implement Lock? Lock: prevents someone from accessing something Lock before entering critical section (e.g., before accessing shared data) Unlock when leaving, after accessing shared data Wait if locked Important idea: all synchronization involves waiting Should sleep if waiting for long time Hardware atomic instructions?

Examples of hardware atomic instructions test&set (&address) { /* most architectures */ result = M[address]; M[address] = 1; return result; } swap (&address, register) { /* x86 */ temp = M[address]; M[address] = register; register = temp; } compare&swap (&address, reg1, reg2) { /* 68000 */ if (reg1 == M[address]) { M[address] = reg2; return success; } else { return failure; } } Atomic operations!

Implementing Locks with test&set Simple solution: int value = 0; // Free Acquire() { while (test&set(value)); // while busy } Release() { value = 0; } Simple explanation: If lock is free, test&set reads 0 and sets value=1, so lock is now busy. It returns 0 so while exits If lock is busy, test&set reads 1 and sets value=1 (no change). It returns 1, so while loop continues When we set value = 0, someone else can get lock test&set (&address) { result = M[address]; M[address] = 1; return result; }

Too Much Milk: Solution #2 Lock.Acquire() – wait until lock is free, then grab Lock.Release() – unlock, waking up anyone waiting atomic operations – if two threads are waiting for the lock, only one succeeds to grab the lock Then, our milk problem is easy: milklock.Acquire(); if (nomilk) buy milk; milklock.Release(); Once again, section of code between Acquire() and Release() called a “Critical Section”

Thread 1 local_s1= 0 for i = 0, n/2-1 local_s1 = local_s1 + sqr(A[i]) s = s + local_s1 Thread 2 local_s2 = 0 for i = n/2, n-1 local_s2= local_s2 + sqr(A[i]) s = s +local_s2 static int s = 0; Since addition is associative, it’s OK to rearrange order Right? Most computation is on private variables -Sharing frequency is also reduced, which might improve speed -But there is still a race condition on the update of shared s static lock lk; lock(lk); unlock(lk); lock(lk); unlock(lk); Shared Memory code for computing a sum

Performance Criteria for Synch. Ops Latency (time per op) How long does it take if you always win Especially when light contention Bandwidth (ops per sec) Especially under high contention How long does it take (averaged over threads) when many others are trying for it Traffic How many events on shared resources (bus, crossbar,…) Storage How much memory is required? Fairness Can any one threads be “starved” and never get the lock?

Barriers Software algorithms implemented using locks, flags, counters Hardware barriers Wired-AND line separate from address/data bus Set input high when arrive, wait for output to be high to leave In practice, multiple wires to allow reuse Useful when barriers are global and very frequent Difficult to support arbitrary subset of processors even harder with multiple processes per processor Difficult to dynamically change number and identity of participants e.g. latter due to process migration Not common today on bus-based machines

struct bar_type { int counter; struct lock_type lock int flag = 0; } bar_name; BARRIER (bar_name, p) { LOCK(bar_name.lock); if (bar_name.counter == 0) bar_name.flag = 0; /* reset flag if first to reach*/ mycount = bar_name.counter++; /* mycount is private */ UNLOCK(bar_name.lock); if (mycount == p) { /* last to arrive */ bar_name.counter = 0; /* reset for next barrier */ bar_name.flag = 1; /* release waiters */ } else while (bar_name.flag == 0) {}; /* busy wait for release */ } A Simple Centralized Barrier Shared counter maintains number of processes that have arrived increment when arrive (lock), check until reaches numprocs Problem?

A Working Centralized Barrier Consecutively entering the same barrier doesn’t work Must prevent process from entering until all have left previous instance Could use another counter, but increases latency and contention Sense reversal: wait for flag to take different value consecutive times Toggle this value only when all processes reach BARRIER (bar_name, p) { local_sense = !(local_sense); /* toggle private sense variable */ LOCK(bar_name.lock); mycount = bar_name.counter++; /* mycount is private */ if (bar_name.counter == p) UNLOCK(bar_name.lock); bar_name.flag = local_sense; /* release waiters*/ else {UNLOCK(bar_name.lock); while (bar_name.flag != local_sense) {}; } }

Centralized Barrier Performance Latency Centralized has critical path length at least proportional to p Traffic About 3p bus transactions Storage Cost Very low: centralized counter and flag Fairness Same processor should not always be last to exit barrier No such bias in centralized Key problems for centralized barrier are latency and traffic Especially with distributed memory, traffic goes to same node

Improved Barrier Algorithm Separate gather and release trees Advantage: use of ordinary reads/writes instead of locks (array of flags) 2x(p-1) messages exchanged over the network Valuable in distributed network: communicate along different paths Master-Slave barrier Master core gathers slaves on the barrier and releases them Use separate, per-core polling flags for different wait stages Centralized Contention Master-Slave

Improved Barrier Algorithm Not all messages have same latency Need for locality-aware implementation What if implemented on top of NUMA (cluster-based) shared memory system? e.g., p2012 Master-Slave MEM PROC XBAR NI MEM PROC XBAR NI MEM PROC XBAR MEM PROC XBAR NI

Improved Barrier Algorithm Separate arrival and exit trees, and use sense reversal Valuable in distributed network: communicate along different paths Higher latency (log p steps of work, and O(p) serialized bus xactions) Advantage: use of ordinary reads/writes instead of locks Software combining tree Only k processors access the same location, where k is degree of tree Little contention Centralized Contention Tree

Improved Barrier Algorithm Centralized Contention Tree Software combining tree Only k processors access the same location, where k is degree of tree Separate arrival and exit trees, and use sense reversal Valuable in distributed network: communicate along different paths Higher latency (log p steps of work, and O(p) serialized bus xactions) Advantage: use of ordinary reads/writes instead of locks

Improved Barrier Algorithm Hierarchical synchronization locality-aware implementation What if implemented on top of NUMA (cluster-based) shared memory system? e.g., p2012 Tree MEM PROC XBAR NI MEM PROC XBAR NI MEM PROC XBAR MEM PROC XBAR NI

Barrier performance

Programming model is made up of the languages and libraries that create an abstract view of the machine Control How is parallelism created? How is are dependencies (orderings) enforced? Data Can data be shared or is it all private? How is shared data accessed or private data communicated? Synchronization What operations can be used to coordinate parallelism What are the atomic (indivisible) operations? Parallel programming models

In this and the upcoming lectures we will see different programming models and the features that each provide with respect to Control Data Synchronization Pthreads OpenMP OpenCL Parallel programming models

Includes slides from course CS162 at UC Berkeley, by prof Anthony D. Joseph and Ion Stoica and from course CS194, by prof. Katherine Yelick Shared Memory.

Similar presentations

Presentation on theme: "Includes slides from course CS162 at UC Berkeley, by prof Anthony D. Joseph and Ion Stoica and from course CS194, by prof. Katherine Yelick Shared Memory."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Includes slides from course CS162 at UC Berkeley, by prof Anthony D. Joseph and Ion Stoica and from course CS194, by prof. Katherine Yelick Shared Memory.

Similar presentations

Presentation on theme: "Includes slides from course CS162 at UC Berkeley, by prof Anthony D. Joseph and Ion Stoica and from course CS194, by prof. Katherine Yelick Shared Memory."— Presentation transcript:

Similar presentations

About project

Feedback