Presentation on theme: "Concurrency. Motivations To capture the logical structure of a problem Servers, graphical applications To exploit extra processors, for speed Ubiquitous."— Presentation transcript:
Motivations To capture the logical structure of a problem Servers, graphical applications To exploit extra processors, for speed Ubiquitous multi-core processors To cope with separate physical devices Internet applications
HTC vs HPC High throughput computing Environments that can deliver large amounts of processing capacity over long periods of time High performance computing Uses supercomputers and computer clusters to solve advanced computation problems DACTAL Condor Concurrent application
Concurrency Any system in which two or more tasks may be underway at the same time (at an unpredictable point in their execution) Parallel: more than one task physically active Requires multiple processors Distributed: processors are associated with people or devices that are physically separated from one another in the real world
Levels of Concurrency Instruction level Two or more machine instructions Statement level Two of more source language statements Unit level Two or more subprogram units Program level Two or more programs
Fundamental Concepts A task or process is a program unit that can be in concurrent execution with other program units Tasks differ from ordinary subprograms in that: A task may be implicitly started When a program unit starts the execution of a task, it is not necessarily suspended When a task’s execution is completed, control may not return to the caller Tasks usually work together
Task Categories Heavyweight tasks Execute in their own address space and have their own run-time stacks Lightweight tasks All run in the same address space and use the same run-time stack A task is disjoint if it does not communicate with or affect the execution of any other task in the program in any way
Synchronization A mechanism that controls the order in which tasks execute Cooperation: Task A must wait for task B to complete some specific activity before task A can continue its execution e.g. the producer-consumer problem Competition: Two or more tasks must use some resource that cannot be simultaneously used e.g. a shared counter, dining philosophers Competition is usually provided by mutually exclusive access
The Producer-Consumer Problem Accesses to the buffer must be synchronized, because if multiple producers and/or consumers access it simultaneously, values may get lost, retrieved twice, etc. There are a number of “classic” synchronization problems, one of which is the producer-consumer problem... The problem is to devise a solution that synchronizes the producer & consumer accesses to the buffer. There are M producers that put items into a fixed-sized buffer. The buffer is shared with N consumers that remove items from the buffer. producer 1 producer 2 producer M … consumer 1 consumer 2 consumer N In the bounded-buffer version, the buffer has some fixed-capacity N.
Dining Philosophers Devise a solution that ensures: no philosopher starves; and a hungry philosopher is only prevented from eating by his neighbor(s). Five philosophers sit at a table, alternating between eating noodles and thinking. In order to eat, a philosopher must have two chopsticks. However, there is a single chopstick between each pair of plates, so if one is eating, neither neighbor can eat. A philosopher puts down both chopsticks when thinking. philosopher := [ [true] whileTrue: [ self get: left. self get: right. self eat. self release: left. self release: right. self think. ] Deadlock!
Liveness and Deadlock Liveness is a characteristic that a program unit may or may not have In sequential code, it means the unit will eventually complete its execution In a concurrent environment, a task can easily lose its liveness If all tasks in a concurrent environment lose their liveness, it is called deadlock (or livelock)
Race conditions When the resulting value of a variable, when two different thread of a program are writing to it, will differ depending on which thread writes to it first Transient errors, hard to debug E.g. c = c + 1 Solution: acquire access to the shared resource before execution can continue Issues: lockout, starvation 1.load c 2.add 1 3.store c
Task Execution States Assuming some mechanism for synchronization (e.g. a scheduler), tasks can be in a variety of states: New - created but not yet started Runnable or ready - ready to run but not currently running (no available processor) Running Blocked - has been running, but cannot now continue (usually waiting for some event to occur) Dead - no longer active in any sense
Design Issues Competition and cooperation synchronization Controlling task scheduling How and when tasks start and end execution Alternatives: Semaphores Monitors Message Passing
Semaphores Simple mechanism that can be used to provide synchronization of tasks Devised by Edsger Dijkstra in 1965 for competition synchronization, but can also be used for cooperation synchronization A data structure consisting of an integer and a queue that stores task descriptors A task descriptor is a data structure that stores all the relevant information about the execution state of a task
Semaphore operations Two atomic operations, P and V Consider a semaphore s: P (from the Dutch “passeren”) P(s) – if s > 0 then assign s = s – 1; otherwise block (enqueue) the thread that calls P Often referred to as “wait” V (from the Dutch “vrygeren/vrijgeven”) V(s) – if a thread T is blocked on the s, then wake up T; otherwise assign s = s + 1 Often referred to as “signal”
Dining Philosophers wantBothSticks := Semaphore new. philosopher := [ [true] whileTrue: [ [self haveBothSticks] whileFalse: [ wantBothSticks wait. left available and right available ifTrue: [ self get: left. self get: right. ]. wantBothSticks signal. ]. self eat. self release: left. self release: right. self think. ]
The trouble with semaphores No way to statically check for the correctness of their use Leaving out a single wait or signal event can create many different issues Getting them just right can be tricky Per Brinch Hansen (1973): “The semaphore is an elegant synchronization tool for an ideal programmer who never makes mistakes.”
Locks and Condition Variables A semaphore may be used for either of two purposes: Mutual exclusion: guarding access to a critical section Synchronization: making processes suspend/resume This dual use can lead to confusion: it may be unclear which role a semaphore is playing in a given computation… For this reason, newer languages may provide distinct constructs for each role: Locks: guarding access to a critical section Condition Variables: making processes suspend/resume Locks provide for mutually-exclusive access to shared memory; condition variables provide for thread/process synchronization.
Locks Like a Semaphore, a lock has two associated operations: acquire() try to lock the lock; if it is already locked, suspend execution release() unlock the lock; awaken a waiting thread (if any) These can be used to ‘guard’ a critical section: A Java class has a hidden lock accessible via the synchronized keyword sharedLock.acquire(); // access sharedObj sharedLock.release(); sharedLock.acquire(); // access sharedObj sharedLock.release(); Lock sharedLock; Object sharedObj;
Condition Variables A Condition is a predefined type available in some languages that can be used to declare variables for synchronization. When a thread needs to suspend execution inside a critical section until some condition is met, a Condition can be used. There are three operations for a Condition: wait() suspend immediately; enter a queue of waiting threads signal(), aka notify() in Java awaken a waiting thread (usually the first in the queue), if any broadcast(), aka notifyAll() in Java awaken all waiting threads, if any Java has no Condition class, but every Java class has an anonymous condition-variable that can be manipulated via wait, notify & notifyAll
Monitor motivation A Java class has a hidden lock accessible via the synchronized keyword Deadlocks/livelocks/non-mutual-exclusion are easy to produce Just as control structures were “higher level” than the goto, language designers began looking for higher level ways to synchronize processes In 1973, Brinch-Hansen and Hoare proposed the monitor, a class whose methods are automatically accessed in a mutually-exclusive manner. A monitor prevents simultaneous access by multiple threads
Monitors The idea: encapsulate the shared data and its operations to restrict access A monitor is an abstract data type for shared data Shared data is resident in the monitor (rather than in the client units) All access resident in the monitor Monitor implementation guarantee synchronized access by allowing only one access at a time Calls to monitor procedures are implicitly queued if the monitor is busy at the time of the call
Monitor Visualization public (interface) put(obj) get(obj) private … entry queue notEmpty… notFull … myHeadmyTail mySizeN myValues … hidden lock The compiler ‘wraps’ calls to put() and get() as follows: buf.lock.acquire(); … call to put or get buf.lock.release(); If the lock is locked, the thread enters the entry queue Each condition variable has its own internal queue, in which waiting threads wait to be signaled…
Evaluation of Monitors A better way to provide competition synchronization than are semaphores Equally powerful as semaphores: Semaphores can be used to implement monitors Monitors can be used to implement semaphores Support for cooperation synchronization is very similar as with semaphores, so it has the same reliability issues
2. Receiver (ready) 3. message (transmitted) Distributed Synchronization Semaphores, locks, condition variables, monitors, are shared-memory constructs, and so only useful on a tightly-coupled multiprocessor. They are of no use on a distributed multiprocessor On a distributed multiprocessor, processes can communicate via message-passing -- using send() and receive() primitives. If the message-passing system has no storage, then the send/receive operations must be synchronized: If the message-passing system has storage to buffer the message, then the send() can proceed asynchronously: 1. Sender (ready) 3. Receiver (not ready) 2. message (buffered) 1. Sender (ready) The receiver can then retrieve the message when it is ready...
Tasks its own thread of control; its own execution state; and mutually exclusive subprograms (entry procedures) In 1980, Ada introduced the task, with 3 characteristics: Entry procedures are self-synchronizing subprograms that another task can invoke for task-to-task communication. If task t has an entry procedure p, then another task t2 can execute t.p( argument-list ); In order for p to execute, t must execute: accept p ( parameter-list ); - If t executes accept p and t2 has not called p, t will automatically wait; - If t2 calls p and t has not accepted p, t2 will automatically wait.
Rendezvous This interaction is called a rendezvous between t and t2. It does not depend on shared memory, so t1 and t2 can be on a uniprocessor, a tightly-coupled or a distributed multiprocessor. t2 suspends [suspend] t2’s argument-list is evaluated and passed to t.p’s parameters return-values (or out or in out parameters) are passed back to t2 t continues execution; t2 resumes execution [resume] timett2 begin … end p; t executes the body of p, using its parameter values accept p(params) t.p (args) When t and t2 are both ready, p executes:
Example Problem How can we rewrite what’s below to complete more quickly? procedure sumArray is N: constant integer := ; type RealArray is array(1..N) of float; anArray: RealArray; function sum(a: RealArray; first, last: integer) return float is result: float := 0.0; begin for i in first..last loop result := result + a(i); end loop; return result; end sum; begin -- code to fill anArray with values omitted put( sum(anArray, 1, N) ); end sumArray;
Divide-And-Conquer via Tasks procedure parallelSumArray is -- declarations of N, RealArray, anArray, Sum() as before … -- continued on next slide… task type PartialAdder entry SumSlice(Start: in Integer; Stop: in Integer); entry GetSum(Result: out float); end PartialAdder; task body ArraySliceAdder is i, j: Integer; Answer: Float; begin accept SumSlice(Start: in Integer; Stop: in Integer) do i:= Start; j:= Stop; -- get ready end SumSlice; Answer := Sum(anArray, i, j); -- do the work accept GetSum(Result: out float) do Result := Answer; -- report outcome end GetSum; end ArraySliceAdder;
begin -- code to fill anArray with values omitted end parallelSumArray; Divide-And-Conquer via Tasks (ii) -- continued from previous slide … firstHalfSum, secondHalfSum: Integer; T1.SumSlice(1, N/2); -- start T1 on 1st half T2.SumSlice(N/2 + 1, N); -- start T2 on 2nd half Using two tasks T1 and T2, this parallelSumArray version requires roughly 1/2 the time required by sumArray (on a multiprocessor). Using three tasks, the time will be roughly 1/3 the time of sumArray. … T1.GetSum( firstHalfSum ); -- get 1st half sum from T1 T2.GetSum( secondHalfSum ); -- get 2nd half sum from T2 put( firstHalfSum + secondHalfSum ); -- we’re done! T1, T2 : ArraySliceAdder; -- T1, T2 start & wait on accept
Producer-Consumer in Ada To give the producer and consumer separate threads, we can define the behavior of one in the ‘main’ procedure: procedure ProducerConsumer is buf: Buffer; it: Item; begin -- producer task loop -- produce an Item in it buf.put(it); end loop; end ProducerConsumer; task consumer; task body consumer is it: Item; begin loop buf.get(it); -- consume Item it end loop; end consumer; and the behavior of the other in a separate task: We can then build a Buffer task with put() and get() as (auto- synchronizing) entry procedures...
Capacity-1 Buffer A single-value buffer is easy to build using an Ada task-type: task type Buffer is entry get(it: out Item); entry put(it: in Item); end Buffer; task body Buffer is B: Item; begin loop accept put(it: in Item) do B:= it; end put; accept get(it: out Item) do it := B; end get; end loop; end Buffer; As a task-type, variables of this type (e.g., buf) will automatically have their own thread of execution. This causes myBuffer to alternate between being empty and nonempty. The body of the task is a loop that accepts calls to put() and get() in strict alternation.
Capacity-N Buffer An N-value buffer is a bit more work: -- task declaration is as before … task body Buffer is N: constant integer := 1024; package B is new Queue(N, Items); begin loop select when not B.isFull => accept put(it: in Item) do B.append(it); end put; or when not B.isEmpty => accept get(it: out Item) do it := B.first; B.delete; end get; end select; end loop; end Buffer; We can accept any call to get() so long as we are not empty, and any call to put() so long as we are not full. Ada provides the select-when statement to guard an accept, and perform it if and only if a given condition is true
The Importance of Clusters Scientific computation is increasingly performed on clusters Cost-effective: Created from commodity parts Scientists want more computational power Cluster computational power is easy to increase by adding processors Cluster size keeps increasing!
Clusters Are Not Perfect Failure rates are increasing The number of moving parts is growing (processors, network connections, disks, etc.) Mean Time Between Failures (MTBF) is shrinking Anecdotal: every 20 minutes for Google’s cluster How can we deal with these failures?
Options for Fault-Tolerance Redundancy in space Each participating process has a backup process Expensive! Redundancy in time Processes save state and then rollback for recovery Lighter-weight fault tolerance
Today’s Answer: Redundancy in Time Programmers place checkpoints Small checkpoint size Synchronous Every process checkpoints in the same place in the code Global synchronization before and after checkpoints
What’s the Problem? Future systems will be larger Checkpointing will hurt program performance Many processes checkpointing synchronously will result in network and file system contention Checkpointing to local disk not viable Application programmers are only willing to pay 1% overhead for fault-tolerance The solution: Avoid synchronous checkpoints
Understanding Staggered Checkpointing Today: X X X Processes Time … 64K Tomorrow: checkpoint checkpoint with contention No problem! More processes, more data, synchronous checkpoints More processes, more data, synchronous checkpoints Contention! That’s easy! We’ll stagger the checkpoints…. Not so fast… There is communication! Recovery line Recovery line [Randall 75] VALID Recovery line Send not saved Receive is saved Send is saved Receive not saved State is inconsistent--- it could not have existed State is consistent---it could have existed
Identify All Possible Valid Recovery Lines Time Processes There are so many! [2,4,2] [3,2,0] [2,5,2] [4,5,2] [2,4,3] [1,0,0] [2,3,2] [2,0,2] [1,2,0] [2,0,0] [2,0,1] [1,1,0]
Coroutine procedure A; begin -- do something resume B; -- do something resume B; -- do something -- … end A; procedure B; begin -- do something resume A; -- do something resume A; -- … end B; A coroutine is two or more procedures that share a single thread of execution, each exercising mutual control over the other:
Summary On a shared-memory multiprocessor: The Semaphore was the first synchronization primitive Locks and condition variables separated a semaphore’s mutual-exclusion usage from its synchronization usage Monitors are higher-level, self-synchronizing objects Java classes have an associated (simplified) monitor On a distributed system: Ada tasks provide self-synchronizing entry procedures Concurrent computations consist of multiple entities. Processes in Smalltalk Tasks in Ada Threads in Java OS-dependent in C++