SEMINAR 236825 OPEN PROBLEMS IN DISTRIBUTED COMPUTING Winter 2013-14 Hagit Attiya & Faith Ellen 236825Introduction1.

SEMINAR 236825 OPEN PROBLEMS IN DISTRIBUTED COMPUTING Winter 2013-14 Hagit Attiya & Faith Ellen 236825Introduction1

INTRODUCTION 236825Introduction2

Distributed Systems Distributed systems are everywhere: – share resources – communicate – increase performance (speed & fault tolerance) Characterized by – independent activities (concurrency) – loosely coupled parallelism (heterogeneity) – inherent uncertainty E.g. operating systems (distributed) database systems software fault-tolerance communication networks multiprocessor architectures 236825Introduction 3

Main Admin Issues Goal: Read some interesting papers, related to some open problems in the area Mandatory (active) participation – 1 absence w/o explanation Tentative list of papers already published – First come first served Lectures in English 236825Introduction4

Course Overview: Basic Models 236825Introduction 5 message passing synchronous asynchronous PRAM shared memory

Message-Passing Model processors p 0, p 1, …, p n-1 are nodes of the graph. Each is a state machine with a local state. bidirectional point-to-point channels are the undirected edges of the graph. Channel from p i to p j is modeled in two pieces: – outbuf variable of p i (physical channel) – inbuf variable of p j (incoming message queue) 236825Introduction 6 1 3 2 2 11 1 2 p3p3 p2p2 p0p0 p1p1

Modeling Processors and Channels 236825Introduction 7 inbuf[1] p 1 's local variables outbuf[1] inbuf[2] outbuf[2] p 2 's local variables processors p 0, p 1, …, p n-1 are nodes of the graph. Each is a state machine with a local state. bidirectional point-to-point channels are the undirected edges of the graph. Channel from p i to p j is modeled in two pieces: – outbuf variable of p i (physical channel) – inbuf variable of p j (incoming message queue)

Configuration A snapshot of entire system: accessible processor states (local variables & incoming msg queues) as well as communication channels. Formally, a vector of processor states (including outbufs, i.e., channels), one per processor 236825Introduction 8

Deliver Event Moves a message from sender's outbuf to receiver's inbuf; message will be available next time receiver takes a step 236825Introduction 9 p1p1 p2p2 m 3 m 2 m 1 p1p1 p2p2

Computation Event Occurs at one processor Start with old accessible state (local vars + incoming messages) Apply processor's state machine transition function; handle all incoming messages End with new accessible state with empty inbufs & new outgoing messages 236825Introduction 10 cd e old local state a new local state b

Execution In the first configuration: each processor is in initial state and all inbufs are empty For each consecutive triple configuration, event, configuration new configuration is same as old configuration except: – if delivery event: specified msg is transferred from sender's outbuf to receiver's inbuf – if computation event: specified processor's state (including outbufs) change according to transition function 236825Introduction 11 configuration, event, configuration, event, configuration, …

Asynchronous Executions An execution is admissible in asynchronous model if – every message in an outbuf is eventually delivered – every processor takes an infinite number of steps No constraints on when these events take place: arbitrary message delays and relative processor speeds are not ruled out Models a reliable system (no message is lost and no processor stops working) 236825Introduction 12

Example: Simple Flooding Algorithm Each processor's local state consists of variable color, either red or green Initially: – p 0 : color = green, all outbufs contain M – others: color = red, all outbufs empty Transition: If M is in an inbuf and color = red, then change color to green and send M on all outbufs 236825Introduction 13

Example: Flooding 236825Introduction 14 p1p1 p0p0 p2p2 MM p1p1 p0p0 p2p2 M M deliver event at p 1 from p 0 computation event by p 1 deliver event at p 2 from p 1 p1p1 p0p0 p2p2 M M MM p1p1 p0p0 p2p2 M M computation event by p 2

Example: Flooding (cont'd) 236825Introduction 15 deliver event at p 1 from p 2 computation event by p 1 deliver event at p 0 from p 1 etc. to deliver rest of msgs p1p1 p0p0 p2p2 M M M M p1p1 p0p0 p2p2 M M M M p1p1 p0p0 p2p2 M M M p1p1 p0p0 p2p2 M M M

(Worst-Case) Complexity Measures Message complexity: maximum number of messages sent in any admissible execution Time complexity: maximum "time" until all processes terminate in any admissible execution. How to measure time in an asynchronous execution? – Produce a timed execution by assigning non- decreasing real times to events so that time between sending and receiving any message is at most 1. – Time complexity: maximum time until termination in any timed admissible execution. 236825Introduction 16

Complexities of Flooding Algorithm A state is terminated if color = green. One message is sent over each edge in each direction  message complexity is 2m, where m = number of edges. A node turns green once a "chain" of messages reaches it from p 0  time complexity is diameter + 1 time units. 236825Introduction 17

Synchronous Message Passing Systems An execution is admissible for the synchronous model if it is an infinite sequence of rounds – A round is a sequence of deliver events moving all msgs in transit into inbuf's, followed by a sequence of computation events, one for each processor. Captures the lockstep behavior of the model Also implies – every message sent is delivered – every processor takes an infinite number of steps. Time is the number of rounds until termination 236825Introduction18

Example: Flooding in the Synchronous Model 236825Introduction 19 p1p1 p0p0 p2p2 M MM M p1p1 p0p0 p2p2 MM p1p1 p0p0 p2p2 round 1 events round 2 events Time complexity is diameter + 1 Message complexity is 2m Time complexity is diameter + 1 Message complexity is 2m

Broadcast Over a Rooted Spanning Tree Processors have information about a rooted spanning tree of the communication topology – parent and children local variables at each processor Complexities (synchronous and asynchronous model) – time is depth of the spanning tree, which is at most n - 1 – number of messages is n - 1, since one message is sent over each spanning tree edge 236825Introduction 20 root initially sends M to its children when a processor receives M from its parent – sends M to its children – terminates (sets a local Boolean to true) root initially sends M to its children when a processor receives M from its parent – sends M to its children – terminates (sets a local Boolean to true)

Broadcast Over a Rooted Spanning Tree 236825Introduction 21 root initially sends M to its children when a processor receives M from its parent – sends M to its children – terminates (sets a local Boolean to true) root initially sends M to its children when a processor receives M from its parent – sends M to its children – terminates (sets a local Boolean to true) Synchronous model: – time is depth of the spanning tree, which is at most n - 1 – number of messages is n - 1, since one message is sent over each spanning tree edge Asynchronous model: – same time and messages

Convergecast 236825Introduction 22 Again, suppose a rooted spanning tree has already been computed by the processors – parent and children variables at each processor Do the opposite of broadcast: – leaves send messages to their parents – non-leaves wait to get message from each child, then send combined info to parent

Convergecast 236825Introduction 23 gh a bc def gh de,g f,h c,f,hb,d solid arrows: parent-child relationships dotted lines: non-tree edges

Finding a Spanning Tree from a Root 236825Introduction 24 root sends M to all its neighbors when non-root first gets M – set the sender as its parent – send "parent" msg to sender – send M to all other neighbors (if no other neighbors, then terminate) when get M otherwise – send "reject" msg to sender use "parent" and "reject" msgs to set children variables and terminate (after hearing from all neighbors) root sends M to all its neighbors when non-root first gets M – set the sender as its parent – send "parent" msg to sender – send M to all other neighbors (if no other neighbors, then terminate) when get M otherwise – send "reject" msg to sender use "parent" and "reject" msgs to set children variables and terminate (after hearing from all neighbors)

Execution of Spanning Tree Algorithm 236825Introduction 25 gh a bc def Synchronous: always gives breadth-first search (BFS) tree Both models: O(m) messages O(diam) time Both models: O(m) messages O(diam) time root gh a bc def Asynchronous: not necessarily BFS tree root

Execution of Spanning Tree Algorithm 236825Introduction 26 gh a bc def An asynchronous execution gave a depth-first search (DFS) tree. Is DFS property guaranteed? No! Another asynchronous execution results in this tree: neither BFS nor DFS root gh a bc def

Finding a DFS Spanning Tree from a Root when root first takes step or non-root first receives M: – mark sender as parent (if not root) – for each neighbor in series send M to it wait to get "parent" or "reject" msg in reply – send "parent" msg to parent neighbor when processor receives M otherwise – send "reject" to sender – use "parent" and "reject" msgs to set children variables and terminate when root first takes step or non-root first receives M: – mark sender as parent (if not root) – for each neighbor in series send M to it wait to get "parent" or "reject" msg in reply – send "parent" msg to parent neighbor when processor receives M otherwise – send "reject" to sender – use "parent" and "reject" msgs to set children variables and terminate 236825Introduction 27

Finding a DFS Spanning Tree from a Root Previous algorithm ensures that the spanning tree is always a DFS tree. Analogous to sequential DFS algorithm. Message complexity: O(m) since a constant number of messages are sent over each edge Time complexity: O(m) since each edge is explored in series. 236825Introduction 28

Shared Memory Model Processors (also called processes) communicate via a set of shared variables Each shared variable has a type, defining a set of primitive operations (performed atomically) read, write compare&swap (CAS) LL/SC, DCAS, kCAS, … read-modify-write (RMW), kRMW 236825Introduction 29 p0p0 X X p1p1 p2p2 Y Y readwrite RMW

Changes from the Message-Passing Model 236825Introduction 30 no inbuf and outbuf state components configuration includes values for shared variables one event type: a computation step by a process – pi 's state in old configuration specifies which shared variable is to be accessed and with which primitive – shared variable's value in the new configuration changes according to the primitive's semantics – p i 's state in the new configuration changes according to its old state and the result of the primitive An execution is admissible if every processor takes an infinite number of steps

Abstract Data Types Abstract representation of data & set of methods (operations) for accessing it Implement using primitives on base objects Sometimes, a hierarchy of implementations: Primitive operations implemented from more low-level ones 236825Introduction31 data

Executing Operations 23682532Introduction P1P1 invocationresponse P2P2 P3P3 deq enq(1) enq(2) 1 ok

Interleaving Operations, or Not 23682533Introduction deqenq(1)enq(2)1ok Sequential behavior: invocations & responses alternate and match (on process & object) Sequential Specification: All legal sequential behaviors

Correctness: Sequential consistency [Lamport, 1979] For every concurrent execution there is a sequential execution that – Contains the same operations – Is legal (obeys the sequential specification) – Preserves the order of operations by the same process 236825Introduction34

Example 1: Multi-Writer Registers Add logical time to values Write(v,X) read TS 1,..., TS n TS i = max TS j +1 write  v,TS i  Read only own value Read(X) read  v,TS i  return v Once in a while read TS 1,..., TS n and write to TS i 23682535Introduction Using (multi-reader) single-writer registers Need to ensure writes are eventually visible

Timestamps 1.The timestamps of two write operations by the same process are ordered 2.If a write operation completes before another one starts, it has a smaller timestamp 236825Introduction36 Write(v,X) read TS 1,..., TS n TS i = max TS j +1 write  v,TS i 

Multi-Writer Registers: Proof Write(v,X) read TS 1,..., TS n TS i = max TS j +1 write  v,TS i  Read(X) read  v,TS i  return v Once in a while read TS 1,..., TS n and write to TS i 23682537Introduction Create sequential execution: –Place writes in timestamp order –Insert reads after the appropriate write

236825Introduction38 Multi-Writer Registers: Proof Create sequential execution: –Place writes in timestamp order –Insert reads after the appropriate write Legality is immediate Per-process order is preserved since a read returns a value (with timestamp) larger than the preceding write by the same process

Correctness: Linearizability [Herlihy & Wing, 1990] For every concurrent execution there is a sequential execution that – Contains the same operations – Is legal (obeys the specification of the ADTs) – Preserves the real-time order of non-overlapping operations Each operation appears to takes effect instantaneously at some point between its invocation and its response (atomicity) 236825Introduction39

Example 2: Linearizable Multi-Writer Registers Add logical time to values Write(v,X) read TS 1,..., TS n TS i = max TS j +1 write  v,TS i  Read(X) read TS 1,...,TS n return value with max TS 23682540Introduction Using (multi-reader) single-writer registers [Vitanyi & Awerbuch, 1987]

Multi-writer registers: Linearization order Write(v,X) read TS 1,..., TS n TS i = max TS j +1 write  v,TS i  23682541Introduction Create linearization: –Place writes in timestamp order –Insert each read after the appropriate write Read(X) read TS 1,...,TS n return value with max TS

Multi-Writer Registers: Proof 23682542Introduction Create linearization: –Place writes in timestamp order –Insert each read after the appropriate write Legality is immediate Real-time order is preserved since a read returns a value (with timestamp) larger than all preceding operations

Example 3: Atomic Snapshot n components Update a single component Scan all the components “at once” (atomically) Provides an instantaneous view of the whole memory 236825Introduction43 update ok scan v 1,…,v n

236825Introduction44 Atomic Snapshot Algorithm Update(v,k) A[k] =  v,seq i, i  Scan() repeat read A[1],…,A[n] if equal return A[1,…,n] Linearize: Updates with their writes Scans inside the double collects double collect [Afek, Attiya, Dolev, Gafni, Merritt, Shavit, JACM 1993]

Atomic Snapshot: Linearizability Double collect (read a set of values twice) If equal, there is no write between the collects – Assuming each write has a new value (seq#) Creates a “safe zone”, where the scan can be linearized 236825Introduction45 read A[1],…,A[n] write A[j]

Liveness Conditions Wait-free: every operation completes within a finite number of (its own) steps  no starvation for mutex Nonblocking: some operation completes within a finite number of (some other process) steps  deadlock-freedom for mutex Obstruction-free: an operation (eventually) running solo completes within a finite number of (its own) steps – Also called solo termination wait-free  nonblocking  obstruction-free Bounded wait-free  bounded nonblocking  bounded obstruction-free 236825Introduction46

Wait-free Atomic Snapshot [Afek, Attiya, Dolev, Gafni, Merritt, Shavit, JACM 1993] Embed a scan within the Update. 236825Introduction47 Update(v,k) V = scan A[k] =  v,seq i, i,V  Scan() repeat read A[1],…,A[n] if equal return A[1,…,n] else record diff if twice p j return V j Linearize: Updates with their writes Direct scans as before Borrowed scans in place Linearize: Updates with their writes Direct scans as before Borrowed scans in place direct scan borrowed scan

Atomic Snapshot: Borrowed Scans Interference by process p j And another one…  p j does a scan inbeteween Linearizing with the borrowed scan is OK. 236825Introduction48 write A[j] read A[j] …… …… embedded scan write A[j] read A[j] …… ……

List of Topics (Indicative) Atomic snapshots Space complexity of consensus Dynamic storage Vector agreement Renaming Maximal independent set Routing and possibly others… 236825Introduction49

The Happened-Before Relation a  b means that event a happened before event b: If a and b are events by the same process and a occurs before b, then a  b If event b obtains information from event a then a  b – Usually defined through message passing – But can be extended to read / write Transitive closure: If a  b and b  c then a  c If events a and b by different processes do not exchange information then neither a  b nor a  b are true 236825Introduction50

Timestamps Capture the Happened-Before Relation For timestamps generated as in previous algorithm, we have If a  b then TS(a) < TS(b) But not vice versa… can have TS(a) < TS(b) but not a  b Need to use vector timestamps 236825Introduction51

Causality Captures the Essence of the Computation If two executions have the same happened- before relation – Disagree only on the order of events a and b such that neither a  b nor a  b  The executions are indistinguishable to the processes – Each process obtains the same results when invoking primitives on the base objects 236825Introduction52 //

SEMINAR 236825 OPEN PROBLEMS IN DISTRIBUTED COMPUTING Winter 2013-14 Hagit Attiya & Faith Ellen 236825Introduction1.

Similar presentations

Presentation on theme: "SEMINAR 236825 OPEN PROBLEMS IN DISTRIBUTED COMPUTING Winter 2013-14 Hagit Attiya & Faith Ellen 236825Introduction1."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

SEMINAR 236825 OPEN PROBLEMS IN DISTRIBUTED COMPUTING Winter 2013-14 Hagit Attiya & Faith Ellen 236825Introduction1.

Similar presentations

Presentation on theme: "SEMINAR 236825 OPEN PROBLEMS IN DISTRIBUTED COMPUTING Winter 2013-14 Hagit Attiya & Faith Ellen 236825Introduction1."— Presentation transcript:

Similar presentations

About project

Feedback