Messaging and Group Communication ICS 230 Distributed Systems ( with some slides modified from S.Ghosh’s classnotes )

Group Communication zCommunication to a collection of processes – process group zGroup communication can be exploited to provide ySimultaneous execution of the same operation in a group of workstations ySoftware installation in multiple workstations yConsistent network table management zWho needs group communication ? yHighly available servers yConferencing yCluster management yDistributed Logging….

What type of group communication ? zOpen group (anyone can join, customers of Walmart) zClosed groups (closed membership, class of 2000) zPeer yAll members are equal, All members send messages to the group yAll members receive all the messages yE.g. UCI students, UCI faculty zClient-Server yCommon communication pattern xreplicated servers yClient may or may not care which server answers zDiffusion group yServers sends to other servers and clients zHierarchical (one or more members are diff. from the rest) yHighly and easy scalable SvrsClients

Message Passing System z A system consist of n objects a 0, …, a n-1 z Each object a i is modeled as a (possible infinite) state machine with state set Q i z The edges incident on a i are labeled arbitrarily with integers 1 through r, where r is the degree of a i z Each state of a i contains 2r special components, outbuf i [l], inbuf i [l], for every 1  l  r z A configuration is a vector C=(q o,…,q n-1 ), where q i is the state of a i a3a3 a1a1 a0a0 a2a2 1 2 1 3 2 1 1 2

Message Passing System (II) zA system is said to be asynchronous if there is no fixed upper bound on how long it takes a message to be delivered or how much time elapses between consecutive steps zPoint-to-point messages ysnd i (m) yrcv i (m,j) zGroup communication yBroadcast xone-to-all relationship yMulticast xone-to-many relationship xA variation of broadcast where an object can target its messages to a specified subset of objects

Multicast zBasic Multicast: Does not consider failures zLiveness: Each process must receive every message zIntegrity : No spurious message received zNo duplicates: Accepts exactly one copy of a message zReliable multicast: tolerates (certain kinds of) failures. zAtomic Multicast:  A multicast is atomic, when the message is delivered to every correct member, or to no member at all. zIn general, processes may crash, yet the atomicity of the multicast is to be guaranteed. zReliable Atomic Multicast zScalability a key issue

Steiner Trees and Core Based Trees zGiven a weighted graph (N, L) and a subset N’ in N, identify a subset L’ in L such that (N’,L’) is a subgraph of (N,L) that connects all the nodes of N’. yA minimal Steiner tree is a minimal weight subgraph (N’; L’). yNP-complete ; need heuristics zCore-based Trees yMulticast tree constructed dynamically, grows on demand. yEach group has a core node(s) yA node wishing to join the tree as a receiver sends a unicast join message to the core node. yThe join marks the edges as it travels; it either reaches the core node, or some node already part of the tree. The path followed by the join till the core/multicast tree is grafted to the multicast tree. yA node on the tree multicasts a message by using flooding on the core tree. yA node not on the tree sends a message towards the core node; as soon as the message reaches any node on the tree, it is flooded on the tree.

Using Traditional Transport Protocols yTCP/IP xAutomatic flow control, reliable delivery, connection service, complexity linear degradation in performance yUnreliable broadcast/multicast xUDP, IP-multicast - assumes h/w support xIP-multicast A bandwidth-conserving technology where the router reduces traffic by replicating a single stream of information and forwarding them to multiple clients. Sender sends a single copy to a special multicast IP address (Class D) that represents a group, where other members register. xmessage losses high(30%) during heavy load Reliable IP-multicast very expensive

IP Multicast Distribution trees Source is the root of a spanning tree Routers maintain & update distribution trees whenever members join / leave a group All multicasts are Routed via a Rendezvous point Too much load on routers. Application layer multicast overcomes this.

Group Communication Issues zOrdering zDelivery Guarantees zMembership zFailure

Ordering Service zUnordered zSingle-Source FIFO (SSF) m 1 m 2 a i a j a i m 1 m 2 m 2 a j m 1 yFor all messages m 1, m 2 and all objects a i, a j, if a i sends m 1 before it sends m 2, then m 2 is not received at a j before m 1 is zTotally Ordered m 1 m 2 a i a j m 1 a i m 2 m 2 a j m 1 yFor all messages m 1, m 2 and all objects a i, a j, if m 1 is received at a i before m 2 is, the m 2 is not received at a j before m 1 is zCausally Ordered m 1 m 2 a i a j m 1 m 2 m 2 a i m 1 yFor all messages m 1, m 2 and all objects a i, a j, if m 1 happens before m 2, then m 2 is not received at a i before m 1 is

Delivery guarantees yAgreed Delivery guarantees total order of message delivery and allows a message to be delivered as soon as all of its predecessors in the total order have been delivered. ySafe Delivery requires in addition, that if a message is delivered by the GC to any of the processes in a configuration, this message has been received and will be delivered to each of the processes in the configuration unless it crashes.

Membership zMessages addressed to the group are received by all group members zIf processes are added to a group or deleted from it (due to process crash, changes in the network or the user's preference), need to report the change to all active group members, while keeping consistency among them zEvery message is delivered in the context of a certain configuration, which is not always accurate. However, we may want to guarantee yFailure atomicity yUniformity yTermination

Failure Model zFailures types yMessage omission and delay xDiscover message omission and (usually) recovers lost messages yProcessor crashes and recoveries yNetwork partitions and re-merges zAssume that faults do not corrupt messages ( or that message corruption can be detected) zMost systems do not deal with Byzantine behavior zFaults are detected using an unreliable fault detector, based on a timeout mechanism

Some GC Properties yAtomic Multicast xMessage is delivered to all processes or to none at all. May also require that messages are delivered in the same order to all processes. yFailure Atomicity xFailures do not result in incomplete delivery of multicast messages or holes in the causal delivery order yUniformity xA view change reported to a member is reported to all other members yLiveness xA machine that does not respond to messages sent to it is removed from the local view of the sender within a finite amount of time.

Virtual Synchrony zVirtual Synchrony yIntroduced in ISIS, orders group membership changes along with the regular messages yEnsures that failures do not result in incomplete delivery of multicast messages or holes in the causal delivery order(failure atomicity) yEnsures that, if two processes observe the same two consecutive membership changes, receive the same set of regular multicast messages between the two changes xA view change acts as a barrier across which no multicast can pass yDoes not constrain the behavior of faulty or isolated processes

More Interesting GC Properties zThere exists a mapping k from the set of messages appearing in all rcv i (m) for all i, to the set of messages appearing in snd i (m) for all i, such that each message m in a rcv() is mapped to a message with the same content appearing in an earlier snd() and: zIntegrity yk is well defined. i.e. every message received was previously sent. zNo Duplicates yk is one to one. i.e. no message is received more than once zLiveness yk is onto. i.e. every message sent is received

Reliability Service zA service is reliable (in presence of f faults) if exists a partition of the object indices into faulty and non-faulty such that there are at most f faulty objects and the mapping of k must satisfy: yIntegrity yNo Duplicates xno message is received more than once at any single object yLiveness xNon-faulty liveness eventuallyWhen restricted to non-faulty objects, k is onto. i.e. all messages broadcast by a non-faulty object are eventually received by all non-faulty objects xFaulty liveness all noneEvery message sent by a faulty object is either received by all non-faulty objects or by none of them

Faults and Partitions zWhen detecting a processor P from which we did not hear for a certain timeout, we issue a fault message zWhen we get a fault message, we adopt it (and issue our copy) zProblem: maybe P is only slow zWhen a partition occurs, we can not always completely determine who received which messages (there is no solution to this problem)

Extended virtual synchrony yIntroduced in Totem yProcesses can fail and recover yNetwork can partition and remerge yDoes not solve all the problems of recovery in fault-tolerant distributed system, but it avoid inconsistencies

Extended Virtual Synchrony(cont.) zVirtual synchrony handles recovered processes as new processes yCan cause inconsistencies with network partitions zNetwork partitions are real yGateways, bridges, wireless communication

Extended Virtual Synchrony Model zNetwork may partition into finite number of components yTwo or more may merge to form a larger component zEach membership with a unique identifier is a configuration. xMembership ensures that all processes in a configuration agree on the membership of that configuration

Example: Network Partitions and Merges Logical Groups Configuration Partition

Example: Network Partitions and Merges Configurations

Example: Network Partitions and Merges Configurations Subgroups

Example: Network Partitions and Merges Subgroups Configuration

Example: Network Partitions and Merges Configuration Logical Group

Regular and Transitional Configurations zTo achieve safe delivery with partitions and remerges, the EVS model defines: yRegular Configuration xNew messages are broadcast and delivered xSufficient for FIFO and causal communication modes yTransitional Configuration xNo new messages are broadcast, only remaining messages from prior regular configuration are delivered. yRegular configuration may be followed and preceeded by several transitional configurations.

Configuration change yProcess in a regular or transitional configuration can deliver a configuration change message s.t. Follows delivery of every message in the terminated configuration and precedes delivery of every message in the new configuration. yAlgorithm for determining transitional configuration xWhen a membership change is identified Regular conf members (that are still connected) start exchanging information If another membership change is spotted (e.g. failure cascade), this process is repeated all over again. Upon reaching a decision (on members and messages) – process delivers transitional configuration message to members with agreed list of messages. After delivery of all messages, new configuration is delivered.

zVirtual Synchrony Semantics [11,12] yVirtual Synchrony xEvery two members that participate in the same two consecutive view changes, deliver the same set of messages between the two changes ySending view delivery xMessages are delivered only to those members the sender thought were part of the group when the message was sent xDelay membership operations while other messages are being propagated (serialized transactions) Extra round of messages are sent every time there is a view change, blocking other messages in the meantime (flush messages) yClosed group semantics xOnly current members can send messages to the group zExtended Virtual Synchrony Semantics [13] yVirtual Synchrony ySame view delivery xAllows message delivery in a different view than it was sent in, as long as the message is delivered in the same view to all members yOpen group semantics Group Communication Semantics [11] R. van Renesse,K. Birman,R. Friedman,M. Hyden & D. Karr, “A Framework for Protocol Composition in Horus”,PODC,1995 [12] A. Fekete, N. Lynch and A. Shvartsman, “Specifying and using a Partitionable Group Communication Service”, PODC, 1997 [13] L. E. Moser, Y. Amir, P. M. Melliar-Smith and D. A. Agarwal, “Extended Virtual Synchrony”, ICDCS, 1994

Totem zProvides a Reliable totally ordered multicast service over LAN zIntended for complex applications in which fault-tolerance and soft real-time performance are critical yHigh throughput and low predictable latency yRapid detection of, and recovery from, faults ySystem wide total ordering of messages yScalable via hierarchical group communication yExploits hardware broadcast to achieve high-performance zProvides 2 delivery services yAgreed ySafe zUse timestamp to ensure total order and sequence numbers to ensure reliable delivery

ISIS zTightly coupled distributed system developed over loosely coupled processors zProvides a toolkit mechanism for distributing programming, whereby a DS is built by interconnecting fairly conventional non- distributed programs, using tools drawn from the kit zDefine yhow to create, join and leave a group ygroup membership yvirtual synchrony zInitially point-to-point (TCP/IP) zFail-stop failure model

Horus zAims to provide a very flexible environment to configure group of protocols specifically adapted to problems at hand zProvides efficient support for virtual synchrony zReplaces point-to-point communication with group communication as the fundamental abstraction, which is provided by stacking protocol modules that have a uniform (upcall, downcall) interface zNot every sort of protocol blocks make sense zHCPI yStability of messages ymembership zElectra yCORBA-Compliant interface ymethod invocation transformed into multicast

Transis zHow different components of a partitioned network can operate autonomously and then merge operations when they become reconnected ? zAre different protocols for fast-local and slower-cluster communication needed ? zA large-scale multicast service designed with the following goals yTackling network partitions and providing tools for recovery from them yMeeting needs of large networks through hierarchical communication yExploiting fast-clustered communication using IP-Multicast zCommunication modes yFIFO yCausal yAgreed ySafe

Future Challenges zNext Generations ySpread yEnsemble zOther challenges ySecurity – secure group communication yReal-time – support for interactive and multimedia applications zGroup Communication in Wireless networks yGroup based Communication with incomplete spatial coverage yDynamic membership

Horus A Flexible Group Communication Subsystem

Horus: A Flexible Group Communication System zFlexible group communication model to application developers. 1.System interface 2.Properties of Protocol Stack 3.Configuration of Horus xRun in userspace xRun in OS kernel/microkernel

Architecture zCentral protocol => Lego Blocks zEach Lego block implements a communication feature. zStandardized top and bottom interface (HCPI) yAllow blocks to communicate yA block has entry points for upcall/downcall yUpcall=receive mesg, Downcall=send mesg. zCreate new protocol by rearranging blocks.

Message_send zLookup the entry in topmost block and invokes the function. zFunction adds header zMessage_send is recursively sent down the stack zBottommost block invokes a driver to send message.

zEach stack shielded from each other. zHave own threads and memory scheduler.

Endpoints, Group, and Message Objects zEndpoints yModels the communicating entity yHave address (used for membership), send and receive messages zGroup yMaintain local state on an endpoint. yGroup address: to which message is sent yView: List of destination endpoint addr of accessible group members zMessage yLocal storage structure yInterface includes operation pop/push headers yPassed by reference

Transis A Group Communication Subsystem

Transis : Group Communication System zNetwork partitions and recovery tools. yMultiple disconnected components in the network operate autonomously. yMerge these components upon recovery. zHierachical communication structure. zFast cluster communication.

Systems that depend on primary component: zIsis System: Designate 1 component as primary and shuts down non-primary. yPeriod before partition detected, non-primaries can continue to operate. yOperations are inconsistent with primary zTrans/Total System and Amoeba: yAllow continued operations yInconsistent Operations may occur in different parts of the system. yDon’t provide recovery mechanism

Group Service zWork of the collection of group modules. zManager of group messages and group views zA group module maintains yLocal View: List of currently connected and operational participants yHidden View: Like local view, indicated the view has failed but may have formed in another part of the system.

Network partition wishlist 1.At least one component of the network should be able to continue making updates. 2.Each machine should know about the update messages that reached all of the other machines before they were disconnected. 3.Upon recovery, only the missing messages should be exchanged to bring the machines back into a consistent state.

Transis supports partition zNot all applications progress is dependent on a primary component. zIn Transis, local views can be merged efficiently. yRepresentative replays messages upon merging. zSupport recovering a primary component. yNon-primary can remain operational and wait to merge with primary yNon-primary can generate a new primary if it is lost. xMembers can totally-order past view changes events. Recover possible loss. xTransis report Hidden-views.

Hierarchical Broadcast

Reliable Multicast Engine zIn system that do not lose messages often yUse negative-ack xMessages not retransmitted xPositive ack are piggybacked into regular mesg zDetection of lost messages detected ASAP zUnder high network traffic, network and underlying protocol is driven to high loss rate.

Group Communication as an Infrastructure for Distributed System Management zTable Management yUser accounts, network tables zSoftware Installation and Version Control ySpeed up installation, minimize latency and network load during installation zSimultaneous Execution yInvoke same commands on several machines

Management Server API zStatus: Return status of server and its host machines zChdir: Change the server’s working directory zSimex: Execute a command simultaneously zSiminist: Install a software package zUpdate-map: Update map while preserving consistency between replicas zQuery-map: Retrieve information from the map zExit: Terminate the management server process.

Simultaneous Execution zIdentical management command on many machines. yActivate a daemon, run a script zManagement Server maintains ySet M: most recent membership of the group reported by transis ySet NR: set of currently connected servers not yet reported the outcome of a command execution to the monitor

Software Installation zTransis disseminate files to group members. yMonitor multicasts a msg advertising xpackage P xset of installation requirements Rp xinstallation multicast group Gp xtarget list Tp. yManagement server joins Gp if belongs to Rp and Tp. yStatus of all Management server reported to Monitor zUse technique in “Simultaneous Execution” to execute installation commands.

Table Management zConsistent management of replicated network tables. zServers sharing replicas of tables form Service Group z1 Primary Server yEnforces total order of update mesg yIf network partition, one component (containing Primary) can perform updates

zCould provide tolerance for malicious intrusion yMany mechanisms for enforcing security policy in distributed systems rely on trusted nodes yWhile no single node need to be fully trusted, the function performed by the group can be zProblems yNetwork partitions and re-merges yMessages omissions and delays yCommunication primitives available in distributed systems are too weak (i.e. there is no guarantee regarding ordering or reliability) zHow can we achieve group communication ? yExtending point-to-point networks Questions...

From Group Communication to Transactions... zAdequate group communication can support a specific class of transactions in asynchronous distributed systems zTransaction is a sequence of operations on objects (or on data) that satisfies yAtomicity yPermanence yOrdering zGroup for fault-tolerance yShare common state yUpdate of the common state requires xDelivery permanence (majority agreement) xAll-or-none delivery (multicast to multiple groups) xOrdered delivery (serializability of multiples groups) zTransactions-based on group communication primitives represents an important step toward extending the power and generality of GComm

Messaging and Group Communication ICS 230 Distributed Systems ( with some slides modified from S.Ghosh’s classnotes )

Similar presentations

Presentation on theme: "Messaging and Group Communication ICS 230 Distributed Systems ( with some slides modified from S.Ghosh’s classnotes )"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Messaging and Group Communication ICS 230 Distributed Systems ( with some slides modified from S.Ghosh’s classnotes )

Similar presentations

Presentation on theme: "Messaging and Group Communication ICS 230 Distributed Systems ( with some slides modified from S.Ghosh’s classnotes )"— Presentation transcript:

Similar presentations

About project

Feedback