Fail-safe Communication Layer for DisplayWall Yuqun Chen.

Slides:



Advertisements
Similar presentations
COS 461 Fall 1997 Group Communication u communicate to a group of processes rather than point-to-point u uses –replicated service –efficient dissemination.
Advertisements

Reliable Communication in the Presence of Failures Kenneth Birman, Thomas Joseph Cornell University, 1987 Julia Campbell 19 November 2003.
Replication. Topics r Why Replication? r System Model r Consistency Models r One approach to consistency management and dealing with failures.
Leader Election Let G = (V,E) define the network topology. Each process i has a variable L(i) that defines the leader.  i,j  V  i,j are non-faulty.
Jaringan Komputer Lanjut Packet Switching Network.
Serverless Network File Systems. Network File Systems Allow sharing among independent file systems in a transparent manner Mounting a remote directory.
Improving TCP Performance over Mobile Ad Hoc Networks by Exploiting Cross- Layer Information Awareness Xin Yu Department Of Computer Science New York University,
Spark: Cluster Computing with Working Sets
Cache Coherent Distributed Shared Memory. Motivations Small processor count –SMP machines –Single shared memory with multiple processors interconnected.
Using DSVM to Implement a Distributed File System Ramon Lawrence Dept. of Computer Science
Thread-Level Transactional Memory Decoupling Interface and Implementation UW Computer Architecture Affiliates Conference Kevin Moore October 21, 2004.
The road to reliable, autonomous distributed systems
Distributed Processing, Client/Server, and Clusters
Distributed Systems Fall 2010 Replication Fall 20105DV0203 Outline Group communication Fault-tolerant services –Passive and active replication Highly.
Virtual Synchrony Jared Cantwell. Review Multicast Causal and total ordering Consistent Cuts Synchronized clocks Impossibility of consensus Distributed.
Lecture 3 Page 1 CS 239, Spring 2001 Interprocess Communications in Distributed Operating Systems CS 239 Distributed Operating Systems April 9, 2001.
CS 582 / CMPE 481 Distributed Systems Fault Tolerance.
Socket Programming.
Group Communications Group communication: one source process sending a message to a group of processes: Destination is a group rather than a single process.
© nCode 2000 Title of Presentation goes here - go to Master Slide to edit - Slide 1 Reliable Communication for Highly Mobile Agents ECE 7995: Term Paper.
DisplayWall Software Architecture Yuqun Chen, Grant Wallace, Kai Li.
Distributed Systems Fall 2009 Replication Fall 20095DV0203 Outline Group communication Fault-tolerant services –Passive and active replication Highly.
Lecture 13 Synchronization (cont). EECE 411: Design of Distributed Software Applications Logistics Last quiz Max: 69 / Median: 52 / Min: 24 In a box outside.
Figure 1.1 Interaction between applications and the operating system.
Distributed Systems 2006 Group Membership * *With material adapted from Ken Birman.
16: Distributed Systems1 DISTRIBUTED SYSTEM STRUCTURES NETWORK OPERATING SYSTEMS The users are aware of the physical structure of the network. Each site.
Definition of terms Definition of terms Explain business conditions driving distributed databases Explain business conditions driving distributed databases.
Lecture 12 Synchronization. EECE 411: Design of Distributed Software Applications Summary so far … A distributed system is: a collection of independent.
Highly Available ACID Memory Vijayshankar Raman. Introduction §Why ACID memory? l non-database apps: want updates to critical data to be atomic and persistent.
Networked File System CS Introduction to Operating Systems.
ISO Layer Model Lecture 9 October 16, The Need for Protocols Multiple hardware platforms need to have the ability to communicate. Writing communications.
Multiple Processor Systems. Multiprocessor Systems Continuous need for faster and powerful computers –shared memory model ( access nsec) –message passing.
CS 390- Unix Programming Environment CS 390 Unix Programming Environment Topics to be covered: Distributed Computing Fundamentals.
ARMADA Middleware and Communication Services T. ABDELZAHER, M. BJORKLUND, S. DAWSON, W.-C. FENG, F. JAHANIAN, S. JOHNSON, P. MARRON, A. MEHRA, T. MITTON,
TOTEM: A FAULT-TOLERANT MULTICAST GROUP COMMUNICATION SYSTEM L. E. Moser, P. M. Melliar Smith, D. A. Agarwal, B. K. Budhia C. A. Lingley-Papadopoulos University.
Scalable Web Server on Heterogeneous Cluster CHEN Ge.
Reliable Communication in the Presence of Failures Based on the paper by: Kenneth Birman and Thomas A. Joseph Cesar Talledo COEN 317 Fall 05.
Consistent and Efficient Database Replication based on Group Communication Bettina Kemme School of Computer Science McGill University, Montreal.
Lecture 4: Sun: 23/4/1435 Distributed Operating Systems Lecturer/ Kawther Abas CS- 492 : Distributed system & Parallel Processing.
Group Communication Group oriented activities are steadily increasing. There are many types of groups:  Open and Closed groups  Peer-to-peer and hierarchical.
1 MSc Project Yin Chen Supervised by Dr Stuart Anderson 2003 Grid Services Monitor Long Term Monitoring of Grid Services Using Peer-to-Peer Techniques.
INFO1408 Database Design Concepts Week 15: Introduction to Database Management Systems.
1 Distributed Databases BUAD/American University Distributed Databases.
Copyright © George Coulouris, Jean Dollimore, Tim Kindberg This material is made available for private study and for direct.
EEC 688/788 Secure and Dependable Computing Lecture 10 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University
CS603 Fault Tolerance - Communication April 17, 2002.
Storage Systems CSE 598d, Spring 2007 Rethink the Sync April 3, 2007 Mark Johnson.
Building Dependable Distributed Systems, Copyright Wenbing Zhao
Consistency David E. Culler CS162 – Operating Systems and Systems Programming Lecture 35 Nov 19, 2014 Read:
Group Communication Theresa Nguyen ICS243f Spring 2001.
CS 540 Database Management Systems
Failure detection The design of fault-tolerant systems will be easier if failures can be detected. Depends on the 1. System model, and 2. The type of failures.
Networks, Part 2 March 7, Networks End to End Layer  Build upon unreliable Network Layer  As needed, compensate for latency, ordering, data.
PROCESS RESILIENCE By Ravalika Pola. outline: Process Resilience  Design Issues  Failure Masking and Replication  Agreement in Faulty Systems  Failure.
Implementing Remote Procedure Call Landon Cox February 12, 2016.
Fault Tolerance (2). Topics r Reliable Group Communication.
Mutual Exclusion Algorithms. Topics r Defining mutual exclusion r A centralized approach r A distributed approach r An approach assuming an organization.
Tanenbaum & Van Steen, Distributed Systems: Principles and Paradigms, 2e, (c) 2007 Prentice-Hall, Inc. All rights reserved DISTRIBUTED SYSTEMS.
Group Communication A group is a collection of users sharing some common interest.Group-based activities are steadily increasing. There are many types.
EEC 688/788 Secure and Dependable Computing Lecture 10 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
Introduction to Operating Systems
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
Last Class: Fault Tolerance
University of Wisconsin-Madison Presented by: Nick Kirchem
Presentation transcript:

Fail-safe Communication Layer for DisplayWall Yuqun Chen

DisplayWall Software Architecture render node render node render node master node Logical Network Command Broadcast Synchronization Data Exchange

Motivation Complex communication patterns –programming the DisplayWall is difficult BitBlt operation in Virtual Display Driver Nodes and network links do fail –larger system is more likely to fail –OS may not be stable under high-load –applications have bugs

Design Goal [1] Eases writing distributed applications on DisplayWall –supports some form of group concept –multicast or broadcast –no need to manage pair-wise connections

Design Goals[2] An API for designing fail-safe DisplayWall apps –tolerate independent failures and recovery of render nodes –failures at the application master nodes are considered catastrophic –certain bugs cause all render nodes to fail may or may not be able to deal with this

What’s different with others? API-wise –mostly broadcast –some pair-wise exchange Fault-tolerance wise –real-time characteristic cannot wait for too long –OK to lose certain messages dropping a few frames is OK

Some Requirements Simple abstraction –users shouldn’t deal with pair-wise connections Realizable on a variety of platforms –with and without programmable NI Support storage and retrieval of application- dependent states (soft states) Synchronized Clocks and Barriers

Communication Patterns Command/Data Delivery –from master to some render nodes –broadcast in nature Data exchange –among render nodes, e.g., bitblt and v-tiling –pair-wise in nature Synchronization: clock and barriers –low-latency

Outline Communication patterns Soft States API issues

Command Broadcast Used by all applications –VDD, OpenGL, ImageViewer Issues: –efficiency –dynamic membership: live/dead nodes, overlapped windows –delivery semantics best-effort, guaranteed, or sloppy

Guaranteed or Best-effort? Guaranteed delivery implies –re-configurable (logical) topology for delivering data –loss-less delivery to all nodes an intermediate node may fail right after delivery –the up-stream has to keep all the data for retransmission Best-effort –as far as current topology allows –or limited flexibility

Transactions? Each broadcast is treated as a transaction –keeps the data until the transaction commits Transactions are asynchronous –one doesn’t wait till the previous commits –but, they are applied in order Applied transactions mark the state at each node

Fail-safe Broadcast

recv(msg) send msg to children c1, c2, …, cn if child c failed then recompute c’s subtree without c send to the new child cc

Questions How to detect failure in a timely fashion? –global solution the leaves send the acks to a master the master forces a global reconfiguration after a timeout –a local solution period positive ACK to the parent

Data Exchange Pair-wise sends and recvs High-level issues: –avoid the recv to get stuck by a failing sender timeout or period “I am alive” ack Implementation issues: –neighborhood communication? probably sufficient for most apps except for load balancing

Synchronization Barrier –a special form of broadcast –mostly used for global frame-buffer swap Clock synchronization –can be used to reduce the frequency of barriers –e.g., MPEG playback according to local clock what if it misses the deadline?

Unification Transaction-based messages implement broadcast –implements some form of global ordering Message passing for pair-wise data exchange –key is to detect the failures quickly

Example 1: VDD BitBlt BitBlt: calculate data to send calculate data to recv read frame buffer send data to some nodes if failure then take note recv data from some nodes if failure then use default image

Failure Recovery What happens when a failed render node comes back up? –put itself into the broadcast tree –re-establish peer-to-peer message connection hopefully all hidden –it has to bring its states up to date highly application dependent

Soft States Example: OpenGL Display List –each list consists a series of GL commands must be re-executed to make the list meaningful Textures –may be bound to a texture name

Soft States API Tagged Safe Memory –a chunk of memory replicated on all nodes –tagged and ordered by an ID –associate a recovery handle/function to it Operations: create, insert, and delete Upon recovery: –retrieve all “live” chunks and apply the handles in order

Example 2: VDD recovery Re-establish connections between nodes Restore States: –cached bitmaps from other running nodes –fonts and brushes from other nodes Or, force a re-draw from the master nodes

Messages? Transactions can be implemented on top of messages –add a transaction ID for each message People are familiar with message passing –as opposed to remote memory access you have to manage memory, pretty message What about copy avoidance?

Communication API Message Passing Interface –send(node, type,data), recv(node, type, data) –only need to specify a remote node id –connection is hidden from the API –copy can be avoided by returning a buffer pointer instead of filling the user buffer –very close to sockets API but more flexible

Copy Avoiding Message Passing Trivial –just return a pointer to the buffer Remote memory semantics not necessary –very rare to update remote data structure –memory copy isn’t that bad ( > 200 MB/sec) What about remote bitblt? –the only missing part is peer-to-peer which we can’t do any way

Copy Avoiding Messages CoreLogic global buffer NIC Graphics msg hdr new msg recv(msg)