A Progressive Fault Tolerant Mechanism in Mobile Agent Systems Michael R. Lyu and Tsz Yeung Wong July 27, 2003 SCI Conference Computer Science Department.

Slides:



Advertisements
Similar presentations
Remus: High Availability via Asynchronous Virtual Machine Replication
Advertisements

TRANSACTION PROCESSING SYSTEM ROHIT KHOKHER. TRANSACTION RECOVERY TRANSACTION RECOVERY TRANSACTION STATES SERIALIZABILITY CONFLICT SERIALIZABILITY VIEW.
1 Fault-Tolerance Techniques for Mobile Agent Systems Prepared by: Wong Tsz Yeung Date: 11/5/2001.
(c) Oded Shmueli Distributed Recovery, Lecture 7 (BHG, Chap.7)
Computer Science Lecture 18, page 1 CS677: Distributed OS Last Class: Fault Tolerance Basic concepts and failure models Failure masking using redundancy.
Making Services Fault Tolerant
LYU9905 Security in Mobile Agent E- Commerce Systems Prepared by : Wong Ka Ming, Caris Wong Tsz Yeung, Ah Mole Supervisor : LYU Rung Tsong Michael.
CS 582 / CMPE 481 Distributed Systems Fault Tolerance.
7. Fault Tolerance Through Dynamic (or Standby) Redundancy The lowest-cost fault-tolerance technique in multiprocessors. Steps performed: When a fault.
A Progressive Fault Detection and Service Recovery Mechanism in Mobile Agent Systems Wong Tsz Yeung Aug 26, 2002.
Computer Science Lecture 17, page 1 CS677: Distributed OS Last Class: Fault Tolerance Basic concepts and failure models Failure masking using redundancy.
MPICH-V: Fault Tolerant MPI Rachit Chawla. Outline  Introduction  Objectives  Architecture  Performance  Conclusion.
Authors: Lyu, M.R., Xinyu Chen, and Tsz Yeung Wong From : Intelligent Systems, IEEE [see also IEEE Intelligent Systems and Their Applications vol.19,Issue.
CS-550 (M.Soneru): Recovery [SaS] 1 Recovery. CS-550 (M.Soneru): Recovery [SaS] 2 Recovery Computer system recovery: –Restore the system to a normal operational.
Design, Implementation, and Experimentation on Mobile Agent Security for Electronic Commerce Applications Anthony H. W. Chan, Caris K. M. Wong, T. Y. Wong,
(C) 2004 Daniel SorinDuke Architecture Using Speculation to Simplify Multiprocessor Design Daniel J. Sorin 1, Milo M. K. Martin 2, Mark D. Hill 3, David.
16: Distributed Systems1 DISTRIBUTED SYSTEM STRUCTURES NETWORK OPERATING SYSTEMS The users are aware of the physical structure of the network. Each site.
SRDS’03 Performance and Effectiveness Analysis of Checkpointing in Mobile Environments Xinyu Chen and Michael R. Lyu The Chinese Univ. of Hong Kong Hong.
1 Making Services Fault Tolerant Pat Chan, Michael R. Lyu Department of Computer Science and Engineering The Chinese University of Hong Kong Miroslaw Malek.
1 Rollback-Recovery Protocols II Mahmoud ElGammal.
Distributed Deadlocks and Transaction Recovery.
Distributed Transactions March 15, Transactions What is a Distributed Transaction?  A transaction that involves more than one server  Network.
04/18/2005Yan Huang - CSCI5330 Database Implementation – Distributed Database Systems Distributed Database Systems.
S OFTWARE F AULT T OLERANCE I N A C LUSTERED A RCHITECTURE : T ECHNIQUES & R ELIABILITY M ODELING Hüsnü Şensoy.
Molecular Transactions G. Ramalingam Kapil Vaswani Rigorous Software Engineering, MSRI.
BFTCloud: A Byzantine Fault Tolerance Framework for Voluntary-Resource Cloud Computing Yilei Zhang, Zibin Zheng, and Michael R. Lyu
Chapter 19 Recovery and Fault Tolerance Copyright © 2008.
CS 5204 (FALL 2005)1 Leases: An Efficient Fault Tolerant Mechanism for Distributed File Cache Consistency Gray and Cheriton By Farid Merchant Date: 9/21/05.
Reliable Communication in the Presence of Failures Based on the paper by: Kenneth Birman and Thomas A. Joseph Cesar Talledo COEN 317 Fall 05.
Distributed Transactions Chapter 13
EEC 688/788 Secure and Dependable Computing Lecture 7 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University
ISADS'03 Message Logging and Recovery in Wireless CORBA Using Access Bridge Michael R. Lyu The Chinese Univ. of Hong Kong
University of Tampere, CS Department Distributed Commit.
Tanenbaum & Van Steen, Distributed Systems: Principles and Paradigms, 2e, (c) 2007 Prentice-Hall, Inc. All rights reserved Chapter 8 Fault.
Chapter 2 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University Building Dependable Distributed Systems.
Totally Ordered Broadcast in the face of Network Partitions [Keidar and Dolev,2000] INF5360 Student Presentation 4/3-08 Miran Damjanovic
Improving TCP Performance over Wireless Networks
WS-DREAM: A Distributed Reliability Assessment Mechanism for Web Services Zibin Zheng, Michael R. Lyu Department of Computer Science & Engineering The.
Course: COMS-E6125 Professor: Gail E. Kaiser Student: Shanghao Li (sl2967)
Section 06 (a)RDBMS (a) Supplement RDBMS Issues 2 HSQ - DATABASES & SQL And Franchise Colleges By MANSHA NAWAZ.
1 Fault-Tolerant Computing Systems #1 Introduction Pattara Leelaprute Computer Engineering Department Kasetsart University
Building Dependable Distributed Systems, Copyright Wenbing Zhao
EEC 688/788 Secure and Dependable Computing Lecture 6 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University
Multidatabase Transaction Management COP5711. Multidatabase Transaction Management Outline Review - Transaction Processing Multidatabase Transaction Management.
1 Developing Aerospace Applications with a Reliable Web Services Paradigm Pat. P. W. Chan and Michael R. Lyu Department of Computer Science and Engineering.
Antidio Viguria Ann Krueger A Nonblocking Quorum Consensus Protocol for Replicated Data Divyakant Agrawal and Arthur J. Bernstein Paper Presentation: Dependable.
Relying on Safe Distance to Achieve Strong Partitionable Group Membership in Ad Hoc Networks Authors: Q. Huang, C. Julien, G. Roman Presented By: Jeff.
FTOP: A library for fault tolerance in a cluster R. Badrinath Rakesh Gupta Nisheeth Shrivastava.
Distributed DBMSPage © 1998 M. Tamer Özsu & Patrick Valduriez Outline Introduction Background Distributed DBMS Architecture Distributed Database.
Seminar On Rain Technology
SEMINAR TOPIC ON “RAIN TECHNOLOGY”
EEC 688/788 Secure and Dependable Computing Lecture 10 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University
Primary-Backup Replication
Outline Introduction Background Distributed DBMS Architecture
EEC 688/788 Secure and Dependable Computing
Outline Announcements Fault Tolerance.
Fault Tolerance Distributed Web-based Systems
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
Outline Introduction Background Distributed DBMS Architecture
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
Database Recovery 1 Purpose of Database Recovery
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
Reliable Web Services: Methodology, Experiment and Modeling International Conference on Web Services (ICWS 2007) Pat. P. W. Chan, Michael R. Lyu Department.
Last Class: Fault Tolerance
Presentation transcript:

A Progressive Fault Tolerant Mechanism in Mobile Agent Systems Michael R. Lyu and Tsz Yeung Wong July 27, 2003 SCI Conference Computer Science Department Chinese University of Hong Kong

Outline Introduction of the Problem Problem Solutions – Server failure detection and recovery – Agent failure detection and recovery Reliability Evaluations – Using agent implementation – Using Stochastic Petri Net Simulation

Introduction of the problem Mobile agents are autonomous software codes sent out in netwroks to perform services on behalf of their host. We focus on designing a fault-tolerant mobile agent system The challenge is: – Guarantee service availability in the presence of server failures. – Guarantee service availability in the presence of agent failures. – Preserve data consistency in both agents and servers. – Preserve the exactly-once property. – Guarantee the agent can eventually finish its tasks.

Introduction of the problem Fault-tolerance is classified into levels – Level 0: No tolerance to faults – Level 1: Server failure detection and recovery – Level 2: Agent failure detection and recovery

Level 0 No tolerance to faults – Why agents die? because of server failure because of faults inside agent – Application has to restart manually. – Affected server may leave an inconsistent state after recovery.

Level 1 Server failure detection and recovery – Incorporate a failure detection program (monitor, watchdog). – When a server restarts, abort all uncommitted transactions in the server. This preserves data consistency – When the agent re-executes after the initial states Visited servers will be visited again Violates exactly-once execution property

Level 2 Agent failure detection and recovery – When a server fails, its residing agents are lost. We aim at recovering such loss in this level – By using checkpointing We checkpoint agent internal data We use checkpointed data to recover lost agents. Agent data consistency is preserved – Recovery of agent happens on the failed server This preserves the exactly-once execution property.

Design of Level 1 FT We have a global daemon which monitors all the servers. Single point of failure problem monitoring daemon server pool

Design of Level 1 FT When the daemon recovers a server – It aborts all the uncommitted transactions performed by those lost agents. – This preserves data consistency in the server. This technique is – Easy to implement – Can be deployed on every existing mobile agent platform, without modifying the platform.

Design of Level 2 FT We use cooperative agents. – Actual agent – Witness agent Actual agent performs actual computation for the user. Witness agent monitors the availability of actual agent. – It follows behind the actual agent.

Design of Level 2 FT In our protocol, actual agents are able to communicate with the witness agent – the message is not a broadcast one, but a peer-to- peer one – Actual agent assumes that the witness agent is in the previous server – Actual agent must know the address of the previous server

Protocol of Level 2 FT Server i-1 Server iServer i+1 Server log Arrive at i Agent messages box Leave i leave arrive Checkpointing happens!!

Protocol of Level 2 FT Server i-1 Server iServer i+1 Server log Agent messages box Arrive at i+1 Leave i+1 Arrive at i+1 Leave i+1 spawn Arrive at i Leave i

Failure and Recovery Scenarios We only cover stopping failures. (I.e., assuming Byzantine failures do not exist) We handle most kinds of failures: – Witness agents fail to receive “arrive at i” message – Witness agents fail to receive “leave i” message – Witness agent failures

Missing arrive message The reason may be: 1. message is lost message is lost 2. message arrives after timeout period message arrives after timeout period 3. actual agent dies when it is ready to leave server i-1 actual agent dies when it is ready to leave server i-1 4. actual agent dies when it has just arrive at server i, without logging. actual agent dies when it has just arrive at server i, without logging. 5. actual agent dies when it has just arrive at server i, with logging. actual agent dies when it has just arrive at server i, with logging. Arrive at i Zzz.. Next

Missing arrive message It is simple for the 1 st and 2 nd case. Server i-1 Server i Server log Arrive at i probe found log retransmit message Back

Missing arrive message For the 3 rd and 4 th cases, recovery takes place. Server i-1 Server i Server log checkpoint data no log retransmit message Back

Missing arrive message For the 5 th case, it results in missing detection. – since log appears in the server – the consequence is that “leave i” message never arrives. Back

Missing leave message The reason may be: 1. message is lost. 2. message arrives after timeout period 3. actual agent dies when it has just sent the “arrive at i” message actual agent dies when it has just sent the “arrive at i” message 4. actual agent dies when it has just logged the message “leave i” message. actual agent dies when it has just logged the message “leave i” message leave i Zzz.. Next

Missing leave message The 3 rd case is the same as the previous missing detection case. Server i-1 Server i Server log checkpoint data no log retransmit message Arrive at i

Missing leave message In this case, the recovery action is the same as the previous section. – When failure happens, the agent should be performing computation. – So, when server recovers, the agent’s computation has aborted. Back

Missing leave message This results in missing detection again. – This can be compensated by the 3 rd case in the previous discussion.3 rd case – It is because the witness will never receive “arrive i+1”.

Witness Failure Scenarios There is a chain of witness agents leaving on the itinerary of the agent – The latest witness monitors the actual agent. – Other witnesses monitor its preceding witness i-1i Witnessing dependency

Witness Failure Scenarios Server i-2 Server i-1Server i recovery

Simplification Assume that 2-server failure would not happen – We can simplify our witnessing dependency i-2i-1i

Simplification If failure strikes server i-1 – Witness on server i-2 can recover witness on server i-1 If failure strikes server i-2 – Will not recover it – Because within a short period, no failure would happen

Reliability Evaluation The results are obtained by – an agent system implementation using Concordia. – simulation using Stochastic Petri Net. – aim: to measure the percentage of successful round-trip-travel. Home

Reliability Evaluation

about 60% about 5%

Reliability Evaluation For agent failure detection only

Reliability Evaluation 100% about 60%

Reliability Evaluation about 140% # of original agent # of recovered agent

Conclusion Categorized the fault-tolerance of mobile agent system. Designed a scheme for both server and agent failure detection and recovery. Analyzed most failure scenarios in mobile agent systems. Conducted performance evaluations which show – Our scheme is a promising technique – Trade-off between cost and levels of reliability