Sapna E. George, Ing-Ray Chen Presented By Yinan Li, Shuo Miao April 14, 2009.

Slides:

Advertisements

Similar presentations

CS 795 – Spring  “Software Systems are increasingly Situated in dynamic, mission critical settings ◦ Operational profile is dynamic, and depends.

Advertisements

Efficient Solutions to the Replicated Log and Dictionary Problems

Network Operating Systems Users are aware of multiplicity of machines. Access to resources of various machines is done explicitly by: –Logging into the.

CPSC 668Set 12: Causality1 CPSC 668 Distributed Algorithms and Systems Fall 2009 Prof. Jennifer Welch.

CS-550 (M.Soneru): Recovery [SaS] 1 Recovery. CS-550 (M.Soneru): Recovery [SaS] 2 Recovery Computer system recovery: –Restore the system to a normal operational.

20101 Synchronization in distributed systems A collection of independent computers that appears to its users as a single coherent system.

Dept. of Computer Science & Engineering, CUHK Fault Tolerance and Performance Analysis in Wireless CORBA Chen Xinyu Supervisor: Markers: Prof.

16: Distributed Systems1 DISTRIBUTED SYSTEM STRUCTURES NETWORK OPERATING SYSTEMS The users are aware of the physical structure of the network. Each site.

SRDS’03 Performance and Effectiveness Analysis of Checkpointing in Mobile Environments Xinyu Chen and Michael R. Lyu The Chinese Univ. of Hong Kong Hong.

Lecture 12 Synchronization. EECE 411: Design of Distributed Software Applications Summary so far … A distributed system is: a collection of independent.

A Survey of Rollback-Recovery Protocols in Message-Passing Systems M. Elnozahy, L. Alvisi, Y. Wang, D. Johnson Carnegie Mellon University Presented by:

1 Rollback-Recovery Protocols II Mahmoud ElGammal.

Switching Techniques Student: Blidaru Catalina Elena.

Mapping Internet Addresses to Physical Addresses (ARP)

Distributed Quality-of-Service Routing of Best Constrained Shortest Paths. Abdelhamid MELLOUK, Said HOCEINI, Farid BAGUENINE, Mustapha CHEURFA Computers.

ICMP (Internet Control Message Protocol) Computer Networks By: Saeedeh Zahmatkesh spring.

1 On Failure Recoverability of Client-Server Applications in Mobile Wireless Environments Ing-Ray Chen, Baoshan Gu, Sapna E. George and Sheng- Tzong Cheng.

Authors: Ing-Ray Chen Weiping He Baoshan Gu Presenters: Yao Zheng.

Checkpointing and Recovery. Purpose Consider a long running application –Regularly checkpoint the application Expensive task –In case of failure, restore.

Hierarchical agent-based secure and reliable multicast in wireless mesh networks Yinan LI, Ing-Ray Chen Robert Weikel, Virginia Sistrunk, Hung-Yuan Chung.

Ad-hoc On-Demand Distance Vector Routing (AODV) and simulation in network simulator.

A Survey of Rollback-Recovery Protocols in Message-Passing Systems.

Switching breaks up large collision domains into smaller ones Collision domain is a network segment with two or more devices sharing the same Introduction.

PRESENTED BY A. B. C. 1 User Oriented Regional Registration- Based Mobile Multicast Service Management in Mobile IP Networks Ing-Ray Chen and Ding-Chau.

EEC 688/788 Secure and Dependable Computing Lecture 7 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University

Dynamic Source Routing in ad hoc wireless networks Alexander Stojanovic IST Lisabon 1.

Fault Tolerant Systems

Time Warp State Saving and Simultaneous Events. Outline State Saving Techniques –Copy State Saving –Infrequent State Saving –Incremental State Saving.

Load-Balancing Routing in Multichannel Hybrid Wireless Networks With Single Network Interface So, J.; Vaidya, N. H.; Vehicular Technology, IEEE Transactions.

Checkpointing and Recovery. Purpose Consider a long running application –Regularly checkpoint the application Expensive task –In case of failure, restore.

ISADS'03 Message Logging and Recovery in Wireless CORBA Using Access Bridge Michael R. Lyu The Chinese Univ. of Hong Kong

CprE 545: Fault Tolerant Systems (G. Manimaran), Iowa State University1 CprE 545: Fault Tolerant Systems Rollback Recovery Protocols.

Coordinated Checkpointing Presented by Sarah Arnold 1.

PRoPHET+: An Adaptive PRoPHET- Based Routing Protocol for Opportunistic Network Ting-Kai Huang, Chia-Keng Lee and Ling-Jyh Chen.

Chapter 2 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University Building Dependable Distributed Systems.

Fault Tolerance in CORBA and Wireless CORBA Chen Xinyu 18/9/2002.

Dual-Region Location Management for Mobile Ad Hoc Networks Yinan Li, Ing-ray Chen, Ding-chau Wang Presented by Youyou Cao.

Energy-Efficient Data Caching and Prefetching for Mobile Devices Based on Utility Huaping Shen, Mohan Kumar, Sajal K. Das, and Zhijun Wang P 邱仁傑.

Unit III Bandwidth Utilization: Multiplexing and Spectrum Spreading In practical life the bandwidth available of links is limited. The proper utilization.

a/b/g Networks Routing Herbert Rubens Slides taken from UIUC Wireless Networking Group.

EEC 688/788 Secure and Dependable Computing Lecture 6 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University

Efficient Resource Allocation for Wireless Multicast De-Nian Yang, Member, IEEE Ming-Syan Chen, Fellow, IEEE IEEE Transactions on Mobile Computing, April.

Authors: Ing-Ray Chen and Ding-Chau Wang Presented by Chaitanya,Geetanjali and Bavani Modeling and Analysis of Regional Registration Based Mobile Service.

DMAP: integrated mobility and service management in mobile IPv6 systems Authors: Ing-Ray Chen Weiping He Baoshan Gu Presenters: Chia-Shen Lee Xiaochen.

Movement-Based Check-pointing and Logging for Recovery in Mobile Computing Systems Sapna E. George, Ing-Ray Chen, Ying Jin Dept. of Computer Science Virginia.

A proxy-based integrated cache consistency and mobility management scheme for client-server applications in Mobile IP systems - Weiping He, Ing-Ray Chen.

Presented by Rukmini and Diksha Chauhan Virginia Tech 2 nd May, 2007 Movement-Based Checkpointing and Logging for Recovery in Mobile Computing Systems.

Operating System Reliability Andy Wang COP 5611 Advanced Operating Systems.

Airmail: A Link-layer Protocol for Wireless Networks

8.6. Recovery By Hemanth Kumar Reddy.

Prepared by Ertuğrul Kuzan

Packet Switching Datagram Approach Virtual Circuit Approach

Chapter 25: Advanced Data Types and New Applications

EEC 688/788 Secure and Dependable Computing

Operating System Reliability

Operating System Reliability

Switching Techniques In large networks there might be multiple paths linking sender and receiver. Information may be switched as it travels through various.

EECS 498 Introduction to Distributed Systems Fall 2017

Operating System Reliability

Operating System Reliability

EEC 688/788 Secure and Dependable Computing

CSE 4340/5349 Mobile Systems Engineering

EEC 688/788 Secure and Dependable Computing

EEC 688/788 Secure and Dependable Computing

Operating System Reliability

EEC 688/788 Secure and Dependable Computing

EEC 688/788 Secure and Dependable Computing

EEC 688/788 Secure and Dependable Computing

Operating System Reliability

Operating System Reliability

Presentation transcript:

Sapna E. George, Ing-Ray Chen Presented By Yinan Li, Shuo Miao April 14, 2009

Agenda Introduction Introduction The Proposed Approach The Proposed Approach System Model System Model Checkpointing, Logging, and Recovery Checkpointing, Logging, and Recovery Performance Analysis Performance Analysis The SPN Model The SPN Model Results and Analysis Results and Analysis Summary and Applicability Summary and Applicability

Failure-Prone Mobile Computing Mobile Computing Mobile Computing A type of distributed computing involving hosts that are mobile while retailing network connection through unreliable wireless communication Prone to failures Why mobile computing is prone to failures? Why mobile computing is prone to failures? Host mobility Link breakage Limiting the use of backend power source Limited battery life Wireless communication Unreliable wireless links Intermittent connectivity Bandwidth limitations

Failure Recovery in Traditional Distributed Systems Checkpointing Checkpointing Process state is periodically saved to stable storage such as hard disks, namely checkpoint Checkpoints may be large, e.g., the size of a checkpoint may be several KB Sometimes called a snapshot Independent VS. coordinated Independent checkpointing Each process independently and asynchronously takes local checkpoints Coordinated checkpointing All processes synchronize to jointly build global checkpoints An expensive operation During the process of taking checkpoints, the process has to be suspended The size of checkpoints may be large How can we reduce the number of checkpointing operations, while still enable rollback recovery?

Failure Recovery in Traditional Distributed Systems Logging Logging Proposed as a complimentary to checkpointing, to reduce the number of expensive checkpointing operations, but still enable effective recovery Recording all the non-deterministic events happening in a process between two checkpoints that potentially change the state of the process, e.g., receipt of a message, user input, etc., and the information necessary to reply these events Asynchronous recovery is achieved by combining checkpointing with logging

Failure Recovery in Traditional Distributed Systems Rollback recovery Rollback recovery Upon failure, the state of the failed process is rolled back to the most recent checkpoint Events found in the logs that happened after the most recent checkpoint are replayed in their original order Independent recovery Each process recreates its pre-failure state independently (asynchronously) Computation is restarted after rollback

New Challenges Brought by Mobile Computing Are traditional methods still directly applicable? Are traditional methods still directly applicable? Rethinking … Properties of mobile computing that complicate the situation Properties of mobile computing that complicate the situation Mobility of hosts Unreliable wireless network connectivity Low wireless bandwidth Limited battery life of mobile devices Lack of stable storage Different types of failures Voluntary disconnection (to save battery life) Hardware failure Transmission errors due to noises Software failure

The Proposed Approach (1/2) Why it is proposed? Why it is proposed? Traditional recovery schemes may not work Due to the characteristics of mobile computing And the challenges imposed by mobile applications The limitation of existing methods Checkpoints are taken periodically If the frequency of checkpointing is high, the additional overhead is large If the frequency is low, the recovery cost may be very large (the distance between the current MSS of the MH and the MSS that stores the checkpoint may be large, and the logs may be widely dispersed) Rigid mechanism for the whole system, lacking the flexibility of per-user methods

The Proposed Approach (2/2) Proposed approach Proposed approach Movement-based, instead of periodic Checkpoints are taken only when a mobile host has made M handoffs Goal: looking for the optimal M that minimizes the total recovery cost Advantages Advantages Minimization of the total recovery cost The optimal threshold M ensures that Checkpoints stay close to the recovery MSS Logs are not too widely dispersed Overhead of unnecessary checkpoints & logs is avoided Per-user based M is a function of the failure rate, log arrive rate, and mobility rate of individual MHs, which is adaptive to specific user behavior Wide-range applicability

System Model (1/2)

System Model (2/2) MHs are not assumed to have stable storage MHs are not assumed to have stable storage MSSs are equipped with enough stage storage MSSs are equipped with enough stage storage However, efficient storage management is still necessary Interactions between MH and the network infrastructure most relevant to failure recovery Interactions between MH and the network infrastructure most relevant to failure recovery Handoff, disconnect, and reconnect Process state Process state Normal: computing, receiving user inputs, sending/receiving messages Save Checkpointing: checkpoints are sent to MSS for stable storage Logging: events happening between two checkpoints that cause process state changes, e.g., receipt of messages or user inputs, are logged by MSS to stable storage Recovery: rollback

Checkpointing & Logging (1/3)

Checkpointing & Logging (2/3) Each MH maintains several variables locally Each MH maintains several variables locally handoff_counter: stores the #handoffs cp_seq: store the sequence number of the most recent checkpoint cp_loc: stores the ID of the MSS that stores the most recent checkpoint MSScp: the MSS that stores the most recent checkpoint log_set: contains a list of IDs of MSSs that stores the logs MSSlogs: the set of MSSs that stores the logs These variables must survive the host failure

Checkpointing & Logging (3/3) Procedure Procedure When handoff_counter = M A new checkpoint is taken The checkpoint is sent to the current MSS for stable storage cp_seq is set to the sequence number of this new checkpoint cp_loc is updated with the current MSS handoff_counter is reset to zero log_set is cleared When a new log is recorded The log is sent the to current MSS for stable storage The ID of the MSS is inserted into log_set if it is not present When a handoff occurs handoff_counter is increased by 1 and checked against M

Independent Recovery Each MH recovers independently without global coordination Each MH recovers independently without global coordination Asynchronous recovery Recovery procedure Recovery procedure When a MH reconnects to a MSS after a failure, it sends cp_seq and cp_loc to its current MSS The MSS initiates the process of collecting the checkpoint and the logs on behalf of the MH The current MSS sends a request to MSScp MSScp responds with the most recent checkpoint The current MSS sends requests to all MSSs in MSSlogs Each MSS in MSSlogs responds with the log entry The current MSS compiles the checkpoint and all the logs, sends it back to the MH The MH rolls back to the checkpoint and replays all the logs

Performance Analysis: SPN Model

SPN Model Parameters

The SPN Model (1/3) Places Places Handoff: represents the state in which handoff_counter since the last checkpoint does not exceed M Failure: stands for the state in which a MH failure has occurred. Initially, mark(“handoff”)=0 and mark(“failure”) = 0 Transitions Transitions Move (inter-cell): with a mobility rate of σ Checkpointing: with a checkpoint rate of θ k Fail: with a failure rate of λ f Recovery: with a recovery rate of θ i, which depends on i=handoff_counter

The SPN Model (2/3) Parameters elaboration Parameters elaboration θ k – checkpointing rate θ k – checkpointing rate The MH takes a snapshot of its current state and sends it to the MSS for stable storage The dominating factor is the transmission of the checkpoint to the MSS θ k = 1 /T ckp_w Thus, θ k = 1 /T ckp_w θ i - recovery rate θ i - recovery rate This represents the recovery rate, which is the reverse of the average recovery time The average recovery time is the sum of 1. the time needed to send recovery request information to all the MSSs that stores the most recent checkpoint and all the logs 2. the time needed to transmit the most recent checkpoint from MSS cp to MSS rec, and from MSS rec to the MH 3. The time needed to transfer all the logs from respective MSSs where they are located to MSS rec, and from MSS rec to the MH 4. The time required to rollback and replay all the logs

The SPN Model (3/3) Transition rules Transition rules Whenever the MH encounters a handoff, the number of tokens in the place “Handoff” is increased by 1 The MH stays in the state represented by the place “Handoff” if the number of cumulative handoffs is less than M before a failure When the number of tokens in the place “Handoff” is equal to M, the transition “Checkpointing” is fired, and this consumes all the tokens in the place “Handoff”, i.e., handoff_counter is reset θi When a MH failure occurs (the transition “Fail” is fired), all the tokens in the place “Handoff” are moved to the place “Failure”; the recovery rate θi depends on the number of handoffs since the last checkpoint, which is denoted as i = mark(“Fail”) = #(Handoffs) The only inhibitor arc ensures that the number of handoffs between two checkpoints does not exceed the threshold M

Calculations (1/5)  Variables required to compute : The number of MSSs storing logs, this is the size of the list log_set The average hop count between two MSSs separated by j handoffs 1/6 probability to move backward 2/6 probability to have sideway moves 3/6 probability to move forward The first move For simplicity, it is assumed that this equals to the number of handoffs, i.e., i

Calculations (2/5)  Total time to recover after i handoffs is the sum of the following components Time for recovery info requests, assuming that the size of a request packet is no more than the size of a log Time for transferring the most recent checkpoint to the MH The time required for MSS rec to transfer one request packet, assuming that the time to send a request packet from MH to MSS rec through wireless network is The time required to transfer the checkpoint from MSS cp to MSS rec through wired link The time required to transfer the checkpoint from MSS rec to the MH through wireless link

Calculations (3/5) Time for transferring the logs to the MH; n is the number of logs since the most recent checkpoint, and n mss is the number of log entries per MSS Time for rolling back and replaying all the n logs The time required to transfer the logs from MSS logs to MSS rec through wired link, assuming that each MSS in MSS logs stores n mss logs The time required to transfer the logs from MSS rec to the MH through wireless link

 Thus we have: The total time to recovery after i handoffs; 1/ θi The average recovery time per failure The underlying Markov model has 2M+1 states, with the probability of state j as Pj Calculations (4/5) The max operator is used to take the maximum of checkpoint transfer time and log transfer time since these messages can be transmitted simultaneously

The recovery probability, defined as the probability that the recovery time is less than or equal to the recovery time deadline T; θ S corresponds to θi if the MH has made i handoffs in state S The total time spent on checkpointing and logging before a failure The total cost of recovery per failure is the weighted sum of the average recovery time per failure and the total time spent on checkpointing and logging per failure w1 = w2 = 0.5 in the result analysis Calculations (5/5) Total number of checkpoints taken before a failure Total number of log entries taken before a failure

Results and Analysis (1/4)  Log arrival rate↑  ecover Probability ↓  Mobility Rate ↑  Recover Probability ↑  Mobility Rate ↑  Checkpoint interval ↓  Number of logs ↓  Recovery time ↓  Flat segment:  Recall  Deadline ↑  Recover Probability ↑  Mobility Rate ↑  Recover Probability ↑  Mobility Rate ↑  Checkpoint interval ↓  Number of logs ↓  Recovery time ↓

Results and Analysis (2/4)  Failure Rate ↑  Number of logs ↓  Recovery time ↓  Recover Probability ↑  Checkpoint Size↑  ↑  Recovery Probability↓  When size of checkpoint is sufficiently large (32K), dominates Mobility Rate becomes insensitive.

Results and Analysis (3/4)  M↑  Time interval↑  Number of logs ↑  Recovery time ↑  Recover Probability ↓  M is NOT the smaller the better A smaller M brings more overhead during failure free operation. So there is a tradeoff between recovery time and total time.

Results and Analysis (4/4)  The curves indicate there is a optimal M.  Small M  Tc dominates Tr  M ↑  Tc ↓, Tr ↑ and Total cost ↓  Large M  Tr dominates Tc  Total cost ↑  Crossover point at M=8  Large M  Tr dominates Tc  Mobility Rate ↓  Checkpoint interval ↑  # of logs ↑  Recovery time ↑ Tr: Recovery Time Tc: Total time spent on checkpointing and logging per failure

Summary An efficient failure recovery scheme for mobile computing systems based on movement-based checkpointing and logging. An efficient failure recovery scheme for mobile computing systems based on movement-based checkpointing and logging. A Performance model based on stochastic Petri Nets A Performance model based on stochastic Petri Nets Identify optimal movement threshold M Minimize cost of recovery per failure Calculate failure recoverability Applicability Build a table covering possible parameter values Build a table covering possible parameter values Mobility rate, failure rate, log arrival rate and etc. Optimal M may be selected dynamically to minimize cost Optimal M may be selected dynamically to minimize cost Optimal M must satisfy specified recovery probability Optimal M must satisfy specified recovery probability

Questions? Questions?