Movement-Based Check-pointing and Logging for Recovery in Mobile Computing Systems Sapna E. George, Ing-Ray Chen, Ying Jin Dept. of Computer Science Virginia.

Slides:



Advertisements
Similar presentations
Recovery Techniques in Mobile Databases Prepared by Ammar Hamamra.
Advertisements

Distributed Systems Major Design Issues Presented by: Christopher Hector CS8320 – Advanced Operating Systems Spring 2007 – Section 2.6 Presentation Dr.
Silberschatz and Galvin  Operating System Concepts Module 16: Distributed-System Structures Network-Operating Systems Distributed-Operating.
Expected-Reliability Analysis for Wireless CORBA with Imperfect Components Chen Xinyu
Network Operating Systems Users are aware of multiplicity of machines. Access to resources of various machines is done explicitly by: –Logging into the.
On Reducing Communication Cost for Distributed Query Monitoring Systems. Fuyu Liu, Kien A. Hua, Fei Xie MDM 2008 Alex Papadimitriou.
CS-550 (M.Soneru): Recovery [SaS] 1 Recovery. CS-550 (M.Soneru): Recovery [SaS] 2 Recovery Computer system recovery: –Restore the system to a normal operational.
Dept. of Computer Science & Engineering, CUHK Performance and Effectiveness Analysis of Checkpointing in Mobile Environments Chen Xinyu
Online Data Gathering for Maximizing Network Lifetime in Sensor Networks IEEE transactions on Mobile Computing Weifa Liang, YuZhen Liu.
Dept. of Computer Science & Engineering, CUHK Fault Tolerance and Performance Analysis in Wireless CORBA Chen Xinyu Supervisor: Markers: Prof.
16: Distributed Systems1 DISTRIBUTED SYSTEM STRUCTURES NETWORK OPERATING SYSTEMS The users are aware of the physical structure of the network. Each site.
SRDS’03 Performance and Effectiveness Analysis of Checkpointing in Mobile Environments Xinyu Chen and Michael R. Lyu The Chinese Univ. of Hong Kong Hong.
CS401 presentation1 Effective Replica Allocation in Ad Hoc Networks for Improving Data Accessibility Takahiro Hara Presented by Mingsheng Peng (Proc. IEEE.
A Survey of Rollback-Recovery Protocols in Message-Passing Systems M. Elnozahy, L. Alvisi, Y. Wang, D. Johnson Carnegie Mellon University Presented by:
1 Rollback-Recovery Protocols II Mahmoud ElGammal.
Switching Techniques Student: Blidaru Catalina Elena.
Copyright 2007 Koren & Krishna, Morgan-Kaufman Part.17.1 FAULT TOLERANT SYSTEMS Chapter 6 – Checkpointing.
Efficient and Robust Query Processing in Dynamic Environments Using Random Walk Techniques Chen Avin Carlos Brito.
CIS 725 Wireless networks. Low bandwidth High error rates.
1 On Failure Recoverability of Client-Server Applications in Mobile Wireless Environments Ing-Ray Chen, Baoshan Gu, Sapna E. George and Sheng- Tzong Cheng.
Authors: Ing-Ray Chen Weiping He Baoshan Gu Presenters: Yao Zheng.
IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS 2007 (TPDS 2007)
Checkpointing and Recovery. Purpose Consider a long running application –Regularly checkpoint the application Expensive task –In case of failure, restore.
1 Fast Failure Recovery in Distributed Graph Processing Systems Yanyan Shen, Gang Chen, H.V. Jagadish, Wei Lu, Beng Chin Ooi, Bogdan Marius Tudor.
A Survey of Rollback-Recovery Protocols in Message-Passing Systems.
1 A Dynamical Redirection Approach to Enhancing Mobile IP with Fault Tolerance in Cellular Systems Jenn-Wei Lin, Jichiang Tsai, and Chin-Yu Huang IEEE.
1 EnviroStore: A Cooperative Storage System for Disconnected Operation in Sensor Networks Liqian Luo, Chengdu Huang, Tarek Abdelzaher John Stankovic INFOCOM.
Switching breaks up large collision domains into smaller ones Collision domain is a network segment with two or more devices sharing the same Introduction.
EEC 688/788 Secure and Dependable Computing Lecture 7 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University
Fault Tolerant Systems
Load-Balancing Routing in Multichannel Hybrid Wireless Networks With Single Network Interface So, J.; Vaidya, N. H.; Vehicular Technology, IEEE Transactions.
Checkpointing and Recovery. Purpose Consider a long running application –Regularly checkpoint the application Expensive task –In case of failure, restore.
ISADS'03 Message Logging and Recovery in Wireless CORBA Using Access Bridge Michael R. Lyu The Chinese Univ. of Hong Kong
Fault Tolerance CSCI 4780/6780. Distributed Commit Commit – Making an operation permanent Transactions in databases One phase commit does not work !!!
PRoPHET+: An Adaptive PRoPHET- Based Routing Protocol for Opportunistic Network Ting-Kai Huang, Chia-Keng Lee and Ling-Jyh Chen.
Chapter 2 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University Building Dependable Distributed Systems.
Fault Tolerance in CORBA and Wireless CORBA Chen Xinyu 18/9/2002.
Dual-Region Location Management for Mobile Ad Hoc Networks Yinan Li, Ing-ray Chen, Ding-chau Wang Presented by Youyou Cao.
K-Anycast Routing Schemes for Mobile Ad Hoc Networks 指導老師 : 黃鈴玲 教授 學生 : 李京釜.
Sapna E. George, Ing-Ray Chen Presented By Yinan Li, Shuo Miao April 14, 2009.
Energy-Efficient Data Caching and Prefetching for Mobile Devices Based on Utility Huaping Shen, Mohan Kumar, Sajal K. Das, and Zhijun Wang P 邱仁傑.
Design and Analysis of Optimal Multi-Level Hierarchical Mobile IPv6 Networks Amrinder Singh Dept. of Computer Science Virginia Tech.
Chapter 7: Consistency & Replication IV - REPLICATION MANAGEMENT By Jyothsna Natarajan Instructor: Prof. Yanqing Zhang Course: Advanced Operating Systems.
Revisiting failure detectors Some of you asked questions about implementing consensus using S - how does it differ from reaching consensus using P. Here.
EEC 688/788 Secure and Dependable Computing Lecture 6 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University
Efficient Resource Allocation for Wireless Multicast De-Nian Yang, Member, IEEE Ming-Syan Chen, Fellow, IEEE IEEE Transactions on Mobile Computing, April.
Middleware for Fault Tolerant Applications Lihua Xu and Sheng Liu Jun, 05, 2003.
Rajani Muraleedharan and Lisa Ann Osadciw By: Mai Ali Sayed Ahmed.
FTOP: A library for fault tolerance in a cluster R. Badrinath Rakesh Gupta Nisheeth Shrivastava.
On Mobile Sink Node for Target Tracking in Wireless Sensor Networks Thanh Hai Trinh and Hee Yong Youn Pervasive Computing and Communications Workshops(PerComW'07)
A proxy-based integrated cache consistency and mobility management scheme for client-server applications in Mobile IP systems - Weiping He, Ing-Ray Chen.
Presented by Rukmini and Diksha Chauhan Virginia Tech 2 nd May, 2007 Movement-Based Checkpointing and Logging for Recovery in Mobile Computing Systems.
Operating Systems Distributed-System Structures. Topics –Network-Operating Systems –Distributed-Operating Systems –Remote Services –Robustness –Design.
Pouya Ostovari and Jie Wu Computer & Information Sciences
Prepared by Ertuğrul Kuzan
Chapter 25: Advanced Data Types and New Applications
EEC 688/788 Secure and Dependable Computing
Switching Techniques In large networks there might be multiple paths linking sender and receiver. Information may be switched as it travels through various.
EECS 498 Introduction to Distributed Systems Fall 2017
EEC 688/788 Secure and Dependable Computing
CSE 4340/5349 Mobile Systems Engineering
EEC 688/788 Secure and Dependable Computing
Switching Techniques.
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
Brahim Ayari, Abdelmajid Khelil and Neeraj Suri
Presentation transcript:

Movement-Based Check-pointing and Logging for Recovery in Mobile Computing Systems Sapna E. George, Ing-Ray Chen, Ying Jin Dept. of Computer Science Virginia Polytechnic and State University

Outline  Background  Problem Definition – Failure Recovery in the Mobile Computing Environment  Proposed Solution – Movement-Based Check-pointing and Logging  Performance Analysis Analytic Model of the System Analysis Results and Conclusions  Future Work

Background

Mobile Computing  Advances in wireless networking and portable device technologies are revolutionizing computing  Mobile Computing – A type of distributed computing Involves hosts that may be mobile Host network connectivity maintained through wireless communications

Fault-tolerance in Distributed systems Check-pointing, Logging, Rollback recovery  Check-pointing  failure-free operations Save system state to stable storage This snapshot is called a checkpoint  Logging  failure-free operations All non-deterministic events and the information necessary to replay these events are logged to the stable storage In addition to checkpoints

Fault-tolerance in Distributed systems  Failure Recovery Failed process rolls back to the latest checkpoint Replays all the logged events in their original order Recreates pre-failure state independently

Problem Definition Failure Recovery in the Mobile Computing Environment

Effects of Properties of MC Env.  Mobility of hosts If checkpointing requires coordination, the MH must be searched and located first before control messages can be delivered; this increases communication delay Data related to recovery, such as checkpoints and logs, may be distributed over many MSS; a mechanism is required for efficient storage, retrieval and management of this dispersed information

Effects of Properties of MC Env.  Low bandwidth and unreliable network connectivity A recovery mechanism that requires a large number of messages or large size of messages imposes undue burden on the wireless resources and increases the cost of providing fault tolerance.

Effects of Properties of MC Env.  Limited battery life of host devices Communication is energy intensive. Recovery mechanism must keep communication (the number of messages and the size of messages) to a minimum.

Effects of Properties of MC Env.  Lack of stable storage on host devices Devices are vulnerable to physical damage Devices are small and are equipped with limited memory MH’s disk cannot reliably function as the stable storage required to store recovery information.

Effects of Properties of MC Env.  Different types of ‘failures‘ Voluntary disconnection and hardware failure must be handled differently  A disconnected host may reconnect after a while and expect to resume operations A MH that is currently unreachable cannot be expected to participate in a checkpointing or recovery operation. A scheme that requires synchronization or coordination with other MHs would either block until the MH reconnected or would fail.

The Problem…  Traditional recovery schemes suffer from many shortcomings when applied to the mobile computing environment.  The failure-prone nature of the environment makes it essential to provide some form of explicit recovery mechanism.

The Problem…  In general, application recovery mechanisms try to balance Recovery cost (failure-free operational cost) Recovery time Storage requirements for recovery related information

The Problem…  Adaptations of traditional recovery schemes for the mobile computing environment Do not consider mobility in the selection of checkpointing interval Use periodic checkpointing Subsequently control the proliferation of recovery information using techniques that merge logs and move the information closer to the MH.

Proposed Solution Movement-Based Check-pointing and Logging

Assumed Mobile Computing System  A set of mobile hosts (MHs)  They maintain network connectivity through a wireless link to a static mobile support station (MSS)  A MSS handles all communications to and from MHs within its area of influence known as a cell  Each MSS is equipped with enough volume of stable storage to store the state and log information

Assumed Mobile Computing System  Interactions between the MH and the network infrastructure relevant to failure recovery Handoff – Cell boundary crossing Disconnection – For power conservation Reconnection – Possibly in a cell different from the one in which it disconnected

Assumed Mobile Computation  A distributed computation  a number of processes executing concurrently on multiple hosts.  Process states: Normal- executing application related computations, receiving user inputs or sending and receiving messages. Save - saves its state as a checkpoint to the stable storage Between checkpoints, the process also logs all events (Normal state) Recovery – Loads checkpoints and applies logs

Movement-Based Checkpointing and Logging  Interval between checkpoints is governed by the number of handoffs experienced by the MH and is not fixed  MH maintains a handoff counter which is incremented by 1 every time a handoff occurs.  When the value of the counter becomes greater than a threshold M, a checkpoint is taken.  In between checkpoints, all write events related to a MH is also logged to the local MSS.

Movement-Based Checkpointing and Logging  The threshold M is a configurable parameter. Depends on: User mobility rate Network the failure rate Application log arrival rate

Movement-Based Checkpointing and Logging  Thus, depending on the variability in the MH’s mobility, the time interval between successive checkpoints differs.  Recovery – MH recovers independently without coordination with other MHs Upon reconnection, MH informs local MSS. Local MSS contacts MSS with latest checkpoint Local MSS contacts all MSS storing logs All data transferred to local MSS via wired network and to MH via wireless link MH rolls back and applies logs

Movement-Based Checkpointing and Logging  The performance of this scheme depends on identifying the optimal movement threshold M per user and application. Checkpoints and logs remain within acceptable range of the MH’s current location and eliminates the need for information consolidation. Ensures acceptable recovery time since M bounds the number of MSSs’ from which logs must be retrieved.

Performance Analysis Analytic Model

Stochastic Petri-Net (SPN) Model

SPN Model Parameters ParameterDescription σMH mobility rate, i.e. the rate at which the MH crosses cell boundaries. μLog arrival rate i.e. the rate at which logs are created λfλf MH failure rate i.e. the rate at which the MH fails MMovement threshold i.e. the number of handoffs after which the MH takes a checkpoint rRatio of bandwidth of wireless network to wired network T ckp_w Time required to transmit a checkpoint through the wireless link T log_w Time required to load a log entry through the wireless link T elog Time required to execute a log entry at the MH

SPN Model Parameters  Parameter Θ k - Checkpoint rate of the MH  Parameter Θ i - Recovery rate of the MH = inverse of recovery time  i - number of handoffs experienced by the MH since the last checkpoint and before failure.

Analytic Model – Recovery Time

 T req_rec - Time spent on recovery information requests N mss_logs – Number of MSSs storing logs D mss - average hop count between MSS cp and MSS rec

Analytic Model – Recovery Time  T ckp_tx - Time spent on transmitting the latest checkpoint to the MH  T log_tx - Time spent on transmitting the logs to the MH  T rec - Time spent to rollback to the last checkpoint and apply the logs

Analytic Model – Cost of Recovery  T r – Average Recovery time per failure  F r – Recovery probability  T c – Cost of recovery No. of checkpoints before failure No. of logs before failure

SPN Evaluation Parameters  Size of a log entry - 50B  Size of a checkpoint B  Bandwidth of wired network-2Mbps  Ratio of bandwidth of wireless to wired network (r)  Time required to apply a log entry (Telog) s  Time required to transmit a log entry through the wireless channel (Tlog_w) s  Time required to transmit a checkpoint through the wireless channel (Tckp_w) s

Performance Analysis Results and Conclusions

Recovery Probability vs. Recovery Time

Recovery Probability vs. Log Arrival Rate

Recovery Probability vs. Failure Rate

Recovery Probability & Recovery Time vs. Movement Threshold

Determining Optimal Movement Threshold that Minimizes Recovery Cost Per Failure

Conclusion – Proposed Scheme  An efficient failure recovery scheme for mobile computing systems based on movement-based checkpointing and logging  Movement-based checkpointing and logging scheme takes a checkpoint only after the mobile node has made M movements (mobility handoffs).  The value of M is governed by the failure rate, log arrival rate, and the mobility rate of the application and MH.  Identify the optimal movement threshold M, when given the failure, mobility and log arrival rates, to minimize the cost of recovery per failure.

Conclusion – Practical Application  Build a table at configuration time covering possible parameter values of the mobility rate and failure rate of the MH and log arrival rate of the mobile applications, and listing the optimal M value that would minimize the recovery cost per failure.  At runtime, based on the measured rates, the optimal M may be selected dynamically to minimize the recovery cost per failure.  Optimal M selected must also satisfy the specified recovery probability when given an application deadline to recover from a failure.