Yavor Todorov. Introduction How it works OS level checkpointing Application level checkpointing CPR for parallel programing CPR functionality References.

Yavor Todorov

Introduction How it works OS level checkpointing Application level checkpointing CPR for parallel programing CPR functionality References Contents

Basically, checkpoint/restart mechanisms allow a machine that crashes and is subsequently restarted to continue from the checkpoint with no loss of data, just as if no failure had occurred. At some firms and supercomputing centers, it's common practice to break up long-running computational programs into several batches. Programs such as a gene- sequencing application search through enormous databases and execute complex algorithms that can take several weeks to complete. But while the concept is easy to understand, the technical mechanism to checkpoint and restart an operating system or application is quite complex. Introduction

Checkpointing can occur either within the operating system or at the application level. Most high end mainframes have automated CPR utilities CPR at the operating system level saves the state of everything that's being done within a given application at periodic checkpoints and allows the system to restart from the last point. On very large computers with hundreds or thousands of processes running, saving the entire state of an operating system can take a long time How it works

It also takes a long time to later restart the machine at that state - on large jobs, it could take several hours. The recovery is delayed because a large amount of data must be stored, whether or not the application requires that information to fully restart it. When a process is checkpointed register set, file handlers How it works (cont’d)

Text Data Stack Heap Register set values Status of open files Sockets Signals. Process image

"Checkpointing at the operating system is useful but very costly, in that the operating system does not know what data the application really needs to restore it later, so it blindly saves everything," according to James Kasdorf, director of supercomputing center At OS level CPR saves system state. That includes unneeded copies of data, program code and system libraries Most supercomputing centers try to avoid CPR at OS level Checkpoint at OS level

Since the whole process image is save its an expensive operation Takes more time Usually needs kernel modifications since most OS like Linux were not built with CPR functionality Checkpointing at OS level drawbacks

The application uses OS hooks to save information needed for restart More efficient in a way that it takes less time to checkpoint and it is faster to restart the application It allows you to choose optimal point which is typically at the end of a loop Only needed data gets saved Checkpoint at application level

Difficult in some cases.E.g. application has an open communications channel to an external device or the application runs on a clustered computer. Distributed application’s state is hard to save as programs state is changing across multiple nodes CPR for apps with large buffer memory takes longer CPR at app level drawbacks

Each process is responsible for taking its own checkpoint. Checkpoint timing is responsibility of a coordinating process. CPR data includes: in-transit message data, data section, file offsets, signal state, executable information, stack contents and register contents, CPU state, info about open files, pending signals. Checkpoint file can be stored either on local or global storage. When program is restarted each process initiates its own restart. CPR for parallel programs

All migrating processes have to be stopped at the time, to avoid loss of a signal Socket IP addressing space have to be taken in consideration( it can be virtualized) I/O speeds are pivotal for any CPR process ( the faster the better) CPR for parallel programs

Process migration Load balancing Crash recovery Rollback transaction Job control CPR functionality

Duell, J. (2005). The design and implementation of Berkeley Lab's linux checkpoint/restart. Berkeley: Lawrence Berkeley National Lab. Litzkow, M., Tannenbaum, T., Basney, J., & Livny, M. (1997). Checkpoint and Migration of UNIX Processes in the Condor Distributed Processing System. Zhong, H., & Nieh, J. (2001). CRAK: Linux Checkpoint/Restart As a Kernel Module. Depertment of CS Columbia University. http://www.computerworld.com/s/article/68930/Checkpoint_and_Restart Sources

Yavor Todorov. Introduction How it works OS level checkpointing Application level checkpointing CPR for parallel programing CPR functionality References.

Similar presentations

Presentation on theme: "Yavor Todorov. Introduction How it works OS level checkpointing Application level checkpointing CPR for parallel programing CPR functionality References."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Yavor Todorov. Introduction How it works OS level checkpointing Application level checkpointing CPR for parallel programing CPR functionality References.

Similar presentations

Presentation on theme: "Yavor Todorov. Introduction How it works OS level checkpointing Application level checkpointing CPR for parallel programing CPR functionality References."— Presentation transcript:

Similar presentations

About project

Feedback