Presentation is loading. Please wait.

Presentation is loading. Please wait.

Advanced I/O Techniques for Efficient and Highly Available Process Crash Recovery Protocols Thesis Presentation Jason Cornwell 03/15/2011.

Similar presentations


Presentation on theme: "Advanced I/O Techniques for Efficient and Highly Available Process Crash Recovery Protocols Thesis Presentation Jason Cornwell 03/15/2011."— Presentation transcript:

1 Advanced I/O Techniques for Efficient and Highly Available Process Crash Recovery Protocols Thesis Presentation Jason Cornwell 03/15/2011

2 Agenda Introduction Challenges Pertinent Background Proposed Techniques Implementations Experimental Setup & Results Conclusions Future Work

3 Computing Intensive Applications

4 Network Centric Services

5 Recent Advances

6 Motivation & Goals Demand for more computing power and high-bandwidth network connections Advances in Microprocessors and Networks Parallel Computing Performance and Scalability Reliability and Availability Simplicity and Accessibility

7 Agenda Introduction Challenges Pertinent Background Proposed Techniques Implementations Experimental Setup & Results Conclusions Future Work

8 Reliability Problems Large numbers of CPUs, Memory Modules, Hard Disk Drives, Network Interfaces, Network Switches Low Mean-Time-To-Failure (MTTF) and/or High Failure-In-Time (FIT)

9 Classification of Failure Transient Failure –Power glitch –System patch and reboot –ECC trap Partial “Permanent” Failure –Disk failure –Partial network failure Wholesale “Permanent” Failure –Total hardware failure –Natural disaster

10 Availability Problems Large numbers Processes, Threads, Software Barriers, Busy Waiting Temporarily Unresponsive and/or Unavailable

11 Agenda Introduction Challenges Pertinent Background Proposed Techniques Implementations Experimental Setup & Results Conclusions Future Work

12 Possible Solutions Transient Failure –Restart/replay/resume on the same node –Task-migration is possible Permanent Partial Failure –Rebalance the workload on surviving nodes –Partial task-migration is needed Permanent Wholesale Failure –Reconfigure the applications and services –Massive task-migration to new platform

13 Checkpointing Common feature in high-performance computing (HPC) platforms Saves the execution state Application or system-level Mechanism for task migration

14 Application vs System Level Application-level Recovery Point –Developed application specific –Generally smaller footprint –Data accessiblity restrictions Kernel-level Recovery Point –Snapshot processes –Full resource restoration –Flexibility due to system level preemption

15 Berkeley Labs Checkpoint/Restart System-level Kernel-module Checkpoint creation implemented Process recovery implemented Linked to BLCR libraries at execution Stores checkpoint data locally (stack, heap, registers, signals, etc.)

16 Agenda Introduction Challenges Pertinent Background Proposed Techniques Implementations Experimental Setup & Results Conclusions Future Work

17 Contribution Enhanced BLCR performance through latency tolerant technique Increased BLCR availability through novel checkpoint creation technique

18 I/O Optimization Avoided extreme modification to BLCR Reduce the disk latency of checkpoint creation Implemented a caching technique Improved I/O performance 4-fold or more System overhead less than 300KB in experimental test results

19 Checkpoint Caching Buffer used as temporary storage Storage block flushed in large volume Trade-off between resource consumption and improved I/O efficiency cr_copy(chkptData, count) if(chkptBuf is NULL) kmalloc size of count for chkptBuf space; copy chkptData into chkptBuf; else kmalloc size of count + chkptBuf size for tempBuf space; copy chkptBuf into tempBuf; krealloc chkptBuf for its expanded size; memmove tempBuf into chkptBuf; kfree memory for tempBuf; end if

20 Optimized Write Operation

21 Remote Checkpoint BLCR is limited to local disk storage Remote checkpoint offers off-site storage option Uses sockets to transmit data Needs predefined destination Outperforms BLCR in some experimental tests

22 Remote Checkpoint Server Single thread daemon Used GCC compiler Stores the recovery point external to the client node Could be ported to Microsoft derivative while(true) create socket; bind to address; listen for incoming connections; wait for client to connect; create file descriptor; while(data buffered received) write checkpoint data; close file descriptor; close socket;

23 Modified Write Operation TCP packets MTU must be reached before delivery Only modification is to the write operation of BLCR if(remote chkpt) if(socket is NULL) create socket; establish connection, if handshake fails break and perform the original_chkpt; end if package checkpoint data; send data message; end if if(original_chkpt) original BLCR write operation; end if

24 Agenda Introduction Challenges Pertinent Background Proposed Techniques Implementations Experimental Setup & Results Conclusions Future Work

25 Design I/O Optimization Write write(chkptData, count) if(chkptBuf has space for the incoming chkptData) cr_copy(ckptData, count); else vfs_write(chkptBuf); vfs_write(chkptData); kfree(chkptBuf); end if Remote Checkpoint Write

26 Agenda Introduction Challenges Pertinent Background Proposed Techniques Implementations Experimental Setup & Results Conclusions Future Work

27 Experimental Setup I/O Optimization Dell Workstation, 3.06 GHz Intel Pentium 4, 1 GB Memory, 5,400 RPM Hard Disk, Linux 2.6 BLCR Implementation Optimized BLCR (O-BLCR) Implementation Remote Checkpoint Dell PowerEdge 700, 2.80 GHz Dual-processor Intel Pentium 4, 3 GB Memory, 5,400 RPM Hard Disk, Linux 2.6 Dell Workstation, 3.06 GHz Intel Pentium 4, 1 GB Memory, 5,400 RPM Hard Disk, Linux 2.6 BLCR Implementation BLCR with NFS (BLCR+NFS) BLCR with our Remote Checkpoint Technique (BLCR+R)

28 Benchmarks Program NP-Complete Data Encryption Linear Equation Solver File Compression Resource Utilization BenchmarkCPUMemoryI/O TSPHighLow AESHighLowMedium GELowHigh HCMedium

29 I/O Optimization Results

30 Remote Checkpoint Results

31 Agenda Introduction Challenges Pertinent Background Proposed Techniques Implementations Experimental Setup & Results Conclusions Future Work

32 Conclusion Minimal modification to BLCR I/O optimization technique reduced the write latency of BLCR Remote checkpoint increases BLCR availability with new feature These techniques should be deployed into the foundation of BLCR source code

33 Agenda Introduction Challenges Pertinent Background Proposed Techniques Implementations Experimental Setup & Results Conclusions Future Work

34 Server authentication protocol Data packet encryption Automated process load balancing

35 Questions


Download ppt "Advanced I/O Techniques for Efficient and Highly Available Process Crash Recovery Protocols Thesis Presentation Jason Cornwell 03/15/2011."

Similar presentations


Ads by Google