Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Bugs, Bugs and Bugs. 2 Bugs: Run Time Handling Heisenbugs/MandelbugsHeisenbugs/Mandelbugs –Heisenbugs are easier to take care of during run-time –Higher.

Similar presentations


Presentation on theme: "1 Bugs, Bugs and Bugs. 2 Bugs: Run Time Handling Heisenbugs/MandelbugsHeisenbugs/Mandelbugs –Heisenbugs are easier to take care of during run-time –Higher."— Presentation transcript:

1 1 Bugs, Bugs and Bugs

2 2 Bugs: Run Time Handling Heisenbugs/MandelbugsHeisenbugs/Mandelbugs –Heisenbugs are easier to take care of during run-time –Higher chance that robust programming mechanisms are successful Bohr bugs are typically easier to find and fix…at design time But harder to take care of during run time

3 3 Perturbation Classifications/Coverage PersistencePersistence –Transient fault –Intermittent fault –Permanent fault Creation timeCreation time –Design fault –Operational fault IntentionIntention –Accidental fault –Intentional fault  Crash failure  Fail-silent and Fail-stop  Omission failure  Timing failure  System fails to respond within a specified time slice  Both late and early responses might be “bad”  Late timing failure = performance failure  Arbitrary failure  System behaves arbitrarily

4 4 Robust Programming Mechanisms Objective: Sustain the delivery of services despite perturbations! Process Pairs Graceful Degradation Selective Retry Checkpointing Rejuvenation Micro-reboots Recovery Blocks Diversity (NVP, NCP)...

5 5 Process Pairs (Continual Service) Implementation Variants: - Active replicas – both process client requests [+ fast; - complex] - Primary/Backup – state transfer [+- simpler; - delay]

6 6 Process Pairs Process pair scheme robust to varied types of software faults (crashes, resource shortage/delays, load…) :Process pair scheme robust to varied types of software faults (crashes, resource shortage/delays, load…) : –Study of print servers with process pair technology (primary / backup) –2000 systems; 10 million system hours –99.3% of failures affected only one server, i.e., 99.3% of failures were tolerated

7 7 Simple Process Pair (same host)... forever { wait_for_request(Request);process_request(Request);}... wait_for_request(Request);process_request(Request);} event loop Server Process:

8 8 Simple Process Pair (same host) int ft = backup();... forever { wait_for_request(Request);process_request(Request);} int ft = backup();... forever { wait_for_request(Request);process_request(Request);} create backup process; primary returns create backup process; primary returns event loop Server Process:

9 9 Simple Process Pair Implementation backup event loop - Don’t forget that we are assuming that the backup has the “full” state info or that the needed state is stored on (external) stable storage - Mostly focusing on crash failures…primary can hang too…watchdog timers - Transients ok too except this model is at a basic concept level…

10 10 Syscalls parent process kernel fork time waitpid fork waitpid fork...

11 11 man page: fork fork () creates a child process that differs from the parent process only in its PID and PPID, and in the fact that resource utilizations are set to 0. File locks and pending signals are not inherited. RETURN VALUE On success, the PID of the child process is returned in the parent's thread of execution, and a 0 is returned in the child's thread of execution. On failure, a -1 will be returned in the parent's context, no child process will be created, and errno will be set appropriately. ERRORS EAGAIN fork () cannot allocate sufficient memory to copy the parent's page tables and allocate a task structure for the child. EAGAIN It was not possible to create a new process because the caller's RLIMIT_NPROC resource limit was encountered. ENOMEM fork () failed to allocate the necessary kernel structures because memory is tight.

12 12 man page: waitpid(pid, *status, options) The waitpid () system call suspends execution of the current process until a child specified by pid argument has changed state. By default, waitpid () waits only for terminated children. The value of pid can be: < -1 meaning wait for any child process whose process group ID is equal to the absolute value of pid. -1 meaning wait for any child process. 0 meaning wait for any child process whose process group ID is equal to that of the calling process. >0 meaning wait for the child whose process ID is equal to the value of pid. waitpid (): on success, returns the process ID of the child whose state has changed; on error, -1 is returned. ERRORS ECHILD The process specified by pid does not exist or is not a child of the calling process. (This can happen for one's own child if the action for SIGCHLD is set to SIG_IGN. See also the LINUX NOTES section about threads.) EINTR WNOHANG was not set and an unblocked signal or a SIGCHLD was caught. EINVAL The options argument was invalid.

13 13 Simple Process Pair int backup() { int ret, restarts = 0; for (;; restarts++) { ret = fork(); if (ret == 0) {// child? return restarts; } while(ret != waitpid(ret,0,0)) ;}} int backup() { int ret, restarts = 0; for (;; restarts++) { ret = fork(); if (ret == 0) {// child? return restarts; } while(ret != waitpid(ret,0,0)) ;}} count number of child procs create child parent waits for child to terminate parent waits for child to terminate waitpid(PID, *status, options)

14 14 Robust?

15 15 Failed fork system call (looping?)... int backup() { int ret, restarts = 0; for (;; restarts++) { ret = fork(); if (ret == 0) {// child? return restarts; } while(ret != waitpid(ret,0,0)) ; } int backup() { int ret, restarts = 0; for (;; restarts++) { ret = fork(); if (ret == 0) {// child? return restarts; } while(ret != waitpid(ret,0,0)) ; } returns -1 on error parent does not return returns with -1: no child created returns with -1: no child created retry until success

16 16 Problem: forked another child... fork() // fork non-terminating child... backup()... fork(); // fails  returns -1 fork(); // fails  returns -1 waitpid(-1,0,0); waitpid(-1,0,0); // waits for any child... might not return... fork() // fork non-terminating child... backup()... fork(); // fails  returns -1 fork(); // fails  returns -1 waitpid(-1,0,0); waitpid(-1,0,0); // waits for any child... might not return ret = fork() waitpid(ret,0,0)

17 17 Graceful Degradation int backup() { int ret, restarts = 0; for (;; restarts++) { ret = fork(); if (ret < 0) { log(“backup:...”); return -1; } if (ret == 0) {// child returns return restarts; } while(ret != waitpid(ret,0,0)) ; }} int backup() { int ret, restarts = 0; for (;; restarts++) { ret = fork(); if (ret < 0) { log(“backup:...”); return -1; } if (ret == 0) {// child returns return restarts; } while(ret != waitpid(ret,0,0)) ; }} process can run without backup: just return if fork fails process can run without backup: just return if fork fails

18 18 Selective Retries Retries:Retries: –repeat a call until it succeeds or until we run out of time (timeout) or max. number of retries Selective Retries:Selective Retries: –repeat only calls when there is a chance that retry can succeed –e.g., memory shortage might disappear –e.g., invalid argument will typically stay invalid

19 19 Not always clear if retry could succeed Fork () creates a child process that differs from the parent process only in its PID and PPID, and in the fact that resource utilizations are set to 0. File locks and pending signals are not inherited. RETURN VALUE On success, the PID of the child process is returned in the parent's thread of execution, and a 0 is returned in the child's thread of execution. On failure, a -1 will be returned in the parent's context, no child process will be created, and errno will be set appropriately. ERRORS EAGAIN fork () cannot allocate sufficient memory to copy the parent's page tables and allocate a task structure for the child. EAGAIN It was not possible to create a new process because the caller's RLIMIT_NPROC resource limit was encountered. ENOMEM fork () failed to allocate the necessary kernel structures because memory is tight.

20 20 Selective Retries int ft = backup();... forever { wait_for_request(Request);process_request(Request);} int ft = backup();... forever { wait_for_request(Request);process_request(Request);} Can fail but might succeed when more memory avail or less processes Can fail but might succeed when more memory avail or less processes Infinite re-tries? Delays?

21 21 Selective Retries int ft = backup();... forever { if (ft < 0) { ft = backup(); } if (ft < 0) { ft = backup(); }wait_for_request(Request);process_request(Request);} int ft = backup();... forever { if (ft < 0) { ft = backup(); } if (ft < 0) { ft = backup(); }wait_for_request(Request);process_request(Request);} Retry if no backup Might be a lot of retries... state might already be corrupted state might already be corrupted

22 22 Retry Questions... How often should we retry?How often should we retry? –should we wait between retries? Should we retry at some later point in time?Should we retry at some later point in time? –how many times until we give up? At what level should we retry?At what level should we retry?

23 23 Hierarchical Retries potentially: exp. increase in retries! composability: retries should be independent of each other ! function h calls retry function f retry

24 24 Selective Retries Under high load calls might fail due to resource shortageUnder high load calls might fail due to resource shortage We can use selective retries to increase probability of success during resource allocationWe can use selective retries to increase probability of success during resource allocation Operating systems like Linux have a “killer process” that terminates processes if too few resources existOperating systems like Linux have a “killer process” that terminates processes if too few resources exist With selective retries this will make sure that processes that survive can complete their requestsWith selective retries this will make sure that processes that survive can complete their requests

25 25 Bohrbugs

26 26 Continuous Crashing

27 27 Continuous Crashing Finite number of retries by client?Finite number of retries by client? –client will stop sending the request eventually But what if we cannot control clientsBut what if we cannot control clients –clients might think it is fun to crash server?  DoS attacks take place like this!  What happens if the retrying request activates bohrbugs?

28 28 Graceful Degradation Alternative Approach:Alternative Approach: –server needs to make sure that failed request is only retried for a fixed number of times Problem:Problem: –how can we know that a request has already been partially processed several times? Solution:Solution: –need to keep some state info between request instances!

29 29 State Handling

30 30 Using Session State int ft = backup();... forever { wait_for_request(Request); get_session_state(Request); if(num_retries < N) { process_request(Request); store_session_state(Request); }else { return_error(); } } int ft = backup();... forever { wait_for_request(Request); get_session_state(Request); if(num_retries < N) { process_request(Request); store_session_state(Request); }else { return_error(); } } updates number of retries updates number of retries

31 31 Crash of Parent!

32 32 What if parent process dies? Possible reasons: Operator might kill wrong processOperator might kill wrong process Parent might terminate for some other reason, e.g.,Parent might terminate for some other reason, e.g., –Linux: out of memory process killer (see earlier slide!) –Kills processes that use too much memory: “more cpu time decreases the chance of being killed” Parent could get killed

33 33 Detecting Parent Crashes

34 34 Detection of Process Crashes Pipe used to communicate between procs –Unix: ls | sort Pipe end closed when –process terminates Process B can detect –when process A terminated

35 35 Adding Parent Termination Detection int fd[2]; // pipe fd int backup() {... pipe(fd); ret = fork(); if (ret == 0) { // child? close (fd [1]); close (fd [1]); return restarts++; return restarts++; } // parent closes other end: close (fd [0]);... int fd[2]; // pipe fd int backup() {... pipe(fd); ret = fork(); if (ret == 0) { // child? close (fd [1]); close (fd [1]); return restarts++; return restarts++; } // parent closes other end: close (fd [0]);... write end read end

36 36 Child can detect parent termination int hasParentTerminated() { // check if other end of pipe has been closed......} int hasParentTerminated() { // check if other end of pipe has been closed......} has to be called periodically

37 37 Problem: State Corruption

38 38 Parent Replacement already executed requests e.g., new parent allocated resources that are never freed

39 39 Alternative Approach

40 40 Process Links Generalized Crash Detection

41 41 Linking Processes We can use a pipe as a failure detector:We can use a pipe as a failure detector: –We can detect that a process has terminated We can use that for:We can use that for: –Replacing failed processes –Providing some “termination atomicity”: If one process fails, some other processes might not be able to work properly anymore One simple way is to terminate all such processes Garbage collection of processes

42 42 Process Links: “Termination Atomicity” Set of cooperating processesSet of cooperating processes If some process p terminates, each linked process q must terminateIf some process p terminates, each linked process q must terminate We can link processes via “process links”:We can link processes via “process links”: –Programming language support – Java, Erlang, …

43 43 Pipe And Filter

44 44 Example: Farmer / Worker

45 45 Asymmetric Link Behavior

46 46 Master as Process Pair Mitigates parent crash semantics by avoiding terminations as possible for liveness

47 Error Recovery in Distributed Systems (DS) Checkpointing

48 48 Handling Transients? Transient Fault: a fault that is no longer present after system restart Many flavors: –SW transients –OS transients –Middleware/Protocol transients –Network transients –Operational transients –Power transients Need to recover from the effects of transients  detect them! … let us assume simple local sanity checks (acceptance tests) exisit!

49 49  So how does one handle these transients? Objective: - sustained ops (key driver: sustained performance) - transparent handling of bugs (to users and application designers) System Model: Coupled/Distributed/Networked Processes

50 50 Periodic Checkpointing

51 51 Checkpointing pid parent = getpid();... for (int nxt_ckpt=0 ;; nxt_ckpt -- ) { if (nxt_ckpt <= 0) { pid newparent = getpid(); if (backup() >= 0 && parent != newparent) { kill(parent, KILL); parent = newparent; nxt_ckpt = N; }}wait_for_request(Request);process_request(Request);} pid parent = getpid();... for (int nxt_ckpt=0 ;; nxt_ckpt -- ) { if (nxt_ckpt <= 0) { pid newparent = getpid(); if (backup() >= 0 && parent != newparent) { kill(parent, KILL); parent = newparent; nxt_ckpt = N; }}wait_for_request(Request);process_request(Request);}

52 52 Backup Code Revisited Issue:Issue: –If we have multiple generations, we want the ancestors only to take over if none of the children is alive Use process links instead of waitpidUse process links instead of waitpid –Waitpid in endless loop is dangerous anyhow...

53 53 Temporal Redundancy  “Redo” tasks on error detection X task progress  transient occurs (and is detected) P REDO task

54 54 Backward Error Recovery Save process state at predetermined (periodic) recovery pointsSave process state at predetermined (periodic) recovery points –Called “checkpoints” –Checkpoints stored on stable storage, not affected by same failures Recover by rolling back to a previously saved (error-free) stateRecover by rolling back to a previously saved (error-free) state task progress  transient task progress  transient (& acceptance test) X X chkpt chkpt : complete set of (state) information needed to re-start task execution from chkpt. P P

55 55 Advantages of Backward Recovery + + Requires no knowledge of the errors in the system state + + Can handle arbitrary / unpredictable faults (as long as they do not affect the recovery mechanism) + + Can be applied regardless of the sustained damage (the saved state must be error-free, though) + + General scheme / application independent + + Particularly suitable for recovering from transient faults

56 56 Disadvantages of Backward Recovery ― ― Requires significant resources (e.g. time, computation, stable storage) for checkpointing and recovery ― ― Checkpointing requires –To identify consistent states –The system to be halted / slowed down temporarily ― ― Care must be taken in concurrent systems to avoid the orphans, lost and domino effects (will cover later in the lecture...)

57 57 Forward Error Recovery Detect the error Damage assessment Build a new error-free state from which the system can continue execution –“Safe stop” –Degraded mode –Error compensation E.g., switching to a different component, etc… Fault detected Fault manifests State Reconstruction Damage Assessment

58 58 Advantages of Forward Recovery + + Efficient (time / memory) –If the characteristics of the fault are well understood, forward recovery is a very efficient solution + + Well suited for real-time applications –Missed deadlines can be addressed + + Anticipated faults can be dealt with in a timely way using redundancy

59 59 Disadvantages of Forward Recovery — — Application-specific — — Can only remove predictable errors from the system state — — Requires knowledge of the actual error — — Depends on the accuracy of error detection, potential damage prediction, and actual damage assessment — — Not usable if the system state is damaged beyond recoverability

60 60 Error Recovery Save process state at predetermined (periodic) recovery pointsSave process state at predetermined (periodic) recovery points –Called “checkpoints” –Checkpoints stored on stable storage, not affected by same failures Recover by rolling back to a previously saved (error-free) stateRecover by rolling back to a previously saved (error-free) state task progress  transient task progress  transient (& acceptance test) X X chkpt chkpt : complete set of (state) information needed to re-start task execution from chkpt. P P

61 61 Logging Requests request_no = 0;... for (int nxt_ckpt=0 ;; nxt_ckpt--) { checkpoint(&nxt_ckpt); wait_for_request(Request); log_to_disk(++request_no,Request); process_request(Request); } request_no = 0;... for (int nxt_ckpt=0 ;; nxt_ckpt--) { checkpoint(&nxt_ckpt); wait_for_request(Request); log_to_disk(++request_no,Request); process_request(Request); }

62 62 Processing Log request_no = 0;... for (int nxt_ckpt=0 ;; nxt_ckpt--) { if (checkpoint(&nxt_ckpt) == recovery) { while((request_no+1,R) in log) { process_request(R); request_no++; } wait_for_request(Request); log_to_disk(++request_no,Request); process_request(Request); } request_no = 0;... for (int nxt_ckpt=0 ;; nxt_ckpt--) { if (checkpoint(&nxt_ckpt) == recovery) { while((request_no+1,R) in log) { process_request(R); request_no++; } wait_for_request(Request); log_to_disk(++request_no,Request); process_request(Request); }

63 63 Problems: Lost Updates, corrupted saved states...not easy to fix! State diverges from original computationState diverges from original computation –results of replayed request might be different could detect this by keeping a log of replies –new client request might be processed correctly e.g., ids in requests might not make sense to the current server instance

64 64 Frequency vs Completeness Less complete checkpointLess complete checkpoint –higher probability that error is purged from saved state –omitted state needs to be recomputed on recovery Less frequent checkpointingLess frequent checkpointing –checkpoint becomes larger –state information becomes stale –… “Application save” is (in practice) very robust“Application save” is (in practice) very robust –might not always contain all info (e.g., window position) for transparent restart

65 65 Effectiveness of Checkpointing

66 66 Distributed Systems: Checkpointing So how does one place the chkpts & where? Should we synchronize process-es(-ors) & checkpoints? Should we synchronize process-es(-ors) & checkpoints? P1 P2 P3 Note: A system can be synchronous though the msg. based comm. can still be async!

67 67 Options for Checkpoint Storage? Key building block: stable storageKey building block: stable storage –Persistent: survives the failure of the entity that created/initialized/used it –Reliable: very low probability of losing or corrupting info ImplementationImplementation –Typically non-volatile media (disks) –Single disk? Often replicated/multiple volatile memories –Make sure one replica at least always survives!

68 68 Options for Checkpoint Placement? Uncoordinated: processes take checkpoints independentlyUncoordinated: processes take checkpoints independently –Pro: no delays –Con: consistency? Coordinated: have processes coordinate before taking a checkpointCoordinated: have processes coordinate before taking a checkpoint –Pro: globally consistent checkpoints –Con: co-ordination delays Communication-induced: checkpoint when receiving and prior to processing messages that may introduce conflictsCommunication-induced: checkpoint when receiving and prior to processing messages that may introduce conflicts

69 69 What happens when we don’t synchronize? orphan msgs. lost msgs. P1 P2 P1 P2 X X chkpt C1 chkpt C2 Msg fault Rollback to C1 & C2 gives an inconsistent state

70 70..and more problems... domino effects P1 P2 X fault * problems are fixable though require considerable pre-planning oo

71 71 P 1 fails, recovers, rolls back to C aP 1 fails, recovers, rolls back to C a P 2 finds it received message (m i ) never sent, rollback to C bP 2 finds it received message (m i ) never sent, rollback to C b P 3 finds it received message (m j ) never sent, roll back to C cP 3 finds it received message (m j ) never sent, roll back to C c ………… P1P1 P2P2 P3P3 Recovery line CaCa CbCb CcCc Boom! mimi mjmj

72 72 Consistent Checkpoints: No orphans, lost msgs or dominos! P1 P2 all messages sent ARE recorded with a consistent cut! P3 consistent cut

73 73 Processes co-ordinate (synchronize) to set checkpoints guaranteed to be consistentProcesses co-ordinate (synchronize) to set checkpoints guaranteed to be consistent –2 Phase Consistent Checkpointing  Phase I:  Phase I: An initiator node X takes a “tentative” checkpoint and requests all other processes to set checkpoints. All processes inform X when they are willing to checkpoint  Phase II:  Phase II: If all other processes are willing to checkpoint, then X decides to make its checkpoint permanent; otherwise X decides that all checkpoints shall be discarded. Informs all of decision  Either all or none take permanent checkpoints! Synchronizing Checkpoints (not the processors!)

74 74 2Phase Consistent Checkpoints X R {X1,R1,S1} preliminary checkpoints {X2,R2,S2} consistent checkpoints S requests X1 X2 S2 R1R2 S1

75 75 Atomic Commitment and Window of Vulnerability So far, recovery of actions that can be individually rolled back…So far, recovery of actions that can be individually rolled back… Better idea:Better idea: –Encapsulate actions in sequences that cannot be undone individually –Atomic transactions provide this –Properties: ACID Atomicity: transaction is an indivisible unit of work Consistency: transaction leaves system in correct state or aborts Isolation: transactions’ behavior not affected by other concurrent transactions Durability: transaction’s effects are permanent after it commits (Serializable)

76 76 Atomic Commit (cont.) To implement transactions, processes must coordinate!To implement transactions, processes must coordinate! –Bundling of related events –Coordination between processes One protocol: two-phase commitOne protocol: two-phase commit CommitAbort Q: can this somehow block?

77 77 Two-phase commit (cont.) Problem: coordinator failure after PREPARE & before COMMIT blocks participants waiting for decision (a)Problem: coordinator failure after PREPARE & before COMMIT blocks participants waiting for decision (a) Three-phase commit overcomes this (b)Three-phase commit overcomes this (b) –delay final decision until enough processes “know” which decision will be taken

78 78 State Transfer Reintegrating a failed component requires state transfer!Reintegrating a failed component requires state transfer! –If checkpoint/log to stable storage, recovering replica can do incremental transfer Recover first from last checkpoint Get further logs from active replicas –Goal: minimal interference with remaining replicas –Problem: state is being updated! Might result in incorrect state transfer (have to coordinate with ongoing messages) Might change such that the new replica can never catch up! –Solution: give higher priority to state-transfer messages Lots of variations…


Download ppt "1 Bugs, Bugs and Bugs. 2 Bugs: Run Time Handling Heisenbugs/MandelbugsHeisenbugs/Mandelbugs –Heisenbugs are easier to take care of during run-time –Higher."

Similar presentations


Ads by Google