Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Win XP/Vista/Win7+++ Win 2000. 2 Bugs, Bugs and Bugs.

Similar presentations


Presentation on theme: "1 Win XP/Vista/Win7+++ Win 2000. 2 Bugs, Bugs and Bugs."— Presentation transcript:

1 1 Win XP/Vista/Win7+++ Win 2000

2 2 Bugs, Bugs and Bugs

3 3 Bugs: Run Time Handling Heisenbugs/MandelbugsHeisenbugs/Mandelbugs –Heisenbugs are easier to take care of during run-time –Higher chance that robust programming mechanisms are successful Bohr bugs are typically easier to find and fix…at design time But harder to take care of during run time Well cover schemes later that cover both types… though lets try simple approaches first

4 4 Perturbation Classifications/Coverage PersistencePersistence –Transient fault –Intermittent fault –Permanent fault Creation timeCreation time –Design fault –Operational fault IntentionIntention –Accidental fault –Intentional fault Crash failure Fail-silent and Fail-stop Omission failure Timing failure System fails to respond within a specified time slice Both late and early responses might be bad Late timing failure = performance failure Arbitrary failure System behaves arbitrarily

5 5 Robust Programming Mechanisms Objective: Sustain the delivery of services despite perturbations! Process Pairs Graceful Degradation Selective Retry Checkpointing Rejuvenation Micro-reboots Recovery Blocks Diversity (NVP, NCP)...

6 6 Process Pairs (Continual Service) Implementation Variants: - Active replicas – both process client requests [+ fast; - complex] - Primary/Backup – state transfer [+- simpler; - delay] client sends request to pair... as long as one is correct, client should get an answer. Variants? (a)both process request … active replication (b) only one processes request … transfers state (primary backup) (b) only one processes request … does not update state of other … fast but later state consistency problems

7 7 Process Pairs Process pair scheme robust to varied types of software faults (crashes, resource shortage/delays, load…) :Process pair scheme robust to varied types of software faults (crashes, resource shortage/delays, load…) : –Study of print servers with process pair technology (primary / backup) –2000 systems; 10 million system hours –99.3% of failures affected only one server, i.e., 99.3% of failures were tolerated

8 8 Simple Process Pair (same host)... forever { wait_for_request(Request);process_request(Request);}... wait_for_request(Request);process_request(Request);} event loop Server Process: only takes care of crash failures … watchdogs to take care of hang failures etc…

9 9 Simple Process Pair (same host) int ft = backup();... forever { wait_for_request(Request);process_request(Request);} int ft = backup();... forever { wait_for_request(Request);process_request(Request);} create backup process; primary returns create backup process; primary returns event loop Server Process: Simplicity!! Just call it as a function

10 10 Simple Process Pair Implementation backup event loop - Dont forget that we are assuming that the backup has the full state info or that the needed state is stored on (external) stable storage - Mostly focusing on crash failures…primary can hang too…watchdog timers - Transients ok too except this model is at a basic concept level… state is lost during crash - hope is that all needed state is stored externally e.g. file system

11 11 Syscalls parent process kernel fork time waitpid fork waitpid fork...

12 12 man page: fork fork () creates a child process that differs from the parent process only in its PID and PPID, and in the fact that resource utilizations are set to 0. File locks and pending signals are not inherited. RETURN VALUE On success, the PID of the child process is returned in the parent's thread of execution, and a 0 is returned in the child's thread of execution. On failure, a -1 will be returned in the parent's context, no child process will be created, and errno will be set appropriately. ERRORS EAGAIN fork () cannot allocate sufficient memory to copy the parent's page tables and allocate a task structure for the child. EAGAIN It was not possible to create a new process because the caller's RLIMIT_NPROC resource limit was encountered. ENOMEM fork () failed to allocate the necessary kernel structures because memory is tight. Dont forget that there is a limit for the # of threads; else EAGAIN error!

13 13 man page: waitpid(pid, *status, options) The waitpid () system call suspends execution of the current process until a child specified by pid argument has changed state. By default, waitpid () waits only for terminated children. The value of pid can be: < -1 meaning wait for any child process whose process group ID is equal to the absolute value of pid. -1 meaning wait for any child process. 0 meaning wait for any child process whose process group ID is equal to that of the calling process. >0 meaning wait for the child whose process ID is equal to the value of pid. waitpid (): on success, returns the process ID of the child whose state has changed; on error, -1 is returned. ERRORS - ECHILD The process specified by pid does not exist or is not a child of the calling process. (This can happen for one's own child if the action for SIGCHLD is set to SIG_IGN. See also the LINUX NOTES section about threads.) - EINTR WNOHANG was not set and an unblocked signal or a SIGCHLD was caught. - EINVAL The options argument was invalid.

14 14 Simple Process Pair int backup() { int ret, restarts = 0; for (;; restarts++) { ret = fork(); if (ret == 0) {// child? return restarts; } while(ret != waitpid(ret,0,0)) ;}} int backup() { int ret, restarts = 0; for (;; restarts++) { ret = fork(); if (ret == 0) {// child? return restarts; } while(ret != waitpid(ret,0,0)) ;}} count number of child procs create child parent waits for child to terminate parent waits for child to terminate Create child – child returns – parent waits for child to terminate waitpid(PID, *status, options), fork

15 15 Robust?

16 16 Failed fork system call (looping?)... int backup() { int ret, restarts = 0; for (;; restarts++) { ret = fork(); if (ret == 0) {// child? return restarts; } while(ret != waitpid(ret,0,0)) ; } int backup() { int ret, restarts = 0; for (;; restarts++) { ret = fork(); if (ret == 0) {// child? return restarts; } while(ret != waitpid(ret,0,0)) ; } returns -1 on error parent does not return returns with -1: no child created returns with -1: no child created retry until success loops... and creates new children.. Implicit retry…

17 17 Problem: forked another child... fork() // fork non-terminating child... backup()... fork(); // fails returns -1 fork(); // fails returns -1 waitpid(-1,0,0); waitpid(-1,0,0); // waits for any child... might not return... fork() // fork non-terminating child... backup()... fork(); // fails returns -1 fork(); // fails returns -1 waitpid(-1,0,0); waitpid(-1,0,0); // waits for any child... might not return ret = fork() waitpid(ret,0,0)

18 18 Graceful Degradation int backup() { int ret, restarts = 0; for (;; restarts++) { ret = fork(); if (ret < 0) { log(backup:...); return -1; } if (ret == 0) {// child returns return restarts; } while(ret != waitpid(ret,0,0)) ; }} int backup() { int ret, restarts = 0; for (;; restarts++) { ret = fork(); if (ret < 0) { log(backup:...); return -1; } if (ret == 0) {// child returns return restarts; } while(ret != waitpid(ret,0,0)) ; }} process can run without backup: just return if fork fails process can run without backup: just return if fork fails Why not retry in backup? Should do that.. might help in association with killer process! But we need a top level retry mechanism!

19 19 Selective Retries Retries:Retries: –repeat a call until it succeeds or until we run out of time (timeout) or max. number of retries Selective Retries:Selective Retries: –repeat only calls when there is a chance that retry can succeed –e.g., memory shortage might disappear –e.g., invalid argument will typically stay invalid

20 20 Not always clear if retry could succeed Fork () creates a child process that differs from the parent process only in its PID and PPID, and in the fact that resource utilizations are set to 0. File locks and pending signals are not inherited. RETURN VALUE On success, the PID of the child process is returned in the parent's thread of execution, and a 0 is returned in the child's thread of execution. On failure, a -1 will be returned in the parent's context, no child process will be created, and errno will be set appropriately. ERRORS - EAGAIN fork () cannot allocate sufficient memory to copy the parent's page tables and allocate a task structure for the child. - EAGAIN It was not possible to create a new process because the caller's RLIMIT_NPROC resource limit was encountered. - ENOMEM fork () failed to allocate the necessary kernel structures because memory is tight. The maximum number of threads that can be created for the real user ID of the calling process. Upon encountering this limit, fork () fails with the error EAGAIN.

21 21 Selective Retries int ft = backup();... forever { wait_for_request(Request);process_request(Request);} int ft = backup();... forever { wait_for_request(Request);process_request(Request);} Can fail but might succeed when more memory avail or less processes Can fail but might succeed when more memory avail or less processes Infinite re-tries? Delays? Selective retries: retry calls for which a retry might help ; resource problems... Do not retry all failed calls – e.g., due to argument failures Do not retry infinitely often –> might lead to unacceptable delays

22 22 Selective Retries int ft = backup();... forever { if (ft < 0) { ft = backup(); } if (ft < 0) { ft = backup(); }wait_for_request(Request);process_request(Request);} int ft = backup();... forever { if (ft < 0) { ft = backup(); } if (ft < 0) { ft = backup(); }wait_for_request(Request);process_request(Request);} Retry if no backup Might be a lot of retries... state might already be corrupted state might already be corrupted Might need too much processing to do for each request... Potentially too much processing power taken away... Exponential backup.. Generally a good trade-off as it balances overhead and eventual success of retry

23 23 Retry Questions... How often should we retry?How often should we retry? –should we wait between retries? Should we retry at some later point in time?Should we retry at some later point in time? –how many times until we give up? At what level should we retry?At what level should we retry?

24 24 Hierarchical Retries potentially: exp. increase in retries! composability: retries should be independent of each other ! function h calls retry function f retry

25 25 Selective Retries Under high load calls might fail due to resource shortageUnder high load calls might fail due to resource shortage We can use selective retries to increase probability of success during resource allocationWe can use selective retries to increase probability of success during resource allocation Operating systems like Linux have a killer process that terminates processes if too few resources existOperating systems like Linux have a killer process that terminates processes if too few resources exist With selective retries this will make sure that processes that survive can complete their requestsWith selective retries this will make sure that processes that survive can complete their requests

26 26 Bohrbugs

27 27 Continuous Crashing

28 28 Continuous Crashing Finite number of retries by client?Finite number of retries by client? –client will stop sending the request eventually But what if we cannot control clientsBut what if we cannot control clients –clients might think it is fun to crash server? DoS attacks take place like this! What happens if the retrying request activates bohrbugs? What happens if the retrying request activates bohrbugs?

29 29 Graceful Degradation Alternative Approach:Alternative Approach: –server needs to make sure that failed request is only retried for a fixed number of times Problem:Problem: –how can we know that a request has already been partially processed several times? Solution:Solution: –need to keep some state info between request instances!

30 30 State Handling (load & store application states)

31 31 Using Session State int ft = backup();... forever { wait_for_request(Request); get_session_state(Request); if(num_retries < N) { process_request(Request); store_session_state(Request); }else { return_error(); } } int ft = backup();... forever { wait_for_request(Request); get_session_state(Request); if(num_retries < N) { process_request(Request); store_session_state(Request); }else { return_error(); } } updates number of retries updates number of retries

32 32 Crash of Parent!

33 33 What if parent process dies? Possible reasons: Operator might kill wrong processOperator might kill wrong process Parent might terminate for some other reason, e.g.,Parent might terminate for some other reason, e.g., –Linux: out of memory process killer (see earlier slide!) –Kills processes that use too much memory: more cpu time decreases the chance of being killed Parent could get killed Normally we would expect that parent does not crash.. just performs a waitpid but...

34 34 Detecting Parent Crashes

35 35 Detection of Process Crashes Pipe used to communicate between procs –Unix: ls | sort Pipe end closed when –process terminates Process B can detect –when process A terminated

36 36 Adding Parent Termination Detection int fd[2]; // pipe fd int backup() {... pipe(fd); ret = fork(); if (ret == 0) { // child? close (fd [1]); close (fd [1]); return restarts++; return restarts++; } // parent closes other end: close (fd [0]);... int fd[2]; // pipe fd int backup() {... pipe(fd); ret = fork(); if (ret == 0) { // child? close (fd [1]); close (fd [1]); return restarts++; return restarts++; } // parent closes other end: close (fd [0]);... write end read end

37 37 Child can detect parent termination int hasParentTerminated() { // check if other end of pipe has been closed......} int hasParentTerminated() { // check if other end of pipe has been closed......} has to be called periodically Pipe as detector ; not completely satisfactory: child is already in a state that might be corrupted ! Would need to start the application with an option to say that it is in parent mode to jettison most state!

38 38 Problem: State Corruption

39 39 Parent Replacement already executed requests e.g., new parent allocated resources that are never freed new children will have also that corrupted state!

40 40 Alternative Approach reinit may fail and might cost too much time... no ideal solution as far as I know.. would stick with the first solution when possible: parent might not fail that often

41 41 Process Links Generalized Crash Detection

42 42 Linking Processes We can use a pipe as a failure detector:We can use a pipe as a failure detector: –We can detect that a process has terminated We can use that for:We can use that for: –Replacing failed processes –Providing some termination atomicity (if one dies all die!) If one process fails, some other processes might not be able to work properly anymore One simple way is to terminate all such processes Garbage collection of processes

43 43 Process Links: Termination Atomicity Set of cooperating processesSet of cooperating processes If some process p terminates, each linked process q must terminateIf some process p terminates, each linked process q must terminate We can link processes via process links:We can link processes via process links: –Programming language support – Java, Erlang, …

44 44 Pipe And Filter

45 45 Example: Farmer / Worker Farmer process pair ; Worker ;

46 46 Asymmetric Link Behavior

47 47 Master as Process Pair Mitigates parent crash semantics by avoiding terminations as possible for liveness

48 48 Error Recovery in Distributed Systems (DS) Checkpointing

49 49 Handling Transients? Transient Fault: a fault that is no longer present after system restart Many flavors: –SW transients –OS transients –Middleware/Protocol transients –Network transients –Operational transients –Power transients Need to recover from the effects of transients detect them! … let us assume simple local sanity checks (acceptance tests) exisit!

50 50 So how does one handle these transients? So how does one handle these transients? Objective: - sustained ops (key driver: sustained performance) - transparent handling of bugs (to users and application designers) System Model: Coupled/Distributed/Networked Processes

51 51 Periodic Checkpointing

52 52 Checkpointing pid parent = getpid();... for (int nxt_ckpt=0 ;; nxt_ckpt -- ) { if (nxt_ckpt <= 0) { pid newparent = getpid(); if (backup() >= 0 && parent != newparent) { kill(parent, KILL); parent = newparent; nxt_ckpt = N; }}wait_for_request(Request);process_request(Request);} pid parent = getpid();... for (int nxt_ckpt=0 ;; nxt_ckpt -- ) { if (nxt_ckpt <= 0) { pid newparent = getpid(); if (backup() >= 0 && parent != newparent) { kill(parent, KILL); parent = newparent; nxt_ckpt = N; }}wait_for_request(Request);process_request(Request);}

53 53 Backup Code Revisited Issue:Issue: –If we have multiple generations, we want the ancestors only to take over if none of the children is alive Use process links instead of waitpidUse process links instead of waitpid –Waitpid in endless loop is dangerous anyhow...

54 54 Temporal Redundancy Redo tasks on error detection Redo tasks on error detection X task progress transient occurs (and is detected) P REDO task

55 55 Backward Error Recovery Save process state at predetermined (periodic) recovery pointsSave process state at predetermined (periodic) recovery points –Called checkpoints –Checkpoints stored on stable storage, not affected by same failures Recover by rolling back to a previously saved (error-free) stateRecover by rolling back to a previously saved (error-free) state task progress transient task progress transient (& acceptance test) X X chkpt chkpt : complete set of (state) information needed to re-start task execution from chkpt. P P

56 56 Advantages of Backward Recovery + + Requires no knowledge of the errors in the system state + + Can handle arbitrary / unpredictable faults (as long as they do not affect the recovery mechanism) + + Can be applied regardless of the sustained damage (the saved state must be error-free, though) + + General scheme / application independent + + Particularly suitable for recovering from transient faults

57 57 Disadvantages of Backward Recovery Requires significant resources (e.g. time, computation, stable storage) for checkpointing and recovery Checkpointing requires –To identify consistent states –The system to be halted / slowed down temporarily Care must be taken in concurrent systems to avoid the orphans, lost and domino effects (will cover later in the lecture...)

58 58 Forward Error Recovery Detect the error Damage assessment Build a new error-free state from which the system can continue execution –Safe stop –Degraded mode –Error compensation E.g., switching to a different component, etc… Fault detected Fault manifests State Reconstruction Damage Assessment

59 59 Advantages of Forward Recovery + + Efficient (time / memory) –If the characteristics of the fault are well understood, forward recovery is a very efficient solution + + Well suited for real-time applications –Missed deadlines can be addressed + + Anticipated faults can be dealt with in a timely way using redundancy

60 60 Disadvantages of Forward Recovery Application-specific Can only remove predictable errors from the system state Requires knowledge of the actual error Depends on the accuracy of error detection, potential damage prediction, and actual damage assessment Not usable if the system state is damaged beyond recoverability

61 61 Error Recovery Save process state at predetermined (periodic) recovery pointsSave process state at predetermined (periodic) recovery points –Called checkpoints –Checkpoints stored on stable storage, not affected by same failures Recover by rolling back to a previously saved (error-free) stateRecover by rolling back to a previously saved (error-free) state task progress transient task progress transient (& acceptance test) X X chkpt chkpt : complete set of (state) information needed to re-start task execution from chkpt. P P

62 62 Logging Requests request_no = 0;... for (int nxt_ckpt=0 ;; nxt_ckpt--) { checkpoint(&nxt_ckpt); wait_for_request(Request); log_to_disk(++request_no,Request); process_request(Request); } request_no = 0;... for (int nxt_ckpt=0 ;; nxt_ckpt--) { checkpoint(&nxt_ckpt); wait_for_request(Request); log_to_disk(++request_no,Request); process_request(Request); }

63 63 Processing Log request_no = 0;... for (int nxt_ckpt=0 ;; nxt_ckpt--) { if (checkpoint(&nxt_ckpt) == recovery) { while((request_no+1,R) in log) { process_request(R); request_no++; } wait_for_request(Request); log_to_disk(++request_no,Request); process_request(Request); } request_no = 0;... for (int nxt_ckpt=0 ;; nxt_ckpt--) { if (checkpoint(&nxt_ckpt) == recovery) { while((request_no+1,R) in log) { process_request(R); request_no++; } wait_for_request(Request); log_to_disk(++request_no,Request); process_request(Request); }

64 64 Problems: Lost Updates, corrupted saved states...not easy to fix! State diverges from original computationState diverges from original computation –results of replayed request might be different could detect this by keeping a log of replies –new client request might be processed correctly e.g., ids in requests might not make sense to the current server instance

65 65 Frequency vs Completeness Less complete checkpointLess complete checkpoint –higher probability that error is purged from saved state –omitted state needs to be recomputed on recovery Less frequent checkpointingLess frequent checkpointing –checkpoint becomes larger –state information becomes stale –… Application save is (in practice) very robustApplication save is (in practice) very robust –might not always contain all info (e.g., window position) for transparent restart

66 66 Effectiveness of Checkpointing

67 67 Distributed Systems: Checkpointing So how does one place the chkpts & where? Should we synchronize process-es(-ors) & checkpoints? Should we synchronize process-es(-ors) & checkpoints? P1 P2 P3 Note: A system can be synchronous though the msg. based comm. can still be async!

68 68 Options for Checkpoint Storage? Key building block: stable storageKey building block: stable storage –Persistent: survives the failure of the entity that created/initialized/used it –Reliable: very low probability of losing or corrupting info ImplementationImplementation –Typically non-volatile media (disks) –Single disk? Often replicated/multiple volatile memories –Make sure one replica at least always survives!

69 69 Options for Checkpoint Placement? Uncoordinated: processes take checkpoints independentlyUncoordinated: processes take checkpoints independently –Pro: no delays –Con: consistency? Coordinated: have processes coordinate before taking a checkpointCoordinated: have processes coordinate before taking a checkpoint –Pro: globally consistent checkpoints –Con: co-ordination delays Communication-induced: checkpoint when receiving and prior to processing messages that may introduce conflictsCommunication-induced: checkpoint when receiving and prior to processing messages that may introduce conflicts

70 70 What happens when we dont synchronize? orphan msgs. lost msgs. P1 P2 P1 P2 X X chkpt C1 chkpt C2 Msg fault Rollback to C1 & C2 gives an inconsistent state

71 71..and more problems... domino effects P1 P2 X fault * problems are fixable though require considerable pre-planning oo

72 72 P 1 fails, recovers, rolls back to C aP 1 fails, recovers, rolls back to C a P 2 finds it received message (m i ) never sent, rollback to C bP 2 finds it received message (m i ) never sent, rollback to C b P 3 finds it received message (m j ) never sent, roll back to C cP 3 finds it received message (m j ) never sent, roll back to C c ………… P1P1 P2P2 P3P3 Recovery line CaCa CbCb CcCc Boom! mimi mjmj

73 73 Consistent Checkpoints: No orphans, lost msgs or dominos! P1 P2 all messages sent ARE recorded with a consistent cut! P3 consistent cut

74 74 Processes co-ordinate (synchronize) to set checkpoints guaranteed to be consistentProcesses co-ordinate (synchronize) to set checkpoints guaranteed to be consistent –2 Phase Consistent Checkpointing Phase I: Phase I: An initiator node X takes a tentative checkpoint and requests all other processes to set checkpoints. All processes inform X when they are willing to checkpoint Phase II: Phase II: If all other processes are willing to checkpoint, then X decides to make its checkpoint permanent; otherwise X decides that all checkpoints shall be discarded. Informs all of decision Either all or none take permanent checkpoints! Synchronizing Checkpoints (not the processors!)

75 75 2Phase Consistent Checkpoints X R {X1,R1,S1} preliminary checkpoints {X2,R2,S2} consistent checkpoints S requests X1 X2 S2 R1R2 S1

76 76 Atomic Commitment and Window of Vulnerability So far, recovery of actions that can be individually rolled back…So far, recovery of actions that can be individually rolled back… Better idea:Better idea: –Encapsulate actions in sequences that cannot be undone individually –Atomic transactions provide this –Properties: ACID Atomicity: transaction is an indivisible unit of work Consistency: transaction leaves system in correct state or aborts Isolation: transactions behavior not affected by other concurrent transactions Durability: transactions effects are permanent after it commits (Serializable)

77 77 Atomic Commit (cont.) To implement transactions, processes must coordinate!To implement transactions, processes must coordinate! –Bundling of related events –Coordination between processes One protocol: two-phase commitOne protocol: two-phase commit CommitAbort Q: can this somehow block?

78 78 Two-phase commit (cont.) Problem: coordinator failure after PREPARE & before COMMIT blocks participants waiting for decision (a)Problem: coordinator failure after PREPARE & before COMMIT blocks participants waiting for decision (a) Three-phase commit overcomes this (b)Three-phase commit overcomes this (b) –delay final decision until enough processes know which decision will be taken

79 79 State Transfer Reintegrating a failed component requires state transfer!Reintegrating a failed component requires state transfer! –If checkpoint/log to stable storage, recovering replica can do incremental transfer Recover first from last checkpoint Get further logs from active replicas –Goal: minimal interference with remaining replicas –Problem: state is being updated! Might result in incorrect state transfer (have to coordinate with ongoing messages) Might change such that the new replica can never catch up! –Solution: give higher priority to state-transfer messages Lots of variations…


Download ppt "1 Win XP/Vista/Win7+++ Win 2000. 2 Bugs, Bugs and Bugs."

Similar presentations


Ads by Google