DEEDS SW Reliability HW ages (physically) Failure Rate λ = 10 -6 (1 failure every million hours) R(t) = e – λt What about SW? –What is λ for SW? –Does SW phyically age? –… t R(t)
DEEDS Error Coverage: SW Faults/Errors? Difficulties? Heisenbugs: run time … go away as state changes Bohrbugs: design level; predictably lead to failures How does one capture relations such as: –# of detected bugs vs. inferring # of actual bugs for relation with failure rate –how do Heisenbugs get modeled from a reliability viewpoint? –how do we capture memory leaks, additive imprecision, proc. specific aspects? –… –Design/Ops : [Code Level] Incorrect initializations, pointer misdirection, buffer overflows, rounding errors, memory alloc./overlaps, data set and result exceptions/ranges, access levels (kernel, mem.)
DEEDS … Does SW code age or is it the the SW operational environment that deteriorates?
DEEDS Operational Errors … degradation OS resource exhaustion –memory leaks, fragmentation, alloc/de-alloc cycles –file descriptor leaks –unreleased file locks –process memory retention –data corruption (database, kernel file descriptors, …) –numerical error accumulation –FS ageing … eventually lead to performance degradation of the SW/OS or crash/hang type failures or both
DEEDS Error nuances Two key observations from actual systems 1.70% of SW errors are transient [Tandem, ATT] 2.Most (~90%) failures caused by peak conditions in workload and timing [IBM] Process Pairs, Checkpointing etc (recovery mechanisms) “react” to errors and try to mask/correct Instead of “reacting”, can one take a “proactive” approach to avoid failures?
DEEDS SW Rejuvenation/Re-Generation Proactive, preventive & pre-emptive rollback of continuously running applications to prevent performance degradations and failures (Heisenbugs) Controlled shut-down to clean/refresh SW state to avoid uncontrolled shutdowns Objective: Maximize up-time! –free OS resources –Address resource depeletions –clean error accumulations preventing failures [ defragmentation, garbage collection, kernel flushes, file lock cleanups, file server tables etc … forced re-starts ] Works if failure rate is increasing over time
DEEDS Models Static Model: Deterministic application – analytical basis Dynamic: Transaction Based Server Systems –transaction rates –input buffering –request handling discipline (FCFS?) –SW failure rates (transients? sporadic? … ) –rejuvenation time/success criteria … steady state availability, long run prob. of transaction loss etc rejuvenate failure init rejuvenation complete repair
DEEDS Cost? Depending on re-start instance, some transactions can be lost [better to lose a few known transactions than system failure] –ideally, stable states exist to minimize transactions loss –rate of recovery post-rejuvenation? Downtime; reduced processor access over cleanup ops. –cost also for handling queued messages, responsiveness, clean-up of memory data structures, re-spawning process from stable state, thread mgmt., out of order processor execution modes, …
DEEDS Rejuvenation and Checkpointing Problem: –Rejuvenation: loses complete state –Checkpointing: saves all state Need to find a balance between –saving all state and –saving no state
DEEDS When to rejuvenate? What is an optimal rejuvenation trigger point? [keep in mind that while we would love to know “stable execution/transaction” locations, it is not easy at all to find these in a standard event driven system] Optimal: w.r.t. minimizing failures, min. cost, min. missed transactions Static rejuvenation triggers estimation: periodic Dynamic/Load-Based/Predictive: time and load driven [load = most failures at high load …]
DEEDS Rejuvenation Policies Maintenance Lost Transaction T 1. Periodic load threshold Maintenance - Free Real Memory - File Table Size - Process Table Size - Used Swap Space - # disk requests - # new process 2. Dynamic - process size - page out counter - response time time load 3. Random?
DEEDS Example Cost Models: Cluster Servers Mean time to node fail: 720 hrs Mean time for node repair: 30 mins (spares available) Mean time for system repair: 4 hrs Mean time for rejuvenation: 10 mins Cost (node failure): $5000/hr Cost (rejuvenation): $250/hr !!! –rejuvenation frequency, duration of rejuvenation etc etc etc
DEEDS Ageing? Modeling/Monitoring factors to put in dynamic/predictive models: – Free Real Memory – File Table Size – Process Table Size – Used Swap Space – Load – …
DEEDS Rejuvenation Types/Granularities On-Line –Periodic –Predictive Level 1 (Partial) Rejuvenation –Restart Service (only when stopping of service saves state) Level 2 (Full) Rejuvenation –OS reboot Graceful service/state/node failover, re-start, upgrades etc
DEEDS Examples Apache Servers – refresh (swap space exhaustion) NASA probes (X200, Advanced Flight Systems) – state cleanup, preventive maintenance Patriot missiles: 8 hr reboot (guidance error accumulation) Win 9x servers; IIS 5.0 process recycling (file system ageing; ghost pointers) IBM Director SW Rejuvenation IBM touts new smaller servers But hardware is only responsible for a third as many problems as software, … To address software problems, IBM is adding "software rejuvenation" features that accommodate problems with Windows 2000 and Windows NT. Those operating systems gradually become less stable as computing tasks that are supposed to be finished don't quite let go of all the resources that they used, such as memory. For that reason, many organizations periodically restart their Windows machines. Right now, software rejuvenation means simply a feature that lets Windows computers automatically restart themselves periodically. In the fourth quarter, the servers will monitor the operating system as well as several popular server software packages to judge more intelligently when they need to be rebooted.
DEEDS Refs Software Rejuvenation: Analysis, Module and Applications –T. Huang, C. Kintala, N. Kolettis and N. Fulton, Proc. of FTCS-25, 1995 Analysis of SW Rejuvenation using Markov Regenerative Stochastic Petri Nets –S. Garg, …, K. Trivedi, Proc. of ISSRE 1995 Analysis of SW Cost Models with Rejuvenation –T. Dohi, …, K. Trivedi, Proc. HASE 2000