Download presentation
Presentation is loading. Please wait.
Published byClarence Merritt Modified over 9 years ago
1
Software Rejuvenation: Analysis, Module and Applications Yennun Huang Chandra Kintala Nick Kolettis N. Dudley Fulton Chris L. Del Checcolo
2
For the lecture What is Software Rejuvenation? Why is it needed Probabilistic state transition models Examples
3
What is Software Rejuvenation? The act of gracefully terminating an application and immediately restarting Goal: Prevents unexpected error termination by terminating the program before it suffers an error
4
Intended Use Software rejuvenation is primarily indicated for servers where applications are intended to run indefinitely without failure
5
Why do applications fail? Process Aging: gradual degradation of application performance, over time, that may lead to premature program termination
6
Causes Memory leaks Unreleased file locks File descriptor leaking Etc.
7
Fine, so just fix the program! Not always possible, bugs may lie in the application the libraries that the application is using Or in the OS itself (Except for Windows of course)
8
Reactive vs. Proactive Some strategies involve watching to see if an application fails then restarting it. While not a bad solution, want ability to prevent errors so as to minimize any collateral damage
9
Software Rejuvenation Periodic preemptive rollback of continuously running applications to prevent failures in the future
10
Transition Model For SW without Rejuvenation Transition Model For SW with Rejuvenation
11
Downtime and cost without rejuvenation P f = Downtime w/o r (L) = P f * L Cost w/o r (L) = P f * L * c f
12
Downtime and cost with rejuvenation P p = P f = P r = P 0 = Downtime w r (L) = (P f + P r ) * L Cost w r (L) = (P f * c f + P r * c r ) * L
13
Thresholds - Goal Goal is to stay in S 0 for the longest amount of time
14
Thresholds cont. To see how r 4 affects downtime and cost, lets differentiate the previous equations with respect to r 4
15
Thresholds cont. Downtime: If r 3 is dominant, the derivative becomes negative and downtime decreases when r 4 increases thus rejuvenate at state S p If r 3 is small, slow recovery from S R, downtime increases as r 4 increases
16
Thresholds cont. Cost: When c r is dominant, cost increases as r 4 increases, implies no rejuvenation benefit When c r is small, cost decreases as r 4 increases
18
I swear this is the last thresholds slide cont. Overall, costs need to be calculated for individual programs For best results: perform rejuvenation at state S P (r 4 = ∞) or don’t perform rejuvenation (r 4 = 0)
19
Example 1 MTBF = 12 months; = 1/(12*30*24) Takes 30 min to recover from unexpected error; r 1 = 2 Base Longevity is seven days; r 2 =1/(7*24) If rejuvenation is performed, mean repair time after rejuvenation is 20 minutes; r 3 = 3 Ave. Cost of unscheduled downtime due to failure, c f, is $1,000/hour Ave. Cost of scheduled downtime during rejuvenation, c r, is $40/hour
20
No rejuvenation (r 4 = 0) Once Every three Week r 4 = 1/(2*7*24) Once Every Two Weeks r 4 =1/(1*7*24) Hours of Downtime 0.4905.9658.727 Cost of Downtime 490554586
21
Example 2 MTBF = 3 months; = 1/(3*30*24) Takes 30 min to recover from unexpected error; r 1 = 2 Base Longevity is three days; r 2 =1/(3*24) If rejuvenation is performed, mean repair time after rejuvenation is 10 minutes; r 3 = 6 Ave. Cost of unscheduled downtime due to failure, c f, is $5,000/hour Ave. Cost of scheduled downtime during rejuvenation, c r, is $5/hour
22
No rejuvenation (r 4 = 0) Once Every three Week r 4 = 1/(11*24) Once Every Two Weeks r 4 =1/(4*24) Hours of Downtime 1.945.709.52 Cost of Downtime 9675.257672.435643.31
23
Example 3 MTBF = 3 months; = 1/(3*30*24) Takes 2 min to recover from unexpected error; r 1 = 0.5 Base Longevity is 10 days; r 2 =1/(10*24) If rejuvenation is performed, mean repair time after rejuvenation is 10 minutes; r 3 = 6 Ave. Cost of unscheduled downtime due to failure, c f, is $5,000/hour Ave. Cost of scheduled downtime during rejuvenation, c r, is $5/hour
24
No rejuvenation (r 4 = 0) Once Every month r 4 = 1/(20*24) Once Every Two Weeks r 4 =1/(4*24) Hours of Downtime 7.196.836.36 Cost of Downtime 3.6k2.48k1.11k
25
Implementation Implementation of Software Rejuvenation is fairly easy. Cron Jobs can be set to restart the application at various intervals watchd can be used to detect if applications have failed and restart them
26
Real World Examples BILL-DATS II Collector Billing collection system used by AT&T long- distance network Set to rejuvenate after 1 week Hasn’t prematurely failed after several year
27
“S” Scientific Speech synthesis system Long running scientific application Used to process several hundred sentences over the course of many days Found to fail after 100 sentences Rejuvenates after 15
28
Conclusions: Decision to use Software Rejuvenation depends on predetermined failure rates and associated costs. r 4 = 0, No rejuvenation r 4 = ∞, Rejuvenation
Similar presentations
© 2024 SlidePlayer.com Inc.
All rights reserved.