DEEDS SW Ageing and Rejuvenation. DEEDS SW Reliability HW ages (physically) Failure Rate λ = 10 -6 (1 failure every million hours) R(t) = e – λt What.

Slides:



Advertisements
Similar presentations
Express5800/ft series servers Product Information Fault-Tolerant General Purpose Servers.
Advertisements

Clustering Technology For Scaleability Jim Gray Microsoft Research
Introduction High-Availability Systems: An Example Pioneered FT in telephone switching applications. Aggressive availability goal: 2 hours downtime in.
Computer System Organization Computer-system operation – One or more CPUs, device controllers connect through common bus providing access to shared memory.
Operating Systems: Introduction n 1. Historical Development n 2. The OS as a Resource Manager n 3. Definitions n 4. The Process.
The Case for Drill-Ready Cloud Computing Vision Paper Tanakorn Leesatapornwongsa and Haryadi S. Gunawi 1.
Introduction to DBA.
Virtualization in HPC Minesh Joshi CSC 469 Dr. Box Feb 1, 2012.
Software Rejuvenation: Analysis, Module and Applications Yennun Huang Chandra Kintala Nick Kolettis N. Dudley Fulton Chris L. Del Checcolo.
By: Swetha Kendyala Software Rejuvenation.
Continuously Recording Program Execution for Deterministic Replay Debugging.
Chapter 14 Chapter 14: Server Monitoring and Optimization.
Memory Management 2010.
Computer Organization and Architecture
16: Distributed Systems1 DISTRIBUTED SYSTEM STRUCTURES NETWORK OPERATING SYSTEMS The users are aware of the physical structure of the network. Each site.
Fault Prediction and Software Aging
Backup and Recovery Part 1.
Chapter 9 Overview  Reasons to monitor SQL Server  Performance Monitoring and Tuning  Tools for Monitoring SQL Server  Common Monitoring and Tuning.
1 CSE 403 Reliability Testing These lecture slides are copyright (C) Marty Stepp, They may not be rehosted, sold, or modified without expressed permission.
Highly Available ACID Memory Vijayshankar Raman. Introduction §Why ACID memory? l non-database apps: want updates to critical data to be atomic and persistent.
CN1276 Server Kemtis Kunanuraksapong MSIS with Distinction MCTS, MCDST, MCP, A+
Bob Thome, Senior Director of Product Management, Oracle SIMPLIFYING YOUR HIGH AVAILABILITY DATABASE.
Yavor Todorov. Introduction How it works OS level checkpointing Application level checkpointing CPR for parallel programing CPR functionality References.
Silberschatz, Galvin and Gagne ©2009 Operating System Concepts – 8 th Edition, Chapter 2: System Structures.
Oracle10g RAC Service Architecture Overview of Real Application Cluster Ready Services, Nodeapps, and User Defined Services.
11 SYSTEM PERFORMANCE IN WINDOWS XP Chapter 12. Chapter 12: System Performance in Windows XP2 SYSTEM PERFORMANCE IN WINDOWS XP  Optimize Microsoft Windows.
By Lecturer / Aisha Dawood 1.  You can control the number of dispatcher processes in the instance. Unlike the number of shared servers, the number of.
1 Performance Evaluation of Computer Systems and Networks Introduction, Outlines, Class Policy Instructor: A. Ghasemi Many thanks to Dr. Behzad Akbari.
Chapter 41 Processes Chapter 4. 2 Processes  Multiprogramming operating systems are built around the concept of process (also called task).  A process.
Windows Vista Inside Out Chapter 22 - Monitoring System Activities with Event Viewer Last modified am.
Module 16: Performing Ongoing Database Maintenance
Using Model Checking to Find Serious File System Errors StanFord Computer Systems Laboratory and Microsft Research. Published in 2004 Presented by Chervet.
Issues Autonomic operation (fault tolerance) Minimize interference to applications Hardware support for new operating systems Resource management (global.
Chapter 10 System Monitoring Issues Performance Benchmarks NT Server Services Users and Server Access Information Task Manager for Applications Ram and.
 Load balancing is the process of distributing a workload evenly throughout a group or cluster of computers to maximize throughput.  This means that.
Lecture 7 Page 1 CS 111 Summer 2013 Another Option Fixed partition allocations result in internal fragmentation – Processes don’t use all of the fixed.
High Availability Technologies for Tier2 Services June 16 th 2006 Tim Bell CERN IT/FIO/TSI.
HPC HPC-5 Systems Integration High Performance Computing 1 Application Resilience: Making Progress in Spite of Failure Nathan A. DeBardeleben and John.
 Avoid exposure to magnetic fields  Avoid exposure to static electricity  Avoid exposure to extremes in temperature  Avoid exposure to liquids.
Oracle Architecture - Structure. Oracle Architecture - Structure The Oracle Server architecture 1. Structures are well-defined objects that store the.
LHC Logging Cluster Nilo Segura IT/DB. Agenda ● Hardware Components ● Software Components ● Transparent Application Failover ● Service definition.
COMP091 – Operating Systems 1 Memory Management. Memory Management Terms Physical address –Actual address as seen by memory unit Logical address –Address.
Cofax Scalability Document Version Scaling Cofax in General The scalability of Cofax is directly related to the system software, hardware and network.
Functions of Operating Systems V1.0 (22/10/2005).
Jun-Ki Min. Slide Purpose of Database Recovery ◦ To bring the database into the last consistent stat e, which existed prior to the failure. ◦
OPERATING SYSTEMS CS 3502 Fall 2017
Presented by: Daniel Taylor
Processes and threads.
Process Management Process Concept Why only the global variables?
Enforcing the Atomic and Durable Properties
Hands-On Microsoft Windows Server 2008
Applying Control Theory to Stream Processing Systems
Maximum Availability Architecture Enterprise Technology Centre.
Real-time Software Design
Introduction of Week 3 Assignment Discussion
Predictive Performance
Faults and fault-tolerance
Page Replacement.
Operating System Reliability
Fault Tolerance Distributed Web-based Systems
Operating System Reliability
Morgan Kaufmann Publishers Memory Hierarchy: Virtual Memory
Multiprocessor and Real-Time Scheduling
Software Aging & Rejuvenation
CS703 - Advanced Operating Systems
Chapter 2 Operating System Overview
Ch 10. Maintaining and Automating SQL Server
Operating System Reliability
Operating System Reliability
Presentation transcript:

DEEDS SW Ageing and Rejuvenation

DEEDS SW Reliability HW ages (physically) Failure Rate λ = (1 failure every million hours) R(t) = e – λt What about SW? –What is λ for SW? –Does SW phyically age? –… t  R(t)

DEEDS Error Coverage: SW Faults/Errors? Difficulties? Heisenbugs: run time … go away as state changes Bohrbugs: design level; predictably lead to failures How does one capture relations such as: –# of detected bugs vs. inferring # of actual bugs for relation with failure rate –how do Heisenbugs get modeled from a reliability viewpoint? –how do we capture memory leaks, additive imprecision, proc. specific aspects? –… –Design/Ops : [Code Level] Incorrect initializations, pointer misdirection, buffer overflows, rounding errors, memory alloc./overlaps, data set and result exceptions/ranges, access levels (kernel, mem.)

DEEDS … Does SW code age or is it the the SW operational environment that deteriorates?

DEEDS Operational Errors … degradation  OS resource exhaustion –memory leaks, fragmentation, alloc/de-alloc cycles –file descriptor leaks –unreleased file locks –process memory retention –data corruption (database, kernel file descriptors, …) –numerical error accumulation –FS ageing … eventually lead to performance degradation of the SW/OS or crash/hang type failures or both

DEEDS Resource Depletion: Apache Servers

DEEDS Error nuances Two key observations from actual systems 1.70% of SW errors are transient [Tandem, ATT] 2.Most (~90%) failures caused by peak conditions in workload and timing [IBM] Process Pairs, Checkpointing etc (recovery mechanisms) “react” to errors and try to mask/correct  Instead of “reacting”, can one take a “proactive” approach to avoid failures?

DEEDS SW Rejuvenation/Re-Generation Proactive, preventive & pre-emptive rollback of continuously running applications to prevent performance degradations and failures (Heisenbugs) Controlled shut-down to clean/refresh SW state to avoid uncontrolled shutdowns Objective: Maximize up-time! –free OS resources –Address resource depeletions –clean error accumulations preventing failures [ defragmentation, garbage collection, kernel flushes, file lock cleanups, file server tables etc … forced re-starts ] Works if failure rate is increasing over time

DEEDS Models Static Model: Deterministic application – analytical basis Dynamic: Transaction Based Server Systems –transaction rates –input buffering –request handling discipline (FCFS?) –SW failure rates (transients? sporadic? … ) –rejuvenation time/success criteria … steady state availability, long run prob. of transaction loss etc rejuvenate failure init rejuvenation complete repair

DEEDS Cost? Depending on re-start instance, some transactions can be lost [better to lose a few known transactions than system failure] –ideally, stable states exist to minimize transactions loss –rate of recovery post-rejuvenation? Downtime; reduced processor access over cleanup ops. –cost also for handling queued messages, responsiveness, clean-up of memory data structures, re-spawning process from stable state, thread mgmt., out of order processor execution modes, …

DEEDS Rejuvenation vs Checkpointing?

DEEDS Rejuvenation and Checkpointing Problem: –Rejuvenation: loses complete state –Checkpointing: saves all state Need to find a balance between –saving all state and –saving no state

DEEDS When to rejuvenate? What is an optimal rejuvenation trigger point? [keep in mind that while we would love to know “stable execution/transaction” locations, it is not easy at all to find these in a standard event driven system] Optimal: w.r.t. minimizing failures, min. cost, min. missed transactions  Static rejuvenation triggers estimation: periodic  Dynamic/Load-Based/Predictive: time and load driven [load = most failures at high load …]

DEEDS Rejuvenation Policies Maintenance Lost Transaction T 1. Periodic load threshold Maintenance - Free Real Memory - File Table Size - Process Table Size - Used Swap Space - # disk requests - # new process 2. Dynamic - process size - page out counter - response time time  load 3. Random?

DEEDS Example Cost Models: Cluster Servers Mean time to node fail: 720 hrs Mean time for node repair: 30 mins (spares available) Mean time for system repair: 4 hrs Mean time for rejuvenation: 10 mins Cost (node failure): $5000/hr Cost (rejuvenation): $250/hr !!! –rejuvenation frequency, duration of rejuvenation etc etc etc

DEEDS Misc Experiences Rejuv. Interval Cost Downtime 100hrs 8 Nodes/1 Spare 1 hrs hrs Rejuv. Interval Cost Downtime 100hrs 5 hrs 8 Nodes/2 Spares.5

DEEDS Ageing? Modeling/Monitoring factors to put in dynamic/predictive models: – Free Real Memory – File Table Size – Process Table Size – Used Swap Space – Load – …

DEEDS Rejuvenation Types/Granularities On-Line –Periodic –Predictive Level 1 (Partial) Rejuvenation –Restart Service (only when stopping of service saves state) Level 2 (Full) Rejuvenation –OS reboot  Graceful service/state/node failover, re-start, upgrades etc

DEEDS Examples Apache Servers – refresh (swap space exhaustion) NASA probes (X200, Advanced Flight Systems) – state cleanup, preventive maintenance Patriot missiles: 8 hr reboot (guidance error accumulation) Win 9x servers; IIS 5.0 process recycling (file system ageing; ghost pointers) IBM Director SW Rejuvenation IBM touts new smaller servers But hardware is only responsible for a third as many problems as software, … To address software problems, IBM is adding "software rejuvenation" features that accommodate problems with Windows 2000 and Windows NT. Those operating systems gradually become less stable as computing tasks that are supposed to be finished don't quite let go of all the resources that they used, such as memory. For that reason, many organizations periodically restart their Windows machines. Right now, software rejuvenation means simply a feature that lets Windows computers automatically restart themselves periodically. In the fourth quarter, the servers will monitor the operating system as well as several popular server software packages to judge more intelligently when they need to be rebooted.

DEEDS Refs Software Rejuvenation: Analysis, Module and Applications –T. Huang, C. Kintala, N. Kolettis and N. Fulton, Proc. of FTCS-25, 1995 Analysis of SW Rejuvenation using Markov Regenerative Stochastic Petri Nets –S. Garg, …, K. Trivedi, Proc. of ISSRE 1995 Analysis of SW Cost Models with Rejuvenation –T. Dohi, …, K. Trivedi, Proc. HASE 2000