The Role of RAS-Models in the Design and Evaluation of Self-Healing Systems Rean Griffith, Ritika Virmani, Gail Kaiser Programming Systems Lab (PSL) Columbia.

Slides:

Advertisements

Similar presentations

1 Tools and Techniques for Designing and Evaluating Self-Healing Systems Rean Griffith, Ritika Virmani, Gail Kaiser Programming Systems Lab (PSL) Columbia.

Advertisements

1 Dynamic Adaptation of Temporal Event Correlation Rules Rean Griffith‡, Gail Kaiser‡ Joseph Hellerstein*, Yixin Diao* Presented by Rean Griffith

Designing and Developing Decision Support Systems Chapter 4.

Module – 9 Introduction to Business continuity

Business Continuity Section 3(chapter 8) BC:ISMDR:BEIT:VIII:chap8:Madhu N PIIT1.

Ira Cohen, Jeffrey S. Chase et al.

Availability in Globally Distributed Storage Systems

© 2009 EMC Corporation. All rights reserved. Introduction to Business Continuity Module 3.1.

CS 795 – Spring  “Software Systems are increasingly Situated in dynamic, mission critical settings ◦ Operational profile is dynamic, and depends.

7-1 INTRODUCTION: SoA Introduced SoA in Chapter 6 Service-oriented architecture (SoA) - perspective that focuses on the development, use, and reuse of.

Objektorienteret Middleware Presentation 2: Distributed Systems – A brush up, and relations to Middleware, Heterogeneity & Transparency.

Making Services Fault Tolerant

Chapter 19: Network Management Business Data Communications, 4e.

ManageEngine TM Applications Manager 8 Monitoring Custom Applications.

University of Kansas Construction & Integration of Distributed Systems Jerry James Oct. 30, 2000.

1 Software Testing and Quality Assurance Lecture 30 – Testing Systems.

Enhancing Paper Mill Performance through Advanced Diagnostics Services Dr. BS Babji Process Automation Division ABB India Ltd, Bangalore IPPTA Zonal Seminar.

Fault Prediction and Software Aging

1 Evaluating Computing Systems Using Fault-Injection and RAS Metrics and Models Rean Griffith Thesis Proposal February 28 th 2007.

System Engineering Instructor: Dr. Jerry Gao. System Engineering Jerry Gao, Ph.D. Jan System Engineering Hierarchy - System Modeling - Information.

Failure Avoidance through Fault Prediction Based on Synthetic Transactions Mohammed Shatnawi 1, 2 Matei Ripeanu 2 1 – Microsoft Online Ads, Microsoft Corporation.

SIMULATING ERRORS IN WEB SERVICES International Journal of Simulation: Systems, Sciences and Technology 2004 Nik Looker, Malcolm Munro and Jie Xu.

1 RAS-Models: A Building Block for Self-Healing Benchmarks Rean Griffith, Ritika Virmani, Gail Kaiser Programming Systems Lab (PSL) Columbia University.

Understanding Network Failures in Data Centers: Measurement, Analysis and Implications Phillipa Gill University of Toronto Navendu Jain & Nachiappan Nagappan.

MCTS Guide to Microsoft Windows Server 2008 Network Infrastructure Configuration Chapter 11 Managing and Monitoring a Windows Server 2008 Network.

PROJECT PRESENTATION Prof: Daniel Amyot Presented By… ANVESH ALUWALA GURPREET SINGH DHADDA Evaluation of Load Testing Tools WebLOAD Professional Vs NeoLoad.

1 Chapter Overview Introduction to Windows XP Professional Printing Setting Up Network Printers Connecting to Network Printers Configuring Network Printers.

Microsoft ® Official Course Monitoring and Troubleshooting Custom SharePoint Solutions SharePoint Practice Microsoft SharePoint 2013.

©Ian Sommerville 2004Software Engineering, 7th edition. Chapter 27 Slide 1 Quality Management 1.

Software Faults and Fault Injection Models --Raviteja Varanasi.

ATIF MEHMOOD MALIK KASHIF SIDDIQUE Improving dependability of Cloud Computing with Fault Tolerance and High Availability.

1 Autonomic Computing An Introduction Guenter Kickinger.

1 BTEC HNC Systems Support Castle College 2007/8 Systems Analysis Lecture 9 Introduction to Design.

CPIS 357 Software Quality & Testing

Microsoft ® Official Course Module 10 Optimizing and Maintaining Windows ® 8 Client Computers.

Software Reliability SEG3202 N. El Kadri.

TRƯỜNG ĐẠI HỌC CÔNG NGHỆ Bộ môn Mạng và Truyền Thông Máy Tính.

Module 7: Fundamentals of Administering Windows Server 2008.

This chapter is extracted from Sommerville’s slides. Text book chapter

Advanced Computer Networks Topic 2: Characterization of Distributed Systems.

Module 15 Managing Windows Server® 2008 Backup and Restore.

OSIsoft High Availability PI Replication

1 Dynamic Emulation and Fault- Injection using Dyninst Presented by: Rean Griffith Programming Systems Lab (PSL)

Module 12: Responding to Security Incidents. Overview Introduction to Auditing and Incident Response Designing an Audit Policy Designing an Incident Response.

Module 9 Planning and Implementing Monitoring and Maintenance.

Survey of Tools to Support Safe Adaptation with Validation Alain Esteva-Ramirez School of Computing and Information Sciences Florida International University.

Test Plan: Introduction o Primary focus: developer testing –Implementation phase –Release testing –Maintenance and enhancement o Secondary focus: formal.

Configuring Debugging as Search: Finding the Needle in the Haystack Andrew Whitaker, Richard S. Cox and Steven D. Gribble. University of Washington Presented.

1 Manipulating Managed Execution Runtimes to support Self-Healing Systems Rean Griffith‡, Gail Kaiser‡ Presented by Rean Griffith

Improving the Reliability of Commodity Operating Systems Michael M. Swift, Brian N. Bershad, Henry M. Levy Presented by Ya-Yun Lo EECS 582 – W161.

Improving Product Realization Through Intelligent Information Management S.K. Gupta Mechanical Engineering Department and Institute for Systems Research.

Introduction to Performance Testing Performance testing is the process of determining the speed or effectiveness of a computer, network, software program.

Problem On a regular basis we use: –Java applets –JavaScript –ActiveX –Shockwave Notion of ubiquitous computing.

1 Chapter 2: Operating-System Structures Services Interface provided to users & programmers –System calls (programmer access) –User level access to system.

Cofax Scalability Document Version Scaling Cofax in General The scalability of Cofax is directly related to the system software, hardware and network.

OSIsoft High Availability PI Replication Colin Breck, PI Server Team Dave Oda, PI SDK Team.

Sun Tech Talk 3 Solaris 10 and OpenSolaris Pierre de Filippis Sun Campus Evangelist

Chapter 19: Network Management

Software Metrics and Reliability

Software Verification and Validation

OVERVIEW Impact of Modelling and simulation in Mechatronics system

Presented by Munezero Immaculee Joselyne PhD in Software Engineering

Maximum Availability Architecture Enterprise Technology Centre.

Rean Griffith‡, Gail Kaiser‡ Presented by Rean Griffith

Fault Tolerance Distributed Web-based Systems

Automated Analysis and Code Generation for Domain-Specific Models

Agenda The current Windows XP and Windows XP Desktop situation

Reliable Web Services: Methodology, Experiment and Modeling International Conference on Web Services (ICWS 2007) Pat. P. W. Chan, Michael R. Lyu Department.

Presentation transcript:

The Role of RAS-Models in the Design and Evaluation of Self-Healing Systems Rean Griffith, Ritika Virmani, Gail Kaiser Programming Systems Lab (PSL) Columbia University SOAS 2007 Leipzig, Germany September 26 th 2007 Presented by Rean Griffith

Overview Introduction Challenges Problem Hypothesis Experiments Conclusion & Future Work

Introduction A self-healing system “…automatically detects, diagnoses and repairs localized software and hardware problems” – The Vision of Autonomic Computing 2003 IEEE Computer Society

Challenges How do we evaluate our progress towards realizing self-healing systems?  How do we quantify the impact of the problems these systems should resolve? (Baseline)  How do we reason about expected benefits for systems currently lacking self-healing mechanisms?  How do we quantify the efficacy of individual and combined self-healing mechanisms and reason about tradeoffs?  How do we identify sub-optimal mechanisms?

Problem Evaluating self-healing systems and their mechanisms is non-trivial  Studying the failure behavior of systems can be difficult  Finding fault-injection tools that exercise the remediation mechanisms available is difficult  Multiple styles of healing to consider (reactive, preventative, proactive)  Accounting for imperfect repair scenarios  Partially automated repairs are possible

Hypotheses Reliability, Availability and Serviceability provide reasonable evaluation metrics Combining practical fault-injection tools with mathematical modeling techniques provides the foundation for a feasible and flexible methodology for evaluating and comparing the reliability, availability and serviceability (RAS) characteristics of computing systems

Objective To inject faults into the components a multi- component n-tier web application  Specifically the application server and Operating System components Observe its responses and the responses of any remediation mechanisms available Model and evaluate available mechanisms Identify weaknesses

Experiment Setup Target: 3-Tier Web Application TPC-W Web-application Resin Web-server and (Java) Application Server Sun Hotspot JVM v1.5 MySQL Linux Remote Browser Emulation clients to simulate user loads

Practical Fault-Injection Tools Kheiron/JVM (ICAC 2006)  Uses bytecode rewriting to inject faults into running Java applications  Faults include: memory leaks, hangs, delays etc.  Two other versions of Kheiron exist (CLR & C) Nooks Device-Driver Fault-Injection Tools  Uses the kernel module interface on Linux (2.4 and now 2.6) to inject device driver faults  Faults include: text faults, stack faults, hangs etc.

Healing Mechanisms Available Application Server  Automatic restarts Operating System  Nooks device driver protection framework  Manual system reboot

Mathematical Modeling Techniques Continuous Time Markov Chains (CTMCs)  Limiting/steady-state availability  Yearly downtime  Repair success rates (fault-coverage)  Repair times Markov Reward Networks  Downtime costs (time, money, #service visits etc.)  SLA penalty-avoidance

Resin Memory-Leak Handler Analysis Analyzing perfect recovery e.g. mechanisms addressing resource leaks/fatal crashes  S 0 – UP state, system working  S 1 – DOWN state, system restarting  λ failure = 1 every 8 hours  µ restart = 47 seconds Attaching a value to each state allows us to evaluate the cost/time impact associated with these failures. Results: Steady state availability: % Downtime per year: 866 minutes

Linux w/Nooks Recovery Analysis Analyzing imperfect recovery e.g. device driver recovery using Nooks  S 0 – UP state, system working  S 1 – UP state, recovering failed driver  S 2 – DOWN state, system reboot  λ driver_failure = 4 faults every 8 hrs  µ nooks_recovery = 4,093 mu seconds  µ reboot = 82 seconds  c – coverage factor/success rate

Resin + Linux + Nooks Analysis Composing Markov chains  S 0 – UP state, system working  S 1 – UP state, recovering failed driver  S 2 – DOWN state, system reboot  S 3 – DOWN state, Resin reboot  λ driver_failure = 4 faults every 8 hrs  µ nooks_recovery = 4,093 mu seconds  µ reboot = 82 seconds  c – coverage factor  λ memory_leak_ = 1 every 8 hours  µ restart_resin = 47 seconds Max availability = % Min downtime = 866 minutes

Benefits of CTMCs + Fault Injection Able to model and analyze different styles of self- healing mechanisms Quantifies the impact of mechanism details (success rates, recovery times etc.) on the system’s operational constraints (availability, production targets, production-delay reduction etc.)  Engineering view AND Business view Able to identify under-performing mechanisms Useful at design time as well as post-production Able to control the fault-rates

Caveats of CTMCs + Fault-Injection CTMCs may not always be the “right” tool  Constant hazard-rate assumption May under or overstate the effects/impacts True distribution of faults may be different  Fault-independence assumptions Limited to analyzing near-coincident faults Not suitable for analyzing cascading faults (can we model the precipitating event as an approximation?) Some failures are harder to replicate/induce than others  Better data on faults could improve fault-injection tools Getting detailed breakdown of types/rates of failures  More data should improve the fault-injection experiments and relevance of the results

Real-World Downtime Data* Mean incidents of unplanned downtime in a year: (n-tier web applications) Mean cost of unplanned downtime (Lost productivity #IT Hours):  2115 hrs ( hour work-weeks) Mean cost of unplanned downtime (Lost productivity #Non-IT Hours):  hrs** ( hour work-weeks) * “IT Ops Research Report: Downtime and Other Top Concerns,” StackSafe. July (Web survey of 400 IT professional panelists, US Only) ** "Revive Systems Buyer Behavior Research," Research Edge, Inc. June 2007

Quick Analysis – End User View Unplanned Downtime (Lost productivity Non-IT hrs) per year: hrs (30,942 minutes). Is this good? (94.11% Availability) Less than two 9’s of availability  Decreasing the down time by an order of magnitude could improve system availability by two orders of magnitude

Proposed Data-Driven Evaluation (7U) 1. Gather failure data and specify fault-model 2. Establish fault-remediation relationship 3. Select/create fault-injection tools to mimic faults in 1 4. Identify Macro-measurements  Identify environmental constraints governing system- operation (availability, production targets etc.) 5. Identify Micro-measurements  Identify metrics related to specifics of self-healing mechanisms (success rates, recovery time, fault-coverage) 6. Run fault-injection experiments and record observed behavior 7. Construct pre-experiment and post-experiment models

The 7U-Evaluation Method

Conclusions Dynamic instrumentation and fault-injection lets us transparently collect data “in-situ” and replicate problems “in-vivo” The CTMC-models are flexible enough to quantitatively analyze various styles and “impacts” of repairs We can use them at design-time or post- deployment time The math is the “easy” part compared to getting customer data on failures, outages, and their impacts.  These details are critical to defining the notions of “better” and “good” for these systems

Future Work More experiments on an expanded set of operating systems using more server-applications  Linux 2.6  OpenSolaris 10  Windows XP SP2/Windows 2003 Server Modeling and analyzing other self-healing mechanisms  Error Virtualization (From STEM to SEAD, Locasto et. al Usenix 2007)  Self-Healing in OpenSolaris 10 Feedback control for policy-driven repair- mechanism selection

Questions, Comments, Queries? Thank you for your time and attention For more information contact: Rean Griffith

Extra slides

Proposed Preventative Maintenance Non-Birth-Death process with 6 states, 6 parameters:  S 0 – UP state, first stage of lifetime  S 1 – UP state, second stage of lifetime  S 2 – DOWN state, Resin reboot  S 3 – UP state, inspecting memory use  S 4 – UP state, inspecting memory use  S 5 – DOWN state, preventative restart  λ 2ndstage = 1/6 hrs  λ failure = 1/2 hrs  µ restart_resin_worst = 47 seconds  λ inspect = Memory use inspection rate  µ inspect = 21,627 microseconds  µ restart_resin_pm = 3 seconds

Kheiron/JVM Operation SampleMethod Bytecode Method body SampleMethod Bytecode Method body _SampleMethodSampleMethod New Bytecode Method Body Call _Sample Method _SampleMethod Bytecode Method body A BC Prepare Shadow Create Shadow SampleMethod( args ) [throws NullPointerException] push args call _SampleMethod( args ) [throws NullPointerException] { try{…} catch (IOException ioe){…} } // Source view of _SampleMethod’s body return value/void

Application Server Memory Leaks Memory leak condition causing an automatic application server restart every hours (95% confidence interval)