CSE 598B: Self-* Systems Path Based Failure and Evolution Management Mike Y. Chen, Anthony Accardi, Emre Kiciman, Jim Lloyd, Dave Patterson, Armando Fox,

Slides:



Advertisements
Similar presentations
The Public Sector and Xtremesofts AppMetrics Working Together to Maximize Application Availability for Government Servants and Citizens Web Site:
Advertisements

Top-Down Network Design Chapter Nine Developing Network Management Strategies Copyright 2010 Cisco Press & Priscilla Oppenheimer.
Configuration management
Prescriptive Process models
Overcoming Limitations of Sampling for Agrregation Queries Surajit ChaudhuriMicrosoft Research Gautam DasMicrosoft Research Mayur DatarStanford University.
A Path-based Approach to Managing Failures and Evolution Mike Chen, Anthony Accardi 1, Emre Kıcıman, Jim Lloyd 2, Dave Patterson, Armando Fox, Eric Brewer.
Automated Software Testing: Test Execution and Review Amritha Muralidharan (axm16u)
Chapter 4 Quality Assurance in Context
Chapter 13 Managing Computer and Data Resources. Introduction A disciplined, systematic approach is needed for management success Problem Management,
Copyright 2004 Prentice-Hall, Inc. Essentials of Systems Analysis and Design Second Edition Joseph S. Valacich Joey F. George Jeffrey A. Hoffer Appendix.
High Availability Group 08: Võ Đức Vĩnh Nguyễn Quang Vũ
Pinpoint: Problem Determination in Large, Dynamic Internet Services Mike Chen, Emre Kıcıman, Eugene Fratkin {emrek,
0 General information Rate of acceptance 37% Papers from 15 Countries and 5 Geographical Areas –North America 5 –South America 2 –Europe 20 –Asia 2 –Australia.
Network Management Overview IACT 918 July 2004 Gene Awyzio SITACS University of Wollongong.
©Ian Sommerville 2006Software Engineering, 8th edition. Chapter 8 Slide 1 System models.
Extensible Scalable Monitoring for Clusters of Computers Eric Anderson U.C. Berkeley Summer 1997 NOW Retreat.
Overview Distributed vs. decentralized Why distributed databases
Modified from Sommerville’s originalsSoftware Engineering, 7th edition. Chapter 8 Slide 1 System models.
(c) 2007 Mauro Pezzè & Michal Young Ch 1, slide 1 Software Test and Analysis in a Nutshell.
Winter Retreat Connecting the Dots: Using Runtime Paths for Macro Analysis Mike Chen, Emre Kıcıman, Anthony Accardi, Armando Fox, Eric Brewer
Introduction : ‘Skoll: Distributed Continuous Quality Assurance’ Morimichi Nishigaki.
seminar on Intrusion detection system
Recovery Oriented Computing: Update Armando Fox (in loco Patterson) Summer ROC Retreat, June 2002.
1 Reliable Adaptive Distributed Systems Armando Fox, Michael Jordan, Randy H. Katz, David Patterson, George Necula, Ion Stoica, Doug Tygar.
Understanding Network Failures in Data Centers: Measurement, Analysis and Implications Phillipa Gill University of Toronto Navendu Jain & Nachiappan Nagappan.
Chapter 1 Introduction to Databases
H-1 Network Management Network management is the process of controlling a complex data network to maximize its efficiency and productivity The overall.
©Ian Sommerville 2000 Software Engineering, 6th edition. Chapter 7 Slide 1 System models l Abstract descriptions of systems whose requirements are being.
LÊ QU Ố C HUY ID: QLU OUTLINE  What is data mining ?  Major issues in data mining 2.
ATIF MEHMOOD MALIK KASHIF SIDDIQUE Improving dependability of Cloud Computing with Fault Tolerance and High Availability.
CHAPTER FIVE Enterprise Architectures. Enterprise Architecture (Introduction) An enterprise-wide plan for managing and implementing corporate data assets.
Systems Analysis – Analyzing Requirements.  Analyzing requirement stage identifies user information needs and new systems requirements  IS dev team.
Chapter 4 System Models A description of the various models that can be used to specify software systems.
System models Abstract descriptions of systems whose requirements are being analysed Abstract descriptions of systems whose requirements are being analysed.
Top-Down Network Design Chapter Nine Developing Network Management Strategies Oppenheimer.
A Lightweight Platform for Integration of Resource Limited Devices into Pervasive Grids Stavros Isaiadis and Vladimir Getov University of Westminster
Scalable Analysis of Distributed Workflow Traces Daniel K. Gunter and Brian Tierney Distributed Systems Department Lawrence Berkeley National Laboratory.
Object-Oriented Software Engineering Practical Software Development using UML and Java Chapter 1: Software and Software Engineering.
Testing Workflow In the Unified Process and Agile/Scrum processes.
Guiding Principles. Goals First we must agree on the goals. Several (non-exclusive) choices – Want every CS major to be educated in performance including.
Right In Time Presented By: Maria Baron Written By: Rajesh Gadodia
Chapter 7 System models.
Object-Oriented Software Engineering Practical Software Development using UML and Java Chapter 1: Software and Software Engineering.
Object-Oriented Software Engineering Practical Software Development using UML and Java Chapter 1: Software and Software Engineering.
C6 Databases. 2 Traditional file environment Data Redundancy and Inconsistency: –Data redundancy: The presence of duplicate data in multiple data files.
System models l Abstract descriptions of systems whose requirements are being analysed.
Modified by Juan M. Gomez Software Engineering, 6th edition. Chapter 7 Slide 1 Chapter 7 System Models.
Advanced Computer Networks Topic 2: Characterization of Distributed Systems.
Software Engineering, 8th edition Chapter 8 1 Courtesy: ©Ian Somerville 2006 April 06 th, 2009 Lecture # 13 System models.
1 Computing Challenges for the Square Kilometre Array Mathai Joseph & Harrick Vin Tata Research Development & Design Centre Pune, India CHEP Mumbai 16.
Framework for MDO Studies Amitay Isaacs Center for Aerospace System Design and Engineering IIT Bombay.
M Global Software Group 1 Motorola Internal Use Only Better Software Quality at a Lower Cost: Testing to Eliminate Software Black Holes Isaac (Haim) Levendel,
Chapter 6 CASE Tools Software Engineering Chapter 6-- CASE TOOLS
Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University IWPSE 2003 Program.
CISC Machine Learning for Solving Systems Problems Presented by: Suman Chander B Dept of Computer & Information Sciences University of Delaware Automatic.
Copyright , Dennis J. Frailey CSE7315 – Software Project Management CSE7315 M18 - Version 9.01 SMU CSE 7315 Planning and Managing a Software Project.
© 2006, National Research Council Canada © 2006, IBM Corporation Solving performance issues in OTS-based systems Erik Putrycz Software Engineering Group.
1 Experience from Studies of Software Maintenance and Evolution Parastoo Mohagheghi Post doc, NTNU-IDI SEVO Seminar, 16 March 2006.
Big traffic data processing framework for intelligent monitoring and recording systems 學生 : 賴弘偉 教授 : 許毅然 作者 : Yingjie Xia a, JinlongChen a,b,n, XindaiLu.
Detecting, Managing, and Diagnosing Failures with FUSE John Dunagan, Juhan Lee (MSN), Alec Wolman WIP.
David Foster LCG Project 12-March-02 Fabric Automation The Challenge of LHC Scale Fabrics LHC Computing Grid Workshop David Foster 12 th March 2002.
Pinpoint: Problem Determination in Large, Dynamic Internet Services Mike Chen, Emre Kıcıman, Eugene Fratkin {emrek,
LIOProf: Exposing Lustre File System Behavior for I/O Middleware
Online School Management System Supervisor Name: Ashraful Islam Juwel Lecturer of Asian University of Bangladesh Submitted By: Bikash Chandra SutrodhorID.
Welcome to the Winter 2004 ROC Retreat
Embracing Failure: A Case for Recovery-Oriented Computing
Improving searches through community clustering of information
Self Healing and Dynamic Construction Framework:
Software Design and Architecture
Presentation transcript:

CSE 598B: Self-* Systems Path Based Failure and Evolution Management Mike Y. Chen, Anthony Accardi, Emre Kiciman, Jim Lloyd, Dave Patterson, Armando Fox, Eric Brewer (UC Berkeley, Stanford U, Tellme Networks, eBay Inc.) Presented by: Arjun R. Nath

2 The Problem..  Computing systems increasing in complexity  Tending towards large, complex, distributed systems  Sometimes there are thousands of machines involved  Basic system management is becoming increasingly difficult.  Detecting and diagnosing failures to understanding application behaviour is becoming very difficult.

3..the Problem  Existing techniques such as code-level debuggers, program slicing, process profiling and application logs fail to characterize overall system behaviour.  Distribuged debuggers are available but focus on a homogenous subset of the system.

4 Goal of the paper  Techniques to help us understand large distributed systems.  Improve – availability – reliability – manageability  Why are we looking at this paper ? (Self-* context) –This paper is about techniques for monitoring of large, complex, distributed systems.

5 Two main principles  Path-Based Measurement: –Model the system as a collection of paths thru heterogenous components. –Make local observations along the paths and store these. These can be accessed via queries and visualization techniques. (Focus is on correctness rather than performance)  Statistical Behaviour Analysis: –Large volumes of system requests are stored for statistical analysis using classical techniques to identify deviations from normal behaviour. This can be applied to live systems or used for offline analysis.

6 What is a "Path" ?  Associated with a request  Control Flow  Resources  Paths may have inter-path dependencies : shared state, shared database tables, shared filesystems, shared memory.  Multiple paths may be grouped together in sessions.

Coarse grained paths

Fine grained paths

9 How do paths help ?  Failure Management  Evolution (of the system)

10 Failure Management...  Detection: –Reduce downtime associcated with detection delays –Using paths can help in noticing developing problems before they become severe The Key is to define "normal" behaviour statistically and then check for deviations  Diagnosis: –Isolate problems using solely the recorded path observations and then drive the diagnosis process with the path information. –Paths help identify which components are involved in a given failure and aid in identifiying causes.

11...Failure Management  Impact Analysis: –Helps in knowing the scale of the problem -> estimate time-to-repair –Which other paths are at risk.

12 Evolution (of the system)  Its very difficult to get an overall picture of how a complex distributed system changes with time: - Software/hardware upgrades, patches, code changes etc. - Systems evolve through changes to their components and also thru changes in how they interact  Paths help in revealing system structure and dependencies and tracking changes.

Implementation

Implementation: Architecture

15 …Implementation...  Tracers - tracking a request through the target system. –Each request has an identifier associated that is maintained throughout the path –Ids may be stored in extensible headers (HTTP, SOAP) –Tracers are platform specific but can be generic to applications using the same platform (J2EE,.NET)  Pinpoint, ObsLogs, SuperCal all have tracers.

…Implementation: tools.. Three systems that support path-based analysis

...Implementation  Aggregator and Repository –Aggregator receives observations from tracers –reconstructs paths using IDs –Stores this in the Repository –There may be also a Central Repository that collects from distributed repositories.  Analysis Engines and Visualization. –Single and multi-path analysis –Dedicated engines for various statistical tests –Support for some data mining tools\ –Visualization: Tukey’s boxplots generated using Octave

…Implementation A trend specific to recognition time in Tellme application A suggests a regression in a speech grammar in that application. The Tukey boxplots shown illustrate a distribution’s center, spread, and asymmetries by using rectangles to show the upper and lower quartiles and the median, and explicitly plotting each outlier.

Limitations and constraints  Cannot resolve fault causes at a very detailed level  Overheads can be high for fine grained paths  Need to decide which observations to include in paths. This is an iterative process.  Can be difficult to implement especially for existing systems

Its important so understand that Path- based analysis is an aid to fault detection and recovery and not a solution in itself. It is meant to be used in combination with traditional fault handling techniques.

Conclusion  As systems get more complex, Path-based analysis tools will have increasing importance.  Path based fault analysis complements traditional techniques  Hardly any fully functional, path-based, fault management tools available.  This paper: –Has breadth but lacks depth in some places. –Needs some more data around production environment experiments –Should have concentrated on 1 or 2 implementations and included more details. –Not much info on SuperCal and ObsLogs

Other related stuff  “Pinpoint” project at Stanford (Some interesting papers here)  Magpie project (MicroSoft)  Quest Software : Jprobe – Java performance profiler  Borland's OptimizeItEnterprise Suite

23 That’s all folks, Thank You