Presentation is loading. Please wait.

Presentation is loading. Please wait.

CompSci 296.2 Self-Managing Systems Shivnath Babu.

Similar presentations


Presentation on theme: "CompSci 296.2 Self-Managing Systems Shivnath Babu."— Presentation transcript:

1 CompSci 296.2 Self-Managing Systems Shivnath Babu

2 2 Today Wrap up sample projects ROC discussion

3 3 Sample Projects NIMO Fa Combining structured & unstructured data Projects using Nagios Projects using IBM autonomic computing toolkit

4 4 NIMO: NonInvasive Modeling for Optimization Build performance models for scientific apps –Automatic, online, and noninvasive Projects –Study many scientific apps (e.g., 140 bio apps in BioPortal)  characterize behavior, good models –“Steal app”, build and refine model –Incorporate NIMO in a “grid” scheduler (Condor, Globus) –Optimization problems in scheduling workflows

5 5 Fa Testbed to study: –Whether we can automate problem prediction, diagnosis –Relationship among problems, causes, data, & models Projects –Models for predicting performance problems (online) –Models and mechanisms for root-cause queries –Others

6 6 Structured and Unstructured Data Combined querying/mining of structured and unstructured system data –Structured data: time series of CPU utilization –Unstructured data (free text): System error log Ex: Characterize system state when a specific error occurs

7 7 Add New Features to Current Systems Add problem-prediction capability to Nagios Add root-cause querying to Nagios Similar projects using the IBM Autonomic Computing Toolkit + ABLE framework Remember the “mechanism projects” –Undo, virtualization, active probing

8 8 ROC: Recovery-Oriented Computing Complaints about current systems –Focus only on performance  Availability & maintainability is neglected –Focus on MTTF of individual components  MTTR neglected –MTTF of system << MTTF of individual components

9 9 ROC Philosophy “If a problem has no solution, it may not be a problem, but a fact, not to be solved, but to be coped with over time” — Shimon Peres (“Peres’s Law”) People/HW/SW failures are facts, not problems Recovery/repair is how we cope with above facts ROC focus is on fast repair Vs. old focus on longer time between failures

10 10 ROC Principles Recovery experiments: benchmarking recovery Pinpoint: Automatic problem diagnosis Recursive restart: Innovative use of reboot App and system undo Defense in depth: ROC at hardware level

11 11 Discussion Strong point: Comprehensive, relate to other fields Margin of safety for systems –Current examples? –How to incorporate? Negative point: Evolution Vs. revolution? –What approach is the project taking? At what level should we support Undo? –Transaction, application, system –Pros and cons Benchmarking availability/recovery (TOC?) –How can you claim that a system is 99.999% available? Dealing with the automation irony –Fire drills


Download ppt "CompSci 296.2 Self-Managing Systems Shivnath Babu."

Similar presentations


Ads by Google