Presentation is loading. Please wait.

Presentation is loading. Please wait.

CompSci 296.2 Self-Managing Systems Shivnath Babu.

Similar presentations


Presentation on theme: "CompSci 296.2 Self-Managing Systems Shivnath Babu."— Presentation transcript:

1 CompSci 296.2 Self-Managing Systems Shivnath Babu

2 2 Today Some current work in self-managing systems  Ideas & resources for projects IBM ROC (Discussion deferred to next class) Our projects at Duke HP

3 3 Project Group size <= 2 Identify “general topic” by end of January, meet Shivnath Feb 7: Scope problem and give 15-minute talk Feb 21: 3-minute talk March 7: 15-minute talk March 28: 3-minute talk April 4/6: 15-minute talk April 20/24: 15-minute final in-class presentation (+ “demo”)

4 4 Work on Self-Managing Systems IBM IBM Journal, Volume 42, Number 1, 2003 Autonomic computing home page IBM autonomic home – library, demos Autonomic computing toolkit IBM Tivoli

5 5 Work on Self-Managing Systems Berkeley-Stanford ROC project Reading for this class Interesting source of project ideas and source code Sample project reports/presentations (follow the CS444A/294-4 link)

6 6 The past: research goals and assumptions of last 15 years Goal #1: Improve performance Goal #2: Improve performance Goal #3: Improve cost-performance

7 7 New research goals for a New Century: ACME Availability Changeability –support rapid deployment of new software, apps, UI Maintainability –reduce burden on system administrators –provide helpful, forgiving SysAdmin environments Evolutionary Growth –allow easy system expansion over time Also Security/Privacy

8 8 Recovery-Oriented Computing (ROC) Philosophy “If a problem has no solution, it may not be a problem, but a fact, not to be solved, but to be coped with over time” — Shimon Peres (“Peres’s Law”) People/HW/SW failures are facts, not problems Recovery/repair is how we cope with above facts Since major Sys Admin job is recovery after failure, ROC also helps with maintenance/TCO ROC focus is on fast repair Vs. old focus on longer time between failures

9 9 An Example Project in ROC Undo functionality for system administrators (useful for self-managing components as well) To recover from human errors To recover from failed operations like software upgrades, installs, and configuration updates An interesting mechanism project for self-healing

10 10 Mechanism Projects Required/useful mechanisms for self-managing systems Take a goal related to self-managing (e.g., self- optimization, predicting problems), take a system (e.g., a database)  What mechanisms are needed? Will current mechanisms suffice? Ex: Data collection –nonintrusive, distributed, “active probing”

11 11 Our Projects at Duke Ques: Querying Systems (as data) –Better tools for system administrators and self-managing system components CoD: Cluster on Demand –Allocate virtual clusters to applications on demand

12 12 Querying Systems as Data WAN Clients Web server Application servers Database servers

13 13 Querying Systems as Data WAN Clients Web server Application servers Database servers WAN

14 14 Querying Systems as Data What are probable causes of the Service-Level-Agreement (SLA) violations rising to 12%? Root-cause query

15 15 Queries: What if … Given today’s workload, how will average response time change if my database fails? If I double the memory on my application servers, how will SLA violation rate change?

16 16 Queries: Let me know … Let me know if, with 75% probability, average response time will exceed 5 seconds in next 30 minutes –Prediction –Continuous query

17 17 Queries: What should I do? What should I do to reduce SLA violations of requests A to <1%, without increasing violations of other requests? –Root-cause + What-if

18 18 Querying Systems as Data Instrumented traces, logs System activity data Data from active probing Workload System configuration data (e.g., buffer size, indexes) Source code Models –Analytic performance models –Machine learning models –Rules from system experts –Simulators DATADATA

19 19 Querying Systems with QueS (30,000 ft) DATADATA Query Processor Data Acquisition Data Maintenance Model- driven DB Engine Queries Answers System mgmt. services

20 20 Challenges: Query Complexity Support for complex queries –Rank probable causes of SLA violation rising to 12%? –“What should I do” queries Queries are ad-hoc Queries may be acquisitional

21 21 Challenges: Query Specification Declarative query language –Expressibility of language –Composition Snapshot queries and continuous queries

22 22 Challenges: Query Processing Model-based query processing Many types of data sources –Structured, semi-structured, and unstructured Uncertainty in input data –E.g., legacy systems may have partial/no instrumentation Imprecise answers –Answers may include quantification of accuracy –Ranking

23 23 Challenges: Run-time Overhead Real-time service for 24x7 systems Tunable data acquisition Active probing

24 24 Work in Progress With Piyush Shivam –Models for answering queries about expected performance given a resource assignment, feasible resource assignments to meet SLA, what-if queries for scientific applications With Songyun Duan –Use of Bayesian Networks for performance prediction and root-cause queries With Wanhong Xu –What-if queries on configuration-parameter settings

25 25 Projects at HP Research Project 1: Predicting performance problems, finding root cases of problems Project 2: Debugging complex systems Project 3: Designing adaptive systems


Download ppt "CompSci 296.2 Self-Managing Systems Shivnath Babu."

Similar presentations


Ads by Google