Presentation is loading. Please wait.

Presentation is loading. Please wait.

© 2004 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice Toward automated diagnosis and forecasting.

Similar presentations


Presentation on theme: "© 2004 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice Toward automated diagnosis and forecasting."— Presentation transcript:

1 © 2004 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice Toward automated diagnosis and forecasting of performance problems Moises Goldszmidt Moises.Goldszmidt@hp.com Joint work with: Ira Cohen, Terence Kelly, and Julie Symons of HPLabs and Jeff Chase of CS Duke

2 6/14/2015Copyright © 2004 HP corporate presentation. All rights reserved.2 Motivation Complexity of deployed systems surpasses the ability of humans to diagnose, forecast, and characterize behavior −Scale and interconnectivity −Levels of abstraction −Modular design and features −Changes in input, applications, infrastructure −Lack of closed-form mathematical characterization −Legacy systems and OEMs We are competent at collecting data Need: Transform collected data from measurements to actionable items

3 6/14/2015Copyright © 2004 HP corporate presentation. All rights reserved.3 SLIC: Statistical Learning, Inference, and Control A priori models of system structure and behavior −Difficult and costly −Likely to be inaccurate and/or short-lived −Brittle to changes and unanticipated conditions Models obtained automatically using induction techniques −Economical −Minimally invasive −Require little knowledge −Adapt to changes SLIC today SLIC tomorrow Guide and accelerate induction Automation and Adaptation

4 6/14/2015Copyright © 2004 HP corporate presentation. All rights reserved.4 This talk The application of automated statistical pattern recognition and probabilistic modeling for performance diagnostics in a 3-tier service −Given a performance target: Service level objectives (SLO) = Avg response time −Goal: Analyze streams of instrumentation data collected from service components and select a focused set of metrics and thresholds that correlated to specific performance problems (SLO violations)

5 6/14/2015Copyright © 2004 HP corporate presentation. All rights reserved.5 Overview: System under study W2K Server Apache W2K Server BEA WebLogic W2K Server Oracle httperf HP OpenViewSLO compliance indicatorStatistical analysis engine / model induction logs metrics classifier Models classifier Diagnosis classifier Forecasting

6 6/14/2015Copyright © 2004 HP corporate presentation. All rights reserved.6 Metrics

7 6/14/2015Copyright © 2004 HP corporate presentation. All rights reserved.7 Diagnosing SLOs Which system metrics correlate with violations? What values of these metrics correlate with the violations? Why those spikes? Are there different “types” of violations? Can these violations be forecasted?

8 6/14/2015Copyright © 2004 HP corporate presentation. All rights reserved.8 Annotated Map of SLO violations

9 6/14/2015Copyright © 2004 HP corporate presentation. All rights reserved.9 Decision boundary

10 6/14/2015Copyright © 2004 HP corporate presentation. All rights reserved.10 Is it always the CPU? (need for adaptation)

11 6/14/2015Copyright © 2004 HP corporate presentation. All rights reserved.11 A combination of metrics (need for automation)

12 6/14/2015Copyright © 2004 HP corporate presentation. All rights reserved.12 Model – P(S,M*) (Bayesian network) Efficient Induction = 15millisecs 4 params per metric Representative 94% patterns of SLO behavior Interpretable   i [log P(mi|pa(mi),s-) – log P(mi|pa(mi),s+)] > 0 Fuse expert knowledge and statistical information

13 6/14/2015Copyright © 2004 HP corporate presentation. All rights reserved.13 Model induction process Stat extractor Model induction loop Report Find relevant metrics Heuristic search Score: Balanced Accuracy 0.5*[P(s-= F(M)|s-) + P(s+=F(M)|s+)] Mean Variance Covariance

14 6/14/2015Copyright © 2004 HP corporate presentation. All rights reserved.14 Model report Input file name: outMeansVarsexper23_5min ****SLO Threshold: 312 milliseconds – 20% violations Task: Diagnose Top metrics ranked score var_IDC2_CODA_CPU_1_USERTIME 0.83522 mean_IDC5_CODA_DISK_1_PHYSWRITEBYTE 0.88134 mean_IDC5_CODA_GBL_SWAPSPACEUSED 0.8996 var_IDC5_CODA_NETIF_2_INPACKET 0.9147 var_IDC5_CODA_NETIF_2_INBYTE 0.93675 var_IDC2_CODA_GBL_SWAPSPACEUSED 0.93675 mean_IDC5_CODA_DISK_1_PHYSREADBYTE 0.93596 var_IDC2_CODA_DISK_1_PHYSWRITE 0.94134 Number of metrics were chosen for classifiers: 8 --------------------------- Model False Alarms Detection TAN: 6.5 +- 1.9 94.8 +- 6.1 ---------------------------

15 6/14/2015Copyright © 2004 HP corporate presentation. All rights reserved.15 Attribution report Total instances of SLO violation: 152 Alarm Attribution combination: 01011010 8 Alarm Attribution combination: 01011110 2 Alarm Attribution combination: 01100010 4 Alarm Attribution combination: 11001010 11 Alarm Attribution combination: 11001110 4 Alarm Attribution combination: 11011010 14 Alarm Attribution combination: 11100010 7 Alarm Attribution combination: 11101010 9 Alarm Attribution combination: 11101110 9 Alarm Attribution combination: 11110010 7 Alarm Attribution combination: 11110100 1 Alarm Attribution combination: 11110110 3 Alarm Attribution combination: 11111000 2 Alarm Attribution combination: 11111010 12 Alarm Attribution combination: 11111110 12 Total instances of sum of metrics likelihood in alarm: 148 varAppCPU_USER meanDBDISK1WRITEBYTE meanDBGBLSWAPUSED varDBNETIF2INPACK varDBNETIF2INBYTE varAppGBLSWAPUSED meanDBDISK1READBYTE varAppDISK1WRITE

16 6/14/2015Copyright © 2004 HP corporate presentation. All rights reserved.16 Experiments Experiment 20: increasing ramp −Increase the number of simultaneous sessions −Add a new client every 20 minutes −Up to 100 sessions in 36 hours −Requests A sinusoidal wave over a ramp Experiment 23: bursts with two workload −Workload 1: steady background traffic of 1k requests per minute and 20 clients −Workload 2: On-off every hour, with bursts of 5, 15, 20 clients (each session of 50 requests per minute) −Overall time = 36 hours

17 6/14/2015Copyright © 2004 HP corporate presentation. All rights reserved.17 Output (Avg resp times) from Experiments Experiment 20 Experiment 23

18 6/14/2015Copyright © 2004 HP corporate presentation. All rights reserved.18 Experimental loop: 10 minutes

19 6/14/2015Copyright © 2004 HP corporate presentation. All rights reserved.19 92-94% of patterns captured Failing to adapt to SLO increases False Alarms CPU alone is not enough Handful of metrics needed Summary of results

20 6/14/2015Copyright © 2004 HP corporate presentation. All rights reserved.20 Metrics selected Need to adapt metrics to workload Handful of metrics

21 6/14/2015Copyright © 2004 HP corporate presentation. All rights reserved.21 Metrics change across SLO defs.

22 6/14/2015Copyright © 2004 HP corporate presentation. All rights reserved.22 Next steps Continue characterization (Steve Zhang) −Different workloads, scaling, learning curves, search, etc Data from Production environments From forensics to online (Steve Zhang) −Model management and fusion Forecasting SLO violations (Rob Powers) −Three epochs forecasting Exper20: 91.8 +- 1.2 (9.1FA and 93 Det) Exper23: 79.7 +-4.6 (2.4 FA and 83 Det) Fusing different SLT techniques: −Bnets and decision trees….

23 6/14/2015Copyright © 2004 HP corporate presentation. All rights reserved.23 Concluding remarks Demonstrated a pattern recognition and probabilistic modeling approach to diagnosing service level objective problems −Automated: relieves the user from inspecting hundreds of time series of metrics (their combinations!) and the possible correlations with the SLO −Adaptive: works for a variety of SLO definitions and with different workloads −Economic: does not rely on costly expert, and it is minimally invasive. Does not rely on modifications to infrastructure Not such a big hammer: −Freeware available – not worse than a regression −We have enough computer power in a laptop

24 6/14/2015Copyright © 2004 HP corporate presentation. All rights reserved.24


Download ppt "© 2004 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice Toward automated diagnosis and forecasting."

Similar presentations


Ads by Google