Ira Cohen, Jeffrey S. Chase et al. Correlating Instrumentation data to system states: A building block for automated diagnosis and control Ira Cohen, Jeffrey S. Chase et al.
Introduction Networked systems continue to grow in scale Complex behavior stemming from interaction of Workload Software structure Hardware Traffic conditions System goals Pervasive System needed to manage such a system Examples? HP’s Openview IBM’s Tivoli (Aggregates + displays graphically)
Introduction Two approaches to build self managing systems A priori models Event-condition-action rules Not based on real systems (Disadvantages?) Difficult and costly Unreliable, does not take account of all
Introduction Statistical learning techniques Assumes little to no domain knowledge Hence “general” Problem! Still have to identify techniques that are powerful enough to induce effective models that are: Efficient Accurate Robust
Goals Automatically analyze instrumentation data from network services in order to Forecast Diagnose Repair failure conditions We use the Tree-Augmented Naïve Bayesian Networks (TANs) as the basis for Diagnosis Forcasting System-level instrumentations in a 3-tier network service. Widely used in various fields, but TANs are not used in the context of computer systems.
Goals Analyzed data from 124 metrics gathered from 3 tiered e-commerce site under synthetic load Httperf Java PetStore as platform TAN model select combination of metrics and threshold values that complies with Service Level Objectives for average response time. Results later
What is a TAN? Bayesian network is an annotated directed acyclic graph encoding a joint probability distribution Naïve Bayesian Network State var S is only parent of all other vertices Assumes all metrics are fully independent given S TANs consider relationships among metrics themselves, with constraint that each metric has only one other parent than S
Why Use a TAN? Based on premise that a relatively small subset of metrics and threshold values is sufficient to approximate the distribution accurately Outperforms generalized Bayesian networks and other alternatives in both Cost Accuracy
Why use a TAN? Useful for forecasting failures and violations Possible to induce models that predict SLO violations in near future, even when system is stable Automated controller can invoke directly Identify impending violation Respond Loading Adding resources Cheap model to induce Possible to maintain multiple models Periodic refresh
Setup System is 3-tier webservice Apache Middleware (BEA WebLogic) Oracle db 3 Servers with HP Openview to collect statistics Load Generator is httperf SLO indicator processes the logs to determine compliance
Interpretability and Modifiability TANs offer other advantages Interpretability Modifiability Influence of each metric can be quantified in a probabilistic model Analysis catalogs each type of violation according to the metrics and values that correlate with observed instances Strength is given from prob value occurring in different states Gives insight to causes of violations and how to repair
Workloads Varies several characteristics Aggregate req rate Number of concurrent connections Fraction of data-intensive vs app-intensive requests This is to exercise the model-induction methodology by providing it with a wide range of M,P pairs Where M = sample of values for system metrics P = vector of app-level performance measurements
Workloads RAMP: Increasing concurrency STEP: Background + Step function Background constant traffic Bursty, hour long bursts BUGGY: Increasing aggregate req. rate
Results Varied SLO thresholds to explore effect on induced models To eval accuracy of models under varying conditions Trained and evaled TAN classifier for each of 31 different SLO definitions Baseline: accuracy of 60-pctile SLO classifier (MOD) and CPU as metric.
Results Overall BA of TAN is 87-94% 90+% for all experiments 6% False alarm for 2 experiments, 17% for BUGGY Single metric is not sufficient to capture pattern of SLO violations (CPU) Small number of metrics is sufficient to capture pattern (3-8) Sensitive to workload and SLO definition (MOD always has high detection rate, but generate false alarms at increasing rate as SLO thresh increases)
Conclusion TANs are attractive for self-managing systems Build system models automatically No a priori knowledge required Generalizes to wide range of conditions Zeroes in on most relevant metrics Practical
Conclusion Possible work to adapt this to changing conditions Close the loop for automated diagnosis and control Ultimately most successful model is a hybrid of Automatically induced models A priori models
Questions?