Presentation on theme: "Ira Cohen, Jeffrey S. Chase et al."— Presentation transcript:
1 Ira Cohen, Jeffrey S. Chase et al. Correlating Instrumentation data to system states: A building block for automated diagnosis and controlIra Cohen, Jeffrey S. Chase et al.
2 Introduction Networked systems continue to grow in scale Complex behavior stemming from interaction ofWorkloadSoftware structureHardwareTraffic conditionsSystem goalsPervasive System needed to manage such a systemExamples?HP’s OpenviewIBM’s Tivoli(Aggregates + displays graphically)
3 Introduction Two approaches to build self managing systems A priori modelsEvent-condition-action rulesNot based on real systems(Disadvantages?)Difficult and costlyUnreliable, does not take account of all
4 Introduction Statistical learning techniques Assumes little to no domain knowledgeHence “general”Problem!Still have to identify techniques that are powerful enough to induce effective models that are:EfficientAccurateRobust
5 GoalsAutomatically analyze instrumentation data from network services in order toForecastDiagnoseRepair failure conditionsWe use the Tree-Augmented Naïve Bayesian Networks (TANs) as the basis forDiagnosisForcastingSystem-level instrumentations in a 3-tier network service.Widely used in various fields, but TANs are not used in the context of computer systems.
6 Goals Analyzed data from 124 metrics gathered from 3 tiered e-commerce site under synthetic loadHttperfJava PetStore as platformTAN model select combination of metrics and threshold values that complies with Service Level Objectives for average response time.Results later
7 What is a TAN?Bayesian network is an annotated directed acyclic graph encoding a joint probability distributionNaïve Bayesian NetworkState var S is only parent of all other verticesAssumes all metrics are fully independent given STANs consider relationships among metrics themselves, with constraint that each metric has only one other parent than S
8 Why Use a TAN?Based on premise that a relatively small subset of metrics and threshold values is sufficient to approximate the distribution accuratelyOutperforms generalized Bayesian networks and other alternatives in bothCostAccuracy
9 Why use a TAN? Useful for forecasting failures and violations Possible to induce models that predict SLO violations in near future, even when system is stableAutomated controller can invoke directlyIdentify impending violationRespondLoadingAdding resourcesCheap model to inducePossible to maintain multiple modelsPeriodic refresh
10 Setup System is 3-tier webservice ApacheMiddleware (BEA WebLogic)Oracle db3 Servers with HP Openview to collect statisticsLoad Generator is httperfSLO indicator processes the logs to determine compliance
11 Interpretability and Modifiability TANs offer other advantagesInterpretabilityModifiabilityInfluence of each metric can be quantified in a probabilistic modelAnalysis catalogs each type of violation according to the metrics and values that correlate with observed instancesStrength is given from prob value occurring in different statesGives insight to causes of violations and how to repair
12 Workloads Varies several characteristics Aggregate req rateNumber of concurrent connectionsFraction of data-intensive vs app-intensive requestsThis is to exercise the model-induction methodology by providing it with a wide range of M,P pairsWhere M = sample of values for system metricsP = vector of app-level performance measurements
14 Results Varied SLO thresholds to explore effect on induced models To eval accuracy of models under varying conditionsTrained and evaled TAN classifier for each of 31 different SLO definitionsBaseline: accuracy of 60-pctile SLO classifier (MOD) and CPU as metric.
15 Results Overall BA of TAN is 87-94% 90+% for all experiments 6% False alarm for 2 experiments, 17% for BUGGYSingle metric is not sufficient to capture pattern of SLO violations (CPU)Small number of metrics is sufficient to capture pattern (3-8)Sensitive to workload and SLO definition (MOD always has high detection rate, but generate false alarms at increasing rate as SLO thresh increases)
16 Conclusion TANs are attractive for self-managing systems Build system models automaticallyNo a priori knowledge requiredGeneralizes to wide range of conditionsZeroes in on most relevant metricsPractical
17 Conclusion Possible work to adapt this to changing conditions Close the loop for automated diagnosis and controlUltimately most successful model is a hybrid ofAutomatically induced modelsA priori models