Presentation is loading. Please wait.

Presentation is loading. Please wait.

Failure Avoidance through Fault Prediction Based on Synthetic Transactions Mohammed Shatnawi 1, 2 Matei Ripeanu 2 1 – Microsoft Online Ads, Microsoft Corporation.

Similar presentations


Presentation on theme: "Failure Avoidance through Fault Prediction Based on Synthetic Transactions Mohammed Shatnawi 1, 2 Matei Ripeanu 2 1 – Microsoft Online Ads, Microsoft Corporation."— Presentation transcript:

1 Failure Avoidance through Fault Prediction Based on Synthetic Transactions Mohammed Shatnawi 1, 2 Matei Ripeanu 2 1 – Microsoft Online Ads, Microsoft Corporation Redmond, Washington 2 – Electrical and Computer Engineering Department, University of British Columbia Vancouver BC, Canada This work was supported in part by the Institute for Computing, Information and Cognitive Systems (ICICS) at UBC.

2 Microsoft Ad Exchange The Exchange Market Place Model Advertisers, Publishers, and Brokers Ad Networks – Aggregators of Advertising Demands Value-Added Providers Exchange Operator Exchange Characteristics Liquidity Auction – Bidding and Pricing Eligible Participants Federation Fairness and Neutrality Vs. Arbitraging Strict Requirements Performance, Reliability and Strict SLA

3 The Problem with Logs Complexity Contain large disparate content Verbose and Large Size Often Incomplete for data analysis and mining Record problems after they have happened Users already endured the bad experience Enterprises may lose customer’s trust in the service

4 Related Work Lack of holistic approach Focus on specific aspects of the problems Pre-process logs to reduce their complexity Log Aggregations – Snyder et al [5] Event Categorization – Zhen et al [10] Log parsing and text mining – Xu et al [7, 9] Proactive fault avoidance System Monitoring at run time – Pietrantuono [1] Runtime fault inject to ensure faults are detected – Cotroneo [2] Problems with such approaches » They are not holistic – approach one aspect of the problems » One problem solution may cause another problem (e.g. fault injection at runtime may interfere with the system behavior, and may add to log complexity)

5 Suggested Approach Goals Address the problem holistically: Proactivity Using synthetic transactions before going to production Log Design Simplicity and Completeness Use of specialized logs for specific set of metrics Log data is complete for data analysis (through iteration) Data mining in Mind Log schema is designed with data analysis mining in mind

6 Suggested Approach Flowchart

7 Suggested Approach Goals in Details Proactivity – System functionality emulation in a test environment System Replica in Pre-production Environment Replica of the production hardware system in a test environment Replica of the production software system in a test environment Synthetic Transactions Use of software client to Emulate the expected workload through distribution of function calls and data load intensity.

8 Suggested Approach Goals in Details Log design Simplicity and Completeness Tailored to the analysis of the problem(s) at hand (e.g. response time, correctness, error handling, …etc.) Log schema design is advised by Dimensional Modeling, this ensures accounting for all impactful data The schema design is iterative to allow for isolation of the most impactful parameters to the metrics at hand » This leads to more compact logs

9 Suggested Approach Goals in Details Data mining in Mind The dimensional model ensure data set completeness Also ensures: – Amenability to data mining – Completeness of data mining data requirements – Easily allows for addition of dimensions and new data Use of any available mining solutions The goal is not to create new data mining models/techniques Generating data with data mining in mind simplifies the mining process

10 Pre-Deployment based on Mining Findings Before deployment, configure the system based on the mining results The findings advise the system parameters and conditions that cause faults System Configuration, guard against the conditions that cause problems, and so prevent them from happening

11 Experiment Experiment Goals predict the conditions that cause delay in executing CRUDQ operations (and so missing SLA) Accordingly, set operational limits and system configuration before deployment to prevent these problems from happening. Methodology: Devise synthetic transactions to emulate the CRUDQ ops Log results from the synthetic system Build a data mining model from these logs Find out the most impactful system conditions in system failure (SLA latency) Configure the system to guard against them

12 Experiment Baseline Compare the data mining ability of the synthetic system to actual log data We used synthetic log data to train and test the model » 8 hours worth of data with 28k transactions » Log size was 11MB We used actual log data to verify our results » Five weeks worth of data with 650M transactions » Log size was about 88GB of data a day

13 Results Naïve Bayes prediction using synthetic data » “CPU load” (77%) and “Recent historical trend” (increasing) impacted results the most » Using this model on actual log data showed 91% prediction accuracy

14 Results Decision Trees prediction using synthetic data » “CPU load” impacted the results the most (77%) » Using this model on actual log data showed 89% prediction accuracy

15 Summary and Conclusions Approach to enhance an online service’s reliability, availability and performance through Use of synthetic transactions in pre-production environments. Use of specialized logs for failure prediction, Use of data-mining on the compact specialized logs. Identify the environment conditions that correlate to failures, and guard against them. Advantages Analysis and predictor training occur before system goes to production. Data set used in creating the synthetic predictor systems are orders of magnitude smaller, easier to use, and faster to process than their production counter parts.

16 Challenges and Limitations Main challenges: Understanding the service at hand, Identifying the quality of service requirements Tactical Challenges and Limitations Producing the service in pre-production environments, – Replication makes it easy – Emulation techniques required if replication is not feasible Generating accurate synthetic transactions – Requires understanding of service, APIs, and usage patterns. – Exacerbated by the complexity of service and its inter-system dependencies. Isolating the measures of interest – May not always be attainable. – Measures of interest may be grouped.

17 Q&A


Download ppt "Failure Avoidance through Fault Prediction Based on Synthetic Transactions Mohammed Shatnawi 1, 2 Matei Ripeanu 2 1 – Microsoft Online Ads, Microsoft Corporation."

Similar presentations


Ads by Google