Automated Problem Diagnosis for Production Systems Soila P. Kavulya Scott Daniels (AT&T), Kaustubh Joshi (AT&T), Matti Hiltunen (AT&T), Rajeev Gandhi (CMU),

Automated Problem Diagnosis for Production Systems Soila P. Kavulya Scott Daniels (AT&T), Kaustubh Joshi (AT&T), Matti Hiltunen (AT&T), Rajeev Gandhi (CMU), Priya Narasimhan (CMU) PARALLEL DATA LABORATORY Carnegie Mellon University

Automated Problem Diagnosis Diagnosing problems Creates major headaches for administrators Worsens as scale and system complexity grows Goal: automate it and get proactive Failure detection and prediction Problem determination (or “fingerpointing”) Problem visualization How: Instrumentation plus statistical analysis November 12http://www.pdl.cmu.edu/2

Target Systems for Validation VoIP system at large telecom provider 10s of millions of calls per day, diverse workloads 100s of heterogeneous network elements Labeled traces available Hadoop: MapReduce implementation  Hadoop clusters with homogeneous hardware  Yahoo! M45 & Opencloud production clusters  Controlled experiments in Amazon EC2 cluster  Long running jobs (> 100s): Hard to label failures http://www.pdl.cmu.edu/3November 12

Assumptions of Approach Majority of system is working correctly Problems manifest in observable behavioral changes Exceptions or performance degradations All instrumentation is locally timestamped Clocks are synchronized to enable system- wide correlation of data Instrumentation faithfully captures system behavior http://www.pdl.cmu.edu/4November 12

Overview of Diagnostic Approach http://www.pdl.cmu.edu/5 End-to-end Trace Construction End-to-end Trace Construction Performance Counters Application Logs Ranked list of root-causes Anomaly Detection Localization November 12

Anomaly Detection Overview Some systems have rules for anomaly detection, e.g., Redialing number immediately after disconnection Server reported error codes and exceptions If no rules available, rely on peer-comparison Identifies peers (nodes, flows) in distributed systems Detect anomalies by identifying “odd-man-out” http://www.pdl.cmu.edu/6November 12

Anomaly Detection Approach http://www.pdl.cmu.edu/7 Histogram comparison identifies anomalous nodes Pairwise comparison of node histograms Detect anomaly if difference between histograms exceeds pre-specified threshold Faulty node Histograms (distributions) of durations of flows Normal node Normalized counts (total 1.0) November 12

Localization Overview 1.Obtain labeled end-to-end traces (labels indicate failures and successes) Telecom systems –Use heuristics, e.g., Redialing number immediately after disconnection Hadoop –Use peer-comparison for anomaly detection since heuristics for detection are unavailable 2.Localize source of problems Score attributes based on how well they distinguish failed calls from successful ones November 12http://www.pdl.cmu.edu/8

“Truth Table” Call Representation November 12http://www.pdl.cmu.edu/9 Server1Server2Customer1Phone1Outcome Call11101SUCCESS Call21011FAIL Log Snippet Call1: 09:31am,SUCCESS, Server1,Server2,Phone1 Call2: 09:32am,FAIL,Server1,Customer1,Phone1 Log Snippet Call1: 09:31am,SUCCESS, Server1,Server2,Phone1 Call2: 09:32am,FAIL,Server1,Customer1,Phone1 10s of thousands of attributes 10s of millions of calls

Identify Suspect Attributes Estimate conditional probability distributions Prob(Success|Attribute) vs Prob(Failure|Attribute) Update belief on distribution with each call seen November 12http://www.pdl.cmu.edu/10 Degree of Belief Probability Success|Customer1 Failure|Customer1 Anomaly score: Distance between distributions

Find Multiple Ongoing Problems Search for combination of attributes that maximize anomaly score E.g., (Customer1 and ServerOS4) Greedy search limits combinations explored Iterative search identifies multiple problems November 12http://www.pdl.cmu.edu/11 1. Chronic signature1 Customer1 ServerOS4 2. Chronic signature2 PhoneType7 Time of Day (GMT) Failed Calls UI: Ranked list of chronics

Evaluation Prototype in use by Ops team Daily reports over past 2 years Helped Ops to quickly discover new chronics For example, to analyze 25 million VoIP calls 2 2.4GHz Xeon cores, used <1 GB of memory Data loading: 1.75 minutes for 6GB of data Diagnosis: ~4 seconds per signature (near-interactive) November 12http://www.pdl.cmu.edu/12

1. Chronic Signature1 Service_A Customer_A 2. Chronic Signature2 Service_A Customer_N IP_Address_N Call Quality (QoS) Violations November 12http://www.pdl.cmu.edu/13 Message loss used as the event failure indicator (>1%) Draco showed most QoS issues were tied to specific customers and not ISP network elements (as was previously believed) Customer name, IP Incident at ISP: Failed Calls Time of Day (GMT) Failed Calls Time of Day (GMT)

In Summary… Use peer-comparison for anomaly detection Localize source of problems using statistics Applicable when end-to-end traces available E.g., customer, network element, version conflicts Approach used on Trone might vary Depends on instrumentation available Also depends on fault-model November 12http://www.pdl.cmu.edu/14

Automated Problem Diagnosis for Production Systems Soila P. Kavulya Scott Daniels (AT&T), Kaustubh Joshi (AT&T), Matti Hiltunen (AT&T), Rajeev Gandhi (CMU),

Similar presentations

Presentation on theme: "Automated Problem Diagnosis for Production Systems Soila P. Kavulya Scott Daniels (AT&T), Kaustubh Joshi (AT&T), Matti Hiltunen (AT&T), Rajeev Gandhi (CMU),"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Automated Problem Diagnosis for Production Systems Soila P. Kavulya Scott Daniels (AT&T), Kaustubh Joshi (AT&T), Matti Hiltunen (AT&T), Rajeev Gandhi (CMU),

Similar presentations

Presentation on theme: "Automated Problem Diagnosis for Production Systems Soila P. Kavulya Scott Daniels (AT&T), Kaustubh Joshi (AT&T), Matti Hiltunen (AT&T), Rajeev Gandhi (CMU),"— Presentation transcript:

Similar presentations

About project

Feedback