Presentation is loading. Please wait.

Presentation is loading. Please wait.

UC Berkeley Online System Problem Detection by Mining Console Logs Wei Xu* Ling Huang † Armando Fox* David Patterson* Michael Jordan* *UC Berkeley † Intel.

Similar presentations


Presentation on theme: "UC Berkeley Online System Problem Detection by Mining Console Logs Wei Xu* Ling Huang † Armando Fox* David Patterson* Michael Jordan* *UC Berkeley † Intel."— Presentation transcript:

1 UC Berkeley Online System Problem Detection by Mining Console Logs Wei Xu* Ling Huang † Armando Fox* David Patterson* Michael Jordan* *UC Berkeley † Intel Labs Berkeley

2 Why console logs? Detecting problems in large scale Internet services often requires detailed instrumentation Instrumentation can be costly to insert & maintain High code churn Often combine open-source building blocks that are not all instrumented Can we use console logs in lieu of instrumentation? + Easy for developer, so nearly all software has them – Imperfect: not originally intended for instrumentation 2

3 Problems we are looking for The easy case – rare messages Harder but useful - abnormal sequences NORMAL receiving blk_1 received blk_1 receiving blk_2 ERROR what is wrong with blk_2 ??? 3

4 Overview and Contributions * Large-scale system problem detection by mining console logs (SOSP’ 09) Accurate online detection with small latency 4 Frequent pattern based filtering OK PCA Detection OK ERROR Dominant cases Non-pattern Normal cases Real anomalies Parsing* Free text logs 200 nodes

5 Constructing event traces from console logs Parse: message type + variables Group messages by identifiers (automatically discovered) –Group ~= event trace receiving blk_1 received blk_1 reading blk_1 receiving blk_2 received blk_2 receiving blk_2 receiving blk_1 received blk_1 reading blk_1 receiving blk_2 received blk_2 receiving blk_2 5

6 Online detection: When to make detection? Cannot wait for the entire trace Can last arbitrarily long time How long do we have to wait? Long enough to keep correlations Wrong cut = false positive Difficulties No obvious boundaries Inaccurate message ordering Variations in session duration 6 receiving blk_1 received blk_1 reading blk_1 deleting blk_1 deleted blk_1 receiving blk_1 received blk_1 Time

7 Frequent patterns help determine session boundaries Key Insight: Most messages/traces are normal Strong patterns “Make common paths fast” Tolerate noise 7

8 Two stage detection overview 8 Frequent pattern based filtering OK PCA Detection OK ERROR Dominant cases Non-pattern Normal cases Real anomalies Parsing Free text logs 200 nodes

9 Stage 1 - Frequent patterns (1): Frequent event sets 9 receiving blk_1 received blk_1 reading blk_1 deleting blk_1 deleted blk_1 receiving blk_1 received blk_1 Time Coarse cut by time Find frequent item set Refine time estimation reading blk_1 error blk_1 Repeat until all patterns found PCA Detection

10 Stage 1 - Frequent patterns (2) : Modeling session duration time Assuming Gaussian? 99.95 th percentile estimation is off by half 45% more false alarms Mixture distribution Power-law tail + histogram head 10 Duration Count Pr(X>=x)

11 Stage 2 - Handling noise with PCA detection More tolerant to noise Principal Component Analysis (PCA) based detection 11 Frequent pattern based filtering OK PCA Detection OK ERROR Dominant cases Non-pattern Normal cases Real anomalies Parsing Free text logs 200 nodes

12 Frequent pattern matching filters most of the normal events Frequent pattern based filtering OK PCA Detection OK ERROR Dominant cases Non-pattern Normal cases Real anomalies Parsing* Free text logs 200 nodes 100% 86% 14% 13.97% 0.03%

13 Evaluation setup Hadoop file system (HDFS) Experiment on Amazon’s EC2 cloud 203 nodes x 48 hours Running standard map-reduce jobs ~24 million lines of console logs 575,000 traces ~ 680 distinct ones Manual label from previous work Normal/abnormal + why it is abnormal “Eventually normal” – did not consider time For evaluation only 13

14 Frequent patterns in HDFS Frequent Pattern99.95 th percentile Duration % of messages Allocate, begin write13 sec20.3% Done write, update metadata8 sec44.6% Delete-12.5% Serving block-3.8% Read exception-3.2% Verify block-1.1% Total85.6% 14 Covers most messages Short durations (Total events ~20 million)

15 Detection latency Detection latency is dominated by the wait time 15 Single event pattern Frequent pattern (matched) Frequent pattern (timed out) Non pattern events

16 Detection accuracy 16 True Positives False Positives False Negatives PrecisionRecall Online16,9162,748086.0%100.0% Offline16,8081,74610890.6%99.3% (Total trace = 575,319) Ambiguity on “abnormal” Manual labels: “eventually normal” > 600 FPs in online detection as very long latency E.g. a write session takes >500sec to complete (99.99 th percentile is 20sec)

17 Future work Distributed log stream processing –Handle large scale cluster + partial failures Clustering alarms Allowing feedback from operators Correlation on logs from multiple applications / layers 17

18 Summary http://www.cs.berkeley.edu/~xuw/ Wei Xu PrecisionRecall Online86.0%100.0% 18 Frequent pattern based filtering OK PCA Detection OK ERROR Dominant cases Non-pattern Normal cases Real anomalies Parsing Free text logs 200 nodes 24 million lines


Download ppt "UC Berkeley Online System Problem Detection by Mining Console Logs Wei Xu* Ling Huang † Armando Fox* David Patterson* Michael Jordan* *UC Berkeley † Intel."

Similar presentations


Ads by Google