Experience Report: System Log Analysis for Anomaly Detection

Slides:



Advertisements
Similar presentations
A Comparison of Implicit and Explicit Links for Web Page Classification Dou Shen 1 Jian-Tao Sun 2 Qiang Yang 1 Zheng Chen 2 1 Department of Computer Science.
Advertisements

Ziming Zhang, Yucheng Zhao and Yiwen Wan.  Introduction&Motivation  Problem Statement  Paper Summeries  Discussion and Conclusions.
Aggregating local image descriptors into compact codes
1 VLDB 2006, Seoul Mapping a Moving Landscape by Mining Mountains of Logs Automated Generation of a Dependency Model for HUG’s Clinical System Mirko Steinle,
UC Berkeley Online System Problem Detection by Mining Console Logs Wei Xu* Ling Huang † Armando Fox* David Patterson* Michael Jordan* *UC Berkeley † Intel.
An Overview of Machine Learning
Patch to the Future: Unsupervised Visual Prediction
Software Quality Ranking: Bringing Order to Software Modules in Testing Fei Xing Michael R. Lyu Ping Guo.
Fei Xing1, Ping Guo1,2 and Michael R. Lyu2
Robust Moving Object Detection & Categorization using self- improving classifiers Omar Javed, Saad Ali & Mubarak Shah.
Decision Tree Rong Jin. Determine Milage Per Gallon.
Software Reliability Engineering: A Roadmap
Efficient Convex Relaxation for Transductive Support Vector Machine Zenglin Xu 1, Rong Jin 2, Jianke Zhu 1, Irwin King 1, and Michael R. Lyu 1 4. Experimental.
1 Integrating User Feedback Log into Relevance Feedback by Coupled SVM for Content-Based Image Retrieval 9-April, 2005 Steven C. H. Hoi *, Michael R. Lyu.
An Experimental Evaluation on Reliability Features of N-Version Programming Xia Cai, Michael R. Lyu and Mladen A. Vouk ISSRE’2005.
Smart Traveller with Visual Translator for OCR and Face Recognition LYU0203 FYP.
© 2013 IBM Corporation Efficient Multi-stage Image Classification for Mobile Sensing in Urban Environments Presented by Shashank Mujumdar IBM Research,
Webpage Understanding: an Integrated Approach
Jieming Zhu 1, Pinjia He 1, Qiang Fu 2, Hongyu Zhang 3, Michael R. Lyu 1, Dongmei Zhang 3 1 The Chinese University of Hong Kong, Hong Kong 2 Microsoft,
A Geometric Framework for Unsupervised Anomaly Detection: Detecting Intrusions in Unlabeled Data Authors: Eleazar Eskin, Andrew Arnold, Michael Prerau,
Alert Correlation for Extracting Attack Strategies Authors: B. Zhu and A. A. Ghorbani Source: IJNS review paper Reporter: Chun-Ta Li ( 李俊達 )
Slide Image Retrieval: A Preliminary Study Guo Min Liew and Min-Yen Kan National University of Singapore Web IR / NLP Group (WING)
Active Learning for Class Imbalance Problem
An Automated Approach to Predict Effectiveness of Fault Localization Tools Tien-Duy B. Le, and David Lo School of Information Systems Singapore Management.
Table 3:Yale Result Table 2:ORL Result Introduction System Architecture The Approach and Experimental Results A Face Processing System Based on Committee.
Detecting Semantic Cloaking on the Web Baoning Wu and Brian D. Davison Lehigh University, USA WWW 2006.
Transfer Learning with Applications to Text Classification Jing Peng Computer Science Department.
BING: Binarized Normed Gradients for Objectness Estimation at 300fps
Online Kinect Handwritten Digit Recognition Based on Dynamic Time Warping and Support Vector Machine Journal of Information & Computational Science, 2015.
Anomaly Detection via Online Over-Sampling Principal Component Analysis.
Xiangnan Kong,Philip S. Yu Multi-Label Feature Selection for Graph Classification Department of Computer Science University of Illinois at Chicago.
A Novel Local Patch Framework for Fixing Supervised Learning Models Yilei Wang 1, Bingzheng Wei 2, Jun Yan 2, Yang Hu 2, Zhi-Hong Deng 1, Zheng Chen 2.
Zibin Zheng DR 2 : Dynamic Request Routing for Tolerating Latency Variability in Cloud Applications CLOUD 2013 Jieming Zhu, Zibin.
Computer Science Automated Software Engineering Research ( Mining Exception-Handling Rules as Conditional Association.
Limitations of Cotemporary Classification Algorithms Major limitations of classification algorithms like Adaboost, SVMs, or Naïve Bayes include, Requirement.
Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University July 21, 2008WODA.
Online Multiple Kernel Classification Steven C.H. Hoi, Rong Jin, Peilin Zhao, Tianbao Yang Machine Learning (2013) Presented by Audrey Cheong Electrical.
WSP: A Network Coordinate based Web Service Positioning Framework for Response Time Prediction Jieming Zhu, Yu Kang, Zibin Zheng and Michael R. Lyu The.
Computational Approaches for Biomarker Discovery SubbaLakshmiswetha Patchamatla.
ICDCS 2014 Madrid, Spain 30 June-3 July 2014
LogTree: A Framework for Generating System Events from Raw Textual Logs Liang Tang and Tao Li School of Computing and Information Sciences Florida International.
Iterative similarity based adaptation technique for Cross Domain text classification Under: Prof. Amitabha Mukherjee By: Narendra Roy Roll no: Group:
A Clustering-based QoS Prediction Approach for Web Service Recommendation Shenzhen, China April 12, 2012 Jieming Zhu, Yu Kang, Zibin Zheng and Michael.
Competition II: Springleaf Sha Li (Team leader) Xiaoyan Chong, Minglu Ma, Yue Wang CAMCOS Fall 2015 San Jose State University.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
Similarity Measurement and Detection of Video Sequences Chu-Hong HOI Supervisor: Prof. Michael R. LYU Marker: Prof. Yiu Sang MOON 25 April, 2003 Dept.
An Evaluation Study on Log Parsing and Its Use in Log Mining
P.Demestichas (1), S. Vassaki(2,3), A.Georgakopoulos(2,3)
Data Mining, Machine Learning, Data Analysis, etc. scikit-learn
ANOMALY DETECTION FRAMEWORK FOR BIG DATA
Outlier Processing via L1-Principal Subspaces
Data Mining 101 with Scikit-Learn
Machine Learning for dotNET Developer Bahrudin Hrnjica, MVP
An Enhanced Support Vector Machine Model for Intrusion Detection
CMPT 733, SPRING 2016 Jiannan Wang
A survey of network anomaly detection techniques
Pinjia He, Jieming Zhu, Jianlong Xu, and
iSRD Spam Review Detection with Imbalanced Data Distributions
Data Mining, Machine Learning, Data Analysis, etc. scikit-learn
Intro to Machine Learning
Data Mining, Machine Learning, Data Analysis, etc. scikit-learn
View Inter-Prediction GAN: Unsupervised Representation Learning for 3D Shapes by Learning Global Shape Memories to Support Local View Predictions 1,2 1.
Microarray Data Set The microarray data set we are dealing with is represented as a 2d numerical array.
Feature Selection Methods
Hierarchical, Perceptron-like Learning for OBIE
Jia-Bin Huang Virginia Tech
Yining ZHAO Computer Network Information Center,
FREERIDE: A Framework for Rapid Implementation of Datamining Engines
Device Failure Prediction
Li Li, Zhu Li, Vladyslav Zakharchenko, Jianle Chen, Houqiang Li
Presentation transcript:

Experience Report: System Log Analysis for Anomaly Detection Shilin He, Jieming Zhu, Pinjia He, and Michael R. Lyu Department of Computer Science and Engineering, The Chinese University of Hong Kong, Hong Kong 2016/10/26

Outline Background & Motivation Framework Supervised Anomaly Detection Unsupervised Anomaly Detection Evaluation Conclusion 2

Outline Background & Motivation Framework Supervised Anomaly Detection Unsupervised Anomaly Detection Evaluation Conclusion

Background Operating systems, software frameworks, distributed systems, etc.

Background Especially, many online services and applications are deployed on distributed systems. …

Background System breakdown causes significant revenue loss. Failures System Anomaly detection could pinpoint issues promptly and help resolve them immediately.

Background Logs : Logs are the main data source for system anomaly detection. Logs are routinely generated by systems (e.g., 24 x 7 basis). Logs record detailed runtime information, e.g., timestamp, state, IP address.

Check logs manually? Oh, NO! Background Manual inspection of logs becomes impossible! Check logs manually? Oh, NO! Systems are often implemented by hundreds of developers. Logs are generated at a high rate & Noisy data are hard to distinguish. Systems generate duplicated logs due to fault tolerant mechanism. Many automated log-based anomaly detection methods are proposed!

Background … Log-based anomaly detection methods: Failure diagnosis using decision trees [ICAC’04] Failure prediction in IBM bluegene/l event logs [ICDM’07] Detecting largescale system problems by mining console logs [SOSP’09] Mining invariants from console logs for system problem detection. [USENIX ATC’10] Log Clustering based Problem Identification for Online Service Systems [ICSE’16] …

Outline Background & Motivation Framework Supervised Anomaly Detection Unsupervised Anomaly Detection Evaluation Conclusion

Motivation Developers are not aware of the state-of-the-art log-based anomaly detection methods. No open-source tools are currently available. Lack of comparison among existing anomaly detection methods. Academia Industry

Outline Background & Motivation Framework Supervised Anomaly Detection Unsupervised Anomaly Detection Evaluation Conclusion

Framework

1. Log Collection

2. Log Parsing

3. Feature Extraction Divide all logs into different log sequences (windows) log sequence <=> row in the event count matrix. Windows Basis  Fixed windows Time Sliding windows Session windows Identifiers

4. Anomaly Detection Anomaly detection methods Supervised Logistic Regression Decision Tree Support Vector Machine Unsupervised Log Clustering PCA Invariants Mining

Outline Background & Motivation Framework Supervised Anomaly Detection Unsupervised Anomaly Detection Evaluation Conclusion

Supervised Anomaly Detection Logistic Regression Decision Tree Support Vector Machine (SVM) Build model with training data Apply model on testing data Measure performance General procedure: Training Testing All data

Supervised Anomaly Detection Trained Decision Tree Example: Anomaly

Supervised Anomaly Detection Trained SVM Example: Anomalies Normal instances

Outline Background & Motivation Framework Supervised Anomaly Detection Unsupervised Anomaly Detection Evaluation Conclusion

Log Clustering Detection Online learning Knowledge base initialization Update representatives Distance (new instance, representatives) Add into cluster Detection Online learning Knowledge base initialization Log vectorization Representative extraction Log clustering

PCA Two subspaces are generated by PCA: Sn: Normal Space, constructed by first k principal components. Sa: Anomaly Space, constructed by remaining (n-k) components. Project y into anomaly space using where P is the vector of first k principal components. An event count vector is regarded as anomaly if Q is the threshold Squared prediction error

Invariants Mining Code: Program Execution Flow:

Invariants Mining Main process: Build event count matrix Estimate the invariant space (r invariants) using SVD Search invariants with a brute force algorithm Validate the mined invariants until r invariants are obtained

Outline Background & Motivation Framework Supervised Anomaly Detection Unsupervised Anomaly Detection Evaluation Conclusion

Evaluation Data sets Performance metric Fixed windows & Sliding windows Session windows Performance metric Labels are given for each block, thus we can only use the session window to extract features.

Evaluation Q1: What is the accuracy of supervised anomaly detection? Q2: What is the accuracy of unsupervised anomaly detection? Q3: What is the efficiency of these anomaly detection?

Evaluation 1. Accuracy of Supervised Methods More sensitive Finding 1: Supervised anomaly detection achieves high precision, while recall varies.

Evaluation 1. Accuracy of Supervised Methods Finding 2: Sliding windows achieve higher accuracy than fixed windows

Evaluation 2. Accuracy of Unsupervised Methods Finding 3:Unsupervised methods are not as good as supervised methods except Invariants Mining

Evaluation 3. Effects of window setting on supervised & unsupervised methods

Evaluation 3. Effects of window setting on supervised & unsupervised methods Finding 4:Different window sizes and step sizes affect the methods differently.

Evaluation 4. Efficiency of Anomaly Detection Methods Finding 5: Most anomaly detection scale linearly with log size except Log Clustering and Invariants Mining.

Outline Background & Motivation Framework Supervised Anomaly Detection Unsupervised Anomaly Detection Evaluation Conclusion

Conclusion In this paper, we fill the gap by providing a detailed review and evaluation of six state-of-the-art anomaly detection methods. (over 4000 lines of Python codes) compare their accuracy and efficiency on two representative production log datasets. release an open-source toolkit of these anomaly detection methods for easy reuse and further study.

Demo https://github.com/cuhk-cse/loglizer

Thanks! Q & A