Presentation is loading. Please wait.

Presentation is loading. Please wait.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Mining Logs Files for Data-Driven System Management Advisor.

Similar presentations


Presentation on theme: "Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Mining Logs Files for Data-Driven System Management Advisor."— Presentation transcript:

1 Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Mining Logs Files for Data-Driven System Management Advisor : Dr. Hsu Presenter : Wen-Hsiang Hu Authors : Wei Peng ; Tao Li ; Sheng Ma SIGKDD Explorations, 2005, Pages:44 -51

2 Intelligent Database Systems Lab N.Y.U.S.T. I. M. 2 Outline Motivation Objective Introduction Common Categories Message Categorization Naïve Bayes Classifier Modified Naïve Bayes algorithm Hidden Markov Model Experiments Conclusion Future Work

3 Intelligent Database Systems Lab N.Y.U.S.T. I. M. 3 Motivation A popular approach to system management is based on analyzing system log files. However, some new aspects of the log files have been less emphasized in existing methods from data mining and machine learning community. e.g. temporal characteristics

4 Intelligent Database Systems Lab N.Y.U.S.T. I. M. 4 Objective We will describe our research efforts on mining system log files for automatic management. Automated log data analysis can be performed without much domain knowledge and its results provide guidance for network managers to perform their jobs more effectively.

5 Intelligent Database Systems Lab N.Y.U.S.T. I. M. 5 Introduction We apply text mining techniques to automatically categorize the text messages with disparate formats into a set of common categories start-up: Component A reports “A has started” in log file. Component B reports “ B has begun execution” in log file. Improve categorization accuracy by considering the temporal characteristics of log messages Utilize visualization tools to help the users understand and interpret patterns in the data.

6 Intelligent Database Systems Lab N.Y.U.S.T. I. M. 6 Common Categories Transform the messages in the log files into a set of common categories. We first manually determine a set of categories as the basis for transformation. The set of categories is based on the CBE (Common Base Event) format established by IBM initiative [26]. The set of categories includes start, stop, dependency, create, connection, report, request, configuration, and other.

7 Intelligent Database Systems Lab N.Y.U.S.T. I. M. 7 The goal is to assign predefined category labels to unlabeled documents based on the likelihood inferred from the training set of labeled documents. We use Naïve Bayes as our classification approach Uses training data to calculate Bayes-optimal estimates of the model parameters. The most probable class is then assigned to the test data. Message Categorization

8 Intelligent Database Systems Lab N.Y.U.S.T. I. M. 8 suppose there are L categories, denoted by C 1,C 2, · · · C L, We can characterize the likelihood of a document with the sum of probability over all the categories Given a set of training samples S, the Naïve Bayes classifier uses S to estimate P(d i |C j ) and P(C j ). To classify a new sample, it uses Bayes rule to compute class posterior The predicted class for the document d i is then just argmax j P(C j |d i ). Naïve Bayes Classifier

9 Intelligent Database Systems Lab N.Y.U.S.T. I. M. 9 Incorporating the temporal information- Modified Naïve Bayes algorithm In many scenarios, text messages generated in the log files usually contain timestamps. If a sequence of log messages are considered, the accuracy of categorization for each message can be improved. For example, the components usually first start up a process, then stop the process at a later time.

10 Intelligent Database Systems Lab N.Y.U.S.T. I. M. 10 Incorporating the temporal information- Modified Naïve Bayes algorithm (cont.) Suppose we are given a sequence of adjacent messages D = (d 1, d 2, · · ·, d T ). let Q i be the category labels for message d i (i.e.,Q i is one of C 1,C 2, · · · C L ). Now we want to classify d i+1 V.S. text classification probability P(C j |d i+1 ) state transition probability P(C j |Q i )

11 Intelligent Database Systems Lab N.Y.U.S.T. I. M. 11 One example of effectiveness of modified Naïve Bayes text classification probability P(C j |d i+1 ) Incorporating the temporal information- Modified Naïve Bayes algorithm (cont.) 考慮時間不考慮時間 previous state current state

12 Intelligent Database Systems Lab N.Y.U.S.T. I. M. 12 state transition probability P(C j |Q i ). Incorporating the temporal information- Modified Naïve Bayes algorithm (cont.) configuration to startconfiguration to configuration P(C j |d i+1 ) * P(C j |Q i ) 0.4430199 * 0.1869 0.4981456 * 0.1111 >

13 Intelligent Database Systems Lab N.Y.U.S.T. I. M. 13 Incorporating the temporal information- Hidden Markov Model Associated with each of a set of states, S = {s 1, · · ·, s n }, is a probability distribution over the symbols in the emission vocabulary K = {k 1, · · ·, k m }. There is also a prior state distribution π(s). Training data consists of several sequences of observed emissions, one of which would be written {o 1, · · ·, o x } the category labels as states the log messages as emission vocabulary HMM explicitly considers the state transition sequence.

14 Intelligent Database Systems Lab N.Y.U.S.T. I. M. 14 Incorporating the temporal information- Hidden Markov Model (cont.) When one log message has been assigned several competitive state labels by text classification, For example, the message “The Windows Installer initiated a system restart to complete or continue the configuration of ’Microsoft XML Parser’.” has 0.4327 probability to be categorized into configuration state has 0.4164 probability to be labeled as stop state The state transition probability is calculated from the training log data sets. The probability of emitting messages can be estimated as the ratio of the occurrence probabilities of log messages to the occurrence of their states in the training data. Viterbi algorithm is used to find the most possible state sequence that emits the given log messages.

15 Intelligent Database Systems Lab N.Y.U.S.T. I. M. 15 The training data of the raw log files are labeled with nine categories, i.e., configuration, connection, create, dependency, other, report, request, start, and stop. Table 2 lists the keywords and their probabilities in the corresponding classes. Experiments

16 Intelligent Database Systems Lab N.Y.U.S.T. I. M. 16 Experiments (cont.)

17 Intelligent Database Systems Lab N.Y.U.S.T. I. M. 17 Figure 1: The 2D plot of the “raven” data set. X axis is the time. Y axis is the state We observe that these dotted lines in create and connection categories occur alternately. We note that the connection problems occur after the create problems for at least three days. Experiments (cont.)

18 Intelligent Database Systems Lab N.Y.U.S.T. I. M. 18 The 2D plot of the “raven” data set. X axis is the time. Y axis is the component Component 5 will report problems synchronously whenever there are problems in component 6. after component 0 starts, it tends to keep generating report messages. Experiments (cont.)

19 Intelligent Database Systems Lab N.Y.U.S.T. I. M. 19 3D-plot will help users understand and interpret complicated patterns. Experiments (cont.)

20 Intelligent Database Systems Lab N.Y.U.S.T. I. M. 20 We propose two approaches (Modified Naïve Bayes and HMM) for incorporating temporal information to improve the performance of categorizing log messages. Conclusion

21 Intelligent Database Systems Lab N.Y.U.S.T. I. M. 21 Instead of manually determining the set of common categories, we could develop techniques (i.e. clustering) to automatically infer them from historical data. the number of different common categories for system management can be large => utilize the dependence relationships among different categories develop methods that can efficiently discover interesting temporal patterns from the transformed log files Future work


Download ppt "Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Mining Logs Files for Data-Driven System Management Advisor."

Similar presentations


Ads by Google