TraceBench: An Open Data Set for Trace-Oriented Monitoring Jingwen Zhou 1, Zhenbang Chen 1, Ji Wang 1, Zibin Zheng 2, and Michael R. Lyu 1,2 1 PDL, National.

Slides:



Advertisements
Similar presentations
A Survey of Runtime Verification Jonathan Amir 2004.
Advertisements

Evaluation of a Scalable P2P Lookup Protocol for Internet Applications
UC Berkeley Online System Problem Detection by Mining Console Logs Wei Xu* Ling Huang † Armando Fox* David Patterson* Michael Jordan* *UC Berkeley † Intel.
CLUE: SYSTEM TRACE ANALYTICS FOR CLOUD SERVICE PERFORMANCE DIAGNOSIS Hui Zhang 1, Junghwan Rhee 1, Nipun Arora 1, Sahan Gamage 2, Guofei Jiang 1, Kenji.
Statistical Approaches for Finding Bugs in Large-Scale Parallel Systems Leonardo R. Bachega.
SE 450 Software Processes & Product Metrics Reliability: An Introduction.
70-290: MCSE Guide to Managing a Microsoft Windows Server 2003 Environment Chapter 11: Monitoring Server Performance.
Dunja Mladenić Marko Grobelnik Jožef Stefan Institute, Slovenia.
Winter Retreat Connecting the Dots: Using Runtime Paths for Macro Analysis Mike Chen, Emre Kıcıman, Anthony Accardi, Armando Fox, Eric Brewer
seminar on Intrusion detection system
Swami NatarajanJuly 14, 2015 RIT Software Engineering Reliability: Introduction.
A Hadoop MapReduce Performance Prediction Method
Hands-On Microsoft Windows Server 2008 Chapter 11 Server and Network Monitoring.
CH 13 Server and Network Monitoring. Hands-On Microsoft Windows Server Objectives Understand the importance of server monitoring Monitor server.
MSF Testing Introduction Functional Testing Performance Testing.
Windows Server 2008 Chapter 11 Last Update
Presenter: Chi-Hung Lu 1. Problems Distributed applications are hard to validate Distribution of application state across many distinct execution environments.
Microsoft ® Official Course Monitoring and Troubleshooting Custom SharePoint Solutions SharePoint Practice Microsoft SharePoint 2013.
Google Distributed System and Hadoop Lakshmi Thyagarajan.
Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc
Software Faults and Fault Injection Models --Raviteja Varanasi.
Research on cloud computing application in the peer-to-peer based video-on-demand systems Speaker : 吳靖緯 MA0G rd International Workshop.
TESTING STRATEGY Requires a focus because there are many possible test areas and different types of testing available for each one of those areas. Because.
Vulnerability-Specific Execution Filtering (VSEF) for Exploit Prevention on Commodity Software Authors: James Newsome, James Newsome, David Brumley, David.
Towards An Open Data Set for Trace-Oriented Monitoring Jingwen Zhou 1, Zhenbang Chen 1, Ji Wang 1, Zibin Zheng 2, and Michael R. Lyu 1,2 1 National University.
1 Fast Failure Recovery in Distributed Graph Processing Systems Yanyan Shen, Gang Chen, H.V. Jagadish, Wei Lu, Beng Chin Ooi, Bogdan Marius Tudor.
ACME: a platform for benchmarking distributed applications David Oppenheimer, Vitaliy Vatkovskiy, and David Patterson ROC Retreat 12 Jan 2003.
Ranking the Importance of Alerts for Problem Determination in Large Computer System Guofei Jiang, Haifeng Chen, Kenji Yoshihira, Akhilesh Saxena NEC Laboratories.
70-290: MCSE Guide to Managing a Microsoft Windows Server 2003 Environment, Enhanced Chapter 11: Monitoring Server Performance.
INFSO-RI Enabling Grids for E-sciencE Logging and Bookkeeping and Job Provenance Services Ludek Matyska (CESNET) on behalf of the.
Bug Localization with Machine Learning Techniques Wujie Zheng
1 CS 430 Database Theory Winter 2005 Lecture 16: Inside a DBMS.
CS 533 Information Retrieval Systems.  Introduction  Connectivity Analysis  Kleinberg’s Algorithm  Problems Encountered  Improved Connectivity Analysis.
Cloud Testing Haryadi Gunawi Towards thousands of failures and hundreds of specifications.
70-290: MCSE Guide to Managing a Microsoft Windows Server 2003 Environment, Enhanced Chapter 11: Monitoring Server Performance.
College of Computer National University of Defense Technology Jingwen Zhou, Zhenbang Chen, Haibo Mi, and Ji Wang {jwzhou, This work.
1 Test Selection for Result Inspection via Mining Predicate Rules Wujie Zheng
Amit Malik SecurityXploded Research Group FireEye Labs.
CH 13 Server and Network Monitoring. Hands-On Microsoft Windows Server Objectives Understand the importance of server monitoring Monitor server.
CISC Machine Learning for Solving Systems Problems Presented by: Suman Chander B Dept of Computer & Information Sciences University of Delaware Automatic.
WS-DREAM: A Distributed Reliability Assessment Mechanism for Web Services Zibin Zheng, Michael R. Lyu Department of Computer Science & Engineering The.
HADOOP DISTRIBUTED FILE SYSTEM HDFS Reliability Based on “The Hadoop Distributed File System” K. Shvachko et al., MSST 2010 Michael Tsitrin 26/05/13.
Monitoring and Managing Server Performance. Server Monitoring To become familiar with the server’s performance – typical behavior Prevent problems before.
Search Engine using Web Mining COMS E Web Enhanced Information Mgmt Prof. Gail Kaiser Presented By: Rupal Shah (UNI: rrs2146)
Software Engineering1  Verification: The software should conform to its specification  Validation: The software should do what the user really requires.
A Runtime Verification Based Trace-Oriented Monitoring Framework for Cloud Systems Jingwen Zhou 1, Zhenbang Chen 1, Ji Wang 1, Zibin Zheng 2, and Wei Dong.
ApproxHadoop Bringing Approximations to MapReduce Frameworks
Crowd Fraud Detection in Internet Advertising Tian Tian 1 Jun Zhu 1 Fen Xia 2 Xin Zhuang 2 Tong Zhang 2 Tsinghua University 1 Baidu Inc. 2 1.
Network management Network management refers to the activities, methods, procedures, and tools that pertain to the operation, administration, maintenance,
Pinpoint: Problem Determination in Large, Dynamic Internet Services Mike Chen, Emre Kıcıman, Eugene Fratkin {emrek,
Distributed File System. Outline Basic Concepts Current project Hadoop Distributed File System Future work Reference.
Tool Support for Testing Classify different types of test tools according to their purpose Explain the benefits of using test tools.
Gorilla: A Fast, Scalable, In-Memory Time Series Database
Profiling: What is it? Notes and reflections on profiling and how it could be used in process mining.
Experience Report: System Log Analysis for Anomaly Detection
Problem: Internet diagnostics and forensics
Software Architecture in Practice
Applying Control Theory to Stream Processing Systems
SQL Server Monitoring Overview
Cloud Security Research Based On The Internet of Things
Chapter 16: Distributed System Structures
Human Complexity of Software
TraceBench: An Open Data Set for Trace-Oriented Monitoring
Pong: Diagnosing Spatio-Temporal Internet Congestion Properties
Fault Tolerance Distributed Web-based Systems
Interpret the execution mode of SQL query in F1 Query paper
Control Theory in Log Processing Systems
5/7/2019 Map Reduce Map reduce.
Yining ZHAO Computer Network Information Center,
Presentation transcript:

TraceBench: An Open Data Set for Trace-Oriented Monitoring Jingwen Zhou 1, Zhenbang Chen 1, Ji Wang 1, Zibin Zheng 2, and Michael R. Lyu 1,2 1 PDL, National University of Defense Technology, Changsha, China 2 Shenzhen Research Institute, CUHK, Shenzhen, China {jwzhou, 18 December

Motivation 2

Benefiting our daily life … … Supporting different fields 3 Cloud System

Benefiting our daily life … … Supporting different fields Increasing in complexity Increasing in scale 4 CompanyLoss AmazonLost 7 million dollars in 100 minutes. Google Lost 550,000 dollars in less than 5 minutes, and the global internet traffic dropped 40%. … In August 2013, meltdowns successively happened in: Cloud System

System monitoring is an important method to improve the reliability of cloud systems at runtime. SupervisingAnalyzingAdjusting such as the topics of tracing, failure detection, fault diagnosis, system recovery … In our category, system monitoring includes the activities of: 5 System Monitoring

Currently, more and more attentions are paid to the trace-oriented monitoring, which effectively improves the reliability of cloud systems. Based on the recorded/used data, monitoring systems can be divided into: Resource-oriented MonitoringTrace-oriented Monitoring Record/use the resource metrics, such as CPU and memory Ganglia, Chukwa … Record/use the paths of system execution, or called the traces X-Trace, Dapper, Zipkin … Generally speaking, the trace provides more valuable information than the resource metrics in revealing system behaviors. 6 System Monitoring later

Motivation The trace data is essential for the trace-oriented monitoring topics, e.g., evaluating the designed algorithms. However, there are some difficulties in getting the data of trace: 1. Collecting traces by hand is a tedious and time-consuming process. Choosing or implementing a tracing system Instrumenting and deploying a target system Designing and collecting traces … 2. Manually synthesized traces are weak in authenticity and usability. 3. There is few data set of trace existing in academia and industry. 4. Companies do not want to release their traces which record the internal details of their systems. 7

Motivation TraceBench SupervisingAnalyzingAdjusting Trace-oriented monitoring topics 1. Data Format 2. Other Details 3. Applications 4. Discusses 8

Data Format 9

Trace The trace records the process of system running, Linear event sequence Trace tree Request flow graph and can be expressed as: complexity low high information more less TraceBench stores the traces in the form of trace tree. 10

Trace Tree 11

Trace Tree Trace = events + relationships Event: function name and latency … Relationship: local and remote function calls … Trace => Trace Tree Nodes correspond to events. Edges correspond to relationships Trace Tree => linear event sequence id: 1 name: fs -mv …… 1 id: 2 name: RPC:getFileInfo …… 2 events …… fatherId: 1 childId: 2 type: local call …… a relationships …… DFS: 1, 2, 4, 3, 5 12

Samples 13 Normal Meet a killDN fault (function fault)

Samples Normal Meet a killDN fault (function fault) 14

15 Normal Meet a slowHDFS fault (Performance fault) Samples

Other Details 16

Trace FileType Workload Datanode collected under different workload speeds collected with various cluster sizes Process Network Data System affect the processes on HDFS nodes introduce errors in the data on datanodes bring anarchies to the network in the cluster inject faults to OSs of the HDFS nodes Single All faults are chosen from a single fault type faults are chosen from all the four types Each trace file corresponds to a certain scenario, considering different cluster size, request type, workloads speed, injected faults, etc. Normal Abnormal Combination Collected when system running normally Collected when a permanent fault injected Collected when system encountering temporal faults Class Structure 17

18 Statistics The whole collection work lasted for more than half a year. 50 clients + (50+1) HDFS nodes + others > 100 hosts inject 14 faults of 4 types whole size of TraceBench ≈ 3.2 GB, including: 361 trace files 366,487 traces 14,724,959 events 6,273,497 relationships trace length = [5, 420] nodes per trace = [2, 44] ……

Collection Process 19 CloudStack client001client002clientN … Clients Datanode001Datanode002DatanodeM … HDFS Namenode MTracer Server controllerGanglia Server HDFS requests track control inject faults control monitor control monitor

Collection Process start MTracer and HDFS Normal start Clients stop MTracer and HDFS stop Clients request handling period trace collection period Abnormal inject a fault recover the system Combination inject a fault inject a fault inject a fault inject a fault inject a fault request handling period 20

Applications 21

22 Data Analysis Based on TraceBench, we analyzed the behaviors of HDFS on the aspects of request handling, workload balancing, fault influence, etc.

23 Data Analysis Based on TraceBench, we analyzed the behaviors of HDFS on the aspects of request handling, workload balancing, fault influence, etc. Analysis of the influence of a performance fault

24 Detecting Failed Requests The same kind of user requests usually result in the traces with similar topologies, which can be extracted as properties. For example: Reading a data block starts with invoking: blockSeekTo (B) if success, ends with: checksumOK (K) So, a successful file read request should satisfy the following LTL property: Read at least one data block The last data block reading Successfully reading ExistAlwaysNext If a trace of read request violates the property, we say a failure happens.

25 Detecting Failed Requests Similarly, we extracted properties for other requests: In the form of SQL queries, we check the traces with above properties in some TraceBench sets with faults injected in. 100% of failed traces are identified without FPs. Besides detecting failures, we also extracted properties for detecting faults in requests, e.g., read request:

26 Mining Temporal Invariants Synoptic*: a tool for mining temporal invariants from logs. With Synoptic, we mined temporal invariants in TraceBench: * From TraceBench Always followed by Always precedes of Never followed by Totally ordered partially ordered

27 Mining Temporal Invariants When dealing PO logs: 1. Too many invariants are generated. 2. Some false invariants arise. 3. many invariants contain the same information. Reason: Synoptic treats the same kind of events generated from different hosts as different events. Conclusion: When dealing PO logs, Synoptic seems to be more suitable for the systems with few hosts.

28 Mining Temporal Invariants with synthesized logs [1] [1] I. Beschastnikh, Y. Brun, M. D. Ernst, A. Krishnamurthy, and T. E. Anderson, “Mining temporal invariants from partially ordered logs,” ACM SIGOPS Operating Systems Review, 45(3), pp. 39–46, with TraceBench Synoptic on synthesized logs vs. on TraceBench The experiment results based on TraceBench are more convincing.

29 Diagnosing Performance Anomalies We implemented a PCA based performance anomalies diagnosing algorithm[2] and evaluated it with TraceBench. [2] A. Lakhina, M. Crovella, and C. Diot, “Diagnosing networkwide traffic anomalies,” in Proceedings of SIGCOMM 2004, pp. 219–230, (total anomalies),(found anomalies),(incorrectly found anomalies)

30 Diagnosing Performance Anomalies 1. Find all anomalies in some cases 2. However sometimes only a small parts 3. Without false alarm in our experiments 4. Analysis time increases very fast with the trace length 5. but slowly when growing the trace amount This algorithm is sensitive with the feature of data This algorithm is pretty accurate This algorithm is more feasible for short traces

Discusses 31

? Traces are collected only on HDFS. ! Traces from HDFS are representative, because: – HDFS is a widely used system, – and many mechanisms and procedures in HDFS are shared by others. ? During collection, the HDFS cluster is small. ! It is enough for exhibiting various features of HDFS, because: – traces are collected in different scenarios. ? Many other faults exist rather than the injected faults. ! Injected faults are representative, because: – containing different types, – involving both function and performance faults, – and selecting the most frequent faults. 32 Discusses

Thanks and Any Questions?