Towards An Open Data Set for Trace-Oriented Monitoring Jingwen Zhou 1, Zhenbang Chen 1, Ji Wang 1, Zibin Zheng 2, and Michael R. Lyu 1,2 1 National University.

Slides:



Advertisements
Similar presentations
W3C Workshop on Web Services Mark Nottingham
Advertisements

The Mystery Machine: End-to-end performance analysis of large-scale Internet services Michael Chow David Meisner, Jason Flinn, Daniel Peek, Thomas F. Wenisch.
UC Berkeley Online System Problem Detection by Mining Console Logs Wei Xu* Ling Huang † Armando Fox* David Patterson* Michael Jordan* *UC Berkeley † Intel.
Exploring Latent Features for Memory- Based QoS Prediction in Cloud Computing Yilei Zhang, Zibin Zheng, and Michael R. Lyu
Pei Fan*, Ji Wang, Zibin Zheng, Michael R. Lyu Toward Optimal Deployment of Communication-Intensive Cloud Applications 1.
Web Caching Schemes1 A Survey of Web Caching Schemes for the Internet Jia Wang.
Extensible Scalable Monitoring for Clusters of Computers Eric Anderson U.C. Berkeley Summer 1997 NOW Retreat.
70-290: MCSE Guide to Managing a Microsoft Windows Server 2003 Environment Chapter 11: Monitoring Server Performance.
Maintaining and Updating Windows Server 2008
Implementing High Availability
Hands-On Microsoft Windows Server 2008 Chapter 11 Server and Network Monitoring.
CH 13 Server and Network Monitoring. Hands-On Microsoft Windows Server Objectives Understand the importance of server monitoring Monitor server.
Batch VIP — A backend system of video processing VIEW Technologies The Chinese University of Hong Kong.
Windows Server 2008 Chapter 11 Last Update
Microsoft ® Official Course Monitoring and Troubleshooting Custom SharePoint Solutions SharePoint Practice Microsoft SharePoint 2013.
Microsoft Active Directory(AD) A presentation by Robert, Jasmine, Val and Scott IMT546 December 11, 2004.
Presenter: John Tkaczewski Duration: 30 minutes February Webinar: The Basics of Remote Data Replication.
DISTRIBUTED COMPUTING
Service Architecture of Grid Faults Diagnosis Expert System Based on Web Service Wang Mingzan, Zhang ziye Northeastern University, Shenyang, China.
70-290: MCSE Guide to Managing a Microsoft Windows Server 2003 Environment, Enhanced Chapter 11: Monitoring Server Performance.
Distributed QoS Evaluation for Real- World Web Services Zibin Zheng, Yilei Zhang, and Michael R. Lyu July 07, 2010 Department of Computer.
Introduction to Hadoop and HDFS
f ACT s  Data intensive applications with Petabytes of data  Web pages billion web pages x 20KB = 400+ terabytes  One computer can read
BFTCloud: A Byzantine Fault Tolerance Framework for Voluntary-Resource Cloud Computing Yilei Zhang, Zibin Zheng, and Michael R. Lyu
Scalable Web Server on Heterogeneous Cluster CHEN Ge.
Emergence of Scaling and Assortative Mixing by Altruism Li Ping The Hong Kong PolyU
Cloud Testing Haryadi Gunawi Towards thousands of failures and hundreds of specifications.
CARDIO: Cost-Aware Replication for Data-Intensive workflOws Presented by Chen He.
70-290: MCSE Guide to Managing a Microsoft Windows Server 2003 Environment, Enhanced Chapter 11: Monitoring Server Performance.
College of Computer National University of Defense Technology Jingwen Zhou, Zhenbang Chen, Haibo Mi, and Ji Wang {jwzhou, This work.
Elastic Pathing: Your Speed Is Enough to Track You Presented by Ali.
Business Data Communications, Fourth Edition Chapter 11: Network Management.
Zibin Zheng DR 2 : Dynamic Request Routing for Tolerating Latency Variability in Cloud Applications CLOUD 2013 Jieming Zhu, Zibin.
Siyuan Liu *#, Yunhuai Liu *, Lionel M. Ni *# +, Jianping Fan #, Minglu Li + * Hong Kong University of Science and Technology # Shenzhen Institutes of.
Sentomist: Unveiling transient WSN bugs via symptom mining 1 Sentomist: Unveiling Transient Sensor Network Bugs via Symptom Mining Yangfan Zhou, Xinyu.
CH 13 Server and Network Monitoring. Hands-On Microsoft Windows Server Objectives Understand the importance of server monitoring Monitor server.
CISC Machine Learning for Solving Systems Problems Presented by: Suman Chander B Dept of Computer & Information Sciences University of Delaware Automatic.
A User Experience-based Cloud Service Redeployment Mechanism KANG Yu Yu Kang, Yangfan Zhou, Zibin Zheng, and Michael R. Lyu {ykang,yfzhou,
WSP: A Network Coordinate based Web Service Positioning Framework for Response Time Prediction Jieming Zhu, Yu Kang, Zibin Zheng and Michael R. Lyu The.
GLOBAL EDGE SOFTWERE LTD1 R EMOTE F ILE S HARING - Ardhanareesh Aradhyamath.
WS-DREAM: A Distributed Reliability Assessment Mechanism for Web Services Zibin Zheng, Michael R. Lyu Department of Computer Science & Engineering The.
A Runtime Verification Based Trace-Oriented Monitoring Framework for Cloud Systems Jingwen Zhou 1, Zhenbang Chen 1, Ji Wang 1, Zibin Zheng 2, and Wei Dong.
Reading TCP/IP Protocol. Training target: Read the following reading materials and use the reading skills mentioned in the passages above. You may also.
IEEE CLOUD’2012 Topology-Aware Deployment of Scientific Applications in Cloud Computing Pei Fan 1, Zhenbang Chen 1, Ji Wang 1, Zibin Zheng 2, Michael R.
A Binary Agent Technology for COTS Software Integrity Anant Agarwal Richard Schooler.
TraceBench: An Open Data Set for Trace-Oriented Monitoring Jingwen Zhou 1, Zhenbang Chen 1, Ji Wang 1, Zibin Zheng 2, and Michael R. Lyu 1,2 1 PDL, National.
Distributed File System. Outline Basic Concepts Current project Hadoop Distributed File System Future work Reference.
Maintaining and Updating Windows Server 2008 Lesson 8.
Beyond Application Profiling to System Aware Analysis Elena Laskavaia, QNX Bill Graham, QNX.
Experience Report: System Log Analysis for Anomaly Detection
A Collaborative Quality Ranking Framework for Cloud Components
Problem: Internet diagnostics and forensics
Software Architecture in Practice
Module 10: Managing and Monitoring Network Access
Cloud Security Research Based On The Internet of Things
PA an Coordinated Memory Caching for Parallel Jobs
Network Administration CNET-443
Chapter 16: Distributed System Structures
Challenges in Network Troubleshooting In big scale networks, when an issue like latency or packet drops occur its very hard sometimes to pinpoint.
Fault Injection: A Method for Validating Fault-tolerant System
PerfView Measure and Improve Your App’s Performance for Free
TraceBench: An Open Data Set for Trace-Oriented Monitoring
Mobile Agents.
Fault Tolerance Distributed Web-based Systems
ModelNet: A Large-Scale Network Emulator for Wireless Networks Priya Mahadevan, Ken Yocum, and Amin Vahdat Duke University, Goal:
Pei Fan*, Ji Wang, Zibin Zheng, Michael R. Lyu
Interpret the execution mode of SQL query in F1 Query paper
Web Service and Fault Tolerance Stratregy Evaluation and Selection
Develop a Reliability Test in TTCN-3
Presentation transcript:

Towards An Open Data Set for Trace-Oriented Monitoring Jingwen Zhou 1, Zhenbang Chen 1, Ji Wang 1, Zibin Zheng 2, and Michael R. Lyu 1,2 1 National University of Defense Technology, Changsha, China 2 Chinese University of Hong Kong, Hong Kong, China {jwzhou, This work is supported by : National 973 Program of China (No. 2011CB302603) NSFC (No and No ) SRFDP (No )

Motivation Benefiting our daily life … … Supporting different fields Increasing in complexity Increasing in scale

Motivation Benefiting our daily life … … Supporting different fields Increasing in complexity Increasing in scale

Motivation … In August 2013, meltdowns successively happened in: CompanyLoss AmazonLost 7 million dollars in 100 minutes. Google Lost 550,000 dollars in less than 5 minutes, and the global internet traffic dropped 40%. Trace-oriented monitoring is one of the methods to improve system reliability at runtime. Detection Diagnosis Remediation

DetectionDiagnosisRemediation Detection Diagnosis Remediation Trace-based research There is limited free available trace data set existing.

Detection Diagnosis Remediation Trace-based research Choosing or implementing a tracing system Instrumenting and deploying a target system Collecting traces … We are collecting a trace data set.

Detection Diagnosis Remediation Trace-based research We are collecting a trace data set. There is limited free available trace data set existing. Choosing or implementing a tracing system Instrumenting and deploying a target system Collecting traces … Each element in our data set is a trace. The trace records the execution path of a user request. Trace = events + relationships Event: function name and latency … Relationship: local and remote function calls … Trace → Trace Tree Nodes and edges correspond to events and relationships, respectively. Sample: A trace tree about a read file request in HDFS.

Trace File Type Structure of Data Set Workload Datanode collected under different workload speeds collected with various cluster sizes Process Network Data System affect the processes on HDFS nodes introduce errors in the data on datanodes bring anarchies to the network in the cluster inject faults to OSs of the HDFS nodes Single All faults are chosen from a single fault type faults are chosen from all the four types considering different requests, etc. considering different requests, number of faulty nodes, etc. repeat many times Normal Abnormal Combination collected when HDFS running normally collected when a fault injected collected when faults are randomly injected and later eliminated Class

Details in Collection CloudStack client001client002clientN … Clients Datanode001Datanode002DatanodeM … HDFS Namenode MTracer Server controllerGanglia Server HDFS requests track control inject faults control monitor control monitor

start MTracer and HDFS Details in Collection Normal start Clients stop MTracer and HDFS stop Clients request handling period trace collection period Abnormal inject a fault recover the system Combination inject a fault inject a fault inject a fault inject a fault inject a fault request handling period

Threads to Invalidity ? Traces are collected only on HDFS. ! Traces from HDFS are representative, because: –HDFS is a widely used system, –and many mechanisms and procedures in HDFS are shared by others. ? During collection, the maximal DN number is 50. ! It is enough for exhibiting various features of HDFS, because: –traces are collected in different scenarios. ? Many other faults exist rather than the injected faults. ! Injected faults are representative, because: –containing different types, –involving both function and performance faults, – and selecting the most frequent faults.

Thank for your attention! This work is supported by : National 973 Program of China (No. 2011CB302603) NSFC (No and No ) SRFDP (No ) Jingwen Zhou, Zhenbang Chen, Ji Wang, Zibin Zheng, and Michael R. Lyu {jwzhou,