Presentation is loading. Please wait.

Presentation is loading. Please wait.

College of Computer National University of Defense Technology Jingwen Zhou, Zhenbang Chen, Haibo Mi, and Ji Wang {jwzhou, This work.

Similar presentations


Presentation on theme: "College of Computer National University of Defense Technology Jingwen Zhou, Zhenbang Chen, Haibo Mi, and Ji Wang {jwzhou, This work."— Presentation transcript:

1 College of Computer National University of Defense Technology Jingwen Zhou, Zhenbang Chen, Haibo Mi, and Ji Wang {jwzhou, zbchen}@nudt.edu.cn This work is supported by: iVCE National Basic Research Program of China

2 Background 2 … …… …

3 Background 3 … Distributed System containing tens to hundreds nodes, which is medium-scale.

4 Background 4 Resource-oriented MonitoringTrace-oriented Monitoring Record the resource consumption, such as CPU and memory Ganglia, Chukwa … Record execution paths, or called the traces of requests X-Trace, P-Tracer, Zipkin …

5 Background 5 Resource-oriented MonitoringTrace-oriented Monitoring Record the resource consumption, such as CPU and memory Ganglia, Chukwa … Record execution paths, or called the traces of requests X-Trace, P-Tracer, Zipkin … Generally, trace-oriented monitoring collects more valuable information than resource-oriented monitoring

6 Background 6 In prototype: store trace in text file simple visualization … too little function to use For large-scale DS: construct call trees using a map-reduce process … exceptions also occur in monitors and hard to recovery For Medium-scale DS: Lightweight Efficient Real time Visualized X-Trace P-Tracer ?? MTracer

7 Architecture 7 Monitor Server DS … Network Node 1 Node 2 Node n Monitor Server Manager Event Recovering UI Database writer … extractor Receiver … DS instrumentation Reporter info Recording Storing Visualizing

8 Architecture 8 Monitor Server DS … Network Node 1 Node 2 Node n Monitor Server Manager Event Recovering UI Database writer … extractor Receiver … DS instrumentation Reporter Storing Visualizing info Recording

9 Trace Recording  Trace = Events + Relationships TraceID  Event: Function name, latency, … Automatically collected NID, for overhead consideration NID ≈ temporary thread ID  Relationship: remote procedure call, communication between nodes … 9

10 Trace Recording 10 Assigning a new NID when a trace starts. Each time a node communicates with a remote node, assigns a new NID for the remote node, and preserves the local NID. When generating an event, records local NID, together with the start and end time stamps. The first time of generating an event after a new NID assigned, an additional event also created, called edge, recording the information of the causal relationship. Assigning a new NID when a trace starts. Each time a node communicates with a remote node, assigns a new NID for the remote node, and preserves the local NID. When generating an event, records local NID, together with the start and end time stamps. The first time of generating an event after a new NID assigned, an additional event also created, called edge, recording the information of the causal relationship. Node 1 Node 2 F1 F2 F3F4 F5 Request start Request end ST3ET3ST4ET4 ST5ET5 ST1 ET1 NID2NID3 NID1 R1 Node1 Trace1 NID1 ST1 ET1 F1 R2 Node1 Trace1 NID1 ST2 ET2 F2 R3 Node2 Trace1 NID2 ST3 ET3 F3 R4 Node2 Trace1 NID3 ST4 ET4 F4 R5 Node1 Trace1 NID1 ST5 ET5 F5 E1 Trace1 0 0 NID1 E2 Trace1 NID1 T2 NID2 E3 Trace1 NID1 T2 NID3 Node TraceID NID Timestamp Name Event TraceID FatherNID FatherST ChildNID Edge NID1 ST2 ET2

11 Trace Reconstruction 11 NID2NID3 NID1 R1 Node1 Trace1 NID1 ST1 ET1 F1 R2 Node1 Trace1 NID1 ST2 ET2 F2 R3 Node2 Trace1 NID2 ST3 ET3 F3 R4 Node2 Trace1 NID3 ST4 ET4 F4 R5 Node1 Trace1 NID1 ST5 ET5 F5 E1 Trace1 0 0 NID1 E2 Trace1 NID1 T2 NID2 E3 Trace1 NID1 T2 NID3 NID1 Node 1 Node 2 F1 F2 F3F4 F5 Request start Request end ST3ET3ST4ET4 ST5ET5 ST1 ET1 Node TraceID NID Timestamp Name Event TraceID FatherNID FatherST ChildNID Edge ST2 ET2 Assigning a new NID when a trace starts. Each time a node communicates with a remote node, assigns a new NID for the remote node, and preserves the local NID. When generating an event, records local NID, together with the start and end time stamps. The first time of generating an event after a new NID assigned, an additional event also created, called edge, recording the information of the causal relationship. Assigning a new NID when a trace starts. Each time a node communicates with a remote node, assigns a new NID for the remote node, and preserves the local NID. When generating an event, records local NID, together with the start and end time stamps. The first time of generating an event after a new NID assigned, an additional event also created, called edge, recording the information of the causal relationship.

12 Trace Reconstruction 12 NID2NID3 NID1 R1 Node1 Trace1 NID1 ST1 ET1 F1 R2 Node1 Trace1 NID1 ST2 ET2 F2 R3 Node2 Trace1 NID2 ST3 ET3 F3 R4 Node2 Trace1 NID3 ST4 ET4 F4 R5 Node1 Trace1 NID1 ST5 ET5 F5 E1 Trace1 0 0 NID1 E2 Trace1 NID1 T2 NID2 E3 Trace1 NID1 T2 NID3 NID1 Node 1 Node 2 F1 F2 F3F4 F5 Request start Request end ST3ET3ST4ET4 ST5ET5 ST1 ET1 Node TraceID NID Timestamp Name Event TraceID FatherNID FatherST ChildNID Edge ST2 ET2 F1 F2 F5 F3 F4 NID1 NID2NID3 ST1 ET2 F2 Name, latency, node, … Local call, remote call, …

13 Architecture 13 Monitor Server DS … Network Node 1 Node 2 Node n Monitor Server Manager Event Recovering UI Database writer … extractor Receiver … DS instrumentation Reporter Storing Visualizing info Recording

14 Architecture 14 Monitor Server DS … Network Node 1 Node 2 Node n Monitor Server Manager Event Recovering UI Receiver … DS instrumentation Reporter Visualizing info Recording Database writer … extractor Storing

15 Trace Storing 15 T_Trace T_Event T_Edge T_Operation SELECT INSERT UPDATE exist ? INSERT edge ? SELECT INSERT UPDATE exist ?

16 Trace Storing 16 T_Trace T_Event T_Edge T_Operation EdgeWriter EventWriter OperationWriter TraceWriter Q_Trace Q_Event Q_Edge Q_Operation Extractor Event Optimization 1: Batch Inserting Optimization 1: Batch Inserting Optimization 2: Information updating in memory Optimization 2: Information updating in memory Reduce database operations

17 Architecture 17 Monitor Server DS … Network Node 1 Node 2 Node n Monitor Server Manager Event Recovering UI Receiver … DS instrumentation Reporter Visualizing info Recording Database writer … extractor Storing

18 Architecture 18 Monitor Server DS … Network Node 1 Node 2 Node n Monitor Server Manager Event Recovering Database Receiver … DS instrumentation Reporter info Recording writer … extractor Storing UI Visualizing

19 Visualization 19 1. Trace Tree

20 Visualization 20

21 Visualization 21 2. Trace Tree Classification

22 Visualization 22 3. Performance Problem Diagnosis

23 Experiments: Overhead  Generating an event: 0.046ms vs. seconds or minutes in DS  Size of an event: 0.315KB 2MB bandwidth vs. GB-level network  Generating an ID: 0.057ms >50% less using our method 23 The overhead of a client is negligible!

24 Experiments: Effectiveness 24 T_Operation limits the global speedup to 6X Capability of receiving eventsSpeed of database operations

25 Experiments: Usability 25  HDFS: RPC & Data accessing processes  50 Clients + (50 +1) HDFS  14 faults : Functional + Performance problem  Easily handling, correct visualization  Trace classifying, diagnosis

26 Conclusion  MTracer : A Lightweight, efficient, real-time monitor for medium-scale DS, with a visualized frontend.  Future work An easier way for instrumentations A dataset for trace-based monitoring research Fault detection … 26

27 27 This work is supported by: iVCE National Basic Research Program of China

28 28 Optimizations 1. Batch Inserting Q_Event T_Event Q_Event T_Event

29 Optimizations 29 2. Information Updating in Memory Q_Trace T_Trace Q_Trace T_Trace


Download ppt "College of Computer National University of Defense Technology Jingwen Zhou, Zhenbang Chen, Haibo Mi, and Ji Wang {jwzhou, This work."

Similar presentations


Ads by Google