Presentation is loading. Please wait.

Presentation is loading. Please wait.

Bin Xin, Patrick Eugster, Xiangyu Zhang Dept. of Computer Science Purdue University {xinb, peugster, Lightweight Task Graph Inference.

Similar presentations


Presentation on theme: "Bin Xin, Patrick Eugster, Xiangyu Zhang Dept. of Computer Science Purdue University {xinb, peugster, Lightweight Task Graph Inference."— Presentation transcript:

1 Bin Xin, Patrick Eugster, Xiangyu Zhang Dept. of Computer Science Purdue University {xinb, peugster, xyzhang}@cs.purdue.usc Lightweight Task Graph Inference for Distributed Applications Jinlin Yang Center for Software Excellence Microsoft Corp. jiliny@microsoft.com 2010 29th IEEE International Symposium on Reliable Distributed Systems

2 Introduction New Challenges to reliability as applications move to Cloud Distinct corporate entities managing the infrastructure and the owing the application deployed Application developer do not have access to lower level debugging information in case of failures/faults. Depends on Application output or app level custom Logs for diagnosis Goal: Describe the high-level structural view of a distributed program execution to facilitate easy “after the fact” diagnosis.

3 Contributions Define abstraction for representing distributed executions – “Tasks” A lightweight approach to generate “Task Graphs” from the application event logs. A declarative formulation of the rules to generate Task Graphs using Prolog. Demonstrate use of Task Graph to help understand the distributed execution including anomaly detection.

4 Relevance to SmartGrid and CiC Extensions Fault Detection by real-time log processing (CEP?) The patterns for CEP can be defined by the application developer OR can be auto-generated using code augmentation and static code analysis. On fault-detection, the task graph can be used to decide “recovery” mechanisms (other than naïve restart process strategy) Shortcomings Do not explicitly consider the “Data Repository” Considered only as one of the ‘tasks’. Not sure how it handles Transactions

5 Definitions Event: is the execution of an operation that sends (or receives) data/signal to a different thread/process (Smallest building blocks) Signaling Event: is the operation of Sending Acting Event: is the operation of Receiving Happens Before (a  e b): partial ordering of events. A is the Sender and B is the receiver who acts on that signal. Task: Autonomous computation within a thread between to “acting” events. [A start, A end ) Task contains exactly one Acting Event Zero or more Signaling Event Task Graph: A DAG whose nodes are tasks and edges represent Happens Before relations A Request: A pair of signaling and acting events, where the signaling event is originating from outside the System. A Reply: A pair of signaling and acting events, where the Acting event is triggered outside the System. E2E service Graph:

6 Example

7 System Setup Uses HDFS as the example application on Cloud HDFS logs are not sufficient/standardized Uses Instrumentation using a tool called “AspectJ” AspectJ lets the developer insert code based on specific “rules” during compilation Each event is logged as a 7-field tuple (EventID, ProcID, threadID, SourceLocation, Type, Tag, Value)

8 Constructing Task Graphs (Prolog formulation) - I Events A “Fact” to parse and store all events An entry for hb is made only if the Rules on the right are true for events X & Y

9 Constructing Task Graphs (Prolog formulation) - II Tasks

10 Issues & Solutions - I False +ves caused by Common Sycn Objects Notion of “Time” is required. But Global Clocks or Vector Clocks are expensive and complex. Heuristic: Use the order of events in the event logs. Problem: Proposed Solution:

11 Issues & Solutions - II False +ves caused by Communication Multiple Writes on the same Socket. Heuristic: Use “Packet Size” and Total Received so far to decide which write to associate to which reads. Problem: Proposed Solution:

12 Issues & Solutions - III False -ves caused by Gaurded Waits Multiple waiting threads are notified and the Lock Condition is updated before the current thread’s execution. Hence a Condition Check is required after waking up. Manually update such cases and remove augmented code within the loop and Add a marker just after the loop. Problem: Proposed Solution:

13 Evaluation - I Performance Impact Runtime: 22.2% increase in binary size 38% increase in execution time TaskGraph building using Prolog:

14 Evaluation – II (Demo) To Help a new HDFS developer to analyze HDFS Execution

15 Relevance to SmartGrid and CiC Extensions Fault Detection by real-time log processing (CEP?) The patterns for CEP can be defined by the application developer OR can be auto-generated using code augmentation and static code analysis. On fault-detection, the task graph can be used to decide “recovery” mechanisms (other than naïve restart process strategy) Shortcomings Do not explicitly consider the “Data Repository” Considered only as one of the ‘tasks’. Not sure how it handles Transactions


Download ppt "Bin Xin, Patrick Eugster, Xiangyu Zhang Dept. of Computer Science Purdue University {xinb, peugster, Lightweight Task Graph Inference."

Similar presentations


Ads by Google