Presentation is loading. Please wait.

Presentation is loading. Please wait.

Yongle Zhang, Serguei Makarov, Xiang Ren, David Lion, Ding Yuan

Similar presentations


Presentation on theme: "Yongle Zhang, Serguei Makarov, Xiang Ren, David Lion, Ding Yuan"— Presentation transcript:

1 Yongle Zhang, Serguei Makarov, Xiang Ren, David Lion, Ding Yuan
Pensieve: Non-Intrusive Failure Reproduction for Distributed Systems using the Event Chaining Approach Yongle Zhang, Serguei Makarov, Xiang Ren, David Lion, Ding Yuan

2 Failure Reproduction is Time-Consuming
$156 Billion in 2013 Understand bug Verify fix

3 A Real-World Debugging Procedure
Provided error description, log & reproduction steps. Can’t reproduce. Added some reproduction steps. Still cannot reproduce. Could you upload fsimage? Fsimage uploaded. Revised reproduction steps. Reporter Still cannot reproduce. Developer HDFS-6130

4 A Real-World Debugging Procedure
Provided error description, log & reproduction steps. Can’t reproduce. Added some reproduction steps. Still cannot reproduce. Could you upload fsimage? Fsimage uploaded. Revised reproduction steps. Reporter Still cannot reproduce. Developer . . . [5 days, 29 discussions] Reproduced… HDFS-6130 [after another 8 minutes] Posted the working patch.

5 Pensieve: Non-Intrusive Failure Reproduction
Analyzes Java bytecode & log Non-intrusive Works on JVM based system

6 Pensieve: Non-Intrusive Failure Reproduction
Analyzes Java bytecode & log Non-intrusive Works on JVM based system Command Line Input (HDFS-4022) ./pensieve -jar ./hadoop-hdfs alpha.jar // Java bytecode -log ./HDFS-logs/ // failure logs -error ./HDFS-logs/datanode-2.log#800 // symptoms

7 Pensieve: Non-Intrusive Failure Reproduction
Analyzes Java bytecode & log Non-intrusive Works on JVM based system Command Line Input (HDFS-4022) ./pensieve -jar ./hadoop-hdfs alpha.jar // Java bytecode -log ./HDFS-logs/ // failure logs -error ./HDFS-logs/datanode-2.log#800 // symptoms . 800: ERROR: invalid block datanode-2.log

8 Pensieve: Non-Intrusive Failure Reproduction
Analyzes Java bytecode & log Non-intrusive Works on JVM based system Output Unit Test (HDFS-4022) initCluster(1); // start 1-machine cluster create(“a.txt”,2); // create 2-replica file append(“a.txt”,“X”); // append to file addDataNode(); // add a datanode on the fly

9 Existing Solutions Are Limited
Record-and-replay (deterministic replay) Intrusive: modifies existing software stack Incurs performance overhead Symbolic Execution E.g., ESD [Zamfir EuroSys’10], SherLog [Yuan ASPLOS’10]. Pros: precise & non-intrusive Cons: hard to scale to large systems

10 Scalability of Symbolic Execution
1.for(int i=0;i<blocks.length;i++){ 2. if(blocks[i].genStamp!=VALID_GS) 3. log("invalid block. . ."); 4.} Simplified code snippet for HDFS-4022 Enumerates every possible execution path

11 Scalability of Symbolic Execution
1.for(int i=0;i<blocks.length;i++){ 2. if(blocks[i].genStamp!=VALID_GS) 3. log("invalid block. . ."); 4.} Simplified code snippet for HDFS-4022 Enumerates every possible execution path blocks[i].genStamp!=VALID_GS

12 Scalability of Symbolic Execution
1.for(int i=0;i<blocks.length;i++){ 2. if(blocks[i].genStamp!=VALID_GS) 3. log("invalid block. . ."); 4.} Simplified code snippet for HDFS-4022 Enumerates every possible execution path blocks[0].genStamp!=VALID_GS blocks[i].genStamp!=VALID_GS blocks[1].genStamp!=VALID_GS && blocks[0].genStamp==VALID_GS .

13 Scalability of Symbolic Execution
1.for(int i=0;i<blocks.length;i++){ 2. if(blocks[i].genStamp!=VALID_GS) 3. log("invalid block. . ."); 4.} Simplified code snippet for HDFS-4022 Enumerates every possible execution path blocks[0].genStamp!=VALID_GS OR blocks[i].genStamp!=VALID_GS blocks[1].genStamp!=VALID_GS && blocks[0].genStamp==VALID_GS OR .

14 Scalability of Symbolic Execution
1.for(int i=0;i<blocks.length;i++){ 2. if(blocks[i].genStamp!=VALID_GS) 3. log("invalid block. . ."); 4.} Simplified code snippet for HDFS-4022 Enumerates every possible execution path Failure Total Branch Instructions S.E. Stopped At (Instructions) Condition Size Pensieve instruction HDFS-4022 72,943,652 693 109,018,324 166

15 Core Idea – Partial Trace Observation
Developers almost never debug a failure by reconstructing its complete execution path. Instead, they construct a simplified trace which only contains events that are likely to be causally relevant to the failure.

16 How do developers debug HDFS-4022?
for(int i=0;i<blocks.length;i++) if(blocks[i].genStamp!=VALID_GS) log("invalid block. . .");

17 How do developers debug HDFS-4022?
if(this.stage==APPEND){ log("Appending" + b); b.genStamp=newGS; } void rollLog(. . .){ b.genStamp=logGS; } for(int i=0;i<blocks.length;i++) if(blocks[i].genStamp!=VALID_GS) log("invalid block. . .");

18 How do developers debug HDFS-4022?
if(this.stage==APPEND){ log("Appending" + b); b.genStamp=newGS; } void rollLog(. . .){ b.genStamp=logGS; } for(int i=0;i<blocks.length;i++) if(blocks[i].genStamp!=VALID_GS) log("invalid block. . .");

19 How do developers debug HDFS-4022?
if(this.stage==APPEND){ log("Appending" + b); b.genStamp=newGS; } for(int i=0;i<blocks.length;i++) if(blocks[i].genStamp!=VALID_GS) log("invalid block. . .");

20 How do developers debug HDFS-4022?
if(this.stage==APPEND){ log("Appending" + b); b.genStamp=newGS; } for(int i=0;i<blocks.length;i++) if(blocks[i].genStamp!=VALID_GS) log("invalid block. . .");

21 How do developers debug HDFS-4022?
Client.java void append(. . .){ stage=APPEND; . . . } if(this.stage==APPEND){ log("Appending" + b); b.genStamp=newGS; } for(int i=0;i<blocks.length;i++) if(blocks[i].genStamp!=VALID_GS) log("invalid block. . .");

22 How do developers debug HDFS-4022?
Network serialization de-serialization Client.java DataNode.java void append(. . .){ stage=APPEND; . . . } if(this.stage==APPEND){ log("Appending" + b); b.genStamp=newGS; } for(int i=0;i<blocks.length;i++) if(blocks[i].genStamp!=VALID_GS) log("invalid block. . .");

23 How do developers debug HDFS-4022?
Network serialization de-serialization Client.java DataNode.java void append(. . .){ stage=APPEND; . . . } if(this.stage==APPEND){ log("Appending" + b); b.genStamp=newGS; } Got one user command (append) by looking at 8 instructions! for(int i=0;i<blocks.length;i++) if(blocks[i].genStamp!=VALID_GS) log("invalid block. . .");

24 Event Chaining Approach
Event – a point in time during execution Location event – a program location reached Condition event – a condition holds Invocation event – a function invoked

25 Event Chaining Approach
Location event Condition event Invocation event An event is explained by other events Location event  path conditions 0.for(int i=0;i<blocks.length;i++) 1. if(blocks[i].genStamp!=VALID_GS) 2. log("invalid block. . ."); 888: ERROR: invalid block… e1:line2(L2) datanode-2.log

26 Event Chaining Approach
Location event Condition event Invocation event An event is explained by other events Location event  path conditions 0.for(int i=0;i<blocks.length;i++) 1. if(blocks[i].genStamp!=VALID_GS) 2. log("invalid block. . ."); 888: ERROR: invalid block… e2:blocks[i].genStamp e1:L2 datanode-2.log

27 Event Chaining Approach
Location event Condition event Invocation event An event is explained by other events Condition event  definitions 3.if(this.stage==APPEND){ 4. log("Appending" + b); 5. b.genStamp=newGS; … } 0.for(int i=0;i<blocks.length;i++) 1. if(blocks[i].genStamp!=VALID_GS) 2. log("invalid block. . ."); e3:L5 888: ERROR: invalid block… e2:blocks[i].genStamp e1:L2 datanode-2.log

28 Event Chaining Approach
Location event Condition event Invocation event An event is explained by other events 6.void append(. . .){ 7. stage=APPEND; … } 3.if(this.stage==APPEND){ 4. log("Appending" + b); 5. b.genStamp=newGS; … } e5:L7 0.for(int i=0;i<blocks.length;i++) 1. if(blocks[i].genStamp!=VALID_GS) 2. log("invalid block. . ."); e3:L5 888: ERROR: invalid block… e2:blocks[i].genStamp e1:L2 datanode-2.log

29 Event Chaining Approach
Location event Condition event Invocation event An event is explained by other events Location event  function invocation 6.void append(. . .){ 7. stage=APPEND; … } 3.if(this.stage==APPEND){ 4. log("Appending" + b); 5. b.genStamp=newGS; … } e6:append() e5:L7 0.for(int i=0;i<blocks.length;i++) 1. if(blocks[i].genStamp!=VALID_GS) 2. log("invalid block. . ."); e3:L5 888: ERROR: invalid block… e2:blocks[i].genStamp e1:L2 datanode-2.log

30 Event Chaining Approach
Location event Condition event Invocation event Captures dependency on shared variables 6.void append(. . .){ 7. stage=APPEND; … } 3.if(this.stage==APPEND){ 4. log("Appending" + b); 5. b.genStamp=newGS; … } Thread 1 e6:append() Thread 2 e5:L7 0.for(int i=0;i<blocks.length;i++) 1. if(blocks[i].genStamp!=VALID_GS) 2. log("invalid block. . ."); e3:L5 888: ERROR: invalid block… e2:blocks[i].genStamp e1:L2 datanode-2.log

31 Forking for Multiple Possibilities
e2:blocks[i].genStamp e1:L2 0.for(int i=0;i<blocks.length;i++) 1. if(blocks[i].genStamp!=VALID_GS) 2. log("invalid block. . .");

32 Forking for Multiple Possibilities
e7:L8 e3:L5 fork e2:blocks[i].genStamp e2 e1 e1:L2 3.if(this.stage==APPEND){ 4. log("Appending" + b); 5. b.genStamp=newGS; … } … void rollLog(. . .){ 8. b.genStamp=logGS; … } 0.for(int i=0;i<blocks.length;i++) 1. if(blocks[i].genStamp!=VALID_GS) 2. log("invalid block. . .");

33 Priority based scheduling
e7:L8 P:500 e3:L5 P:500 P:1000 fork e2:blocks[i].genStamp e2 e1 e1:L2

34 Priority based scheduling
e7:L8 P:500 e3:L5 P:500 e2 e2:blocks[i].genStamp e1 e1:L2 3.if(this.stage==APPEND){ 4. log("Appending" + b); 5. b.genStamp=newGS; … } … void rollLog(. . .){ 8. b.genStamp=logGS; … } 0.for(int i=0;i<blocks.length;i++) 1. if(blocks[i].genStamp!=VALID_GS) 2. log("invalid block. . .");

35 Priority based scheduling
e7:L8 P:0 e3:L5 P:1000 e2 e2:blocks[i].genStamp e1 e1:L2 3.if(this.stage==APPEND){ 4. log("Appending" + b); 5. b.genStamp=newGS; … } s . Appending… log files … void rollLog(. . .){ 8. b.genStamp=logGS; … } 0.for(int i=0;i<blocks.length;i++) 1. if(blocks[i].genStamp!=VALID_GS) 2. log("invalid block. . .");

36 Priority based scheduling
Favors event chains with most matched logs Favors simpler reproduction paths

37 Eliminating Infeasible Event Chains
path conditions Path conditions Variable substitution Logical conjunction e2:blocks[i].genStamp e1:L2 0.for(int i=0;i<blocks.length;i++) 1. if(blocks[i].genStamp!=VALID_GS) 2. log("invalid block. . .");

38 Eliminating Infeasible Event Chains
path conditions Path conditions Variable substitution Logical conjunction 3.if(this.stage==APPEND){ 4. log("Appending" + b); 5. b.genStamp=newGS; … } e3:L5 e2:blocks[i].genStamp e1:L2 0.for(int i=0;i<blocks.length;i++) 1. if(blocks[i].genStamp!=VALID_GS) 2. log("invalid block. . .");

39 Skip Less-Relevant Loops
Skips loops when there’s no loop carried dependency 77% of randomly sampled loops in HDFS Follows loop iterations otherwise 3.if(this.stage==APPEND){ 4. log("Appending" + b); 5. b.genStamp=newGS; … } e3:L5 e2:blocks[i].genStamp e1:L2 0.for(int i=0;i<blocks.length;i++) 1. if(blocks[i].genStamp!=VALID_GS) 2. log("invalid block. . .");

40 Verification & Testcase Refinement
Sacrifices precision for efficiency Event Chain e3: i=2; e2: i>1 e1: log(“ERROR”);

41 Verification & Testcase Refinement
Sacrifices precision for efficiency Event Chain Execution e3: i=2; e2: i>1 e1: log(“ERROR”);

42 Verification & Testcase Refinement
Sacrifices precision for efficiency Event Chain Execution e3: i=2; e2: i>1 X Diverged! e1: log(“ERROR”);

43 Verification & Testcase Refinement
Sacrifices precision for efficiency e3: i=2 Event Chain Execution e5: N>0 e4: i=0; e2: i>1 X Diverged! e1: log(“ERROR”);

44 Verification & Testcase Refinement
Sacrifices precision for efficiency e3: i=2 Event Chain e3: i=2 Execution e5: N>0 e6: N<=0 e4: i=0; e2: i>1 e2: i>1 X e1: log(“ERROR”); Diverged! e1: log(“ERROR”);

45 Verification & Testcase Refinement
Variable modified in a different thread (in paper) e3: i=2 Event Chain e3: i=2 Execution e5: N>0 e6: N<=0 e4: i=0; e2: i>1 e2: i>1 X e1: log(“ERROR”); Diverged! e1: log(“ERROR”);

46 Evaluation Evaluated on 18 cases from JVM distributed systems
HDFS, HBase, ZooKeeper, Cassandra with noisy logs generated from manual reproduction Overall Result Successfully reproduces 72% Finishes analysis within 10 min Scalability Result Average # of Events in A Event Chain 105.2 # of Forked Event Chains 1367.2

47 Case Study: HDFS-6130 Useful for hard bugs path conditions inferred
fsimage.layoutVersion!=TXID_LAYOUT // Use old fsimage initCluster(UPGRADE); restartCluster(); initCluster(UPGRADE); restartCluster(); Developers’ reproduction Pensieve’s reproduction

48 Case Study: HDFS-4022 Finds different reproduction than developers’
initCluster(3); setConfig(“policy”, “ALWAYS”); create(“a.txt”,2); stopDataNode(1); append(“a.txt”,“data”); initCluster(4); create(“a.txt”,3); stopDataNode(3); append(“a.txt”,“data”); Developers’ reproduction Pensieve’s reproduction – fewer nodes!

49 Limitations Error is not logged (e.g., silent data loss)
Bugs involving resource exhaustion Systems need to have clearly defined input events. E.g., not for compilers.

50 Related Work Static program slicing [Weiser81]
Obtains static trace but not dynamic partial trace Symbolic execution based approach ESD, SherLog. Record-and-replay based approach BugRedux [Wei ICSE’12], etc.

51 Conclusion Thanks! Pensieve: automated failure reproduction
Based on Partial Trace Observation Scales to real-world distributed systems Non-intrusive and relies on logs Pensieve leverages the natural way human beings do failure reproduction. Thanks!


Download ppt "Yongle Zhang, Serguei Makarov, Xiang Ren, David Lion, Ding Yuan"

Similar presentations


Ads by Google