Presentation is loading. Please wait.

Presentation is loading. Please wait.

Tanakorn Leesatapornwongsa, Jeffrey F. Lukman, Shan Lu, Haryadi S. Gunawi.

Similar presentations


Presentation on theme: "Tanakorn Leesatapornwongsa, Jeffrey F. Lukman, Shan Lu, Haryadi S. Gunawi."— Presentation transcript:

1 Tanakorn Leesatapornwongsa, Jeffrey F. Lukman, Shan Lu, Haryadi S. Gunawi

2

3 Tanakorn Leesatapornwongsa, Jeffrey F. Lukman, Shan Lu, Haryadi S. Gunawi

4 4/5/16TaxDC @ ASPLOS ‘16 4  More people develop distributed systems  Distributed systems are hard  Hard largely because of concurrency  Concurrency leads to unexpected timings  X should arrive before Y, but X can arrive after Y  Unexpected timings lead to distributed concurrency (DC) bugs

5 4/5/16TaxDC @ ASPLOS ‘16 5 “… be able to reason about the correctness of increasingly more complex distributed systems that are used in production” – Azure engineers & managers Uncovering Bugs in Distributed Storage Systems during Testing (Not in Production!) [FAST ‘16] Understanding distributed system bugs is important!

6 4/5/16TaxDC @ ASPLOS ‘16 6  Bugs caused by non-deterministic timing  Non-deterministic timing of concurrent events involving more than one node  Messages, crashes, reboots, timeouts, computations

7 4/5/16TaxDC @ ASPLOS ‘16 7 (LC bug: multi-threaded single machine software) Top 10 most cited ASPLOS paper

8 4/5/16TaxDC @ ASPLOS ‘16 8  104 bugs  4 varied distributed systems  Bugs in 2011-2014  Study description, source code, patches

9 4/5/16TaxDC @ ASPLOS’16 9 TaxDC Trigger Timin g Order Violation Atomicity Violation Fault Timing Reboot Timing Input Fault Reboot Workload Scope Nodes Messag es Protocols Error & Failure Error Loc Mem Loc Sem Loc Hang Loc Silence Glob Wrong Glob Miss Glob Silence Failure Downtime Data Loss Op Fail Performanc e Fix Timing Global Sync Local Sync Handlin g Retry Ignore Accept Others Timin g InputScopeErrorFailure Handlin g Timin g

10 4/5/16TaxDC @ ASPLOS ‘16 10 ZooKeeper-1264 1. Follower F crashes, reboots, and joins cluster 2. Leader L sync snapshot with F 3. Client requests new update, F applies this only in memory 4. Sync finishes 5. Client requests other update, F writes this to disk correctly 6. F crashes, reboots, and joins cluster again 7. This time L sends only diff after update in step 5. 8. F loses update in step 3.

11 4/5/16TaxDC @ ASPLOS ‘16 11 ZooKeeper-1264 1. Follower F crashes, reboots, and joins cluster 2. Leader L sync snapshot with F 3. Client requests new update, F applies this only in memory 4. Sync finishes 5. Client requests other update, F writes this to disk correctly 6. F crashes, reboots, and joins cluster again 7. This time L sends only diff after update in step 5. 8. F loses update in step 3. Timing: - Atomicity violation - Fault Timing Input: - 4 Protocols - 2 faults - 2 reboots Error: - Global Failure: Data inconsistenc y Fix: Delay msg.

12 4/5/16TaxDC @ ASPLOS’16 12 TaxDC Trigger Timin g Order Violation Atomicity Violation Fault Timing Reboot Timing Input Fault Reboot Workload Scope Nodes Messag es Protocols Error & Failure Error Loc Mem Loc Sem Loc Hang Loc Silence Glob Wrong Glob Miss Glob Silence Failure Downtime Data Loss Op Fail Performanc e Fix Timing Global Sync Local Sync Handlin g Retry Ignore Accept Others Timin g InputScopeErrorFailure Handlin g Timin g

13 4/5/16TaxDC @ ASPLOS’16 13 TaxDC Trigger Timin g Order Violation Atomicity Violation Fault Timing Reboot Timing Input Fault Reboot Workload Scope Nodes Messag es Protocols Error & Failure Error Loc Mem Loc Sem Loc Hang Loc Silence Glob Wrong Glob Miss Glob Silence Failure Downtime Data Loss Op Fail Performanc e Fix Timing Global Sync Local Sync Handlin g Retry Ignore Accept Others Timin g InputScopeErrorFailure Handlin g Timin g Conditions that make bugs happen

14 4/5/16TaxDC @ ASPLOS’16 14 TaxDC Trigger Timin g Order Violation Atomicity Violation Fault Timing Reboot Timing Input Fault Reboot Workload Scope Nodes Messag es Protocols Error & Failure Error Loc Mem Loc Sem Loc Hang Loc Silence Glob Wrong Glob Miss Glob Silence Failure Downtime Data Loss Op Fail Performanc e Fix Timing Global Sync Local Sync Handlin g Retry Ignore Accept Others Timin g InputScopeErrorFailure Handlin g Timin g What: Untimely moment that makes bug happens Why: Help design bug detection tools

15 4/5/16TaxDC @ ASPLOS ‘16 15 Trigger Timing Message Ex: MapReduce-3274 “Does the timing involve many messages?”

16 “Does the timing involve many messages?” 4/5/16TaxDC @ ASPLOS ‘16 16 Trigger Timing Message Order violation (44%) Ex: MapReduce-3274 2 events, X and Y Y must happen after X But Y happens before X

17 “Does the timing involve many messages?” 4/5/16TaxDC @ ASPLOS ‘16 17 Trigger Timing Message Order violation (44%) Msg-msg race Submit Kill Ex: MapReduce-3274 Submit 2 events, X and Y Y must happen after X But Y happens before X

18 4/5/16TaxDC @ ASPLOS ‘16 18 Trigger Timing Receive- receive race Send-send race Receive- send race Message Order violation (44%) Msg-msg race AB AB AB AB Ne w key Ne w Old key HBase- 5780 MapReduce- 3274 Kill End report Kill what job? Expired! Ne w key (late ) MapReduce- 5358 End report Kill

19 4/5/16TaxDC @ ASPLOS ‘16 19 Trigger Timing cmp Message Order violation (44%) Msg-msg race Msg-compute race Order violation: 2 events, X and Y Y must happen after X But Y happens before X Ex: MapReduce-4157

20 4/5/16TaxDC @ ASPLOS ‘16 20 Trigger Timing Message Order violation (44%) Atomicity violation (20%) A message comes in the middle of atomic operation ABAB AB Ex: Cassandra-1011, Hbase-4729, MapReduce-5009, Zookeeper-1496

21 4/5/16TaxDC @ ASPLOS ‘16 21 Trigger Timing Message Fault (21%) Fault at specific timing AB ABC ABC Ex: Cassandra-6415, Hbase-5806, MapReduce-3858, Zookeeper-1653 No fault timing in LC bugs Only in DC bugs

22 4/5/16TaxDC @ ASPLOS ‘16 22 Trigger Timing Message Fault Reboot (11%) AB AB Reboot at specific timing Ex: Cassandra-2083, Hadoop-3186, MapReduce-5489, Zookeeper-975

23 4/5/16TaxDC @ ASPLOS ‘16 23 Trigger Timing Message Fault Reboot Mix (4%) ZooKeeper-1264 1. Follower F crashes, reboots, and joins cluster 2. Leader L sync snapshot with F 3. Client requests new update, F applies this only in memory (in the middle of sync snapshot) 4. Sync finishes 5. Client requests other update, F writes this to disk correctly 6. F crashes, reboots, and joins cluster again 7. This time L sends only diff after update in step 5. 8. F loses update in step 3. Atomicity violation Fault timing Failure

24 4/5/16TaxDC @ ASPLOS ‘16 24 Trigger Timing cmp Message timingFault timing Reboot timing Implication: simple patterns can inform pattern-based bug detection tools, etc.

25 4/5/16TaxDC @ ASPLOS’16 25 TaxDC Trigger Timin g Order Violation Atomicity Violation Fault Timing Reboot Timing Input Fault Reboot Workload Scope Nodes Messag es Protocols Error & Failure Error Loc Mem Loc Sem Loc Hang Loc Silence Glob Wrong Glob Miss Glob Silence Failure Downtime Data Loss Op Fail Performanc e Fix Timing Global Sync Local Sync Handlin g Retry Ignore Accept Others Timin g Input ScopeErrorFailure Handlin g Timin g What: Input to exercise buggy code Why: Improve testing coverage

26 4/5/16TaxDC @ ASPLOS ‘16 26 Trigger Timing Input ZooKeeper-1264 1. Follower F crashes, reboots, and joins cluster 2. Leader L sync snapshot with F 3. Client requests new update, F applies this update only in memory 4. Sync finishes 5. Client requests other update, F writes this to disk correctly 6. F crashes, reboots, and joins cluster again 7. This time L sends only diff after update in step 5. 8. F loses update in step 3. Fault & reboot 2 crashes 2 reboots

27 4/5/16TaxDC @ ASPLOS ‘16 27 Trigger Timing Input Fault “How many bugs require fault injection?” 37% = No fault 63% = Yes “What kind of fault? & How many times?” 88% = No timeout12% 53% = No crash35% = 1 crash12% Real-world DC bugs are NOT just about message re-ordering, but faults as well

28 4/5/16TaxDC @ ASPLOS ‘16 28 Trigger Timing Input Fault Reboot “How many reboots?” 73% = No reboot20% = 17%

29 4/5/16TaxDC @ ASPLOS ‘16 29 Trigger Timing Input Fault Reboot Workload nmnm popo rqrq Cassandra Paxos bug (Cassandra-6023) 3 concurrent user requests! “How many protocols to run as input?” 20% = 1 80% = 2+ protocols Implication: multiple protocols for DC testing

30 4/5/16TaxDC @ ASPLOS ’16 30 TaxDC Trigger Timin g Order Violation Atomicity Violation Fault Timing Reboot Timing Input Fault Reboot Workload Scope Nodes Messag es Protocols Error & Failure Error Loc Mem Loc Sem Loc Hang Loc Silence Glob Wrong Glob Miss Glob Silence Failure Downtime Data Loss Op Fail Performanc e Fix Timing Global Sync Local Sync Handlin g Retry Ignore Accept Others Timin g InputScopeFailure Handlin g Timin g Error What: First effect of untimely ordering Why: Help failure diagnosis and bug detection

31 4/5/16TaxDC @ ASPLOS ‘16 31 Trigger Local Error Error can be observed in one triggering node (46%) Implication: identify opportunities for failure diagnosis and bug detection Null pointer, false assertion, etc.

32 4/5/16TaxDC @ ASPLOS ‘16 32 Many are silent errors and hard to diagnose (hidden errors, no error messages, long debugging) Trigger Local Error Global Error cannot be observed in one node (54%) ??

33 4/5/16TaxDC @ ASPLOS’16 33 TaxDC Trigger Timin g Order Violation Atomicity Violation Fault Timing Reboot Timing Input Fault Reboot Workload Scope Nodes Messag es Protocols Error & Failure Error Loc Mem Loc Sem Loc Hang Loc Silence Glob Wrong Glob Miss Glob Silence Failure Downtime Data Loss Op Fail Performanc e Fix Timing Global Sync Local Sync Handlin g Retry Ignore Accept Others Timin g InputScopeErrorFailure Handlin g Timin g What: How developers fix bugs Why: Help design runtime prevention and automatic patch generation

34 4/5/16TaxDC @ ASPLOS ‘16 34 Trigger Error Fix Comple x Add Global Synchro - nization Similar to fixing LC bugs: add synchronization e.g. lock() Are patches complicated? Are patches adding synch.? Add new states & transitions

35 4/5/16TaxDC @ ASPLOS ‘16 35 Trigger Delay Comple x Error Fix Simple

36 4/5/16TaxDC @ ASPLOS ‘16 36 Trigger Comple x Error Fix Simple Delay Ignore/discard

37 4/5/16TaxDC @ ASPLOS ‘16 37 Trigger Comple x Error Fix Simple Delay Ignore/Discard Retry

38 4/5/16TaxDC @ ASPLOS ‘16 38 Trigger Comple x Error Fix Simple f(msg); g(msg); Delay Ignore/Discard Retry Accept

39 4/5/16TaxDC @ ASPLOS ‘16 39 Trigger Comple x Error Fix Simple Delay Ignore Retry Accept 40% are easy to fix (no new computation logic) Implication: many fixes can inform automatic runtime prevention f(msg); g(msg);

40 4/5/16TaxDC @ ASPLOS ‘16 40 Trigger Comple x Error Fix Simple Delay Ignore/Discard Retry Accept Sync. Fix DC bugs vs. LC bugs

41 4/5/16TaxDC @ ASPLOS ‘16 41  Distributed system model checker  Formal verification  DC bug detection  Runtime failure prevention

42 4/5/16TaxDC @ ASPLOS ‘16 42 RealityEvent Message Crash Multiple crashes Reboot Multiple reboots Timeout Computation Disk fault Modist NSDI’11 Demeter SOSP’11 MaceMC NSDI’07 SAMC OSDI’14 Let’s find out how to re-order all events without exploding the state space!

43  State-of-the-art  Verdi [PLDI ‘15]  Raft update  ~ 6,000 lines of proof  IronFleet [SOSP ‘15]  Paxos update  Lease-based read/write  ~ 5,000 – 10,000 lines of proof  Challenges Foreground & Background #Protocol interactions 4/5/16TaxDC @ ASPLOS ‘16 43 20%= 1 80% = 2+ Protocols 29% = Mix 19%=FG 52% = BG Let’s find out how to better verify more protocol interactions! Only verify foreground protocols Foreground & background

44  State-of-the-art: LC bug detection  Pattern-based detection  Error-based detection  Statistical bug detection  Opportunities: DC bug detection?  Pattern-based detection  Error-based detection 4/5/16TaxDC @ ASPLOS ‘16 44 53% = Explicit47% = Silent Message timing Fault timing Reboot timing Let’s leverage these timing patterns and explicit error to do DC bug detection!

45  State-of-the-art: LC bug prevention  Deadlock Immunity [OSDI ‘08]  Aviso [ASPLOS ‘13]  ConAir [ASPLOS ‘13]  Etc.  Opportunities: DC bug prevention Fixes 4/5/16TaxDC @ ASPLOS ‘16 45 40% = Simple60% = Complex Let’s build runtime prevention technique that leverage this simplicity!

46 4/5/16TaxDC @ ASPLOS ‘16 46 “Why seriously address DC bugs now?” Everything is distributed and large-scale! DC bugs are not uncommon! “Why is tackling DC bugs possible now?” 1.Open access to source code 2.Pervasive documentations 3.Detailed bug descriptions

47 47 http://ucare.cs.uchicago.edu 4/5/16TaxDC @ ASPLOS ‘16


Download ppt "Tanakorn Leesatapornwongsa, Jeffrey F. Lukman, Shan Lu, Haryadi S. Gunawi."

Similar presentations


Ads by Google