Presentation is loading. Please wait.

Presentation is loading. Please wait.

Diagnosing Missing Events in Distributed Systems with Negative Provenance Yang Wu* Mingchen Zhao* Andreas Haeberlen* Wenchao Zhou + Boon Thau Loo* * University.

Similar presentations


Presentation on theme: "Diagnosing Missing Events in Distributed Systems with Negative Provenance Yang Wu* Mingchen Zhao* Andreas Haeberlen* Wenchao Zhou + Boon Thau Loo* * University."— Presentation transcript:

1 Diagnosing Missing Events in Distributed Systems with Negative Provenance Yang Wu* Mingchen Zhao* Andreas Haeberlen* Wenchao Zhou + Boon Thau Loo* * University of Pennsylvania + Georgetown University 1

2 Internet HTTP Server Data Center Network Controller Why is the HTTP server getting DNS queries? Motivation: Network debugging DNS Packet 2 -Example: Software Defined Networks HTTP Packet -Need good debuggers -SDN offers flexibility, but can have bugs

3 Why is the HTTP server getting DNS queries? -Existing tools: SNP (SOSP ‘11), NetSight (NSDI ‘14) Approach: Provenance Internet HTTP Server Data Center Network Controller DNS Packet DNS Packet arrived at HTTP Server DNS Packet received at Switch Broken FlowEntry existed at Switch … … … Program DNS Packet Broken FlowEntry 3 - They produce “backtrace”, or provenance

4 Internet HTTP Server Data Center Network Controller -What if something expected is not happening? Challenge: Missing events ??? 4 Why is the HTTP server NOT getting requests? -Cannot be handled by existing tools -Reason: No starting point

5 Survey: How common are missing events? 5 -Missing events are consistently in the majority -Lengthier email threads for missing events Missing eventsPositive events NANOG-user Floodlight-dev Outages

6 Approach: Counter-factual reasoning Find all the ways a missing event could have occurred, Why did Bob NOT arrive at SIGCOMM? 6 Philly Chicago and show why each of them did not happen.

7 Result: debugger for missing events Internet HTTP Server Data Center Network Controller No HTTP Packet arrived at HTTP Server No Forwarding-FlowEntry installed at Switch HTTP Packet received at Switch Dropping-FlowEntry existed at Switch … … Program … ??? HTTP Packet Dropping- FlowEntry Why is the HTTP server NOT getting requests? 7

8 Challenge: Too many possible explanations 8 Why did Bob NOT arrive at SIGCOMM?

9 Approach Generating Negative Provenance Improving readability Background: Provenance 9 Overview Challenge: Too many explanations Goal: Diagnose missing events WHY NOT ? Approach: Counter-factual reasoning System Y! R-tree indexing Explanation usability Evaluation Query speed Explanation size reduction Experiments

10 Background: provenance -Captures causality between events 10 DNS Packet arrived at HTTP Server DNS Packet received at Switch Broken FlowEntry existed at Switch … … -Example: ExSPAN (SIGMOD ’10) PacketSent :- PacketReceived, FlowEntry. network datalog (NDLOG)

11 ??? FlowEntry PacketReceived time t1 t2 t3 t4 t5 now PacketSent PacketOut PacketSent :- FlowEntry, PacketReceived. PacketSent during [t4,t5] 11 Background: How to generate provenance? PacketSent :- PacketOut. t6 FlowEntry during [t4,t5] PacketReceived during [t4,t5] Step 2: Admin issues query when bug occursStep 1: Collecting events from distributed system Step 3: Provenance graph is generated

12 Approach Generating Negative Provenance Improving readability Background: Provenance 12 Overview Challenge: Too many explanations Goal: Diagnose missing events WHY NOT ? Approach: Counter-factual reasoning System Y! R-tree indexing Explanation usability Evaluation Query speed Explanation size reduction Experiments

13 Generating negative provenance graphs FlowEntry PacketSent :- FlowEntry, PacketReceived. PacketReceived No PacketSent during [t1,now] ??? time t1 t2 t3 t4 t5 now 13 PacketSent -Explaining why something does not exist -Use missing conditions to explain missing events

14 Generating negative provenance graphs -explanation can be unnecessarily complex PacketSent FlowEntry PacketReceived No PacketSent during [t1,now] time t1 t2 t3 t4 t5 now No PacketReceived during [t1,t2] No FlowEntry during [t2,t3] No PacketReceived during [t3,t4] No FlowEntry during [t4,t5] No PacketReceived during [t5,now] 14

15 Generating negative provenance graphs -We want simple explanations No PacketSent during [t1,now] No FlowEntry during [t1,now] 15 PacketSent FlowEntry PacketReceived time t1 t2 t3 t4 t5 now -Key challenge: explanation can be complex -This is hard (Set-Cover)

16 Generating negative provenance graphs -pseudo-code for building negative provenance graph 16

17 17 Challenge: explanation too complicated No A at at X No B at at Y No C at at Z

18 Approach Generating Negative Provenance Improving readability Background: Provenance 18 Overview Challenge: Too many explanations Goal: Diagnose missing events WHY NOT ? Approach: Counter-factual reasoning System Y! R-tree indexing Explanation usability Evaluation Query speed Explanation size reduction Experiments

19 Readability: how to improve? 19 No chicken. … No egg. No chicken. … … No Packet arrived at Server No Packet arrived at S1 No Packet arrived at S2 No Packet arrived at S3 … -Pattern: logical inconsistency. Solution: pruning. -Pattern: transient event chains. Solution: hiding.

20 Readability: improving techniques Prune logical inconsistencies. Prune failed assertions. Branch coalescing. Application-specific invariants. Hide transient event chains. Summarize super-vertex. 20

21 21 Readability: concise explanations

22 Approach Generating Negative Provenance Improving readability Background: Provenance 22 System Y! R-tree indexing Overview Challenge: Too many explanations Goal: Diagnose missing events WHY NOT ? Approach: Counter-factual reasoning Explanation usability Evaluation Query speed Explanation size reduction Experiments

23 System: Y! 23 Supports arbitrary NDLOG programs. Efficient event storage: R-tree. Supports general programs: Pyretic frontend.

24 System: better index for faster queries Was there a FlowTable from 3pm to 8pm, whose priority is higher than 255? Any hotels within 3 miles of SIGCOMM? ≈ 24

25 System: R-tree for faster queries -R-tree: designed to handle high-dimensional queries Used material from Wikipedia. -basic idea: multi-dimensional B-tree 25

26 Approach Generating Negative Provenance Improving readability Background: Provenance 26 Overview Challenge: Too many explanations Goal: Diagnose missing events WHY NOT ? Approach: Counter-factual reasoning System Y! R-tree indexing Explanation usability Evaluation Query speed Explanation size reduction Experiments

27 Evaluation: Experiments 27 Are negative provenance graphs concise? Are negative provenance graphs useful? What is the query turnaround time? What is the runtime storage overhead? Will Y! slow down the distributed system? How runtime storage overhead scales? How query turnaround time scales? How readability heuristics scales?

28 Evaluation: How fast are the queries? 28 Query turnaround (seconds) 1.2 0.6 0.9 0.3 -Query turnaround less than one second 1 second SDN1 SDN2 SDN3 SDN4 SDN5 BGP1 BGP2 BGP3 BGP4 -Nine scenarios reproduced: SDN & BGP

29 Evaluation: Are answers concise? 29 SDN1 SDN2 SDN3 SDN4 SDN5 BGP1 BGP2 BGP3 BGP4 Original Inconsistencies pruned Final - 90% -Size reduced by over 90% -Absolute size is always less than 25 25 # Vertices in answers 40 0 20 0 30 0 10 0

30 Evaluation: Are answers useful? Why is the HTTP server NOT getting requests? No HTTP Packet arrived at HTTP Server No Forwarding FlowEntry arrived at Intermediate Switch HTTP Packets arrived at Border Switch Forwarding FlowEntry arrived at Border Switch Broken FlowEntry arrived at Intermediate Switch Internet HTTP Server Data Center Network Controller ??? Broken FlowEntry S1 S2 ??? HTTP Packet 30

31 -Goal: a debugger for missing events Example: Why HTTP server is not getting requests? -Approach: Negative Provenance Use counterfactual reasoning to explain how the missing events could have occurred. -Explanation can be simplified. The initial explanation can be complex, but we can generate concise answers. -Implementation: Y! Efficient queries due to a special index. Usability and scalability evaluations. Questions? 31 Summary


Download ppt "Diagnosing Missing Events in Distributed Systems with Negative Provenance Yang Wu* Mingchen Zhao* Andreas Haeberlen* Wenchao Zhou + Boon Thau Loo* * University."

Similar presentations


Ads by Google