Diagnosing Missing Events in Distributed Systems with Negative Provenance Yang Wu* Mingchen Zhao* Andreas Haeberlen* Wenchao Zhou + Boon Thau Loo* * University.

Slides:



Advertisements
Similar presentations
Categories of I/O Devices
Advertisements

Ditto: Speeding Up Runtime Data Structure Invariant Checks AJ Shankar and Ras Bodik UC Berkeley.
SDN Controller Challenges
Software-defined networking: Change is hard Ratul Mahajan with Chi-Yao Hong, Rohan Gandhi, Xin Jin, Harry Liu, Vijay Gill, Srikanth Kandula, Mohan Nanduri,
SIMPLE-fying Middlebox Policy Enforcement Using SDN
Ion Stoica, Robert Morris, David Karger, M. Frans Kaashoek, Hari Balakrishnan MIT and Berkeley presented by Daniel Figueiredo Chord: A Scalable Peer-to-peer.
Networking Problems in Cloud Computing Projects. 2 Kickass: Implementation PROJECT 1.
Background for “KISS: Keep It Simple and Sequential” cs264 Ras Bodik spring 2005.
VeriCon: Towards Verifying Controller Programs in SDNs (PLDI 2014) Thomas Ball, Nikolaj Bjorner, Aaron Gember, Shachar Itzhaky, Aleksandr Karbyshev, Mooly.
A. Haeberlen Having your Cake and Eating it too: Routing Security with Privacy Protections 1 HotNets-X (November 15, 2011) Alexander Gurney * Andreas Haeberlen.
SDN and Openflow.
Web Caching Schemes1 A Survey of Web Caching Schemes for the Internet Jia Wang.
CPSC 322, Lecture 12Slide 1 CSPs: Search and Arc Consistency Computer Science cpsc322, Lecture 12 (Textbook Chpt ) January, 29, 2010.
Shadow Configurations: A Network Management Primitive Richard Alimi, Ye Wang, Y. Richard Yang Laboratory of Networked Systems Yale University.
Shadow Configurations: A Network Management Primitive Richard Alimi, Ye Wang, and Y. Richard Yang Laboratory of Networked Systems Yale University February.
EEC-681/781 Distributed Computing Systems Lecture 3 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University
Progress Report 11/1/01 Matt Bridges. Overview Data collection and analysis tool for web site traffic Lets website administrators know who is on their.
PathExpander: Architectural Support for Increasing the Path Coverage of Dynamic Bug Detection S. Lu, P. Zhou, W. Liu, Y. Zhou, J. Torrellas University.
Diagnosing Missing Events in Distributed Systems with Negative Provenance Yang Wu* Mingchen Zhao* Andreas Haeberlen* Wenchao Zhou + Boon Thau Loo* * University.
DARPA NMS PI Meeting November 14, 2002 Understanding BGP in Action Dan Massey USC/ISI.
CSE 490dp Resource Control Robert Grimm. Problems How to access resources? –Basic usage tracking How to measure resource consumption? –Accounting How.
Rex: Replication at the Speed of Multi-core Zhenyu Guo, Chuntao Hong, Dong Zhou*, Mao Yang, Lidong Zhou, Li Zhuang Microsoft ResearchCMU* 1.
Replay Debugging for Distributed Systems Dennis Geels, Gautam Altekar, Ion Stoica, Scott Shenker.
Hands-On Microsoft Windows Server 2008 Chapter 11 Server and Network Monitoring.
CH 13 Server and Network Monitoring. Hands-On Microsoft Windows Server Objectives Understand the importance of server monitoring Monitor server.
Windows Server 2008 Chapter 11 Last Update
Presenter: Chi-Hung Lu 1. Problems Distributed applications are hard to validate Distribution of application state across many distinct execution environments.
SIMPLE-fying Middlebox Policy Enforcement Using SDN Zafar Ayyub Qazi Cheng-Chun Tu Luis Chiang Vyas Sekar Rui Miao Minlan Yu.
Distributed Data Stores – Facebook Presented by Ben Gooding University of Arkansas – April 21, 2015.
Chapter 1: Hierarchical Network Design
 Zhichun Li  The Robust and Secure Systems group at NEC Research Labs  Northwestern University  Tsinghua University 2.
(Business) Process Centric Exchanges
Y. WuHotNets-XII (Nov 22, 2013)1 Answering Why-Not Queries in Software-Defined Networks with Negative Provenance Yang Wu* Andreas Haeberlen* Wenchao Zhou.
Frontiers in Massive Data Analysis Chapter 3.  Difficult to include data from multiple sources  Each organization develops a unique way of representing.
Extending SDN to Handle Dynamic Middlebox Actions via FlowTags (Full version to appear in NSDI’14) Seyed K. Fayazbakhsh, Luis Chiang, Vyas Sekar, Minlan.
Automated Bandwidth Allocation Problems in Data Centers Yifei Yuan, Anduo Wang, Rajeev Alur, Boon Thau Loo University of Pennsylvania.
Developer TECH REFRESH 15 Junho 2015 #pttechrefres h Understand your end-users and your app with Application Insights.
A data-centric platform for analyzing distributed systems Provenance Maintenance and Querying on Log-structured Databases.
Distributed Information Systems. Motivation ● To understand the problems that Web services try to solve it is helpful to understand how distributed information.
Reporter : Yu Shing Li 1.  Introduction  Querying and update in the cloud  Multi-dimensional index R-Tree and KD-tree Basic Structure Pruning Irrelevant.
SIGCOMM 2012 (August 16, 2012) Private and Verifiable Interdomain Routing Decisions Mingchen Zhao * Wenchao Zhou * Alexander Gurney * Andreas Haeberlen.
CISC Machine Learning for Solving Systems Problems Presented by: Suman Chander B Dept of Computer & Information Sciences University of Delaware Automatic.
M. Wang, T. Xiao, J. Li, J. Zhang, C. Hong, & Z. Zhang (2014)
Motivation: Finding the root cause of a symptom
Automated Network Repair with Meta Provenance
Custom Computing Machines for the Set Covering Problem Paper Written By: Christian Plessl and Marco Platzner Swiss Federal Institute of Technology, 2002.
NetEgg: Scenario-based Programming for SDN Policies Yifei Yuan, Dong Lin, Rajeev Alur, Boon Thau Loo University of Pennsylvania 1.
Flashback : A Lightweight Extension for Rollback and Deterministic Replay for Software Debugging Sudarshan M. Srinivasan, Srikanth Kandula, Christopher.
Jennifer Rexford Princeton University MW 11:00am-12:20pm Testing and Debugging COS 597E: Software Defined Networking.
Logically Centralized? State Distribution Trade-offs in Software Defined Networks.
Tanenbaum & Van Steen, Distributed Systems: Principles and Paradigms, 2e, (c) 2007 Prentice-Hall, Inc. All rights reserved DISTRIBUTED SYSTEMS.
NFP: Enabling Network Function Parallelism in NFV
Seyed K. Fayaz, Tushar Sharma, Ari Fogel
Yiting Xia, T. S. Eugene Ng Rice University
SDN challenges Deployment challenges
Problem: Internet diagnostics and forensics
SDN Network Updates Minimum updates within a single switch
Programming SDN Newer proposals Frenetic (ICFP’11) Maple (SIGCOMM’13)
Enhanced Provenance Model (TAP): Time-aware Provenance for Distributed Systems Original Article: Wenchao Zhou, Ling Ding, Andreas Haeberlen, Zachary Ives,
NFP: Enabling Network Function Parallelism in NFV
Abstractions for Model Checking SDN Controllers
Software Defined Networking (SDN)
NFP: Enabling Network Function Parallelism in NFV
Evaluating Transaction System Performance
Data collection methodology and NM paradigms
What's New in eCognition 9
Cluster Load Balancing for Fine-grain Network Services
EE 122: Lecture 22 (Overlay Networks)
Overview of Networking
What's New in eCognition 9
Presentation transcript:

Diagnosing Missing Events in Distributed Systems with Negative Provenance Yang Wu* Mingchen Zhao* Andreas Haeberlen* Wenchao Zhou + Boon Thau Loo* * University of Pennsylvania + Georgetown University 1

Internet HTTP Server Data Center Network Controller Why is the HTTP server getting DNS queries? Motivation: Network debugging DNS Packet 2 -Example: Software Defined Networks HTTP Packet -Need good debuggers -SDN offers flexibility, but can have bugs

Why is the HTTP server getting DNS queries? -Existing tools: SNP (SOSP ‘11), NetSight (NSDI ‘14) Approach: Provenance Internet HTTP Server Data Center Network Controller DNS Packet DNS Packet arrived at HTTP Server DNS Packet received at Switch Broken FlowEntry existed at Switch … … … Program DNS Packet Broken FlowEntry 3 - They produce “backtrace”, or provenance

Internet HTTP Server Data Center Network Controller -What if something expected is not happening? Challenge: Missing events ??? 4 Why is the HTTP server NOT getting requests? -Cannot be handled by existing tools -Reason: No starting point

Survey: How common are missing events? 5 -Missing events are consistently in the majority -Lengthier threads for missing events Missing eventsPositive events NANOG-user Floodlight-dev Outages

Approach: Counter-factual reasoning Find all the ways a missing event could have occurred, Why did Bob NOT arrive at SIGCOMM? 6 Philly Chicago and show why each of them did not happen.

Result: debugger for missing events Internet HTTP Server Data Center Network Controller No HTTP Packet arrived at HTTP Server No Forwarding-FlowEntry installed at Switch HTTP Packet received at Switch Dropping-FlowEntry existed at Switch … … Program … ??? HTTP Packet Dropping- FlowEntry Why is the HTTP server NOT getting requests? 7

Challenge: Too many possible explanations 8 Why did Bob NOT arrive at SIGCOMM?

Approach Generating Negative Provenance Improving readability Background: Provenance 9 Overview Challenge: Too many explanations Goal: Diagnose missing events WHY NOT ? Approach: Counter-factual reasoning System Y! R-tree indexing Explanation usability Evaluation Query speed Explanation size reduction Experiments

Background: provenance -Captures causality between events 10 DNS Packet arrived at HTTP Server DNS Packet received at Switch Broken FlowEntry existed at Switch … … -Example: ExSPAN (SIGMOD ’10) PacketSent :- PacketReceived, FlowEntry. network datalog (NDLOG)

??? FlowEntry PacketReceived time t1 t2 t3 t4 t5 now PacketSent PacketOut PacketSent :- FlowEntry, PacketReceived. PacketSent during [t4,t5] 11 Background: How to generate provenance? PacketSent :- PacketOut. t6 FlowEntry during [t4,t5] PacketReceived during [t4,t5] Step 2: Admin issues query when bug occursStep 1: Collecting events from distributed system Step 3: Provenance graph is generated

Approach Generating Negative Provenance Improving readability Background: Provenance 12 Overview Challenge: Too many explanations Goal: Diagnose missing events WHY NOT ? Approach: Counter-factual reasoning System Y! R-tree indexing Explanation usability Evaluation Query speed Explanation size reduction Experiments

Generating negative provenance graphs FlowEntry PacketSent :- FlowEntry, PacketReceived. PacketReceived No PacketSent during [t1,now] ??? time t1 t2 t3 t4 t5 now 13 PacketSent -Explaining why something does not exist -Use missing conditions to explain missing events

Generating negative provenance graphs -explanation can be unnecessarily complex PacketSent FlowEntry PacketReceived No PacketSent during [t1,now] time t1 t2 t3 t4 t5 now No PacketReceived during [t1,t2] No FlowEntry during [t2,t3] No PacketReceived during [t3,t4] No FlowEntry during [t4,t5] No PacketReceived during [t5,now] 14

Generating negative provenance graphs -We want simple explanations No PacketSent during [t1,now] No FlowEntry during [t1,now] 15 PacketSent FlowEntry PacketReceived time t1 t2 t3 t4 t5 now -Key challenge: explanation can be complex -This is hard (Set-Cover)

Generating negative provenance graphs -pseudo-code for building negative provenance graph 16

17 Challenge: explanation too complicated No A at at X No B at at Y No C at at Z

Approach Generating Negative Provenance Improving readability Background: Provenance 18 Overview Challenge: Too many explanations Goal: Diagnose missing events WHY NOT ? Approach: Counter-factual reasoning System Y! R-tree indexing Explanation usability Evaluation Query speed Explanation size reduction Experiments

Readability: how to improve? 19 No chicken. … No egg. No chicken. … … No Packet arrived at Server No Packet arrived at S1 No Packet arrived at S2 No Packet arrived at S3 … -Pattern: logical inconsistency. Solution: pruning. -Pattern: transient event chains. Solution: hiding.

Readability: improving techniques Prune logical inconsistencies. Prune failed assertions. Branch coalescing. Application-specific invariants. Hide transient event chains. Summarize super-vertex. 20

21 Readability: concise explanations

Approach Generating Negative Provenance Improving readability Background: Provenance 22 System Y! R-tree indexing Overview Challenge: Too many explanations Goal: Diagnose missing events WHY NOT ? Approach: Counter-factual reasoning Explanation usability Evaluation Query speed Explanation size reduction Experiments

System: Y! 23 Supports arbitrary NDLOG programs. Efficient event storage: R-tree. Supports general programs: Pyretic frontend.

System: better index for faster queries Was there a FlowTable from 3pm to 8pm, whose priority is higher than 255? Any hotels within 3 miles of SIGCOMM? ≈ 24

System: R-tree for faster queries -R-tree: designed to handle high-dimensional queries Used material from Wikipedia. -basic idea: multi-dimensional B-tree 25

Approach Generating Negative Provenance Improving readability Background: Provenance 26 Overview Challenge: Too many explanations Goal: Diagnose missing events WHY NOT ? Approach: Counter-factual reasoning System Y! R-tree indexing Explanation usability Evaluation Query speed Explanation size reduction Experiments

Evaluation: Experiments 27 Are negative provenance graphs concise? Are negative provenance graphs useful? What is the query turnaround time? What is the runtime storage overhead? Will Y! slow down the distributed system? How runtime storage overhead scales? How query turnaround time scales? How readability heuristics scales?

Evaluation: How fast are the queries? 28 Query turnaround (seconds) Query turnaround less than one second 1 second SDN1 SDN2 SDN3 SDN4 SDN5 BGP1 BGP2 BGP3 BGP4 -Nine scenarios reproduced: SDN & BGP

Evaluation: Are answers concise? 29 SDN1 SDN2 SDN3 SDN4 SDN5 BGP1 BGP2 BGP3 BGP4 Original Inconsistencies pruned Final - 90% -Size reduced by over 90% -Absolute size is always less than # Vertices in answers

Evaluation: Are answers useful? Why is the HTTP server NOT getting requests? No HTTP Packet arrived at HTTP Server No Forwarding FlowEntry arrived at Intermediate Switch HTTP Packets arrived at Border Switch Forwarding FlowEntry arrived at Border Switch Broken FlowEntry arrived at Intermediate Switch Internet HTTP Server Data Center Network Controller ??? Broken FlowEntry S1 S2 ??? HTTP Packet 30

-Goal: a debugger for missing events Example: Why HTTP server is not getting requests? -Approach: Negative Provenance Use counterfactual reasoning to explain how the missing events could have occurred. -Explanation can be simplified. The initial explanation can be complex, but we can generate concise answers. -Implementation: Y! Efficient queries due to a special index. Usability and scalability evaluations. Questions? 31 Summary