Diagnosing Missing Events in Distributed Systems with Negative Provenance Yang Wu* Mingchen Zhao* Andreas Haeberlen* Wenchao Zhou + Boon Thau Loo* * University.

Slides:



Advertisements
Similar presentations
SDN Controller Challenges
Advertisements

Dynamic Scheduling of Network Updates Xin Jin Hongqiang Harry Liu, Rohan Gandhi, Srikanth Kandula, Ratul Mahajan, Ming Zhang, Jennifer Rexford, Roger Wattenhofer.
Logically Centralized Control Class 2. Types of Networks ISP Networks – Entity only owns the switches – Throughput: 100GB-10TB – Heterogeneous devices:
SIMPLE-fying Middlebox Policy Enforcement Using SDN
Ion Stoica, Robert Morris, David Karger, M. Frans Kaashoek, Hari Balakrishnan MIT and Berkeley presented by Daniel Figueiredo Chord: A Scalable Peer-to-peer.
CloudWatcher: Network Security Monitoring Using OpenFlow in Dynamic Cloud Networks or: How to Provide Security Monitoring as a Service in Clouds? Seungwon.
Nanxi Kang Princeton University
Cs/ee 143 Communication Networks Chapter 6 Internetworking Text: Walrand & Parekh, 2010 Steven Low CMS, EE, Caltech.
Diagnosing Missing Events in Distributed Systems with Negative Provenance Yang Wu* Mingchen Zhao* Andreas Haeberlen* Wenchao Zhou + Boon Thau Loo* * University.
Background for “KISS: Keep It Simple and Sequential” cs264 Ras Bodik spring 2005.
VeriCon: Towards Verifying Controller Programs in SDNs (PLDI 2014) Thomas Ball, Nikolaj Bjorner, Aaron Gember, Shachar Itzhaky, Aleksandr Karbyshev, Mooly.
SDN and Openflow.
Scalable Flow-Based Networking with DIFANE 1 Minlan Yu Princeton University Joint work with Mike Freedman, Jennifer Rexford and Jia Wang.
Internet Networking Spring 2006 Tutorial 12 Web Caching Protocols ICP, CARP.
Shadow Configurations: A Network Management Primitive Richard Alimi, Ye Wang, Y. Richard Yang Laboratory of Networked Systems Yale University.
1 Spring Semester 2007, Dept. of Computer Science, Technion Internet Networking recitation #13 Web Caching Protocols ICP, CARP.
Shadow Configurations: A Network Management Primitive Richard Alimi, Ye Wang, and Y. Richard Yang Laboratory of Networked Systems Yale University February.
Internet Networking Spring 2002 Tutorial 13 Web Caching Protocols ICP, CARP.
1 Design and implementation of a Routing Control Platform Matthew Caesar, Donald Caldwell, Nick Feamster, Jennifer Rexford, Aman Shaikh, Jacobus van der.
EXPERT SYSTEMS Part I.
CSC458/2209 PA1 Simple Router Based on slides by: Antonin Seyed Amir Hejazi 19/09/2014 CSC458/ Computer Networks, University of Toronto.
Understanding Network Failures in Data Centers: Measurement, Analysis and Implications Phillipa Gill University of Toronto Navendu Jain & Nachiappan Nagappan.
Firewalls and VPNS Team 9 Keith Elliot David Snyder Matthew While.
Presenter: Chi-Hung Lu 1. Problems Distributed applications are hard to validate Distribution of application state across many distinct execution environments.
NET-REPLAY: A NEW NETWORK PRIMITIVE Ashok Anand Aditya Akella University of Wisconsin, Madison.
 Zhichun Li  The Robust and Secure Systems group at NEC Research Labs  Northwestern University  Tsinghua University 2.
Scaling IXPs Scalable Infrastructure Workshop. Objectives  To explain scaling options within the IXP  To introduce the Internet Routing Registry at.
Google File System Simulator Pratima Kolan Vinod Ramachandran.
SIGCOMM 2002 New Directions in Traffic Measurement and Accounting Focusing on the Elephants, Ignoring the Mice Cristian Estan and George Varghese University.
VeriFlow: Verifying Network-Wide Invariants in Real Time
NetPilot: Automating Datacenter Network Failure Mitigation Xin Wu, Daniel Turner, Chao-Chih Chen, David A. Maltz, Xiaowei Yang, Lihua Yuan, Ming Zhang.
TELE202 Lecture 5 Packet switching in WAN 1 Lecturer Dr Z. Huang Overview ¥Last Lectures »C programming »Source: ¥This Lecture »Packet switching in Wide.
Network Instruments VoIP Analysis. VoIP Basics  What is VoIP?  Packetized voice traffic sent over an IP network  Competes with other traffic on the.
Y. WuHotNets-XII (Nov 22, 2013)1 Answering Why-Not Queries in Software-Defined Networks with Negative Provenance Yang Wu* Andreas Haeberlen* Wenchao Zhou.
Black-box Testing.
A data-centric platform for analyzing distributed systems Provenance Maintenance and Querying on Log-structured Databases.
Reporter : Yu Shing Li 1.  Introduction  Querying and update in the cloud  Multi-dimensional index R-Tree and KD-tree Basic Structure Pruning Irrelevant.
Lecture 2 Agenda –Finish with OSPF, Areas, DR/BDR –Convergence, Cost –Fast Convergence –Tools to troubleshoot –Tools to measure convergence –Intro to implementation:
SIGCOMM 2012 (August 16, 2012) Private and Verifiable Interdomain Routing Decisions Mingchen Zhao * Wenchao Zhou * Alexander Gurney * Andreas Haeberlen.
1 Root-Cause VoIP Troubleshooting Optimizing the Process Tim Titus CTO, PathSolutions.
Performance Debugging for Distributed Systems of Black Boxes Marcos K. Aguilera Jeffrey C. Mogul Janet L. Wiener HP Labs Patrick Reynolds, Duke Athicha.
CISC Machine Learning for Solving Systems Problems Presented by: Suman Chander B Dept of Computer & Information Sciences University of Delaware Automatic.
General Writing - Audience What is their level of knowledge? Advanced, intermediate, basic? Hard to start too basic – but have to use the right terminology.
An Efficient Gigabit Ethernet Switch Model for Large-Scale Simulation Dong (Kevin) Jin.
Motivation: Finding the root cause of a symptom
Automated Network Repair with Meta Provenance
32nd International Conference on Very Large Data Bases September , 2006 Seoul, Korea Efficient Detection of Empty Result Queries Gang Luo IBM T.J.
An Efficient Gigabit Ethernet Switch Model for Large-Scale Simulation Dong (Kevin) Jin.
NetEgg: Scenario-based Programming for SDN Policies Yifei Yuan, Dong Lin, Rajeev Alur, Boon Thau Loo University of Pennsylvania 1.
NETWORKING (2) Dr. Andy Wu BCIS 4630 Fundamentals of IT Security.
Jennifer Rexford Princeton University MW 11:00am-12:20pm Testing and Debugging COS 597E: Software Defined Networking.
Logically Centralized? State Distribution Trade-offs in Software Defined Networks.
SDN challenges Deployment challenges
Problem: Internet diagnostics and forensics
SDN Network Updates Minimum updates within a single switch
Programming SDN Newer proposals Frenetic (ICFP’11) Maple (SIGCOMM’13)
Shadow Configurations: A Network Management Primitive
The DPIaaS Controller Prototype
Network Anti-Spoofing with SDN Data plane Authors:Yehuda Afek et al.
Dispersing Asymmetric DDoS Attacks with SplitStack
Large Distributed Systems
NOX: Towards an Operating System for Networks
Enhanced Provenance Model (TAP): Time-aware Provenance for Distributed Systems Original Article: Wenchao Zhou, Ling Ding, Andreas Haeberlen, Zachary Ives,
Understanding the OSI Reference Model
Software Defined Networking (SDN)
SDN Based IoT-Cloud Comm.
Data collection methodology and NM paradigms
Cluster Load Balancing for Fine-grain Network Services
COS 461: Computer Networks
Overview of Networking
Presentation transcript:

Diagnosing Missing Events in Distributed Systems with Negative Provenance Yang Wu* Mingchen Zhao* Andreas Haeberlen* Wenchao Zhou + Boon Thau Loo* * University of Pennsylvania + Georgetown University 1

Internet HTTP Server Data Center Network SDN Controller Why is the HTTP server getting DNS queries? Motivation: Network debugging DNS Query 2 -Example: Software Defined Networks HTTP Request -Need good debuggers! -SDN offers flexibility, but can have bugs

Why is the HTTP server getting DNS queries? -Existing tools: SNP (SOSP ‘11), NetSight (NSDI ‘14) Approach: Provenance Internet HTTP Server Data Center Network SDN Controller DNS Query DNS Query arrived at HTTP Server DNS Query received at Switch Broken FlowEntry existed at Switch … … … Program DNS Query Broken FlowEntry 3 - They produce “backtraces”, or provenance

Internet HTTP Server Data Center Network SDN Controller -What if an expected event does not happen? Challenge: Missing events ??? 4 Why is the HTTP server NOT getting requests? -Cannot be handled by existing tools -No starting point for a backtrace

Survey: How common are missing events? 5 -Missing events are consistently in the majority - threads for missing events are longer Missing eventsPositive events NANOG-user Floodlight-dev Outages

Approach: Counter-factual reasoning Find all the ways a missing event could have occurred, Why did Bob NOT arrive at SIGCOMM? 6 Philadelphia Chicago and show why each of them did not happen.

Result: Debugger for missing events Internet HTTP Server Data Center Network Controller No HTTP Request arrived at HTTP Server No Forwarding-FlowEntry installed at Switch HTTP Request received at Switch Dropping-FlowEntry existed at Switch … … Program … ??? HTTP Request Dropping- FlowEntry Why is the HTTP server NOT getting requests? 7

Challenge: Too many possible explanations! 8 Why did Bob NOT arrive at SIGCOMM? When an event happens, there is one reason. When an event does not happen, there can be many reasons.

Approach Generating Negative Provenance Improving readability Background: Provenance 9 Overview Challenge: Too many explanations Goal: Diagnose missing events WHY NOT ? Approach: Counter-factual reasoning System Y! R-tree indexing Usability Evaluation Query speed Size reduction Experiments

Background: Provenance -Captures causality between events 10 DNS Query arrived at HTTP Server DNS Query received at Switch Broken FlowEntry existed at Switch … … -Example: SNP (SOSP ’11) PacketSent :- PacketReceived, FlowEntry. network datalog (NDLOG) Event Causal relationship Provenance graph

??? FlowEntry PacketReceived time now PacketSent PacketOut PacketSent :- PacketReceived, FlowEntry. PacketSent during [t4,t5] 11 Background: How to generate provenance? PacketSent :- PacketOut. FlowEntry during [t4,t5] PacketReceived during [t4,t5] Step 2: Issue query when relevant event occursStep 1: Collect events from distributed system Step 3: Provenance graph is generated t4 t5

Approach Generating Negative Provenance Improving readability Background: Provenance 12 Overview Challenge: Too many explanations Goal: Diagnose missing events WHY NOT ? Approach: Counter-factual reasoning System Y! R-tree indexing Usability Evaluation Query speed Size reduction Experiments

Generating negative provenance graphs FlowEntry PacketSent :- PacketReceived, FlowEntry. PacketReceived No PacketSent during [t1,now] ??? time 13 PacketSent -Goal: Explain why something does not exist -Use missing preconditions to explain missing events t1 now t2 t3 t4 t5

Generating negative provenance graphs -Explanation can be unnecessarily complex PacketSent FlowEntry PacketReceived No PacketSent during [t1,now] time No PacketReceived during [t1,t2] No FlowEntry during [t2,t3] No PacketReceived during [t3,t4] No FlowEntry during [t4,t5] No PacketReceived during [t5,now] 14 t1 t2 t3 t4 t5 now

Generating negative provenance graphs -We want simple explanations No PacketSent during [t1,now] No FlowEntry during [t1,now] 15 PacketSent FlowEntry PacketReceived time t1 t2 t3 t4 t5 now -This is hard (Set-Cover) -But greedy heuristics tend to work well

Generating negative provenance graphs 16

17 Challenge: Explanation is complicated! No A at at X No B at at Y No C at at Z Why NOT … ?

Approach Generating Negative Provenance Improving readability Background: Provenance 18 Overview Challenge: Too many explanations Goal: Diagnose missing events WHY NOT ? Approach: Counter-factual reasoning System Y! R-tree indexing Explanation usability Evaluation Query speed Explanation size reduction Experiments

Readability: How to simplify the provenance? 19 No chicken. … No egg. No chicken. … … No Packet arrived at Server No Packet arrived at S1 No Packet arrived at S2 No Packet arrived at S3 … -Heuristic #1: Prune logical inconsistencies -Heuristic #2: Summarize transient event chains Prune Summarize

Readability: Other heuristics Prune logical inconsistencies. Prune failed assertions. Branch coalescing. Application-specific invariants. Summarize transient event chains. Summarize super-vertex. 20

21 Readability: Concise explanations Why NOT … ?

Approach Generating Negative Provenance Improving readability Background: Provenance 22 System Y! R-tree indexing Overview Challenge: Too many explanations Goal: Diagnose missing events WHY NOT ? Approach: Counter-factual reasoning Explanation usability Evaluation Query speed Explanation size reduction Experiments

System: Y! 23 General: Works for any NDLOG program (not just SDN) Uses R-tree to speed up queries Supports general programs: Pyretic frontendMore details are in the paper

System: Better index for faster queries Was there a FlowTable from 3pm to 8pm, whose priority is higher than 255? 24 Any hotels within 3 miles of SIGCOMM? ≈ -Event storage must provide fast spatial query

System: R-tree for faster queries -R-tree: Designed to handle high-dimensional queries Used material from Wikipedia. -Basic idea: Multi-dimensional boxes as indexes 25

Approach Generating Negative Provenance Improving readability Background: Provenance 26 Overview Challenge: Too many explanations Goal: Diagnose missing events WHY NOT ? Approach: Counter-factual reasoning System Y! R-tree indexing Usability Evaluation Query speed Size reduction Experiments

Evaluation: Setup 27 -Two case studies: SDN and BGP -Buggy scenarios reproduced from literature and survey -SDN1: Broken flow entry -SDN2: MAC spoofing -SDN3: Incorrect ACL -SDN4: Ping traceback -SDN5: Internal access -BGP1: Off-path change -BGP2: Black hole -BGP3: Link failure -BGP4: Bogon List -Simulation stack: RapidNet + Mininet + Trema

Evaluation: Questions 28 Are negative provenance graphs concise? Are negative provenance graphs useful? What is the query turnaround time? What is the runtime storage overhead? Will Y! slow down the distributed system? How runtime storage overhead scales? How query turnaround time scales? How readability heuristics scales?

Evaluation: Time to answer a query 29 -Query turnaround less than one second SDN1 SDN2 SDN3 SDN4 SDN5 BGP1 BGP2 BGP3 BGP4 Query turnaround (seconds) Less than one second

Evaluation: Size of the returned answer 30 SDN1 SDN2 SDN3 SDN4 SDN5 BGP1 BGP2 BGP3 BGP4 Original Inconsistencies pruned All heuristics applied - 90% -Heuristics reduce size of the provenance by over 90% -No answers had more than 25 vertices 25 # Vertices in answers

Evaluation: How useful are the answers? Why is the HTTP server NOT getting requests? No HTTP Request arrived at HTTP Server No Forwarding FlowEntry arrived at Intermediate Switch HTTP Requests arrived at Border Switch Forwarding FlowEntry arrived at Border Switch Broken FlowEntry arrived at Intermediate Switch Internet HTTP Server Data Center Network SDN Controller ??? Broken FlowEntry S1 S2 ??? HTTP Request 31

More information: Goal: Diagnose events with negative symptoms Example: Why is the HTTP server not getting any requests? -Approach: Negative Provenance Uses counterfactual reasoning to find all the ways in which the missing event could have occurred. Then Explains why each did not come to pass. -Challenge: Explanation can be very large Uses a combination of several heuristics to remove redundancy and improve readability. -Implementation: Y! Can be applied to any distributed system. Supports both positive and negative provenance. -Two case studies: SDN and BGP Provenance is readable and can be computed quickly.