Toward Interactive Debugging for ISP Networks Chia-Chi Lin, Matthew Caesar, Jacobus Van der Merwe § University of Illinois at Urbana-Champaign § AT&T Labs.

Slides:

Advertisements

Similar presentations

Ch. 12 Routing in Switched Networks

Advertisements

Using Network Virtualization Techniques for Scalable Routing Nick Feamster, Georgia Tech Lixin Gao, UMass Amherst Jennifer Rexford, Princeton University.

Challenges in Making Tomography Practical

Internet Availability Nick Feamster Georgia Tech.

Multihoming and Multi-path Routing

1 Building a Fast, Virtualized Data Plane with Programmable Hardware Bilal Anwer Nick Feamster.

Optimizing Cost and Performance for Multihoming Nick Feamster CS 6250 Fall 2011.

and 6.855J Cycle Canceling Algorithm. 2 A minimum cost flow problem , $4 20, $1 20, $2 25, $2 25, $5 20, $6 30, $

FORUM ON NEXT GENERATION STANDARDIZATION (Colombo, Sri Lanka, 7-10 April 2009) A Pilot Implementation of an NGN Dual Stack IPv4/IPv6 network for MEWC,

Data recovery 1. 2 Recovery - introduction recovery restoring a system, after an error or failure, to a state that was previously known as correct have.

Michele Pagano – A Survey on TCP Performance Evaluation and Modeling 1 Department of Information Engineering University of Pisa Network Telecomunication.

Using search for engineering diagnostics and prognostics Jim Austin.

Jennifer Rexford Princeton University MW 11:00am-12:20pm Logically-Centralized Control COS 597E: Software Defined Networking.

Two-Market Inter-domain Bandwidth Contracting

Switching Techniques In large networks there might be multiple paths linking sender and receiver. Information may be switched as it travels through various.

1 Praveen K. Muthuswamy Electrical Computer and Systems Engineering Rensselaer Polytechnic Institute In collaboration with Koushik Kar, Aparna Gupta (RPI)

The Platform as a Service Model for Networking Eric Keller, Jennifer Rexford Princeton University INM/WREN 2010.

1 Quality of Service Issues Network design and security Lecture 12.

INTRODUCTION TO NETWORK VIRTUALIZATION Mosharaf Chowdhury Member, eNVy Project Wednesday, May 14, 2008 University of Waterloo - eNVy 1.

Chapter 9 Introduction to MAN and WAN

INTRODUCTION TO SIMULATION WITH OMNET++ José Daniel García Sánchez ARCOS Group – University Carlos III of Madrid.

Capacity Planning IACT 918 July 2004 Gene Awyzio SITACS University of Wollongong.

1 Thursday, June 29, 2006 "If you really think there's a bug you should report a bug. Maybe you're not using it properly. Have you ever considered that?"

April 30, A New Tool for Designer-Level Verification: From Concept to Reality April 30, 2014 Ziv Nevo IBM Haifa Research Lab.

Routing and Congestion Problems in General Networks Presented by Jun Zou CAS 744.

Ch. 12 Routing in Switched Networks Routing in Packet Switched Networks Routing Algorithm Requirements –Correctness –Simplicity –Robustness--the.

Addition 1’s to 20.

CS203 Lecture 15.

1 Deadlock Solutions: Avoidance, Detection, and Recovery CS 241 March 30, 2012 University of Illinois.

BY PAYEL BANDYOPADYAY WHAT AM I GOING TO DEAL ABOUT? WHAT IS AN AD-HOC NETWORK? That doesn't depend on any infrastructure (eg. Access points, routers)

Consensus Routing: The Internet as a Distributed System John P. John, Ethan Katz-Bassett, Arvind Krishnamurthy, and Thomas Anderson Presented.

A Study of Multiple IP Link Failure Fang Yu

1 LINK STATE PROTOCOLS (contents) Disadvantages of the distance vector protocols Link state protocols Why is a link state protocol better?

Shadow Configurations: A Network Management Primitive Richard Alimi, Ye Wang, Y. Richard Yang Laboratory of Networked Systems Yale University.

Mitigating routing misbehavior in ad hoc networks Mary Baker Departments of Computer Science and.

Shadow Configurations: A Network Management Primitive Richard Alimi, Ye Wang, and Y. Richard Yang Laboratory of Networked Systems Yale University February.

Routing problems are easy to cause, and hard to diagnose (“Happy operators make happy packets”) Jennifer Rexford AT&T Labs—Research

1 Design and implementation of a Routing Control Platform Matthew Caesar, Donald Caldwell, Nick Feamster, Jennifer Rexford, Aman Shaikh, Jacobus van der.

A Routing Control Platform for Managing IP Networks Jennifer Rexford Princeton University

Measurement and Monitoring Nick Feamster Georgia Tech.

Bandwidth DoS Attacks and Defenses Robert Morris Frans Kaashoek, Hari Balakrishnan, Students MIT LCS.

Slide 1 of 9 Presenting 24x7 Scheduler The art of computer automation Press PageDown key or click to advance.

1 Latency Equalization: A Programmable Routing Service Primitive Minlan Yu Joint work with Marina Thottan, Li Li at Bell Labs.

Presenter: Chi-Hung Lu 1. Problems Distributed applications are hard to validate Distribution of application state across many distinct execution environments.

Switching Techniques Student: Blidaru Catalina Elena.

VeriFlow: Verifying Network-Wide Invariants in Real Time

Overlay Network Physical LayerR : router Overlay Layer N R R R R R N.

Switching breaks up large collision domains into smaller ones Collision domain is a network segment with two or more devices sharing the same Introduction.

©2015 EarthLink. All rights reserved Cloud Express ™ Optimize Your Business & Cloud Networks.

A Passive Approach to Sensor Network Localization Rahul Biswas and Sebastian Thrun International Conference on Intelligent Robots and Systems 2004 Presented.

Interconnect simulation. Different levels for Evaluating an architecture Numerical models – Mathematic formulations to obtain performance characteristics.

Interconnect simulation. Different levels for Evaluating an architecture Numerical models – Mathematic formulations to obtain performance characteristics.

11 CLUSTERING AND AVAILABILITY Chapter 11. Chapter 11: CLUSTERING AND AVAILABILITY2 OVERVIEW  Describe the clustering capabilities of Microsoft Windows.

6 December On Selfish Routing in Internet-like Environments paper by Lili Qiu, Yang Richard Yang, Yin Zhang, Scott Shenker presentation by Ed Spitznagel.

Tufts Wireless Laboratory School Of Engineering Tufts University Paper Review “An Energy Efficient Multipath Routing Protocol for Wireless Sensor Networks”,

Computer Simulation of Networks ECE/CSC 777: Telecommunications Network Design Fall, 2013, Rudra Dutta.

Review. Layers Physical layer – sending bits from one place to another, ensuring an okay BER Data link layer – encapsulate information bits into frames,

Using Ant Agents to Combine Reactive and Proactive strategies for Routing in Mobile Ad Hoc Networks Fredrick Ducatelle, Gianni di caro, and Luca Maria.

The “New Network Node” Algorithm Brought to you by: Brian Wolf(Researcher) Harlan Russell (Advisor) Joe Hammond (Advisor Emeritus) Vivek Mehta(Graduate.

1 Scalability and Accuracy in a Large-Scale Network Emulator Nov. 12, 2003 Byung-Gon Chun.

Multi Node Label Routing – A layer 2.5 routing protocol

Shadow Configurations: A Network Management Primitive

Connected Maintenance Solution

Connected Maintenance Solution

Switching Techniques In large networks there might be multiple paths linking sender and receiver. Information may be switched as it travels through various.

On-Time Network On-chip

Intra-Domain Routing Jacob Strauss September 14, 2006.

Switching Techniques.

Presentation transcript:

Toward Interactive Debugging for ISP Networks Chia-Chi Lin, Matthew Caesar, Jacobus Van der Merwe § University of Illinois at Urbana-Champaign § AT&T Labs – Research

1 Debugging in ISP Networks Internet: most complex distributed system ever created –Leads to complex failure modes –Bugs, vulnerabilities, compromise, misconfigurations Major challenges in debugging in ISP Networks –Lack of visibility –High rates of change of protocols –Complex interdependencies These could cause devastating effects –Long-term outages, slow repair –February 2009 BGP outage

2 Interactive Debugging is Necessary Problems exist with fully automated techniques –Focus on detection rather than diagnosis –Modeling could be inexact –Logical and semantic errors seems to require human knowledge to solve Our position: –Humans must be in-the-loop –Tools are required to facilitate the process

3 A Scenario ISP Customer Pause when the outage occurs Cloned Network

4 Our Vision Isolation of the operational network –Prevent diagnostic procedure from interfering with live network operation –Solution: virtualization technologies Reproducibility of network execution –Enable operator to replay execution, narrow in on rare events –Solution: instill a pseudorandom ordering over events, messages Interactive stepping through execution –Operator can slowly step through operation, trace messages –Solution: protocols providing tight control over distributed execution

5 The Architecture Virtual Service Platforms Virtual Service Coordinator Physical Network Node Debugging Coordinator Virtual Service Nodes User (human troubleshooter) Physical Network Infrastructure Application 1: e.g. BGP Application 2: e.g. OSPF

6 Key Challenge: Reproducibility Reproducibility simplifies interactive debugging –Can run multiple times, varying inputs to narrow down cause –When rare bug occurs, dont need to wait for it to reoccur One option: generate comprehensive logs of all events –e.g., log all packet sends/receives, all data –Problem: not scalable to large networked software Our approach: eliminate randomness in execution –Starting with the same initial state will produce same execution –Make execution pseudorandom to explore different execution paths –Key challenge: how to eliminate randomness in large-scale software execution?

7 An Algorithm for Distributed, Reproducible Execution Approach: –Encapsulate software in virtual environment –Intercept softwares inputs/outputs, instill an ordering over them –Make sure that ordering is the same, every time software is run How this is done: –Network is run in lockstep fashion –On every cycle: messages from neighbors are buffered –Before deliver to application, pseudorandom ordering is instilled by consistent hash of packets contents –Human sends step commands to move to next lockstep cycle

8 Improving Performance for the Production Network Problem: running application in lockstep fashion slows operation –Might be okay for some protocols (e.g., BGP) –Probably not okay for others (e.g., OSPF) Solution: optimistic execution of events –Choose pseudorandom ordering in advance that is likely to happen anyway –Dont buffer packets, deliver them immediately –If we guess wrong, roll back application to earlier state

9 Example: Running the Lockstep Algorithm in a Cloned Network App Transmission Phase Processing Phase I finished transmitting. I am ready to process. K L S A A K L S SLK A I finished processing. I am ready to transmit. App Sending Buffer Receiving Buffer 1.S 2.L 3.K 4.……

10 Example: Live Algorithm in Production Network Seattle Los Angeles Salt Lake City Kansas City Houston Atlanta New York Washington Chicago The live algorithm does two things: Determine the ordering of events Roll back events violating the ordering Packets from Seattle should come before those from Los Angeles 1.Seattle 2.Los Angeles 3.Kansas City 4.Chicago 5.…… S K C L SKC LK C KC Pseudorandom ordering is violated!

11 Connecting the Two Algorithms We can run the production network using the live algorithm –Achieves a fixed ordering over messages –But how to actually debug it? Solution: replay using the lockstep algorithm –First let the production network run, checkpoint starting state –To debug, start lockstep algorithm with same staring state –Lockstep algorithm will traverse the same execution Can replay multiple times, narrow in on problem, experiment by changing inputs, etc.

12 Simulation Settings Protocol evaluated: OSPF Topologies used: BRITE, Internet2 backbone Link delay model: 1 ms + (0, 0.5] exponentially distributed random delay Events simulated: Abilene IS-IS traces over the month of January 2009 (giving 209 events) Measure performance overheads of our approach

13 Results – Overhead in Production Networks Live algorithm suffers from rollbacks, incurring 4x inflation in traffic overhead Using delay-estimation optimization reduces overhead to 0.02x traffic inflation

14 Results – Response Time in Cloned Networks Low response time is beneficial to interactive debugging Response time is low for variety of network sizes

15 Conclusion Humans are required to be in-the-loop to diagnose problems Our architecture is a first step towards interactive debugging –Builds on known techniques, e.g., virtualization technologies and distributed semaphores –Develop techniques to reproduce distributed executions Simulations on real-world events show the scheme accompanied with low overheads

16

17 The State of the Art: Automated Techniques Logging observations –X-Trace, Friday, etc. Model checking –rcc, OD flow, etc. Debugging standalone programs –Coverity, AVIO, etc.

18 Optimized Ordering in the Production Network Goal: avoid rollbacks by selecting ordering likely to happen anyway –Events separated by long period will fall into different groups which means ordering is easy –Problem: some failure events are correlated E.g., multiple overlay links sharing same physical link –How to order events in same group? Solution: if we know link delays, we can reliably estimate expected arrival of events –In practice we dont know exact link delays –But we can estimate them –Can improve estimation by giving protocol messages high priority

19 Results – Storage in Production Network State required for rolling back packets is small and increases slowly with network size