Presentation on theme: "Toward Interactive Debugging for ISP Networks Chia-Chi Lin, Matthew Caesar, Jacobus Van der Merwe § University of Illinois at Urbana-Champaign § AT&T Labs."— Presentation transcript:
Toward Interactive Debugging for ISP Networks Chia-Chi Lin, Matthew Caesar, Jacobus Van der Merwe § University of Illinois at Urbana-Champaign § AT&T Labs – Research
1 Debugging in ISP Networks Internet: most complex distributed system ever created –Leads to complex failure modes –Bugs, vulnerabilities, compromise, misconfigurations Major challenges in debugging in ISP Networks –Lack of visibility –High rates of change of protocols –Complex interdependencies These could cause devastating effects –Long-term outages, slow repair –February 2009 BGP outage
2 Interactive Debugging is Necessary Problems exist with fully automated techniques –Focus on detection rather than diagnosis –Modeling could be inexact –Logical and semantic errors seems to require human knowledge to solve Our position: –Humans must be in-the-loop –Tools are required to facilitate the process
3 A Scenario ISP Customer Pause when the outage occurs Cloned Network
4 Our Vision Isolation of the operational network –Prevent diagnostic procedure from interfering with live network operation –Solution: virtualization technologies Reproducibility of network execution –Enable operator to replay execution, narrow in on rare events –Solution: instill a pseudorandom ordering over events, messages Interactive stepping through execution –Operator can slowly step through operation, trace messages –Solution: protocols providing tight control over distributed execution
5 The Architecture Virtual Service Platforms Virtual Service Coordinator Physical Network Node Debugging Coordinator Virtual Service Nodes User (human troubleshooter) Physical Network Infrastructure Application 1: e.g. BGP Application 2: e.g. OSPF
6 Key Challenge: Reproducibility Reproducibility simplifies interactive debugging –Can run multiple times, varying inputs to narrow down cause –When rare bug occurs, dont need to wait for it to reoccur One option: generate comprehensive logs of all events –e.g., log all packet sends/receives, all data –Problem: not scalable to large networked software Our approach: eliminate randomness in execution –Starting with the same initial state will produce same execution –Make execution pseudorandom to explore different execution paths –Key challenge: how to eliminate randomness in large-scale software execution?
7 An Algorithm for Distributed, Reproducible Execution Approach: –Encapsulate software in virtual environment –Intercept softwares inputs/outputs, instill an ordering over them –Make sure that ordering is the same, every time software is run How this is done: –Network is run in lockstep fashion –On every cycle: messages from neighbors are buffered –Before deliver to application, pseudorandom ordering is instilled by consistent hash of packets contents –Human sends step commands to move to next lockstep cycle
8 Improving Performance for the Production Network Problem: running application in lockstep fashion slows operation –Might be okay for some protocols (e.g., BGP) –Probably not okay for others (e.g., OSPF) Solution: optimistic execution of events –Choose pseudorandom ordering in advance that is likely to happen anyway –Dont buffer packets, deliver them immediately –If we guess wrong, roll back application to earlier state
9 Example: Running the Lockstep Algorithm in a Cloned Network App Transmission Phase Processing Phase I finished transmitting. I am ready to process. K L S A A K L S SLK A I finished processing. I am ready to transmit. App Sending Buffer Receiving Buffer 1.S 2.L 3.K 4.……
10 Example: Live Algorithm in Production Network Seattle Los Angeles Salt Lake City Kansas City Houston Atlanta New York Washington Chicago The live algorithm does two things: Determine the ordering of events Roll back events violating the ordering Packets from Seattle should come before those from Los Angeles 1.Seattle 2.Los Angeles 3.Kansas City 4.Chicago 5.…… S K C L SKC LK C KC Pseudorandom ordering is violated!
11 Connecting the Two Algorithms We can run the production network using the live algorithm –Achieves a fixed ordering over messages –But how to actually debug it? Solution: replay using the lockstep algorithm –First let the production network run, checkpoint starting state –To debug, start lockstep algorithm with same staring state –Lockstep algorithm will traverse the same execution Can replay multiple times, narrow in on problem, experiment by changing inputs, etc.
12 Simulation Settings Protocol evaluated: OSPF Topologies used: BRITE, Internet2 backbone Link delay model: 1 ms + (0, 0.5] exponentially distributed random delay Events simulated: Abilene IS-IS traces over the month of January 2009 (giving 209 events) Measure performance overheads of our approach
13 Results – Overhead in Production Networks Live algorithm suffers from rollbacks, incurring 4x inflation in traffic overhead Using delay-estimation optimization reduces overhead to 0.02x traffic inflation
14 Results – Response Time in Cloned Networks Low response time is beneficial to interactive debugging Response time is low for variety of network sizes
15 Conclusion Humans are required to be in-the-loop to diagnose problems Our architecture is a first step towards interactive debugging –Builds on known techniques, e.g., virtualization technologies and distributed semaphores –Develop techniques to reproduce distributed executions Simulations on real-world events show the scheme accompanied with low overheads
17 The State of the Art: Automated Techniques Logging observations –X-Trace, Friday, etc. Model checking –rcc, OD flow, etc. Debugging standalone programs –Coverity, AVIO, etc.
18 Optimized Ordering in the Production Network Goal: avoid rollbacks by selecting ordering likely to happen anyway –Events separated by long period will fall into different groups which means ordering is easy –Problem: some failure events are correlated E.g., multiple overlay links sharing same physical link –How to order events in same group? Solution: if we know link delays, we can reliably estimate expected arrival of events –In practice we dont know exact link delays –But we can estimate them –Can improve estimation by giving protocol messages high priority
19 Results – Storage in Production Network State required for rolling back packets is small and increases slowly with network size