Presentation is loading. Please wait.

Presentation is loading. Please wait.

Enterprise Network Troubleshooting Nick Feamster Georgia Tech (joint with Russ Clark, Yiyi Huang, Anukool Lakhina, Manas Khadilkar, Aditi Thanekar)

Similar presentations


Presentation on theme: "Enterprise Network Troubleshooting Nick Feamster Georgia Tech (joint with Russ Clark, Yiyi Huang, Anukool Lakhina, Manas Khadilkar, Aditi Thanekar)"— Presentation transcript:

1 Enterprise Network Troubleshooting Nick Feamster Georgia Tech (joint with Russ Clark, Yiyi Huang, Anukool Lakhina, Manas Khadilkar, Aditi Thanekar)

2 2 Three Disjoint Views of the Network Policy: The operators wish list Static: What the configurations say Dynamic: The behavior that users witness PolicyStaticDynamic Generation Error Checking and Deployment - rancid/rcc - FIREMAN/Lumeta - ping - traceroute - … Independent analyses!

3 3 A Closer Look Proactive analysis –Fault avoidance –Policy conformance Reactive diagnosis –Correcting network faults Detection Localization –Active and passive measurements –Need users perspective Idea: These analyses should inform each other Two studies 1.Routing 2.Firewalls

4 4 Catastrophic Configuration Faults …a glitch at a small ISP… triggered a major outage in Internet access across the country. The problem started when MAI Network Services...passed bad router information from one of its customers onto Sprint. -- news.com, April 25, 1997Sprint Microsoft's websites were offline for up to 23 hours...because of a [router] misconfiguration…it took nearly a day to determine what was wrong and undo the changes. -- wired.com, January 25, 2001 WorldCom Inc…suffered a widespread outage on its Internet backbone that affected roughly 20 percent of its U.S. customer base. The network problems…affected millions of computer users worldwide. A spokeswoman attributed the outage to "a route table issue." -- cnn.com, October 3, 2002 "A number of Covad customers went out from 5pm today due to, supposedly, a DDOS (distributed denial of service attack) on a key Level3 data center, which later was described as a route leak (misconfiguration). -- dslreports.com, February 23, 2004

5 5 Case 1: Network-Wide Routing Analysis Proactive routing configuration analysis Idea: Analyze configuration before deployment Configure Detect Faults Deploy rcc Many faults can be detected with static analysis.

6 6 Operators Find Static Analysis Useful Thats wicked! -- Nicolas Strina, ip-man.net Thanks again for a great tool. -- Paul Piecuch, IT Manager...good to finally see more coverage of routing as distributed programming. From my experience, the principles of software engineering eliminate a vast majority of errors. -- Joe Provo, rcn.com I find your approach useful, it is really not fun (but critical for the health of the network) to keep track of the inconsistencies among different routers…a configuration verifier like yours can give the operator a degree of confidence that the sky won't fall on his head real soon now. -- Arnaud Le Tallanter, clara.net

7 7 Yes, but Surprises Happen! Link failures Node failures Traffic volumes shift Network devices wedged … Two problems –Detection –Localization

8 8 Detection: Analyze Routing Dynamics Idea: Routers exhibit correlated behavior Blips across signals may be more operationally interesting than any spike in one.

9 9 Detection Three Types of Events Single-router bursts Correlated bursts Multi-router bursts Common Commonly missed using thresholds

10 10 Localization: Joint Dynamic/Static Which routers are border routers for that burst Topological properties of routers in the burst StaticDynamic Proactive Analysis Deployment Reactive Detection Diagnosis/ Correction

11 11 Case 2: Firewalls Georgia Tech Campus Network –Research and Administrative Network –180 buildings –130+ firewalls –1700+ switches –55000+ ports Problem: Availability/Reachability –Flux in firewall, router, switch configurations –No common authority over changes made

12 12 Causes of Reachability Problems BGP policies Firewall misconfigurations Router misconfigurations Switch misconfigurations Network element failures Changes in traffic loads …

13 13 Specific Focus: Firewall Configuration Difficult to understand and audit configs Subject to continual modifications –Roughly 1-2 touches per day Federated policy, distributed dependencies –Each department has independent policies –Local changes may affect global behavior

14 14 Campus-Wide Network Performance Monitoring and Recovery –Monitor hosts are co-located with routers and switches Continual performance monitoring Multiple views of the network –Get the users perspective of the network –Isolate real network problems –Eliminate non-network issues Reactive Monitoring/Diagnosis: CPR Warren Matthews, Russ Clark, Matt Sanders, et al.

15 15 How CPR works Distributed probing –Smokeping –Nagios –Pathload Centralized analysis SI Rich Lyman EDI French OHR

16 16 Active Measurement –Ping and traceroute connectivity –OWAMP - one way delay –Iperf and Pathrate - bandwidth testing –Application tests - web, mail, DHCP, printing Passive Measurement –Packet capture –NetFlow –Firewall logs Device Data –SNMP counter data from switches, routers, wireless Aps User Sessions –Login,logout session data Measurements

17 17 Deployment

18 18 CPR Data Flow Distributed collection Centralized storage

19 19 Firewall-Induced Reachability Step 1: Proactive checking Step 2: Reactive measurement using CPR –Detection –Localization

20 20 Packet Probes One-way packet probes –Initiated by central command –No acknowledgements required –Recipient directly notifies central monitoring node

21 21 Core Routers SI NIGW OHR ABCGHF EDIJKLM Lyman A M Packet Probes

22 22 ABCDEFGHIJKLMABCDEFGHIJKLM A1111111111111A1111111111111 B1111111111111B1111111111111 C1111111111111C1111111111111 D1111111111111D1111111111111 E1111111111111E1111111111111 F1111111111111F1111111111111 G1111111111111G1111111111111 H1111111111111H1111111111111 I1111111111111I1111111111111 J1111111111111J1111111111111 K1111111111111K1111111111111 L1111111111111L1111111111111 M0111111111111M0111111111111 Output: Reachability Matrix

23 23 Core Routers SI GW OHR A BC ED KML BC ED KL The Suspects

24 24 Core Routers SI NIGW OHR ABCGHF EDIJKLM Lyman A M X YZ AMAMAM Spoofing and Firewalls A B C Deny A M

25 25 (Immediate) Open Issues Reachability and reliability of controller Service-level probes –Diagnostic tools != Service-level Happiness Policy conformance

26 26 Holy Grail: Joint Analysis of 3 Views PolicyStaticDynamic Generation Error Checking and Deployment - rancid/rcc - FIREMAN/Lumeta - ping - traceroute - … Static firewall analysis –Configurations –Logs Policy conformance


Download ppt "Enterprise Network Troubleshooting Nick Feamster Georgia Tech (joint with Russ Clark, Yiyi Huang, Anukool Lakhina, Manas Khadilkar, Aditi Thanekar)"

Similar presentations


Ads by Google