Enterprise Network Troubleshooting Nick Feamster Georgia Tech (joint with Russ Clark, Yiyi Huang, Anukool Lakhina, Manas Khadilkar, Aditi Thanekar)

Slides:



Advertisements
Similar presentations
Presented by Nikita Shah 5th IT ( )
Advertisements

1 © 2001, Cisco Systems, Inc. All rights reserved. Cisco TunnelBuilder, 5/2002 Cisco MPLS Tunnel Builder Product Details ITD Product Management.
1 Senn, Information Technology, 3 rd Edition © 2004 Pearson Prentice Hall James A. Senns Information Technology, 3 rd Edition Chapter 7 Enterprise Databases.
© 2004 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice Installation & management of SUSE.
Chapter 9 E-Security. Awad –Electronic Commerce 1/e © 2002 Prentice Hall 2 OBJECTIVES Security in Cyberspace Conceptualizing Security Designing for Security.
Enterprise Network Troubleshooting Nick Feamster Georgia Tech (joint with Russ Clark, Yiyi Huang, Anukool Lakhina, Manas Khadilkar, Aditi Thanekar)
Using Network Virtualization Techniques for Scalable Routing Nick Feamster, Georgia Tech Lixin Gao, UMass Amherst Jennifer Rexford, Princeton University.
Network Monitoring System In CSTNET Long Chun China Science & Technology Network.
1 Fault Analysis for Large-scale Campus-wide Wireless Networks Jian Chen Department of CS, Tsinghua University, Beijing, China.
1 Diagnosing Network Disruptions with Network-wide Analysis Yiyi Huang, Nick Feamster, Anukool Lakhina*, Jim Xu College of Computing, Georgia Tech * Guavus,
Resonance: Dynamic Access Control in Enterprise Networks Ankur Nayak, Alex Reimers, Nick Feamster, Russ Clark School of Computer Science Georgia Institute.
Path Splicing with Network Slicing
Data Mining Challenges for Network Management Nick Feamster, Georgia Tech Dave Andersen, CMU (joint with Jay Lepreau and Emulab)
Networking Research Nick Feamster CS Nick Feamster Ph.D. from MIT, Post-doc at Princeton this fall Arriving January 2006 –Here off-and-on until.
Improving Internet Availability with Path Splicing Nick Feamster Georgia Tech Joint work with Murtaza Motiwala and Santosh Vempala.
Challenges in Making Tomography Practical
Data-Plane Accountability with In-Band Path Diagnosis Murtaza Motiwala, Nick Feamster Georgia Tech Andy Bavier Princeton University.
Using VINI to Test New Network Protocols Murtaza Motiwala, Georgia Tech Andy Bavier, Princeton University Nick Feamster, Georgia Tech Santosh Vempala,
Internet Availability Nick Feamster Georgia Tech.
Nick Feamster Research Interest: Networked Systems Arriving January 2006 Likely teaching CS 7260 in Spring 2005 Here off-and-on until then. works.
Characterizing VLAN-Induced Sharing in a Campus Network
Multihoming and Multi-path Routing
Network Operations Nick Feamster
Network Troubleshooting: rcc and Beyond Nick Feamster Georgia Tech (joint with Russ Clark, Yiyi Huang, Anukool Lakhina)
1 Resonance: Dynamic Access Control in Enterprise Networks Ankur Nayak, Alex Reimers, Nick Feamster, Russ Clark School of Computer Science Georgia Institute.
Network Operations Nick Feamster
Network Operations Research Nick Feamster
Theory Lunch. 2 Problem Areas Network Virtualization for Experimentation and Architecture –Embedding problems –Economics problems (markets, etc.) Network.
Multihoming and Multi-path Routing
Business Transaction Management Software for Application Coordination 1 Business Processes and Coordination.
1 Resonance: Dynamic Access Control in Enterprise Networks Ankur Nayak, Alex Reimers, Nick Feamster, Russ Clark School of Computer Science Georgia Institute.
Network Protection and Restoration Session 5 - Optical/IP Network OAM & Protection and Restoration Presented by: Malcolm Betts Date:
Jeopardy Q 1 Q 6 Q 11 Q 16 Q 21 Q 2 Q 7 Q 12 Q 17 Q 22 Q 3 Q 8 Q 13
Jeopardy Q 1 Q 6 Q 11 Q 16 Q 21 Q 2 Q 7 Q 12 Q 17 Q 22 Q 3 Q 8 Q 13
Title Subtitle.
Protocol layers and Wireshark Rahul Hiran TDTS11:Computer Networks and Internet Protocols 1 Note: T he slides are adapted and modified based on slides.
Richmond House, Liverpool (1) 26 th January 2004.
Configuration management
Zhiyun Qian, Z. Morley Mao (University of Michigan)
Chapter 1: Introduction to Scaling Networks
The Platform as a Service Model for Networking Eric Keller, Jennifer Rexford Princeton University INM/WREN 2010.
ABC Technology Project
© 2006 Cisco Systems, Inc. All rights reserved. MPLS v MPLS VPN Technology Introducing the MPLS VPN Routing Model.
1 Improving TCP Performance over Mobile Networks HALA ELAARAG Stetson University Speaker : Aron ACM Computing Surveys 2002.
powerful network monitoring & management solution
Sponsored by the National Science Foundation Tutorial: OpenFlow-Based Vertical Handoff over WiFi and WiMAX in the Orbit Testbed Ryan Izard and KC Wang.
VOORBLAD.
© 2007 Cisco Systems, Inc. All rights reserved.Cisco Public 1 EN0129 PC AND NETWORK TECHNOLOGY I IP ADDRESSING AND SUBNETS Derived From CCNA Network Fundamentals.
Chapter 10 Software Testing
© 2006 Cisco Systems, Inc. All rights reserved.Cisco Public 1 Version 4.0 VLANs LAN Switching and Wireless – Chapter 3.
25 seconds left…...
Measurement: Techniques, Strategies, and Pitfalls Nick Feamster CS 7260 February 7, 2007.
© 2006 Cisco Systems, Inc. All rights reserved. MPLS v2.2—5-1 MPLS VPN Implementation Configuring BGP as the Routing Protocol Between PE and CE Routers.
We will resume in: 25 Minutes.
Connecting LANs, Backbone Networks, and Virtual LANs
1 © 2004, Cisco Systems, Inc. All rights reserved. CCNA 2 v3.1 Module 1 WANs and Routers.
A Measurement Study of Available Bandwidth Estimation Tools MIT - CSAIL with Jacob Strauss & Frans Kaashoek Dina Katabi.
Measurement and Monitoring Nick Feamster Georgia Tech.
Presented by INTRUSION DETECTION SYSYTEM. CONTENT Basically this presentation contains, What is TripWire? How does TripWire work? Where is TripWire used?
PacNOG 6: Nadi, Fiji Dealing with DDoS Attacks Hervey Allen Network Startup Resource Center.
Happy Network Administrators  Happy Packets  Happy Users WIRED Position Statement Aman Shaikh AT&T Labs – Research October 16,
workshop eugene, oregon What is network management? System & Service monitoring  Reachability, availability Resource measurement/monitoring.
Towards an Internet that “Never Fails” Hari Balakrishnan MIT Joint work with Nick Feamster, Scott Shenker, Mythili Vutukuru.
CPR and GAMMON Deployment Experiences. Warren Matthews Georgia Institute of Technology.
1 Network Measurement Summary ESCC, Feb Joe Metzger ESnet Engineering Group Lawrence Berkeley National Laboratory.
INTRUSION DETECTION SYSYTEM. CONTENT Basically this presentation contains, What is TripWire? How does TripWire work? Where is TripWire used? Tripwire.
정하경 MMLAB Fundamentals of Internet Measurement: a Tutorial Nevil Brownlee, Chris Lossley, “Fundamentals of Internet Measurement: a Tutorial,” CMG journal.
CPR and GAMMON Deployment Experiences.
Presentation transcript:

Enterprise Network Troubleshooting Nick Feamster Georgia Tech (joint with Russ Clark, Yiyi Huang, Anukool Lakhina, Manas Khadilkar, Aditi Thanekar)

2 Three Disjoint Views of the Network Policy: The operators wish list Static: What the configurations say Dynamic: The behavior that users witness PolicyStaticDynamic Generation Error Checking and Deployment - rancid/rcc - FIREMAN/Lumeta - ping - traceroute - … Independent analyses!

3 A Closer Look Proactive analysis –Fault avoidance –Policy conformance Reactive diagnosis –Correcting network faults Detection Localization –Active and passive measurements –Need users perspective Idea: These analyses should inform each other Two studies 1.Routing 2.Firewalls

4 Catastrophic Configuration Faults …a glitch at a small ISP… triggered a major outage in Internet access across the country. The problem started when MAI Network Services...passed bad router information from one of its customers onto Sprint. -- news.com, April 25, 1997Sprint Microsoft's websites were offline for up to 23 hours...because of a [router] misconfiguration…it took nearly a day to determine what was wrong and undo the changes. -- wired.com, January 25, 2001 WorldCom Inc…suffered a widespread outage on its Internet backbone that affected roughly 20 percent of its U.S. customer base. The network problems…affected millions of computer users worldwide. A spokeswoman attributed the outage to "a route table issue." -- cnn.com, October 3, 2002 "A number of Covad customers went out from 5pm today due to, supposedly, a DDOS (distributed denial of service attack) on a key Level3 data center, which later was described as a route leak (misconfiguration). -- dslreports.com, February 23, 2004

5 Case 1: Network-Wide Routing Analysis Proactive routing configuration analysis Idea: Analyze configuration before deployment Configure Detect Faults Deploy rcc Many faults can be detected with static analysis.

6 Operators Find Static Analysis Useful Thats wicked! -- Nicolas Strina, ip-man.net Thanks again for a great tool. -- Paul Piecuch, IT Manager...good to finally see more coverage of routing as distributed programming. From my experience, the principles of software engineering eliminate a vast majority of errors. -- Joe Provo, rcn.com I find your approach useful, it is really not fun (but critical for the health of the network) to keep track of the inconsistencies among different routers…a configuration verifier like yours can give the operator a degree of confidence that the sky won't fall on his head real soon now. -- Arnaud Le Tallanter, clara.net

7 Yes, but Surprises Happen! Link failures Node failures Traffic volumes shift Network devices wedged … Two problems –Detection –Localization

8 Detection: Analyze Routing Dynamics Idea: Routers exhibit correlated behavior Blips across signals may be more operationally interesting than any spike in one.

9 Detection Three Types of Events Single-router bursts Correlated bursts Multi-router bursts Common Commonly missed using thresholds

10 Localization: Joint Dynamic/Static Which routers are border routers for that burst Topological properties of routers in the burst StaticDynamic Proactive Analysis Deployment Reactive Detection Diagnosis/ Correction

11 Case 2: Firewalls Georgia Tech Campus Network –Research and Administrative Network –180 buildings –130+ firewalls –1700+ switches – ports Problem: Availability/Reachability –Flux in firewall, router, switch configurations –No common authority over changes made

12 Causes of Reachability Problems BGP policies Firewall misconfigurations Router misconfigurations Switch misconfigurations Network element failures Changes in traffic loads …

13 Specific Focus: Firewall Configuration Difficult to understand and audit configs Subject to continual modifications –Roughly 1-2 touches per day Federated policy, distributed dependencies –Each department has independent policies –Local changes may affect global behavior

14 Campus-Wide Network Performance Monitoring and Recovery –Monitor hosts are co-located with routers and switches Continual performance monitoring Multiple views of the network –Get the users perspective of the network –Isolate real network problems –Eliminate non-network issues Reactive Monitoring/Diagnosis: CPR Warren Matthews, Russ Clark, Matt Sanders, et al.

15 How CPR works Distributed probing –Smokeping –Nagios –Pathload Centralized analysis SI Rich Lyman EDI French OHR

16 Active Measurement –Ping and traceroute connectivity –OWAMP - one way delay –Iperf and Pathrate - bandwidth testing –Application tests - web, mail, DHCP, printing Passive Measurement –Packet capture –NetFlow –Firewall logs Device Data –SNMP counter data from switches, routers, wireless Aps User Sessions –Login,logout session data Measurements

17 Deployment

18 CPR Data Flow Distributed collection Centralized storage

19 Firewall-Induced Reachability Step 1: Proactive checking Step 2: Reactive measurement using CPR –Detection –Localization

20 Packet Probes One-way packet probes –Initiated by central command –No acknowledgements required –Recipient directly notifies central monitoring node

21 Core Routers SI NIGW OHR ABCGHF EDIJKLM Lyman A M Packet Probes

22 ABCDEFGHIJKLMABCDEFGHIJKLM A A B B C C D D E E F F G G H H I I J J K K L L M M Output: Reachability Matrix

23 Core Routers SI GW OHR A BC ED KML BC ED KL The Suspects

24 Core Routers SI NIGW OHR ABCGHF EDIJKLM Lyman A M X YZ AMAMAM Spoofing and Firewalls A B C Deny A M

25 (Immediate) Open Issues Reachability and reliability of controller Service-level probes –Diagnostic tools != Service-level Happiness Policy conformance

26 Holy Grail: Joint Analysis of 3 Views PolicyStaticDynamic Generation Error Checking and Deployment - rancid/rcc - FIREMAN/Lumeta - ping - traceroute - … Static firewall analysis –Configurations –Logs Policy conformance