Presentation by Michael Smathers, Usman Jafarey CS395/495 IMRE, April 24, 2006 PlanetSeer: Internet Path Failure Monitoring and Characterization in Wide-Area.

Slides:



Advertisements
Similar presentations
Multihoming and Multi-path Routing
Advertisements

Neighbor Discovery for IPv6 Mangesh Kaushikkar. Overview Introduction Terminology Protocol Overview Message Formats Conceptual Model of a Host.
1 Network Measurements in Overlay Networks Richard Cameron Craddock School of Electrical and Computer Engineering Georgia Institute of Technology.
Part IV: BGP Routing Instability. March 8, BGP routing updates  Route updates at prefix level  No activity in “steady state”  Routing messages.
Improving TCP Performance over Mobile Ad Hoc Networks by Exploiting Cross- Layer Information Awareness Xin Yu Department Of Computer Science New York University,
Consensus Routing: The Internet as a Distributed System John P. John, Ethan Katz-Bassett, Arvind Krishnamurthy, and Thomas Anderson Presented.
End-to-End Routing Behavior in the Internet Vern Paxson Presented by Zhichun Li.
Detecting Traffic Differentiation in Backbone ISPs with NetPolice Ying Zhang Zhuoqing Morley Mao Ming Zhang.
1 Internet Networking Spring 2004 Tutorial 13 LSNAT - Load Sharing NAT (RFC 2391)
1 Spring Semester 2007, Dept. of Computer Science, Technion Internet Networking recitation #4 Mobile Ad-Hoc Networks AODV Routing.
1 Internet Networking Spring 2003 Tutorial 11 Explicit Congestion Notification (RFC 3168) Limited Transmit (RFC 3042)
Oct 21, 2004CS573: Network Protocols and Standards1 IP: Addressing, ARP, Routing Network Protocols and Standards Autumn
Internet Routing Instability Labovitz et al. Sigcomm 1997 Largely adopted from Ion Stoica’s slide at UCB.
1 Internet Networking Spring 2003 Tutorial 11 Explicit Congestion Notification (RFC 3168)
1 End-to-End Detection of Shared Bottlenecks Sridhar Machiraju and Weidong Cui Sahara Winter Retreat 2003.
E2E Routing Behavior in the Internet Vern Paxson Sigcomm 1996 Slides are adopted from Ion Stoica’s lecture at UCB.
© 2007 Pearson Education Inc., Upper Saddle River, NJ. All rights reserved.1 Computer Networks and Internets with Internet Applications, 4e By Douglas.
Hands-On Microsoft Windows Server 2003 Networking Chapter 7 Windows Internet Naming Service.
User-level Internet Path Diagnosis R. Mahajan, N. Spring, D. Wetherall and T. Anderson.
1 CCNA 2 v3.1 Module 8. 2 TCP/IP Suite Error and Control Messages CCNA 2 Module 8.
End-to-End Issues. Route Diversity  Load balancing o Per packet splitting o Per flow splitting  Spill over  Route change o Failure o policy  Route.
1 Spring Semester 2007, Dept. of Computer Science, Technion Internet Networking recitation #12 LSNAT - Load Sharing NAT (RFC 2391)
Network Measurement Bandwidth Analysis. Why measure bandwidth? Network congestion has increased tremendously. Network congestion has increased tremendously.
CMPT 471 Networking II Address Resolution IPv6 Neighbor Discovery 1© Janice Regan, 2012.
1 ICMP : Internet Control Message Protocol Computer Network System Sirak Kaewjamnong.
1 Studying Black Holes on the Internet with Hubble Ethan Katz-Bassett, Harsha V. Madhyastha, John P. John, Arvind Krishnamurthy, David Wetherall, Thomas.
INTERNET TOPOLOGY MAPPING INTERNET MAPPING PROBING OVERHEAD MINIMIZATION  Intra- and inter-monitor redundancy reduction IBRAHIM ETHEM COSKUN University.
Lect3..ppt - 09/12/04 CIS 4100 Systems Performance and Evaluation Lecture 3 by Zornitza Genova Prodanoff.
Reading Report 14 Yin Chen 14 Apr 2004 Reference: Internet Service Performance: Data Analysis and Visualization, Cross-Industry Working Team, July, 2000.
Guide to TCP/IP, Third Edition
ICMP (Internet Control Message Protocol) Computer Networks By: Saeedeh Zahmatkesh spring.
CIS 725 Wireless networks. Low bandwidth High error rates.
Page 19/13/2015 Chapter 8 Some conditions that must be met for host to host communication over an internetwork: a default gateway must be properly configured.
© Janice Regan, CMPT 128, CMPT 371 Data Communications and Networking BGP, Flooding, Multicast routing.
1 Routing. 2 Routing is the act of deciding how each individual datagram finds its way through the multiple different paths to its destination. Routing.
1 Spring Semester 2009, Dept. of Computer Science, Technion Internet Networking recitation #3 Mobile Ad-Hoc Networks AODV Routing.
Ad hoc On-demand Distance Vector (AODV) Routing Protocol ECE 695 Spring 2006.
Healing the Web: An Overview of CoDeeN & Related Projects Vivek Pai, Larry Peterson + many others Princeton University.
A Measurement Study on the Impact of Routing Events on End-to-End Internet Path Performance Feng Wang 1, Zhuoqing Morley Mao 2 Jia Wang 3, Lixin Gao 1,
A Light-Weight Distributed Scheme for Detecting IP Prefix Hijacks in Real-Time Lusheng Ji†, Joint work with Changxi Zheng‡, Dan Pei†, Jia Wang†, Paul Francis‡
AODV: Introduction Reference: C. E. Perkins, E. M. Royer, and S. R. Das, “Ad hoc On-Demand Distance Vector (AODV) Routing,” Internet Draft, draft-ietf-manet-aodv-08.txt,
GPSR: Greedy Perimeter Stateless Routing for Wireless Networks EECS 600 Advanced Network Research, Spring 2005 Shudong Jin February 14, 2005.
Netprog: Routing and the Network Layer1 Routing and the Network Layer (ref: Interconnections by Perlman)
Networking Fundamentals. Basics Network – collection of nodes and links that cooperate for communication Nodes – computer systems –Internal (routers,
Detection of Routing Loops and Analysis of Its Causes Sue Moon Dept. of Computer Science KAIST Joint work with Urs Hengartner, Ashwin Sridharan, Richard.
1 A Framework for Measuring and Predicting the Impact of Routing Changes Ying Zhang Z. Morley Mao Jia Wang.
Ad-hoc On Demand Distance Vector Protocol Hassan Gobjuka.
End-to-End Routing Behavior in the Internet Vern Paxson Presented by Sankalp Kohli and Patrick Wong.
ECE 4110 – Internetwork Programming
A Measurement Study on the Impact of Routing Events on End-to-End Internet Path Performance Feng Wang 1, Zhuoqing Morley Mao 2 Jia Wang 3, Lixin Gao 1,
Ad Hoc On-Demand Distance Vector Routing (AODV) ietf
Spring Routing: Part I Section 4.2 Outline Algorithms Scalability.
Improving Fault Tolerance in AODV Matthew J. Miller Jungmin So.
1 Effective Diagnosis of Routing Disruptions from End Systems Ying Zhang Z. Morley Mao Ming Zhang.
Distance Vector Routing
CS 6401 Intra-domain Routing Outline Introduction to Routing Distance Vector Algorithm.
TCP/IP1 Address Resolution Protocol Internet uses IP address to recognize a computer. But IP address needs to be translated to physical address (NIC).
Coping with Link Failures in Centralized Control Plane Architecture Maulik Desai, Thyagarajan Nandagopal.
PlanetSeer: Internet Path Failure Monitoring and Characterization in Wide-Area Services Ming Zhang, Chi Zhang Vivek Pai, Larry Peterson, Randy Wang Princeton.
IP: Addressing, ARP, Routing
Monitoring Persistently Congested Internet Links
UNIT-V Transport Layer protocols for Ad Hoc Wireless Networks
COMPUTER NETWORKS CS610 Lecture-33 Hammad Khalid Khan.
Internet Networking recitation #4
RESOLVING IP ALIASES USING DISTRIBUTED SYSTEMS
Mobile and Wireless Networking
COS 561: Advanced Computer Networks
COS 461: Computer Networks
Routing and the Network Layer (ref: Interconnections by Perlman
DSDV Destination-Sequenced Distance-Vector Routing Protocol
Presentation transcript:

Presentation by Michael Smathers, Usman Jafarey CS395/495 IMRE, April 24, 2006 PlanetSeer: Internet Path Failure Monitoring and Characterization in Wide-Area Services

Large volume of traffic data required to characterize misbehavior, wide-area services –Peer-to-peer (P2P) systems –Content distribution networks (CDN) Solution: Combine passive monitoring of wide area networks with active probes to quantify and characterize anomalies. Detecting Path Anomalies

Traceroute only maps forward path; difficult to infer if problem is with forward or reverse path without destination cooperation. BGP/OSPF propagate failure information. Traceroute may stop at a hop that is not the source of the failure. High variance in failure duration makes it difficult to respond in time. Few sites had enough coverage to identify all affected paths of a failure. Traditional Detection…

More accurate, complete view of failures thanks to geographical diversity of nodes Minimum overhead; active probing is initiated only after passive monitoring detects anomaly High rate of failure detection thanks to large volumes of traffic Advantages of this approach

Passively monitoring traffic on PlanetLab since February 2004 to detect anomalous behaviour – Coordinate active probes between PlanetLab sites to confirm/characterize anomaly and measure scope ~90,000 anomalies confirmed each month with PlanetSeer. PlanetLab Test Bed

Wide-area service network: CoDeeN 7-12K clients/day GB/day 5-7 million requests/day 120 nodes in North America(350 world-wide) Passive Monitoring Daemons (MonD) run on all CoDeeN nodes to detect anomalous TCP traffic behaviour. Active Probing Daemons (ProbeD) run on all PlanetLab nodes, including CoDeeN nodes, awaiting requests from MonDs. Components

1.MonD detects anomaly, sends request to local ProbeD. 2.ProbeD contacts ProbeDs on other nodes to coordinate planet-wide probe. 3.ProbeDs are organized in groups for distributed probe. Operation

Uses PlanetLab's tcpdump to observe all incoming and outgoing TCP packets. Uses this information to generate path and flow level statistics which are used to identify possible anomalies in real-time. Two indicators of anomalies: –Change in TTL(Time To Live) field –Multiple consecutive timeouts Current threshold: 4 timeouts If MonD is on receiving side, ACKs not reaching sender. We can assume forward path is at fault. If MonD is sender, we cannot determine from timeouts which path contains the problem. MonD - Operation

When MonD is sender, maintain two variables for each flow: SendSeqNo, sequence number of most recently sent packet. SendRtxCount, count of times the packet has been retransmitted. CurrentSeqNo > SendSeqNo; flow is making progress, clear SendRtxCount and set SendSeqNo to current. CurrentSeqNo < SendSeqNo; fast retransmit. Set SendSeqNo to current. CurrentSeqNo = SendSeqNo, timeout; Increment SendRtxCount. If SendRtxCount exceeds threshold, MonD notifies ProbeD of possible anomaly. MonD - Timeout Detection

MonD receiver side, maintain largest seq. no per flow. If current packet has same seq. no, increment counter. When counter hits threshold notify ProbeD that sender is not seeing ACKs. MonD - cont’d…

Three probing operations: 1.Baseline probes, run when new IP is added to MonD path table. 2.Forward probes, traceroutes invoked at multiple geographically distributed nodes when MonD detects anomaly. Rate limited, ProbeD will not forward probe the same destination more than once in 10 minutes. 3.Reprobes, if anomaly is confirmed by forward probe, reprobes sent by initial ProbeD to determine duration and effects of anomaly. Reprobes sent at.5, 1.5, 3.5 and 7.5 hours after anomaly detection time. Reprobes compared to original baseline and forward probes. ProbeD

353 ProbeDs running on 145 PlanetLab sites. Distributed across North/South America, Europe, Asia and elsewhere. Membership information kept for ProbeDs to avoid unnecessary communication to dead nodes. 30 ProbeD node groups based on geographic diversity. ProbeD receives request from local MonD, then –forwards request to one ProbeD from each group –ProbeDs perform probe, send results to requester. –originator collects data ProbeD - Operation

887,521 unique client IPs from 9232 ASes. Probes traversed ASes. (over half the ASes on the Internet) 2,259,558 possible anomalies 271,898 confirmed ProbeD - Dataset

Unusable hops identified by * in place of name, removed. Relative hop count maintained. Missing hops found by comparing traceroutes that share destination. Repairing Traceroute Data

Anomily confirmed if any of the following conditions are met: There is a loop in the traceroute Local traceroute disagrees with baseline Local traceroute doesn't reach destination but other traceroutes make it Traceroute returns ICMP destination unreachable Anomoly Detection

Detected if same sequence observed at least 3 times in a traceroute. Persistent loops, traceroute stays in loops until max hops. Temporary loops, loops resolved before max hops. Reprobes determine duration of persistent loop. Routing Loops

Number of routers/AS involved in loop. Loop length – number of routers involved Temporary loops longer lengths than persistent Persistent loops generally involve single AS Loops mapped by tiers of AS involved Measuring Scope

Temporary loops overload routers Persistent loops cause loss of connectivity Degrade latency Loop Effects

Distinguish between forward/reverse anomalies Scope of anomaly; hops between anomoly & end host Classify as either path change or path outage Evaluating Reference Paths –Hazards; destination behind firewall, intermediate router filtering –Firewall heuristics; choosing appropriate distance n between host & anomaly 0 < RevHop(dst) - RevHop(Sx) < n Reference Paths

Comparing reference path (R) with local path (L) –Path change; L reaches last hop of R –Path outage; L cuts out before R –Path outage + Path change; L diverges from R, arrives at R’s last hop Breakdown of all anomalies observed: –Path Change: 48% –Forward Outage: 10% –Other: 24% –Temporary: 18% Non-Loop Anomalies

Define scope; # hops on R that could change next hop value Remote traceroute from various locations, find Intercept path –Intercept path narrows scope Find relative location of anomaly, i.e. near host –Find distance of path change by average distances of all paths in scope Path Changes

Distinguish between forward, reverse paths Forward path: –Route change on forward path, in addition to outage –ICMP dest. Unreachable –Reported as timeout on forward path by MonD 35% anomalies found to be Fwd Timeout (inferred by MonD) –Indistiguishable without passive/active probes Path Outage

Path Change Detection - AS

How many failures can be bypassed? –For all clients with reference path, reachability failures –Of these, PlanetSeer nodes able to reach destination in cases (43% of failures) –Same results achieved using 15 vantage points as all 30 Bypass ratio; minimum RTT of any bypass path and RTT of baseline path –Improves latency in 23% of new paths Bypassing Anomalies

BGP –misconfiguration classification –Locate origin via time, prefix, view Traceroute; Path symmetry; 49% asymmetric, 91% persist for more than several hours Ping/Traceroute hybrids Related Work

Passive Monitoring –Enables must faster detection of anomalies –Better resolution, temporary anomaly detection Failure distribution (AS topology) –Tier 1 most stable, Tier 3 least stable Loop Behaviour –Temporary loops have much longer lengths –Most span 4 routers Path Change resolution –63% of outages occur within 3 hops of end host –Over half confined to 2 AS’s, 50% confined within 3 hops Alternate path discovery –Largely unsuccessful, most outages near network edge lack any redundancy Conclusions