Presentation is loading. Please wait.

Presentation is loading. Please wait.

Presentation by Michael Smathers, Usman Jafarey CS395/495 IMRE, April 24, 2006 PlanetSeer: Internet Path Failure Monitoring and Characterization in Wide-Area.

Similar presentations


Presentation on theme: "Presentation by Michael Smathers, Usman Jafarey CS395/495 IMRE, April 24, 2006 PlanetSeer: Internet Path Failure Monitoring and Characterization in Wide-Area."— Presentation transcript:

1 Presentation by Michael Smathers, Usman Jafarey CS395/495 IMRE, April 24, 2006 PlanetSeer: Internet Path Failure Monitoring and Characterization in Wide-Area Services

2 Large volume of traffic data required to characterize misbehavior, wide-area services –Peer-to-peer (P2P) systems –Content distribution networks (CDN) Solution: Combine passive monitoring of wide area networks with active probes to quantify and characterize anomalies. Detecting Path Anomalies

3 Traceroute only maps forward path; difficult to infer if problem is with forward or reverse path without destination cooperation. BGP/OSPF propagate failure information. Traceroute may stop at a hop that is not the source of the failure. High variance in failure duration makes it difficult to respond in time. Few sites had enough coverage to identify all affected paths of a failure. Traditional Detection…

4 More accurate, complete view of failures thanks to geographical diversity of nodes Minimum overhead; active probing is initiated only after passive monitoring detects anomaly High rate of failure detection thanks to large volumes of traffic Advantages of this approach

5 Passively monitoring traffic on PlanetLab since February 2004 to detect anomalous behaviour – Coordinate active probes between PlanetLab sites to confirm/characterize anomaly and measure scope ~90,000 anomalies confirmed each month with PlanetSeer. PlanetLab Test Bed

6 Wide-area service network: CoDeeN 7-12K clients/day 100-200GB/day 5-7 million requests/day 120 nodes in North America(350 world-wide) Passive Monitoring Daemons (MonD) run on all CoDeeN nodes to detect anomalous TCP traffic behaviour. Active Probing Daemons (ProbeD) run on all PlanetLab nodes, including CoDeeN nodes, awaiting requests from MonDs. Components

7 1.MonD detects anomaly, sends request to local ProbeD. 2.ProbeD contacts ProbeDs on other nodes to coordinate planet-wide probe. 3.ProbeDs are organized in groups for distributed probe. Operation

8 Uses PlanetLab's tcpdump to observe all incoming and outgoing TCP packets. Uses this information to generate path and flow level statistics which are used to identify possible anomalies in real-time. Two indicators of anomalies: –Change in TTL(Time To Live) field –Multiple consecutive timeouts Current threshold: 4 timeouts If MonD is on receiving side, ACKs not reaching sender. We can assume forward path is at fault. If MonD is sender, we cannot determine from timeouts which path contains the problem. MonD - Operation

9 When MonD is sender, maintain two variables for each flow: SendSeqNo, sequence number of most recently sent packet. SendRtxCount, count of times the packet has been retransmitted. CurrentSeqNo > SendSeqNo; flow is making progress, clear SendRtxCount and set SendSeqNo to current. CurrentSeqNo < SendSeqNo; fast retransmit. Set SendSeqNo to current. CurrentSeqNo = SendSeqNo, timeout; Increment SendRtxCount. If SendRtxCount exceeds threshold, MonD notifies ProbeD of possible anomaly. MonD - Timeout Detection

10 MonD receiver side, maintain largest seq. no per flow. If current packet has same seq. no, increment counter. When counter hits threshold notify ProbeD that sender is not seeing ACKs. MonD - cont’d…

11 Three probing operations: 1.Baseline probes, run when new IP is added to MonD path table. 2.Forward probes, traceroutes invoked at multiple geographically distributed nodes when MonD detects anomaly. Rate limited, ProbeD will not forward probe the same destination more than once in 10 minutes. 3.Reprobes, if anomaly is confirmed by forward probe, reprobes sent by initial ProbeD to determine duration and effects of anomaly. Reprobes sent at.5, 1.5, 3.5 and 7.5 hours after anomaly detection time. Reprobes compared to original baseline and forward probes. ProbeD

12 353 ProbeDs running on 145 PlanetLab sites. Distributed across North/South America, Europe, Asia and elsewhere. Membership information kept for ProbeDs to avoid unnecessary communication to dead nodes. 30 ProbeD node groups based on geographic diversity. ProbeD receives request from local MonD, then –forwards request to one ProbeD from each group –ProbeDs perform probe, send results to requester. –originator collects data ProbeD - Operation

13 887,521 unique client IPs from 9232 ASes. Probes traversed 10090 ASes. (over half the ASes on the Internet) 2,259,558 possible anomalies 271,898 confirmed ProbeD - Dataset

14 Unusable hops identified by * in place of name, removed. Relative hop count maintained. Missing hops found by comparing traceroutes that share destination. Repairing Traceroute Data

15 Anomily confirmed if any of the following conditions are met: There is a loop in the traceroute Local traceroute disagrees with baseline Local traceroute doesn't reach destination but other traceroutes make it Traceroute returns ICMP destination unreachable Anomoly Detection

16 Detected if same sequence observed at least 3 times in a traceroute. Persistent loops, traceroute stays in loops until max hops. Temporary loops, loops resolved before max hops. Reprobes determine duration of persistent loop. Routing Loops

17 Number of routers/AS involved in loop. Loop length – number of routers involved Temporary loops longer lengths than persistent Persistent loops generally involve single AS Loops mapped by tiers of AS involved Measuring Scope

18 Temporary loops overload routers Persistent loops cause loss of connectivity Degrade latency Loop Effects

19 Distinguish between forward/reverse anomalies Scope of anomaly; hops between anomoly & end host Classify as either path change or path outage Evaluating Reference Paths –Hazards; destination behind firewall, intermediate router filtering –Firewall heuristics; choosing appropriate distance n between host & anomaly 0 < RevHop(dst) - RevHop(Sx) < n Reference Paths

20 Comparing reference path (R) with local path (L) –Path change; L reaches last hop of R –Path outage; L cuts out before R –Path outage + Path change; L diverges from R, arrives at R’s last hop Breakdown of all anomalies observed: –Path Change: 48% –Forward Outage: 10% –Other: 24% –Temporary: 18% Non-Loop Anomalies

21 Define scope; # hops on R that could change next hop value Remote traceroute from various locations, find Intercept path –Intercept path narrows scope Find relative location of anomaly, i.e. near host –Find distance of path change by average distances of all paths in scope Path Changes

22 Distinguish between forward, reverse paths Forward path: –Route change on forward path, in addition to outage –ICMP dest. Unreachable –Reported as timeout on forward path by MonD 35% anomalies found to be Fwd Timeout (inferred by MonD) –Indistiguishable without passive/active probes Path Outage

23 Path Change Detection - AS

24 How many failures can be bypassed? –For all clients with reference path, 62815 reachability failures –Of these, PlanetSeer nodes able to reach destination in 27263 cases (43% of failures) –Same results achieved using 15 vantage points as all 30 Bypass ratio; minimum RTT of any bypass path and RTT of baseline path –Improves latency in 23% of new paths Bypassing Anomalies

25 BGP –misconfiguration classification –Locate origin via time, prefix, view Traceroute; Path symmetry; 49% asymmetric, 91% persist for more than several hours Ping/Traceroute hybrids Related Work

26 Passive Monitoring –Enables must faster detection of anomalies –Better resolution, temporary anomaly detection Failure distribution (AS topology) –Tier 1 most stable, Tier 3 least stable Loop Behaviour –Temporary loops have much longer lengths –Most span 4 routers Path Change resolution –63% of outages occur within 3 hops of end host –Over half confined to 2 AS’s, 50% confined within 3 hops Alternate path discovery –Largely unsuccessful, most outages near network edge lack any redundancy Conclusions


Download ppt "Presentation by Michael Smathers, Usman Jafarey CS395/495 IMRE, April 24, 2006 PlanetSeer: Internet Path Failure Monitoring and Characterization in Wide-Area."

Similar presentations


Ads by Google