Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Effective Diagnosis of Routing Disruptions from End Systems Ying Zhang Z. Morley Mao Ming Zhang.

Similar presentations


Presentation on theme: "1 Effective Diagnosis of Routing Disruptions from End Systems Ying Zhang Z. Morley Mao Ming Zhang."— Presentation transcript:

1 1 Effective Diagnosis of Routing Disruptions from End Systems Ying Zhang Z. Morley Mao Ming Zhang

2 Src Routing disruptions impact application performance  More applications today have high QoS requirements  Routing events can cause high loss and long delays AS A AS B AS C Internet AS D AS E Dst

3 Existing approaches to diagnose routing disruptions are ISP-centric  Require routing data from many routers in ISPs [Feldmann04, Teixeira04, Wu05]  Passive and accurate 3 AS A AS C Internet AS D AS B BGP collectors

4 Limitations of ISP-centric approaches  Difficult to gain access to data from many ISPs  BGP data reflects “expected” data-plane paths 4 AS A AS C Internet AS D AS B End-systems ?? ? ?? ? ? ISP

5 Can we diagnose entirely from end systems?  Goal: infer data-plane paths of many routers 5 Dst ISP A AS B AS C AS D Probing host

6 Our approach: end systems based monitoring  Only require probing from end hosts  Cover all the PoPs of a target ISP 6 Dst Target ISP AS B AS C AS D Probing host

7 Our approach: end systems based monitoring  Cover most of the destinations on the Internet 7 ISP A AS B AS C AS D Probing host Dst

8 Our approach: end systems based monitoring  Identify routing changes by comparing paths measured consecutively 8 Dst ISP A AS B AS C AS D Probing host

9 Advantages and challenges  Advantages:  No need to access to ISP-propriety data  Identify actual data-plane paths  Monitor data plane performance  Challenges:  Limited resources to probe  Coverage of probed paths  Timing granularity  Measurement noise 9

10 System architecture 10 Event identification and classification Event identification and classification Collaborative probing Collaborative probing Event correlation and inference Event correlation and inference Event impact analysis Reports Target ISP

11 Outline  Collaborative probing  Event identification and classification  Event correlation and inference  Result and validation 11

12 Collaborative probing  Using a set of hosts  To learn the routing state  To improve coverage  To reduce overhead 12 ISP A AS B AS C AS D Probing host

13 Outline  Collaborative probing  Event identification and classification  Event correlation and inference  Result and validation 13

14 Event classification  Classify events according to ingress/egress changes 14 Destination Prefix P Target ISP Probing host Type1: Ingress PoP changes Type2: Ingress PoP same, egress PoP different Type3: Ingress PoP same, egress PoP same

15 Outline  Collaborative probing  Event identification and classification  Event correlation and inference  Result and validation 15

16 Likely causes: link failures 1616 Destination Prefix P Target ISP Old path New path Probing host Old egress PoP New egress PoP Neighbor AS

17 Likely causes: internal distance changes 1717 distance: 120 Probing host Old egress PoP New egress PoP  Hot potato changes  Cost of old internal path increases  Cost of new internal path decreases Neighbor AS distance: 80 distance: 100 distance: 120

18 Event correlation  Spatial correlation: a single network failure often affects multiple routers  Temporal correlation: routing events occurring close together are likely due to only a few causes 18

19 Inference methodology  An evidence: an event that supports the cause 19 Destination prefix P Target ISP Probing host New path Probing host New egress Cause: Link L is down Link L

20 Inference methodology  A conflict: a measurement trace that conflicts with the cause 20 Destination prefix P Target ISP Probing host New path Probing host New egress Cause: Link L is down Link L

21 Inference methodology 21 Evidence node [1,2,3]->[1,2,4] Cause: link 2-3 down Cause: node 3 withdraws the route AS 1 AS 2 AS 3AS 4 Withdrawal

22 Inference methodology 22 Evidence node [1,2,3]->[1,2,4] Evidence node [0,2,3]->[0,2,4] Cause: link 2-3 down Cause: node 3 withdraws the route Evidence Graph AS 1 AS 2 AS 3AS 4 AS 0 Withdrawal

23 Inference methodology 23 Conflict node [1,2,3,6] Cause: link 2-3 down Cause: node 3 withdraws the route Conflict node [0,2,3,6] Conflict Graph Conflict node [0,2,3] AS 1 AS 2 AS 3 AS 0 AS 6

24 Inference methodology 24 Evidence node [1,2,3]->[1,2,4] Evidence node [0,2,3]->[0,2,4] Conflict node [1,2,3,6] Conflict node [0,2,3,6] Evidence GraphConflict Graph Conflict node [0,2,3]  Greedy algorithm: minimum set of causes that can explain all the evidence while minimizing conflicts Evidence: 2 Conflicts: 3 Evidence: 2 Conflicts: 0

25 Outline  Collaborative probing  Event identification and classification  Event correlation and inference  Result and validation 25

26 ISPs studied 26 AS Name ASN (Tier) Periods# of Src# of PoPs# of Probes Probe Gap AT&T3/23-4/92301116145318.3 min Verio4/10-4/22 9/13-9/22 218468102419.3 min Deutsche Telekom 4/23-5/22149642795817.5 min Savvis5/23-6/24178394098917.4 min Abilene9/23-9/30 2/3-2/17 113115103718.4 min

27 Results of event classification  Many events are internal changes  Abilene has many ingress changes 27 Target AS Total events (% all traces) Diff egress Same ingress, egressDiff ingress Internal PoP path External AS path AT&T0.35%12.1%51%35%11% Verio0.31%27.3%48%19%9.8% Deutsche Telekom 0.66%4.9%8.5%80.7%7.2% Savvis0.35%11%45%31%14% Abilene0.24%13.6%37%40%17%

28 Validation with BGP based approach [Wu05]  Hot potato changes: egress point changes due to internal distance changes 28 Hot potato changes BGP based Our methodBoth Tier-1 AS147185101 (31%, 45%) Abilene network798860 (24%, 31%) Number of incidences identified by BGP method Number of incidences identified by our method Number of incidences identified by both False negative, false positives

29 Validation with BGP based approach  Session resets: peering link up/down  Inaccuracy reasons:  Limited coverage  Coarse-grained probing  Measurement noise 29 Session reset BGP based Our method Both Tier-1 AS9156 (33%, 50%) Abilene network 7117 (0%, 36%)

30 System performance  Can keep up with generated routing state  Applicable for real-time diagnosis and mitigation  Reactive: construct alternate paths to bypass the problem  Proactive: avoid paths with many historical routing disruptions 30

31 Conclusion  Developed the first system to diagnose routing disruptions purely from end systems  Used a simple greedy algorithm on two bipartite graphs to infer causes  Comprehensively validated the accuracy 31

32 Thank you! Questions? 32

33 Performance impact analysis  End-to-end latency changes caused by different types of routing events 33

34 Validation with BGP data  BGP feeds from RouteView, RIPE, Abilene, and 29 BGP feeds from a Tier-1 ISP  The destination prefix coverage and the routing event detection rate 34 Target AS Dst. Prefix coverage Dst. Prefix traversing PoPs with BGP feeds Detected events (AS change, next hop change) Missed events (short-duration, filtering, other) AT&T15%1.5%11% (10.3%, 3.2%)89% (75%, 13%, 1%) Verio18.6%18.1%23% (19.1%, 8.6%)77% (73%, 4%, 0%) Savvis7.8%1.1%6% (5.8%, 0.5%)94% (80%, 9%, 5%) Abilene6% 21% (17.3%, 5.8%)79% (61%, 15%, 3%)

35 Event classification: same ingress PoP, different egress PoP 3535 Target ISP Old path New path Probing host Old egress PoP New egress PoP  Policy changes  Local preference in the old route decreases  Local preference in the new route increases Neighbor AS Local Pref : 100->50 Local Pref : 60->110

36 Event classification: same ingress PoP, different egress PoP 3636 Target ISP Old path New path Probing host Old egress PoP New egress PoP  External routing changes  Old route worsens due to external factors (withdrawal, longer AS path)  New route improves due to external factors AS A ABCD->ABEFD BCEFD->BEFD AS B

37 Event classification: same ingress PoP, same egress PoP  Internal PoP path changes  Cost of old internal path increases  Cost of new internal path decreases  External AS path changes 3737 Destination Prefix P Target ISP Old path New path Probing host

38 Results of cause inference 38  Effectiveness of inference algorithm  Clusters: a group of events with the same root cause

39 Event identification  A routing event: path changes  Event identification omparing continuous routing snapshots 39 Dst ISP A AS B AS C AS D Probing host


Download ppt "1 Effective Diagnosis of Routing Disruptions from End Systems Ying Zhang Z. Morley Mao Ming Zhang."

Similar presentations


Ads by Google