Presentation is loading. Please wait.

Presentation is loading. Please wait.

RRAPID: Real-time Recovery based on Active Probing, Introspection, and Decentralization Takashi Suzuki Matthew Caesar.

Similar presentations


Presentation on theme: "RRAPID: Real-time Recovery based on Active Probing, Introspection, and Decentralization Takashi Suzuki Matthew Caesar."— Presentation transcript:

1 RRAPID: Real-time Recovery based on Active Probing, Introspection, and Decentralization Takashi Suzuki Matthew Caesar

2 Motivation Today’s internet core has bursty losses Backbones have low average loss rates (<0.2%), but experience large bursts in loss Loss durations vary from 10ms to 33.72sec 6 out of 7 providers experienced large outage periods 10- 220sec for 1-2 times per day Difficult for multimedia applications to recover from repeated loss (e.g. with FEC) Commonly used restoration techniques insufficient Link layer recovery, MPLS not yet uniformly deployed RON too slow (20 sec), not scalable  real-time recovery desired “Assessment of VoIP Quality over Internet Backbones,” Markopoulou, Tobagi, Karam (INFOCOM 2002)

3 Approach RRAPID: Real-time Recovery based on Adaptive Probing, Introspection, and Dampening Technique: Overlay based, real-time recovery Use Link-state routing Determine link cost from packet receipt delay Adaptively dampen route advertisements Desirable properties: Speed: Low end-to-end failure time Stability: Few route oscillations Accuracy: Avoid reacting to transient failures Scalability: Low probing/communication overhead

4 System Architecture: Reaction Mechanism Route Stabilization (RS): Dampens route flaps Adaptive Tracking (AT): Filters noise Reacts quickly to changes Link Cost Estimation (LCE): Estimates failure probability from packet loss “Delay-deficit algorithm” RS AT LCE

5 Simulation Results: Layered Control Show detailed actions of layers --- LCE output: metric representing probability link has failed --- AT output: metric with noise filtered --- RS output: advertised value for link Red spikes result from back- to-back packet losses Setup Link Failure at t=[150s-170s] Probe every 300ms, 10% loss --- LCE output --- AT output --- RS output Results First Detection in 0.92s, next at 5.42 Several false positives due to cold start. Stabilizes in 100s. 0.92s corresponds to 3 lost probes plus propagation delay of 0.02s

6 Simulation Results: Reaction Speed Reaction Speed Probing faster improves speed Probing every <400ms can give ~1s reaction times Loss decreases reaction time Overhead Probing every >50ms gives reasonable overhead Effect of packet loss Increasing packet loss decreases accuracy Advertisements and probes are dropped Subsecond reactions even at 5% loss

7 Simulation Results: Comparison Compared RRAPID, RON, and “Oracle- based” routing. Results: RON requires 4 to 10x more advertisements than RRAPID RON’s overhead increases exponentially with probe speed, RRAPID’s overhead increases linearly Packet loss has an extreme effect on RON, moderate effect on RRAPID

8 Emulation Results: Real Internet Workload Method Measured performance on real Internet workload Traces acquired between UIUC and Stanford Emulated 2-path overlay topology, one trace for each path 1 natural failure at time t=[123.4s to 133.7s], introduced two failures from t=[40s to 50s] and t=[60s to 70s] Result Stable, sub-second reactions --- Number of flows on link #1 --- Number of flows on link #2 Overlay path 1 Overlay path 2

9 Analysis Simplified model of system Modeled RS layer as MIAD  Increase by 1, Decrease by 1/k  Advertisement threshold limited to n Ignored AT layer effects  n*k state Markov chain Given: Probe loss probability p Number of paths N Probe interval I We can determine: Speed: Average reaction time Overhead: Average advertisement rate Found best-case expected Overhead and Reaction time for variable transient loss rates. Results Can react quickly, stably for fairly large amounts of transient packet loss Overhead and reaction time increases super-linearly with loss rate

10 Conclusions 1. Can achieve sub-second reactions on most links with reasonable stability Congested links increase reaction time Can react well on most internet links 2. Trade off relationship between overhead and reaction speed 3. Lossy links worsen reaction time Hard to react quickly, stably if all paths have >10% loss. Future work: Improve scalability with route aggregation Extend evaluation of system parameters Consider wider range of topologies, cross traffic, offered loads


Download ppt "RRAPID: Real-time Recovery based on Active Probing, Introspection, and Decentralization Takashi Suzuki Matthew Caesar."

Similar presentations


Ads by Google