Presentation is loading. Please wait.

Presentation is loading. Please wait.

Traffic Engineering with Forward Fault Correction (FFC)

Similar presentations


Presentation on theme: "Traffic Engineering with Forward Fault Correction (FFC)"— Presentation transcript:

1 Traffic Engineering with Forward Fault Correction (FFC)
Hongqiang “Harry” Liu, Srikanth Kandula, Ratul Mahajan, Ming Zhang, David Gelernter (Yale University) Today we will take a look at Forward Fault Correction – FFC in traffic engineering, which proactively makes a network eliminate congestion automatically when faults happen. This is a joint work with the MSR folks: Srikanth, Ratul and Ming, and my advisor David at Yale.

2 Cloud services require large network capacity
Growing traffic Cloud Networks Nowadays, cloud services are rapidly growing. This growth gives cloud infrastructure a huge pressure. Especially for cloud networks, it is a big challenge to face the growing traffic, since the networks are expensive, so that it is difficult to upgrade the capacity or expand the network. Expensive (e.g. cost of WAN: $100M/year)

3 TE is critical to effectively utilizing networks
Traffic Engineering Microsoft SWAN Google B4 …… Devoflow MicroTE …… This is also why in recent years, cloud providers are taking great efforts to deploy traffic engineering systems in both Wide Are Networks and Datacenter networks to highly utilize their network resources. WAN Network Datacenter Network 3

4 But, TE is also vulnerable to faults
TE controller Network configuration Network view Frequent updates for high utilization Control-plane faults While TE has been shown to be very helpful for highly utilizing the network resources, its vulnerability to faults is an essential dark side. This is because TE monitors and controls the network remotely, and it also has to frequently update the network for a continuous high utilization. Remote and highly dynamic system could be undermined by various of faults that are common in reality. These faults fall into two major categories, control-plane faults and data-plane faults. Data-plane faults Network

5 Control plan faults Failures or long delays to configure a network device TE Controller Switch TE configurations RPC failure Firmware bugs Overloaded CPU Memory shortage

6 Congestion due to control plane faults
s2 s4 (10) New Flows (traffic demands): s s2 (10) s s3 (10) s s4 (10) s s4 (10) Configurations might causes severe congestions. For example, in this TE update from left to right, link s1-s4 will be overloaded if s2 fails to implement the new configuration which moves the traffic on tunnel s2-s1-s4 onto tunnel s2-s4.

7 Rescaling: Sending traffic proportionally to residual paths
Data plane faults Link and switch failures Rescaling: Sending traffic proportionally to residual paths s2 s4 (10) The second category…. Combining 7 & 8 Because of the importance of TE, many network providers deploy TE systems in their network. In such a system, a conceptually centralized TE controller is the core to run TE algorithms. It relies on monitoring systems to get the currently traffic demands and network topology. After obtaining a new TE solution, it configures the traffic rate limiters and ingress routers in the network via a controlling system. Nevertheless, the input and the output of the TE controller are not 100% reliable. For example, link and router failures can change the network topology, which might not be reported timely by the monitoring system. For another example, configurations to some routers might be rejected or lost, which makes the controlling system only partially implement a new TE. s s4 (10)

8 Control and data plane faults in practice
In production networks: Faults are common. Faults cause severe congestion. Control plane: fault rate = 0.1% -- 1% per TE update. Data plane: fault rate = 25% per 5 minutes. Congestion due to faults like what are shown in the previous examples usually happen in practice. Our measurement in a production network shows that both control and data plane faults are very common. For instance, the fault rate of control plane fault can be as high as one percent, which means if we have 100 switches in the network, there is very likely to be one control plane fault each time we update the network. And for data plane faults, even within 5 minutes, it is also very likely to see one or more than one faults like link failures. And our evaluation shows that even a single control plane can be sufficient to cause large congestion. For example, this figures shows the CDF of maximum link over-subscription rate if each network update has a control plane fault. This point shows that there is a 10% change to make some link overloaded by 10% of its capacity or more. And for data plane faults, even a single link failure could result in a much worse situation than control plane faults.

9 State of the art for handling faults
Heavy over-provisioning Reactive handling of faults Control plane faults: retry Data plane faults: re-compute TE and update networks Big loss in throughput Just because the faults are so common and their consequences are so bad, we must handle them. However, in the state of art, people have very limited ways to handle the faults. Typically they use reactive approach, which means they detect and handle faults. For example, when control plane faults happen, they retry switches that are not responding. When data plane faults happen, they try to compute a new TE and update the network. Nevertheless, the reactive approaches are far from satisfaction. The major problem is that.. Cannot prevent congestion Slow (seconds -- minutes) Blocked by control plane faults

10 How about handling congestion proactively?
Small font

11 Forward fault correction (FFC) in TE
[Bad News] Individual faults are unpredictable. [Good News] Simultaneous #faults is small. FEC guarantees no information loss under up to k arbitrary packet drops. Packet loss with careful data encoding Bound the congestion, or bound the level for over-provisioning. FFC guarantees no congestion under up to k arbitrary faults. Network faults with careful traffic distribution

12 Example: FFC for control plane faults
Non-FFC When looking into control plane faults, we already know that configuration failures on s2 (or s3) will cause congestions. So that the new TE on the right is non-FFC or we say k=0.

13 Example: FFC for control plane faults
Control Plane FFC (k=1) Missing: FFC can have overhead in terms of throughput Just like FEC has overhead. (important)

14 Trade-off: network utilization vs. robustness
(Control Plane FFC) K=2 (Control Plane FFC) Missing: FFC can have overhead in terms of throughput Just like FEC has overhead. (important) Non-FFC Throughput: 50 Throughput: 47 Throughput: 44

15 Systematically realizing FFC in TE
Formulation: How to merge FFC into existing TE framework? Computation: How to find FFC-TE efficiently?

16 Basic TE linear programming formulations
LP formulations Sizes of flows 𝑏 𝑓 TE decisions: Traffic on paths 𝑙 𝑓, 𝑡 TE objective: Maximizing throughput max ∀𝑓 𝑏 𝑓 Deliver all granted flows s.t. ∀𝑓: ∀𝑡 𝑙 𝑓,𝑡 ≥ 𝑏 𝑓 Basic TE constraints: No overloaded link ∀𝑒: ∀𝑓 ∀𝑡∋𝑒 𝑙 𝑓,𝑡 ≤ 𝑐 𝑒 Titles Transfer from previous slide. Simplify this. Before this slide: How to formulation in the framework of existing te. Solving Efficiently. FFC constraints: 𝑘 𝑐 control plane faults No overloaded link up to 𝑘 𝑒 link failures 𝑘 𝑣 switch failures

17 Formulating control plane FFC
𝑓 1 𝑓 2 𝑓 3 Total load under faults? 𝑓 1 ’s load in old TE 𝑓 2 ’s load in new TE Fault on 𝑓 1 : 𝑙 1 𝑜𝑙𝑑 + 𝑙 2 𝑛𝑒𝑤 + 𝑙 3 𝑛𝑒𝑤 ≤link cap 𝟑 𝟏 Fault on 𝑓 2 : 𝑙 1 𝑛𝑒𝑤 + 𝑙 2 𝑜𝑙𝑑 + 𝑙 3 𝑛𝑒𝑤 ≤link cap Trying English first and simplify the formulation. Jump Fault on 𝑓 3 : 𝑙 1 𝑛𝑒𝑤 + 𝑙 2 𝑛𝑒𝑤 + 𝑙 3 𝑜𝑙𝑑 ≤link cap Challenge: too many constraints With n flows and FFC protection k: #constraints = 𝒏 𝟏 + … + 𝒏 𝒌 for each link.

18 An efficient and precise solution to FFC
Our approach: A lossless compression from O( 𝑛 𝑘 ) constraints to O(kn) constraints. Total load under faults ≤ link capacity Total additional load due to faults ≤ link spare capacity 𝑥 𝑖 : additional load due to fault-i Given 𝑋={ 𝑥 1 , 𝑥 2 ,…, 𝑥 𝑛 }, FFC requires that the sum of arbitrary k elements in 𝑋 is ≤ link spare capacity O( 𝑛 𝑘 ) Simplify  abstract Paper both data and control optimity Define 𝑦 𝑚 as the mth largest element in 𝑋: 𝑚=1 𝑘 𝑦 𝑚 ≤ link spare capacity O(1) Expressing 𝑦 𝑚 with 𝑋?

19 Sorting network 𝑥 1 𝑦 1 (1st largest) 𝑧 4 𝑧 5 𝑥 2 𝑦 2 (2nd largest)
𝑧 2 𝑧 3 𝑧 7 𝑧 8 𝑥 3 𝑧 1 𝑧 6 A comparison 𝑥 4 𝑥 1 𝑧 1 =max{ 𝑥 1 , 𝑥 2 } 𝑥 2 𝑧 2 =min{ 𝑥 1 , 𝑥 2 } 1st round 2nd round 𝑦 𝑦 2 ≤ link spare capacity Complexity: O(kn) additional variables and constraints. Throughput: optimal in control-plane and data plane if paths are disjoint.

20 FFC extensions Differential protection for different traffic priorities Minimizing congestion risks without rate limiters Control plane faults on rate limiters Uncertainty in current TE Different TE objectives (e.g. max-min fairness) We show the extensions One sentence for each item My formulation is general.

21 Evaluation overview Testbed experiment Large-scale simulation
FFC can be implemented in commodity switches FFC has no data loss due to congestion under faults Large-scale simulation A WAN network with O(100) switches and O(1000) links Injecting faults based on real failure reports Single priority traffic in a well-provisioned network Multiple priority traffic in a well-utilized network

22 FFC prevents congestion with negligible throughput loss
Single priority 160% High priority (High FFC protection) Medium priority (Low FFC protection) Low priority (No FFC protection) 100 80 From the figures, we can also observe that only using priority queuing to protect higher priority traffic is not sufficient. Because if the traffic layout is not robust, high priority traffic could also suffer from congestion because high priority flows congestion each other. However, if we use both priority queuing and FFC to protect higher priority traffic, we can indeed realize the vision that when problems happen, the network always sacrifices low priority traffic first. 60 Ratio (%) 40 20 <0.01% FFC Data-loss / Non-FFC Data-loss FFC Throughput / Optimal Throughput

23 Conclusions Centralized TE is critical to high network utilization but is vulnerable to control and data plane faults. FFC proactively handle these faults. Guarantee: no congestion when #faults ≤ k. Efficiently computable with low throughput overhead in practice. FFC At the end this talk, I want to mention that FFC is essentially offering a new knob for network operators to better balance the trade-off between the risk of congestion and network over-provisioning. FFC provides a simple interface--- the FFC protection level k ---- to the operators, and achieves an efficient balance in the trade-off. High risk of congestion Heavy network over-provisioning


Download ppt "Traffic Engineering with Forward Fault Correction (FFC)"

Similar presentations


Ads by Google