Presentation on theme: "Ethernet Data Center Routing Challenges and 802.1aq/SPB new work PETER ASHWOOD-SMITH"— Presentation transcript:
Ethernet Data Center Routing Challenges and 802.1aq/SPB new work PETER ASHWOOD-SMITH email@example.com
802.1aq’s 16 ECT can give perfect spread going 2 hops 16 uplinks. However: A) Need to tweak 2 nd layer switch priorities to guarantee all 16 are used. B) Need at least 16 subnets (C/S-Vlan’s) to assign one per 802.1aq B-VID. A) Tweak Bridge Priorities Here S 1 … S 16 B)
Can we eliminate ‘tweaking*’ David Allan et al. have a presentation on this so I won’t spend much time on it. In general a network with N equal cost paths from ‘some source’ to ‘some destination’ requires #ECT about 25-40% greater than N (to statistically capture them all). Therefore when #ECT == N some ‘tweaking’ is usually required (for DC its trivial to do however). Dave et al. suggest non-independence between ECT algorithms as way to address this (maximize diversity) … *Tweaking = adjusting Bridge Priorities up/down from defaults.
A 15 A 16 B 32 B 31 B 30 B 29 A1A1 A2A2 B4B4 B3B3 B2B2 B1B1 48 switch non blocking 2 layer L2 fabric 16 at “upper” layer A 1..A 16 32 at “lower” layer B 1.. B 32 16 uplinks per B n, & 160 UNI links per B n 32 downlinks per A n “Example” 802.1aq switching cluster – assume 100GE NNI links/groups (16 x 100GE per B n )x32 = 512x100GE = 51.2T 160 x 10GE server links (UNI) per B n (32 x 160)/2 = 2560 servers @ 2x10GE per uFIB = 16 x 48 B-mac = 768 entries mFIB = 16 subnet x 48 src = 768 entries 16 x 32 x 100GE = 51.2T using 48 x 2T switches S 3,1 S 3,160 S 32,1 S 32,160 S 1,1 S 1,160 5120 x 10GE 16 x 100GE 160 x 10GE 32 x 100GE 1536 FIB/node Good numbers “16” & “2” levels.
For a given ECT-ALG k, A j is a member of every SPF-TREE(B *,ECT-ALG k ) Properly tuned no two ECT-ALGorithms will use the same A j as a fork point. S 1 … S 16 ECT-ALG #12 Source Node (1)
A 15 A 16 B 32 B 31 B 30 B 29 A1A1 A2A2 B4B4 B3B3 B2B2 B1B1 Subnet N i maps to I-SID j and then to a unique A (j mod 16 ) So load spreading allows each A i to transit a complete subnet. Problem#1 - Unable to further spread such that A i and A j (i != j) each handle subset of flows in I-SID j I-SID j I-SID i
A 15 A 16 B 32 B 31 B 30 B 29 A1A1 A2A2 B4B4 B3B3 B2B2 B1B1 This is an issue under failure of A j Recovery will move entire subnet traffic to another A i node. A preferable solution is to spread affected load over remaining A * I-SID j I-SID i
A 15 A 16 B 32 B 31 B 30 B 29 A1A1 A2A2 B4B4 B3B3 B2B2 B1B1 Possible solution – head end hashing (unicast only) Allow unicast I-SID i and I-SID j traffic to be hashed based on smaller flows to different B-VIDs (ECT-ALGorithms) This breaks the symmetry and congruence rules but allows edge balancing at smaller granularity. No changes to multicast. Requires learning, independent of B-VID I-SID j I-SID i Unicast Mcast
A 15 A 16 B 32 B 31 B 30 B 29 A1A1 A2A2 B4B4 B3B3 B2B2 B1B1 A 15 A 16 B 32 B 31 B 30 B 29 A1A1 A2A2 B4B4 B3B3 B2B2 B1B1 Interconnection of fabrics creates more than 16 paths (exponential ) C1C1 C2C2 Number of paths can grow exponentially with increasing levels. Constant number of paths always << number of paths in many networks. Growing 802.1aq ECT to say 32 or even 100 ECMP causes larger unicast FIBs. O(16) O(16x2) O(16x2x16)
A 15 A 16 B32B32 B31B31 B30B30 B29B29 A1A1 A2A2 B4B4 B3B3 B2B2 B1B1 Horizontal Growth – not too bad but need more ECT-ALGORITHMS. Horizontal growth by 1 just increases number of ECT by 1 Not too big a problem but we would need to define new ECT (via Opaque). B34B34 B33B33 A 17
General Issue O(degree) O(diameter) #paths ~= O( diameter degree ) So head end ECT in worst case requires O(exp(# B-VIDs)) S D Choose path from N x B-VID
A feasible solution … Re-assign traffic to path at each hop Tandem “ECMP” just like IP. Need to keep O(degree) number of next hops Only need one B-VID.. removes O(diameter) from state cost Flip side is you have no control – just hope for fine scale statistical distribution Choose path from N x nxt hop S D Choose path from N x nxt hop Single B-VID
What about loops in this mode? 802.1aq Ingress Check is very strong in the case of a single next hop and hence a single possible ingress for an SA. 802.1aq Ingress Check is weakened in the case of a multiple next hop and hence Multiple possible ingress for an SA. However 802.1aq Agreement Protocol functions correctly in the context of multiple possible Next Hops for the same B-VID (refer to Mick’s proof). But …
Agreement Protocol Concerns Is it too complex? it is clearly non trivial, we need implementation/ emulation experience. Is it overly Draconian. For example the bounds on movement are what is required for a mathematical proof by induction.. However there are probably many cases where further movement would not loop. What is the degree of ‘overkill’ ? Is it marketable? – this is unfortunately a legitimate concern!!! 802.1aq can be deployed without AP until we introduce hash based forwarding at which point we either require a symmetric AP and/or an on-data-path loop detection/drop mechanism. Believe that an on-data-path loop detection mechanism is required for hash based ECMP until we have more experience with AP. Recommend we standardize a TTL TAG either stand-alone or as a new form of I-TAG.
View of New Work Requirements R1) New ECT-ALGorithms with improved spreading properties. R2) Allow optional head end hash assignment of 802.1aq SPBM UNI known unicast traffic to one of multiple next hop interfaces/B-VIDs. Very similar to Link Ag. Minimally HASH (seed, C.SA, C.DA, C-VID, [ IP.SA, IP.DA, IP.PROTO] ) R3) Allow optional tandem hash assignment of 802.1aq SPBM B-VID NNI unicast traffic to one of multiple next hop interfaces. Essentially a new SPBM ECT-ALG with its own B-VID. (i.e. new ECT-ALGorithms, all usable at same time) Minimally HASH (seed, B-VID, C.SA, C.DA, C-VID, [ IP.SA, IP.DA, IP.PROTO ]) R4) minor OA&M changes in support of R2 and R3, because symmetry/congruence broken. R5) More experience with AP, emulations, simulations etc. + addition of TTL to new I-TAG or a TTL-TAG.