Presentation is loading. Please wait.

Presentation is loading. Please wait.

Transport Layer Enhancements for Unified Ethernet in Data Centers K. Kant Raj Ramanujan Intel Corp Exploratory work only, not a committed Intel position.

Similar presentations


Presentation on theme: "Transport Layer Enhancements for Unified Ethernet in Data Centers K. Kant Raj Ramanujan Intel Corp Exploratory work only, not a committed Intel position."— Presentation transcript:

1 Transport Layer Enhancements for Unified Ethernet in Data Centers K. Kant Raj Ramanujan Intel Corp Exploratory work only, not a committed Intel position

2 2 *Third party marks and brands are the property of their respective owners Insert Logo Here Context Data center is evolving Data center is evolving Fabric should too. Fabric should too. Last talk: Last talk: –Enhancements to Ethernet, already on track This talk: This talk: –Enhancements to Transport Layer –Exploratory, not in any standards track.

3 3 *Third party marks and brands are the property of their respective owners Insert Logo Here Outline –Data Center evolution & transport impact –Transport deficiencies & remedies – Many areas of deficiencies … – Only Congestion Control and QoS addressed in detail –Summary & Call to Action

4 4 *Third party marks and brands are the property of their respective owners Insert Logo Here Data Center Today Tiered structure Tiered structure Multiple incompatible fabrics Multiple incompatible fabrics –Ethernet, Fiber Channel, IBA, Myrinet, etc. –Management complexity Dedicated servers for applications Inflexible resource usage Dedicated servers for applications Inflexible resource usage business trans client req/ resp Storage Fabric network Fabric SAN storage database query IPC Fabric

5 5 *Third party marks and brands are the property of their respective owners Insert Logo Here Future DC: Stage 1 – Fabric Unification Enet dominant, but convergence really on IP. Enet dominant, but convergence really on IP. –New layer2: PCI-Exp, Optical, WLAN, UWB, … Most ULPs run over transport over IP Most ULPs run over transport over IP Need to comprehend transport implications Need to comprehend transport implications business trans client req/ resp iSCSI storage database query

6 6 *Third party marks and brands are the property of their respective owners Insert Logo Here Future DC: Stage 2 – Clustering & Virtualization Sub-cluster1 Sub- cluster 2 Sub- cluster 3 Storage Nodes SMP Cluster (cost, flexibility, …) SMP Cluster (cost, flexibility, …) Virtualization Virtualization –Nodes, network, storage, … Virtual clusters (VC) –Each VC may have multiple traffic types inside Virtual Cluster1 Virtual Cluster 2 Virtual Cluster 3 IP ntwk

7 7 *Third party marks and brands are the property of their respective owners Insert Logo Here Future DC: New Usage Models Dynamically provisioned virtual clusters Dynamically provisioned virtual clusters Distributed storage (per node) Distributed storage (per node) Streaming traffic (VoIP/IPTV + data services) Streaming traffic (VoIP/IPTV + data services) HPC in DC HPC in DC –Data mining for focused advertising, pricing, … Special purpose nodes Special purpose nodes –Protocol accelerators (XML, authentication, etc.) New models New fabric requirements

8 8 *Third party marks and brands are the property of their respective owners Insert Logo Here Fabric Impact More types of traffic, more demanding needs. More types of traffic, more demanding needs. Protocol impact at all levels Protocol impact at all levels –Ethernet: Previous presentation. –IP: Change affects entire infrastructure. –Transport: This talk Why transport focus? Why transport focus? –Change primarily confined to endpoints. –Many app needs relate to transport layer –App. interface (Sockets/RDMA) mostly unchanged. DC evolution Transport evolution

9 9 *Third party marks and brands are the property of their respective owners Insert Logo Here Transport Issues & enhancements Transport (TCP) enhancement areas Transport (TCP) enhancement areas –Better Congestion control and QoS –Support media evolution –Support for high availability –Many others –Message based & unordered data delivery. –Connection migration in virtual clusters. –Transport layer multicasting. How do we enhance transport? How do we enhance transport? –New TCP compatible protocol? –Use an existing protocol (SCTP)? –Evolutionary changes to TCP from DC perspective.

10 10 *Third party marks and brands are the property of their respective owners Insert Logo Here What s wrong with TCP Congestion control TCP congestion control (CC) works independently for each connection TCP congestion control (CC) works independently for each connection –By default TCP equalizes throughput undesirable –Sophisticated QoS can change this, but … Lower level CC Backpressure on transport Lower level CC Backpressure on transport –Transport layer congestion control is crucial MAC router switch Cong feedback TL cong cntrl IP MAC App transport IP MAC ECN/ICMP App transport IP MAC

11 11 *Third party marks and brands are the property of their respective owners Insert Logo Here What s wrong with QoS? Elaborate mechanisms Elaborate mechanisms –Intserv (RSVP), Diffserv, BW broker, … … But a nightmare to use … But a nightmare to use –App knowledge, many parameters, sensitivity, … What do we need? What do we need? –Simple/intuitive parameters –e.g., streaming or not, normal vs. premium, etc. –Automatic estimation of BW needs. –Application focus, not flow focus! QoS relevant primarily under congestion QoS relevant primarily under congestion Fix TCP congestion control, use IP QoS sparingly. Fix TCP congestion control, use IP QoS sparingly.

12 12 *Third party marks and brands are the property of their respective owners Insert Logo Here TCP Congestion Control Enhancements 1)Collective control of all flows of an app –Applicable to both TCP & UDP –Ensures proportional fairness of multiple inter- related flows –Tagging of connections to identify related flows. 2)Packet loss highly undesirable in DC –Move towards a delay based TCP variant. 3)Multilevel Coordination –Socket vs. RDMA apps, TCP vs. UDP, … –A layer above transport for coordination

13 13 *Third party marks and brands are the property of their respective owners Insert Logo Here Collective Congestion Control Control connections thru a congested device together (control set) Control connections thru a congested device together (control set) Determining control set is challenging Determining control set is challenging BW requirement estimated automatically during non-congested periods BW requirement estimated automatically during non-congested periods Cong. Control S21 S23 SW1 SW2 CL1 SW0 S11 S13 CL2

14 14 *Third party marks and brands are the property of their respective owners Insert Logo Here Sample Collective Control App 1: client1 server1 App 1: client1 server1 –Database queries over a single connection Drives ~5.0 Mb/s BW Drives ~5.0 Mb/s BW App2: client2 server1 App2: client2 server1 –Similar to App1 Drives 2.5 Mb/s BW Drives 2.5 Mb/s BW App 3: client3 server2 App 3: client3 server2 –FTP, starts at t=30 secs 25 conn. 8 Mb/s 25 conn. 8 Mb/s

15 15 *Third party marks and brands are the property of their respective owners Insert Logo Here Sample Results Cong. Control Collective control highly desirable within a DC Modified TCP can maintain 2:1 throughput ratio Modified TCP can maintain 2:1 throughput ratio –Also yields lower losses & smaller RTT.

16 16 *Third party marks and brands are the property of their respective owners Insert Logo Here Adaptation to Media Problem: TCP assumes loss congestion, and designed for WAN (high loss/delay) Problem: TCP assumes loss congestion, and designed for WAN (high loss/delay) Effects: Effects: –Wireless (e.g. UWB) attractive in DC (wiring reduction, mobility, self configuration). –… but TCP is not a suitable transport. –Overkill for communications within a DC. Solution: A self-adjusting transport Solution: A self-adjusting transport –Support multiple congestion/flow-control regimes. –Automatically selected during connection setup.

17 17 *Third party marks and brands are the property of their respective owners Insert Logo Here High Availability Issues Problem: Single failure broken connection, weak robustness check, … Problem: Single failure broken connection, weak robustness check, … Effect: Difficult to achieve high availability. Effect: Difficult to achieve high availability. A B Path 1 Path 2 Solution: Solution: –Multi-homed connections w/ load sharing among paths. –Ideally, controlled diversity & path management –Difficult: need topology awareness, spanning tree problem,

18 18 *Third party marks and brands are the property of their respective owners Insert Logo Here Summary & call to action Data Centers are evolving Data Centers are evolving –Transport must evolve too, but a difficult proposition –TCP is heavily entrenched, change needs an industry wide effort Call to Action Call to Action –Need to get an industry effort going to define –New features & their implementation –Deployment & compatibility issues. –Change will need push from data center administrators & planners.

19 19 *Third party marks and brands are the property of their respective owners Insert Logo Here Additional Resources Presentation can be downloaded from the IDF web site – when prompted enter: Presentation can be downloaded from the IDF web site – when prompted enter: –Username: idf –Password: fall2005 Additional backup slides Additional backup slides Several relevant papers available at http://kkant.ccwebhost.com/download.html Several relevant papers available at http://kkant.ccwebhost.com/download.html http://kkant.ccwebhost.com/download.html – Analysis of collective bandwidth control. – SCTP performance in data centers.

20 20 *Third party marks and brands are the property of their respective owners Insert Logo Here Backup

21 21 *Third party marks and brands are the property of their respective owners Insert Logo Here Comparative Fabric Features FeatureTCPSCTPIBA Scalability to 100 Gb/s difficultdifficultEasy? Message based & ULP support NoYesYes QoS friendly transport? NoNoYes Virtual channel support NoNoyes DC centric flow/cong. control NoNoYes Point to multipoint communication NoNoYes High availability features PoorFairGood Offload latency (end-pt only) ~1us>1us<.5us Compatible w/ TCP/IP base Yeslimited Unordered data delivery NoYesYes Protection against DoS attacks PoorGoodPoor Multiple traffic streams NoYesYes DC requirements TCP lacks many desirable features; SCTP has some

22 22 *Third party marks and brands are the property of their respective owners Insert Logo Here Transport Layer QoS Needed at multiple levels Needed at multiple levels –Between transport uses –Conn. of a given transport –Logical streams DB App cntrldata iSCSI ntwk IPC Web app text images page May be on two VMs on same physical machine. Inter-app Intra-app Intra-conn Best BW subdivision to maximize performance? Requirements Requirements –Must be compatible with lower level QoS –PCI-Exp, MAC, etc. –Automatic estimation of bandwidth requirements –Automatic BW control

23 23 *Third party marks and brands are the property of their respective owners Insert Logo Here Multicasting in DC Software/patch distribution Software/patch distribution –Multicast to all machines w/ same version. –Characteristics –Medium to large file transfer –Time to finish matters, BW doesnt. –Scale: 10s to 1000s. High performance computing High performance computing –MPI collectives need multicasting –Characteristics –Small but frequent transfers –Latency premium, BW not an issue mostly. –Scale: 10s to 100s

24 24 *Third party marks and brands are the property of their respective owners Insert Logo Here Transport layer multicasting subnet2 subnet1 outer router A subnet2 subnet1 outer router A IP multicasting TL multicasting DC needs IP multicasting TL multicasting Legacy infrast. Needs specialized routers Std. routers adequate Short msgs, dynamic group Usually designed for long transfers Appropriate mechanism? Topology aware? Yes (routing alg. based) No (Need new mechnisms) Low overhead No (Complex mgmnt) Simpler, done in TL engine Low latency Primarily BW focussed Need latency centric design Reliable mcast. Built on top Part of TL

25 25 *Third party marks and brands are the property of their respective owners Insert Logo Here TL multicasting value Assumptions Assumptions –A 16 node cluster w/ 4-node subclusters. –Mcast group: 2 nodes in each sub- cluster –Latencies: –endpt: 2 us, ack proc: 1 us, switch: 1 us –App-TL interface: 5 us Latency w/o mcast Latency w/o mcast –send: 7x2 + 3x1 + 2 = 19 us –ack: 1 + 3x1 + 7x1 = 11 us –reply: 5 + 2 + 7x2 = 21 us –Total: 19+11+21 = 51 us Latency w/ mcast Latency w/ mcast –send: 3x2 + 3x1 + 2 + 2x(1+1) + 2 = 17 us –ack: 1 + 1 + 2x1 + 3x1 + 3x1 = 10 us –Total = 17 + 10 + 5 = 32 us Larger savings in full network mcast. Larger savings in full network mcast. subnet2 subnet1 A subnet3 subnet4 outer router D B C

26 26 *Third party marks and brands are the property of their respective owners Insert Logo Here Hierarchical Connections Choose a leader in each subnet. Choose a leader in each subnet. –Topology directed Multicast connections to others nodes via leaders Multicast connections to others nodes via leaders –Ack consolidation at leaders (multicast) –Msg consolidation at leaders (reverse multicast) Done by a layer above? (layer 4.5?) Done by a layer above? (layer 4.5?) A n1 n2 S4 n1 n2 S2 n1 n2 S3S3 n1 n2 subnet2 subnet1 subnet4 subnet 3 outer route r A


Download ppt "Transport Layer Enhancements for Unified Ethernet in Data Centers K. Kant Raj Ramanujan Intel Corp Exploratory work only, not a committed Intel position."

Similar presentations


Ads by Google