Transport protocols Outline: Xuan Zheng Univ. of Virginia

Transport protocols Outline: Xuan Zheng Univ. of Virginia
Outline: Features of a transport protocol suitable for CHEETAH solution (e2e circuit + TCP path) ST review ST usage on end-to-end circuits Comparison with other transport protocols Demos 2018/12/10

Transport on e2e circuit+TCP path
PC NIC I Eth. Sw. FTP TCP Ctrl Sig Signaling ST XC Ethernet OC3 NIC II FTP sessions starts on TCP connection through NIC1 Large file transfer through NIC II using ST transport protocol 2018/12/10

Features of a transport protocol suitable for e2e circuits + TCP path
Congestion control not required on the user-plane Reason: congestion control is “preventive” not reactive Contention for resources resolved during circuit setup Flow control: Of the three options, ON/OFF, window-based, rate based Rate based is the best answer Determine rate at which data can be removed from receive buffer and set the circuit rate and sending rate to match this receive rate ON/OFF and window-based leave open the possibility of the circuit lying idle Avoid idle circuits for utilization reasons Error control Negative ACK based scheme possible because of in-sequence delivery Negative ACKs more desirable to limit positive ACKs on TCP path Need a positive ACK at the end to confirm completion of transfer Retransmissions in the middle of the transfer can be sent on the circuit But retransmissions needed towards the “end” should be sent on the TCP path Selective repeat Goal: High end-to-end throughput (implementation optimizations) 2018/12/10

Scheduled Transfer Protocol (ANSI standard)
Objective: Enable very high bandwidth and low latency transfers across networks Features: Hierarchy of data units: VC – transfer – blocks - STU Connection oriented Virtual Connection (VC) OS-bypass implementation possible Network extension of DMA Flexible end-to-end flow control and error control Independent of the Lower Layer Protocol (LLP) The Scheduled Transfer (ST) protocol is an ANSI-standard protocol which is designed to achieve very high bandwidth and low latency transfer cross the networks. Features: 1) ST is independent of the LLP, such as ATM, Ethernet, Fiber Channel, 2) VC – transfer – blocks -STU 3) A bi-directional logical connection used for ST between two devices. 4) ST offers flexibility in its error control and flow control schemes. 5) The key distinguishing feature of ST is that it provides an OS bypass implementation in order to deliver low latency and high bandwidth between communicating applications. It reduces inter-application latency times across the network by an order of magnitude when compared with a heavily-optimized TCP stack. 6) ST has neither congestion control nor dynamic routing capabilities.

System overview Request_Connection Connection_Answer Local end host
Remote end host Interconnect Networks Lower Layer Lower Layer ST ST Source Control Channel Data Channel(s) Destination Control Channel Data Channel(s) Request_Connection Connection_Answer A Virtual Connection is a bi-directional path between two end devices and is set up before any data are transmitted. The either end devices can be Initiators of VC setup operations. The other end device is the Responder. The initiator issues a Request_Connection to establish a VC. The responder issues a Connection_Answer upon receipt of a Request_Connection. The responder may reject the request. If accepted, a VC will have been establish for use with subsequent operations. Many parameters are negotiated in the virtual connection establishment phase to inform the remote end of capabilities and preferences. For example, buffer size, maximum STU size, upper layer protocol port number, etc. Once established, A VC contains a logical control channel and one or more logical data channels in each direction. The control channel and the data channel carries the control message and the data payload respectively. 2018/12/10

System overview Local end host Remote end host Interconnect Networks
Lower Layer Lower Layer ST ST Source Control Channel Data Channel(s) Destination Control Channel Data Channel(s) 2018/12/10

Data Hierarchy Transfer 28 ≤ Block size ≤ 248 Block 0 Block 1 ……
Block N STU 0 STU 1 …… STU M 28 ≤ STU size ≤ 232 1) A VC may be used to carry multiple independent Read and Write operations. Each read/write operations is called a transfer. 2) A Transfer is composed of one or more Blocks. The block is the unit of flow control. All of the blocks within one transfer should have the same size, except for the first and/or last block of the transfer. They can be smaller; 3) Blocks are composed of one or more Scheduled Transfer Units (STUs), which are the basic unit of transmission in ST. Blocks are segmented into STUs such that each STU does not cross a buffer boundary in the Destination. No STU should cross buffer, transfer, or block boundaries; 4) The ST then packages the STUs into LLP packets, such as Ethernet packets, and delivers them on the physical media. Scheduled Header Data payload 40 bytes

Support for OS-bypass implementation
The sender specifies the Transfer length using a RTS message; The receiver pins the memory in the user space; The receiver specifies the pinned memory address for each Block using CTSs; The sender computes the correct memory address for each STU of the Block; The sender includes the corresponding memory address in each STU of the Block. The NIC receiving the data unit uses Direct Memory Access (DMA) to write the payload in a silent manner into the receiver’s memory. ST provides an OS bypass implementation. It can achieve the low latency by having communicating hosts reserve memory resources during a "scheduling phase" before the actual transfer. The sender sends a control message specifying the length of a data transfer to a receiver; The receiver in turn pins memory in the user space and replies to the sender with a control message that species the pinned memory address; The data Source computes the correct Bufx and Offset for each STU and includes a Bufx, Offset pair in each STU of the Block sent. When the sender sends the data, the header of the data unit carries the corresponding memory address. This allows the NIC receiving the data unit to use Direct Memory Access (DMA) to write the payload in a silent manner into the receivers memory; With the OS-bypass implementation, data is transmitted directly from application to application without OS intervention, thus eliminating OS system-call overhead such as buffering and processing delays. The result is that the transport protocol adds a low end host delay to the total le transfer delay. 2018/12/10

OS-bypass implementation
Source Destination Buffer A Buffer B Buffer C buffer size of L bytes N bytes offset of M bytes CTS: Block X, Block size N bytes, Buffer index A, Offset M bytes RTS: Transfer length L bytes Data: Block X, Buffer index A, Offset M 2018/12/10

Data movement - Write Initiator Responder Request_To_Send
Request a Write <Request_Answer> Rejected? <Clear_To_Send> Enable a Block Data The Block is sent as one or more STUs … <Request_State_Response> State information, if requested in Data <Clear_To_Send> A Write sequence moves a Transfer, which contains one or more Blocks, from an Initiator to a Responder. RTS is issued by the Initiator to request a memory space in the Responder for a data Transfer. It specifies the transfer length, the sequence number, the maximal number of outstanding blocks that the initial could support, maximum block size, etc. the responder could reject the RTS by Request_Answer; After the responder pins the memory space for the data transfer, one or more CTS messages are issued by the Responder to tell the initiator that it is ready to receive subsequent Data operations. One CTS is for each Block within a transfer. CTS carries options, such block size, the memory address, and sequence number; the Initiator sends the data payload to the Responder using Data packets. It carries the block number, STU number, memory address, etc; One data packet for one STU. the responder then places the STU data in the pre-allocated memory area pointed to by the memory address parameters in STU. the initiator could request the transfer state information of the responder to determine the status of a transfer. It is requested by the initial using a Request_State or setting an indication in the Data packets. The responder will issue a RSR to reply the transfer state request. The requested information could includes the highest block which is received corectly, or if the block specifed in the Data packet was received correctly, etc. Enable multiple Blocks … Data The Blocks are sent as one or more STUs … 2018/12/10

Error control A block is the basic unit of error control; if an error within a block is detected, the whole block will need to be retransmitted; Error detection (bit errors and packet losses): 16 bits checksum Timeout, if STU carried an indication requesting an ACK Error correction ARQ Both Go-back-N and selective repeat options are supported. A block is the basic unit of error control. The blocks can be delivered out-of-order. But Out of order STU delivery is not supported by ST. If an error (error or loss) is detected, the whole block will need to be retransmitted; 2) Errors are detected using checksums and timeouts. If a data unit carries a indication for Request_State_Response, then timeouts are used to detect packet losses. 3) Error correction is with Automatic Repeat reQuest (ARQ). The receiver will reissue the CTS message to request a retransmission. The receiver may either request retransmission of the flawed or missing block, or may go-back-N by requesting retransmission of all blocks since the error. 2018/12/10

Flow control The flow control is achieved with CTS
The RTS carries a parameter called CTS_req that specifies the number of outstanding blocks that the sender would like to send back-to-back or concurrently Flow control scheme If CTS_req=1, it is ON/OFF If CTS_req>1, it is effectively window- based Data flow control in Read and Write operations is achieved with Clear_To_Send operations; each Clear_To_Send received gives the data Source permission to send one Block one time. The data receiver could specify the number of outstanding blocks that the sender would like to see exposed at any given time for maximum performance. If CTS_req=1, it is ON/OFF. If CTS_req>1, it is effectively window based. 2018/12/10

ST usage on e2e circuits Local end host Remote end host Interconnect
Networks Lower Layer Lower Layer ST ST Source Control Channel Data Channel(s) Destination Control Channel Data Channel(s) TCP path EoS circuit path 2018/12/10

ST usage on e2e circuits A combination of TCP on the IP path in conjunction with ST on the Ethernet/EoS circuit for the data transfer (server-client mode). Interconnect Networks Lower Layer Lower Layer All control messages Source (server) Control Channel Data Channel Destination (client) Control Channel Data Channel TCP path Unidirectional EoS circuit All data payload 2018/12/10

ST usage on e2e circuits (cont.)
A end-to-end TCP connection (ST control channel) is established for exchanges of control messages; EoS circuit (ST data channel) would be unidirectional from the server to the client; Error control uses negative acknowledgements (NAKs) and selective ARQ; Flow control can be rate based CTS_req = transfer size / block size; CTS sending rate (s) = link rate (bps) / block size (bytes) / 8 2018/12/10

Comparison with other options
ST SABUL/UDT TSUNAMI RBUDP Principle Designed for supercomputer networks UDP data + TCP control Error control Selective or go-back-N ARQ Selective ARQ Flow control All three Rate based + window based Rate based Target network Dedicated circuits IP networks Implementation Kernel level Application level API Yes No Operation Mode Memory-to-memory Disk-to-disk only 2018/12/10

Demos Transport protocol implementations tested Transfer types
SABUL, UDT, TSUNAMI, and RBUDP Transfer types Memory-to-memory and disk-to-disk Link rates 100Mbps and 1Gbps Ethernet link 2018/12/10

Experiment configuration
PC configuration: Dell Precision WorkStation 650 Intel XeonTM processors 2.4GHz with 512KB L2 cache 533MHz system bus 512MB DDR 266MHz SDRAM memory Intel 82545EM, 64bit, PCIx, Gigabit4 (10/100/1000) Ethernet card UATA 100 IDE hard drive Operating system: Redhat 9 (Linux kernel ) 2018/12/10

SABUL Parameters: packet size (1470 bytes); sender’s UDP buffer size ( bytes); receiver’s UDP buffer size ( bytes); initial sending rate (300Mbps default); lower limit of the sending rate (0 default); upper limit of the sending rate (1Gbps default); block size (8M bytes default). Congestion control is disabled by setting initial rate=upper rate=lower rate=link rate. 2018/12/10

SABUL - Results Memory-to-memory Disk-to-disk (725,667,176 bytes)
100Mbps link: throughput > 90Mbps, no loss 1Gbps link: throughput of 910Mbps seen, loss rate < 0.5% Disk-to-disk (725,667,176 bytes) 100Mbps link: throughput > 90Mbps, loss rate < 0.5% 1Gbps link: throughput of 200Mbps seen, loss rate depends upon sending rate the hard disk is the bottleneck 2018/12/10

SABUL demos On GbE link Memory-to-memory Disk-to-disk
1GB file, 1Gbps rate Disk-to-disk Sending rate set to 200Mbps Sending rate set to 500Mbps 2018/12/10

Experiment to find disk rate
Function used: fread() and fwrite() Reading rate ≈ 45MB/s (360Mbps) Writing rate ≈ 38MB/s (270Mbps) 2018/12/10

UDT Parameters: Slow start was disabled in the experiments
MTU (1500 bytes default) Maximum flow window size (256K bytes default); UDT buffer size (4M bytes default), UDP sending buffer size (65536 bytes default); UDP receiving buffer size ( bytes default). Slow start was disabled in the experiments 2018/12/10

UDT - Results Memory-to-memory Disk-to-disk (725,667,176 bytes)
100Mbps link: throughput > 95Mbps, no loss 1Gbps link: throughput > 800Mbps, some loss Disk-to-disk (725,667,176 bytes) 100Mbps link: throughput > 90Mbps, loss rate < 0.4% 1Gbps link: throughput >200Mbps, loss rate ≈15% 2018/12/10

UDT demos – 1GbE Compared to SABUL, memory-to-memory rate drops a little 910Gbps to 800Mbps Disk-to-disk is higher with UDT when compared to SABUL 200Mbps vs. 150Mbps 2018/12/10

Tsunami Parameters: UDP buffer size (20M bytes default); maximal sending rate (1Gbps default); error rate threshold (7.5% default); speedup factor (5/6 default); slowdown factor (24/25 default). Congestion control is disabled by setting a very high error rate threshold Maximal sending rate = link rate in our experiments Buffer size = 2M bytes in our experiments 2018/12/10

TSUNAMI - Results Memory-to-memory Disk-to-disk (725,667,176 bytes)
100Mbps link: throughput > 80Mbps, loss rate ≈ 10%. 1Gbps link: throughput ≈130Mbps, loss rate ≈ 20% UDP buffer size has great impacts on the throughput and the loss rate. 2018/12/10

Tsunami demo -1GbE Loss rate of almost 50% with 1Gb/s sending rate
Loss rate is decreased when a lower sending rate is used Sending rate set to 200Mbps Buffer size is a critical parameter 2018/12/10

RBUDP Arguments: The sending rate = link rate in the experiments
packet size(1452 bytes default) The sending rate = link rate in the experiments RBUDP impl. obtained from Starlight is only memory-to-memory Added code for disk-to-disk 2018/12/10

RBUDP - Results Memory-to-memory Disk-to-disk (725,667,176 bytes)
100Mbps link: throughput > 95Mbps, no loss 1Gbps link: throughput > 900Mbps, loss rate ≈ 2% Disk-to-disk (725,667,176 bytes) 100Mbps link: throughput > 80Mbps, the loss rate <0.8% 1Gbps link: throughput >220Mbps, the loss rate ≈5% 2018/12/10

RBUDP demos – 1Gbps Memory-to-memory: same as SABUL
Disk-to-disk: same as UDT Sending rate set to 1Gbps Sending rate set to 200Mbps 2018/12/10

The end 2018/12/10

Transport protocols Outline: Xuan Zheng Univ. of Virginia

Similar presentations

Presentation on theme: "Transport protocols Outline: Xuan Zheng Univ. of Virginia"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Transport protocols Outline: Xuan Zheng Univ. of Virginia

Similar presentations

Presentation on theme: "Transport protocols Outline: Xuan Zheng Univ. of Virginia"— Presentation transcript:

Similar presentations

About project

Feedback