Presentation on theme: "Anders Magnusson TCP Tuning and E2E Performance TREFpunkt - October 20, 2004."— Presentation transcript:
Anders Magnusson TCP Tuning and E2E Performance TREFpunkt - October 20, 2004
Anders Magnusson October 20, 2004 The speed-of-light problem The sender must store every sent packet until it has received an ACK from the receiver Due to the speed of light limitations this might take a while, even in small countries like Sweden Theoretical RTT Luleå-Stockholm is (1000/300000)*2 = 6.7ms, in reality 20ms TCP window size to keep up with 1Gbit/s must then be (1000/8)*.02 = 2.5Mbyte
Anders Magnusson October 20, 2004 Operating system buffers Inside the operating system kernel there are usually a bunch of different buffers affecting performance The term buffers is somewhat misleading, usually it is just some sort of data structure that is used to reference data in memory (but in theory it could as well be real buffers)
Anders Magnusson October 20, 2004 TCP window buffers The TCP window sizes can be adjusted on virtually all operating systems There are two windows, send and receive The window size for one direction of flow is set to MIN(senders send window, receivers receive window) The send window must be large enough to keep all segments sent during the RTT
Anders Magnusson October 20, 2004 Socket buffers Limits the amount of data an application may write to the kernel before being blocked Often combined with the TCP send window, when ACKs are received the socket buffer data is adjusted accordingly Must be >= TCP window size to avoid limitations
Anders Magnusson October 20, 2004 MBUF clusters There are limitations how many network buffers (in many OSes called MBUFs) that may be allocated MBUFs may have external storage associated with them, allocated out of a separate (limited) area These buffers are often allocated at compile time and it is not uncommon that physical memory is static allocated for them
Anders Magnusson October 20, 2004 Other knobs to turn RFC1323 Turns on Window scaling option needed to use larger TCP windows than 64k Initial window size Avoid slow-start by injecting many packets into the network at connection startup Interface queues Be able to store the packets that are ready to send until the network interface can transmit them
Anders Magnusson October 20, 2004 Problems often seen Packet loss On a long-distance high-speed connection, packet loss in a TCP flow will reduce the speed significantly If the sender enters congestion avoidance, the congestion window will open linearly, and with large windows this will be really slow With an RTT of 185ms and window size of 25MB it will take around 50 minutes to reach full speed
Anders Magnusson October 20, 2004 Problems often seen Packet bursts During the startup of a TCP bulk flow, the exponential increase in packet injection into the network during slow-start may cause packet bursts on links with large bandwidth-delay product The result may be that intermediate switches/routers must drop packets, even though the TCP self-clocking would not permit more packets to be sent than could be received
Anders Magnusson October 20, 2004 Problems often seen ACK/window updates Traditional approach for bulk flows is for the receiver to send an ACK each second received packet Window updates are sent as soon as data is delivered to the receiving process This will cause the return traffic to be more than half the number of the transmitted packets Interrupts, packet handling in the sending host may use a significant amount of CPU
Anders Magnusson October 20, 2004 Problems often seen ARP timeouts When an ARP entry times out, it is usually just removed from the ARP cache, and the next packet will initiate a new ARP request If there is an ongoing packet flow, this approach may cause packets to be dropped until an ARP reply is received
Anders Magnusson October 20, 2004 Tuning of NetBSD sysctl -w net.inet.tcp.rfc1323=1 Activate window scaling and timestamp options due to RFC1323. sysctl -w kern.somaxkva=[sbmax] Set maximum size for all socket buffers together in the system sysctl -w kern.sbmax=[sbmax] Set maximum size of socket buffer for one TCP flow sysctl -w net.inet.tcp.recvspace=[wstd] sysctl -w net.inet.tcp.sendspace=[wstd] Set max size of TCP windows. sysctl kern.mbuf.nmbclusters View maximum number of mbuf clusters. Used for storage of data packets to/from the network interface. Can only be set by recompiling Your kernel.
Anders Magnusson October 20, 2004 Tuning of FreeBSD sysctl net.inet.tcp.rfc1323=1 Activate window scaling and timestamp options due to RFC1323. sysctl ipc.maxsockbuf=[sbmax] Set maximum size of TCP window. sysctl net.inet.tcp.recvspace=[wstd] sysctl net.inet.tcp.sendspace=[wstd] Set max size of TCP windows. sysctl kern.ipc.nmbclusters View maximum number of mbuf clusters. Used for storage of data packets to/from the network interface. Can only be set att boot time.
Anders Magnusson October 20, 2004 Tuning of Linux echo "1" > /proc/sys/net/ipv4/tcp_window_scaling Activate window scaling according to RFC 1323 echo [wmax] > /proc/sys/net/core/rmem_max echo [wmax] > /proc/sys/net/core/wmem_max Set maximum size of TCP windows. echo [wmax] > /proc/sys/net/core/rmem_default echo [wmax] > /proc/sys/net/core/wmem_default Set default size of TCP windows. echo "[wmin] [wstd] [wmax]" > /proc/sys/net/ipv4/tcp_rmem echo "[wmin] [wstd] [wmax]" > /proc/sys/net/ipv4/tcp_wmem Set min, default, max windows. Used by the autotuning function. echo "bmin bdef bmax" > /proc/sys/net/ipv4/tcp_mem Set maximum total TCP buffer-space allocatable. Used by the autotuning function.
Anders Magnusson October 20, 2004 Tuning of Windows (2k, XP, 2k3) HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet \Services\Tcpip\Parameters\Tcp1323Opts=1 Turn on window scaling option HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet \Services\Tcpip\Parameters\TcpWindowSize =[wmax] Set maximum size of TCP window
Anders Magnusson October 20, 2004 How to set a Land Speed Record Recipe: Really high-quality networks Hardware capable of sending/receiving fast enough Operating system without foolish bottlenecks Enthusiasts that spend weekends sending an obscene amount of data between Luleå and San Jose
Anders Magnusson October 20, 2004 GigaSunet OC-192 core Sprintlink OC-192 core 10GE OC192 End host in Luleå, Sweden End host in San Jose, CA SUNET Internet Land Speed Record - Network setup Network path consists of 42(!) router hops, using paths shared with other users of the networks.
Anders Magnusson October 20, 2004 Records submitted September 12 1 966 080 000 000 bytes in 3648 real seconds = 4310 Mbit/second 1831 Gbytes in almost exactly an hour 120 000 packets/second transferred with an MTU of 4470 bytes Record submitted for the IPv4 single and multiple stream class is 124.935 Petabit-meters/second (which is a 78% increase of our previous record)
Anders Magnusson October 20, 2004 Compared with others Compared to the previous record, we can note that we achieved this, using Less powerful end hosts 200% longer distance Less than half the MTU size (which generates heavier CPU-load on the end- hosts) The normal GigaSunet and Sprintlink production infrastructures
Anders Magnusson October 20, 2004 Fiber path for the Internet LSR Distance from Luleå, Sweden to San Jose, CA is approximately 28,983 km (18,013 miles)