ECE544: Communication Networks-II Spring 2015

ECE544: Communication Networks-II Spring 2015
Transport Protocols Includes teaching materials from, L. Peterson, David Wetherall and Tom Anderson (Univ. of Washington), Gregg, Brendan. Systems Performance: Enterprise and the Cloud

Today’s Lecture Introduction to transport protocols UDP TCP RTP

The Disconnect Applications running on hosts need to communicate
Guaranteed Service Best-effort R1 ETH FDDI IP R2 FDDI PPP IP R3 PPP ETH IP IP ETH IP ETH Applications running on hosts need to communicate Require some guarantees from the underlying layer Network Layer (IP) provides only best-effort communication services Only between hosts (not applications)

Transport Protocol Transport protocol
Host1 Host8 Appl. Appl. TCP/UDP TCP/UDP R1 ETH FDDI IP R2 FDDI PPP IP R3 PPP ETH IP IP IP ETH ETH Transport protocol Provides services required by applications using the services provides by the network layer The Transport Layer is the lowest layer in the network stack that is an end-to-end protocol

Transport Protocols Applications requirements vs. IP layer limitations
Guarantee message delivery Network may drop messages. Deliver messages in the same order they are sent Messages may be reordered in networks and incurs a long delay. Delivers at most one copy of each message Messages may duplicate in networks. Support arbitrarily large message Network may limit message size. Support synchronization between sender and receiver Allows the receiver to apply flow control to the sender Support multiple application processes on each host Network only support communication between hosts Many more Design just a few transport protocols to meet most of the current and future application requirements Each satisfies the requirements for a class of applications Many applications=>few transport protocols

Most Popular Transport Protocols
User Datagram Protocol (UDP) Support multiple applications processes on each host Option to check messages for correctness with CRC check Transmission Control Protocol (TCP) Ensures reliable delivery of packets between source and destination processes Ensures in-order delivery of packets to destination process Other services Real Time Protocol (RTP) Serves real-time multimedia applications Moves decision making to the applications Runs over UDP TCP, UDP and RTP satisfy needs of the most common applications Applications requiring other functionality usually use UDP for transport protocol, and implement additional features as part of the application

UDP Demultiplexing Service: Support for multiple processes on each host to communicate Issue: IP only provides communication between hosts (IP addresses) Solution Add port number and associate a process with a port number 4-Tuple Unique Connection Identifier: [SrcPort, SrcIPAddr, DestPort, DestIPAddr ] Appl process UDP IP Appl process UDP IP 16 31 SrcPort DesPort Length Checksum Payload Network UDP Packet Format

UDP Error Detection Service: Ensure message correctness Solution
Issue: Packet corruption in transit Solution Use Checksum. Why isn’t IP checksum enough? Includes UDP header, payload, pseudo header Pseudo header Protocol number, source IP address, destination IP address, and UDP length 16 31 SrcPort DesPort Length Checksum Payload

UDP Properties and Applications
It is transaction-oriented, suitable for simple query-response protocols such as the Domain Name System (DNS) or Network Time Protocol (NTP) It provides datagrams, suitable for modeling other protocols such as in IP tunneling, or Remote Procedure Call (RPC) and the Network File System (NFS). It is simple, suitable for bootstrapping without a full protocol stack, such as the DHCP and Trivial File Transfer Protocol. It is stateless, suitable for very large numbers of clients, such as in streaming media applications for The lack of retransmission delays makes it suitable for real-time applications such a VoIP, online games, and many protocols built on top of the Real Time Streaming Protocol. Works well in unidirectional communication, suitable for broadcast information such as in many kinds of service discovery and shared information such as broadcast time or Routing Information Protocol (RIP). Numerous key Internet applications use UDP, including: the Domain Name System (DNS), where queries must be fast and only consist of a single request followed by a single reply packet, the Simple Network Management Protocol (SNMP), the Routing Information Protocol (RIP)[2] and the Dynamic Host Configuration Protocol (DHCP). Voice and video traffic is generally transmitted using UDP. Real-time video and audio streaming protocols are designed to handle occasional lost packets, so only slight degradation in quality occurs, rather than large delays if lost packets were retransmitted. Because both TCP and UDP run over the same network, many businesses are finding that a recent increase in UDP traffic from these real-time applications is hindering the performance of applications using TCP, such as point of sale, accounting, and database systems. When TCP detects packet loss, it will throttle back its data rate usage. Since both real-time and business applications are important to businesses, developing quality of service solutions is seen as crucial by some.[10]

Datagram Sockets Applications use datagram sockets to establish host-to-host communications. An application binds a socket to its endpoint of data transmission, which is a combination of an IP address and a service port. A port is a software structure that is identified by the port number, a 16 bit integer value, allowing for port numbers between 0 and Port 0 is reserved, but is a permissible source port value if the sending process does not expect messages in response. The Internet Assigned Numbers Authority (IANA) has divided port numbers into three ranges.[3] Port numbers 0 through 1023 are used for common, well-known services. On Unix-like operating systems, using one of these ports requires superuser operating permission. Port numbers 1024 through are the registered ports used for IANA-registered services. Ports through are dynamic ports that are not officially designated for any specific service, and may be used for any purpose. They also are used as ephemeral ports, from which software running on the host may randomly choose a port in order to define itself.[3] In effect, they are used as temporary ports primarily by clients when communicating with servers.

Transmission Control Protocol (TCP)
First proposed by Vinton Cerf and Robert Kahn, 1974 TCP/IP enabled computers of all sizes, from different vendors, different OSs, to communicate with each other. Used by 80% of all traffic on the Internet Reliable, in-order delivery, connection-oriented, bye-stream service

TCP: Connection-oriented
Service: Connection-oriented Application states the destination once Issue: IP is connection-less Solution: TCP maintains the connection state Connection Establishment Connection Termination

Van Jacobson’s algorithms
Evolution of TCP 1984 Nagel’s algorithm to reduce overhead of small packets; predicts congestion collapse 1975 Three-way handshake Raymond Tomlinson In SIGCOMM 75 1987 Karn’s algorithm to better estimate round-trip time 1990 4.3BSD Reno fast retransmit delayed ACK’s 1983 BSD Unix 4.2 supports TCP/IP 1986 Congestion collapse observed 1988 Van Jacobson’s algorithms congestion avoidance and congestion control (most implemented in 4.3BSD Tahoe) 1974 TCP described by Vint Cerf and Bob Kahn In IEEE Trans Comm 1982 TCP & IP RFC 793 & 791 1975 1980 1985 1990

TCP Through the 1990s 1993 1994 1996 1994 T/TCP (Braden) Transaction
SACK TCP (Floyd et al) Selective Acknowledgement 1993 TCP Vegas (Brakmo et al) real congestion avoidance 1994 ECN (Floyd) Explicit Congestion Notification 1996 Hoe Improving TCP startup 1996 FACK TCP (Mathis et al) extension to SACK 1993 1994 1996

TCP: Packet Format Flags Sequence number Acknowledgement
SYN, FIN, ACK, RESET, URG, PUSH Sequence number Sequence number of the first byte of data in the segment It is an abstract number (more later) Acknowledgement Next sequence number expected from the sender

Reliable Byte-stream Bidirectional data transfer
Control information (e.g., ACK) piggybacks on data segments in reverse direction

TCP Connection Management
Setup assymetric 3-way handshake Transfer sliding window; data and acks in both directions Teardown symmetric 2-way handshake Client-server model initiator (client) contacts server listener (server) responds, provides service

Three-Way Handshake Opens both directions for transfer
Active participant Passive participant (client) (server) SYN, SequenceNum = x y , SYN + ACK, SequenceNum = x + 1 Acknowledgment = ACK, Acknowledgment = y + 1 +data

Do we need 3-way handshake?
Allows both sides to allocate state for buffer size, state variables, … calculate estimated RTT, estimated MTU, etc. Helps prevent Duplicates across incarnations Intentional hijacking random nonces => weak form of authentication Short-circuit? Persistent connections in HTTP (keep connection open) Transactional TCP (save seq #, reuse on reopen) But congestion control effects dominate

TCP Transfer Connection is bi-directional acks can carry response data
(client) (server) Seq = x + MSS; Ack = y+1 Seq = x + 2*MSS; Ack = y+1 Seq = y+MSS; Ack = x+2MSS+1 Seq = x + 3*MSS; Ack = y+MSS+1

TCP Connection Teardown
Symmetric: either side can close connection (or RST!) Web server Web browser FIN data, ACK Half-open connection; data can be continue to be sent data, ACK FIN ACK Can reclaim connection after 2 MSL ACK Can reclaim connection right away (must be at least 1MSL after first FIN)

Connection Establishment
Both sender and receiver must be ready before we start the transfer of data Need to agree on a set of parameters e.g., the Maximum Segment Size (MSS) This is (in-band) signaling It sets up state at the endpoints

Connection Establishment
Active participant (client) Passive participant (server) SYN, Seq#=x SYN+ACK, Seq#=y Ack#=x+1 ACK, Ack#=y+1 Data+ACK Connection Establishment Data transport Server Informs TCP about the listening port Up-call registration Client Performs three way handshake SYN and ACK flags in the header are used Initial Sequence numbers x and y selected at random

Connection Termination
FIN FIN-ACK ACK DATA Data write Data ACK Any side can terminate the connection Each side closes its half of the connection independently A connection may be half-opened Can only receive data

What if packets can be delayed?
Solutions? Never reuse an ID? Change IP layer to eliminate packet reordering? Prevent very late delivery? IP routers keep hop count per pkt, discard if exceeded ID’s not reused within delay bound TCP won’t work without some bound on how late packets can arrive! 1 1 Accept! Reject!

TCP Connection Setup, with States
Active participant Passive participant (client) (server) SYN_SENT LISTEN SYN, SequenceNum = x SYN_RCVD y , 1 SYN + ACK, SequenceNum = x + Acknowledgment = ESTABLISHED ACK, Acknowledgment = ESTABLISHED y + 1 +data

TCP Connection Teardown
Web server Web browser FIN_WAIT_1 FIN CLOSE_WAIT ACK LAST_ACK FIN_WAIT_2 FIN TIME_WAIT ACK … CLOSED CLOSED

The TIME_WAIT State We wait 2MSL (two times the maximum segment lifetime of 60 seconds) before completing the close Why? ACK might have been lost and so FIN will be resent Could interfere with a subsequent connection

TCP Handshake in an Uncooperative Internet
TCP Hijacking if seq # is predictable, attacker can insert packets into TCP stream many implementations of TCP simply bumped previous seq # by 1 attacker can learn seq # by setting up a connection Solution: use random initial sequence #’s weak form of authentication Malicious attacker Client Server SYN, SequenceNum = x SYN + ACK, y, x + 1 ACK, y+1 “HTTP get URL”, x + MSS fake web page, y+MSS web page, y + MSS

TCP Handshake in an Uncooperative Internet
TCP SYN flood server maintains state for every open connection if attacker spoofs source addresses, can cause server to open lots of connections eventually, server runs out of memory Malicious attacker Server SYN, SequenceNum = SYN, p x SYN, q SYN, r SYN, s SYN + ACK, y, x + 1

How can TCP choose segment size?
Pick LAN MTU as segment size? LAN MTU can be larger than WAN MTU E.g., Gigabit Ethernet jumbo frames Pick smallest MTU across all networks in Internet? Most traffic is local! Local file server, web proxy, DNS cache, ... Increases packet processing overhead Discover MTU to each destination? (IP DF bit) Guess?

How do we keep the pipe full?
Unless the bandwidth*delay product is small, stop and wait can’t fill pipe Solution: Send multiple packets without waiting for first to be acked Reliable, unordered delivery: Send new packet after each ack Sender keeps list of unack’ed packets; resends after timeout Receiver same as stop&wait How easy is it to write apps that handle out of order delivery? How easy is it to test those apps?

Sliding Window: Reliable, ordered delivery
Two constraints: Receiver can’t deliver packet to application until all prior packets have arrived Sender must prevent buffer overflow at receiver Solution: sliding window circular buffer at sender and receiver packets in transit <= buffer size advance when sender and receiver agree packets at beginning have been received How big should the window be? bandwidth * round trip delay

Sender/Receiver State
packets sent and acked (LAR = last ack recvd) packets sent but not yet acked packets not yet sent (LFS = last frame sent) receiver packets received and acked (NFE = next frame expected) packets received out of order packets not yet received (LFA = last frame ok)

Sliding window Allows multiple packets, up to the size of the window, to be sent on the network before acknowledgments are received. Provides high throughput even on high-latency networks. The size of the window is advertised by the receiver to indicate how many packets it is willing to receive at that time.

Sliding Window Send Window 1 2 3 4 5 6 sent x x x x x x x acked x LAR
1 2 3 4 5 6 sent x x x x x x x acked x LAR LFS Receive Window 1 2 3 4 5 6 recvd x x x x x x acked x x NFE LFA

TCP State-Transition Max segment lifetime (MSL): 120 sec (recommended)

What if we lose a packet? Go back N (original TCP)
receiver acks “got up through k” (“cumulative ack”) ok for receiver to buffer out of order packets on timeout, sender restarts from k+1 Selective retransmission (RFC 2018) receiver sends ack for each pkt in window on timeout, resend only missing packet

Can we shortcut timeout?
If packets usually arrive in order, out of order delivery is (probably) a packet loss Negative ack receiver requests missing packet Fast retransmit (TCP) receiver acks with NFE-1 (or selective ack) if sender gets acks that don’t advance NFE, resends missing packet

Sender Algorithm Send full window, set timeout On receiving an ack:
if it increases LAR (last ack received) send next packet(s) -- no more than window size outstanding at once else (already received this ack) if receive multiple acks for LAR, next packet may have been lost; retransmit LAR + 1 (and more if selective ack) On timeout: resend LAR + 1 (first packet not yet acked)

Receiver Algorithm On packet arrival:
if packet is the NFE (next frame expected) send ack increase NFE hand any packet(s) below NFE to application else if < NFE (packet already seen and acked) send ack and discard // Q: why is ack needed? else (packet is > NFE, arrived out of order) buffer and send ack for NFE – 1 -- signal sender that NFE might have been lost -- and with selective ack: which packets correctly arrived

Sequence Number Selection
Initial sequence number (ISN) selection: Why not simply chose 0? Must avoid overlap with earlier incarnation Security issues Requirements for ISN selection Must operate correctly Without synchronized clocks Despite node failures

Sequence Number Wrap Around
Protect against SequenceNum wrap around Sliding window Seq # space >= 2 x WinSize For TCP: 232 >> 2 x 216 Seq # should not wraparound within a MSL (120 sec) period of time For OC-48 (2.5 Gbps), time until wraparound: 14 sec TCP extension to the sequence # space for protecting against seq # wrapping around Add 32-bit timestamp as optional header

How do we determine timeouts?
If timeout too small, useless retransmits can lead to congestion collapse (and did in 86) as load increases, longer delays, more timeouts, more retransmissions, more load, longer delays, more timeouts … Dynamic instability! If timeout too big, inefficient wait too long to send missing packet Timeout should be based on actual round trip time (RTT) varies with destination subnet, routing changes, congestion, …

Estimating RTTs Idea: Adapt based on recent past measurements
For each packet, note time sent and time ack received Compute RTT samples and average recent samples for timeout EstimatedRTT =  x EstimatedRTT + (1 - ) x SampleRTT This is an exponentially-weighted moving average (low pass filter) that smoothes the samples. Typically,  = 0.8 to 0.9. Set timeout to small multiple (2) of the estimate

Keep the Pipe Full AdvertisedWindow: 216=>64 KB TCP Extension:
Big enough to allow the sender to keep the pipe full (assume that the receiver has enough buffer to handle the data) If RTT = 100 ms, Delay x Bandwidth = 122 KB for 10 Mbps link Delay x Bandwidth = 1.2 MB for 100 Mbps link (AdvertisedWindow is not large enough) TCP Extension: Scaling factor option for AdvertisedWindow, e.g., use 16-byte units of data

TCP Error Control Cumulative ACK: ACK the highest contiguous bytes received Same as studied before Extension: Selective ACK (SACK), ACK additional blocks of received data in TCP optional header Timeout Timer If timeout too soon unnecessarily retransmit → adds load to network If timeout too late Increases latency Limits the throughput.

TCP Timeout Issue: RTT in a wide area network varies substantially
Solution: Adaptive Timeout Original Algorithm: EstimatedRTT = a x EstimatedRTT + (1-a) x SampleRTT Timeout = β x EstimatedRTT (β = 2) Problem Does not distinguish whether the ACK is for original transmission or retransmission (suggestions?) Constant β is not good. Assumes constant variance

TCP Timeout Karn/Partridge Algorithm
Whenever TCP retransmits a segment, it stops taking samples of the RTT Only measure SampleRTT for segments that have have been sent only once Each time TCP retransmits, set the next timeout to be twice the last timeout Relieves congestion Jacobson/Karels Algorithm: Adaptive variance (uses mean variance) Difference = SampleRTT - EstimatedRTT EstimatedRTT = EstimatedRTT + (d x Difference) → (same as in original) Deviation = Deviation + d(|Difference|- Deviation) Timeout = m x EstimatedRTT + f x Deviation (default: set m = 1 and f= 4 )

Triggering Transmission
When to transmit a segment: small segments subject to large overhead Reach max segment size (MSS): the size of the largest segment TCP can send without causing the local IP to fragment MSS = local MTU – IP & TCP header The sending process explicitly ask the TCP to transmit, “push”

TCP Silly Window Syndrome
Sender has MSS bytes of data to send, but window is closed ACK arrives with a small window Sender sends a small segment (high overhead) Receiver advertise a small window Sender sends a small receive segment Repeat the above To solve: Nagle’s Algorithm When the application have data to send If both available data and the window >= MSS Send a full segment Else If there is unACKed data in flight Buffer the new data until an ACK arrives Send all the new data now

TCP Deadlock TCP Deadlock To solve it:
receiver advertises a window size of 0, the sender stops sending data the window size update from the receiver is lost To solve it: the sender starts the persist timer when AdvertisedWindow = 0 When the persist timer expires, the sender sends a small packet

Example – Exchange of Packets
SEQ=1 ACK=2; WIN=3 T=2 SEQ=2 Receiver has buffer of size 4 and application doesn’t read ACK=3; WIN=2 T=3 SEQ=3 Stall due to flow control here T=4 SEQ=4 T=5 ACK=4; WIN=1 T=6 ACK=5; WIN=0

Example – Buffer at Sender
1 2 3 4 5 6 7 8 9 =acked T=2 1 2 3 4 5 6 7 8 9 =sent T=3 1 2 3 4 5 6 7 8 9 =advertised T=4 1 2 3 4 5 6 7 8 9 T=5 1 2 3 4 5 6 7 8 9 T=6 1 2 3 4 5 6 7 8 9

How does sender know when to resume sending?
If receive window = 0, sender stops no data => no acks => no window updates Sender periodically pings receiver with one byte packet receiver acks with current window size Why not have receiver ping sender?

Should sender be greedy?
Should sender transmit as soon as any space opens in receive window? Silly window syndrome receive window opens a few bytes sender transmits little packet receive window closes Solution (Clark, 1982): sender doesn’t resume sending until window is half open

Should sender be greedy?
App writes a few bytes; send a packet? Don’t want to send a packet for every keystroke If buffered writes >= max segment size if app says “push” (ex: telnet, on carriage return) after timeout (ex: 0.5 sec) Nagle’s algorithm Never send two partial segments; wait for first to be acked, before sending next Self-adaptive: can send lots of tinygrams if network is being responsive But (!) poor interaction with delayed acks (later)

Even with flow control packets might not reach the destination
Congestion Source 1 Even with flow control packets might not reach the destination Dest 1 Source 2 Dest 2 Source 3 When the network cannot support the sender’s rate Queues at the network elements overflow

Congestion Control vs. Flow Control
Mechanism to prevent sender from overrunning the capacity of the network When network is the bottleneck Flow Control Mechanism to prevent sender from overrunning the capacity of the receiver When receiver is the bottleneck

Misbehaving TCP Receivers
On server side, little incentive to cheat TCP Mostly competing against other flows from same server On client side, high incentive to induce server to send faster How?

Congestion Control: Design Approach
Maintain another window at the sender called CongestionWindow (cwnd) CongestionWindow is the max number of packets allowed in the network Number of unACKed packets at the sender. Key: How to calculate congestion window (cwnd) Various approaches possible TCP estimates it based on observed packet losses Assumes packet loss as indication of congestion Since we don’t know whether the network or the receiver is the bottleneck MaxWindow = MIN(CongestionWindow, AdvertisedWindow) EffectiveWin = MaxWindow – (LastByteSent – LastByteAcked)

Congestion Avoidance Prevent sending too much data and causing saturation, which can cause packet drops and worse performance. Slow-start: Part of TCP congestion control, this begins with a small congestion window and then increases it as acknowledgments (ACKs) are received within a certain time. When they are not, the congestion window is reduced. Congestion avoidance: to prevent sending too much data and causing saturation, which can cause packet drops and worse performance. Slow-start: Part of TCP congestion control, this begins with a small congestion window and then increases it as acknowledgments (ACKs) are received within a certain time. When they are not, the congestion window is reduced. Important topics for TCP performance include the three-way handshake, duplicate ACK detection, congestion control algorithms, Nagle, delayed ACKs, SACK, and FACK.

Optimizations Selective acknowledgments (SACKs): allow TCP to acknowledge discontinuous packets, reducing the number of retransmits required. Fast retransmit: Instead of waiting on a timer, TCP can retransmit dropped packets based on the arrival of duplicate ACKs. These are a function of round-trip time and not the typically much slower timer.

Optimizations Fast recovery: This recovers TCP performance after detecting duplicate ACKs, by resetting the connection to perform slow-start. In some cases these are implemented by use of extended TCP options added to the protocol header. Important topics for TCP performance include the three-way handshake, duplicate ACK detection, congestion control algorithms, Nagle, delayed ACKs, SACK, and FACK.

Congestion Avoidance: (AIMD)
If no congestion in the network (increase conservatively) Increase the congestion window additively every RTT If congestion in the network (decrease aggressively) Decrease the congestion window multiplicatively, immediately How is congestion detected? Estimated (more later) Every ACK reception cwnd = cwnd + MSS*(MSS/cwnd) cwnd in bytes Every RTT w = w + 1 w = cwnd in segments Every ACK reception w = w + 1/w w = cwnd in segments cwnd = cwnd/2 cwnd in bytes

Congestion Avoidance: (AIMD)
CongestionWindow Size Startup time Time TCP’s saw tooth pattern Issues with additive increase takes too long to ramp up a connection from the beginning The entire advertised window may be reopened when a lost packet retransmitted and a single cumulative ACK is received by the sender

TCP “Slow Start”: To start quickly!
Maintain another variable slow start threshold (ssthresh) Last known stable rate If (cwnd > ssthresh) State = congestion avoidance Else State = slow start In Slow start Increase the congestion window exponentially every RTT Key: How is ssthresh calculated? Every ACK reception cwnd = cwnd + MSS cwnd in bytes Every ACK reception w = w + 1 w = cwnd in segments

TCP: Congestion Detection and Retransmit
Loss of packet indicates congestion Timer Timeouts (No ACK) Set according to Jacobson/Karels algorithm On timer timeout ssthresh = max(2*MSS, effwin/2); cwnd = MSS Notice this will cause TCP to go into slow start Issue: takes a long time to detect a packet loss Affects throughput Any other quicker way of detecting a packet loss?

Fast Retransmit Observation: A series of duplicate ACKs might mean a packet loss Solution Every time receiver receives a packet (out-of-order), sends a duplicate ACK Sender retransmit the missing packet after it receives some number of duplicate ACKs (e.g. 3 duplicate ACKs) Fast Retransmit does not replace timeouts Issue: Reduces latency (early retransmit) but still incurs loss in throughput (slow start after packet loss ) PKT 1 PKT 2 ACK 1 PKT 3 ACK 2 PKT 4 ACK 2 PKT 5 PKT 6 ACK 2 ACK 2 PKT 3 Retran ACK 6

Fast Recovery Transmit a packet for every ACK received till the retransmitted packet is ACK’d ssthresh= (2*MSS, cwdn/2); cwnd = sshthred + 3 On every ACK will the ACK of retransmitted packet cwnd = cwnd + 1 On reception of ACK of retransmitted packet Start congestion avoidance instead of slow start cwnd = ssthresh

TCP backlog queues Bursts of connections are handled by using backlog queues . There are two such queues, one for incomplete connections while the TCP handshake completes (also known as the SYN backlog), and one for established sessions waiting to be accepted by the application (also known as the listen backlog). These are pictured in the figure. Only one queue was used in earlier kernels, and it was vulnerable to SYN floods. A SYN flood is a type of DoS attack that involves sending numerous SYNs to the listening TCP port from bogus IP addresses. This fills the backlog queue while TCP waits to complete the handshake, preventing real clients from connecting. With two queues, the first can act as a staging area for potentially bogus connections, which are promoted to the second queue only once the connection is established. The first queue can be made long to absorb SYN floods and optimized to store only the minimum amount of metadata necessary. The length of these queues can be tuned independently. The second can also be set by the application as the backlog argument to listen().

TCP send & receive buffers
Data throughput is improved by using send and receive buffers associated with the socket. For the write path, the data is buffered in the TCP send buffer, and then sent to IP for delivery. While the IP protocol has the capability to fragment packets, TCP tries to avoid this by sending data as MSS-size segments to IP. This means the unit of (re-) transmission matches the unit of fragmentation; otherwise a dropped fragment would require retransmission of the entire pre-fragmented packet. This approach can also improve TCP/ IP stack efficiency, as it avoids fragmentation and assembly of regular packets. The size of both the send and receive buffers is tunable. Larger sizes improve throughput performance, at the cost of more main memory spent per connection. One buffer may be set to be larger than the other if the server is expected to perform more sending or receiving. The Linux kernel will also dynamically increase the of these buffers based on the connection activity.

Duplicate ACK detection
Used by the fast retransmit and fast recovery algorithms. It is performed on the sender and works as follows: The sender sends a packet with sequence number 10. The receiver replies with an ACK for sequence number 11. The sender sends 11, 12, and 13. Packet 11 is dropped. The receiver replies to both 12 and 13 by sending an ACK for 11, which it is still expecting. The sender receives the duplicate 11 ACKs. Also used by TCP Reno and Tahoe congestion avoidance algorithms.

TCP Performance (Steady State)
Bandwidth as a function of RTT? Loss rate? Packet size? Receive window?

What if TCP connection is short?
Slow start dominates performance What if network is unloaded? Burstiness causes extra drops Packet losses unreliable indicator for short flows can lose connection setup packet Can get loss when connection near done Packet loss signal unrelated to sending rate In limit, have to signal congestion (with a loss) on every connection 50% loss rate as increase # of connections

Example: 100KB transfer 100Mb/s Ethernet,100ms RTT, 1.5MB MSS
Ethernet ~ 100 Mb/s 64KB window, 100ms RTT ~ 6 Mb/s slow start (delayed acks), no losses ~ 500 Kb/s slow start, with 5% drop ~ 200 Kb/s Steady state, 5% drop rate ~ 750 Kb/s

Improving Short Flow Performance
Start with a larger initial window RFC 3390: start with 3-4 packets Persistent connections HTTP: reuse TCP connection for multiple objects on same page Share congestion state between connections on same host or across host Skip slow start? Ignore congestion signals?

TCP Modeling Given the congestion behavior of TCP can we predict what type of performance we should get? What are the important factors Loss rate Affects how often window is reduced RTT Affects increase rate and relates BW to window RTO Affects performance during loss recovery MSS Affects increase rate

Overall TCP Behavior Let’s concentrate on steady state behavior with no timeouts and perfect loss recovery Window Time

Simple TCP Model Some additional assumptions
Fixed RTT No delayed ACKs In steady state, TCP losses packet each time window reaches W packets Window drops to W/2 packets Each RTT window increases by 1 packetW/2 * RTT before next loss BW = MSS * avg window/RTT = MSS * (W + W/2)/(2 * RTT) .75 * MSS * W / RTT

Simple Loss Model What was the loss rate? BW = .75 * MSS * W / RTT
Packets transferred between losses = Avg BW * time = (.75 W/RTT) * (W/2 * RTT) = 3W2/8 1 packet lost  loss rate = p = 8/3W2 W = sqrt( 8 / (3 * loss rate)) BW = .75 * MSS * W / RTT BW = MSS / (RTT * sqrt (2/3p))

TCP Friendliness What does it mean to be TCP friendly?
TCP is not going away Any new congestion control must compete with TCP flows Should not clobber TCP flows and grab bulk of link Should also be able to hold its own, i.e. grab its fair share, or it will never become popular How is this quantified/shown? Has evolved into evaluating loss/throughput behavior If it shows 1/sqrt(p) behavior it is ok But is this really true?

TCP Performance Can TCP saturate a link? Congestion control
Increase utilization until… link becomes congested React by decreasing window by 50% Window is proportional to rate * RTT Doesn’t this mean that the network oscillates between 50 and 100% utilization? Average utilization = 75%?? No…this is *not* right!

TCP Congestion Control
Rule for adjusting W If an ACK is received: W ← W+1/W If a packet is lost: W ← W/2 Only W packets may be outstanding Source Dest t Window size

Congestion Control: Reno and Tahoe
Reno: triple duplicate ACKs trigger: halving of the congestion window, halving of the slow-start threshold, fast retransmit, and fast recovery Tahoe: triple duplicate ACKs trigger: fast retransmit, halving the slow-start threshold, congestion window set to one maximum segment size (MSS), and slow-start state. Some operating systems (e.g., Linux and Oracle Solaris 11) allow the algorithm to be selected as part of system tuning. Newer algorithms that have been developed for TCP include Vegas, New Reno, and Hybla.

Nagle This algorithm [RFC 896] reduces the number of small packets on the network by delaying their transmission to allow more data to arrive and coalesce. This delays packets only if there is data in the pipeline and delays are already being encountered. The system may provide a tunable parameter to disable Nagle, which may be necessary if its operation conflicts with delayed ACKs.

Delayed ACKs This algorithm [RFC 1122] delays the sending of ACKs up to 500 msec., so that multiple ACKs may be combined. Other TCP control messages can also be combined, reducing the number of packets on the network.

Selective ACK (SACK) Allows the receiver to inform the sender that it received a noncontiguous block of data. Without this, a packet drop would eventually cause the entire send window to be retransmitted, to preserve a sequential acknowledgment scheme. This harms TCP performance and is avoided by most modern operating systems that support SACK.

Forward ACKs (FACK) SACK has been extended by forward acknowledgments (FACK), which are supported in Linux by default. FACKs track additional state and better regulate the amount of outstanding data in the network, improving overall performance.

Fast Recovery

Latency Analysis

Key Metrics for Network Monitoring

tcpdump

Network performance analysis tools

Linux sar Network Statistics

netstat -s The output lists various network statistics, mostly from TCP, that are grouped by their protocol. Fortunately , many of these have long descriptive names, so their meaning may be obvious. Unfortunately, the output is inconsistent and includes spelling errors, which is a nuisance when processing this text programmatically. A number of performance-related metrics have been highlighted in bold, to show the kind of information that is available. Many of these require an advanced understanding of TCP behavior, including the newer features and algorithms that have been introduced in recent years. Here are some example metrics to look for: A high rate of forwarded versus total packets received: check that the server is supposed to be forwarding (routing) packets. Passive connection openings: this can be monitored to show load in terms of client connections. A high rate of segments retransmitted versus segments sent out: can show an unreliable network. This may be expected clients). Packets pruned from the receive queue because of socket buffer overrun: This is a sign of network saturation and may be fixable by increasing socket buffers— provided there are sufficient system resources for the application to keep up.

netstat –s (cont.)

netstat -i

traceroute The traceroute command sends a series of test packets to experimentally determine the current route to a host. This is performed by increasing the IP protocol time to live (TTL) by one for each packet, causing the sequence of gateways to the host to reveal themselves by sending ICMP time exceeded response messages (provided a firewall doesn’t block them). For example, testing the current route between a host in California and a target in Virginia. Each hop shows a series of three RTTs, which can be used as a coarse source of network latency statistics . As with ping(8), the packets used are low-priority and may show higher latency than for other application protocols. The path taken can also be studied as part of static performance tuning. Networks are designed to be dynamic and responsive to outages. Performance may have degraded as the path has changed. The traceroute command sends a series of test packets to experimentally determine the current route to a host. This is performed by increasing the IP protocol time to live (TTL) by one for each packet, causing the sequence of gateways to the host to reveal themselves by sending ICMP time exceeded response messages (provided a firewall doesn’t block them). For example, testing the current route between a host in California and a target in Virginia.

tcpdump Network packets can be captured and inspected using the tcpdump utility. Each line of output shows the time of the packet (with microsecond resolution), its source and destination IP addresses, and TCP header values. By studying these, the operation of TCP can be understood in detail, including how advanced features are working for your workload. The -n option was used to not resolve IP addresses as host names . Various other options are available, including printing verbose details where available (-v), link-layer headers (-e), and hex-address dumps (-x or -X). Network packets can be captured and inspected using the tcpdump( 8) utility. This can either print packet summaries on STDOUT, or write packet data to a file for later analysis. The latter is usually more practical: packet rates can be too high to follow their summaries in real time. Each line of output shows the time of the packet (with microsecond resolution), its source and destination IP addresses, and TCP header values. By studying these, the operation of TCP can be understood in detail, including how advanced features are working for your workload. The -n option was used to not resolve IP addresses as host names . Various other options are available, including printing verbose details where available (-v), link-layer headers (-e), and hex-address dumps (-x or -X). During performance analysis, it can be useful to change the timestamp column to show delta times between packets (-ttt), or elapsed time since the first packet (-ttttt). An expression can also be provided to describe how to filter packets (see pcap-filter(7)), to focus on the packets of interest. This is performed in-kernel for efficiency (except on Linux 2.0 and older). Packet capture is expensive to perform, in terms of both CPU cost and storage. If possible, use tcpdump( 8) only for short periods to limit the performance cost.

Homework (4th and 5th eds.)
5.13 5.16 5.28 5.34 5.39 Due 4/24

ECE544: Communication Networks-II Spring 2015

Similar presentations

Presentation on theme: "ECE544: Communication Networks-II Spring 2015"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

ECE544: Communication Networks-II Spring 2015

Similar presentations

Presentation on theme: "ECE544: Communication Networks-II Spring 2015"— Presentation transcript:

Similar presentations

About project

Feedback