Presentation on theme: "ECE544: Communication Networks-II Spring 2015"— Presentation transcript:
1ECE544: Communication Networks-II Spring 2015 Transport ProtocolsIncludes teaching materials from, L. Peterson, David Wetherall and Tom Anderson (Univ. of Washington), Gregg, Brendan. Systems Performance: Enterprise and the Cloud
2Today’s LectureIntroduction to transport protocolsUDPTCPRTP
3The Disconnect Applications running on hosts need to communicate Guaranteed ServiceBest-effortR1ETHFDDIIPR2FDDIPPPIPR3PPPETHIPIPETHIPETHApplications running on hosts need to communicateRequire some guarantees from the underlying layerNetwork Layer (IP) provides only best-effort communication servicesOnly between hosts (not applications)
4Transport Protocol Transport protocol Host1Host8Appl.Appl.TCP/UDPTCP/UDPR1ETHFDDIIPR2FDDIPPPIPR3PPPETHIPIPIPETHETHTransport protocolProvides services required by applications using the services provides by the network layerThe Transport Layer is the lowest layer in the network stack that is an end-to-end protocol
5Transport Protocols Applications requirements vs. IP layer limitations Guarantee message deliveryNetwork may drop messages.Deliver messages in the same order they are sentMessages may be reordered in networks and incurs a long delay.Delivers at most one copy of each messageMessages may duplicate in networks.Support arbitrarily large messageNetwork may limit message size.Support synchronization between sender and receiverAllows the receiver to apply flow control to the senderSupport multiple application processes on each hostNetwork only support communication between hostsMany moreDesign just a few transport protocols to meet most of the current and future application requirementsEach satisfies the requirements for a class of applicationsMany applications=>few transport protocols
6Most Popular Transport Protocols User Datagram Protocol (UDP)Support multiple applications processes on each hostOption to check messages for correctness with CRC checkTransmission Control Protocol (TCP)Ensures reliable delivery of packets between source and destination processesEnsures in-order delivery of packets to destination processOther servicesReal Time Protocol (RTP)Serves real-time multimedia applicationsMoves decision making to the applicationsRuns over UDPTCP, UDP and RTP satisfy needs of the most common applicationsApplications requiring other functionality usually use UDP for transport protocol, and implement additional features as part of the application
7UDP DemultiplexingService: Support for multiple processes on each host to communicateIssue: IP only provides communication between hosts (IP addresses)SolutionAdd port number and associate a process with a port number4-Tuple Unique Connection Identifier: [SrcPort, SrcIPAddr, DestPort, DestIPAddr ]Appl processUDPIPAppl processUDPIP1631SrcPortDesPortLengthChecksumPayloadNetworkUDP Packet Format
8UDP Error Detection Service: Ensure message correctness Solution Issue: Packet corruption in transitSolutionUse Checksum. Why isn’t IP checksum enough?Includes UDP header, payload, pseudo headerPseudo headerProtocol number, source IP address, destination IP address, and UDP length1631SrcPortDesPortLengthChecksumPayload
9UDP Properties and Applications It is transaction-oriented, suitable for simple query-response protocols such as the Domain Name System (DNS) or Network Time Protocol (NTP)It provides datagrams, suitable for modeling other protocols such as in IP tunneling, or Remote Procedure Call (RPC) and the Network File System (NFS).It is simple, suitable for bootstrapping without a full protocol stack, such as the DHCP and Trivial File Transfer Protocol.It is stateless, suitable for very large numbers of clients, such as in streaming media applications forThe lack of retransmission delays makes it suitable for real-time applications such a VoIP, online games, and many protocols built on top of the Real Time Streaming Protocol.Works well in unidirectional communication, suitable for broadcast information such as in many kinds of service discovery and shared information such as broadcast time or Routing Information Protocol (RIP).Numerous key Internet applications use UDP, including: the Domain Name System (DNS), where queries must be fast and only consist of a single request followed by a single reply packet, the Simple Network Management Protocol (SNMP), the Routing Information Protocol (RIP) and the Dynamic Host Configuration Protocol (DHCP).Voice and video traffic is generally transmitted using UDP. Real-time video and audio streaming protocols are designed to handle occasional lost packets, so only slight degradation in quality occurs, rather than large delays if lost packets were retransmitted. Because both TCP and UDP run over the same network, many businesses are finding that a recent increase in UDP traffic from these real-time applications is hindering the performance of applications using TCP, such as point of sale, accounting, and database systems. When TCP detects packet loss, it will throttle back its data rate usage. Since both real-time and business applications are important to businesses, developing quality of service solutions is seen as crucial by some.
10Datagram SocketsApplications use datagram sockets to establish host-to-host communications. An application binds a socket to its endpoint of data transmission, which is a combination of an IP address and a service port. A port is a software structure that is identified by the port number, a 16 bit integer value, allowing for port numbers between 0 and Port 0 is reserved, but is a permissible source port value if the sending process does not expect messages in response.The Internet Assigned Numbers Authority (IANA) has divided port numbers into three ranges. Port numbers 0 through 1023 are used for common, well-known services. On Unix-like operating systems, using one of these ports requires superuser operating permission. Port numbers 1024 through are the registered ports used for IANA-registered services. Ports through are dynamic ports that are not officially designated for any specific service, and may be used for any purpose. They also are used as ephemeral ports, from which software running on the host may randomly choose a port in order to define itself. In effect, they are used as temporary ports primarily by clients when communicating with servers.
11Transmission Control Protocol (TCP) First proposed by Vinton Cerf and Robert Kahn, 1974TCP/IP enabled computers of all sizes, from different vendors, different OSs, to communicate with each other.Used by 80% of all traffic on the InternetReliable, in-order delivery, connection-oriented, bye-stream service
12TCP: Connection-oriented Service: Connection-orientedApplication states the destination onceIssue: IP is connection-lessSolution: TCP maintains the connection stateConnection EstablishmentConnection Termination
13Van Jacobson’s algorithms Evolution of TCP1984Nagel’s algorithmto reduce overheadof small packets;predicts congestion collapse1975Three-way handshakeRaymond TomlinsonIn SIGCOMM 751987Karn’s algorithmto better estimate round-trip time19904.3BSD Renofast retransmitdelayed ACK’s1983BSD Unix 4.2supports TCP/IP1986Congestion collapseobserved1988Van Jacobson’s algorithmscongestion avoidance and congestion control(most implemented in 4.3BSD Tahoe)1974TCP described byVint Cerf and Bob KahnIn IEEE Trans Comm1982TCP & IPRFC 793 & 7911975198019851990
14TCP Through the 1990s 1993 1994 1996 1994 T/TCP (Braden) Transaction SACK TCP(Floyd et al)Selective Acknowledgement1993TCP Vegas(Brakmo et al)real congestion avoidance1994ECN(Floyd)ExplicitCongestionNotification1996HoeImproving TCP startup1996FACK TCP(Mathis et al)extension to SACK199319941996
15TCP: Packet Format Flags Sequence number Acknowledgement SYN, FIN, ACK, RESET, URG, PUSHSequence numberSequence number of the first byte of data in the segmentIt is an abstract number (more later)AcknowledgementNext sequence number expected from the sender
16Reliable Byte-stream Bidirectional data transfer Control information (e.g., ACK) piggybacks on data segments in reverse direction
17TCP Connection Management Setupassymetric 3-way handshakeTransfersliding window; data and acks in both directionsTeardownsymmetric 2-way handshakeClient-server modelinitiator (client) contacts serverlistener (server) responds, provides service
18Three-Way Handshake Opens both directions for transfer Active participantPassive participant(client)(server)SYN, SequenceNum =xy,SYN + ACK, SequenceNum =x+1Acknowledgment =ACK, Acknowledgment =y+1+data
19Do we need 3-way handshake? Allows both sides toallocate state for buffer size, state variables, …calculate estimated RTT, estimated MTU, etc.Helps preventDuplicates across incarnationsIntentional hijackingrandom nonces => weak form of authenticationShort-circuit?Persistent connections in HTTP (keep connection open)Transactional TCP (save seq #, reuse on reopen)But congestion control effects dominate
20TCP Transfer Connection is bi-directional acks can carry response data (client)(server)Seq = x + MSS; Ack = y+1Seq = x + 2*MSS; Ack = y+1Seq = y+MSS; Ack = x+2MSS+1Seq = x + 3*MSS; Ack = y+MSS+1
21TCP Connection Teardown Symmetric: either side can close connection (or RST!)Web serverWeb browserFINdata, ACKHalf-open connection; data can be continue to be sentdata, ACKFINACKCan reclaim connection after 2 MSLACKCan reclaim connection right away (must be at least 1MSL after first FIN)
22Connection Establishment Both sender and receiver must be ready before we start the transfer of dataNeed to agree on a set of parameterse.g., the Maximum Segment Size (MSS)This is (in-band) signalingIt sets up state at the endpoints
23Connection Establishment Active participant(client)Passive participant(server)SYN, Seq#=xSYN+ACK, Seq#=yAck#=x+1ACK, Ack#=y+1Data+ACKConnectionEstablishmentDatatransportServerInforms TCP about the listening portUp-call registrationClientPerforms three way handshakeSYN and ACK flags in the header are usedInitial Sequence numbers x and y selected at random
24Connection Termination FINFIN-ACKACKDATAData writeData ACKAny side can terminate the connectionEach side closes its half of the connection independentlyA connection may be half-openedCan only receive data
25What if packets can be delayed? Solutions?Never reuse an ID?Change IP layer to eliminate packet reordering?Prevent very late delivery?IP routers keep hop count per pkt, discard if exceededID’s not reused within delay boundTCP won’t work without some bound on how late packets can arrive!11Accept!Reject!
26TCP Connection Setup, with States Active participantPassive participant(client)(server)SYN_SENTLISTENSYN, SequenceNum =xSYN_RCVDy,1SYN + ACK, SequenceNum =x+Acknowledgment =ESTABLISHEDACK, Acknowledgment =ESTABLISHEDy+1+data
27TCP Connection Teardown Web serverWeb browserFIN_WAIT_1FINCLOSE_WAITACKLAST_ACKFIN_WAIT_2FINTIME_WAITACK…CLOSEDCLOSED
28The TIME_WAIT StateWe wait 2MSL (two times the maximum segment lifetime of 60 seconds) before completing the closeWhy?ACK might have been lost and so FIN will be resentCould interfere with a subsequent connection
29TCP Handshake in an Uncooperative Internet TCP Hijackingif seq # is predictable, attacker can insert packets into TCP streammany implementations of TCP simply bumped previous seq # by 1attacker can learn seq # by setting up a connectionSolution: use random initial sequence #’sweak form of authenticationMalicious attackerClientServerSYN, SequenceNum =xSYN + ACK, y, x + 1ACK, y+1“HTTP get URL”, x + MSSfake web page, y+MSSweb page, y + MSS
30TCP Handshake in an Uncooperative Internet TCP SYN floodserver maintains state for every open connectionif attacker spoofs source addresses, can cause server to open lots of connectionseventually, server runs out of memoryMalicious attackerServerSYN, SequenceNum =SYN, pxSYN, qSYN, rSYN, sSYN + ACK, y, x + 1
31How can TCP choose segment size? Pick LAN MTU as segment size?LAN MTU can be larger than WAN MTUE.g., Gigabit Ethernet jumbo framesPick smallest MTU across all networks in Internet?Most traffic is local!Local file server, web proxy, DNS cache, ...Increases packet processing overheadDiscover MTU to each destination? (IP DF bit)Guess?
32How do we keep the pipe full? Unless the bandwidth*delay product is small, stop and wait can’t fill pipeSolution: Send multiple packets without waiting for first to be ackedReliable, unordered delivery:Send new packet after each ackSender keeps list of unack’ed packets; resends after timeoutReceiver same as stop&waitHow easy is it to write apps that handle out of order delivery?How easy is it to test those apps?
33Sliding Window: Reliable, ordered delivery Two constraints:Receiver can’t deliver packet to application until all prior packets have arrivedSender must prevent buffer overflow at receiverSolution: sliding windowcircular buffer at sender and receiverpackets in transit <= buffer sizeadvance when sender and receiver agree packets at beginning have been receivedHow big should the window be?bandwidth * round trip delay
34Sender/Receiver State packets sent and acked (LAR = last ack recvd)packets sent but not yet ackedpackets not yet sent (LFS = last frame sent)receiverpackets received and acked (NFE = next frame expected)packets received out of orderpackets not yet received (LFA = last frame ok)
35Sliding windowAllows multiple packets, up to the size of the window, to be sent on the network before acknowledgments are received.Provides high throughput even on high-latency networks.The size of the window is advertised by the receiver to indicate how many packets it is willing to receive at that time.
36Sliding Window Send Window 1 2 3 4 5 6 sent x x x x x x x acked x LAR 123456sentxxxxxxxackedxLARLFSReceive Window123456recvdxxxxxxackedxxNFELFA
38What if we lose a packet? Go back N (original TCP) receiver acks “got up through k” (“cumulative ack”)ok for receiver to buffer out of order packetson timeout, sender restarts from k+1Selective retransmission (RFC 2018)receiver sends ack for each pkt in windowon timeout, resend only missing packet
39Can we shortcut timeout? If packets usually arrive in order, out of order delivery is (probably) a packet lossNegative ackreceiver requests missing packetFast retransmit (TCP)receiver acks with NFE-1 (or selective ack)if sender gets acks that don’t advance NFE, resends missing packet
40Sender Algorithm Send full window, set timeout On receiving an ack: if it increases LAR (last ack received)send next packet(s)-- no more than window size outstanding at onceelse (already received this ack)if receive multiple acks for LAR, next packet may have been lost; retransmit LAR + 1 (and more if selective ack)On timeout:resend LAR + 1 (first packet not yet acked)
41Receiver Algorithm On packet arrival: if packet is the NFE (next frame expected)send ackincrease NFEhand any packet(s) below NFE to applicationelse if < NFE (packet already seen and acked)send ack and discard // Q: why is ack needed?else (packet is > NFE, arrived out of order)buffer and send ack for NFE – 1-- signal sender that NFE might have been lost-- and with selective ack: which packets correctly arrived
42Sequence Number Selection Initial sequence number (ISN) selection:Why not simply chose 0?Must avoid overlap with earlier incarnationSecurity issuesRequirements for ISN selectionMust operate correctlyWithout synchronized clocksDespite node failures
43Sequence Number Wrap Around Protect against SequenceNum wrap aroundSliding windowSeq # space >= 2 x WinSizeFor TCP: 232 >> 2 x 216Seq # should not wraparound within a MSL (120 sec) period of timeFor OC-48 (2.5 Gbps), time until wraparound: 14 secTCP extension to the sequence # space for protecting against seq # wrapping aroundAdd 32-bit timestamp as optional header
44How do we determine timeouts? If timeout too small, useless retransmitscan lead to congestion collapse (and did in 86)as load increases, longer delays, more timeouts, more retransmissions, more load, longer delays, more timeouts …Dynamic instability!If timeout too big, inefficientwait too long to send missing packetTimeout should be based on actual round trip time (RTT)varies with destination subnet, routing changes, congestion, …
45Estimating RTTs Idea: Adapt based on recent past measurements For each packet, note time sent and time ack receivedCompute RTT samples and average recent samples for timeoutEstimatedRTT = x EstimatedRTT + (1 - ) x SampleRTTThis is an exponentially-weighted moving average (low pass filter) that smoothes the samples. Typically, = 0.8 to 0.9.Set timeout to small multiple (2) of the estimate
46Keep the Pipe Full AdvertisedWindow: 216=>64 KB TCP Extension: Big enough to allow the sender to keep the pipe full (assume that the receiver has enough buffer to handle the data)If RTT = 100 ms,Delay x Bandwidth = 122 KB for 10 Mbps linkDelay x Bandwidth = 1.2 MB for 100 Mbps link (AdvertisedWindow is not large enough)TCP Extension:Scaling factor option for AdvertisedWindow,e.g., use 16-byte units of data
47TCP Error ControlCumulative ACK: ACK the highest contiguous bytes receivedSame as studied beforeExtension: Selective ACK (SACK), ACK additional blocks of received data in TCP optional headerTimeout TimerIf timeout too soonunnecessarily retransmit → adds load to networkIf timeout too lateIncreases latencyLimits the throughput.
48TCP Timeout Issue: RTT in a wide area network varies substantially Solution: Adaptive TimeoutOriginal Algorithm:EstimatedRTT = a x EstimatedRTT + (1-a) x SampleRTTTimeout = β x EstimatedRTT (β = 2)ProblemDoes not distinguish whether the ACK is for original transmission or retransmission (suggestions?)Constant β is not good.Assumes constant variance
49TCP Timeout Karn/Partridge Algorithm Whenever TCP retransmits a segment, it stops taking samples of the RTTOnly measure SampleRTT for segments that have have been sent only onceEach time TCP retransmits, set the next timeout to be twice the last timeoutRelieves congestionJacobson/Karels Algorithm: Adaptive variance (uses mean variance)Difference = SampleRTT - EstimatedRTTEstimatedRTT = EstimatedRTT + (d x Difference) → (same as in original)Deviation = Deviation + d(|Difference|- Deviation)Timeout = m x EstimatedRTT + f x Deviation(default: set m = 1 and f= 4 )
50Triggering Transmission When to transmit a segment:small segments subject to large overheadReach max segment size (MSS): the size of the largest segment TCP can send without causing the local IP to fragmentMSS = local MTU – IP & TCP headerThe sending process explicitly ask the TCP to transmit, “push”
51TCP Silly Window Syndrome Sender has MSS bytes of data to send, but window is closedACK arrives with a small windowSender sends a small segment (high overhead)Receiver advertise a small windowSender sends a small receive segmentRepeat the aboveTo solve: Nagle’s AlgorithmWhen the application have data to sendIf both available data and the window >= MSSSend a full segmentElseIf there is unACKed data in flightBuffer the new data until an ACK arrivesSend all the new data now
52TCP Deadlock TCP Deadlock To solve it: receiver advertises a window size of 0, the sender stops sending datathe window size update from the receiver is lostTo solve it:the sender starts the persist timer when AdvertisedWindow = 0When the persist timer expires, the sender sends a small packet
53Example – Exchange of Packets SEQ=1ACK=2; WIN=3T=2SEQ=2Receiver has buffer of size 4 and application doesn’t readACK=3; WIN=2T=3SEQ=3Stall due to flow control hereT=4SEQ=4T=5ACK=4; WIN=1T=6ACK=5; WIN=0
54Example – Buffer at Sender 123456789=ackedT=2123456789=sentT=3123456789=advertisedT=4123456789T=5123456789T=6123456789
55How does sender know when to resume sending? If receive window = 0, sender stopsno data => no acks => no window updatesSender periodically pings receiver with one byte packetreceiver acks with current window sizeWhy not have receiver ping sender?
56Should sender be greedy? Should sender transmit as soon as any space opens in receive window?Silly window syndromereceive window opens a few bytessender transmits little packetreceive window closesSolution (Clark, 1982): sender doesn’t resume sending until window is half open
57Should sender be greedy? App writes a few bytes; send a packet?Don’t want to send a packet for every keystrokeIf buffered writes >= max segment sizeif app says “push” (ex: telnet, on carriage return)after timeout (ex: 0.5 sec)Nagle’s algorithmNever send two partial segments; wait for first to be acked, before sending nextSelf-adaptive: can send lots of tinygrams if network is being responsiveBut (!) poor interaction with delayed acks (later)
58Even with flow control packets might not reach the destination CongestionSource1Even with flow control packets might not reach the destinationDest1Source2Dest2Source3When the network cannot support the sender’s rateQueues at the network elements overflow
59Congestion Control vs. Flow Control Mechanism to prevent sender from overrunning the capacity of the networkWhen network is the bottleneckFlow ControlMechanism to prevent sender from overrunning the capacity of the receiverWhen receiver is the bottleneck
60Misbehaving TCP Receivers On server side, little incentive to cheat TCPMostly competing against other flows from same serverOn client side, high incentive to induce server to send fasterHow?
61Congestion Control: Design Approach Maintain another window at the sender called CongestionWindow (cwnd)CongestionWindow is the max number of packets allowed in the networkNumber of unACKed packets at the sender.Key: How to calculate congestion window (cwnd)Various approaches possibleTCP estimates it based on observed packet lossesAssumes packet loss as indication of congestionSince we don’t know whether the network or the receiver is the bottleneckMaxWindow = MIN(CongestionWindow, AdvertisedWindow)EffectiveWin = MaxWindow – (LastByteSent – LastByteAcked)
62Congestion AvoidancePrevent sending too much data and causing saturation, which can cause packet drops and worse performance.Slow-start: Part of TCP congestion control, this begins with a small congestion window and then increases it as acknowledgments (ACKs) are received within a certain time. When they are not, the congestion window is reduced.Congestion avoidance: to prevent sending too much data and causing saturation, which can cause packet drops and worse performance.Slow-start: Part of TCP congestion control, this begins with a small congestion window and then increases it as acknowledgments (ACKs) are received within a certain time. When they are not, the congestion window is reduced.Important topics for TCP performance include the three-way handshake, duplicate ACK detection, congestion control algorithms, Nagle, delayed ACKs, SACK, and FACK.
63OptimizationsSelective acknowledgments (SACKs): allow TCP to acknowledge discontinuous packets, reducing the number of retransmits required.Fast retransmit: Instead of waiting on a timer, TCP can retransmit dropped packets based on the arrival of duplicate ACKs. These are a function of round-trip time and not the typically much slower timer.
64OptimizationsFast recovery: This recovers TCP performance after detecting duplicate ACKs, by resetting the connection to perform slow-start.In some cases these are implemented by use of extended TCP options added to the protocol header.Important topics for TCP performance include the three-way handshake, duplicate ACK detection, congestion control algorithms, Nagle, delayed ACKs, SACK, and FACK.
65Congestion Avoidance: (AIMD) If no congestion in the network (increase conservatively)Increase the congestion window additively every RTTIf congestion in the network (decrease aggressively)Decrease the congestion window multiplicatively, immediatelyHow is congestion detected?Estimated (more later)Every ACK receptioncwnd = cwnd + MSS*(MSS/cwnd)cwnd in bytesEvery RTTw = w + 1w = cwnd in segmentsEvery ACK receptionw = w + 1/ww = cwnd in segmentscwnd = cwnd/2cwnd in bytes
66Congestion Avoidance: (AIMD) CongestionWindow SizeStartup timeTimeTCP’s saw tooth patternIssues with additive increasetakes too long to ramp up a connection from the beginningThe entire advertised window may be reopened when a lost packet retransmitted and a single cumulative ACK is received by the sender
67TCP “Slow Start”: To start quickly! Maintain another variable slow start threshold (ssthresh)Last known stable rateIf (cwnd > ssthresh)State = congestion avoidanceElseState = slow startIn Slow startIncrease the congestion window exponentially every RTTKey: How is ssthresh calculated?Every ACK receptioncwnd = cwnd + MSScwnd in bytesEvery ACK receptionw = w + 1w = cwnd in segments
68TCP: Congestion Detection and Retransmit Loss of packet indicates congestionTimer Timeouts (No ACK)Set according to Jacobson/Karels algorithmOn timer timeoutssthresh = max(2*MSS, effwin/2); cwnd = MSSNotice this will cause TCP to go into slow startIssue: takes a long time to detect a packet lossAffects throughputAny other quicker way of detecting a packet loss?
69Fast RetransmitObservation: A series of duplicate ACKs might mean a packet lossSolutionEvery time receiver receives a packet (out-of-order), sends a duplicate ACKSender retransmit the missing packet after it receives some number of duplicate ACKs (e.g. 3 duplicate ACKs)Fast Retransmit does not replace timeoutsIssue: Reduces latency (early retransmit) but still incurs loss in throughput (slow start after packet loss )PKT 1PKT 2ACK 1PKT 3ACK 2PKT 4ACK 2PKT 5PKT 6ACK 2ACK 2PKT 3RetranACK 6
70Fast RecoveryTransmit a packet for every ACK received till the retransmitted packet is ACK’dssthresh= (2*MSS, cwdn/2); cwnd = sshthred + 3On every ACK will the ACK of retransmitted packetcwnd = cwnd + 1On reception of ACK of retransmitted packetStart congestion avoidance instead of slow startcwnd = ssthresh
71TCP backlog queuesBursts of connections are handled by using backlog queues . There are two such queues, one for incomplete connections while the TCP handshake completes (also known as the SYN backlog), and one for established sessions waiting to be accepted by the application (also known as the listen backlog). These are pictured in the figure. Only one queue was used in earlier kernels, and it was vulnerable to SYN floods. A SYN flood is a type of DoS attack that involves sending numerous SYNs to the listening TCP port from bogus IP addresses. This fills the backlog queue while TCP waits to complete the handshake, preventing real clients from connecting. With two queues, the first can act as a staging area for potentially bogus connections, which are promoted to the second queue only once the connection is established. The first queue can be made long to absorb SYN floods and optimized to store only the minimum amount of metadata necessary. The length of these queues can be tuned independently. The second can also be set by the application as the backlog argument to listen().
72TCP send & receive buffers Data throughput is improved by using send and receive buffers associated with the socket. For the write path, the data is buffered in the TCP send buffer, and then sent to IP for delivery. While the IP protocol has the capability to fragment packets, TCP tries to avoid this by sending data as MSS-size segments to IP. This means the unit of (re-) transmission matches the unit of fragmentation; otherwise a dropped fragment would require retransmission of the entire pre-fragmented packet. This approach can also improve TCP/ IP stack efficiency, as it avoids fragmentation and assembly of regular packets. The size of both the send and receive buffers is tunable. Larger sizes improve throughput performance, at the cost of more main memory spent per connection. One buffer may be set to be larger than the other if the server is expected to perform more sending or receiving. The Linux kernel will also dynamically increase the of these buffers based on the connection activity.
73Duplicate ACK detection Used by the fast retransmit and fast recovery algorithms.It is performed on the sender and works as follows:The sender sends a packet with sequence number 10.The receiver replies with an ACK for sequence number 11.The sender sends 11, 12, and 13.Packet 11 is dropped.The receiver replies to both 12 and 13 by sending an ACK for 11, which it is still expecting.The sender receives the duplicate 11 ACKs.Also used by TCP Reno and Tahoe congestion avoidance algorithms.
74TCP Performance (Steady State) Bandwidth as a function ofRTT?Loss rate?Packet size?Receive window?
75What if TCP connection is short? Slow start dominates performanceWhat if network is unloaded?Burstiness causes extra dropsPacket losses unreliable indicator for short flowscan lose connection setup packetCan get loss when connection near donePacket loss signal unrelated to sending rateIn limit, have to signal congestion (with a loss) on every connection50% loss rate as increase # of connections
76Example: 100KB transfer 100Mb/s Ethernet,100ms RTT, 1.5MB MSS Ethernet ~ 100 Mb/s64KB window, 100ms RTT ~ 6 Mb/sslow start (delayed acks), no losses ~ 500 Kb/sslow start, with 5% drop ~ 200 Kb/sSteady state, 5% drop rate ~ 750 Kb/s
77Improving Short Flow Performance Start with a larger initial windowRFC 3390: start with 3-4 packetsPersistent connectionsHTTP: reuse TCP connection for multiple objects on same pageShare congestion state between connections on same host or across hostSkip slow start?Ignore congestion signals?
78TCP ModelingGiven the congestion behavior of TCP can we predict what type of performance we should get?What are the important factorsLoss rateAffects how often window is reducedRTTAffects increase rate and relates BW to windowRTOAffects performance during loss recoveryMSSAffects increase rate
79Overall TCP BehaviorLet’s concentrate on steady state behavior with no timeouts and perfect loss recoveryWindowTime
80Simple TCP Model Some additional assumptions Fixed RTTNo delayed ACKsIn steady state, TCP losses packet each time window reaches W packetsWindow drops to W/2 packetsEach RTT window increases by 1 packetW/2 * RTT before next lossBW = MSS * avg window/RTT =MSS * (W + W/2)/(2 * RTT).75 * MSS * W / RTT
81Simple Loss Model What was the loss rate? BW = .75 * MSS * W / RTT Packets transferred between losses =Avg BW * time =(.75 W/RTT) * (W/2 * RTT) = 3W2/81 packet lost loss rate = p = 8/3W2W = sqrt( 8 / (3 * loss rate))BW = .75 * MSS * W / RTTBW = MSS / (RTT * sqrt (2/3p))
82TCP Friendliness What does it mean to be TCP friendly? TCP is not going awayAny new congestion control must compete with TCP flowsShould not clobber TCP flows and grab bulk of linkShould also be able to hold its own, i.e. grab its fair share, or it will never become popularHow is this quantified/shown?Has evolved into evaluating loss/throughput behaviorIf it shows 1/sqrt(p) behavior it is okBut is this really true?
83TCP Performance Can TCP saturate a link? Congestion control Increase utilization until… link becomes congestedReact by decreasing window by 50%Window is proportional to rate * RTTDoesn’t this mean that the network oscillates between 50 and 100% utilization?Average utilization = 75%??No…this is *not* right!
84TCP Congestion Control Rule for adjusting WIf an ACK is received: W ← W+1/WIf a packet is lost: W ← W/2Only W packets may be outstandingSourceDesttWindow size
85Congestion Control: Reno and Tahoe Reno: triple duplicate ACKs trigger: halving of the congestion window, halving of the slow-start threshold, fast retransmit, and fast recoveryTahoe: triple duplicate ACKs trigger: fast retransmit, halving the slow-start threshold, congestion window set to one maximum segment size (MSS), and slow-start state.Some operating systems (e.g., Linux and Oracle Solaris 11) allow the algorithm to be selected as part of system tuning.Newer algorithms that have been developed for TCP include Vegas, New Reno, and Hybla.
86NagleThis algorithm [RFC 896] reduces the number of small packets on the network by delaying their transmission to allow more data to arrive and coalesce.This delays packets only if there is data in the pipeline and delays are already being encountered.The system may provide a tunable parameter to disable Nagle, which may be necessary if its operation conflicts with delayed ACKs.
87Delayed ACKsThis algorithm [RFC 1122] delays the sending of ACKs up to 500 msec., so that multiple ACKs may be combined.Other TCP control messages can also be combined, reducing the number of packets on the network.
88Selective ACK (SACK)Allows the receiver to inform the sender that it received a noncontiguous block of data.Without this, a packet drop would eventually cause the entire send window to be retransmitted, to preserve a sequential acknowledgment scheme.This harms TCP performance and is avoided by most modern operating systems that support SACK.
89Forward ACKs (FACK)SACK has been extended by forward acknowledgments (FACK), which are supported in Linux by default.FACKs track additional state and better regulate the amount of outstanding data in the network, improving overall performance.
96netstat -sThe output lists various network statistics, mostly from TCP, that are grouped by their protocol. Fortunately , many of these have long descriptive names, so their meaning may be obvious. Unfortunately, the output is inconsistent and includes spelling errors, which is a nuisance when processing this text programmatically. A number of performance-related metrics have been highlighted in bold, to show the kind of information that is available. Many of these require an advanced understanding of TCP behavior, including the newer features and algorithms that have been introduced in recent years. Here are some example metrics to look for: A high rate of forwarded versus total packets received: check that the server is supposed to be forwarding (routing) packets. Passive connection openings: this can be monitored to show load in terms of client connections. A high rate of segments retransmitted versus segments sent out: can show an unreliable network. This may be expected clients). Packets pruned from the receive queue because of socket buffer overrun: This is a sign of network saturation and may be fixable by increasing socket buffers— provided there are sufficient system resources for the application to keep up.
99tracerouteThe traceroute command sends a series of test packets to experimentally determine the current route to a host. This is performed by increasing the IP protocol time to live (TTL) by one for each packet, causing the sequence of gateways to the host to reveal themselves by sending ICMP time exceeded response messages (provided a firewall doesn’t block them). For example, testing the current route between a host in California and a target in Virginia.Each hop shows a series of three RTTs, which can be used as a coarse source of network latency statistics . As with ping(8), the packets used are low-priority and may show higher latency than for other application protocols. The path taken can also be studied as part of static performance tuning. Networks are designed to be dynamic and responsive to outages. Performance may have degraded as the path has changed.The traceroute command sends a series of test packets to experimentally determine thecurrent route to a host. This is performed by increasing the IP protocol time to live (TTL) byone for each packet, causing the sequence of gateways to the host to reveal themselves bysending ICMP time exceeded response messages (provided a firewall doesn’t block them).For example, testing the current route between a host in California and a target in Virginia.
100tcpdumpNetwork packets can be captured and inspected using the tcpdump utility.Each line of output shows the time of the packet (with microsecond resolution),its source and destination IP addresses, and TCP header values.By studying these, the operation of TCP can be understood in detail,including how advanced features are working for your workload.The -n option was used to not resolve IP addresses as host names .Various other options are available, including printing verbose details where available (-v),link-layer headers (-e), and hex-address dumps (-x or -X).Network packets can be captured and inspected using the tcpdump( 8) utility. This can either print packet summaries on STDOUT, or write packet data to a file for later analysis. The latter is usually more practical: packet rates can be too high to follow their summaries in real time.Each line of output shows the time of the packet (with microsecond resolution), its source and destination IP addresses, and TCP header values. By studying these, the operation of TCP can be understood in detail, including how advanced features are working for your workload. The -n option was used to not resolve IP addresses as host names . Various other options are available, including printing verbose details where available (-v), link-layer headers (-e), and hex-address dumps (-x or -X).During performance analysis, it can be useful to change the timestamp column to show delta times between packets (-ttt), or elapsed time since the first packet (-ttttt). An expression can also be provided to describe how to filter packets (see pcap-filter(7)), to focus on the packets of interest. This is performed in-kernel for efficiency (except on Linux 2.0 and older). Packet capture is expensive to perform, in terms of both CPU cost and storage. If possible, use tcpdump( 8) only for short periods to limit the performance cost.
101Homework (4th and 5th eds.) 184.108.40.2065.345.39Due 4/24