Presentation is loading. Please wait.

Presentation is loading. Please wait.

Reliable Byte-Stream (TCP)

Similar presentations


Presentation on theme: "Reliable Byte-Stream (TCP)"— Presentation transcript:

1 Reliable Byte-Stream (TCP)
Outline Connection Establishment/Termination Sliding Window Revisited Flow Control Adaptive Timeout Spring 2006 CS 332

2 End-to-End Protocols Underlying best-effort network
drops messages re-orders messages delivers duplicate copies of a given message limits messages to some finite size delivers messages after an arbitrarily long delay Common end-to-end services guarantee message delivery deliver messages in the same order they are sent deliver at most one copy of each message support arbitrarily large messages support synchronization (between sender and receiver) allow the receiver to flow control the sender support multiple application processes on each host Spring 2006 CS 332

3 Simple Demultiplexer (UDP)
Extends host-to-host service into process-to-process Unreliable and unordered datagram service Adds multiplexing No flow control Endpoints identified by ports (why not PID?) servers have well-known ports (clients don’t need this) Often just starting point see /etc/services on Unix Implemented as message queue Spring 2006 CS 332

4 Simple Demultiplexer (UDP)
Header format Note 16 bit port number (so only 64K ports) Process really identified via <port,host> pair Checksum (optional in IPv4, mandatory in IPv6) psuedo header + UDP header + data Pseudo header: Protocol number Source IP Dest IP UDP length field Why? 16 31 SrcPort DstPort Why? Well, the SourceIP and DestIP are included to ensure that the packet has been delivered between the correct two endpoints (it’s UDP’s way of checking that the packet arrived at the right place). And yes, the length field is included twice. Checksum Length Data Spring 2006 CS 332

5 TCP Overview Connection-oriented Byte-stream Full duplex
app writes bytes TCP sends segments app reads bytes Full duplex Flow control: keep sender from overrunning receiver Congestion control: keep sender from overrunning network Application process W rite bytes TCP Send buffer Segment T ransmit segments Read Receive buffer Spring 2006 CS 332

6 Flow Control vs Congestion Control
Prevent sender from overloading receiver End-to-end issue Congestion Control Prevent too much data from being injected into network Concerned with how hosts and network interact Spring 2006 CS 332

7 Data Link Reliability (text 2.5)
Wherein we look at reliability issues on a point-to-point link! Error correcting codes can’t handle all possible errors (without introducing lots of overhead--including this is not designing for normal situation), so badly garbled frames are dropped. We need a way to recover from these lost frames. Spring 2006 CS 332

8 Acks and Timeouts Acknowledgement (ACK) Timeout
Small frame sent to peer indicating receipt of frame No data Piggybacking Timeout If ACK not received within reasonable time, original frame is retransmitted Automatic Repeat Request (ARQ) General strategy of using ACKS and timeouts to implement reliable delivery Spring 2006 CS 332

9 Acknowledgements & Timeouts
Spring 2006 CS 332

10 Acknowledgements & Timeouts
Spring 2006 CS 332

11 A Subtlety… Consider scenarios (c) and (d) in previous slide.
Receiver receives two good frames (duplicate) It may deliver both to higher layer protocol (not good!) Solution: 1-bit sequence number in frame header Spring 2006 CS 332

12 Stop-and-Wait Problem: keeping the pipe full Example
Sender Receiver Problem: keeping the pipe full Example 1.5Mbps link x 45ms RTT = 67.5Kb (8KB) 1KB frames implies 1/8th link utilization (Next slide) So, can only send one frame every 45ms, which implies 1024 X 8 bits per .045sec, or effective bandwidth of 182Kbps, or about 1/8 of link bandwidth. Spring 2006 CS 332

13 Bandwidth x Delay Product
Sending a 1KB packet in 45ms implies sending at rate of (1024 x 8)/0.045 = 182 Kbps, or 1/8 of bandwidth. Bandwidth-delay: The number of bits that fits in the pipe in a single round trip. (I.e. the amount of data that could be “in transit” at any given time.) Goal: Want to be able to send this much data before getting first ACK. (called keeping the pipe full) Spring 2006 CS 332

14 Sliding Window Allow multiple outstanding (un-ACKed) frames
Upper bound on un-ACKed frames, called window Sender Receiver T ime Spring 2006 CS 332

15 Sliding Window: Sender
Assign sequence number to each frame (SeqNum) Maintain three state variables: send window size (SWS) last acknowledgment received (LAR) last frame sent (LFS) Maintain invariant: LFS - LAR ≤ SWS Advance LAR when ACK arrives Buffer up to SWS frames (must be prepared to retransmit frames until they are ACKed) SWS LAR LFS Spring 2006 CS 332

16 Sliding Window: Receiver
Maintain three state variables receive window size (RWS) (upper bound on # out-of-order frames) largest frame acceptable (LFA) (sequence # of) last frame received (LFR) Maintain invariant: LFA - LFR ≤ RWS Frame SeqNum arrives: if LFR < SeqNum ≤ LFA accept if SeqNum ≤ LFR or SeqNum > LFA discard Send cumulative ACKs RWS LFR LFA Spring 2006 CS 332

17 Note: When packet loss occurs, pipe is no longer kept full!
Longer it takes to notice lost packet, worst the condition becomes Possible solutions: Send NACKs Selective acknowledgements (just ACK exactly those frames received, not highest frame received) Not used: too much added complexity Spring 2006 CS 332

18 Sequence Number Space SeqNum field is finite; sequence numbers wrap around Sequence number space must be larger then number of outstanding frames (I.e. stop-and-wait had 2 # space) I.e. if sequence number space is of size 8 (say 0..7), and number of outstanding frames is allowed to be 10, then sender can send sequence numbers 0,1,2,3,4,5,6,7,0,1 all at once. Now if receiver sends back an ACK with sequence number 1, which packet 1 is it ACKing? Spring 2006 CS 332

19 Sequence Number Space Even SWS < SequenceSpaceSize is not sufficient suppose 3-bit SeqNum field (0..7) (so SequenceSpaceSize = 8) Let SWS=RWS=7 sender transmit frames 0..6 Frames arrive successfully, but ACKs are lost sender retransmits 0..6 receiver expecting 7, 0..5, but receives second incarnation of 0..5 (because the receiver has at this point updated its various pointers) SWS ≤ (SequenceSpaceSize+1)/2 is rule (if SWS=RWS) Intuitively, SeqNum “slides” between two halves of sequence number space Spring 2006 CS 332

20 Easy to overlook… Relationship between window size and sequence number space depends on assumption that frames are not reordered in transit (easy to assume on point-to-point link). Spring 2006 CS 332

21 Back to Chapter 5… Spring 2006 CS 332

22 Data Link Versus Transport
Transport potentially connects many different hosts need explicit connection establishment and termination Transport has potentially different RTT (over different routes and at different times, even on scale of minutes) need adaptive timeout mechanism Transport has potentially long delay in network need to be prepared for arrival of very old packets Transport has potentially different capacity at destination need to accommodate different node capacity Transport has potentially different network capacity need to be prepared for network congestion It’s not really the out of order packets that get you here, because sliding window can handle these, up to a point. It’s the fact that on the Internet, packets can arrive REALLY late, thus confounding sliding window. Capacity comment here is meant to note that on point-to-point link, you can tune endpoints so that buffer sizes work well with link bandwidth and the like. On Internet, you simply don’t know what kind of buffer sizes and other parameters you may run up against. So TCP needs a way to learn what resources the other side is able to apply to the network. As for network capacity, on a point-to-point link, the sending side is the only one using the link, and it knows what the link capacity is. On the Internet, the sender has no way of knowing the bandwidths of the links over which it’s data will travel. Also, it’s not the only one transmitting over many of those links! Spring 2006 CS 332

23 The “End-to-End” Argument
Consider TCP vs X.25 TCP: Consider underlying IP network unreliable and use sliding window to provide end-to-end in-order reliable delivery X.25: Use sliding window within network on hop-by-hop basis (which should guarantee end-to-end). Several problems with this: No guarantee that added hop preserves service In link from A to B to C, no guarantee that B behaves perfectly (nodes known to introduce errors and mix packet order) Spring 2006 CS 332

24 End-to-End “A function should not be provided in the lower levels of the system unless it can be completely and correctly implemented at that level” Does allow for functions to be incompletely provided at lower levels for optimization E.g. detecting and retransmitting single corrupt packet across one hop preferable to retransmitting entire file end-to-end. See reading assignment on class homework page Spring 2006 CS 332

25 Segment Format Spring 2006 CS 332

26 Segment Format (cont) Each connection identified with 4-tuple:
(SrcPort, SrcIPAddr, DestPort, DestIPAddr) Sliding window and flow control acknowledgment, SequenceNum, AdvertisedWindow Flags SYN, FIN, RESET, PUSH, URG, ACK Checksum pseudo header + TCP header + data Sender Data (SequenceNum) Acknowledgment + AdvertisedWindow Receiver Spring 2006 CS 332

27 Connection Establishment and Termination
Active participant Passive participant (client) (server) SYN, SequenceNum = Note: SequenceNum contains the sequence number of the first data byte contained in the segment. ACK field always gives the sequence number of the next data byte expected. (Except for the SYN segments) x y , 1 SYN + ACK, SequenceNum = x + Acknowledgment = ACK, Acknowledgment = y + 1 Spring 2006 CS 332

28 State Transition Diagram
CLOSED LISTEN SYN_RCVD SYN_SENT ESTABLISHED CLOSE_WAIT LAST_ACK CLOSING TIME_WAIT FIN_WAIT_2 FIN_WAIT_1 Passive open Close Send/ SYN SYN/SYN + ACK SYN + ACK/ACK ACK /FIN FIN/ACK ACK + FIN/ACK Timeout after two segment lifetimes Active open /SYN Opening connection event/action Closing connection Spring 2006 CS 332

29 Sliding Window Revisited
Sending application LastByteWritten TCP LastByteSent LastByteAcked Receiving application LastByteRead LastByteRcvd NextByteExpected Sending side LastByteAcked ≤ LastByteSent LastByteSent ≤ LastByteWritten buffer bytes between LastByteAcked and LastByteWritten Receiving side LastByteRead < NextByteExpected NextByteExpected ≤ LastByteRcvd +1 buffer bytes between LastByteRead and LastByteRcvd Spring 2006 CS 332

30 Flow Control Send buffer size: MaxSendBuffer
Receive buffer size: MaxRcvBuffer Receiving side LastByteRcvd - LastByteRead ≤ MaxRcvBuffer AdvertisedWindow = MaxRcvBuffer - (LastByteRcvd LastByteRead) Sending side LastByteSent - LastByteAcked ≤ AdvertisedWindow EffectiveWindow = AdvertisedWindow - (LastByteSent - LastByteAcked) LastByteWritten - LastByteAcked ≤ MaxSendBuffer block sender if (LastByteWritten - LastByteAcked) + y > MaxSenderBuffer y is the number of bytes to send for the next message Spring 2006 CS 332

31 Flow Control Always send ACK in response to arriving data segment
This response contains latest Acknowledge and AdvertisedWindow fields even if they haven’t changed Problem: How does the sending side know when the advertised window is no longer 0? It can’t get this info, since receiver only sends window advertisements in response to received packets, and sender can’t send anything because it believes the window size is zero. Solution: Persist when AdvertisedWindow = 0 Periodically send a probe segment with one byte of data. Although most won’t be accepted, they trigger responses, and eventually one will come back with a nonzero advertised window. Spring 2006 CS 332

32 Protection Against Wrap Around
32-bit SequenceNum Bandwidth Time Until Wrap Around T1 (1.5 Mbps) 6.4 hours Ethernet (10 Mbps) 57 minutes T3 (45 Mbps) 13 minutes FDDI (100 Mbps) 6 minutes STS-3 (155 Mbps) 4 minutes STS-12 (622 Mbps) 55 seconds STS-24 (1.2 Gbps) 28 seconds Spring 2006 CS 332

33 Keeping the Pipe Full 16-bit AdvertisedWindow
Results below assume RTT of 100 ms, typical for cross-country link 16-bit AdvertisedWindow Bandwidth Delay x Bandwidth Product T1 (1.5 Mbps) 18KB Ethernet (10 Mbps) 122KB T3 (45 Mbps) 549KB FDDI (100 Mbps) 1.2MB STS-3 (155 Mbps) 1.8MB STS-12 (622 Mbps) 7.4MB STS-24 (1.2 Gbps) 14.8MB Spring 2006 CS 332

34 TCP Extensions Implemented as header options
Store timestamp in outgoing segments Extend sequence space with 32-bit timestamp: PAWS (Protection Against Wrapped Sequence Numbers) Shift (scale) advertised window Spring 2006 CS 332

35 Adaptive Retransmission (Original Algorithm)
Measure SampleRTT for each segment/ACK pair Compute weighted average of RTT a between 0.8 and 0.9 (recommended value 0.9) Note a in this range has a strong smoothing effect Set timeout based on EstRTT TimeOut = 2 x EstRTT (rather conservative) Spring 2006 CS 332

36 Karn/Partridge Algorithm
Sender Receiver Sender Receiver Original transmission Original transmission TT TT ACK Retransmission SampleR SampleR Retransmission ACK Note this means that the new timeout value is based on this new formula rather than on the old EstimatedRTT formula. Problem: ACK doesn’t acknowledge a transmission (it acks a receive) Do not sample RTT when retransmitting Double timeout after each retransmission (exponential backoff) Why? Spring 2006 CS 332

37 A Problem Problem with both these approaches: they can’t keep up with wide RTT fluctuations, thus causing unnecessary retransmissions When the network is already loaded, unnecessary retransmissions add to the network load (as Stevens notes, “It is the network equivalent of pouring gasoline on a fire”) What’s needed: keep track of the variance in RTT measurements AND use smooth RTT estimator. Spring 2006 CS 332

38 Jacobson/ Karels Algorithm
New Calculations for average RTT Diff = sampleRTT - EstRTT EstRTT = EstRTT + ( g x Diff) Recommended value for g is 0.125 EstRTT is just the smoothed RTT as before Dev = Dev + h ( |Diff| - Dev) Recommended value for h is 0.25 Dev is the smoothed mean deviation (easier to compute mean that standard deviation, which requires a square root) TimeOut = EstRTT + 4 x Dev Larger gain for the deviation makes the TimeOut value increase faster when the RTT changes. Notes algorithm only as good as granularity of clock (500ms on Unix) accurate timeout mechanism important to congestion control (later) Note these values? Spring 2006 CS 332

39 TCP Interactive Data Flow
Material here is from TCP/IP Illustrated, Vol. 1 Study by Caceres, et. al. (1991) : On a packet count basis, about half of all TCP segments contain bulk data (ftp, , Usenet news) Half contain interactive data (telnet, rlogin) On byte count basis, ratio is around 90% bulk transfer, 10% interactive. Bulk data tends to be full size (normally 512 bytes of data), interactive is much smaller (90% of telnet and rlogin packets carry less than 10 bytes of data). Spring 2006 CS 332

40 Rlogin and Telnet Surprisingly, each interactive keystroke typically generates a packet (as opposed to a line generating a packet). Moreover, a single rlogin keystroke can generate 4 segments (though usually 3) Interactive keystroke from client ACK of keystroke from server (typically piggybacked in echo of data byte) see next slide Echo of data byte from server ACK of echoed byte from client Spring 2006 CS 332

41 Delayed ACKs Normally, TCP does not send an ACK the instant it receives data. Instead, it delays the ACK, hoping to have data going in other direction on which it can piggyback the ACK. Most implementations use a 200ms delay (delays ACK up to 200ms before sending the ACK by itself) This is why in previous slide, ACK would normally piggyback with the echoed character Spring 2006 CS 332

42 Nagle Algorithm 1 byte data segment generates 41 byte packets (20 for IP header + 20 for TCP header). Small packets are called tinygrams On LANs, usually not an issue, but on WANs, this can be a problem (it adds congestion) Solution: Nagle Algorithm (RFC 896, Nagle, 1984): When a TCP connection has outstanding data that has not yet been Acked, small segments cannot be sent until the outstanding data is acknowledged. Spring 2006 CS 332

43 Nagle Algorithm (continued)
Nagle is self-clocking: the faster the ACKs come back, the faster the data is sent. But on slow WAN, where tinygrams can be a problem, fewer segments are sent. Ex. On LAN, time for single byte to be sent, ACKed and echoed is around 16ms. To generate data at this rate, you need to be typing around 60 characters per second (so on LAN you don’t kick in Nagle) On WAN, you’ll often kick in Nagle Spring 2006 CS 332

44 Disabling the Nagle Algorithm
Why would you want to? X Window system: small messages (mouse movements) need to be delivered without delay Typing one of the terminals special function keys during interactive login Function keys normally generate multiple bytes of data, beginning with ASCII escape character. If TCP gets data a byte at a time, it can potentially send first byte and then hold the rest of the characters. The server wouldn’t generate the ACK until it received the rest of the command, so Nagle would kick in, meaning rest of bytes not sent for 200ms, which can be a noticeable delay. With sockets API, the TCP_NODELAY option disables Nagle Host Requirements RFCs (1122, 1123) specify that there must be a way for an app to disable Nagle on an individual TCP connection. Spring 2006 CS 332

45 TCP Teardown Spring 2006 CS 332


Download ppt "Reliable Byte-Stream (TCP)"

Similar presentations


Ads by Google