Presentation on theme: "How to use it Press Space to go alonge slide animation Don’t hurry to press Space next time. Wait for end of animation If you want to go back, use key."— Presentation transcript:
How to use it Press Space to go alonge slide animation Don’t hurry to press Space next time. Wait for end of animation If you want to go back, use key «PgUp». Version 08 June 1999 Come later - presentation is under construction now
Encapsulation data into Ethernet packet User data Application header Application data TCP header TCP segment IP datagram Application data TCP header IP header Application data TCP header IP header Ethernet header Ethernet trailer 46 to 1500 bytes Ethernet frame
Type IEEE 802.2/802.3 Encapsulation (RFC 1042) Destination address 6 Source address 6 lengt h 2 DSAP 0xAA 1 SSAP 0xAA 1 cntl MAC802.2 LLC Org code 00 3 type SNAP DATA CRC 4 Type IP Datagram or Type ARP request/reply 28 PAD 18 RARP request/reply 28 PAD 18 TYPE field identifies data that follows. For example, type 0x0800 (hex) identifies IP datagram follows LENGTH contain length packet from next byte till CRC (CRC isn’t included) DSAP (Destination Service Access Point) and SSAP (Source Service Access Point) both are set to 0xAA. CNTL (Control field) is set to 3. ORG CODE allways is 0 in all bytes
Ethernet Encapsulation (RFC 894) Destination address 6 Source address 6 type bytes DATA CRC 4 Type IP Datagram or Type ARP request/reply 28 Type RARP request/reply 28 or PAD 18 PAD 18
IP packet structure 16-bit total packet length 16-bit identification TOS 4-bit ver 4-bit IHL 13-bit Fr offset flags 3-bit DATA Header checksumTTLProtocol Source address Destination address Options (+padding) Version.Current protocol version is 4. IHL - IP header length. IHL is quantity of 32-bit words in IP header. This field has 4- bit length => maximum header length is 60 bytes TOS - type of service contain of a 3-bit precedence bits (ignored), 4 TOS bits, and unused bit which must be 0. 4 TOS bits: minimize delay maxm,ize throughput maximize reliability minimize monetary cost Only 1 of these 4 bits can be turned on TPL - total packet length is total IP packet’s length in bytes. Then maximum length of IP packet is bytes. IDENTIFICATIN - this field is used when IP need fragment fatagrams. Identification identifies each datagram and is incremented each time a datagram is sent We’ll see meaning of this field when we talk about fragmentation FLAGS and FRAGMENT OFFEST we’’ see also when we talk about fragmentation Continue...
IP packet structure DATA 16-bit total packet length 16-bit identification TOS 4-bit ver 4-bit IHL 13-bit Fr offset flags 3-bit Header checksumTTLProtocol Source address Destination address Options (+padding) TTL - time-to-live sets an upper limit of routers through which a datagram can pass. This field is decremented each time when datagram pass the router. When this field became 0 a datagram is dropped by router and ICMP message is sent to datagram’s sender PROTOCOL - this field identifies DATA portion of datagram (which protocol is encapsulated into IP datagram). HEADER CHECKSUM is calculetaed for IP header only. SOURCE and DESTINATION addresses is sender’s and receiver’s IP addresses. OPTIONS is a variable-length field which contain som eoptions. We’ll discuss some of them later. The option field always end on a 32-bit boundary. PAD bytes (value is 0) are added if neccessary. DATA is data.
Special case IP addresses IP address classes Class Range A to B to C to D to Multicast E to
ARP and RARP RARP If system don’t have hard or floppy drive and should boot from network it can’t take IP address from local resourses. Such system have only MAC- address. RARP is algorithm which allow system to obtain IP address from network ARP For example, we are working on the Ethernet network. Ethernet driver and adapter are using MAC-address. TCP/IP is using IP addresses. When host want to send data to another host it known onlt receiver’s IP address and put this information to TCP/IP stack. Then TCP/IP stack need mechanism to have correspondence between MAC and IP addresses. IP have two algorithms for solve it. 32-bit IP address 48-bit Ethernet address ARPRARP
ARP IP Send IP datagram to IP address Resolve IP address to hardware address Ethernet driver Yes ARP request No Ethernet driver ARP Is somebody looking for my address? No Ignore request Is somebody looking for my address? Yes Send ARP reply Yes Do I know hardware address? ARP Host
RARP Boot Diskless workstation Read own hardware network address Send RARP request RARP server Somebody wants to have IP address! Give to somebody IP address from my table Send RARP reply I have a IP address!!!
ARP packet Dest address 6 Source address 6 type 2 Hard type 2 Prot type 2 Hard size 1 op 2 Prot size 1 Sender Ethernet address 6 Sender IP address 4 Target Ethernet address 6 Target IP address 4 type0x806 hardware typeSpecified hardware type. 1 for an Ethernet protocol type0x800 for IP hardware sizeSize of hardware address. 6 for an Ethernet protocol sizeSize of protocol address. 4 for IP opType of operation (request or reply). ARP request - 1, ARP reply - 2, RARP request - 3, RARP reply - 4. BroadcastDest address
ICMP - Internet Control Message Protocol RFC 792 packet structure IP header 20 ICMP message 8-bit type8-bit code 16-bit checksum (for entire ICMP message) Contents depend on type and code The same for all type of messages
ICMP address mask request and reply Type 17-request 18 - reply Code bit checksum (for entire ICMP message) Subnet mask 12 bytes identifier (anything)sequence number (anything) Type 13-request 14 - reply Code bit checksum (for entire ICMP message) 32-bit originate timestamp 20 bytes identifier (anything)sequence number (anything) 32-bit receive timestamp 32-bit transmit timestamp ICMP timestamp request and reply
ICMP port unreachable error Data portion of ICMP message Ethernet header 14 IP header 20 ICMP header 8 IP header of datagram that generated error 20 UDP header 8 ICMP message IP datagram IP header of the datagram that generated the error Must include At least 8 byte that followed this IP header. In this example it is UDP header General format ICMP unreachable message type 3code bit checksum (for entire ICMP message) IP header uncluding options + first 8 bytes of original IP datagram data 8 bytes Unused (must be 0)
ICMP echo request and echo reply (PING) I want to know is server alive Client Send echo request Server is alive Server I received “ping” to my address Answer to client Send echo reply type 0 - reply 8 - request code 0 16-bit checksum (for entire ICMP message) Optional data 8 bytes identifier Packets: sequence number Server must reply identifier and sequence number fields. Historically ping has operated in mode where it sends an echo request once a second. identifier - process ID of the sending process sequence number - starts at 0 and incremented every time a new echo request is sent
Router 2 IP record option (-r option) ClientRouter 1 Send echo request with -r option Packet IP option: Server Router 3 code 1 len 1 ptr 1 IP addr R1 4 IP addr R2 4 IP addr R3 4 IP addr of server 4 Incoming interface Code 1-byte field specifying the type of IP option. For RR option its value is 7 Lentotal number of bytes of the RR option. Ping always provides a 38-byte option, to record up to 9 IP addresses - maximum There is the limited room in the IP header for the list of IP addresses, because entire IP header is limited to 15*32- bit words (60 bytes). There are only up to 40 bytes for option field in IP header Routers put into RR packet IP addresses of their outgoing interfaces Ptr: = IP addr R Send echo reply IP addr R1 4
BROADCASTING Four types of IP broadcast NameAddressDescription Limited limited broadcast never forwarded by a router. Net-directred netid routers forward this kind of broadcast. These broadcast asign for netid IP network Subnet-directred host ID all is 1 bitbroadcast for specific subnet. For example, knowledge of is broadcast for subnet x mask is requiredwith subnet mask All-subnet-directred knowledge of If network is subneted this is all-subnet-directed mask is required broadcast. If network isn’t subneted this is net-directed subnet ID all 1, broadcast host ID all 1
MULTICASTING !Note!On an Ethernet multicast address is 01:00:00:00:00:00 Addressing Class D to Multicast Do you remember? The set of host listening to a particular IP multicast address is called a host group. A host group can span multiple networks. Membership in a host group is dynamic - hosts may join and leave host group at will. There is no restriction on the number of hosts in a host group, and a host not have to belong to a group to send a message to that group. Here is format of a class D IP address First four bit for class D: = = bit multicast group ID IP address
MULTICASTING Converting Multicast Group addresses to Ethernet Addresses The Ethernet addresses corresponding to IP multicasting are in the range 01:00:5e:00:00:00 through 01:00:5e:7f:ff:ff We have 23 bits in the Etherntet address to correspond to the IP multicast group ID. The mapping places the low order 23 bits of the multicast group ID into these 23 bits of the Ethernet address. 48-bit Ethernet address Class D IP address These 5 bits in the multicast froup ID are not used to form the Ethernet address Low-order 23 bits of multicast group ID is copied to Ethernet address Since the upper 5 bits of the multicast group ID are ignored in this mapping, it is not uniwue. 32 different multicast group IDs map to same Ethernet address (1111 = 31). The device driver or the IP software must perform filtering, since the interface card may receive multicast frames in which the host is really not interested. 5e
IGMP reports and queries Process 1 IGMP report Dest IP Group IP Router Host Process 2 Multicast groups participant: No 12 GroupAddress Group Group Process 3 IP Join to group 1 Join to group 2 Multicast groups on interface 1: 1 Interface 1 IP Another IGMP report Dest IP Group IP IGMP report Dest IP Group IP Another IGMP report Dest IP Group IP Timer! Send IGMP query Join to group 1 IP Multicast groups participant: No 1 IGMP report Dest IP Group IP Another GMP report Dest IP Group IP IGMP query Dest IP Group IP - 0 IGMP report Dest IP Group IP Group 1 alive Group 1 reported Report group 2 only IGMP report Dest IP Group IP Group 2 alive Leave group 2 Don’t report group 2 next time (Internet Group Management Protocol)
IGMP packet IGMP version (1) IP header 20 IGMP message 8 IP datagram Version1 Type1 - multicast router query 2 - response sent by a host Group addressclass D IP address. For query address is set to 0 IGMP message IGMP type (1-2) unused16-bit checksum 32-bit group address (calss D IP address) 8 bytes
DATA (if any) UDP packet Source port Destination port UDP lengthUDP checksum
TFTP Trivial File Transfer Protocol IP header 20 UDP header 8 IP datagram Data packet Opcode 1=RRQ 2=WRQ 2 filename N mode N opcode 3=data 2 data Block number 2 opcode 4=ACK 2 Block number 2 opcode 5=error 2 Error number 2 Error message N 0 1 Packet types UDP datagram TFTP message Data ACK packet Error packet Requestes Mode netascii octet
TFTP operations Client In case of write file the client sends the WRQ. If all is OK, server responds with ACK and block number 0. And so on. File transfer opcode3 blcok number1 bytes512 Dest UDP port - appl Source UDP port - new port number, was appointed for this file transfer by TFTP server Those ports numbers will be used during file transfer. Need file “File” from server Server Read request for “File” opcode1 Dest UDP port 69 Source UDP port - appl Process File can be read by client? YES ACK opcode4 block number2 Receiving block 1 Receiving block 2. Data size last block of file ACK opcode4 block number1 Client received block 1 File trnsfer opcode3 blcok number2 bytes356 (last block of “File”) Error messages. Server responds with this type of packet if a read request or write request can’t be processed. Also read or write error during file transmission can cause this message to be sent, and transmission is then terminated.
BOOTP: Bootstrap Protocol IP header 20 UDP header 8 BOOTP request/reply 300 UDP datagram IP datagram BOOTP Packet Format
Opcode request, 2 - reply opcodehardware type number of seconds 300 bytes Transaction ID BOOTP datagram hardware address length hopcount unused client IP address your IP address server IP address gateway IP address client hardware address (16 bytes) server hostname (64bytes) boot filename (128 bytes) vendor-specific information (64 bytes) H type - 1 for Ethernet H addr length - 6 for Ethernet Hop count - set to 0 by client Trans ID - set by client and returned by the server Number of seconds - set by client Client IP - set by client. If client don’t have an address => 0 Your IP - filled by the server with client’s IP address Server IP - filled by the server Gateway IP - filled by a proxy server. If is. Client H address - must be set by client Server hostname - null terminating string that is optionally filled in by the server Boot filename -fully qualified, null terminated pathnema of a file to bootstrap from
BOOTP 0 1 subnet mask 4 Port numbers 67ServerClient68 Vendor-Specific information If information in vendor-specific filed is provided, the first 4 bytes of this area are set to th IP address This is called magic cookie. Subnet mask Pad Gateway End of the items. Any bytes after this should be set to IP address of preferred gateway 4 N 1 many fields... IP address of preferred gateway 4 taglength Examples
BOOTP operations Client. Port 68. Server. Port 67. IP For client Client’s request Dest UDP port 67 Source IP Dest IP Boot process BOOTP process UDP port 68 ARP request “who is server” Sender IP Target IP BOOTP server UDP port 67 Server’s reply Source IP Your IP Server IP Gateway IP Boot file name - BFILE ARP request to see if anyone else on network has same adress Target IP Source IP Client sends second ARP request 0.5 second later, and third ARP request 0.5 second after it. Third ARP request Source IP address is (client’s address) Receiving information Is my IP address unique? Client’s request Source IP Dest IP NOBODY ANSWER Client’s request Source IP Dest IP My IP address unique! Server’s reply Source IP Your IP Server IP Gateway IP Boot file name - BFILE Server’s reply Source IP Your IP Server IP Gateway IP Boot file name - BFILE ARP reply Sender Target IP Target harware address - server’s TFTP Clients read boot file BFILE from the server I have IP, I have loodable image. I can start!
Header length (4) DATA TCP packet Destination port Acknowledgment number Sequence number Urgent pointer Source port Header checksum Options (+padding) Reserved (6) flags (6)Window The MSS option is using only in SYN packets 01631
TCP sequence and aknowledgement ClientServer Send 10 bytes SEQ10 ACKNo Receiving SEQ 10 and 10 bytes ACK = 10 (SEQ) + 10 bytes Send my own data with my own SEQ and ACK = 20 Send 20 bytes SEQ30 ACK20 Receiving SEQ 30 DATA 20 ACK 20 my ACK = Server received my data, his ACK = 20 my curr SEQ = prev send plus data = Receiving SEQ 20 DATA 10 ACK 50 my ACK = Client received my data, his ACK = 50 my curr SEQ = prev send plus data = Send 10 bytes SEQ20 ACK50 Send 20 bytes SEQ50 ACK30 And so on….
TCP connection establishment Server SEQ145 ACK- Flags S Send packet with S (SYN) flag. (SYN segement). Packet contain the port number of the server that the client want to connect ISN - initial sequence number ClientISN = 145 Described three segments complete the connection establishment. This is often called the three- way handshake. Server respond contain correct ACK Receiving packet. Respond with own SYN segment containing own SN and ACK for client’s SYN plus one (SYN comsumes one sequence number) ACK = = 146 ISN = 348 SEQ348 ACK146 Flags SA Receiving server’s respond Acknowledge server’s SYN with ACK = server’s SN + 1 = = 349 ACK349 Flags A The connection establishment completed Active openPassive open
TCP connection termination Server SEQ658 ACK426 Flags FA TCP connection is full duplex, and each direction must be shut down independenly Client Next ACK should be, for example, 426 and my own SN must be 658 Receiving FIN packet. Respond with correspondent ACK ACK427 Flags A User type “quite”, for example Send FIN - packety with FIN flag Active closePassive close ACK659 Flags A I should close second direction SEQ426 ACK659 Flags FA Receiving FIN packet. Respond with correspondent ACK The connection closed Now is «half-close». It can be some data is sending by server to client, with corresponding ACKs. Then server close another direction of connection
TCP states for connection establishment and termination SYN_SENT ESTABLISHED FIN_WAIT_1 FIN_WAIT_2 TIME_WAIT ESTABLISHED SYN_RCVD CLOSE_WAIT LAST_ACK CLOSED ClientServeractive openpassive open active closepassive close SYN J SYN K, ack J+1 ack K+1 FIN M ack M+1 FIN N ack N+1 Client stays in this state for twice the MSL
2 MSL state All received datagram is discarded There is impossible to open another connection for this socket pairs (IP tuple) Quiet Time If a host in the 2MSL wait crashes, reboots within MSL seconds and immediatly establishes new connections isung the same local and foreign IP addresses and port number. To protect this scenario RFC 793 states that TCP should not create any connectionfor MSL seconds after rebooting. This is called the quiet time. Reset Segments Reset segment - “reset” bit in TCP header is set to 1. Any queued data is thrown away and the reset is sent immediately. The receiver of the RST can tell that the other end did an abort instead of a normal close. Example We trying to connect to server with port number that’s not in use on the destionation. UDP sends “port unreachable” message in this case. TCP sends reset segment. Server doesn’t have process with port ServerClient SEQ400 Flags S port SEQ0 ACK 401 Flags RA FIN - orderly release. RST - abortive release.
Half-Open Packet All is fine !But sometimes something can crash. Alive computer don’t know that peer is died. Peer havn’t sent FIN or RES segments. Connection is Half-Open
Simultaneous Open Usual connection open SYN_SENT ESTABLISHED SYN_RCVD active openpassive open SYN J SYN K, ack J+1 ack K+1 Result - one connection, not two. Simultaneous Open active open SYN JSYN K SYN J, ack K+1 SYN K, ack J+1 SYN_SENT ESTABLISHED SYN_RCVD
Simultaneous Close Usual connection close active closepassive close Simultaneous Close active close FIN JFIN K ack K+1 ack J+1 FIN_WAIT_1 FIN_WAIT_2 TIME_WAIT CLOSE_WAIT LAST_ACK CLOSED FIN M ack M+1 FIN N ack N+1 FIN_WAIT_1 CLOSING TIME_WAIT
TCP options (RFC 792 and 1323) kind=0 1 byte kind=1 1 byte kind=2 1 byte kind=3 1 byte len=3 1 byte len=4 1 byte MSS 2 byte shift count 1 byte End of option list No operations Maximum segment size Window scale factor Timestamp kind=8 1 byte len=10 1 byte timestamp value 4 byte timestamp echo reply 4 byte (examples) Those options don’t have length field. The other do. length is th total length, uncluding the kind and len bytes.
Delayed Acknowledgment (delayed ACK) START KERNEL For example, delayed ACK here is 200 ms. See to client. Client don’t send ACK immediatly. It delay ACK, hoping to have data to send them in the same direction as the ACK. It can wait till next “delay ACK” boundary. TIME 200 ms intervals long time... ClientServer And now... PSH 2:6 (4) ack 11 is waiting Here delayed ACK flag is turned off ack 6 PSH 6:12 (4) ack 11 is waiting TCP has decided to sent data packet. Another instant PSH 11:15 (4) ack 12 piggyback
Befor packet was pushed into physical media another packet from server had been received Nagle algoritm PSH 2:3 (1) ack 2 ClientServer 1 byte APPLICATION TCP buffer mss (20 bytes) Send packet 1 byte TCP doesn’t send packet. We are waiting for first packet’s ACK. TCP has data for send entire packet. And TCP does it. 1 byte TCP doesn’t send packet. We are waiting for first packet’s ACK. ack 3 TCP has received packet. Now it can send data from buffer. PSH 3:5 (2) ack 2 20 bytes PSH 5:25 (20) ack 2 ack 5 ack 25 bla.., bla... bla… bla… tume has passed PSH 55:56 (1) ack 10 PSH 10:12 (2) ack 56 PSH 56:58 (2) ack 10 ack 56 PSH 8:10 (2) ack 55 PSH 56:58 (2) ack 12 ACK is receiving, I have data, preparing and send packet Now I have data for sending again. And I have “free” ACK from server (packet *) *
TCP timers Retransmission timer. This timer is used when expecting an acknowledfment from other end. Persist timer keeps window size information flowing even if the other end closes its receive window. Keepalive timer detect when the other end on an otherwise idle connection crashes or reboots. 2MSL timer measures the time a connection has been in the TIME_WAIT state.
Round-Trip Time Err = M - A A + gErr D D + h(|Err| - D) RTO = A + 4D PSH 2:3 (1) ack 2 ack 3 Send bytes Receive ACK for that bytes Measured RTT (M) There are some formules which are used for calculate retransmissiom timeout value (RTO). A - smoothed RTT (an estimator of average) D - smoothed mean deviation g (1/8) h Karn’s algoritm. Algoritm specify that when retransmission occurs, we cannot update the RTT estimator when the acknowledgement for the retransmitted data finally arrives.
RTT example. Measurement. 1:257 (256) ack ack :513 (256) ack :769 (256) ack ack ack :1025 (256) ack :1281 (256) ack :1537 (256) ack ack ack :1793 (256) ack RTT № sec RTT № sec RTT № sec Most implementation measure only one RTT value per connection at any time. If the timer for a given connection is already in use when a data segment is transmitted, that segment is not timed. start timer stop timer start timer stop timer start timer stop timer
RTT example. Measurement. 1:257 (256) ack ack :513 (256) ack :769 (256) ack ack ack :1025 (256) ack :1281 (256) ack :1537 (256) ack ack ack :1793 (256) ack RTT № sec RTT № sec RTT № sec The timing is done by incrementing a counter every 500-ms TCP timer routine is invoked. Figure shows the relationship in our example between actual RTT that we can determin by network analyzator and the counted clock ticks start timerstop timerstart timer stop timer RTT №2. 1 tick RTT №3. 2 ticks RTT №1. 3 ticks
RTT example. Calculation. 1:257 (256) ack ack :513 (256) ack :769 (256) ack ack ack :1025 (256) ack :1281 (256) ack :1537 (256) ack ack ack :1793 (256) ack RTT № sec (3 RTT № sec RTT № sec Err = M - A A + gErr D D + h(|Err| - D) RTO = A + 4D RTT №1 = 3 ticks RTT №2 = 1 ticks RTT №3 = 2 ticks A is initialized to 0 D is initialized to 3 Initial RTO = A + 2D = 0 + 2*3 = 6 seconds (Factor 2 is used only for initial calculation) When the ACK for the first data segment arrives (segment 2) measured RTT is 3 and our estimators initialized as A = M = = 2 D = A/2 = 1 RTO = A+4D = 2+ 4*1 = 6 seconds When the ACK for the second data segment arrives (segment 5) measured RTT is 1 and update is Err = M - A = = -1.5 A = A + g*Err = *1.5 = D = D + H(|Err| - D) = *( ) = RTO = A + 4D = *1.125 = But most implementation use RTO as a multiple of 500 ms. In our instance RTO will be 6 seconds.
Congestion example. 6401:6657 (256) ack 1 ack :6913 (256) ack :7169 (256) ack :7425 (256) ack 1 ack 6913 to appl 7425:7681 (256) ack :7937 (256) ack :8193 (256) ack 1 ack 6913 (save 256) 6913:7169 (256) ack 1 retransmission ack 6913 (save 256) ack :8449 (256) ack 1 ack 8449 Received missed packet. Now this host has all data bytes ack 6913 (save 256) all saved to appl to appl TCP count the number of duplicate ACKs received, and when the third one is received assume that a segment has been lost. TCP retransmit only one one segment, starting with that sequence number. We discuss fast retransmit algoritm later. There is third duplicate ACKs There is normal data flow Host knows that prevous packet is missed. Then host send ACK for prevous received packet and save receiving packet. First duplicate ACK Second duplicate ACK 3rd ACK Congestion. For example, router lost packet
Slow start. 1:513 (512) ack 1 ack 513 Slow start works with congestion window - CWND. CWND is initialized to 1 (one) segment and is increased by one segment each time an ACK is received. The sender can transmit up to the minimum of the congestion window and advertized windiw. CWND is flow control imposed by sender. At some point the capacity of the network can be reached and some packets can be discarded. This situation tells to the sender that its CWND is too large. We’’ ll see later mechanism of CWND adjusting. cwnd = 2 513:1025 (512) ack :1537 (512) ack 1 ack 1025 cwnd = 3 cwnd = 1 Sender sends only two segments because ACK for segment 1025:1537 hasn’t received. Result: We have CWND = 3 and 3 sended (without ACK) segments. ack 1537 cwnd = :2049 (512) ack :2561 (512) ack :3073 (512) ack :3585 (512) ack 1 And so on CWND is maintained in bytes
Congestion avoidance algoritm. There are two indications of packet loss: a timeout occure the receipt of duplicate ACKs Congestion avoidance and slow start are different. But in practice congestion avoidance and slow start are implemented together. When congestion occurs TCP slows down the transmission rate of packets into the network and then invoke slow start to get things going again. Congestion avoidance and slow start require that two variables be maintained for each connection: CWND A slow start treshold size, ssthresh
Congestion avoidance algoritm. Combined algoritm’s work. Congestion occur! Normal data flow, CWND is growing Initialization: CWND = 1 segment SSTHRESH = bytes SSTRESH = CWS/2 Is congestion indicated by timeout? CWND = 1 segment Retransmission, bla-bla-bla.. At least: ACK is received TCP increase CWND, but the way it increases depends on whether we TCP performs slow start or congestion avoidance CWND =< SSTHRESH? Yes No Yes TCP’s doing SLOW START Slow start has CWND start at one segment and be incremented by one segmentevery an ACK is received. (Do you remember slide before?). Slow start continues until we are halfway to where congestion occured (since we recorded half of the window size that got us into trouble), and then congestion avoidance takes over. CONGESTION AVOIDANCE Congestion avoidance dictates that CWND be incremented by 1/CWND each time an ACK is received. So we want to increase CWND by at most one segment each RTT, whereas slow start will increment CWND by the number of ACKs received in a RTT No CWS - current window size
Starting point: We assumed that congestion has just occured when CWND had a value of 32 segments. Congestion was indicated by timeout SSTRESH = 16 CWND round-trip times Congestion avoidance algoritm. Illustration. SSTRESH = 32 / 2 = 16 CWND = 1 1 segment is send at time 0 At time 1 ACK is returned and CWND is incremented to 2 segments At time 2 two ACK is returned and CWND is incremented to 4 segments (CWND was 2 and two ACK received) And so on CWND = SSTRESH. Slow start is stopped and congestion avoidance is started Now congestion avoidance is working. Increasing of CWND is linear, with a maximum increase of one segment per round-trip time congestion moment
Fast retransmit and Fast recovery algoritms. It’ duplicated ACK may be generated by reordering segments. 1st duplicated ACK Host don’t wait for timer retransmission expires. It send the lost segment. This is: TCP host NETWORK 1:513 (512) ack 1 513:1025 (512) ack 1 I am able to send 3 packets ack 513 2nd duplicated ACK 3rdt duplicated ACK ack 513 It’ duplicated ACK also may be generated by reordering segments. I think segment is lost FAST RETRANSMIT ALGORITM Slow start isn’t performed, but congestion algoritm is working. This is FAST RECOVERY ALGORITM
Fast retransmit and Fast recovery algoritms. Combined algoritm’s work. Retransmit the missing segment SSTRESH = CWS/2 CWND= SSTRESH + 3 * segment size 3rd duplicate ACK is received ACK is received which acknowledges all data segments sent between lost packet and 1st duplicate ACK CWND = SSTRESH Congestion avoidance is now working If duplicate ACK arrives, INC(CWND;segment size); transmit packet (if CWND allows)
Slow start and congestion avoidance example Initialize: CWND = MSS = 256 SSTRESH = numbers (from table) ,2 0,4 0,6 0,8 1 1,2 1,4 1,6 1,8 CWND SEQ x 1000 Here is ACK for data! CWND <= SSTRESH we in slow start 1 segment = 256 CWND = CWND = 512 Timeout occurs SSTRESH = CWS/2 = minimum valuse = 512 CWND = 1 segment = 256 Here is no changes because new data is not being acknowledged SYN S, AACK DATA GO CWND <=SSTRESH slow start CWND = CWND + 1 segment CWND = = 768 DATA GO CWND > SSTRESH cong.avoid. CWND < *256/ /8 We are using integer arithmetic. CWND = 885 Real formula for 1/CWND is cwnd <- cwnd + (segsize*segsize)/cwnd + segsize/8 DATA GO CWND > SSTRESH cong.avoid. CWND < *256/ /8 We are using integer arithmetic. CWND = 991 CWND > SSTRESH cong.avoid. CWND < *256/ /8 We are using integer arithmetic. CWND = 1089
Slow start and congestion avoidance example First two duplicated ACK is received and is counted and CWND is left alone numbers (from table) ,7 8,8 8,9 9 9,1 9,2 9,3 9,4 9,5 CWNDSEQ x ,6 9, Retransmission is sent Third duplicated ACK is arrived SSTRESH = CWND/2 = 2426/ 2 = 1024 (rounded down to the next mult. of the segment size) CWND = SSTRESH + number of dupl ACKs = * 256 = 1792 Duplicated ACK is received. CWND = CWND + 1 segment = = 2048 But CWND ‘s not big enough for sent data NOTE: here we have 2304 unacknowledged data from prevous segments Duplicated ACK is received. CWND = CWND + 1 segment = = 2304 But CWND ‘s not big enough for sent data Duplicated ACK is received. CWND = CWND + 1 segment = = 2560 We can send data Data is sent There are some segments with same situation ACK for new data is received CWND <= SSTRESH slow start!!! CWND = SSTRESH + segment size = = 1280
TCP keepalive timer TCP implementation may use keepalive option. This option is used to know: Is my peer alive? There are 4 scenarios if there is no activity on connection and one peer send keepalive probe to another Usually the keepalive timer is 2 hours. One example is one half-open connection. One peer is died but another end don’t know about it. It keeps socket (IP address + port number) for that died perr. But peer needn’t anything already... And alive one must know it!
Keepalive probe has SEQ that is one less than it should be (for example, receiver wait for SEQ = 14, but keepalive probe has SEQ = 13. Receiver receivs packet with incorrect SEQ and is forced to respond with ACK which containnext SEQ thar the server is expecting That’s all.. Peers don’t have any data to send to each other but connection is established Client received answer from the server. It knows that the server is alive and reset its keepalive timer TCP keepalive timer My keepalive timer exhaust Is my peer alive? But I forgot his MAC address... Client Packet Server Scenario 1. Peer is alive and reachable. Packet 2 (two) hours passed... ARP requestARP replykeepalive probeACK My keepalive timer exhaust Is my peer alive? TCP send request. Don’t see now on lower level (for ARP). We should know whatever perr alive or not. Client Packet Server Packetkeepalive probe That’s all.. Peers don’t have any data to send to each other but connection is established Scenario 2. Peer crashed or process was rebooted. 2 hours have passed But peer is crashed 75 seconds… No answer keepalive probe 75 seconds… No answer Client send 10 keep-alive probes. If it doesn’t receive response, it consider the peer’s host is down or terminate connection
TCP keepalive timer Once again.. My keepalive timer exhaust Is my peer alive? ClientServer Scenario 3. Peer has crashed and rebooted. In this scenario situation will be the same as in scenario 1 - from client’s point of view. This situation may be caused by accident with intermediate router Scenario 4.Client is running, but unreachable. I’ll be laconic… 2 hours has passed keepalive probe Are they crazy? I don’t have such socket! Host has crashed, rebooted. It has working TCP stack but doesn’t have socket for that connection reset connection
But things is changing… For example, router fell and route was changed. Another router needs fragmnet our datagram, but datagram has DF bit set. Router is sending ICMP error to our host Path MTU Discovery If th other end doesn’t specify MSS, it default to 536 It is possible to save path MTU on a per-route basis We send datagrams with DF (don’t fragment) bit set MTU = MIN (my interface MTU; MSS announced by the other end) Connection established Decrease MTU Router generate newer form of ICMP error message which contain its MSS MTU = MSS - IP header - TCP header We have received ICMP error “can’t fragment” Router generate older form of ICMP error message We take next smaller MTU Things is being changing… After timeout we can try bigger MTU (depending on implementation ). RFC 1101 recommends 10 minutes.
TCP packet with MSS option kind=2 1 byte len=0 1 byte MSS 2 byte Maximum segment size option Data offset DATA Destination port Acknowledgment number Sequence number Urgent pointer Source port Header checksum Options (+padding) FlagsReservedWindow TCP packet
Path MTU Discovery. Example. SYN mss = 1460 MTU is 552! I can send datagram with 512 bytes of data. Host 1 Host 2 Router 1 MTU = 1500 MTU = 552MTU = 296 SYN, ACK mss = 512 1:513 (512) ACK 1:257(256) ACK Router: I can’t send so big datagram without fragmentation. But DF bit is set => error occur! ICMP error message: Host 1 unreachable, need to frag, mtu = 296 (newer implementation router’s TCP) My MSS now 256 (MTU = 296)
Window Scale Option Networks are growing and buffers is coming bigger and there is not enough window size (maximum window size allowed by window field in TCP header) The newer implementation using WINDOW SCALE OPTION The newer implementation can work with oldest implementations. Data offset DATA Destination port Acknowledgment number Sequence number Urgent pointer Source port Header checksum Options (+padding) FlagsReservedWindow TCP header kind=3 1 byte len=3 1 byte shift count 1 byte Window There are only 16 bit Option field can contain WINDOW SCALE OPTION WINDOW SCALE OPTION can be advertized only in SYN segment. Sacel factor is fixed in each direction when the connection established Shift count: no scaling performed
Window Scale Option. Setting. ActiveOpen SYN, wscale 1 To enable window scaling both ends must have this option in their SYN segments I think my window scale should be 1 Active peer is going to use window scale! I understand it and choose my window scale = 0. I must set this option to 0. SYN, ACK, wscale 3 How scale work. Window scale is using to shift value from window field to get real window size For example, window scale was set to 1 and window size in the receiving packet is 4 (it’s only example) Using window scale to shift value to left for 1 bit Real advertized window is 8
Timestamp option kind=8 1 byte len=10 1 byte timestamp value 4 byte timestamp echo reply 4 byte Timestamp oyin isusing for better calculating RTT The sender places a 32-bit value in the first field and the receiver echoes this back in the reply field. For usinf this option both ends must be able to work with this option. For established this option the active peer must set timestamp option in the SYN and another (passive) end must answer with option too. Only one timestamp option is kept per connection How does TCP do it? Receiver’s TCP keeps: ACK number from the last ACK which was sent, and time stamp value which was placed to there (tsrecenct). ACK number is next sequence number whivh we are waiting for (lastback). Segment arrived: If SEQ from segment is lastback, tsrecent = timestamp option from the segment SEQ Trsent is sent to the timestamp reply field and lastback is sent to ACK value in the sending ACK.
PAWS: Protection Against Wrapped Sequence Numbers