Presentation on theme: "How to use it Press Space to go alonge slide animation"— Presentation transcript:
1How to use it Press Space to go alonge slide animation Don’t hurry to press Space next time. Wait for end of animationIf you want to go back, use key «PgUp».Proxy ARP Incoming connection Request Queue Interactive data flow ????????? Push flag Urgent mode Заголовок IP - подробно, с разбивкой по битам и указанием его максимальной длины в байтах Repacketization p 320 ICMP Errors p 317 exponential backoff p 299Path MTU discovery forUDP 118TCP flagsVersion 08 June 1999Come later - presentation is under construction now
2Encapsulation data into Ethernet packet User dataApplicationheaderUser dataTCPheaderApplication dataTCP segmentIPheaderApplication dataTCPheaderIP datagramEthernetheaderApplication dataTCPheaderIPEthernettrailer46 to 1500 bytesEthernet frame
3IEEE 802.2/802.3 Encapsulation (RFC 1042) 802.3 MAC802.2 LLC802.2 SNAPDestination address6Sourceaddress6length2DSAP0xAA1SSAP0xAA1cntl031Org code003type2DATACRC4LENGTH contain length packet from next byte till CRC (CRC isn’t included)Type08002IP DatagramDSAP (Destination Service Access Point) and SSAP (Source Service Access Point) both are set to 0xAA.orCNTL (Control field) is set to 3.Type08062ARP request/reply28PAD18ORG CODE allways is 0 in all bytesIHL - IP header lengthTYPE field identifies data that follows. For example, type 0x0800 (hex) identifies IP datagram followsorType80352RARP request/reply28PAD18
516-bit total packet length IP packet structureVersion.Current protocol version is 4.1516314-bit ver4-bit IHLTOS16-bit total packet lengthIHL - IP header length. IHL is quantity of 32-bit words in IP header. This field has 4-bit length => maximum header length is 60 bytes16-bit identificationflags 3-bit13-bit Fr offsetDATAHeader checksumTTLProtocolSource addressDestination addressOptions (+padding)TOS - type of service contain of a 3-bit precedence bits (ignored), 4 TOS bits, and unused bit which must be 0. 4 TOS bits: minimize delay maxm,ize throughput maximize reliability minimize monetary cost Only 1 of these 4 bits can be turned onTPL - total packet length is total IP packet’s length in bytes. Then maximum length of IP packet is bytes.IHL - IP header lengthIDENTIFICATIN - this field is used when IP need fragment fatagrams. Identification identifies each datagram and is incremented each time a datagram is sent We’ll see meaning of this field when we talk about fragmentation FLAGS and FRAGMENT OFFEST we’’ see also when we talk about fragmentationContinue...
616-bit total packet length IP packet structure151631TTL - time-to-live sets an upper limit of routers through which a datagram can pass. This field is decremented each time when datagram pass the router. When this field became 0 a datagram is dropped by router and ICMP message is sent to datagram’s sender.16-bit total packet length16-bit identificationTOS4-bit ver4-bit IHL13-bit Fr offsetflags 3-bitTTLProtocolHeader checksumSource addressPROTOCOL - this field identifies DATA portion of datagram (which protocol is encapsulated into IP datagram).Destination addressOptions (+padding)HEADER CHECKSUM is calculetaed for IP header only.DATASOURCE and DESTINATION addresses is sender’s and receiver’s IP addresses.IHL - IP header lengthOPTIONS is a variable-length field which contain som eoptions. We’ll discuss some of them later. The option field always end on a 32-bit boundary. PAD bytes (value is 0) are added if neccessary.DATA is data.
7Special case IP addresses IP address classesClassRangeAtoBtoCtoDto MulticastEto
8ARP and RARPARP For example, we are working on the Ethernet network. Ethernet driver and adapter are using MAC-address. TCP/IP is using IP addresses. When host want to send data to another host it known onlt receiver’s IP address and put this information to TCP/IP stack. Then TCP/IP stack need mechanism to have correspondence between MAC and IP addresses. IP have two algorithms for solve it.32-bit IP address48-bit Ethernet addressARPRARPRARP If system don’t have hard or floppy drive and should boot from network it can’t take IP address from local resourses. Such system have only MAC-address. RARP is algorithm which allow system to obtain IP address from network
9ARP Host ARP IP Do I know hardware address? Ethernet driver Host Host Send IP datagramto IP addressHostARPIPResolve IP address tohardware addressDo I knowhardware address?YesYesNoEthernet driverARP requestHostHostEthernet driverEthernet driverARPIs somebody looking for my address?Is somebody looking for my address?ARPNoYesIgnore requestSend ARP reply
10RARP Diskless workstation RARP server Boot Read own hardware network addressI have a IP address!!!Send RARP requestSend RARP replySomebody wants to have IP address!Give to somebody IP address from my tableRARP server
11Sender Ethernet address ARP packetDest address6Sourceaddress6type2Hardtype2Prottype2Hardsize1Protsize1op2Sender Ethernet address6Sender IP address4TargetEthernet address6Target IP address4type0x806hardware typeSpecified hardware type. 1 for an Ethernetprotocol type0x800 for IPhardware sizeSize of hardware address. 6 for an Ethernetprotocol sizeSize of protocol address. 4 for IPopType of operation (request or reply). ARP request - 1, ARP reply - 2, RARP request - 3, RARP reply - 4.Dest addressBroadcast
12ICMP - Internet Control Message Protocol RFC 792 packet structure IP header20ICMP message8-bit type8-bit code16-bit checksum(for entire ICMP message)The same for all type of messagesContents depend on type and code
13ICMP address mask request and reply Type 17-request18 - replyCode - 016-bit checksum(for entire ICMP message)identifier (anything)sequence number (anything)12 bytesSubnet maskICMP timestamp request and replyType 13-request14 - replyCode - 016-bit checksum(for entire ICMP message)identifier (anything)sequence number (anything)32-bit originate timestamp20 bytes32-bit receive timestamp32-bit transmit timestamp
14ICMP port unreachable error IP datagramICMP messageData portion of ICMP messageEthernet header14IP header20ICMP header8IP header of datagram that generated error20UDP header8Must includeIP header of the datagram that generated the errorAt least 8 byte that followed this IP header. In this example it is UDP headerGeneral format ICMP unreachable messagetype 3code 0-1516-bit checksum(for entire ICMP message)8 bytesUnused (must be 0)IP header uncluding options + first 8 bytes of original IP datagram data
15ICMP echo request and echo reply (PING) ClientServerI want to know is server aliveServer is aliveI received “ping” to my addressAnswer to clientSend echo requestSend echo replyPackets:type 0 - reply8 - requestcode 016-bit checksum(for entire ICMP message)8 bytesidentifiersequence numberOptional dataidentifier - process ID of the sending processsequence number - starts at 0 and incremented every time a new echo request is sentServer must reply identifier and sequence number fields. Historically ping has operated in mode where it sends an echo request once a second.
16IP record option (-r option) Send echo replySend echo request with -r optionClientRouter 1Router 2ServerRouter 3Packet IP option:Routers put into RR packet IP addresses of their outgoing interfacescode1len1ptr1IP addr R14IP addr R34IP addr R24IP addr of server4IP addr R24IP addr R14Incoming interface4Ptr: =242842081216Code 1-byte field specifying the type of IP option. For RR option its value is 7Len total number of bytes of the RR option. Ping always provides a 38-byte option, to record up to 9 IP addresses - maximumThere is the limited room in the IP header for the list of IP addresses, because entire IP header is limited to 15*32-bit words (60 bytes). There are only up to 40 bytes for option field in IP header
17Four types of IP broadcast BROADCASTINGFour types of IP broadcastName Address DescriptionLimited limited broadcast never forwarded by a router.Net-directred netid routers forward this kind of broadcast. These broadcast asign for netid IP networkSubnet-directred host ID all is 1 bit broadcast for specific subnet. For example, knowledge of is broadcast for subnet x mask is required with subnet maskAll-subnet-directred knowledge of If network is subneted this is all-subnet-directed mask is required broadcast. If network isn’t subneted this is net-directed subnet ID all 1, broadcast host ID all 1
18Here is format of a class D IP address MULTICASTING!Note! On an Ethernet multicast address is 01:00:00:00:00:00AddressingDo you remember?Class Dto MulticastHere is format of a class D IP addressFirst four bit for class D: = = 23928 bit multicast group ID1IP addressThe set of host listening to a particular IP multicast address is called a host group. A host group can span multiple networks. Membership in a host group is dynamic - hosts may join and leave host group at will. There is no restriction on the number of hosts in a host group, and a host not have to belong to a group to send a message to that group.
19MULTICASTINGConverting Multicast Group addresses to Ethernet AddressesThe Ethernet addresses corresponding to IP multicasting are in the range 01:00:5e:00:00:00 through 01:00:5e:7f:ff:ffWe have 23 bits in the Etherntet address to correspond to the IP multicast group ID. The mapping places the low order 23 bits of the multicast group ID into these 23 bits of the Ethernet address.These 5 bits in the multicast froup ID are not used to form the Ethernet address48-bit Ethernet addressClass D IP address15eLow-order 23 bits of multicast group ID is copied to Ethernet addressSince the upper 5 bits of the multicast group ID are ignored in this mapping, it is not uniwue. 32 different multicast group IDs map to same Ethernet address (1111 = 31). The device driver or the IP software must perform filtering, since the interface card may receive multicast frames in which the host is really not interested.
20IGMP reports and queries (Internet Group Management Protocol)Multicast groups participant: NoProcess 31Group AddressGroupGroupWait for random timerExample, 2 secondsWait for 0-10 secondsJoin to group 1HostIPIGMP reportDest IPGroup IPAnother GMP reportDest IPGroup IPIGMP reportDest IPGroup IPIGMP reportDest IPGroup IPIGMP reportDest IPGroup IPAnother IGMP reportDest IPGroup IPIGMP queryDest IPGroup IP - 0Another IGMP reportDest IPGroup IPIGMP reportDest IPGroup IPInterface 1Don’t report group 2 next timeIPGroup 1 aliveGroup 2 aliveIPWait for 0-10 secondsWait for 0-10 secondsWait for random timerExample, 3 secondsJoin to group 1Group 1 reportedLeavegroup 2Report group 2 onlyJoin to group 2HostMulticast groupson interface 1:12Process 1Process 2Timer!Send IGMP queryMulticast groups participant: No12Router
2132-bit group address (calss D IP address) IGMP packetIP datagramIP header20IGMP message8IGMP message481631IGMP version (1)IGMP type (1-2)unused16-bit checksum8 bytes32-bit group address (calss D IP address)Version 1Type 1 - multicast router query2 - response sent by a hostGroup address class D IP address. For query address is set to 0
25TFTP operationsFile transferopcode 3blcok number 1bytes 512Dest UDP port - applSource UDP port - new port number, was appointed for this file transfer by TFTP serverThose ports numbers will be used during file transfer.File trnsferopcode 3blcok number 2bytes 356 (last block of “File”)Read request for “File”opcode 1Dest UDP port 69Source UDP port - applACKopcode 4block number 1ACKopcode 4block number 2Receiving block 2.Data size < 512 byte => last block of fileReceiving block 1Client received block 1File can be read by client?Need file “File” from serverYESProcessClientServerIn case of write file the client sends the WRQ. If all is OK, server responds with ACK and block number 0. And so on.Error messages. Server responds with this type of packet if a read request or write request can’t be processed. Also read or write error during file transmission can cause this message to be sent, and transmission is then terminated.
27BOOTP datagram opcode hardware type hardware address length hopcount 781516232431opcodehardware typehardware address lengthhopcountOpcode request, replyTransaction IDH type - 1 for EthernetH addr length - 6 for Ethernetnumber of secondsunusedHop count - set to 0 by clientTrans ID - set by client and returned by the serverclient IP addressNumber of seconds - set by clientyour IP addressClient IP - set by client. If client don’t have an address => 0server IP address300 bytesYour IP - filled by the server with client’s IP addressgateway IP addressServer IP - filled by the serverGateway IP - filled by a proxy server. If is.client hardware address (16 bytes)Client H address - must be set by clientserver hostname (64bytes)Server hostname - null terminating string that is optionally filled in by the serverboot filename (128 bytes)Boot filename -fully qualified, null terminated pathnema of a file to bootstrap fromvendor-specific information (64 bytes)
28BOOTP Port numbers Vendor-Specific information Examples Server 67 Client68Vendor-Specific information12551End of the items. Any bytes after this should be set to 255PadExamples141subnet mask4Subnet mask1N1IP address of preferred gateway4many fields ...IP address of preferred gateway4GatewayIf information in vendor-specific filed is provided, the first 4 bytes of this area are set to th IP address This is called magic cookie.taglength
29BOOTP operations NOBODY ANSWER Server’s reply Source IP - 184.108.40.206 Your IPServer IPGateway IPBoot file name - BFILEARP request to see if anyone else on network has same adressTarget IPSource IPClient sends second ARP request 0.5 second later, and third ARP request 0.5 second after it. Third ARP request Source IP address is (client’s address)Client’s requestDest UDP port 67Source IPDest IPARP replySenderTarget IPTarget harware address - server’sClient’s requestSource IPDest IPARP request “who is server”Sender IPTarget IPTFTPClients read boot file BFILE from the serverServer’s replySource IPYour IPServer IPGateway IPBoot file name - BFILEClient’s requestSource IPDest IPServer’s replySource IPYour IPServer IPGateway IPBoot file name - BFILEI have IP, I have loodable image. I can start!My IP address unique!Receiving informationIs my IP address unique?BOOTP process UDP port 68NOBODY ANSWERBOOTP server UDP port 67Boot processClient. Port 68.Server. Port 67.IPFor client
31Acknowledgment number TCP packet1631Source portDestination portSequence numberAcknowledgment numberHeader length (4)Reserved (6)flags (6)WindowHeader checksumUrgent pointerOptions (+padding)DATAThe MSS option is using only in SYN packets
32TCP sequence and aknowledgement Receiving SEQ 10 and 10 bytesReceiving SEQ 30DATA 20ACK 20Receiving SEQ 20DATA 10ACK 50ACK = 10 (SEQ) + 10 bytesmy ACK =Server received my data, his ACK = 20 my curr SEQ= prev send plus data =my ACK =Client received my data, his ACK = 50 my curr SEQ= prev send plus data =Send 20 bytesSEQ 50ACK 30Send 10 bytesSEQ 20ACK 50Send 20 bytesSEQ 30ACK 20Send 10 bytesSEQ 10ACK NoSend my own data with my own SEQand ACK = 20ClientServerAnd so on….
33TCP connection establishment Receiving packet.Send packet with S (SYN) flag. (SYN segement). Packet contain the port number of the server that the client want to connectReceiving server’s respondSEQ 348ACK 146FlagsSAACK 349FlagsASEQ 145ACK -FlagsSRespond with own SYN segment containing own SN and ACK for client’s SYN plus one (SYN comsumes one sequence number) ACK = = 146Server respond contain correct ACKThe connection establishment completedAcknowledge server’s SYN with ACK = server’s SN + 1 = = 349ClientISN = 145ISN = 348ServerActive openPassive openISN - initial sequence numberDescribed three segments complete the connection establishment. This is often called the three-way handshake.
34TCP connection termination Receiving FIN packet.Receiving FIN packet.User type “quite”, for exampleSEQ 426ACK 659FlagsFAACK 659FlagsASEQ 658ACK 426FlagsFAACK 427FlagsARespond with correspondent ACKRespond with correspondent ACKNext ACK should be, for example, 426 and my own SN must be 658I should close second directionThe connection closedNow is «half-close». It can be some data is sending by server to client, with corresponding ACKs. Then server close another direction of connectionSend FIN - packety with FIN flagClientServerActive closePassive closeTCP connection is full duplex, and each direction must be shut down independenly
35TCP states for connection establishment and termination active openpassive openClientServerSYN JSYN_SENTSYN_RCVDSYN K, ack J+1ESTABLISHEDack K+1ESTABLISHEDactive closepassive closeFIN MFIN_WAIT_1CLOSE_WAITack M+1FIN_WAIT_2FIN NLAST_ACKTIME_WAITack N+1CLOSEDClient stays in this state for twice the MSL
36Server doesn’t have process with port 10000 2 MSL stateAll received datagram is discardedThere is impossible to open another connection for this socket pairs (IP tuple)Quiet TimeIf a host in the 2MSL wait crashes, reboots within MSL seconds and immediatly establishes new connections isung the same local and foreign IP addresses and port number. To protect this scenario RFC 793 states that TCP should not create any connectionfor MSL seconds after rebooting. This is called the quiet time.Reset SegmentsReset segment - “reset” bit in TCP header is set to 1. Any queued data is thrown away and the reset is sent immediately. The receiver of the RST can tell that the other end did an abort instead of a normal close.Example We trying to connect to server with port number that’s not in use on the destionation. UDP sends “port unreachable” message in this case. TCP sends reset segment.SEQ 0ACK 401Flags RASEQ 400Flags Sport 10000ServerClientServer doesn’t have process with port 10000FIN - orderly release. RST - abortive release.
37Half-Open But sometimes something can crash. All is fine ! PacketPacketPacketPacketPacketBut sometimes something can crash.All is fine !Alive computer don’t know that peer is died.Peer havn’t sent FIN or RES segments.Connection is Half-Open
38Result - one connection, not two. Simultaneous OpenUsual connection openactive openpassive openSYN JSYN_SENTSYN_RCVDSYN K, ack J+1ESTABLISHEDack K+1ESTABLISHEDSimultaneous Openactive openSYN_SENTSYN JSYN KSYN_RCVDSYN J, ack K+1SYN K, ack J+1ESTABLISHEDResult - one connection, not two.
40TCP options (RFC 792 and 1323) (examples) 1 byte 1 byte 1 byte 1 byte kind=01 byteEnd of option listThose options don’t have length field. The other do.length is th total length, uncluding the kind and len bytes.kind=11 byteNo operationskind=21 bytelen=41 byteMSS2 byteMaximum segment sizeIHL - IP header lengthkind=31 bytelen=31 byteshift count1 byteWindow scale factorTimestampkind=81 bytelen=101 bytetimestamp value4 bytetimestamp echo reply4 byte
41Delayed Acknowledgment (delayed ACK) For example, delayed ACK here is 200 ms. See to client.ClientServerPSH 2:6 (4) ack 11START KERNELlong time...is waitingAnd now...ack 6Client don’t send ACK immediatly. It delay ACK, hoping to have data to send them in the same direction as the ACK. It can wait till next “delay ACK” boundary.Another instantPSH 6:12 (4) ack 11TIME200 ms intervalsHere delayed ACK flag is turned offis waitingPSH 11:15 (4) ack 12piggybackTCP has decided to sent data packet.IHL - IP header length
42Nagle algoritm * TCP has data for send entire packet. And TCP does it. ClientServerAPPLICATIONPSH 2:3 (1) ack 2TCP has data for send entire packet. And TCP does it.TCP doesn’t send packet. We are waiting for first packet’s ACK.TCP doesn’t send packet. We are waiting for first packet’s ACK.TCP has received packet. Now it can send data from buffer.ack 3Send packetPSH 3:5 (2) ack 2mss (20 bytes)20 bytesPSH 5:25 (20) ack 2ack 5TCP buffer1 byte1 byte1 byteack 25bla.., bla... bla… bla… tume has passedPSH 8:10 (2) ack 55IHL - IP header lengthPSH 55:56 (1) ack 10ack 56ACK is receiving, I have data, preparing and send packetPSH 10:12 (2) ack 56*Befor packet was pushed into physical media another packet from server had been receivedPSH 56:58 (2) ack 10Now I have data for sending again. And I have “free” ACK from server (packet *)PSH 56:58 (2) ack 12
43TCP timersRetransmission timer. This timer is used when expecting an acknowledfment from other end.Persist timer keeps window size information flowing even if the other end closes its receive window.Keepalive timer detect when the other end on an otherwise idle connection crashes or reboots.2MSL timer measures the time a connection has been in the TIME_WAIT state.IHL - IP header length
44Round-Trip Time Measured RTT (M) PSH 2:3 (1) ack 2Measured RTT (M)ack 3Send bytesReceive ACK for that bytesThere are some formules which are used for calculate retransmissiom timeout value (RTO).Err = M - AA A + gErrD D + h(|Err| - D)RTO = A + 4DA - smoothed RTT (an estimator of average) D - smoothed mean deviation g (1/8) hIHL - IP header lengthKarn’s algoritm. Algoritm specify that when retransmission occurs, we cannot update the RTT estimator when the acknowledgement for the retransmitted data finally arrives.
45RTT example. Measurement. Most implementation measure only one RTT value per connection at any time. If the timer for a given connection is already in use when a data segment is transmitted, that segment is not timed.start timer1:257 (256) ack 1 1RTT № sec2 ack 257stop timer257:513 (256) ack 1 3start timer513:769 (256) ack 1 4RTT № sec5 ack 5138 ack 769stop timer769:1025 (256) ack 1 6start timerIHL - IP header length1025:1281 (256) ack 1 710 ack 10251281:1537 (256) ack 1 912 ack 1281RTT № secstop timer. . .1537:1793 (256) ack 1 11
46RTT example. Measurement. 1:257 (256) ack 1 12 ack 257257:513 (256) ack 1 3513:769 (256) ack 1 45 ack 5138 ack 769769:1025 (256) ack 1 61025:1281 (256) ack 1 71281:1537 (256) ack 1 910 ack 102512 ack 12811537:1793 (256) ack 1 11. . .RTT № secRTT № secRTT № secThe timing is done by incrementing a counter every 500-ms TCP timer routine is invoked. Figure shows the relationship in our example between actual RTT that we can determin by network analyzator and the counted clock ticks.IHL - IP header length0.030.531.031.532.032.533.03RTT №2.1 tickRTT №1.3 ticksRTT №3.2 ticksstart timerstop timerstart timerstop timerstart timerstop timer
47RTT example. Calculation. 1:257 (256) ack 1 12 ack 257257:513 (256) ack 1 3513:769 (256) ack 1 45 ack 5138 ack 769769:1025 (256) ack 1 61025:1281 (256) ack 1 71281:1537 (256) ack 1 910 ack 102512 ack 12811537:1793 (256) ack 1 11. . .RTT № sec (3RTT № secRTT № secErr = M - AA A + gErrD D + h(|Err| - D)RTO = A + 4DRTT №1 = 3 ticks RTT №2 = 1 ticks RTT №3 = 2 ticksA is initialized to 0 D is initialized to 3 Initial RTO = A + 2D = 0 + 2*3 = 6 seconds (Factor 2 is used only for initial calculation)IHL - IP header lengthWhen the ACK for the second data segment arrives (segment 5) measured RTT is 1 and update is Err = M - A = = -1.5 A = A + g*Err = *1.5 = D = D + H(|Err| - D) = *( ) = RTO = A + 4D = *1.125 = But most implementation use RTO as a multiple of 500 ms. In our instance RTO will be 6 seconds.When the ACK for the first data segment arrives (segment 2) measured RTT is 3 and our estimators initialized as A = M = = 2 D = A/2 = 1 RTO = A+4D = 2+ 4*1 = 6 seconds
48Congestion example. There is normal data flow 6401:6657 (256) ack 16657:6913 (256) ack 1to applack 6657to appl6913:7169 (256) ack 1ack 69137169:7425 (256) ack 1Congestion. For example, router lost packetHost knows that prevous packet is missed. Then host send ACK for prevous received packet and save receiving packet.ack 6913 (save 256)7425:7681 (256) ack 17681:7937 (256) ack 1First duplicate ACKack 6913 (save 256)7937:8193 (256) ack 1ack 6913 (save 256)Second duplicate ACKThere is third duplicate ACKs3rd ACK6913:7169 (256) ack 1 retransmissionack 6913 (save 256)all saved to appl8193 :8449 (256) ack 1IHL - IP header lengthack 8193to applReceived missed packet. Now this host has all data bytesack 8449TCP count the number of duplicate ACKs received, and when the third one is received assume that a segment has been lost. TCP retransmit only one one segment, starting with that sequence number. We discuss fast retransmit algoritm later.
49Slow start. And so on cwnd = 1 Slow start works with congestion window - CWND. CWND is initialized to 1 (one) segment and is increased by one segment each time an ACK is received.1:513 (512) ack 1ack 513cwnd = 2513:1025 (512) ack 11025:1537 (512) ack 1ack 1025At some point the capacity of the network can be reached and some packets can be discarded. This situation tells to the sender that its CWND is too large. We’’ ll see later mechanism of CWND adjusting.cwnd = 31537:2049 (512) ack 12049:2561 (512) ack 1ack 1537Sender sends only two segments because ACK for segment 1025:1537 hasn’t received.Result: We have CWND = 3 and 3 sended (without ACK) segments.cwnd = 4IHL - IP header length2561:3073 (512) ack 13073:3585 (512) ack 1The sender can transmit up to the minimum of the congestion window and advertized windiw. CWND is flow control imposed by sender.And so onCWND is maintained in bytes
50Congestion avoidance algoritm. Congestion avoidance and slow start are different. But in practice congestion avoidance and slow start are implemented together. When congestion occurs TCP slows down the transmission rate of packets into the network and then invoke slow start to get things going again.Congestion avoidance and slow start require that two variables be maintained for each connection:CWNDA slow start treshold size, ssthreshThere are two indications of packet loss:IHL - IP header lengtha timeout occurethe receipt of duplicate ACKs
51Congestion avoidance algoritm. Combined algoritm’s work.NoYesInitialization: CWND = 1 segment SSTHRESH = bytesIs congestion indicated by timeout?TCP’s doing SLOW START Slow start has CWND start at one segment and be incremented by one segmentevery an ACK is received. (Do you remember slide before?).YesNoNormal data flow, CWND is growingCWND = 1 segmentSlow start continues until we are halfway to where congestion occured (since we recorded half of the window size that got us into trouble), and then congestion avoidance takes over.Congestion occur!Retransmission , bla-bla-bla.. At least: ACK is receivedIHL - IP header lengthTCP increase CWND, but the way it increases depends on whether we TCP performs slow start or congestion avoidance CWND =< SSTHRESH?CONGESTION AVOIDANCE Congestion avoidance dictates that CWND be incremented by 1/CWND each time an ACK is received. So we want to increase CWND by at most one segment each RTT, whereas slow start will increment CWND by the number of ACKs received in a RTTSSTRESH = CWS/2CWS - current window size
52Congestion avoidance algoritm. Illustration. Starting point: We assumed that congestion has just occured when CWND had a value of 32 segments. Congestion was indicated by timeout21468101214161820357SSTRESH = 16CWNDround-trip timesSSTRESH = 32 / 2 = 16 CWND = 1 1 segment is send at time 0At time 1 ACK is returned and CWND is incremented to 2 segmentsAt time 2 two ACK is returned and CWND is incremented to 4 segments (CWND was 2 and two ACK received)IHL - IP header lengthcongestion momentAnd so onNow congestion avoidance is working. Increasing of CWND is linear, with a maximum increase of one segment per round-trip timeCWND = SSTRESH. Slow start is stopped and congestion avoidance is started
53Fast retransmit and Fast recovery algoritms. TCP host1:513 (512) ack 1I am able to send 3 packetsNETWORK513:1025 (512) ack 1ack 513ack 5131st duplicated ACKack 5132nd duplicated ACKack 5133rdt duplicated ACKIt’ duplicated ACK also may be generated by reordering segments.It’ duplicated ACK may be generated by reordering segments.I think segment is lostIHL - IP header lengthHost don’t wait for timer retransmission expires. It send the lost segment. This is:Slow start isn’t performed, but congestion algoritm is working. This isFAST RETRANSMIT ALGORITMFAST RECOVERY ALGORITM
54Fast retransmit and Fast recovery algoritms. Combined algoritm’s work.3rd duplicate ACK is receivedACK is received which acknowledges all data segments sent between lost packet and 1st duplicate ACKSSTRESH = CWS/2 CWND= SSTRESH + 3 * segment sizeCWND = SSTRESHRetransmit the missing segmentCongestion avoidance is now workingIHL - IP header lengthIf duplicate ACK arrives, INC(CWND;segment size); transmit packet (if CWND allows)
55Slow start and congestion avoidance example CWNDDATA GODATA GODATA GODATA GODATA GOSEQ x 1000300400500600700800900100011002000,20,40,60,811,21,41,61,8234567891011121SYNSYNS, AACKnumbers (from table)Initialize: CWND = MSS = 256 SSTRESH = 65535IHL - IP header lengthCWND <=SSTRESH slow start CWND = CWND + 1 segment CWND = = 768CWND > SSTRESH cong.avoid. CWND < *256/ /8 We are using integer arithmetic. CWND = 1089Timeout occurs SSTRESH = CWS/2 = minimum valuse = 512 CWND = 1 segment = 256CWND > SSTRESH cong.avoid. CWND < *256/ /8 We are using integer arithmetic. CWND = 885Here is no changes because new data is not being acknowledgedHere is ACK for data! CWND <= SSTRESH we in slow start 1 segment = 256 CWND = CWND = 512CWND > SSTRESH cong.avoid. CWND < *256/ /8 We are using integer arithmetic. CWND = 991Real formula for 1/CWND is cwnd <- cwnd + (segsize*segsize)/cwnd + segsize/8
56Slow start and congestion avoidance example 1400160018002000220024002600280030001200numbers (from table)6264666870728,78,88,999,19,29,39,49,5CWNDSEQ x 100032009,69,734005860First two duplicated ACK is received and is counted and CWND is left aloneThird duplicated ACK is arrived SSTRESH = CWND/2 = 2426/ 2 = 1024 (rounded down to the next mult. of the segment size) CWND = SSTRESH + number of dupl ACKs = * 256 = 1792IHL - IP header lengthDuplicated ACK is received. CWND = CWND + 1 segment = = 2304 But CWND ‘s not big enough for sent dataACK for new data is received CWND <= SSTRESH slow start!!! CWND = SSTRESH + segment size = = 1280Retransmission is sentDuplicated ACK is received. CWND = CWND + 1 segment = = 2560 We can send dataNOTE: here we have unacknowledged data from prevous segmentsDuplicated ACK is received. CWND = CWND + 1 segment = = 2048 But CWND ‘s not big enough for sent dataData is sentThere are some segments with same situation
57TCP keepalive timerTCP implementation may use keepalive option. This option is used to know: Is my peer alive?One example is one half-open connection. One peer is died but another end don’t know about it. It keeps socket (IP address + port number) for that died perr. But peer needn’t anything already...And alive one must know it!Usually the keepalive timer is 2 hours.IHL - IP header lengthThere are 4 scenarios if there is no activity on connection and one peer send keepalive probe to another
58TCP keepalive timer Scenario 1. Peer is alive and reachable. ClientPacketARP requestkeepalive probeARP replyACKPacketServerThat’s all.. Peers don’t have any data to send to each other but connection is establishedKeepalive probe has SEQ that is one less than it should be (for example, receiver wait for SEQ = 14, but keepalive probe has SEQ = 13. Receiver receivs packet with incorrect SEQ and is forced to respond with ACK which containnext SEQ thar the server is expectingClient received answer from the server. It knows that the server is alive and reset its keepalive timer2 (two) hours passed...My keepalive timer exhaust Is my peer alive? But I forgot his MAC address...Scenario 2. Peer crashed or process was rebooted.Clientkeepalive probekeepalive probePacketPacketServerIHL - IP header lengthThat’s all.. Peers don’t have any data to send to each other but connection is established2 hours have passed75 seconds… No answerMy keepalive timer exhaust Is my peer alive? TCP send request. Don’t see now on lower level (for ARP). We should know whatever perr alive or not.75 seconds… No answer Client send 10 keep-alive probes. If it doesn’t receive response, it consider the peer’s host is down or terminate connectionBut peer is crashed
59TCP keepalive timer Scenario 3. Peer has crashed and rebooted. Host has crashed, rebooted. It has working TCP stack but doesn’t have socket for that connectionI’ll be laconic… 2 hours has passedClientServerkeepalive probereset connectionOnce again.. My keepalive timer exhaust Is my peer alive?Are they crazy? I don’t have such socket!Scenario 4.Client is running, but unreachable.IHL - IP header lengthIn this scenario situation will be the same as in scenario 1 - from client’s point of view. This situation may be caused by accident with intermediate router
60Path MTU DiscoveryConnection establishedDecrease MTUMTU = MIN (my interface MTU; MSS announced by the other end)Router generate newer form of ICMP error message which contain its MSSRouter generate older form of ICMP error messageIf th other end doesn’t specify MSS, it default to 536 It is possible to save path MTU on a per-route basisWe send datagrams with DF (don’t fragment) bit setMTU = MSS - IP header - TCP headerWe take next smaller MTUThings is being changing… After timeout we can try bigger MTU (depending on implementation ). RFC 1101 recommends 10 minutes.We have received ICMP error “can’t fragment”But things is changing… For example, router fell and route was changed. Another router needs fragmnet our datagram, but datagram has DF bit set. Router is sending ICMP error to our host
61TCP packet with MSS option Data offsetDATADestination portAcknowledgment numberSequence numberUrgent pointerSource portHeader checksumOptions (+padding)FlagsReservedWindowMaximum segment size optionkind=21 bytelen=01 byteMSS2 byte
62Path MTU Discovery. Example. Host 2Host 1MTU = 552MTU = 296Router 1MTU = 1500MTU = 1500SYN, ACK mss = 512SYN mss = 14601:257(256) ACKICMP error message: Host 1 unreachable, need to frag, mtu = 296 (newer implementation router’s TCP)1:513 (512) ACKRouter: I can’t send so big datagram without fragmentation. But DF bit is set => error occur!MTU is 552! I can send datagram with 512 bytes of data.My MSS now 256 (MTU = 296)
63Window Scale Option 1 byte 1 byte 1 byte Networks are growing and buffers is coming bigger and there is not enough window size (maximum window size allowed by window field in TCP header)The newer implementation using WINDOW SCALE OPTIONThe newer implementation can work with oldest implementations.Data offsetDATADestination portAcknowledgment numberSequence numberUrgent pointerSource portHeader checksumOptions (+padding)FlagsReservedWindowTCP headerOption field can contain WINDOW SCALE OPTIONkind=31 bytelen=31 byteshift count1 byteWindowWINDOW SCALE OPTION can be advertized only in SYN segment. Sacel factor is fixed in each direction when the connection establishedShift count: no scaling performedThere are only 16 bit
64Window Scale Option. Setting. To enable window scaling both ends must have this option in their SYN segmentsSYN, ACK, wscale 3SYN, wscale 1ActiveOpenI think my window scale should be 1Active peer is going to use window scale! I understand it and choose my window scale = 0. I must set this option to 0.IHL - IP header lengthHow scale work.Window scale is using to shift value from window field to get real window size1Using window scale to shift value to left for 1 bit...For example, window scale was set to 1 and window size in the receiving packet is 4 (it’s only example)1Real advertized window is 8
65Timestamp option 1 byte 1 byte 4 byte 4 byte kind=81 bytelen=101 bytetimestamp value4 bytetimestamp echo reply4 byteTimestamp oyin isusing for better calculating RTT The sender places a 32-bit value in the first field and the receiver echoes this back in the reply field. For usinf this option both ends must be able to work with this option. For established this option the active peer must set timestamp option in the SYN and another (passive) end must answer with option too.Only one timestamp option is kept per connection How does TCP do it?Receiver’s TCP keeps: ACK number from the last ACK which was sent, and time stamp value which was placed to there (tsrecenct). ACK number is next sequence number whivh we are waiting for (lastback).IHL - IP header lengthSegment arrived: If SEQ from segment is lastback, tsrecent = timestamp option from the segment SEQTrsent is sent to the timestamp reply field and lastback is sent to ACK value in the sending ACK.
66PAWS: Protection Against Wrapped Sequence Numbers IHL - IP header length