LINUX NETWORK IMPLEMENTATION Jianyong Zhang. Introduction The layer structure of network: 1) BSD socket layer: general data structure for different protocols.

2 Introduction The layer structure of network: 1) BSD socket layer: general data structure for different protocols. 2) INET socket layer: end points for the IP-based protocols TCP and UDP 3) ARP layer 4) Link layer: Ethernet, SLIP, PLIP 5) Hardware: NIC, serial port, parallel port-

3 Socket system call C interface system call routines: Socket(), bind(), listen(), connect(), accept(), send(), sendto(), recv(), recvfrom(), getsockopt(), setsockopt(). All are based on the system call socketcall(). Socket() return a file descriptor, read(), write(), select(), ioctl() use struct file: file  f_op  sock_read Socket inode: struct socket *sock_alloc(void) {… inode->i_mode = S_IFSOCK|S_IRWXUGO; inode->i_sock = 1; inode->i_uid = current->fsuid; inode->i_gid = current->fsgid; sock->inode = inode; … }

4 Generic system call socketcall() function: asmlinkage int sys_socketcall(int call, unsigned long *args) {… unsigned long a0,a1; /* copy_from_user should be SMP safe. */ if (copy_from_user(a, args, nargs[call])) return -EFAULT; a0=a[0]; a1=a[1]; switch(call) { case SYS_SOCKET: err = sys_socket(a0,a1,a[2]); break; case SYS_BIND: err = sys_bind(a0,(struct sockaddr *)a1, a[2]); break; … } …. }

5 Important structures 1. struct socket { socket_state state; /* SS_FREE, SS_UNCONNECTED, SS_CONNECTING, SS_CONNECTED, SS_DISCONNECTIN*/ unsigned long flags; struct proto_ops *ops; struct inode *inode; struct fasync_struct *fasync_list; /* Asynchronous wake up list*/ struct file *file; /* File back pointer*/ struct sock *sk; struct wait_queue *wait; short type;//SOCK_STREAM, SOCK_DGRAM, SOCK_RAW unsigned char passcred; unsigned char tli; };

6 Important structures 2. struct proto_ops { int family; int (*dup) (struct socket *newsock, struct socket *oldsock); int (*release) (struct socket *sock, struct socket *peer); int (*bind) (); int (*connect) (); int (*socketpair) (struct socket *sock1, struct socket *sock2); int (*accept) (); int (*getname) (); unsigned int (*poll) (); int (*ioctl) (); int (*listen) (struct socket *sock, int len); int (*shutdown) (struct socket *sock, int flags); int (*setsockopt) (struct socket *sock, int level, int optname, int (*getsockopt) (); int (*fcntl) (); int (*sendmsg) (); int (*recvmsg) (); };

7 Important structures 3. Struct sk_buff {... }: manage individual communication packets, a doule-link list 4. Struct sock { … } INET socket 5. Struct device {…} contols an abstract network device: network interface.

8 Getting the data from A to B 1. A,B call socket(), then are connected by calling connect(), accept(). 2. A: write(socket,data.len): verify_area(). {… file = fget(socket); inode = file->f_dentry->d_inode; if (!file->f_op || !(write= file->f_op->write)) goto out; down(&inode->i_sem); ret = write(file, data, len, &file->f_pos); up(&inode->i_sem);… } 3. Sock_write() { …struct socket *sock; sock = socki_lookup(file->f_dentry->d_inode); … msg.msg_iov=&iov; iov.iov_base=(void *)ubuf; … return sock_sendmsg(sock, &msg, size); } 4. For INET socket, it will call inet_sendmsg().

9 Getting the data from A to B 5. inet_sendmsg() { struct sock *sk = sock->sk; … return sk->prot->sendmsg(sk, msg, size); } /* call tcp_v4_sendmsg() */ 6. Call tcp_do_sendmsg(sk, msg) {… struct sk_buff *skb; tmp = MAX_HEADER + sk->prot->max_header; skb = sock_wmalloc(sk, tmp, 0, GFP_KERNEL); skb_reserve(skb, MAX_HEADER + sk->prot- >max_header); skb->csum = csum_and_copy_from_user(from, skb_put(skb, copy), copy, 0, &err); /*TCP data bytes are SKB_PUT() on top, later TCP+IP+DEV headers are SKB_PUSH()'d beneath. */ tcp_send_skb(sk, skb, queue_it); …}

10 Getting the data from A to B 5. tcp_send_skb() call tcp_transmit_skb(sk, skb_clone(skb, GFP_KERNEL)); 6. tcp_transmit_skb(struct sock *sk, struct sk_buff *skb) {… struct tcp_opt *tp = &(sk->tp_pinfo.af_tcp); /* Build TCP header and checksum it. */ … tp->af_specific->queue_xmit(skb); 7. Ip_queue_xmit() /* Queues a packet to be sent, and starts the transmitter if necessary. This routine also needs to put in the total length and compute the checksum. */ {… /* Make sure we can route this packet. */ skb->dst = dst_clone(sk->dst_cache); /* OK, we know where to send it, allocate and build IP header. */… /* Do we need to fragment. Again this is inefficient. We need to somehow lock the original buffer and use bits of it. */… /* Add an IP checksum. */…

11 Getting the data from A to B skb->dst->output(skb); … } 7. Bh synchronization with barrier: start_bh_atomic(void), end_bh_atomic(void) 8. Dev_queue_xmit() {… start_bh_atomic(); q = dev->qdisc; if (q->enqueue) { q->enqueue(skb, q); qdisc_wakeup(dev); end_bh_atomic(); … return;} if (dev->flags&IFF_UP) { dev->hard_start_xmit(skb, dev); end_bh_atomic(); return;} } 9. For the WD8013 card, call ei_start_xmit(), pass the data to network adaptor, which in turn sends the packet to the Ethernet.

12 Getting the data from A to B 10. The data, embedded in an Ethernet packet, are received by NIC in B. (NIC is assumed WD8013) 11. NIC trigger an interrupt. This is handled by ei_interrupt(). Call ei_receive() (ei_* functions are chip- specific code for many 8390-based ethernet adaptors) 12. Ei_receive() { … struct sk_buff *skb; skb = dev_alloc_skb(pkt_len+2);…. netif_rx(skb); …} 13 netif_rx() receive a packet from a device driver and queue it for the upper (protocol) levels. Call {skb_queue_tail(&backlog,skb); mark_bh(NET_BH); } 14. There is only one list of backlog in the entire system. 15. Do_bottom_half() calls net_bh()

13 Getting the data from A to B 10. net_bh() {… skb = skb_dequeue(&backlog); /* Bump the pointer to the next structure. skb->data and skb->nh.raw point to the MAC and encapsulated data */ skb->h.raw = skb->nh.raw = skb->data; /* Fetch the packet protocol ID. */ type = skb->protocol; /* We got a packet ID. Now loop over the "known protocols" list. There are two lists. The ptype_all list of taps (normally empty) and the main protocol list which is hashed perfectly for normal protocols. */… if (ptype->type == type && (ptype->dev==skb->dev)) {/*We already have a match queued. Deliver to it*/ skb2=skb_clone(skb, GFP_ATOMIC); pt_prev->func(skb2, skb->dev, pt_prev);…}

14 Getting the data from A to B 10. Call ip_rcv() {… /* check the header for correctness and deal with all the IP options. Ip_forward() and ip_defrag() */ … return skb->dst->input(skb); } 11 ip_local_deliver() {… /* Reassemble IP fragments.*/ skb = ip_defrag(skb); /*Deliver to raw sockets. This is fun as to avoid copies we want to make no surplus copies. */ … /* Pass on the datagram to each protocol that wants it, based on the datagram protocol. */... ipprot->handler(skb2, ntohs(iph->tot_len) - (iph->ihl * 4)); …} 12 tcp_v4_rcv(), udp_rcv(),icmp_rcv()

15 Getting the data from A to B 13. tcp_v4_rcv() {… /* check the header for correctness */ … if (!atomic_read(&sk->sock_readers)) return tcp_v4_do_rcv(sk, skb); __skb_queue_tail(&sk->back_log, skb); do_time_wait: case TCP_TW_ACK: tcp_v4_send_ack(); …} 14. tcp_v4_do_rcv() call { …__skb_queue_tail(&nsk->back_log, skb); if (sk->state == TCP_ESTABLISHED) { /* Fast path */ if (tcp_rcv_established(sk, skb, skb->, skb->len)) goto reset; return 0; } tcp_rcv_state_process(sk, skb, skb->, skb->len);…}

16 Getting the data from A to B 15. TCP receive function for the ESTABLISHED state. * It is split into a fast path and a slow path. The fast path is disabled when: * - A zero window was announced from us - zero window probing * is only handled properly in the slow path. * - Out of order segments arrived. * - Urgent data is expected. * - There is no buffer space left * - Unexpected TCP flags/window values/header lengths are received (detected by checking the TCP header against pred_flags) * - Data is sent in both directions. Fast path only supports pure senders or pure receivers (this means either the sequence number or the ack value must stay constant) * When these conditions are not satisfied it drops into a standard * receive procedure patterned after RFC793 to handle all cases. * The first three cases are guaranteed by proper pred_flags setting, * the rest is checked inline. Fast processing is turned on in * tcp_data_queue when everything is OK.

17 Getting the data from A to B 16. Tcp_data() enter the buffer sk_buff in the list 17. Data_ready() wake up the waiting processes. 18 The former actions are carried up in the kernel, outside of any process. 19. B executes read(socket, data, len). 20. Through sys_read() --- sock_read() – inet_rcvmsg()– tcp_rcvmsg(). 21 This completes the data’s travels from process A to process B. 22 The data is copied only four times: 1) From the user space of A to kernel memory 2) From kernel memory to network card. 3) From network card to another computer’s kernel memory 4) From B’s kernel memory to B’s user space

