Today, I explore high-performance network programming, but I think before talking about system APIs, we can talk about some of the underlying Linux packet sending and receiving process, as follows: This is a simple socket programming code:

int main() {
    ... 

    fd = socket(AF_INET, SOCKET_STREAM, 0);
    bind(fd, ...);
    listen(fd, ...);

    // 如何建立连接
    ...
    afd = accept(fd, ...);

    // 如何接收数据
    ...
    read(afd, ...);

    // 如何发送数据
    ...
    send(afd, ...);

    // 如何关闭连接
    ...
    close(fd);
    ...
}

Part 1: How to establish a connection

Linux high-performance network programming: the sending and receiving process of TCP at the bottom level

We know that the TCP/IP protocol family is divided into application layer, TCP transport layer, IP network layer, and link layer (Ethernet layer driven).

As shown in the above figure to see the application layer, usually in network programming we need to call the accept API to establish a TCP connection, so how does TCP do it?

As can be seen from the flow in the diagram above:

(1) The client initiates the TCP handshake and sends syn packets;

(2) After receiving the packet, the kernel first inserts the information of the current connection into the SYN queue of the network;

(3) After successful insertion, it will return to handshake confirmation (SYN+ACK);

(4) If the client continues to complete the TCP handshake, reply to ACK confirmation;

(5) The kernel will take the packet completed by the TCP handshake and first take the corresponding connection information from the SYN queue;

(6) Throw the connection information into the ACCEPT queue;

(7) The application layer sever can get this connection by calling accept in the system, and the entire network socket connection is completed;

So based on this picture, I want to ask the reader what the problem will be here? Careful readers should be able to see: 1. There are two queues here, there will inevitably be a full situation, so how does the kernel deal with this situation?

(1) If the SYN queue is full, the kernel will drop the connection;

(2) If the ACCEPT queue is full, the kernel will not continue to throw the connection of the SYN queue to the ACCEPT queue, and if the SYN queue is large enough, the subsequent packets sent and received by the client will time out;

(3) If the SYN queue is full, the connection will be dropped like (1);

2. How to control the size of the SYN queue and the ACCEPT queue?

(1) Before kernel version 2.2, the upper limit of the SYN queue (semi-connected state SYN_REVD) and ACCEPT queue (fully connected state) can be set through the listen backlog;

(2) After kernel version 2.2, the backlog only indicates the upper limit of the ACCEPT queue, and the upper limit of the SYN queue can be set through /proc/sys/net/ipv4/tcp_max_syn_backlog;

3. If the server side waits all the time through accept, won't it be stuck with the thread that collects the packet?

In Linux network programming, we will pursue high performance, accept if the receiving thread is stuck, the performance will not go, so there will be blocking and non-blocking modes in socket programming.

(1) Accept in blocking mode will be stuck, and the current thread cannot do anything;

(2) In non-blocking mode, you can handle other things by polling accept, if EAGAIN is returned, the ACCEPT queue is empty, if the connection information is returned, the current connection can be processed;

Part II: Receiving Data

(1) When the network card receives the packet and judges that it is TCP protocol, it will call the tcp_v4_rcv method of the kernel, and if the data receives S1 packets in order, it will be directly inserted into the receive queue;

(2) When S3 packets are received, after the end of step 1, the S2 sequence number should be received, but the packets come in out of order, and S3 is inserted into the out_of_order queue (this queue stores out-of-order packets);

(3) The next S2 packet is received, such as step 1 directly into the receive queue, because the out_of_order queue is not empty like step 1 at this time, so the next step 4 is triggered;

(4) Each time a packet is inserted into the receive queue, the out_of_order queue will be checked, and if the expected sequence number S3 is encountered, it will be removed from the out_of_order queue and written to the receive queue;

(5) Now the application starts calling the recv method;

(6) After layers of encapsulation calls, receiving TCP messages will eventually go to the tcp_recvmsg method;

(7) Now you need to copy data from kernel state to user mode, if the receive queue is empty, it will first check SO_RCVLOWAT this threshold (0 means to receive the specified data return, 1 means to return as long as the data is read, the system defaults to 1), if the number of bytes that have been copied is less than it so far, then it may cause the process to sleep, waiting to copy more data;

(8) Copy data from kernel state to user state, and recv returns the size of the copied data;

(9) In order to choose to reduce the network packet delay or improve the throughput, the system provides tcp_low_latency parameters, if the value is 0, the user has not read the data temporarily, the packet enters the prequeue queue to improve the throughput, otherwise do not use the prequeue queue, enter the tcp_v4_do_rcv, reduce the delay;

Part III: Sending Data

(1) Suppose you call the send method to send data larger than one MSS (such as 2K);

(2) The kernel calls the tcp_sendmsg to copy data, write to the queue and assemble TCP protocol headers;

(3) In the call tcp_sendmsg first need to obtain SKB in the kernel, copy the user-mode data to the kernel state, the kernel really performs the sending of the packet, and the call of the send method is not synchronized, that is, the send method returns successfully, and the IP packets are not necessarily sent to the network. Therefore, it is necessary to copy the data in the user-mode memory that the user needs to send to the kernel-mode memory, which does not depend on the user-mode memory, and also enables the process to quickly release the user-mode memory occupied by the sending data. However, this copy operation is not a simple copy, but the data to be sent is divided into multiple fragmented segments that reach the size of MSS according to MSS, and copied to the sk_buff structure in the kernel for storage;

(4) Copy the data to the sending queue tcp_write_queue;

(5) Call tcp_push send data to the IP layer, where the main sliding window, slow start, congestion window control and judgment whether to use the Nagle algorithm to merge small messages;

(6) Assemble the IP header, filter the data through netfilter modules such as iptables or tcpdump, and hand the data to the neighbor subsystem (the main function is to find the MAC address that needs to be sent, send arp requests, encapsulate the MAC header, etc.);

(7) Call the network card driver to send the data out;

Part 4: Close the connection

Closing the connection is the TCP waving process, we all know that TCP connection is a reliable connection, so how can we complete the complete and reliable closing of the connection? Linux systems provide two functions:

Close corresponds to tcp_close method, which implements close by reducing the number of references to the socket, and only triggers tcp_close when the reference count is 0;

shutdown corresponds to tcp_shutdown method, does not care about the number of times the socket is referenced, and directly closes the corresponding connection;

(1) shutdown can carry one parameter, there are 3 values, which means: only close read, only close write, and close read and write at the same time;

(2) If the shutdown is a half-open connection, RST is issued to close the connection;

(3) If the shutdown is a normal connection, then closing the read has nothing to do with the peer;

(4) If there is a flag bit in the parameter that is closed to write, then the following thing is consistent with close, send out FIN packets, and tell the other party that the machine will not send messages again;

Part V: Thinking Questions

Based on this article, leave a few thinking questions, and answer them in the next article.

(1) After the sending method returns successfully, must the data be sent to the TCP peer? (After calling the method of the IP layer, it may not guarantee that the data will be sent successfully at this time)

(2) 1 socket socket may be used by multiple processes, how does the kernel handle this situation when concurrent access occurs?

(3) If the socket is the default blocking socket, call the len parameter passed in by the recv method, if the data of the network packet is less than len, will recv return?

(4) When the socket is shared by multiple processes or multiple threads, what is the difference when the connection is closed?

Linux high-performance network programming: the sending and receiving process of TCP at the bottom level

Part 1: How to establish a connection

Part II: Receiving Data

Part III: Sending Data

Part 4: Close the connection

Part V: Thinking Questions