[转帖]探索惊群 ⑥ - nginx

https://wenfh2020.com/2021/10/12/thundering-herd-tcp-reuseport/

SO_REUSEPORT (reuseport) 是网络的一个选项设置，它能开启内核功能：网络链接分配 内核负载均衡。

该功能允许多个进程/线程 bind/listen 相同的 IP/PORT，提升了新链接的分配性能。

nginx 开启 reuseport 功能后，性能有立竿见影的提升，我们结合 tcp 协议分析 nginx 的 reuseport 功能。

reuseport 也是内核解决 惊群问题 的优秀方案。

每个进程可以 bind/listen 相同的 IP/PORT，相当于每个进程拥有独立的 listen socket 的完全队列，避免了共享 listen socket 的资源争抢，提升了并发的吞吐。
内核通过哈希算法，将新链接相对均衡地分配到各个开启了 reuseport 属性的进程，所以资源的负载均衡得到解决。

1. 概述

1.1. what

从下面这段英文提取一些关键信息：

SO_REUSEPORT 是网络的一个选项设置，它允许多个进程/线程 bind/listen 相同的 IP/PORT，在 TCP 的应用中，它是一个新链接分发的（内核）负载均衡功能，它提升了新链接的分配性能（针对 accept ）。

Socket options

    The socket options listed below can be set by using setsockopt(2)

    and read with getsockopt(2) with the socket level set to

    SOL_SOCKET for all sockets.  Unless otherwise noted, optval is a

    pointer to an int.

...

    SO_REUSEPORT (since Linux 3.9)

                Permits multiple AF_INET or AF_INET6 sockets to be bound

                to an identical socket address.  This option must be set

                on each socket (including the first socket) prior to

                calling bind(2) on the socket.  To prevent port hijacking,

                all of the processes binding to the same address must have

                the same effective UID.  This option can be employed with

                both TCP and UDP sockets.

                For TCP sockets, this option allows accept(2) load

                distribution in a multi-threaded server to be improved by

                using a distinct listener socket for each thread.  This

                provides improved load distribution as compared to

                traditional techniques such using a single accept(2)ing

                thread that distributes connections, or having multiple

                threads that compete to accept(2) from the same socket.

                For UDP sockets, the use of this option can provide better

                distribution of incoming datagrams to multiple processes

                (or threads) as compared to the traditional technique of

                having multiple processes compete to receive datagrams on

                the same socket.

注释文字来源 (链接需要FQ）：socket(7) — Linux manual page

1.2. why

SO_REUSEPORT 功能解决了什么问题？我们先看看 2013 年 3.9+ 版本内核提交的这个 Linux 内核功能补丁的注释。

soreuseport: TCP/IPv4 implementation

Allow multiple listener sockets to bind to the same port.

Motivation for soresuseport would be something like a web server

binding to port 80 running with multiple threads, where each thread

might have it's own listener socket.  This could be done as an

alternative to other models: 1) have one listener thread which

dispatches completed connections to workers. 2) accept on a single

listener socket from multiple threads.  In case #1 the listener thread

can easily become the bottleneck with high connection turn-over rate.

In case #2, the proportion of connections accepted per thread tends

to be uneven under high connection load (assuming simple event loop:

while (1) { accept(); process() }, wakeup does not promote fairness

among the sockets.  We have seen the  disproportion to be as high

as 3:1 ratio between thread accepting most connections and the one

accepting the fewest.  With so_reusport the distribution is

uniform.

Signed-off-by: Tom Herbert <therbert@google.com>

Signed-off-by: David S. Miller <davem@davemloft.net>

 master

 v5.13

…

 v3.9-rc1

@davem330

Tom Herbert authored and davem330 committed on 24 Jan 2013

1 parent 055dc21 commit da5e36308d9f7151845018369148201a5d28b46d

reuseport 选项主要解决了两个问题：

（A 图）单个 listen socket 遇到的性能瓶颈。
（B 图）单个 listen socket 多个线程同时 accept，但是多个线程资源分配不均。

其实它还解决了一个很重要的问题：

在 tcp 多线程场景中，（B 图）服务端如果所有新链接只保存在一个 listen socket 的 全链接队列 中，那么多个线程去这个队列里获取（accept）新的链接，势必会出现多个线程对一个公共资源的争抢，争抢过程中，大量资源的损耗。

而（C 图）有多个 listener 共同 bind/listen 相同的 IP/PORT，也就是说每个进程/线程有一个独立的 listener，相当于每个进程/线程独享一个 listener 的全链接队列，不需要多个进程/线程竞争某个公共资源，能充分利用多核，减少竞争的资源消耗，效率自然提高了。

1.3. how

SO_REUSEPORT 功能使用，可以通过网络选项进行设置，在 bind 前面设置即可，使用比较简单。

int fd, reuse = 1;

fd = socket(PF_INET, SOCK_STREAM, IPPROTO_IP);

setsockopt(fd, SOL_SOCKET, SO_REUSEPORT, (const void *)&reuse, sizeof(int));

2. 原理

TCP 客户端链接服务端，第一次握手，服务端被动收到第一次握手 SYN 包，内核就通过哈希算法，将客户端的链接分派到内核半链接队列，三次握手成功后，再将这个链接从半链接队列移动到某个 listener 的全链接队列中，提供 accept 获取。

三次握手流程。

图片来源：TCP 三次握手（内核）

服务端被动第一次握手，查找合适的 listener，详看源码（Linux 5.0.1）。

/* include/net/inet_hashtables.h */

static inline struct sock *__inet_lookup(struct net *net,

                     struct inet_hashinfo *hashinfo,

                     struct sk_buff *skb, int doff,

                     const __be32 saddr, const __be16 sport,

                     const __be32 daddr, const __be16 dport,

                     const int dif, const int sdif,

                     bool *refcounted)

{

    u16 hnum = ntohs(dport);

    struct sock *sk;

    /* skb 包，从 established 哈希表中查找是否已有 established 的包。*/

    sk = __inet_lookup_established(net, hashinfo, saddr, sport,

                       daddr, hnum, dif, sdif);

    *refcounted = true;

    if (sk)

        return sk;

    *refcounted = false;

    /* 上面没找到，那么就找一个合适的 listener，去走三次握手流程。 */

    return __inet_lookup_listener(net, hashinfo, skb, doff, saddr,

                      sport, daddr, hnum, dif, sdif);

}

/* net/ipv4/inet_hashtables.c */

struct sock *__inet_lookup_listener(struct net *net,

                    struct inet_hashinfo *hashinfo,

                    struct sk_buff *skb, int doff,

                    const __be32 saddr, __be16 sport,

                    const __be32 daddr, const unsigned short hnum,

                    const int dif, const int sdif) {

    struct inet_listen_hashbucket *ilb2;

    struct sock *result = NULL;

    unsigned int hash2;

    /* 通过目标 ip/port 哈希值从 hashinfo.lhash2 查找对应的 slot。*/

    hash2 = ipv4_portaddr_hash(net, daddr, hnum);

    ilb2 = inet_lhash2_bucket(hashinfo, hash2);

    /* 再从对应的 slot 上，搜索哈希链上的数据。 */

    result = inet_lhash2_lookup(net, ilb2, skb, doff,

                    saddr, sport, daddr, hnum,

                    dif, sdif);

    ...

    return result;

}

/* called with rcu_read_lock() : No refcount taken on the socket */

static struct sock *inet_lhash2_lookup(struct net *net,

                struct inet_listen_hashbucket *ilb2,

                struct sk_buff *skb, int doff,

                const __be32 saddr, __be16 sport,

                const __be32 daddr, const unsigned short hnum,

                const int dif, const int sdif) {

    bool exact_dif = inet_exact_dif_match(net, skb);

    struct inet_connection_sock *icsk;

    struct sock *sk, *result = NULL;

    int score, hiscore = 0;

    u32 phash = 0;

    /* 遍历哈希链，获取合适的 listener。 */

    inet_lhash2_for_each_icsk_rcu(icsk, &ilb2->head) {

        sk = (struct sock *)icsk;

        score = compute_score(sk, net, hnum, daddr, dif, sdif, exact_dif);

        /* 统计分数，获取最大匹配分数的 socket。*/

        if (score > hiscore) {

            if (sk->sk_reuseport) {

                /* 算出哈希值。 */

                phash = inet_ehashfn(net, daddr, hnum, saddr, sport);

                /* 在数组里，通过哈希获得 sk。 */

                result = reuseport_select_sock(sk, phash, skb, doff);

                if (result)

                    return result;

            }

            result = sk;

            hiscore = score;

        }

    }

    return result;

}

参考：[内核源码] 网络协议栈 listen (tcp)

3. nginx

2013 年 Linux 内核添加了 reuseport 功能后，nginx 在 2015 年，1.9.1 版本也增加对应功能的支持，nginx 开启 reuseport 功能后，性能是原来的 2-3 倍，效果可谓立竿见影！

详细请参考 nginx 官方文档：Socket Sharding in NGINX Release 1.9.1

接下来我们看看 nginx 是如何支持 reuseport 的。

3.1. 开启 SO_REUSEPORT

修改 nginx 配置，在 nginx.conf 里，listen 关键字后面添加 ‘reuseport’。

# nginx.conf

# vim /usr/local/nginx/conf/nginx.conf

# 启动 4 个子进程。

worker_processes  4;

http {

    ...

    server {

        listen 80 reuseport;

        server_name localhost;

        ...

    }

    ...

}

3.2. 工作流程

启动测试 nginx，1 master / 4 workers，监听 80 端口。

3.2.1. 进程

父子进程。

# ps -ef | grep nginx

root      88994   1770  0 14:57 ?        00:00:00 nginx: master process /usr/local/nginx/sbin/nginx

nobody    88995  88994  0 14:57 ?        00:00:00 nginx: worker process

nobody    88996  88994  0 14:57 ?        00:00:00 nginx: worker process

nobody    88997  88994  0 14:57 ?        00:00:00 nginx: worker process

nobody    88998  88994  0 14:57 ?        00:00:00 nginx: worker process

父子进程 LISTEN 80 端口情况。

因为配置文件设置了 worker_processes 4 需要启动 4 个子进程， nginx 进程发现配置文件关键字 listen 后添加了 reuseport 关键字，那么主进程先创建 4 个 socket 并设置 SO_REUSEPORT 选项，然后进行 bind 和 listen。

当 fork 子进程时，子进程拷贝了父进程的这 4 个 socket，所以你看到每个子进程都有相同 LISTEN 的 socket fd（6，7，8，9）。

# lsof -i:80 | grep nginx

nginx   88994   root    6u  IPv4 909209      0t0  TCP *:http (LISTEN)

nginx   88994   root    7u  IPv4 909210      0t0  TCP *:http (LISTEN)

nginx   88994   root    8u  IPv4 909211      0t0  TCP *:http (LISTEN)

nginx   88994   root    9u  IPv4 909212      0t0  TCP *:http (LISTEN)

nginx   88995 nobody    6u  IPv4 909209      0t0  TCP *:http (LISTEN)

nginx   88995 nobody    7u  IPv4 909210      0t0  TCP *:http (LISTEN)

nginx   88995 nobody    8u  IPv4 909211      0t0  TCP *:http (LISTEN)

nginx   88995 nobody    9u  IPv4 909212      0t0  TCP *:http (LISTEN)

nginx   88996 nobody    6u  IPv4 909209      0t0  TCP *:http (LISTEN)

nginx   88996 nobody    7u  IPv4 909210      0t0  TCP *:http (LISTEN)

nginx   88996 nobody    8u  IPv4 909211      0t0  TCP *:http (LISTEN)

nginx   88996 nobody    9u  IPv4 909212      0t0  TCP *:http (LISTEN)

nginx   88997 nobody    6u  IPv4 909209      0t0  TCP *:http (LISTEN)

nginx   88997 nobody    7u  IPv4 909210      0t0  TCP *:http (LISTEN)

nginx   88997 nobody    8u  IPv4 909211      0t0  TCP *:http (LISTEN)

nginx   88997 nobody    9u  IPv4 909212      0t0  TCP *:http (LISTEN)

nginx   88998 nobody    6u  IPv4 909209      0t0  TCP *:http (LISTEN)

nginx   88998 nobody    7u  IPv4 909210      0t0  TCP *:http (LISTEN)

nginx   88998 nobody    8u  IPv4 909211      0t0  TCP *:http (LISTEN)

nginx   88998 nobody    9u  IPv4 909212      0t0  TCP *:http (LISTEN)

3.2.2. 网络初始流程

nginx 是多进程模型，Linux 环境下一般使用 epoll 事件驱动。

strace 监控 nginx 进程的系统调用流程。

# strace -f -s 512 -o /tmp/nginx.log /usr/local/nginx/sbin/nginx

# grep -n -E 'socket\(PF_INET|SO_REUSEPORT|listen|bind|clone|epoll' /tmp/nginx.log

# master

208:88993 socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 6

210:88993 setsockopt(6, SOL_SOCKET, SO_REUSEPORT, [1], 4) = 0

212:88993 bind(6, {sa_family=AF_INET, sin_port=htons(80), sin_addr=inet_addr("0.0.0.0")}, 16) = 0

213:88993 listen(6, 511)                    = 0

214:88993 socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 7

216:88993 setsockopt(7, SOL_SOCKET, SO_REUSEPORT, [1], 4) = 0

218:88993 bind(7, {sa_family=AF_INET, sin_port=htons(80), sin_addr=inet_addr("0.0.0.0")}, 16) = 0

219:88993 listen(7, 511)                    = 0

220:88993 socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 8

222:88993 setsockopt(8, SOL_SOCKET, SO_REUSEPORT, [1], 4) = 0

224:88993 bind(8, {sa_family=AF_INET, sin_port=htons(80), sin_addr=inet_addr("0.0.0.0")}, 16) = 0

225:88993 listen(8, 511)                    = 0

226:88993 socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 9

228:88993 setsockopt(9, SOL_SOCKET, SO_REUSEPORT, [1], 4) = 0

230:88993 bind(9, {sa_family=AF_INET, sin_port=htons(80), sin_addr=inet_addr("0.0.0.0")}, 16) = 0

231:88993 listen(9, 511)                    = 0

# master --> fork

250:88993 clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7f0d875dfa10) = 88994

274:88994 clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7f0d875dfa10) = 88995

305:88994 clone( <unfinished ...>

308:88994 <... clone resumed> child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7f0d875dfa10) = 88996

349:88995 epoll_create(512 <unfinished ...>

351:88995 <... epoll_create resumed> )      = 11

413:88995 epoll_ctl(11, EPOLL_CTL_ADD, 6, {EPOLLIN|EPOLLRDHUP, {u32=2270846992, u64=139696082149392}} <unfinished ...>

443:88996 <... epoll_create resumed> )      = 13

445:88994 <... clone resumed> child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7f0d875dfa10) = 88998

524:88997 epoll_create(512 <unfinished ...>

526:88997 <... epoll_create resumed> )      = 15

543:88998 epoll_create(512 <unfinished ...>

545:88998 <... epoll_create resumed> )      = 17

564:88997 epoll_ctl(15, EPOLL_CTL_ADD, 8, {EPOLLIN|EPOLLRDHUP, {u32=2270846992, u64=139696082149392}} <unfinished ...>

565:88996 epoll_ctl(13, EPOLL_CTL_ADD, 7, {EPOLLIN|EPOLLRDHUP, {u32=2270846992, u64=139696082149392}} <unfinished ...>

606:88998 epoll_ctl(17, EPOLL_CTL_ADD, 9, {EPOLLIN|EPOLLRDHUP, {u32=2270846992, u64=139696082149392}}) = 0

分析总结一下 strace 采集的系统调用日志。

# 如果有 N 个子进程就创建 N 个 socket 并对其设置，绑定地址和监听端口。

N * (socket --> setsockopt SO_REUSEPORT --> bind --> listen) -->

# fork 子进程

fork -->

# 每个子进程创建 epoll_create 实例进行网络读写工作。

# 注意，listen fd 是父进程创建的，父进程在 fork 子进程时，为每个子进程打上编号了，

# 每个编号的子进程会处理一个 listen fd.

epoll_create --> epoll_ctl EPOLL_CTL_ADD listen fd -->

epoll_wait --> accept/accept4/read/write

源码函数调用层次。

# 函数调用层次关系。

# ------------------------ master ------------------------

main

|-- ngx_init_cycle

    |-- ngx_open_listening_sockets

        |-- socket # 如果有 N 个子进程，那么创建 N 个 socket.

        |-- setsockopt SO_REUSEPORT

        |-- bind

        |-- listen

|-- ngx_master_process_cycle

    |-- ngx_start_worker_processes

        |-- ngx_spawn_process # 每个 fork 出来的子进程，主进程都会传递一个顺序的数字编号进行标识，保存到 ngx_worker

            |-- fork

# ------------------------ worker ------------------------

            |-- ngx_worker_process_cycle

                |-- ngx_worker_process_init

                    |-- ngx_event_process_init

                        |-- ngx_epoll_init

                            |-- epoll_create

                        |-- ngx_epoll_add_event # ngx_add_event

                            |-- epoll_ctl # EPOLL_CTRL_ADD - 对应子进程编号的 listen fd.

                |-- ngx_process_events_and_timers

                    |-- ngx_epoll_process_events # ngx_process_events

                        |-- epoll_wait

                        |-- ngx_event_accept

                            |-- accept/accept4

4. 参考

本文作者： wenfh2020
本文链接： https://wenfh2020.com/2021/10/12/thundering-herd-tcp-reuseport/
版权声明： 本文可以自由拷贝，转载，请注明出处，谢谢！