天天看點

Docker網絡之深挖overlay

原文位址:https://neuvector.com/network-security/docker-swarm-container-networking/

原文深入挖掘Docker Swarm網絡實作,說了一些官方文檔沒有說的問題,對于學習了解Docker網絡大有幫助,

部署

首先部署一個包含兩個節點的docker swarm叢集,名稱為别為node 1與node 2,建立swarm叢集的過程不贅述。接下來建立一個overlay網絡與三個服務,每個服務隻有一個執行個體,如下:

docker network create --opt encrypted --subnet 100.0.0.0/24 -d overlay net1

docker service create --name redis --network net1 redis
docker service create --name node --network net1 nvbeta/node
docker service create --name nginx --network net1 -p 1080:80 nvbeta/swarm_nginx
           

以上指令建立了一個典型的三層應用。nginx是前置的負載均衡器,将使用者請求流量分發到node服務,node服務是一個web服務,它負責通路redis并将結果通過nginx服務傳回給使用者。簡單起見,隻建立一個node執行個體。

以下是此應用的邏輯視圖:

Docker網絡之深挖overlay

網絡

看一下在docker swarm中已經建立的網絡:

$ docker network ls
NETWORK ID          NAME                DRIVER              SCOPE
cac91f9c60ff        bridge              bridge              local
b55339bbfab9        docker_gwbridge     bridge              local
fe6ef5d2e8e8        host                host                local
f1nvcluv1xnf        ingress             overlay             swarm
8vty8k3pejm5        net1                overlay             swarm
893a1bbe3118        none                null                local
           

net1

剛才建立的overlay網絡,負責容器之間東西向通信。

docker_gwbridge

由Docker建立的bridge網絡,它允許容器與主控端通信。

ingress

由Docker建立的overlay網絡,在Docker swarm中通過此網絡向外部世界暴露服務與routing mesh功能。

net1網絡

每個服務在建立時都指定了“--network net1”選項,是以每個容器執行個體必然有一個接口連接配接到net1網絡,檢視一下node 1,有兩個容器被部署在這個節點上:

$ docker ps
CONTAINER ID        IMAGE                COMMAND                  CREATED             STATUS              PORTS               NAMES
eb03383913fb        nvbeta/node:latest   "nodemon /src/index.j"   2 hours ago         Up 2 hours          8888/tcp            node.1.10yscmxtoymkvs3bdd4z678w4
434ce2679482        redis:latest         "docker-entrypoint.sh"   2 hours ago         Up 2 hours          6379/tcp            redis.1.1a2l4qmvg887xjpfklp4d6n7y
           

通過建立與docker網絡名稱空間的符号連結,檢視一下node 1結點上所有的網絡名稱空間:

$ cd /var/run
$ sudo ln -s /var/run/docker/netns netns
$ sudo ip netns
be663feced43
6e9de12ede80
2-8vty8k3pej
1-f1nvcluv1x
72df0265d4af
           

對比名稱空間ID與Docker swarm中的網絡ID,我們猜測net1網絡屬于2-8vty8k3pej名稱空間,net1網絡的ID為8vty8k3pej。這個可以通過對比名稱空間下的接口與容器中的接口确認。

容器中的接口:

$ docker exec node.1.10yscmxtoymkvs3bdd4z678w4 ip link
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
11040: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UP mode DEFAULT group default
    link/ether 02:42:65:00:00:03 brd ff:ff:ff:ff:ff:ff
11042: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT group default
    link/ether 02:42:ac:12:00:04 brd ff:ff:ff:ff:ff:ff
           
$ docker exec redis.1.1a2l4qmvg887xjpfklp4d6n7y ip link
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
11036: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UP mode DEFAULT group default
    link/ether 02:42:65:00:00:08 brd ff:ff:ff:ff:ff:ff
11038: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT group default
    link/ether 02:42:ac:12:00:03 brd ff:ff:ff:ff:ff:ff
           

2-8vty8k3pej名稱空間下的接口:

$ sudo ip netns exec 2-8vty8k3pej ip link
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
2: br0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UP mode DEFAULT group default
    link/ether 22:37:32:66:b0:48 brd ff:ff:ff:ff:ff:ff
11035: vxlan1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue master br0 state UNKNOWN mode DEFAULT group default
    link/ether 2a:30:95:63:af:75 brd ff:ff:ff:ff:ff:ff
11037: veth2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue master br0 state UP mode DEFAULT group default
    link/ether da:84:44:5c:91:ce brd ff:ff:ff:ff:ff:ff
11041: veth3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue master br0 state UP mode DEFAULT group default
    link/ether 8a:f9:bf:c1:ec:09 brd ff:ff:ff:ff:ff:ff
           

注意br0,它是LinuxBridge裝置,所有其它接口都連接配接在它上邊,包括vxlan1、veth2、veth3。vxlan1是VTEP類型的Linux網絡虛拟化裝置,它是br0的從裝置,用來實作vxlan功能。

veth2與veth3都是veth類型的虛拟化裝置,總是成對出現,其中之一位于名稱空間内,另一個位于容器中,并且們于容器中的veth裝置ID總比位于namespacew中的ID小數字1。是以,名稱空間下的veth2與redis中的eth0是一對,名稱空間下的veth3與node中的eth0是一對。

目前我們可以确認網絡net1屬于名稱空間2-8vty8k3pej,基于目前了理的情報,網絡拓撲圖暫時如下:

node 1

  +-----------+      +-----------+
  |  nodejs   |      |   redis   |
  |           |      |           |
  +--------+--+      +--------+--+
           |                  |
           |                  |
           |                  |
           |                  |
      +----+------------------+-------+ net1
       101.0.0.3          101.0.0.8
       101.0.0.4(vip)     101.0.0.2(vip)
           

docker_gwbridge網絡

對比redis、node容器中的接口與node 1主控端中的接口。主控端接口如下:

$ ip link
...
4: docker_gwbridge: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT group default
    link/ether 02:42:24:f1:af:e8 brd ff:ff:ff:ff:ff:ff
5: docker0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN mode DEFAULT group default
    link/ether 02:42:e4:56:7e:9a brd ff:ff:ff:ff:ff:ff
11039: veth97d586b: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master docker_gwbridge state UP mode DEFAULT group default
    link/ether 02:6b:d4:fc:8a:8a brd ff:ff:ff:ff:ff:ff
11043: vethefdaa0d: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master docker_gwbridge state UP mode DEFAULT group default
    link/ether 0a:d5:ac:22:e7:5c brd ff:ff:ff:ff:ff:ff
10876: vethceaaebe: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master docker_gwbridge state UP mode DEFAULT group default
    link/ether 3a:77:3d:cc:1b:45 brd ff:ff:ff:ff:ff:ff
...
           

可以看到,有三個veth裝置連接配接到docker_gwbridge,ID分别是11039、11043、10876,可以通過如下指令确認:

$ brctl show
bridge name            bridge id                      STP enabled            interfaces
docker0                8000.0242e4567e9a              no
docker_gwbridge        8000.024224f1afe8              no                     veth97d586b
                                                                             vethceaaebe
                                                                             vethefdaa0d
           

根據在net1中總結的veth對的比對規則,我們知道11039與redis中的eth1(11038)是一對,11043與node中的eth1(11042)是一對。現在的網絡拓撲圖如下:

node 1

  172.18.0.4         172.18.0.3
 +----+------------------+----------------+ docker_gwbridge
      |                  |
      |                  |
      |                  |
      |                  |
   +--+--------+      +--+--------+
   |  nodejs   |      |   redis   |
   |           |      |           |
   +--------+--+      +--------+--+
            |                  |
            |                  |
            |                  |
            |                  |
       +----+------------------+----------+ net1
        101.0.0.3          101.0.0.8
        101.0.0.4(vip)     101.0.0.2(vip)
           

docker_gwbridge的功能與單機Docker下預設的docker0(也可能叫bridge,取決于Docker版本)類型的網絡很像。但是有差別,docker0有連接配接外網的功能,docker_gwbridge沒有,它隻負責同主控端下不同容器這間的通信。當容器使用-p選項連接配接外網時由另一個叫ingress的網絡負責。

ingress網絡

再一次列出node 1主控端下的網絡名稱空間與Docker swarm中的網絡:

$ sudo ip netns
be663feced43
6e9de12ede80
2-8vty8k3pej
1-f1nvcluv1x
72df0265d4af

$ docker network ls
NETWORK ID          NAME                DRIVER              SCOPE
cac91f9c60ff        bridge              bridge              local
b55339bbfab9        docker_gwbridge     bridge              local
fe6ef5d2e8e8        host                host                local
f1nvcluv1xnf        ingress             overlay             swarm
8vty8k3pejm5        net1                overlay             swarm
893a1bbe3118        none                null                local
           

很明顯,ingress網絡屬于1-f1nvcluv1x名稱空間。但是72df0265d4af名稱空間是幹什麼的呢?先看一下72df0265d4af名稱空間下的接口:

$ sudo ip netns exec 72df0265d4af ip addr
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host
       valid_lft forever preferred_lft forever
10873: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UP group default
    link/ether 02:42:0a:ff:00:03 brd ff:ff:ff:ff:ff:ff
    inet 10.255.0.3/16 scope global eth0
       valid_lft forever preferred_lft forever
    inet6 fe80::42:aff:feff:3/64 scope link
       valid_lft forever preferred_lft forever
10875: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default
    link/ether 02:42:ac:12:00:02 brd ff:ff:ff:ff:ff:ff
    inet 172.18.0.2/16 scope global eth1
       valid_lft forever preferred_lft forever
    inet6 fe80::42:acff:fe12:2/64 scope link
       valid_lft forever preferred_lft forever
           

eth1(10875)與主控端上的vethceaaebe(10876)是一對,我們也可以知道eth0(10873)

連接配接到ingress網絡,這一點可以通過查詢ingress與docker_gwbridge的詳細資訊佐證:

$ docker network inspect ingress
[
    {
        "Name": "ingress",
        "Id": "f1nvcluv1xnfa0t2lca52w69w",
        "Scope": "swarm",
        "Driver": "overlay",
        ....
        "Containers": {
            "ingress-sbox": {
                "Name": "ingress-endpoint",
                "EndpointID": "3d48dc8b3e960a595e52b256e565a3e71ea035bb5e77ae4d4d1c56cab50ee112",
                "MacAddress": "02:42:0a:ff:00:03",
                "IPv4Address": "10.255.0.3/16",
                "IPv6Address": ""
            }
        },
        ....
    }
]

$ docker network inspect docker_gwbridge
[
    {
        "Name": "docker_gwbridge",
        "Id": "b55339bbfab9bdad4ae51f116b028ad7188534cb05936bab973dceae8b78047d",
        "Scope": "local",
        "Driver": "bridge",
        ....
        "Containers": {
            ....
            "ingress-sbox": {
                "Name": "gateway_ingress-sbox",
                "EndpointID": "0b961253ec65349977daa3f84f079ec5e386fa0ae2e6dd80176513e7d4a8b2c3",
                "MacAddress": "02:42:ac:12:00:02",
                "IPv4Address": "172.18.0.2/16",
                "IPv6Address": ""
            }
        },
        ....
    }
]
           

以上輸出中endpoints的“MAC/IP”與網絡名稱空間72df0265d4af中的“MAC/IP”比對。由此可見,網絡名稱空間72df0265d4af是為了隐藏容器“ingress-sbox”,“ingress-sbox”有兩個接口,一個連接配接主控端,另一個連接配接ingress網絡。

Docker Swarm的其中之一特性是"routing mesh",對于向外部暴露端口的容器,無論它實際運作在那個節點,可以通過通路叢集中的任何一個節點通路到它,怎麼做到的呢?繼續深挖容器。

在我們的應用中,隻有nginx服務通過将自已的80端口映射到主控端的1080端口,但nginx并沒有運作在node 1節點上。

繼續檢視node 1

$ sudo iptables -t nat -nvL
...
...
Chain DOCKER-INGRESS (2 references)
 pkts bytes target     prot opt in     out     source               destination
    0     0 DNAT       tcp  --  *      *       0.0.0.0/0            0.0.0.0/0            tcp dpt:1080 to:172.18.0.2:1080
 176K   11M RETURN     all  --  *      *       0.0.0.0/0            0.0.0.0/0

$ sudo ip netns exec 72df0265d4af iptables -nvL -t nat
...
...
Chain PREROUTING (policy ACCEPT 0 packets, 0 bytes)
 pkts bytes target     prot opt in     out     source               destination
    9   576 REDIRECT   tcp  --  *      *       0.0.0.0/0            0.0.0.0/0            tcp dpt:1080 redir ports 80
...
...
Chain POSTROUTING (policy ACCEPT 0 packets, 0 bytes)
 pkts bytes target     prot opt in     out     source               destination
    0     0 DOCKER_POSTROUTING  all  --  *      *       0.0.0.0/0            127.0.0.11
   14   896 SNAT       all  --  *      *       0.0.0.0/0            10.255.0.0/16        ipvs to:10.255.0.3
           

你可以看到,iptables規則直接轉發主控端1080端口的流量到隐藏容器‘ingress-sbox’,然後POSTROUTING鍊将資料包放在IP位址10.255.0.3上,而其對應的接口則連接配接到ingress網絡上。

注意SNAT規則中的‘ipvs’。‘ipvs’是Linux核心實作的本地負載均衡器:

$ sudo ip netns exec 72df0265d4af iptables -nvL -t mangle
Chain PREROUTING (policy ACCEPT 144 packets, 12119 bytes)
 pkts bytes target     prot opt in     out     source               destination
   87  5874 MARK       tcp  --  *      *       0.0.0.0/0            0.0.0.0/0            tcp dpt:1080 MARK set 0x12c
...
...
Chain OUTPUT (policy ACCEPT 15 packets, 936 bytes)
 pkts bytes target     prot opt in     out     source               destination
    0     0 MARK       all  --  *      *       0.0.0.0/0            10.255.0.2           MARK set 0x12c
...
...
           

iptables規則将流标記為0x12c(=300),然後如此配置ipvs:

$ sudo ip netns exec 72df0265d4af ipvsadm -ln
IP Virtual Server version 1.2.1 (size=4096)
Prot LocalAddress:Port Scheduler Flags
  -> RemoteAddress:Port           Forward Weight ActiveConn InActConn
FWM  300 rr
  -> 10.255.0.5:0                 Masq    1      0          0
           

在另一個節點上nginx服務的容器被配置了IP位址為10.255.0.5,它是負載均衡管理的唯一後端。全部總結下來,目前全部網絡連接配接如下圖所示:

Docker網絡之深挖overlay

總結 

Docker Swarm網絡背後發生了許多很酷的事情,這使得在多主控端網絡下的應用開發變得容易實作,甚至是跨雲環境。對低層細節的挖掘有助于在開發時進行問題定位、調試。