原文位址:https://neuvector.com/network-security/docker-swarm-container-networking/
原文深入挖掘Docker Swarm網絡實作,說了一些官方文檔沒有說的問題,對于學習了解Docker網絡大有幫助,
部署
首先部署一個包含兩個節點的docker swarm叢集,名稱為别為node 1與node 2,建立swarm叢集的過程不贅述。接下來建立一個overlay網絡與三個服務,每個服務隻有一個執行個體,如下:
docker network create --opt encrypted --subnet 100.0.0.0/24 -d overlay net1
docker service create --name redis --network net1 redis
docker service create --name node --network net1 nvbeta/node
docker service create --name nginx --network net1 -p 1080:80 nvbeta/swarm_nginx
以上指令建立了一個典型的三層應用。nginx是前置的負載均衡器,将使用者請求流量分發到node服務,node服務是一個web服務,它負責通路redis并将結果通過nginx服務傳回給使用者。簡單起見,隻建立一個node執行個體。
以下是此應用的邏輯視圖:

網絡
看一下在docker swarm中已經建立的網絡:
$ docker network ls
NETWORK ID NAME DRIVER SCOPE
cac91f9c60ff bridge bridge local
b55339bbfab9 docker_gwbridge bridge local
fe6ef5d2e8e8 host host local
f1nvcluv1xnf ingress overlay swarm
8vty8k3pejm5 net1 overlay swarm
893a1bbe3118 none null local
net1
剛才建立的overlay網絡,負責容器之間東西向通信。
docker_gwbridge
由Docker建立的bridge網絡,它允許容器與主控端通信。
ingress
由Docker建立的overlay網絡,在Docker swarm中通過此網絡向外部世界暴露服務與routing mesh功能。
net1網絡
每個服務在建立時都指定了“--network net1”選項,是以每個容器執行個體必然有一個接口連接配接到net1網絡,檢視一下node 1,有兩個容器被部署在這個節點上:
$ docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
eb03383913fb nvbeta/node:latest "nodemon /src/index.j" 2 hours ago Up 2 hours 8888/tcp node.1.10yscmxtoymkvs3bdd4z678w4
434ce2679482 redis:latest "docker-entrypoint.sh" 2 hours ago Up 2 hours 6379/tcp redis.1.1a2l4qmvg887xjpfklp4d6n7y
通過建立與docker網絡名稱空間的符号連結,檢視一下node 1結點上所有的網絡名稱空間:
$ cd /var/run
$ sudo ln -s /var/run/docker/netns netns
$ sudo ip netns
be663feced43
6e9de12ede80
2-8vty8k3pej
1-f1nvcluv1x
72df0265d4af
對比名稱空間ID與Docker swarm中的網絡ID,我們猜測net1網絡屬于2-8vty8k3pej名稱空間,net1網絡的ID為8vty8k3pej。這個可以通過對比名稱空間下的接口與容器中的接口确認。
容器中的接口:
$ docker exec node.1.10yscmxtoymkvs3bdd4z678w4 ip link
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
11040: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UP mode DEFAULT group default
link/ether 02:42:65:00:00:03 brd ff:ff:ff:ff:ff:ff
11042: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT group default
link/ether 02:42:ac:12:00:04 brd ff:ff:ff:ff:ff:ff
$ docker exec redis.1.1a2l4qmvg887xjpfklp4d6n7y ip link
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
11036: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UP mode DEFAULT group default
link/ether 02:42:65:00:00:08 brd ff:ff:ff:ff:ff:ff
11038: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT group default
link/ether 02:42:ac:12:00:03 brd ff:ff:ff:ff:ff:ff
2-8vty8k3pej名稱空間下的接口:
$ sudo ip netns exec 2-8vty8k3pej ip link
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
2: br0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UP mode DEFAULT group default
link/ether 22:37:32:66:b0:48 brd ff:ff:ff:ff:ff:ff
11035: vxlan1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue master br0 state UNKNOWN mode DEFAULT group default
link/ether 2a:30:95:63:af:75 brd ff:ff:ff:ff:ff:ff
11037: veth2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue master br0 state UP mode DEFAULT group default
link/ether da:84:44:5c:91:ce brd ff:ff:ff:ff:ff:ff
11041: veth3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue master br0 state UP mode DEFAULT group default
link/ether 8a:f9:bf:c1:ec:09 brd ff:ff:ff:ff:ff:ff
注意br0,它是LinuxBridge裝置,所有其它接口都連接配接在它上邊,包括vxlan1、veth2、veth3。vxlan1是VTEP類型的Linux網絡虛拟化裝置,它是br0的從裝置,用來實作vxlan功能。
veth2與veth3都是veth類型的虛拟化裝置,總是成對出現,其中之一位于名稱空間内,另一個位于容器中,并且們于容器中的veth裝置ID總比位于namespacew中的ID小數字1。是以,名稱空間下的veth2與redis中的eth0是一對,名稱空間下的veth3與node中的eth0是一對。
目前我們可以确認網絡net1屬于名稱空間2-8vty8k3pej,基于目前了理的情報,網絡拓撲圖暫時如下:
node 1
+-----------+ +-----------+
| nodejs | | redis |
| | | |
+--------+--+ +--------+--+
| |
| |
| |
| |
+----+------------------+-------+ net1
101.0.0.3 101.0.0.8
101.0.0.4(vip) 101.0.0.2(vip)
docker_gwbridge網絡
對比redis、node容器中的接口與node 1主控端中的接口。主控端接口如下:
$ ip link
...
4: docker_gwbridge: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT group default
link/ether 02:42:24:f1:af:e8 brd ff:ff:ff:ff:ff:ff
5: docker0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN mode DEFAULT group default
link/ether 02:42:e4:56:7e:9a brd ff:ff:ff:ff:ff:ff
11039: veth97d586b: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master docker_gwbridge state UP mode DEFAULT group default
link/ether 02:6b:d4:fc:8a:8a brd ff:ff:ff:ff:ff:ff
11043: vethefdaa0d: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master docker_gwbridge state UP mode DEFAULT group default
link/ether 0a:d5:ac:22:e7:5c brd ff:ff:ff:ff:ff:ff
10876: vethceaaebe: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master docker_gwbridge state UP mode DEFAULT group default
link/ether 3a:77:3d:cc:1b:45 brd ff:ff:ff:ff:ff:ff
...
可以看到,有三個veth裝置連接配接到docker_gwbridge,ID分别是11039、11043、10876,可以通過如下指令确認:
$ brctl show
bridge name bridge id STP enabled interfaces
docker0 8000.0242e4567e9a no
docker_gwbridge 8000.024224f1afe8 no veth97d586b
vethceaaebe
vethefdaa0d
根據在net1中總結的veth對的比對規則,我們知道11039與redis中的eth1(11038)是一對,11043與node中的eth1(11042)是一對。現在的網絡拓撲圖如下:
node 1
172.18.0.4 172.18.0.3
+----+------------------+----------------+ docker_gwbridge
| |
| |
| |
| |
+--+--------+ +--+--------+
| nodejs | | redis |
| | | |
+--------+--+ +--------+--+
| |
| |
| |
| |
+----+------------------+----------+ net1
101.0.0.3 101.0.0.8
101.0.0.4(vip) 101.0.0.2(vip)
docker_gwbridge的功能與單機Docker下預設的docker0(也可能叫bridge,取決于Docker版本)類型的網絡很像。但是有差別,docker0有連接配接外網的功能,docker_gwbridge沒有,它隻負責同主控端下不同容器這間的通信。當容器使用-p選項連接配接外網時由另一個叫ingress的網絡負責。
ingress網絡
再一次列出node 1主控端下的網絡名稱空間與Docker swarm中的網絡:
$ sudo ip netns
be663feced43
6e9de12ede80
2-8vty8k3pej
1-f1nvcluv1x
72df0265d4af
$ docker network ls
NETWORK ID NAME DRIVER SCOPE
cac91f9c60ff bridge bridge local
b55339bbfab9 docker_gwbridge bridge local
fe6ef5d2e8e8 host host local
f1nvcluv1xnf ingress overlay swarm
8vty8k3pejm5 net1 overlay swarm
893a1bbe3118 none null local
很明顯,ingress網絡屬于1-f1nvcluv1x名稱空間。但是72df0265d4af名稱空間是幹什麼的呢?先看一下72df0265d4af名稱空間下的接口:
$ sudo ip netns exec 72df0265d4af ip addr
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
10873: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UP group default
link/ether 02:42:0a:ff:00:03 brd ff:ff:ff:ff:ff:ff
inet 10.255.0.3/16 scope global eth0
valid_lft forever preferred_lft forever
inet6 fe80::42:aff:feff:3/64 scope link
valid_lft forever preferred_lft forever
10875: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default
link/ether 02:42:ac:12:00:02 brd ff:ff:ff:ff:ff:ff
inet 172.18.0.2/16 scope global eth1
valid_lft forever preferred_lft forever
inet6 fe80::42:acff:fe12:2/64 scope link
valid_lft forever preferred_lft forever
eth1(10875)與主控端上的vethceaaebe(10876)是一對,我們也可以知道eth0(10873)
連接配接到ingress網絡,這一點可以通過查詢ingress與docker_gwbridge的詳細資訊佐證:
$ docker network inspect ingress
[
{
"Name": "ingress",
"Id": "f1nvcluv1xnfa0t2lca52w69w",
"Scope": "swarm",
"Driver": "overlay",
....
"Containers": {
"ingress-sbox": {
"Name": "ingress-endpoint",
"EndpointID": "3d48dc8b3e960a595e52b256e565a3e71ea035bb5e77ae4d4d1c56cab50ee112",
"MacAddress": "02:42:0a:ff:00:03",
"IPv4Address": "10.255.0.3/16",
"IPv6Address": ""
}
},
....
}
]
$ docker network inspect docker_gwbridge
[
{
"Name": "docker_gwbridge",
"Id": "b55339bbfab9bdad4ae51f116b028ad7188534cb05936bab973dceae8b78047d",
"Scope": "local",
"Driver": "bridge",
....
"Containers": {
....
"ingress-sbox": {
"Name": "gateway_ingress-sbox",
"EndpointID": "0b961253ec65349977daa3f84f079ec5e386fa0ae2e6dd80176513e7d4a8b2c3",
"MacAddress": "02:42:ac:12:00:02",
"IPv4Address": "172.18.0.2/16",
"IPv6Address": ""
}
},
....
}
]
以上輸出中endpoints的“MAC/IP”與網絡名稱空間72df0265d4af中的“MAC/IP”比對。由此可見,網絡名稱空間72df0265d4af是為了隐藏容器“ingress-sbox”,“ingress-sbox”有兩個接口,一個連接配接主控端,另一個連接配接ingress網絡。
Docker Swarm的其中之一特性是"routing mesh",對于向外部暴露端口的容器,無論它實際運作在那個節點,可以通過通路叢集中的任何一個節點通路到它,怎麼做到的呢?繼續深挖容器。
在我們的應用中,隻有nginx服務通過将自已的80端口映射到主控端的1080端口,但nginx并沒有運作在node 1節點上。
繼續檢視node 1
$ sudo iptables -t nat -nvL
...
...
Chain DOCKER-INGRESS (2 references)
pkts bytes target prot opt in out source destination
0 0 DNAT tcp -- * * 0.0.0.0/0 0.0.0.0/0 tcp dpt:1080 to:172.18.0.2:1080
176K 11M RETURN all -- * * 0.0.0.0/0 0.0.0.0/0
$ sudo ip netns exec 72df0265d4af iptables -nvL -t nat
...
...
Chain PREROUTING (policy ACCEPT 0 packets, 0 bytes)
pkts bytes target prot opt in out source destination
9 576 REDIRECT tcp -- * * 0.0.0.0/0 0.0.0.0/0 tcp dpt:1080 redir ports 80
...
...
Chain POSTROUTING (policy ACCEPT 0 packets, 0 bytes)
pkts bytes target prot opt in out source destination
0 0 DOCKER_POSTROUTING all -- * * 0.0.0.0/0 127.0.0.11
14 896 SNAT all -- * * 0.0.0.0/0 10.255.0.0/16 ipvs to:10.255.0.3
你可以看到,iptables規則直接轉發主控端1080端口的流量到隐藏容器‘ingress-sbox’,然後POSTROUTING鍊将資料包放在IP位址10.255.0.3上,而其對應的接口則連接配接到ingress網絡上。
注意SNAT規則中的‘ipvs’。‘ipvs’是Linux核心實作的本地負載均衡器:
$ sudo ip netns exec 72df0265d4af iptables -nvL -t mangle
Chain PREROUTING (policy ACCEPT 144 packets, 12119 bytes)
pkts bytes target prot opt in out source destination
87 5874 MARK tcp -- * * 0.0.0.0/0 0.0.0.0/0 tcp dpt:1080 MARK set 0x12c
...
...
Chain OUTPUT (policy ACCEPT 15 packets, 936 bytes)
pkts bytes target prot opt in out source destination
0 0 MARK all -- * * 0.0.0.0/0 10.255.0.2 MARK set 0x12c
...
...
iptables規則将流标記為0x12c(=300),然後如此配置ipvs:
$ sudo ip netns exec 72df0265d4af ipvsadm -ln
IP Virtual Server version 1.2.1 (size=4096)
Prot LocalAddress:Port Scheduler Flags
-> RemoteAddress:Port Forward Weight ActiveConn InActConn
FWM 300 rr
-> 10.255.0.5:0 Masq 1 0 0
在另一個節點上nginx服務的容器被配置了IP位址為10.255.0.5,它是負載均衡管理的唯一後端。全部總結下來,目前全部網絡連接配接如下圖所示:
總結
Docker Swarm網絡背後發生了許多很酷的事情,這使得在多主控端網絡下的應用開發變得容易實作,甚至是跨雲環境。對低層細節的挖掘有助于在開發時進行問題定位、調試。