本文關鍵字:Cannot connect to the Docker daemon at,containerd cannot properly do "clean-up" with shim process during start up,用标準方法實作的類群晖paas,with debugable appliance inside built
在前面《利用openfaas faasd在你的雲主機上部署function serverless面闆》中我們介紹了用從
https://github.com/openfaas/faasd/tree/0.9.2/cloud-config.txt提取的腳本安裝openfaas(後來我們用上了0.9.5),和在雲主機上使用它的方法,見《在openfaas面闆上安裝onemanager1,2》,如果說這3文定位主要是基本安裝,排錯,和調試,那麼本文開始就着重于增強和提高腳本的體驗了。前3文的成果和努力依舊有效。
第一個問題,腳本要能在一台幹淨的ubuntu1804的機器上安裝,盡量一次成功,如果不能成功,那麼它也要求能多次覆寫安裝不緻于弄壞系統。這就要求腳本中安裝的元件分開,各元件包括其配置要standalone方式放置,這樣可以重裝時拔插和替換,覆寫。
第二個問題,雖然我前3文中從來沒遇到過,但是後來的嘗試中,我發現在ubuntu1804同樣的安裝方式群組件版本(v1.3.3containerd+cni0.4.0+cniplugins0.8.5+faasd0.9.5),居然gateway那個container開啟一會之後就會停止,導緻8080根本不能通路。
不廢話了,直接上新的腳本:
前置
更新了安裝說明。集中化了全局變量,注意deps prepare部分,bridge-utils是為了控制cni制造的那個openfaas0虛拟網卡用。安裝docker.io,是ubuntu上它可以同時安裝containerd1.3.3和runc
從Docker 1.11開始,Docker容器運作已經不是簡單的通過Docker daemon來啟動,而是內建了containerd、runC等多個元件。如果去搜尋一番,就會發現:docker-containerd 就是 containerd,而 docker-runc 就是 runc。containerd是真正管控容器的daemon,執行容器的時候用的是runc。
為什麼 要分的七零八散呢?為了防止docker一家獨大,docker當年的實作被拆分出了幾個标準化的子產品,,标準化的目的是子產品是可被其他實作替換的,其實也是為了實作類llvm的元件化可開發效果(軟體抽象上,源頭如果有分才能合,如果一開始就是合的就難分)。也是為了分布式效果。docker也像git一樣做分布式部件化了,分布式就是設定2個部件,cliserver,這樣在本地和遠端都可這樣架構。
而為什麼是dockerio而不是docker-ce:
事實是我還發現,有些系統上,安裝了docker-ce再安裝containerd。會導緻系統出問題Cannot connect to the Docker daemon at 。docker與containerd不相容,是以隻好安裝ubuntu維護的docker.io這種解決了containerd依賴的,它預設依賴containerd和runc(不過稍後我們會提到替換更新containerd的版本)。我們選用加入了最新cn ubuntu deb src後apt-get update得到的sudo apt install docker.io=19.03.6-0ubuntu1~18.04.1,sudo apt-cache madison docker.io出來的版本.
#!/bin/bash
## currently tested under ubuntu1804 64b,easy to be ported to centos(can be tested with replacing apt-get and /etc/systemd/system)
## How to use this script in a cloudhost
## su root and then: ./panel.sh -d 'your domain to be binded' -m 'email you use to pass to certbot' -p 'your inital passwords'(email and passwords are not neccessary,feed email only if you encount the "toomanyrequestofagivetype" error)
## (no prefix https/http needed,should bind to the right ip ahead for laster certbot working)
export DOMAIN_NAME=''
export EMAIL_NAME='[email protected]'
export PANEL_TYPE='0'
export PASS_INIT='5cTWUsD75ZgL3VJHdzpHLfcvJyOrUnza1jr6KXry5pXUUNmGtqmCZU4yGoc9yW4'
MIRROR_PATH="http://default-8g95m46n2bd18f80.service.tcloudbase.com/d/demos"
# the pai backend
SERVER_PATH=${MIRROR_PATH}/pai/pai-agent/stable/pai_agent_framework
PAI_MATE_SERVER_PATH=${MIRROR_PATH}/pai/pai-mate/stable/install
# the openfaas backend
OPENFAAS_PATH=${MIRROR_PATH}/faasd
# the code-server web ide
CODE_SERVER_PATH=${MIRROR_PATH}/codeserver
#install dir
INSTALL_DIR="/root/.local"
CONFIG_DIR="/root/.config"
# datadir only for pai and common data
DATA_DIR="/data"
while [[ $# -ge 1 ]]; do
case $1 in
-d|--domain)
shift
DOMAIN_NAME="$1"
shift
;;
-m|--mail)
shift
EMAIL_NAME="$1"
shift
;;
-t|--paneltype)
shift
PANEL_TYPE="$1"
shift
;;
-p|--passinit)
shift
PASS_INIT="$1"
shift
;;
*)
if [[ "$1" != 'error' ]]; then echo -ne "\nInvaild option: '$1'\n\n"; fi
echo -ne " Usage(args are self explained):\n\tbash $(basename $0)\t-d/--domain\n\t\t\t\t\-m/--mail\n\t\t\t\t\-t/--paneltype\n\t\t\t\t-p/--passinit\n\t\t\t\t\n"
exit 1;
;;
esac
done
[[ "$EUID" -ne '0' ]] && echo "Error:This script must be run as root!" && exit 1;
beginTime=$(date +%s)
# write log with time
writeProgressLog() {
echo "[`date '+%Y-%m-%d %H:%M:%S'`][$1][$2]"
echo "[`date '+%Y-%m-%d %H:%M:%S'`][$1][$2]" >> ${DATA_DIR}/h5/access.log
}
# update install progress
updateProgress() {
progress=$1
message=$2
status=$3
installType=$4
# echo "=====================$installType progress======================="
echo "=======================$installType progress=======================" >> ${DATA_DIR}/h5/access.log
writeProgressLog "installType" $installType
writeProgressLog "progress" $progress
writeProgressLog "status" $status
echo $message >> ${DATA_DIR}/h5/access.log
if [ $status == "0" ]; then
code=0
message="success"
else
code=1
message="$installType error"
# exit 1
fi
cat << EOF > ${DATA_DIR}/h5/progress.json
{
"code": $code,
"message": "$message",
"data": {
"installType": "$installType",
"progress": $progress
}
}
EOF
if [ $status == "0" ]; then
code=0
message="success"
else
code=1
message="$installType error"
# exit 1
fi
if [ $status != "0" ]; then
echo $message >> ${DATA_DIR}/h5/installErr.log
fi
}
echo "=====================begin .....====================="
echo "PANEL_TYPE: ${PANEL_TYPE}"
echo "DOMAIN_NAME: ${DOMAIN_NAME}"
echo "SERVER_PATH: ${MIRROR_PATH}"
echo "OPENFAAS_PATH: ${OPENFAAS_PATH}"
echo "PAI_MATE_SERVER_PATH: ${PAI_MATE_SERVER_PATH}"
echo "CODE_SERVER_PATH: ${CODE_SERVER_PATH}"
echo "INSTALL_DIR: ${INSTALL_DIR}"
rm -rf ${DATA_DIR}/h5
mkdir -p ${DATA_DIR}/h5
rm -rf ${DATA_DIR}/h5/index.json
rm -rf ${DATA_DIR}/logs
mkdir -p ${DATA_DIR}/logs
mkdir -p ${INSTALL_DIR}/bin
mkdir -p ${CONFIG_DIR}
echo "=====================deps prepare progress(this may take long...)======================="
msg=$( #begin
if [ $PANEL_TYPE == "0" ]; then
apt-key adv --recv-keys --keyserver keyserver.Ubuntu.com 3B4FE6ACC0B21F32
echo deb http://cn.archive.ubuntu.com/ubuntu/ bionic main restricted universe multiverse >> /etc/apt/sources.list
echo deb http://cn.archive.ubuntu.com/ubuntu/ bionic-security main restricted universe multiverse >> /etc/apt/sources.list
echo deb http://cn.archive.ubuntu.com/ubuntu/ bionic-updates main restricted universe multiverse >> /etc/apt/sources.list
echo deb http://cn.archive.ubuntu.com/ubuntu/ bionic-proposed main restricted universe multiverse >> /etc/apt/sources.list
echo deb http://cn.archive.ubuntu.com/ubuntu/ bionic-backports main restricted universe multiverse >> /etc/apt/sources.list
apt-get update
apt-get install docker.io=19.03.6-0ubuntu1~18.04.1 --no-install-recommends bridge-utils -y
apt-get install nginx python git python-certbot-nginx -y
# sed '1{:a;N;5!b a};$d;N;P;D' -i /etc/apt/sources.list
# apt-get update
else
apt-get update && apt-get install git nginx gcc python3.6 python3-pip python3-virtualenv python-certbot-nginx golang -y
fi 2>&1)
status=$?
updateProgress 30 "$msg" "$status" "deps prepare"
基礎元件代碼:nginx front and docker backend
這部分雖然寫死了各條轉發。但重點在于如何根據具體的轉發需要布置代碼。這裡的理論在于:如果代理伺服器位址(proxy_pass後面那個)中是帶有URI的,此URI會替換掉 location 所比對的URI部分。 而如果代理伺服器位址中是不帶有URI的,則會用完整的請求URL來轉發到代理伺服器。
confignginx() {
echo "=====================certbot renew+start+init progress======================="
systemctl enable nginx.service
systemctl start nginx
#cp -f /lib/systemd/system/certbot.service /etc/systemd/system/certbot-renew.service
#echo '[Install]' >> /etc/systemd/system/certbot-renew.service
#echo 'WantedBy=multi-user.target' >> /etc/systemd/system/certbot-renew.service
#cp -f /lib/systemd/system/certbot.timer /etc/systemd/system/certbot-renew.timer
# sed -i "s/renew/renew --nginx/g" /etc/systemd/system/certbot-renew.service
rm -rf /etc/systemd/system/certbot-renew.service
cat << 'EOF' > /etc/systemd/system/certbot-renew.service
[Unit]
Description=Certbot
Documentation=file:///usr/share/doc/python-certbot-doc/html/index.html
Documentation=https://letsencrypt.readthedocs.io/en/latest/
[Service]
Type=oneshot
ExecStart=/usr/bin/certbot -q renew
PrivateTmp=true
[Install]
WantedBy=multi-user.target
EOF
rm -rf /etc/systemd/system/certbot-renew.timer
cat << 'EOF' > /etc/systemd/system/certbot-renew.timer
[Unit]
Description=Run certbot twice daily
[Timer]
OnCalendar=*-*-* 00,12:00:00
RandomizedDelaySec=43200
Persistent=true
[Install]
WantedBy=timers.target
EOF
msg=$(
#first time renew
certbot certonly --quiet --standalone --agree-tos --non-interactive -m ${EMAIL_NAME} -d ${DOMAIN_NAME} --pre-hook "systemctl stop nginx"
systemctl daemon-reload
systemctl enable certbot-renew.service
systemctl start certbot-renew.service
systemctl start certbot-renrew.timer 2>&1)
status=$?
updateProgress 40 "$msg" "$status" "certbot renew+start+init"
echo "=====================nginx reconfig progress======================="
# add nginx conf
rm -rf /etc/nginx/conf.d/default.conf
cat << 'EOF' > /etc/nginx/conf.d/default.conf
server {
listen 443 http2 ssl;
listen [::]:443 http2 ssl;
server_name DOMAIN_NAME;
ssl on;
ssl_certificate /etc/letsencrypt/live/DOMAIN_NAME/fullchain.pem;
ssl_certificate_key /etc/letsencrypt/live/DOMAIN_NAME/privkey.pem;
ssl_session_timeout 5m;
ssl_protocols TLSv1 TLSv1.1 TLSv1.2;
ssl_ciphers ECDHE-RSA-AES128-GCM-SHA256:HIGH:!aNULL:!MD5:!RC4:!DHE;
ssl_prefer_server_ciphers on;
location / {
proxy_pass http://localhost:PORT;
proxy_set_header Upgrade $http_upgrade;
proxy_set_header Connection upgrade;
proxy_set_header Accept-Encoding gzip;
}
location /pai/ {
proxy_pass http://localhost:5523;
proxy_set_header Upgrade $http_upgrade;
proxy_set_header Connection upgrade;
proxy_set_header Accept-Encoding gzip;
}
location /faasd/ {
proxy_pass http://localhost:8080/ui/;
proxy_set_header Upgrade $http_upgrade;
proxy_set_header Connection upgrade;
proxy_set_header Accept-Encoding gzip;
}
location /codeserver/ {
proxy_pass http://localhost:5000/;
proxy_redirect http:// https://;
proxy_set_header Host $host:443/codeserver;
proxy_set_header Upgrade $http_upgrade;
proxy_set_header Connection upgrade;
proxy_set_header Accept-Encoding gzip;
}
}
server {
listen 80;
server_name DOMAIN_NAME;
if ($host = DOMAIN_NAME) {
return 301 https://$host$request_uri;
}
return 404;
}
EOF
sed -i "s#DOMAIN_NAME#${DOMAIN_NAME}#g" /etc/nginx/conf.d/default.conf
if [ $PANEL_TYPE == "0" ]; then
sed -i "s#PORT#8080/functions/#g" /etc/nginx/conf.d/default.conf
else
sed -i "s#PORT#3000#g" /etc/nginx/conf.d/default.conf
fi
# restart nginx
msg=$( #begin
[[ $(systemctl is-active nginx.service) == "activating" ]] && systemctl reload nginx.service
systemctl restart nginx 2>&1)
status=$?
updateProgress 50 "$msg" "$status" "nginx reconfig"
}
confignginx
為了讓docker能覆寫安裝,接下來腳本開頭處邏輯的清空了配置,這裡的重點問題是containerd與cni,與openfaasd的複雜關系:
Container Network Interface (CNI) 最早是由CoreOS發起的容器網絡規範,是Kubernetes網絡插件的基礎。其基本思想為:Container Runtime在建立容器時,先建立好network namespace,然後調用CNI插件為這個netns配置網絡,其後再啟動容器内的程序。現已加入CNCF,成為CNCF主推的網絡模型。CNI負責了在容器建立或删除期間的所有與網絡相關的操作,它将建立所有規則以確定從容器進和出的網絡連接配接正常,但它并不負責設定網絡媒體,例如建立網橋或分發路由以連接配接位于不同主機中的容器。
這個工作由openfaasd等完成。docker的這些元件->containerd+cni+ctr+runc,是由faasd來配置運作的。單獨啟動第一次安裝完的containerd+cni+ctr+runc并不會啟動cni和開啟網卡(單獨啟動containerd提示cni conf not found沒關系它依然會啟動),需要openfaasd中的動作給後者帶來cni和網卡配置。但這種結合很緊密,使得接下來容器的完全清理工作有難度。
對于容器的清除,用ctr tasks kill && ctr tasks delete && ctr container delete可以看到ps aux|grep manual看到主機空間的shim任務和/proc/id号/ns都被删掉了,但還是某些地方有殘留。這是因為這二者很難分開,shim開啟的task關聯容器和/var/run/containerd無法清理,導緻前者很難單獨拔插/進行配置解除安裝,也難于在下一次覆寫安裝時能從0全新開始。
而這其實是一個bug導緻的,containerd cannot properly do "clean-up" with shim process during start up? #3971(
https://github.com/containerd/containerd/issues/3971),直到1.4.0beta才被解決(
https://github.com/containerd/containerd/pull/4100/commits/488d6194f2080709d9667e00ff244fbdc7ff95b2),但我測試了(cd /var/lib/faasd/ faasd up),隻是效果好點,1.3.3是提示id exists不能重建container,1.40是提示/run/container下的files exists,同樣沒解決完全清理以全新覆寫安裝containerd的需求,是以我腳本中提示了“containerd install+start progress(this may hang long and if you over install the script you may encount /run/containerd device busy error,for this case you need to reboot to fix after scripts finished”,這個基本如果你遇到了var/run删不掉錯誤,等安裝程式跑完,重新開機即可。
是以我選擇了1.40的containerd,它也同時解決了我開頭提到的,gateway失效的問題。用的cni plugins還是0.8.5,本來想用那個cri-containerd-cni-1.4.0-linux-amd64.tar.gz,但裡面的cni是0.7.1,與faasd要求的0.4.0不符。
對于cni的解除安裝和清除,則不屬于ctr的能控制範疇,cni沒有主控端上的控制,除非将程序網絡命名空間恢複到主機目錄,或在在容器網絡空間内運作IP指令來檢查網絡接口是否已正确設定,都挺麻煩,用上面删容器的ctr tasks kill && ctr tasks delete && ctr container delete三部曲删可以看到ifconfig中五個task對應的虛拟網卡也被幹掉了,是以我也就沒有再深入研究cni的解除安裝邏輯。
configdocker() {
[[ $(systemctl is-active faasd-provider) == "activating" ]] && systemctl stop faasd-provider
[[ $(systemctl is-active faasd) == "activating" ]] && systemctl stop faasd
[[ $(systemctl is-active containerd) == "activating" ]] && ctr image remove docker.io/openfaas/basic-auth-plugin:0.18.18 docker.io/library/nats-streaming:0.11.2 docker.io/prom/prometheus:v2.14.0 docker.io/openfaas/gateway:0.18.18 docker.io/openfaas/queue-worker:0.11.2 && for i in basic-auth-plugin nats prometheus gateway queue-worker; do ctr tasks kill -s SIGKILL $i;ctr tasks delete $i;ctr container delete $i; done && systemctl stop containerd && sleep 10
ps -ef|grep containerd|awk '{print $2}'|xargs kill -9
rm -rf /var/run/containerd /run/containerd
[[ ! -z "$(brctl show|grep openfaas0)" ]] && ifconfig openfaas0 down && brctl delbr openfaas0
rm -rf /etc/cni
echo "===============================cniplugins installonly================================="
msg=$( #begin
if [ ! -f "/tmp/cni-plugins-linux-amd64-v0.8.5.tar.gz" ]; then
wget --no-check-certificate -qO- ${MIRROR_PATH}/docker/containernetworking/plugins/v0.8.5/cni-plugins-linux-amd64-v0.8.5.tar.gz > /tmp/cni-plugins-linux-amd64-v0.8.5.tar.gz
fi
mkdir -p /opt/cni/bin
tar -xf /tmp/cni-plugins-linux-amd64-v0.8.5.tar.gz -C /opt/cni/bin
/sbin/sysctl -w net.ipv4.conf.all.forwarding=1 2>&1)
status=$?
updateProgress 50 "$msg" "$status" "cniplugins installonly"
echo "======containerd install+start progress(this may hang long and if you over install the script you may encount /run/containerd device busy error,for this case you need to reboot to fix after scripts finished)====="
msg=$( #begin
# del original deb by docker.io
rm -rf /usr/bin/containerd* /usr/bin/ctr
# replace with new bins
if [ ! -f "/tmp/containerd-1.4.0-linux-amd64.tar.gz" ]; then
wget --no-check-certificate -qO- ${MIRROR_PATH}/docker/containerd/v1.4.0/containerd-1.4.0-linux-amd64.tar.gz > /tmp/containerd-1.4.0-linux-amd64.tar.gz
fi
tar -xf /tmp/containerd-1.4.0-linux-amd64.tar.gz -C ${INSTALL_DIR}/bin/ --strip-components=1 && ln -sf ${INSTALL_DIR}/bin/containerd* /usr/local/bin/ && ln -sf ${INSTALL_DIR}/bin/ctr /usr/local/bin/ctr
rm -rf /etc/systemd/system/containerd.service
cat << 'EOF' > /etc/systemd/system/containerd.service
[Unit]
Description=containerd container runtime
Documentation=https://containerd.io
After=network.target local-fs.target
#After=network.target containerd.socket containerd.service
#Requires=containerd.socket containerd.service
[Service]
ExecStartPre=-/sbin/modprobe overlay
ExecStart=/usr/local/bin/containerd
Type=notify
Delegate=yes
KillMode=process
#changed to mixed to let systemctl stop containerd kill shims
#KillMode=mixed
Restart=always
# Having non-zero Limit*s causes performance problems due to accounting overhead
# in the kernel. We recommend using cgroups to do container-local accounting.
LimitNPROC=infinity
LimitCORE=infinity
LimitNOFILE=1048576
# Comment TasksMax if your systemd version does not supports it.
# Only systemd 226 and above support this version.
TasksMax=infinity
[Install]
WantedBy=multi-user.target
EOF
systemctl daemon-reload && systemctl enable containerd
systemctl start containerd --no-pager 2>&1)
status=$?
updateProgress 50 "$msg" "$status" "containerd install+start"
}
configdocker
未來等containerd的這個bug徹底解決或許有可能讓containerd的shim task實作徹底停止和移除。來說點别的,還記得我在《enginx》中對openresty可以腳本程式設計轉發連結遊戲伺服器的多元件叢集,形成demo based programming的能力的設想嗎(類似組成openfaas的五個containers,是組建一個單節點叢集分布式的典型職責機關。有驗證有網關,有業務)。還有基于jupyter的engitor,那麼現在,我們用openfaas+vscodeonline來實作它們。我們知道openfaas這種就是建構一個分布式函數的“城市”,讓城市組成的世界在二進制級,互相調用分布式API,進行demo組合,構成應用。是真正的demo積木程式設計。因為它可以Turn Any CLI into a Function,甚至是本地native cli。比如它能使shell完全變成分布式語言。直接在二進制上程式設計。
(此處不設回複,掃碼到微信參與留言,或直接點選到原文)