阿裡雲容器啟動失敗: failed to unshare namespaces, running exec setns process for init, Unable to create nf_co

阿裡雲Swarm叢集上一個節點啟動容器失敗，日志和事件中的報錯資訊如下：

"failed to unshare namespaces: Cannot allocate memory"

啟動容器失敗：Error response from daemon: Error response from daemon: oci runtime error: container_linux.go:262: starting container process caused "process_linux.go:247:running exec setns process for init caused \"exit status 34\""

手工啟動容器，也報同樣的錯誤：

Error response from daemon: Error response from daemon: oci runtime error: container_linux.go:262: starting container process caused "process_linux.go:247:running exec setns process for init caused \"exit status 34\"

第一句報錯，很明顯是可用記憶體不夠，導緻配置設定失敗。

第二句報錯，啟動容器程序時運作exec setnamespace時失敗。啟動容器時也要啟動程序，配置設定namespace，這裡的失敗可能也與記憶體有關。

(補充一句，runC實際上就是libcontainer配上了一個輕型的用戶端。容器是提供一個與主控端系統共享核心但與系統中的其它程序資源相隔離的執行環境。Docker通過調用libcontainer包對namespaces、cgroups、capabilities以及檔案系統的管理和配置設定來“隔離”出一個上述執行環境）

進入容器，dmesg檢視核心日志：

[2751018.215519] docker_gwbridge: port 16(veth4c02c8b) entered disabled state

[2751018.239907] runc:[1:CHILD]: page allocation failure: order:6, mode:0x10c0d0 //配置設定2^6 *page_size即256K 記憶體失敗

[2751018.239917] CPU: 5 PID: 839206 Comm: runc:[1:CHILD] Tainted: G E ------------ T 3.10.0-514.26.2.el7.x86_64 #1

[2751018.239919] Hardware name: Alibaba Cloud Alibaba Cloud ECS, BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014

[2751018.239931] Call Trace:

[2751018.239940] [<ffffffff81687133>] dump_stack+0x19/0x1b

[2751018.239945] [<ffffffff811870a0>] warn_alloc_failed+0x110/0x180

[2751018.239990] [<ffffffff811a62d0>] kmem_cache_create_memcg+0x110/0x230

[2751018.239994] [<ffffffff811a641b>] kmem_cache_create+0x2b/0x30

[2751018.240003] [<ffffffffa03029e1>] nf_conntrack_init_net+0x101/0x250 [nf_conntrack]

[2751018.240009] [<ffffffffa03032b4>] nf_conntrack_pernet_init+0x14/0x150 [nf_conntrack]

[2751018.240025] [<ffffffff810b5bb9>] create_new_namespaces+0xf9/0x180

[2751018.240028] [<ffffffff810b5dfa>] unshare_nsproxy_namespaces+0x5a/0xc0

[2751018.240032] [<ffffffff810852c3>] SyS_unshare+0x193/0x300

[2751018.240036] [<ffffffff81697809>] system_call_fastpath+0x16/0x1b

[2751018.240038] Mem-Info:

[2751018.240044] active_anon:1311792 inactive_anon:68376 isolated_anon:0

active_file:59538 inactive_file:93653 isolated_file:0

unevictable:1 dirty:8384 writeback:0 unstable:0

slab_reclaimable:1510499 slab_unreclaimable:802998

mapped:104765 shmem:68700 pagetables:50862 bounce:0

free:66051 free_pcp:32 free_cma:0

[2751018.240049] Node 0 DMA free:15860kB min:64kB low:80kB high:96kB all_unreclaimable? yes

[2751018.240056] lowmem_reserve[]: 0 2815 15868 15868

[2751018.240060] Node 0 DMA32 free:153112kB min:11976kB low:14968kB high:17964kB active_anon:925112kB inactive_anon:41540kB active_file:43200kB inactive_file:180168kB unevictable:0kB isolated(anon):0kB isolated(file):0kB all_unreclaimable? no

[2751018.240125] lowmem_reserve[]: 0 0 13053 13053

[2751018.240129] Node 0 Normal free:95232kB min:55536kB low:69420kB high:83304kB active_anon:4322056kB inactive_anon:231964kB active_file:194952kB inactive_file:194444kB unevictable:4kB isolated(anon):0kB isolated(file):0kB present:13631488kB managed:13367060kB mlocked:4kB dirty:1936kB writeback:0kB mapped:366104kB shmem:232948kBslab_reclaimable:5108320kB slab_unreclaimable:2751412kB kernel_stack:27808kB pagetables:175532kB unstable:0kB bounce:0kB free_pcp:264kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:27 all_unreclaimable? no

[2751018.240137] lowmem_reserve[]: 0 0 0 0

[2751018.240140] Node 0 DMA: 1*4kB (U) 0*8kB 1*16kB (U) 1*32kB (U) 1*64kB (U) 1*128kB (U) 1*256kB (U) 0*512kB 1*1024kB (U) 1*2048kB (M) 3*4096kB (M) = 15860kB

[2751018.240149] Node 0 DMA32: 1515*4kB (UEM) 2456*8kB (UEM) 907*16kB (UEM) 2236*32kB (UEM) 567*64kB (UEM) 52*128kB (UEM) 1*256kB (U) 0*512kB 0*1024kB 0*2048kB 0*4096kB = 154972kB

[2751018.240159] Node 0 Normal: 15952*4kB (UEM) 2235*8kB (UEM) 859*16kB (UEM) 17*32kB (UE) 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 95976kB

[2751018.240167] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB

[2751018.240169] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB

[2751018.240170] 221859 total pagecache pages

[2751018.240171] 0 pages in swap cache

[2751018.240172] Swap cache stats: add 0, delete 0, find 0/0

[2751018.240173] Free swap = 0kB

[2751018.240174] Total swap = 0kB

[2751018.240175] 4194174 pages RAM

[2751018.240176] 0 pages HighMem/MovableOnly

[2751018.240177] 127317 pages reserved

[2751018.240179] kmem_cache_create(nf_conntrack_ffff88036ea66180) failed with error -12

[2751018.240181] CPU: 5 PID: 839206 Comm:runc:[1:CHILD] Tainted: G E ------------ T 3.10.0-514.26.2.el7.x86_64 #1

[2751018.240183] Hardware name: Alibaba Cloud Alibaba Cloud ECS, BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014

[2751018.240184] ffff8803b2254e60 000000009f4901bf ffff880066f2fd60 ffffffff81687133

[2751018.240186] ffff880066f2fdb0 ffffffff811a6322 0000000000080000 0000000000000000

[2751018.240188] 00000000fffffff4 ffff88036ea66180 ffffffff81ae6580 ffff88036ea66180

[2751018.240190] Call Trace:

[2751018.240193] [<ffffffff81687133>] dump_stack+0x19/0x1b

[2751018.240196] [<ffffffff811a6322>] kmem_cache_create_memcg+0x162/0x230

[2751018.240198] [<ffffffff811a641b>] kmem_cache_create+0x2b/0x30

[2751018.240203] [<ffffffffa03029e1>] nf_conntrack_init_net+0x101/0x250 [nf_conntrack]

[2751018.240206] [<ffffffffa03032b4>] nf_conntrack_pernet_init+0x14/0x150 [nf_conntrack]

[2751018.240224] [<ffffffff81568a9c>] copy_net_ns+0x7c/0x130

[2751018.240227] [<ffffffff810b5bb9>] create_new_namespaces+0xf9/0x180

[2751018.240229] [<ffffffff810b5dfa>] unshare_nsproxy_namespaces+0x5a/0xc0

[2751018.240231] [<ffffffff810852c3>] SyS_unshare+0x193/0x300

[2751018.240235] Unable to create nf_conn slab cache

從callstack中标紅的幾處關鍵字unshare, namespace, kmem_cache_create, page allocation failed,來看，基本和容器啟動時報錯一緻的。

既然是記憶體配置設定不足，我們看下可用記憶體還有多少：

[root@ ~]# free -mh

total used free shared buff/cache available

Mem: 15G 5.5G 270M 278M 9.7G 6.3G

Swap: 0B 0B 0B

雖然free顯示270M，算上緩存，可用的還有10G左右，也不少啊，配置設定個容器綽綽有餘。

傳回日志中有句： page allocation failure: order:6，我們看下slab記憶體

[[email protected]_004 ~]# cat /proc/buddyinfo

Node 0, zone DMA 1 　　 0 1 　 1 1 1 1 0 1 1 3

Node 0, zone DMA32 8319 4025 634 266 0 0 0 0 0 0 0

Node 0, zone Normal 8860 11464 488 174 27 0 0 0 0 0 0

核心中為快速管理和配置設定不同大小的記憶體，使用slab對象，大小分别按　2^order*Page_size進行管理。buddyinfo顯示，目前系統中order 為５以上的可用記憶體塊均已為０.(ＤＭＡ相關的不用管，隻看Normal的即可)。小塊的還有很多，如4K: 8860個, 8K 11464個等

核心日志中其實在order為４的記憶體塊已為０. 4K 15952個，8K 2235個等。取日志和執行指令"cat /proc/buddyinfo"時間不同，不同size的記憶體塊數量會有動态變化，整體狀況沒變：碎片很多，大塊不足。

[2751018.240159] Node 0 Normal: 15952*4kB (UEM)2235*8kB (UEM) 859*16kB (UEM) 17*32kB (UE) 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 95976kB

而在正常節點的機器上， cat /proc/buddyinfo為，各order 的記憶體塊都還有。

[[email protected] ~]# cat /proc/buddyinfo

Node 0, zone DMA 1 0 0 1 2 1 1 0 1 1 3

Node 0, zone DMA32 571 288 125 38 19 12 5 5 401 0 0

Node 0, zone Normal 2121 754 758 419 735 1411 1202 1160 1075 3 1

哪些方法可以減少記憶體碎片呢？

１.　https://www.kernel.org/doc/Documentation/sysctl/vm.txt　中extfrag_相關參數和設定

記憶體碎片在核心管理中被叫做externam fragmentation，簡寫為extfrag。

extfrag_threshold

This parameter affects whether the kernel will compact memory or direct
reclaim to satisfy a high-order allocation. The extfrag/extfrag_index file in
debugfs shows what the fragmentation index for each order is in each zone in
the system. Values tending towards 0 imply allocations would fail due to lack
of memory, values towards 1000 imply failures are due to fragmentation and -1
implies that the allocation will succeed as long as watermarks are met.

The kernel will not compact memory in a zone if the
fragmentation index is <= extfrag_threshold. The default value is 500.

[[email protected]_004 vm]# cat /sys/kernel/debug/extfrag/extfrag_index

Node 0, zone DMA -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000

Node 0, zone DMA32 -1.000 -1.000 -1.000 -1.000 0.864 0.932 0.966 0.983 0.992 0.996 0.998

Node 0, zone Normal -1.000 -1.000 -1.000 -1.000 0.919 0.960 0.980 0.990 0.995 0.998 0.999

[[email protected]_004 vm]# cat extfrag_threshold

500

顯然，extfrag_index中為-1的隻有前4個，表示配置設定内在無壓力，後面7個都已經接近0，表示lack of memory.

extfrag_threshold預設值為500，我們改為0.5試試，等待十分鐘後再看buddyinfo沒什麼變化，可能還是系統記憶體碎片太多，已沒法進一步合并了。

2. https://www.kernel.org/doc/Documentation/sysctl/vm.txt compact_memory 參數

compact_memory

Available only when CONFIG_COMPACTION is set. When 1 is written to the file,
all zones are compacted such that free memory is available in contiguous
blocks where possible. This can be important for example in the allocation of
huge pages although processes will also directly compact memory as required.

核心中預設是使能這個選項的，

[[email protected]_004 vm]# uname -a

Linux beta_004 3.10.0-514.26.2.el7.x86_64 #1 SMP Tue Jul 4 15:04:05 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux

[[email protected]_004 vm]# cat /boot/config-3.10.0-514.26.2.el7.x86_64 | grep COMPACTION

CONFIG_BALLOON_COMPACTION=y

CONFIG_COMPACTION=y

相對應的配置參數為 /proc/sys/vm/compact_memory，是個隻寫檔案， echo 1 > /proc/sys/vm/compact_memory

再檢視buddyinfo，好像也沒什麼改變，可用記憶體反而更少了。

[[email protected]_004 vm]# cat /proc/buddyinfo

Node 0, zone DMA 1 0 1 1 1 1 1 0 1 1 3

Node 0, zone DMA32 7502 3725 2606 144 0 0 0 0 0 0 0

Node 0, zone Normal 26138 10751 1001 22 0 0 0 0 0 0 0

3. https://events.static.linuxfound.org/sites/events/files/slides/%5BELC-2015%5D-System-wide-Memory-Defragmenter.pdf 三星印度的一個工程師2015年在Linux大會上提出的一個碎片化解決方案，目前核心版本中還沒有內建，沒法試。

簡單說，就是增加一個核心可控參數接口，echo 1 > /proc/sys/vm/shrink_memory，　　echo 1 > /proc/sys/vm/compact_memory

通過 cat /proc/vmstat | grep compact 檢視compact_相關的數字變化，通常都會有pages/blocks Moved等，碎片化率在一定程度上會降低。

嘗試了一些方法，容器還是不能成功啟動。後來回到問題本身，既然是配置設定大塊記憶體時出錯的，那麼容器在哪要求配置設定的？仔細排查了配置參數，發現子產品中配置了最大可用記憶體為256M，也隻有這個地方了。後來嘗試減少為192M也還是失敗，最後試着不設定這個參數，再部署居然成功了。當然，系統中已有上千個程序，線程兩千多，系統壓力大是根本原因，相同的容器在另一個叢集中申請１Ｇ記憶體７個執行個體，輕松部署。

繞了一大圈，除了系統記憶體不足外，問題居然與容器自身相關。本來這個參數是起限制作用的，防止容器運作中無限制地申請和占用記憶體，沒想到居然在特殊場景下會導緻部署不成功。

4. 使用先前博文中所述echo X > /proc/sys/vm/drop_caches 的方法，詳細見 http://blog.csdn.net/wqhlmark64/article/details/78469401

需要進一步了解、研究的内容：

1. slabtop 中各字段的含義，尤其 Active / Total Slabs (% used) 總是100%, object中也有些占比較高(99%)的對象，是否有影響？

2. slab_reclaimable對應的記憶體有5G多，好像與buddyinfo中order階的記憶體和代表的對象不是一回事，reclaimable這些對象如何回收？什麼時候回收？碎片化的幾個嘗試，好像對這個參數沒什麼影響似的。

slab_reclaimable:5108320kB

3. 有說這是個kernel bug, 4.9.12~~18版本中可能已解決，待确認。

Further investigation indicates that I'm probably hitting this kernel bug: OOM but no swap used. – Mark Feb 24 at 21:36

For anyone else who's experiencing this issue, the bug appears to have been fixed somewhere between 4.9.12 and 4.9.18. – Mark Apr 11 at 20:23

參考：

1. runc的建立過程　　http://blog.csdn.net/liuliuzi_hz/article/details/78649004

2. 和本問題最相似的，參考了不少　 http://www.lijiaocn.com/%E9%97%AE%E9%A2%98/2017/11/13/problem-unable-create-nf-conn.html

阿裡雲容器啟動失敗: failed to unshare namespaces, running exec setns process for init, Unable to create nf_co

繼續閱讀

[Cloud Networking Notes] Congestion[Cloud Networking Notes] Congestion

[Cloud Networking Notes] Management and sharing of network infrastructure in cloud data centersTargets and MotivationsVL2 Design

在Google Earth Engine（GEE）中利用人口資料進行分析

使用Learner Lab - 如何啟動、管理和監控Amazon EC2執行個體以及調整其大小。使用Learner Lab - 如何啟動、管理和監控Amazon EC2執行個體以及調整其大小。

RAID磁盤陣列詳細介紹

我的微網誌生涯正式開始了！！！

高新技術企業有哪些稅收優惠政策？

2023年【申報高企】常見十大問題解答！

高企申請不能掉入的坑

系統內建資質取消後，偷偷崛起的ICSCE資質(資訊化能力和信用評價資質）前言一、ICSCE是什麼二、資訊化能力和信用評價的級别及證書有效期三、ICSCE的價值和意義四、資訊化能力和信用評價的申報基本條件

虛拟主機會影響到SEO嗎

解讀2008年網絡技術熱詞之雲計算

《eWEEK》：09年5大科技發展趨勢雲計算居首

雲計算面試題及答案，雲計算主要就業崗位

雲計算面試題——mysql/存儲引擎/備份

雲計算面試題——檔案/權限/分區/軟體包管理