linux kernel crash問題分析解決

一，問題場景和環境

系統環境：

redhat6.4 kernel：2.6.32-358

問題：

使用iptables給mangle表添加了一條規則，使用nfqueue做為target。當一個http請求命中這個規則之後，機器直接重新開機了。偶發性的出了兩次問題，但是卻在重新開機的機器上重制不了這個問題。

二，排查

1，檢視messages，kernel和dmesg相關日志，未發現有任何異常

2，檢視重新開機前機器的負載，cpu，記憶體，磁盤io，網絡io都正常

3，由于是使用了nfqueue做為target才導緻的重新開機，懷疑是系統的問題，通過現象看應該是iptables的nfqueue導緻的問題，而nfqueue用于從核心讀取資料包在使用者态處理。故具體定位在kernel或者libnetfilter_queue上。

4，通過伺服器顯示螢幕來看重新開機的時候會有什麼有用的輸出，但是伺服器在客戶的機房，檢視太麻煩

5，使用last檢視伺服器的重新開機記錄，發現一個意外現象，即：機器因為nfqueue重新開機的那個記錄裡面有一個crash記錄，意思即系統奔潰了，進而導緻重新開機。那就能斷定是系統或者kernel crash了。

6，linux系統一般預設都安裝配置了kdump，故當 linux 系統核心發生崩潰的時候，可以通過 kdump 等方式收集核心崩潰之前的記憶體，在/var/crash/日期目錄生成一個轉儲檔案 vmcore。使用crash工具可以分享vmcore檔案，來擷取kernel crash前的一些重要資訊。通過在機器上查找，果然發現了crash相關的vmcore檔案。

三，分析vmcore檔案

1，安裝指定kernel的debuginfo包：

# yum install kernel-debuginfo-2.6.32-358.el6.x86_64

2，使用系統自帶的crash指令分析vmcore：

<code># crash /usr/lib/debug/lib/modules/2.6.32-358.el6.x86_64/vmlinux vmcore</code>

<code>crash 7.1.0-6.el6</code>

<code>This program is </code><code>free</code> <code>software, covered by the GNU General Public License,</code>

<code>and you are welcome to change it and/or distribute copies of it under</code>

<code>certain conditions. Enter </code><code>"help copying"</code> <code>to see the conditions.</code>

<code>This program has absolutely no warranty. Enter </code><code>"help warranty"</code> <code>for</code> <code>details.</code>

<code>License GPLv3+: GNU GPL version 3 or later <http:</code><code>//gnu.org/licenses/gpl.html></code>

<code>This is </code><code>free</code> <code>software: you are </code><code>free</code> <code>to change and redistribute it.</code>

<code>There is NO WARRANTY, to the extent permitted by law. Type </code><code>"show copying"</code>

<code>and </code><code>"show warranty"</code> <code>for</code> <code>details.</code>

<code>This GDB was configured as </code><code>"x86_64-unknown-linux-gnu"</code><code>...</code>

<code>WARNING: kernel version inconsistency between vmlinux and dumpfile</code>

<code> </code><code>KERNEL: vmlinux</code>

<code> </code><code>DUMPFILE: vmcore [PARTIAL DUMP]</code>

<code> </code><code>UPTIME: 342 days, 12:15:26</code>

<code>LOAD AVERAGE: 0.00, 0.02, 0.00</code>

<code> </code><code>TASKS: 1050</code>

<code> </code><code>NODENAME: web_yp_49_202.mobileztgame</code>

<code> </code><code>RELEASE: 2.6.32-358.el6.x86_64</code>

<code> </code><code>VERSION: #1 SMP Tue Jan 29 11:47:41 EST 2013</code>

<code> </code><code>MACHINE: x86_64 (2499 Mhz)</code>

<code> </code><code>MEMORY: 128 GB</code>

<code> </code><code>PANIC: </code><code>"BUG: unable to handle kernel NULL pointer dereference at (null)"</code>

<code> </code><code>COMMAND: </code><code>"swapper"</code>

<code> </code><code>TASK: ffff882069324080 (1 of 40) [THREAD_INFO: ffff881068896000]</code>

<code> </code><code>STATE: TASK_RUNNING (PANIC)</code>

從crash的輸出可以看到kernel崩潰的原因為kernel遇見空指針導緻崩潰

bt 指令用于檢視系統崩潰前的堆棧等資訊

bt指令結果如下：

<code>crash> bt</code>

<code>PID: 0 TASK: ffff882069324080 CPU: 5 COMMAND: </code><code>"swapper"</code>

<code> </code><code>#0 [ffff8800618a3750] machine_kexec at ffffffff81035b7b</code>

<code> </code><code>#1 [ffff8800618a37b0] crash_kexec at ffffffff810c0db2</code>

<code> </code><code>#2 [ffff8800618a3880] oops_end at ffffffff815111d0</code>

<code> </code><code>#3 [ffff8800618a38b0] no_context at ffffffff81046bfb</code>

<code> </code><code>#4 [ffff8800618a3900] __bad_area_nosemaphore at ffffffff81046e85</code>

<code> </code><code>#5 [ffff8800618a3950] bad_area_nosemaphore at ffffffff81046f53</code>

<code> </code><code>#6 [ffff8800618a3960] __do_page_fault at ffffffff810476b1</code>

<code> </code><code>#7 [ffff8800618a3a80] do_page_fault at ffffffff8151311e</code>

<code> </code><code>#8 [ffff8800618a3ab0] page_fault at ffffffff815104d5</code>

<code> </code><code>[exception RIP: nf_queue+152]</code>

<code> </code><code>RIP: ffffffff81475718 RSP: ffff8800618a3b60 RFLAGS: 00010207</code>

<code> </code><code>R13: 0000000000000000 R14: ffffffff8147e8b0 R15: 0000000000000000</code>

<code> </code><code>ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018</code>

<code> </code><code>#9 [ffff8800618a3bd8] nf_hook_slow at ffffffff81474800</code>

<code>#10 [ffff8800618a3c58] ip_rcv at ffffffff8147ef54</code>

<code>#11 [ffff8800618a3c98] __netif_receive_skb at ffffffff8144819b</code>

<code>#12 [ffff8800618a3cf8] netif_receive_skb at ffffffff8144a578</code>

<code>#13 [ffff8800618a3d38] napi_skb_finish at ffffffff8144a680</code>

<code>#14 [ffff8800618a3d58] napi_gro_receive at ffffffff8144cc29</code>

<code>#15 [ffff8800618a3d78] ixgbe_poll at ffffffffa015e44c [ixgbe]</code>

<code>#16 [ffff8800618a3e68] net_rx_action at ffffffff8144cd43</code>

<code>#17 [ffff8800618a3ec8] __do_softirq at ffffffff81076fb1</code>

<code>#18 [ffff8800618a3f38] call_softirq at ffffffff8100c1cc</code>

<code>#19 [ffff8800618a3f50] do_softirq at ffffffff8100de05</code>

<code>#20 [ffff8800618a3f70] irq_exit at ffffffff81076d95</code>

<code>#21 [ffff8800618a3f80] do_IRQ at ffffffff81516c95</code>

<code>#22 [ffff881068897db8] ret_from_intr at ffffffff8100b9d3</code>

<code> </code><code>[exception RIP: intel_idle+222]</code>

<code> </code><code>RIP: ffffffff812d37ae RSP: ffff881068897e68 RFLAGS: 00000206</code>

<code> </code><code>RBP: ffffffff8100b9ce R8: 0000000000000004 R9: 0000000000000050</code>

<code> </code><code>ORIG_RAX: ffffffffffffff62 CS: 0010 SS: 0018</code>

<code>#23 [ffff881068897ee0] cpuidle_idle_call at ffffffff81414ef7</code>

<code>#24 [ffff881068897f00] cpu_idle at ffffffff81009fc6</code>

通過bt分析，我們從下到上來看kernel崩潰前的系統調用，定位到kernel崩潰前的一個exception是ip寄存器RIP的異常，而通過dis 指令來看一下該位址的反彙編結果：

<code>crash> dis -l ffffffff81475718</code>

<code>/usr/src/debug/kernel-2.6.32-358.el6/linux-2.6.32-358.el6.x86_64/net/netfilter/nf_queue.c: 221</code>

<code>0xffffffff81475718 <nf_queue+152>: mov (%rbx),%r12</code>

故可定位到出現異常的代碼段：

<code># vim /usr/src/debug/kernel-2.6.32-358.el6/linux-2.6.32-358.el6.x86_64/net/netfilter/nf_queue.c +221</code>

<code>215 segs = skb_gso_segment(skb, 0);</code>

<code>216 kfree_skb(skb);</code>

<code>218 </code><code>return</code> <code>1;</code>

<code>221 </code><code>struct</code> <code>sk_buff *nskb = segs->next;</code>

<code>224 </code><code>if</code> <code>(!__nf_queue(segs, elem, pf, hook, indev, outdev, okfn,</code>

<code>225 queuenum))</code>

<code>226 kfree_skb(segs);</code>

<code>228 } </code><code>while</code> <code>(segs);</code>

<code>229 </code><code>return</code> <code>1;</code>

而通過看skb_gso_segment結構體，可以判斷出是因為skb_gso_segment在某些情況下會傳回NULL，進而導緻如上代碼segs->next擷取到了空指針，進而導緻kernel崩潰。而既然是gso導緻的問題，應該可以通過調整系統gso屬性來規避這個問題：

<code># vim /usr/src/debug/kernel-2.6.32-358.el6/linux-2.6.32-358.el6.x86_64/net/core/dev.c +1728</code>

<code>1729 * skb_gso_segment - Perform segmentation on skb.</code>

<code>1730 * @skb: buffer to segment</code>

<code>1731 * @features: features for the output path (see dev->features)</code>

<code>1733 * This function segments the given skb and returns a list of segments.</code>

<code>1735 * It may return NULL if the skb requires no segmentation. This is</code>

<code>1736 * only possible when GSO is used for verifying header integrity.</code>

<code>1738 </code><code>struct</code> <code>sk_buff *skb_gso_segment(</code><code>struct</code> <code>sk_buff *skb, </code><code>int</code> <code>features)</code>

<code>1740 </code><code>struct</code> <code>sk_buff *segs = ERR_PTR(-EPROTONOSUPPORT);</code>

<code>1741 </code><code>struct</code> <code>packet_type *ptype;</code>

<code>1742 __be16 type = skb->protocol;</code>

從網上找到的對應patch如下：

<a href="https://patchwork.kernel.org/patch/6615071/" target="_blank">https://patchwork.kernel.org/patch/6615071/</a>

四，問題重制

1，最早發現問題，想要重制的辦法是通過如下url通路：curl “t.test.com”，發現重制不了。

2，之後，通過搜尋相關TSO/GSO/LRO/GRO相關的資料，覺得有可能是由于發送的資料包太小，導緻沒有觸發相關的資料包分段重組，進而沒有導緻重制問題。故增大了請求的資料包，通過如下url重制了問題：

# curl “t.test.com/v2/user-manage/css/bootstrap.min.css?test1=sdfsfsdfsdfa&test2_id=2234234234234234234&test_id=50129009890098&test_token=1670056402|_80_m_lxxj1298|1493196793|c726299f2d03b8462764bacf20e2395f|sdfsdfdsfsdffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffsdfsdfsdfdsfsdfhgjgjghjghjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjfhjgjfghjfjfhjjjjjjjjjjjjjjjjjjjjjfffffadfsfsdfsdfsdfsdfsdfdsfdssdfsdfsdfsdfsdfsdf”

iptables相關規則如下：

<code># ipset create lee hash:ip hashsize 819200 maxelem 100000 timeout 300</code>

<code># ipset add lee 1.1.1.1 timeout 300</code>

<code># iptables -t mangle -I PREROUTING -p tcp -m multiport --dports 80,443 -m set --match-set lee src -m string --string t.test.com --algo kmp --from 0 --to 1480 -j NFQUEUE</code>

五，問題結論

linux kernel bug

六，解決辦法

1，更新kernel。從patch和源代碼可以看出kernel 3.0以後應該fix了這個問題，看了下3.10的kernel代碼已經fix

2，使用drop，不再使用nfqueue這個target來添加iptables規則(建議使用這個辦法)

3，調整網卡gso相關屬性，發現通過關閉lro來解決這個重新開機問題。具體指令：

# ethtool -K eth0 lro on

LRO簡介：

Linux 在 2.6.24 中加入了支援 IPv4 TCP 協定的 LRO (Large Receive Offload) ，它通過将多個 TCP 資料聚合在一個 skb 結構，在稍後的某個時刻作為一個大資料包傳遞給上層的網絡協定棧，以減少上層協定棧處理 skb 的開銷，提高系統接收 TCP 資料包的能力。當然，這一切都需要網卡驅動程式支援。

七，參考

<a href="https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/7/html/kernel_crash_dump_guide/sect-crash-running-the-utility" target="_blank">https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/7/html/kernel_crash_dump_guide/sect-crash-running-the-utility</a>

https://www.ibm.com/developerworks/cn/linux/l-cn-network-pt/index.html

本文轉自 leejia1989 51CTO部落格，原文連結：http://blog.51cto.com/leejia/1978729，如需轉載請自行聯系原作者

linux kernel crash問題分析解決

繼續閱讀

Apache (You don't have permission to access / on this server.）

debian9更新4.9.0核心到4.19.2核心過程

centOS7 配置 vsftpd 虛拟使用者及權限Vsftpd配置虛拟使用者及權限

linux-svn解除安裝與安裝

vsftp虛拟多使用者多權限一鍵部署腳本

Ubuntu14.04 LTS下安裝mongodb

httpd服務的部署、啟動、配置和簡單優化一、部署二、啟動三、配置檔案

配置網頁内容通路

手動安裝Intel network I217-LM網卡的Linux驅動

禁止ubuntu系統彈出報錯界面

Ubuntu Linux下Apache的配置檔案

samba伺服器的功能

【Linux】UDP廣播封包接收速率問題

Linux裝置模型（中）之上層容器

PowerPC平台 Linux移植三