MySQL 案例：為什麼 kill 不掉線程

背景

在日常的使用過程中，時不時會遇到個别，或者大量的連接配接堆積在 MySQL 中的現象，這時一般會考慮使用 kill 指令強制殺死這些長時間堆積起來的連接配接，盡快釋放連接配接數和資料庫伺服器的 CPU 資源。

問題描述

在實際操作 kill 指令的時候，有時候會發現連接配接并沒有第一時間被 kill 掉，仍舊在 processlist 裡面能看到，但是顯示的 Command 為 Killed，而不是常見的 Query 或者是 Execute 等。例如：

mysql> show processlist;
+----+------+--------------------+--------+---------+------+--------------+---------------------------------+
| Id | User | Host               | db     | Command | Time | State        | Info                            |
+----+------+--------------------+--------+---------+------+--------------+---------------------------------+
| 31 | root | 192.168.1.10:50410 | sbtest | Query   |    0 | starting     | show processlist                |
| 32 | root | 192.168.1.10:50412 | sbtest | Query   |   62 | User sleep   | select sleep(3600) from sbtest1 |
| 35 | root | 192.168.1.10:51252 | sbtest | Killed  |   47 | Sending data | select sleep(100) from sbtest1  |
| 36 | root | 192.168.1.10:51304 | sbtest | Query   |   20 | Sending data | select sleep(3600) from sbtest1 |
+----+------+--------------------+--------+---------+------+--------------+---------------------------------+

複制

原因分析

遇事不決先翻官方文檔，這裡摘取部分官方文檔的内容：

When you use KILL, a thread-specific kill flag is set for the thread. In most cases, it might take some time for the thread to die because the kill flag is checked only at specific intervals:During SELECT operations, for ORDER BY and GROUP BY loops, the flag is checked after reading a block of rows. If the kill flag is set, the statement is aborted.

ALTER TABLE operations that make a table copy check the kill flag periodically for each few copied rows read from the original table. If the kill flag was set, the statement is aborted and the temporary table is deleted.

The KILL statement returns without waiting for confirmation, but the kill flag check aborts the operation within a reasonably small amount of time. Aborting the operation to perform any necessary cleanup also takes some time.

During UPDATE or DELETE operations, the kill flag is checked after each block read and after each updated or deleted row. If the kill flag is set, the statement is aborted. If you are not using transactions, the changes are not rolled back.

GET_LOCK() aborts and returns NULL.

If the thread is in the table lock handler (state: Locked), the table lock is quickly aborted.

If the thread is waiting for free disk space in a write call, the write is aborted with a “disk full” error message.

官方文檔第一段就很明确的說清楚了 kill 的作用機制：會給連接配接的線程設定一個線程級别的 kill 标記，等到下一次“标記檢測”的時候才會生效。這也意味着如果下一次“标記檢測”遲遲沒有發生，那麼就有可能會出現問題描述中的現象。

官方文檔中列舉了不少的場景，這裡根據官方的描述列舉幾個比較常見的問題場景：

select 語句中進行 order by，group by 的時候，如果伺服器 CPU 資源比較緊張，那麼讀取/擷取一批資料的時間會變長，進而影響下一次“标記檢測”的時間。
對大量資料進行 DML 操作的時候，kill 這一類 SQL 語句會觸發事務復原（InnoDB引擎），雖然語句被 kill 掉了，但是復原操作也會非常久。
kill alter 操作時，如果伺服器的負載比較高，那麼操作一批資料的時間會變長，進而影響下一次“标記檢測”的時間。
其實參考 kill 的作用機制，做一個歸納性的描述的話，那麼：任何阻塞/減慢 SQL 語句正常執行的行為，都會導緻下一次“标記檢測”推遲、無法發生，最終都會導緻 kill 操作的失敗。

模拟一下

這裡借用一個參數

innodb_thread_concurrency

來模拟阻塞 SQL 語句正常執行的場景：

Defines the maximum number of threads permitted inside of InnoDB. A value of 0 (the default) is interpreted as infinite concurrency (no limit). This variable is intended for performance tuning on high concurrency systems.

參照官方文檔的描述，這個參數設定得比較低的時候，超過數量限制的 InnoDB 查詢會被阻塞。是以在本次模拟中，這個參數被設定了一個非常低的值。

mysql> show variables like '%innodb_thread_concurrency%';
+---------------------------+-------+
| Variable_name             | Value |
+---------------------------+-------+
| innodb_thread_concurrency | 1     |
+---------------------------+-------+
1 row in set (0.00 sec)

複制

然後開兩個資料庫連接配接（Session 1 和 Session 2），分别執行

select sleep(3600) from sbtest.sbtest1

語句，然後在第三個連接配接上 kill 掉 Session 2 的查詢：

Session 1：
mysql> select sleep(3600) from sbtest.sbtest1;

Session 2：
mysql> select sleep(3600) from sbtest.sbtest1;

mysql>

Session 3：
mysql> show processlist;
+----+------+--------------------+------+---------+------+--------------+----------------------------------------+
| Id | User | Host               | db   | Command | Time | State        | Info                                   |
+----+------+--------------------+------+---------+------+--------------+----------------------------------------+
| 44 | root | 172.16.64.10:39290 | NULL | Query   |   17 | User sleep   | select sleep(3600) from sbtest.sbtest1 |
| 45 | root | 172.16.64.10:39292 | NULL | Query   |    0 | starting     | show processlist                       |
| 46 | root | 172.16.64.10:39294 | NULL | Query   |    5 | Sending data | select sleep(3600) from sbtest.sbtest1 |
+----+------+--------------------+------+---------+------+--------------+----------------------------------------+
3 rows in set (0.00 sec)

mysql> kill 46;
Query OK, 0 rows affected (0.00 sec)

mysql> show processlist;
+----+------+--------------------+------+---------+------+--------------+----------------------------------------+
| Id | User | Host               | db   | Command | Time | State        | Info                                   |
+----+------+--------------------+------+---------+------+--------------+----------------------------------------+
| 44 | root | 172.16.64.10:39290 | NULL | Query   |   26 | User sleep   | select sleep(3600) from sbtest.sbtest1 |
| 45 | root | 172.16.64.10:39292 | NULL | Query   |    0 | starting     | show processlist                       |
| 46 | root | 172.16.64.10:39294 | NULL | Killed  |   14 | Sending data | select sleep(3600) from sbtest.sbtest1 |
+----+------+--------------------+------+---------+------+--------------+----------------------------------------+
3 rows in set (0.00 sec)

mysql>

複制

可以看到，kill 指令執行之後，Session 2 的連接配接馬上就斷開了，但是 Session 2 發起的查詢仍舊殘留在 MySQL 中。當然，如果是因為

innodb_thread_concurrency

這個參數導緻了類似的問題的話，直接使用

set global

的指令調高上限，或者直接設定為 0 就可以解決，這個參數的變更是實時對所有連接配接生效的。

總結一下

MySQL 的 kill 操作并不是想象中的直接強行終止資料庫連接配接，隻是發送了一個終止的信号，如果 SQL 自身的執行效率過慢，或者受到其他的因素影響（伺服器負載高，觸發大量資料復原）的話，那麼這個 kill 的操作很有可能并不能及時終止這些問題查詢，反而可能會因為程式側連接配接被斷開之後觸發重連，産生更多的低效查詢，進一步拖垮資料庫。