MySQL：産生大量小relay log的故障一例

能力有限有誤請諒解，源碼版本5.7.22

歡迎關注我的《深入了解MySQL主從原理 32講》，如下：

一、案例來源和現象

這個案例是朋友@peaceful遇到的線上問題，最終線索也是他自己找到的。現象如下：

1、出現了大量很小的relay log如下，堆積量大約2600個：

...
-rw-r----- 1 mysql dba    12827 Oct 11 12:28 mysql-relay-bin.036615
-rw-r----- 1 mysql dba     4908 Oct 11 12:28 mysql-relay-bin.036616
-rw-r----- 1 mysql dba     1188 Oct 11 12:28 mysql-relay-bin.036617
-rw-r----- 1 mysql dba     5823 Oct 11 12:29 mysql-relay-bin.036618
-rw-r----- 1 mysql dba      507 Oct 11 12:29 mysql-relay-bin.036619
-rw-r----- 1 mysql dba     1188 Oct 11 12:29 mysql-relay-bin.036620
-rw-r----- 1 mysql dba     3203 Oct 11 12:29 mysql-relay-bin.036621
-rw-r----- 1 mysql dba    37916 Oct 11 12:30 mysql-relay-bin.036622
-rw-r----- 1 mysql dba      507 Oct 11 12:30 mysql-relay-bin.036623
-rw-r----- 1 mysql dba     1188 Oct 11 12:31 mysql-relay-bin.036624
-rw-r----- 1 mysql dba     4909 Oct 11 12:31 mysql-relay-bin.036625
-rw-r----- 1 mysql dba     1188 Oct 11 12:31 mysql-relay-bin.036626
-rw-r----- 1 mysql dba      507 Oct 11 12:31 mysql-relay-bin.036627
-rw-r----- 1 mysql dba      507 Oct 11 12:32 mysql-relay-bin.036628
-rw-r----- 1 mysql dba     1188 Oct 11 12:32 mysql-relay-bin.036629
-rw-r----- 1 mysql dba      454 Oct 11 12:32 mysql-relay-bin.036630
-rw-r----- 1 mysql dba     6223 Oct 11 12:32 mysql-relay-bin.index

2、主庫錯誤日志有如下錯誤

2019-10-11T12:31:26.517309+08:00 61303425 [Note] While initializing dump thread for slave with UUID <eade0d03-ad91-11e7-8559-c81f66be1379>, found a zombie dump thread with the same UUID. Master is killing the zombie dump thread(61303421).
2019-10-11T12:31:26.517489+08:00 61303425 [Note] Start binlog_dump to master_thread_id(61303425) slave_server(19304313), pos(, 4)
2019-10-11T12:31:44.203747+08:00 61303449 [Note] While initializing dump thread for slave with UUID <eade0d03-ad91-11e7-8559-c81f66be1379>, found a zombie dump thread with the same UUID. Master is killing the zombie dump thread(61303425).
2019-10-11T12:31:44.203896+08:00 61303449 [Note] Start binlog_dump to master_thread_id(61303449) slave_server(19304313), pos(, 4)

二、slave_net_timeout參數分析

實際上第一眼看這個案例我也覺得很奇怪，因為很少有人會去設定slave_net_timeout參數，同樣我們也沒有設定過，是以關注較少。但是@peaceful自己找到了可能出現問題的設定就是目前從庫slave_net_timeout參數設定為10。我就順着這個線索往下分析，我們先來看看slave_net_timeout參數的功能。

目前看來從庫的slave_net_timeout有如下兩個功能：

1、設定IO線程在空閑情況下（沒有Event接收的情況下）的連接配接逾時時間。

這個參數5.7.7過後是60秒，以前是3600秒，修改後需要重新開機主從才會生效。

2、如果change master沒有指定MASTER_HEARTBEAT_PERIOD的情況下會設定為slave_net_timeout/2

一般我們配置主從都沒有去指定這個心跳周期，是以就是slave_net_timeout/2，它控制的是如果在主庫沒有Event産生的情況下，多久發送一個心跳Event給從庫的IO線程，用于保持連接配接。但是一旦我們配置了主從（change master）這個值就定下來了，不會随着slave_net_timeout參數的更改而更改，我們可以在slave_master_info表中找到相應的設定如下：

mysql> select Heartbeat from slave_master_info \G
*************************** 1. row ***************************
Heartbeat: 30
1 row in set (0.01 sec)

如果我們要更改這個值隻能重新 change master才行。

三、原因總結

如果滿足下面三個條件，将會出現案例中的故障：

主從中的MASTER_HEARTBEAT_PERIOD的值大于從庫slave_net_timeout
主庫目前壓力很小持續slave_net_timeout設定時間沒有産生新的Event
之前主從有一定的延遲

那麼這種情況下在主庫心跳Event發送給從庫的IO線程之前，IO線程已經斷開了。斷開後IO線程會進行重連，每次重連将會生成新的relay log，但是這些relay log由于延遲問題不能清理就出現了案例中的情況。

下面是官方文檔中關于這部分說明：

If you are logging master connection information to tables, MASTER_HEARTBEAT_PERIOD can be seen
as the value of the Heartbeat column of the mysql.slave_master_info table.
Setting interval to 0 disables heartbeats altogether. The default value for interval is equal to the
value of slave_net_timeout divided by 2.
Setting @@global.slave_net_timeout to a value less than that of the current heartbeat interval
results in a warning being issued. The effect of issuing RESET SLAVE on the heartbeat interval is to
reset it to the default value.

四、案例模拟

有了理論基礎就很好了模拟了，但是延遲這一點不太好模拟，是以我模拟的時候關閉了從庫的SQL線程來模拟積壓的情況。

提前配置好主從，檢視目前的心跳周期和slave_net_timeout參數如下：

mysql> show variables like '%slave_net_timeout%';
+-------------------+-------+
| Variable_name     | Value |
+-------------------+-------+
| slave_net_timeout | 60    |
+-------------------+-------+
1 row in set (0.01 sec)

mysql> select Heartbeat from slave_master_info \G
*************************** 1. row ***************************
Heartbeat: 30
1 row in set (0.00 sec)

1、停止從庫的SQL線程

stop slave sql_thread;

2、設定slave_net_timeout為10

mysql> set global slave_net_timeout=10;
Query OK, 0 rows affected, 1 warning (0.00 sec)

mysql> show warnings;
+---------+------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Level   | Code | Message                                                                                                                                                         |
+---------+------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Warning | 1704 | The requested value for the heartbeat period exceeds the value of `slave_net_timeout' seconds. A sensible value for the period should be less than the timeout. |
+---------+------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------+
1 row in set (0.00 sec)

可以看到這裡實際上已經有一個警告了。

3、重新開機IO線程

這樣才會讓slave_net_timeout參數生效

mysql> stop slave ;
Query OK, 0 rows affected (0.01 sec)

mysql> start slave io_thread;
Query OK, 0 rows affected (0.01 sec)

4、觀察現象

大概每10秒會生成一個relay log檔案如下：

-rw-r----- 1 mysql mysql        500 2019-09-27 23:48:32.655001361 +0800 relay.000142
-rw-r----- 1 mysql mysql        500 2019-09-27 23:48:42.943001355 +0800 relay.000143
-rw-r----- 1 mysql mysql        500 2019-09-27 23:48:53.293001363 +0800 relay.000144
-rw-r----- 1 mysql mysql        500 2019-09-27 23:49:03.502000598 +0800 relay.000145
-rw-r----- 1 mysql mysql        500 2019-09-27 23:49:13.799001357 +0800 relay.000146
-rw-r----- 1 mysql mysql        500 2019-09-27 23:49:24.055001354 +0800 relay.000147
-rw-r----- 1 mysql mysql        500 2019-09-27 23:49:34.280001827 +0800 relay.000148
-rw-r----- 1 mysql mysql        500 2019-09-27 23:49:44.496001365 +0800 relay.000149
-rw-r----- 1 mysql mysql        500 2019-09-27 23:49:54.789001353 +0800 relay.000150
-rw-r----- 1 mysql mysql        500 2019-09-27 23:50:05.485001371 +0800 relay.000151
-rw-r----- 1 mysql mysql        500 2019-09-27 23:50:15.910001430 +0800 relay.000152

大概每10秒主庫的日志會輸出如下日志：

2019-10-08T02:27:24.996827+08:00 217 [Note] While initializing dump thread for slave with UUID <010fde77-2075-11e9-ba07-5254009862c0>, found a zombie dump thread with the same UUID. Master is killing the zombie dump thread(216).
2019-10-08T02:27:24.998297+08:00 217 [Note] Start binlog_dump to master_thread_id(217) slave_server(953340), pos(, 4)
2019-10-08T02:27:35.265961+08:00 218 [Note] While initializing dump thread for slave with UUID <010fde77-2075-11e9-ba07-5254009862c0>, found a zombie dump thread with the same UUID. Master is killing the zombie dump thread(217).
2019-10-08T02:27:35.266653+08:00 218 [Note] Start binlog_dump to master_thread_id(218) slave_server(953340), pos(, 4)
2019-10-08T02:27:45.588074+08:00 219 [Note] While initializing dump thread for slave with UUID <010fde77-2075-11e9-ba07-5254009862c0>, found a zombie dump thread with the same UUID. Master is killing the zombie dump thread(218).
2019-10-08T02:27:45.589814+08:00 219 [Note] Start binlog_dump to master_thread_id(219) slave_server(953340), pos(, 4)
2019-10-08T02:27:55.848558+08:00 220 [Note] While initializing dump thread for slave with UUID <010fde77-2075-11e9-ba07-5254009862c0>, found a zombie dump thread with the same UUID. Master is killing the zombie dump thread(219).
2019-10-08T02:27:55.849442+08:00 220 [Note] Start binlog_dump to master_thread_id(220) slave_server(953340), pos(, 4)

這個日志就和案例中的一模一樣了。

解決問題

知道原因後解決也就很簡單了我們隻需設定slave_net_timeout參數為MASTER_HEARTBEAT_PERIOD的2倍就可以了，設定後重新開機主從即可。

五、實作方式

這裡我們将通過簡單的源碼調用分析來看看到底slave_net_timeout參數和MASTER_HEARTBEAT_PERIOD對主從的影響。

1、從庫使用參數slave_net_timeout

從庫IO線程啟動時候會通過參數slave_net_timeout設定逾時：

->connect_to_master
  -> mysql_options

case MYSQL_OPT_CONNECT_TIMEOUT: //MYSQL_OPT_CONNECT_TIMEOUT
    mysql->options.connect_timeout= *(uint*) arg;
    break;

而在建立和主庫的連接配接時候會使用這個值

connect_to_master
 ->mysql_real_connect
   -> get_vio_connect_timeout

timeout_sec= mysql->options.connect_timeout;

是以我們也看到了slave_net_timeout參數隻有在IO線程重新開機的時候才會生效

2、從庫設定MASTER_HEARTBEAT_PERIOD值

在每次使用從庫change master時候會設定這個值如下，預設為slave_net_timeout/2：

->change_master
  ->change_receive_options


    mi->heartbeat_period= min<float>(SLAVE_MAX_HEARTBEAT_PERIOD,
                                     (slave_net_timeout/2.0f));

是以我們看到隻有change master才會重新設定這個值，重新開機主從是不會重新設定的。

3、使用MASTER_HEARTBEAT_PERIOD值

每次IO線程啟動時候會将這個值傳遞給主庫的DUMP線程，方式應該是通過建構語句‘SET @master_heartbeat_period’來完成的。如下：

->handle_slave_io
 ->get_master_version_and_clock


if (mi->heartbeat_period != 0.0)
  {
    char llbuf[22];
    const char query_format[]= "SET @master_heartbeat_period= %s";
    char query[sizeof(query_format) - 2 + sizeof(llbuf)];

主庫啟動DUMP線程的時候會通過搜尋的方式找到這個值如下

->Binlog_sender::init
   ->Binlog_sender::init_heartbeat_period

user_var_entry *entry=
    (user_var_entry*) my_hash_search(&m_thd->user_vars, (uchar*) name.str,
                                     name.length);
  m_heartbeat_period= entry ? entry->val_int(&null_value) : 0;

4、DUMP線程使用MASTER_HEARTBEAT_PERIOD發送心跳Event

這裡主要是通過一個逾時等待來完成，如下：

->Binlog_sender::wait_new_events
  ->Binlog_sender::wait_with_heartbeat
 

set_timespec_nsec(&ts, m_heartbeat_period); //心跳逾時
    ret= mysql_bin_log.wait_for_update_bin_log(m_thd, &ts);//等待
    if (ret != ETIMEDOUT && ret != ETIME) //如果是正常收到則收到信号，說明有新的Event到來，否則如果是逾時則發送心跳Event
      break; //正常傳回0 是逾時傳回ETIMEDOUT 繼續循環

      if (send_heartbeat_event(log_pos)) //發送心跳Event
        return 1;

5、重連會殺掉可能的存在的DUMP線程

根據UUID進行比對如下：

->kill_zombie_dump_threads


Find_zombie_dump_thread find_zombie_dump_thread(slave_uuid);
THD *tmp= Global_THD_manager::get_instance()->
                                find_thd(&find_zombie_dump_thread);
  if (tmp)
  {
    /*
      Here we do not call kill_one_thread() as
      it will be slow because it will iterate through the list
      again. We just to do kill the thread ourselves.
    */
    if (log_warnings > 1)
    {
      if (slave_uuid.length())
      {
        sql_print_information("While initializing dump thread for slave with "
                              "UUID <%s>, found a zombie dump thread with the "
                              "same UUID. Master is killing the zombie dump "
                              "thread(%u).", slave_uuid.c_ptr(),
                              tmp->thread_id());
      }//這裡就是本案例中的日志了
.....

這裡我們看到了案例中的日志。

6、關于DUMP線程流程圖

最後給出一張來自我《MySQL主從原理32講》第17節中DUMP線程的流程圖如下：

在圖中可以看到心跳Event發送的位置。

作者微信：gp_22389860

MySQL：産生大量小relay log的故障一例

一、案例來源和現象

1、出現了大量很小的relay log如下，堆積量大約2600個：

2、主庫錯誤日志有如下錯誤

二、slave_net_timeout參數分析

三、原因總結

四、案例模拟

1、停止從庫的SQL線程

2、設定slave_net_timeout為10

3、重新開機IO線程

解決問題

五、實作方式

1、從庫使用參數slave_net_timeout

2、從庫設定MASTER_HEARTBEAT_PERIOD值

3、使用MASTER_HEARTBEAT_PERIOD值

4、DUMP線程使用MASTER_HEARTBEAT_PERIOD發送心跳Event

5、重連會殺掉可能的存在的DUMP線程

6、關于DUMP線程流程圖

繼續閱讀

2022秋招面試總結（cpp+java+測開）百度測開一面位元組後端一面蝦皮後端一面蝦皮後端二面

資料庫之DDL操作資料庫DDL操作資料庫DDL操作資料表

資料庫之DQL操作資料庫

mysql優化（sql優化）

資料遷移方法資料遷移原則資料遷移之雙寫方案資料遷移之級聯同步方案

redis叢集資料一緻性_RedisRaft為Redis叢集帶來強大的資料一緻性

寶塔面闆mysql恢複2018.1.8更新

Centos7 MySQL 5.7 安裝MySQL 5.7 安裝

查找入職員工時間排名倒數第三的員工所有資訊

Hibernate使用Hibernate的“3個準備，7個步驟”Hibernate API簡介操作實體對象對象識别

雲計算面試題——mysql/存儲引擎/備份

SQL語言基礎：常用的資料查詢語句

Ubuntu16.04安裝Apache+MySQL+PHP1. 安裝Apache2. 安裝MySQL3. 安裝PHP4. 安裝phpMyAdmin

MySQL的4種隔離級别？出現問題

neo4j之cypher使用文檔

mysql使用source指令導入.sql檔案