MySQL：slave_skip_errors參數對MGR可用性的影響

整個問題提出和測試由 @gc @甘露寺的姑子@乙酉完成，文檔記錄由@gc @乙酉完成。

我隻是進行了問題分析和文檔整理

歡迎關注我的《深入了解MySQL主從原理 32講》，如下：

一、案例描述

MGR在遇到表不存在的情況下，節點沒有退出節點而是爆出一個警告，并且節點狀态也正常，警告如下：

2019-10-17T21:16:11.564211+08:00 10 [Warning] Slave SQL for channel 
group_replication_applier': Worker 1 failed executing transaction 'aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaaaaaa:8' at master log , end_log_pos 220; 
Error executing row event: 'Table 'test.a_1' doesn't exist', Error_code: 1146

叢集狀态如下：

[[email protected]][test]>select * from performance_schema.replication_group_members;
+---------------------------+--------------------------------------+-------------+-------------+--------------+
| CHANNEL_NAME              | MEMBER_ID                            | MEMBER_HOST | MEMBER_PORT | MEMBER_STATE |
+---------------------------+--------------------------------------+-------------+-------------+--------------+
| group_replication_applier | 9fd479bb-f0d8-11e9-9381-000c29105312 | mysql_1     |        3306 | ONLINE       |
| group_replication_applier | a8833a96-f0d8-11e9-a9f4-000c291fd9a5 | mysql_2     |        3306 | ONLINE       |
| group_replication_applier | b2968fe2-f0d8-11e9-a8ff-000c29c89e42 | mysql_3     |        3306 | ONLINE       |
+---------------------------+--------------------------------------+-------------+-------------+--------------+
3 rows in set (0.00 sec)

當時覺得很奇怪，我們知道這種錯誤即便是在主從情況下也是報錯的SQL線程退出的，MGR居然還能線上，這種情況資料已經不同步了，應該報錯并且剔除節點才對。

二、問題分析

随即一些感興趣的同學馬上進行了測試，測試結果和上面不一緻，測試結果是報錯而不是出警告如下：

2019-10-17T09:16:34.317542Z 84 [ERROR] Slave SQL for channel
 'group_replication_applier': Error executing row event:
 'Table 'test.emp1' doesn't exist', Error_code: 1146

并且這種情況表不存在的節點已經被剔除掉了。下面是正常情況的節點狀态：

secondary 1節點：
[[email protected]][test]>select * from performance_schema.replication_group_members;
+---------------------------+--------------------------------------+-------------+-------------+--------------+
| CHANNEL_NAME              | MEMBER_ID                            | MEMBER_HOST | MEMBER_PORT | MEMBER_STATE |
+---------------------------+--------------------------------------+-------------+-------------+--------------+
| group_replication_applier | a8833a96-f0d8-11e9-a9f4-000c291fd9a5 | mysql_2     |        3306 | ERROR        |
+---------------------------+--------------------------------------+-------------+-------------+--------------+
1 row in set (0.00 sec)
 
secondary 2節點：
[[email protected]][test]>select * from performance_schema.replication_group_members;
+---------------------------+--------------------------------------+-------------+-------------+--------------+
| CHANNEL_NAME              | MEMBER_ID                            | MEMBER_HOST | MEMBER_PORT | MEMBER_STATE |
+---------------------------+--------------------------------------+-------------+-------------+--------------+
| group_replication_applier | b2968fe2-f0d8-11e9-a8ff-000c29c89e42 | mysql_3     |        3306 | ERROR        |
+---------------------------+--------------------------------------+-------------+-------------+--------------+
1 row in set (0.00 sec)

那麼疑問就是為什麼同樣是MGR一個是警告一個是錯誤呢，并且前者還能處于正常同步狀态。不錯看到題目就知道這裡和slave_skip_errors參數有關。

三、測試模拟

我們知道再Master-Slave中如果遇到從庫表不存在肯定是報錯的，除非設定slave_skip_errors參數，當然我線上上重來沒有設定過這個參數，并且通過這個案例我們發現本參數對MGR也有影響，如下測試方法：

我們在3個節點都開啟slave-skip-errors= ddl_exist_errors

如下圖：

然後搭建3節點single-primary模式的MGR叢集。

叢集搭建正常。

然後執行如下操作：

[[email protected]][(none)]>set sql_log_bin=0;

Query OK, 0 rows affected (0.00 sec)

[[email protected]][(none)]>create table test.a_1(id bigint auto_increment primary key,name varchar(20));

Query OK, 0 rows affected (0.01 sec)

[[email protected]][(none)]>set sql_log_bin=1;

Query OK, 0 rows affected (0.00 sec)

此時primary節點是有a_1表的，但是因為binlog關閉的原因，兩個secondary節點是不存在a_1表的。

然後我們插入資料：

[[email protected]][test]>insert into test.a_1 values(null,'tom');
Query OK, 1 row affected (0.02 sec)

此時，primary節點因為存在a_1表，是以能夠插入，但是兩個secondary節點不存在a_1表，是以插入是失敗的。資料産生不一緻。正常情況下這種資料不一緻會導緻2個secondary節點被提出叢集才對。但是實際上3個節點都是正常的，叢集并沒有失效。

[[email protected]][test]>select * from test.a_1;

+----+------+

| id | name |

+----+------+

|  1 | tom  |

+----+------+

1 row in set (0.00 sec)

[[email protected]][test]>select * from performance_schema.replication_group_members;

+---------------------------+--------------------------------------+-------------+-------------+--------------+

| CHANNEL_NAME              | MEMBER_ID                            | MEMBER_HOST | MEMBER_PORT | MEMBER_STATE |

+---------------------------+--------------------------------------+-------------+-------------+--------------+

| group_replication_applier | 9fd479bb-f0d8-11e9-9381-000c29105312 | mysql_1     |        3306 | ONLINE       |

| group_replication_applier | a8833a96-f0d8-11e9-a9f4-000c291fd9a5 | mysql_2     |        3306 | ONLINE       |

| group_replication_applier | b2968fe2-f0d8-11e9-a8ff-000c29c89e42 | mysql_3     |        3306 | ONLINE       |

+---------------------------+--------------------------------------+-------------+-------------+--------------+

3 rows in set (0.00 sec)

此時去2個secondary節點讀取test.a_1表，表是不存在的。

secondary 1：

[[email protected]][test]>select * from test.a_1;

ERROR 1146 (42S02): Table 'test.a_1' doesn't exist

[[email protected]][test]>

secondary 2：

[[email protected]][test]>select * from test.a_1;

ERROR 1146 (42S02): Table 'test.a_1' doesn't exist

error log輸出資訊：（set global log_error_verbosity = 3;)

2019-10-17T21:16:11.564211+08:00 10 [Warning] Slave SQL for channel
 'group_replication_applier': Worker 1 failed executing transaction 'aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaaaaaa:8' at master log , 
end_log_pos 220; Error executing row event: 'Table 'test.a_1' doesn't exist', Error_code: 1146

四、slave_skip_errors源碼生效點

這個設定在Rows_log_event::do_apply_event 函數中生效，也就是DML Event開始應用的時候生效，這是正常的SQL線程（或者Worker線程）調用的。

#ifdef HAVE_REPLICATION
  if (opt_slave_skip_errors)
    add_slave_skip_errors(opt_slave_skip_errors);
#endif
if (open_and_lock_tables(thd, rli->tables_to_lock, 0))//打開表
    {
      uint actual_error= thd->get_stmt_da()->mysql_errno();
      if (thd->is_slave_error || thd->is_fatal_error)  
      {
        if (ignored_error_code(actual_error)) //這裡受到 slave_skip_errors 參數控制 ignored_error_code會将slave_skip_errors的參數設定讀取出來
        {
          if (log_warnings > 1)
            rli->report(WARNING_LEVEL, actual_error,
                        "Error executing row event: &apos;%s&apos;",
                        (actual_error ? thd->get_stmt_da()->message_text() :
                         "unexpected success or fatal error"));
          thd->get_stmt_da()->reset_condition_info(thd);
          clear_all_errors(thd, const_cast<Relay_log_info*>(rli));
          error= 0;
          goto end;
        }
        else
        {
          rli->report(ERROR_LEVEL, actual_error,
                      "Error executing row event: &apos;%s&apos;",
                      (actual_error ? thd->get_stmt_da()->message_text() :
                       "unexpected success or fatal error"));
          thd->is_slave_error= 1;
          const_cast<Relay_log_info*>(rli)->slave_close_thread_tables(thd);
          DBUG_RETURN(actual_error);
        }
      }
 ```

可以看到MGR的執行邏輯受到了該參數的影響。

MySQL：slave_skip_errors參數對MGR可用性的影響

一、案例描述

二、問題分析

三、測試模拟

四、slave_skip_errors源碼生效點

繼續閱讀

SQL優化SQL語句優化的目的

資料遷移方法資料遷移原則資料遷移之雙寫方案資料遷移之級聯同步方案

redis叢集資料一緻性_RedisRaft為Redis叢集帶來強大的資料一緻性

JAVA高效程式設計指南

寶塔面闆mysql恢複2018.1.8更新

Centos7 MySQL 5.7 安裝MySQL 5.7 安裝

查找入職員工時間排名倒數第三的員工所有資訊

Hibernate使用Hibernate的“3個準備，7個步驟”Hibernate API簡介操作實體對象對象識别

雲計算面試題——mysql/存儲引擎/備份

關于SQL語言

SQL語言基礎：常用的資料查詢語句

Ubuntu16.04安裝Apache+MySQL+PHP1. 安裝Apache2. 安裝MySQL3. 安裝PHP4. 安裝phpMyAdmin

MySQL的4種隔離級别？出現問題

neo4j之cypher使用文檔

mysql使用source指令導入.sql檔案

sqlServer根據經緯查距離