天天看點

解決方案之:DM relay 處理單元報錯

作者:代曉磊​

遇到DM的問題:包含了relay+syncer。當遇到mysql binlog超過4g的問題時,你的relay+syncer都需要處理,這個問題來源于上遊,并且阻塞了relay同步,你需要先解決relay同步,然後再解決syncer的問題

問題現象:

(1)收到郵件報警

登入檢視DM報錯情況

query-status mba_app_repl_online1

發現syncer報錯

{

“Type”: “UnknownError”,

“msg”: “\[code=10001:class=database:scope=downstream:level=high] database driver error: bad connection\\"[ngithub.com/pingcap/dm/pkg/terror.(\*Error](http://ngithub.com/pingcap/dm/pkg/terror.\(\*Error)).Delegate\\"n\\"t/home/jenkins/agent/workspace/build\_dm\_master/go/src/github.com/pingcap/dm/pkg/terror/terror.go:267\\"[ngithub.com/pingcap/dm/pkg/terror.DBErrorAdaptArgs\\"n\\"t/home/jenkins/agent/workspace/build\_dm\_master/go/src/github.com/pingcap/dm/pkg/terror/adapter.go:39\\"ngithub.com/pingcap/dm/pkg/terror.DBErrorAdapt\\"n\\"t/home/jenkins/agent/workspace/build\_dm\_master/go/src/github.com/pingcap/dm/pkg/terror/adapter.go:46\\"ngithub.com/pingcap/dm/pkg/conn.(\*BaseDB](http://ngithub.com/pingcap/dm/pkg/terror.DBErrorAdaptArgs%5Cn%5Ct/home/jenkins/agent/workspace/build\_dm\_master/go/src/github.com/pingcap/dm/pkg/terror/adapter.go:39%5Cngithub.com/pingcap/dm/pkg/terror.DBErrorAdapt%5Cn%5Ct/home/jenkins/agent/workspace/build\_dm\_master/go/src/github.com/pingcap/dm/pkg/terror/adapter.go:46%5Cngithub.com/pingcap/dm/pkg/conn.\(\*BaseDB)).GetBaseConn\\"n\\"t/home/jenkins/agent/workspace/build\_dm\_master/go/src/github.com/pingcap/dm/pkg/conn/basedb.go:92\\"[ngithub.com/pingcap/dm/syncer.createConns.func1\\"n\\"t/home/jenkins/agent/workspace/build\_dm\_master/go/src/github.com/pingcap/dm/syncer/db.go:299\\"ngithub.com/pingcap/dm/syncer.(\*DBConn](http://ngithub.com/pingcap/dm/syncer.createConns.func1%5Cn%5Ct/home/jenkins/agent/workspace/build\_dm\_master/go/src/github.com/pingcap/dm/syncer/db.go:299%5Cngithub.com/pingcap/dm/syncer.\(\*DBConn)).resetConn\\"n\\"t/home/jenkins/agent/workspace/build\_dm\_master/go/src/github.com/pingcap/dm/syncer/db.go:157\\"[ngithub.com/pingcap/dm/syncer.(\*Syncer](http://ngithub.com/pingcap/dm/syncer.\(\*Syncer)).resetDBs\\"n\\"t/home/jenkins/agent/workspace/build\_dm\_master/go/src/github.com/pingcap/dm/syncer/syncer.go:513\\"[ngithub.com/pingcap/dm/syncer.(\*Syncer](http://ngithub.com/pingcap/dm/syncer.\(\*Syncer)).Resume\\"n\\"t/home/jenkins/agent/workspace/build\_dm\_master/go/src/github.com/pingcap/dm/syncer/syncer.go:2277\\"nruntime.goexit\\"n\\"t/usr/local/go/src/runtime/asm\_amd64.s:1337”,

“error”: null

}      

binlog同步報錯: handle a potential duplicate event

{

“Type”: “UnknownError”,

“msg”: “\[code=30029:class=relay-unit:scope=internal:level=high] handle a potential duplicate event \\"u0026{Timestamp:1583906704 EventType:DeleteRowsEventV2 ServerID:2084430 EventSize:8132 LogPos:11943 Flags:0} in 4723-binlog.002270: check event \\"u0026{Timestamp:1583906704 EventType:DeleteRowsEventV2 ServerID:2084430 EventSize:8132 LogPos:11943 Flags:0} whether duplicate in /data/dm/relay\_log/3793c524-c319-11e8-95ba-6c92bf2f9975.000001/4723-binlog.002270: event from 3811 in /data/dm/relay\_log/3793c524-c319-11e8-95ba-6c92bf2f9975.000001/4723-binlog.002270 diff from passed-in event \\"u0026{Timestamp:1583906704 EventType:DeleteRowsEventV2 ServerID:2084430 EventSize:8132 LogPos:11943 Flags:0}\\"[ngithub.com/pingcap/dm/pkg/terror.(\*Error](http://ngithub.com/pingcap/dm/pkg/terror.\(\*Error)).Generatef\\"n\\"t/home/jenkins/agent/workspace/build\_dm\_master/go/src/github.com/pingcap/dm/pkg/terror/terror.go:237\\"[ngithub.com/pingcap/dm/relay/writer.checkIsDuplicateEvent\\"n\\"t/home/jenkins/agent/workspace/build\_dm\_master/go/src/github.com/pingcap/dm/relay/writer/file\_util.go:157\\"ngithub.com/pingcap/dm/relay/writer.(\*FileWriter](http://ngithub.com/pingcap/dm/relay/writer.checkIsDuplicateEvent%5Cn%5Ct/home/jenkins/agent/workspace/build\_dm\_master/go/src/github.com/pingcap/dm/relay/writer/file\_util.go:157%5Cngithub.com/pingcap/dm/relay/writer.\(\*FileWriter)).handleDuplicateEventsExist\\"n\\"t/home/jenkins/agent/workspace/build\_dm\_master/go/src/github.com/pingcap/dm/relay/writer/file.go:369\\"[ngithub.com/pingcap/dm/relay/writer.(\*FileWriter](http://ngithub.com/pingcap/dm/relay/writer.\(\*FileWriter)).handlePotentialHoleOrDuplicate\\"n\\"t/home/jenkins/agent/workspace/build\_dm\_master/go/src/github.com/pingcap/dm/relay/writer/file.go:312\\"[ngithub.com/pingcap/dm/relay/writer.(\*FileWriter](http://ngithub.com/pingcap/dm/relay/writer.\(\*FileWriter)).handleEventDefault\\"n\\"t/home/jenkins/agent/workspace/build\_dm\_master/go/src/github.com/pingcap/dm/relay/writer/file.go:287\\"[ngithub.com/pingcap/dm/relay/writer.(\*FileWriter](http://ngithub.com/pingcap/dm/relay/writer.\(\*FileWriter)).WriteEvent\\"n\\"t/home/jenkins/agent/workspace/build\_dm\_master/go/src/github.com/pingcap/dm/relay/writer/file.go:130\\"[ngithub.com/pingcap/dm/relay.(\*Relay](http://ngithub.com/pingcap/dm/relay.\(\*Relay)).handleEvents\\"n\\"t/home/jenkins/agent/workspace/build\_dm\_master/go/src/github.com/pingcap/dm/relay/relay.go:434\\"[ngithub.com/pingcap/dm/relay.(\*Relay](http://ngithub.com/pingcap/dm/relay.\(\*Relay)).process\\"n\\"t/home/jenkins/agent/workspace/build\_dm\_master/go/src/github.com/pingcap/dm/relay/relay.go:292\\"[ngithub.com/pingcap/dm/relay.(\*Relay](http://ngithub.com/pingcap/dm/relay.\(\*Relay)).Process\\"n\\"t/home/jenkins/agent/workspace/build\_dm\_master/go/src/github.com/pingcap/dm/relay/relay.go:191\\"[ngithub.com/pingcap/dm/dm/worker.(\*realRelayHolder](http://ngithub.com/pingcap/dm/dm/worker.\(\*realRelayHolder)).run\\"n\\"t/home/jenkins/agent/workspace/build\_dm\_master/go/src/github.com/pingcap/dm/dm/worker/relay.go:164\\"[ngithub.com/pingcap/dm/dm/worker.(\*realRelayHolder](http://ngithub.com/pingcap/dm/dm/worker.\(\*realRelayHolder)).Start.func1\\"n\\"t/home/jenkins/agent/workspace/build\_dm\_master/go/src/github.com/pingcap/dm/dm/worker/relay.go:140\\"nruntime.goexit\\"n\\"t/usr/local/go/src/runtime/asm\_amd64.s:1337”,

}      

(2)排查問題發現上遊的mysql有大binlog檔案,并且大小超過4G,檢視DM的relay-log目錄,DM同步的binlog最大也停止在4G,最新的binlog已經不再拉取,SQL執行停止,拉取也停止。其實上遊mysql的binlog有6G多

通過報錯的4723-binlog.002270檔案找下上遊mysql的大小

show binary logs;

| 4723-binlog.002268 | 1074943453 |

| 4723-binlog.002269 | 1074017204 |

| 4723-binlog.002270 | 6977972415 |

| 4723-binlog.002271 | 1081537286 |

| 4723-binlog.002272 | 1073806137 |

(3)解決問題:

找到官方的解決方案:

​​https://pingcap.com/docs-cn/stable/reference/tools/data-migration/troubleshoot/error-handling/​​

首先解決relay日志同步的問題

在 DM 進行 relay log 拉取與增量同步過程中,如果遇到了上遊超過 4GB 的 binlog 檔案,就可能出現這兩個錯誤。

原因是 DM 在寫 relay log 時需要依據 binlog position 及檔案大小對 event 進行驗證,且需要儲存同步的 binlog position 資訊作為 checkpoint。但是 MySQL binlog position 官方定義使用 uint32 存儲,是以超過 4G 部分的 binlog position 的 offset 值會溢出,進而出現上面的錯誤。

對于 relay 處理單元,可通過以下步驟手動恢複:

1、在上遊确認出錯時對應的 binlog 檔案的大小超出了 4GB。

2、停止 DM-worker。

ansible-playbook stop.yml --tags=dm-worker -l dm-worker1

注意:這裡尤其要注意不要認為stop-task就是停了worker,需要将woker停了,在系統中ps -ef|grep 'dm-worker’沒有了才算

3、将上遊對應的 binlog 檔案複制到 relay log 目錄作為 relay log 檔案 (替掉源檔案時記得備份) 。

4、更新 relay log 目錄内對應的 relay.meta 檔案以從下一個 binlog 開始拉取。 (修改前記得備份)

例如:報錯時有 binlog-name = “mysql-bin.004451” 與binlog-pos = 2453,則将其分别更新為 binlog-name = “mysql-bin.004452” 與binlog-pos = 4。

注意:如果是按照gtid配置的DM,除了修改binlog-name以及binlog-pos外,還需要修改binlog-gtid,上遊mysql binlog-gtid擷取方式:

MySQL [(none)]> show binlog events in ‘4723-binlog.002271’ from 4 limit 4;

±-------------------±----±---------------±----------±------------±--------------------------------------------------------------------------+

| Log_name | Pos | Event_type | Server_id | End_log_pos | Info |

±-------------------±----±---------------±----------±------------±--------------------------------------------------------------------------+

| 4723-binlog.002271 | 4 | Format_desc | 20849203 | 120 | Server ver: 5.6.36-log, Binlog ver: 4 |

| 4723-binlog.002271 | 120 | Previous_gtids | 20849203 | 191 | f0e914ef-54cf-11e7-813d-6c92bf2fa791:1-138218058 |

| 4723-binlog.002271 | 191 | Gtid | 2084430 | 239 | SET @@SESSION.GTID_NEXT= ‘f0e914ef-54cf-11e7-813d-6c92bf2fa791:138218059’ |

| 4723-binlog.002271 | 239 | Query | 2084430 | 302 | BEGIN |

±-------------------±----±---------------±----------±------------±--------------------------------------------------------------------------+

4 rows in set (0.03 sec)

5、重新開機 DM-worker。

ansible-playbook start.yml --tags=dm-worker -l dm-worker1

(4)解決了binlog同步問題,但是還是會遇到syncer執行不成功。

對于 binlog replication 處理單元,可通過以下步驟手動恢複:

1、在上遊确認出錯時對應的 binlog 檔案的大小超出了 4GB。

2、通過 stop-task 停止同步任務。

3、将下遊 dm_meta 資料庫中 global checkpoint 與每個 table 的 checkpoint 中的 binlog_name 更新為出錯的 binlog 檔案,将 binlog_pos 更新為已同步過的一個合法的 position 值,比如 4。

例如:出錯任務名為 dm_test,對應的 source-id 為 replica-1,出錯時對應的 binlog 檔案為 mysql-bin|000001.004451,則執行 UPDATE dm_test_syncer_checkpoint SET binlog_name=‘mysql-bin|000001.004451’, binlog_pos = 4 WHERE id=‘replica-1’;。

注意:具體的值可以登入dm_meta庫select相應的syncer表即可

update mba_app_repl_online1_syncer_checkpoint set binlog_name=‘4723-binlog|000001.002270’,binlog_pos=978592437;