最近PXC 5.7出現腦裂,前端Navicate連接配接到MySQL時,提示WSREP has not yet prepared node for application use錯誤。腦裂可是個不容易忽視的問題啊,嚴重的導緻資料錯亂。你懂的,就好比酒駕,迷糊啊,都亂了。下文是本次故障的具體描述。
一、環境說明
目前的叢集采用了2個資料節點+1個仲裁節點,兩節點均有寫入資料
重新開機了仲裁節點後,資料節點發生腦裂
作業系統CentOS 7
資料庫版本Percona XtraDB Cluster 5.7.20-18-57-log
故障現象
Navicate用戶端可以連接配接,但是查詢資料報錯:WSREP has not yet prepared node for application use
2節點狀态檢視都顯示為non-Primary
mysql> show status like '%wsrep_cluster%';
+--------------------------+--------------------------------------+
| Variable_name | Value |
+--------------------------+--------------------------------------+
| wsrep_cluster_conf_id | 6 |
| wsrep_cluster_size | 3 |
| wsrep_cluster_state_uuid | cd96b06a-0a1d-11e8-99d2-837e6f3b95a9 |
| wsrep_cluster_status | non-Primary |
+--------------------------+--------------------------------------+
二、故障分析
節點1日志
2018-05-29T11:36:56.595913+08:00 0 [Note] WSREP: (45125952, 'tcp://0.0.0.0:4567') connection to peer 17b576f8 with addr tcp://192.168.1.253:4567 timed out, no messages seen in PT3S (gmcast.peer_timeout)
2018-05-29T11:36:56.602911+08:00 0 [Note] WSREP: (45125952, 'tcp://0.0.0.0:4567') turning message relay requesting on, nonlive peers: tcp://192.168.1.253:4567
2018-05-29T11:37:05.600073+08:00 0 [Note] WSREP: Current view of cluster as seen by this node
view (view_id(NON_PRIM,029e6b40,17) ##此處已提示為非主節點
memb {
45125952,0
}
joined {
}
left {
}
partitioned {
029e6b40,0
17b576f8,0
}
)
2018-05-29T11:37:05.631952+08:00 0 [Note] WSREP: Current view of cluster as seen by this node
view (view_id(NON_PRIM,45125952,18)
memb {
45125952,0
}
joined {
}
left {
}
partitioned {
029e6b40,0
17b576f8,0
}
)
2018-05-29T11:37:05.634073+08:00 0 [Note] WSREP: declaring 029e6b40 at tcp://192.168.1.249:4567 stable
2018-05-29T11:37:05.634674+08:00 0 [Note] WSREP: Current view of cluster as seen by this node
view (view_id(NON_PRIM,029e6b40,19)
memb {
029e6b40,0
45125952,0
}
joined {
}
left {
}
partitioned {
17b576f8,0
}
)
2018-05-29T11:37:05.663943+08:00 0 [Note] WSREP: New COMPONENT: primary = no, bootstrap = no, my_idx = 0, memb_num = 1
2018-05-29T11:37:05.670701+08:00 151042 [Warning] WSREP: Send action {(nil), 563, TORDERED} returned -107 (Transport endpoint is not connected)
2018-05-29T11:37:05.670763+08:00 151042 [Note] WSREP: --------- CONFLICT DETECTED -------- ###檢查到沖突
2018-05-29T11:37:05.670778+08:00 151042 [Note] WSREP: cluster conflict due to certification failure for threads:
節點2日志
2018-05-28T23:59:07.757523+08:00 0 [Note] WSREP: Member 2.0 (pxc-cluster-node-1) synced with group.
2018-05-29T09:47:36.679991+08:00 6095593 [Note] Aborted connection 6095593 to db: 'DB71' user: 'user71' host: '192.168.1.253' (Got an error reading communication packets)
2018-05-29T09:47:36.680062+08:00 6095607 [Note] Aborted connection 6095607 to db: 'DB71' user: 'user71' host: '192.168.1.253' (Got an error reading communication packets)
.............................................#Author : Leshami Blog : http://blog.csdn.net/leshami
2018-05-29T11:34:11.981171+08:00 6104858 [Note] Aborted connection 6104858 to db: 'DB71' user: 'user71' host: '192.168.1.253' (Got an error reading communication packets)
2018-05-29T11:36:02.203516+08:00 0 [Note] WSREP: (029e6b40, 'tcp://0.0.0.0:4567') connection to peer 17b576f8 with addr tcp://192.168.1.253:4567 timed out, no messages seen in PT3S (gmcast.peer_timeout)
2018-05-29T11:36:02.203708+08:00 0 [Note] WSREP: (029e6b40, 'tcp://0.0.0.0:4567') turning message relay requesting on, nonlive peers: tcp://192.168.1.253:4567
2018-05-29T11:36:03.203906+08:00 0 [Note] WSREP: (029e6b40, 'tcp://0.0.0.0:4567') reconnecting to 17b576f8 (tcp://192.168.1.253:4567), attempt 0
2018-05-29T11:36:03.704413+08:00 0 [Note] WSREP: declaring node with index 1 suspected, timeout PT5S (evs.suspect_timeout)
2018-05-29T11:36:03.704482+08:00 0 [Note] WSREP: evs::proto(029e6b40, OPERATIONAL, view_id(REG,029e6b40,17)) suspecting node: 17b576f8
2018-05-29T11:36:03.704507+08:00 0 [Note] WSREP: evs::proto(029e6b40, OPERATIONAL, view_id(REG,029e6b40,17)) suspected node without join message, declaring inactive
2018-05-29T11:36:04.204630+08:00 0 [Note] WSREP: declaring node with index 1 inactive (evs.inactive_timeout)
2018-05-29T11:36:06.204943+08:00 0 [Note] WSREP: (029e6b40, 'tcp://0.0.0.0:4567') connection to peer 00000000 with addr tcp://192.168.1.253:4567 timed out, no messages seen in PT3S (gmcast.peer_timeout)
2018-05-29T11:36:09.206069+08:00 0 [Note] WSREP: declaring node with index 2 suspected, timeout PT5S (evs.suspect_timeout)
2018-05-29T11:36:09.206168+08:00 0 [Note] WSREP: evs::proto(029e6b40, GATHER, view_id(REG,029e6b40,17)) suspecting node: 45125952
2018-05-29T11:36:09.706352+08:00 0 [Note] WSREP: evs::proto(029e6b40, GATHER, view_id(REG,029e6b40,17)) suspecting node: 45125952
2018-05-29T11:36:10.206318+08:00 0 [Note] WSREP: (029e6b40, 'tcp://0.0.0.0:4567') connection to peer 00000000 with addr tcp://192.168.1.253:4567 timed out, no messages seen in PT3S (gmcast.peer_timeout)
2018-05-29T11:36:10.206514+08:00 0 [Note] WSREP: evs::proto(029e6b40, GATHER, view_id(REG,029e6b40,17)) suspecting node: 45125952
2018-05-29T11:36:10.706763+08:00 0 [Note] WSREP: evs::proto(029e6b40, GATHER, view_id(REG,029e6b40,17)) suspecting node: 45125952
2018-05-29T11:36:11.204659+08:00 0 [Warning] WSREP: evs::proto(029e6b40, GATHER, view_id(REG,029e6b40,17)) install timer expired
evs::proto(evs::proto(029e6b40, GATHER, view_id(REG,029e6b40,17)), GATHER) {
current_view=Current view of cluster as seen by this node
view (view_id(REG,029e6b40,17)
memb {
029e6b40,0
17b576f8,0
45125952,0
}
joined {
}
left {
}
partitioned {
}
),
input_map=evs::input_map: {aru_seq=5278704,safe_seq=5278704,node_index=node: {idx=0,range=[5278710,5278709],safe_seq=5278704} node: {idx=1,range=[5278705,5278704],safe_seq=5278704} node: {idx=2,range=[5278710,5278709],safe_seq=5278704} },
fifo_seq=10557624,
last_sent=5278709,
known:
三、故障解決
對于目前的腦裂,由于在測試環境,是以直接停止其中一個節點(1.248),然後将剩餘的節點強制提升為主節點
首先停止仲裁節點,然後停止1.249節點
如下在1.249節點上做如下操作
mysql> SET GLOBAL wsrep_provider_options='pc.bootstrap=YES';
Query OK, 0 rows affected (0.00 sec)
--再次檢視此時wsrep_cluster_status為Primary
mysql> show status like 'wsrep_cluster%';
+--------------------------+--------------------------------------+
| Variable_name | Value |
+--------------------------+--------------------------------------+
| wsrep_cluster_conf_id | 4 |
| wsrep_cluster_size | 1 |
| wsrep_cluster_state_uuid | cd96b06a-0a1d-11e8-99d2-837e6f3b95a9 |
| wsrep_cluster_status | Primary |
+--------------------------+--------------------------------------+
4 rows in set (0.01 sec)
在第一個節點248上啟動
[root@node248 ~]# systemctl start mysql
--叢集狀态檢視,此時也已正常
mysql> show status like 'wsrep_cluster%';
+--------------------------+--------------------------------------+
| Variable_name | Value |
+--------------------------+--------------------------------------+
| wsrep_cluster_conf_id | 5 |
| wsrep_cluster_size | 2 |
| wsrep_cluster_state_uuid | cd96b06a-0a1d-11e8-99d2-837e6f3b95a9 |
| wsrep_cluster_status | Primary |
+--------------------------+--------------------------------------+
在第三個節點啟動仲裁節點
[root@ydq4 ~]# /etc/init.d/garb start
正在啟動 /usr/bin/garbd: [确定]
mysql> show status like 'wsrep_cluster%';
+--------------------------+--------------------------------------+
| Variable_name | Value |
+--------------------------+--------------------------------------+
| wsrep_cluster_conf_id | 6 |
| wsrep_cluster_size | 3 |
| wsrep_cluster_state_uuid | cd96b06a-0a1d-11e8-99d2-837e6f3b95a9 |
| wsrep_cluster_status | Primary |
+--------------------------+--------------------------------------+
四、後記
1、此次腦裂時應用程式端已出現節點無法連接配接,并且有日志記錄到mysql error log
2、在資料節點已出現異常時,冒然重新開機仲裁節點不可取
3、建議使用3個資料節點最佳
4、多節點資料寫入,最好按子產品将寫劃分到不同節點
5、節點修複參考:
http://galeracluster.com/documentation-webpages/twonode.html上傳日志: