前面的文章介紹了postgresql基于repmgr的高可用及切換方案,這篇文章主要聊聊通過repmgrd實作failover及auto failover。
前提是部署好postgresql主從,同時部署好repmgr。
[postgres@node1 ~]$ repmgr cluster show ID | Name | Role | Status | Upstream | Location | Priority | Timeline | Connection string ----+-------+---------+-----------+----------+----------+----------+----------+--------------------------------------------------------------- 1 | node1 | primary | * running | | default | 100 | 3 | host=192.168.1.1 user=repmgr dbname=repmgr connect_timeout=2 2 | node2 | standby | running | node1 | default | 100 | 3 | host=192.168.1.2 user=repmgr dbname=repmgr connect_timeout=2
failover
停止主庫,模拟主庫故障
[postgres@node1 ~]$ pg_ctl stop -D /pgdata/waiting for server to shut down..... doneserver stopped
備庫檢視是unreachable狀态
[postgres@node2 .ssh]$ repmgr cluster show ID | Name | Role | Status | Upstream | Location | Priority | Timeline | Connection string ----+-------+---------+---------------+----------+----------+----------+----------+--------------------------------------------------------------- 1 | node1 | primary | ? unreachable | | default | 100 | ? | host=192.168.1.1 user=repmgr dbname=repmgr connect_timeout=2 2 | node2 | standby | running | ? node1 | default | 100 | 3 | host=192.168.1.2 user=repmgr dbname=repmgr connect_timeout=2
備庫提升為主庫
[postgres@node2 ~]$ repmgr standby promoteNOTICE: promoting standby to primaryDETAIL: promoting server "node2" (ID: 2) using "pg_ctl -w -D '/pgdata' promote"waiting for server to promote.... doneserver promotedNOTICE: waiting up to 60 seconds (parameter "promote_check_timeout") for promotion to completeNOTICE: STANDBY PROMOTE successfulDETAIL: server "node2" (ID: 2) was successfully promoted to primary
新主庫檢視叢集狀态
[postgres@node2 ~]$ repmgr cluster show ID | Name | Role | Status | Upstream | Location | Priority | Timeline | Connection string ----+-------+---------+-----------+----------+----------+----------+----------+--------------------------------------------------------------- 1 | node1 | primary | - failed | | default | 100 | ? | host=192.168.1.1 user=repmgr dbname=repmgr connect_timeout=2 2 | node2 | primary | * running | | default | 100 | 4 | host=192.168.1.2 user=repmgr dbname=repmgr connect_timeout=2 WARNING: following issues were detected - unable to connect to node "node1" (ID: 1)
原主庫執行rejoin操作重新加入叢集
[postgres@node1 pgdata]$ repmgr node rejoin -d 'host=192.168.1.2 dbname=repmgr user=repmgr' --force-rewind --config-files=postgresql.conf,postgresql.auto.conf --verbose --dry-run[postgres@node1 pgdata]$ repmgr node rejoin -d 'host=192.168.1.2 dbname=repmgr user=repmgr' --force-rewind --config-files=postgresql.conf,postgresql.auto.conf --verboseINFO: looking for configuration file in /etcINFO: configuration file found at: "/etc/repmgr.conf"INFO: prerequisites for using pg_rewind are metINFO: 2 files copied to "/tmp/repmgr-config-archive-node1"NOTICE: executing pg_rewindDETAIL: pg_rewind command is "pg_rewind -D '/pgdata' --source-server='host=192.168.1.2 user=repmgr dbname=repmgr connect_timeout=2'"NOTICE: 2 files copied to /pgdataINFO: directory "/tmp/repmgr-config-archive-node1" deletedINFO: deleting "recovery.done"NOTICE: setting node 1's upstream to node 2WARNING: unable to ping "host=192.168.1.1 user=repmgr dbname=repmgr connect_timeout=2"DETAIL: PQping() returned "PQPING_NO_RESPONSE"NOTICE: starting server using "pg_ctl -w -D '/pgdata' start"INFO: demoted primary is pingableINFO: node 1 has attached to its upstream nodeNOTICE: NODE REJOIN successfulDETAIL: node 1 is now attached to node 2
檢視叢集狀态
[postgres@node1 pgdata]$ repmgr cluster show ID | Name | Role | Status | Upstream | Location | Priority | Timeline | Connection string ----+-------+---------+-----------+----------+----------+----------+----------+--------------------------------------------------------------- 1 | node1 | standby | running | node2 | default | 100 | 3 | host=192.168.1.1 user=repmgr dbname=repmgr connect_timeout=2 2 | node2 | primary | * running | | default | 100 | 4 | host=192.168.1.2 user=repmgr dbname=repmgr connect_timeout=2
auto failover
可以利用repmgrd程序實作自動的failover,首先要在repmgr.conf檔案中将location參數設定為一緻,不設定的話預設也是一緻的。同時啟動repmgrd必須在postgres.conf配置檔案中設定shared_preload_libraries='repmgr'
修改主備庫repmgr.conf檔案
failover=automaticpromote_command='/pgsql/bin/repmgr standby promote -f /etc/repmgr.conf --log-to-file'follow_command='/pgsql/bin/repmgr standby follow -f /etc/repmgr.conf --log-to-file --upstream-node-id=%n'log_file=/home/postgres/repmgrd.logmonitoring_history=true (啟用監控參數) monitor_interval_secs=5(定義監視資料間隔寫入時間參數)reconnect_attempts=10(故障轉移之前,嘗試重新連接配接主庫次數(預設為6)參數)reconnect_interval=5(每間隔5s嘗試重新連接配接一次參數)
重新開機主備庫使修改生效
[postgres@node1 ~]$ repmgr node service --action=restartDETAIL: executing server command "pg_ctl -w -D '/pgdata' restart"
主備庫啟動repmgrd
[postgres@node1 ~]$ repmgrd –f /etc/repmgr.conf --pid-file /tmp/repmgrd.pid[2019-09-20 11:51:23] [NOTICE] redirecting logging output to "/home/postgres/repmgrd.log"
模拟主庫故障
[postgres@node1 ~]$ pg_ctl stop -D /pgdata/waiting for server to shut down..... doneserver stopped
檢視備庫日志,發現已經升為主庫
[2019-09-20 12:02:52] [NOTICE] promoting standby to primary[2019-09-20 12:02:52] [DETAIL] promoting server "node2" (ID: 2) using "pg_ctl -w -D '/pgdata' promote"[2019-09-20 12:02:52] [NOTICE] waiting up to 60 seconds (parameter "promote_check_timeout") for promotion to complete[2019-09-20 12:02:52] [NOTICE] STANDBY PROMOTE successful[2019-09-20 12:02:52] [DETAIL] server "node2" (ID: 2) was successfully promoted to primary[2019-09-20 12:02:52] [INFO] 0 followers to notify[2019-09-20 12:02:52] [INFO] switching to primary monitoring mode[2019-09-20 12:02:52] [NOTICE] monitoring cluster primary "node2" (ID: 2)
檢視cluster狀态,備庫已經升主
[postgres@node2 ~]$ repmgr cluster show ID | Name | Role | Status | Upstream | Location | Priority | Timeline | Connection string ----+-------+---------+-----------+----------+----------+----------+----------+--------------------------------------------------------------- 1 | node1 | primary | - failed | | default | 100 | ? | host=192.168.1.1 user=repmgr dbname=repmgr connect_timeout=2 2 | node2 | primary | * running | | default | 100 | 5 | host=192.168.1.2 user=repmgr dbname=repmgr connect_timeout=2 WARNING: following issues were detected - unable to connect to node "node1" (ID: 1)
原主庫執行rejoin加入叢集
[postgres@node1 ~]$ repmgr node rejoin -d 'host=192.168.1.2 dbname=repmgr user=repmgr' --force-rewind --config-files=postgresql.conf,postgresql.auto.conf --verbose --dry-run[postgres@node1 ~]$ repmgr node rejoin -d 'host=192.168.1.2 dbname=repmgr user=repmgr' --force-rewind --config-files=postgresql.conf,postgresql.auto.conf --verboseINFO: looking for configuration file in /etcINFO: configuration file found at: "/etc/repmgr.conf"INFO: prerequisites for using pg_rewind are metINFO: 2 files copied to "/tmp/repmgr-config-archive-node1"NOTICE: executing pg_rewindDETAIL: pg_rewind command is "pg_rewind -D '/pgdata' --source-server='host=192.168.1.2 user=repmgr dbname=repmgr connect_timeout=2'"NOTICE: 2 files copied to /pgdataINFO: directory "/tmp/repmgr-config-archive-node1" deletedINFO: deleting "recovery.done"NOTICE: setting node 1's upstream to node 2WARNING: unable to ping "host=192.168.1.1 user=repmgr dbname=repmgr connect_timeout=2"DETAIL: PQping() returned "PQPING_NO_RESPONSE"NOTICE: starting server using "pg_ctl -w -D '/pgdata' start"INFO: demoted primary is pingableINFO: node 1 has attached to its upstream nodeNOTICE: NODE REJOIN successfulDETAIL: node 1 is now attached to node 2
[postgres@node1 ~]$ repmgr cluster show ID | Name | Role | Status | Upstream | Location | Priority | Timeline | Connection string ----+-------+---------+-----------+----------+----------+----------+----------+--------------------------------------------------------------- 1 | node1 | standby | running | node2 | default | 100 | 5 | host=192.168.1.1 user=repmgr dbname=repmgr connect_timeout=2 2 | node2 | primary | * running | | default | 100 | 6 |