使用repmgrd实现postgresql failover和auto failover

前面的文章介绍了postgresql基于repmgr的高可用及切换方案，这篇文章主要聊聊通过repmgrd实现failover及auto failover。

前提是部署好postgresql主从，同时部署好repmgr。

[postgres@node1 ~]$ repmgr cluster show ID | Name  | Role    | Status    | Upstream | Location | Priority | Timeline | Connection string                                            ----+-------+---------+-----------+----------+----------+----------+----------+--------------------------------------------------------------- 1  | node1 | primary | * running |          | default  | 100      | 3        | host=192.168.1.1 user=repmgr dbname=repmgr connect_timeout=2 2  | node2 | standby |   running | node1    | default  | 100      | 3        | host=192.168.1.2 user=repmgr dbname=repmgr connect_timeout=2

failover

停止主库，模拟主库故障

[postgres@node1 ~]$ pg_ctl stop -D /pgdata/waiting for server to shut down..... doneserver stopped

备库查看是unreachable状态

[postgres@node2 .ssh]$ repmgr cluster show ID | Name  | Role    | Status        | Upstream | Location | Priority | Timeline | Connection string                                            ----+-------+---------+---------------+----------+----------+----------+----------+--------------------------------------------------------------- 1  | node1 | primary | ? unreachable |          | default  | 100      | ?        | host=192.168.1.1 user=repmgr dbname=repmgr connect_timeout=2 2  | node2 | standby |   running     | ? node1  | default  | 100      | 3        | host=192.168.1.2 user=repmgr dbname=repmgr connect_timeout=2

备库提升为主库

[postgres@node2 ~]$ repmgr standby promoteNOTICE: promoting standby to primaryDETAIL: promoting server "node2" (ID: 2) using "pg_ctl  -w -D '/pgdata' promote"waiting for server to promote.... doneserver promotedNOTICE: waiting up to 60 seconds (parameter "promote_check_timeout") for promotion to completeNOTICE: STANDBY PROMOTE successfulDETAIL: server "node2" (ID: 2) was successfully promoted to primary

新主库查看集群状态

[postgres@node2 ~]$ repmgr cluster show ID | Name  | Role    | Status    | Upstream | Location | Priority | Timeline | Connection string                                            ----+-------+---------+-----------+----------+----------+----------+----------+--------------------------------------------------------------- 1  | node1 | primary | - failed  |          | default  | 100      | ?        | host=192.168.1.1 user=repmgr dbname=repmgr connect_timeout=2 2  | node2 | primary | * running |          | default  | 100      | 4        | host=192.168.1.2 user=repmgr dbname=repmgr connect_timeout=2 WARNING: following issues were detected  - unable to connect to node "node1" (ID: 1)

原主库执行rejoin操作重新加入集群

[postgres@node1 pgdata]$ repmgr node rejoin -d 'host=192.168.1.2 dbname=repmgr user=repmgr' --force-rewind --config-files=postgresql.conf,postgresql.auto.conf --verbose --dry-run[postgres@node1 pgdata]$ repmgr node rejoin -d 'host=192.168.1.2 dbname=repmgr user=repmgr' --force-rewind --config-files=postgresql.conf,postgresql.auto.conf --verboseINFO: looking for configuration file in /etcINFO: configuration file found at: "/etc/repmgr.conf"INFO: prerequisites for using pg_rewind are metINFO: 2 files copied to "/tmp/repmgr-config-archive-node1"NOTICE: executing pg_rewindDETAIL: pg_rewind command is "pg_rewind -D '/pgdata' --source-server='host=192.168.1.2 user=repmgr dbname=repmgr connect_timeout=2'"NOTICE: 2 files copied to /pgdataINFO: directory "/tmp/repmgr-config-archive-node1" deletedINFO: deleting "recovery.done"NOTICE: setting node 1's upstream to node 2WARNING: unable to ping "host=192.168.1.1 user=repmgr dbname=repmgr connect_timeout=2"DETAIL: PQping() returned "PQPING_NO_RESPONSE"NOTICE: starting server using "pg_ctl  -w -D '/pgdata' start"INFO: demoted primary is pingableINFO: node 1 has attached to its upstream nodeNOTICE: NODE REJOIN successfulDETAIL: node 1 is now attached to node 2

查看集群状态

[postgres@node1 pgdata]$ repmgr cluster show ID | Name  | Role    | Status    | Upstream | Location | Priority | Timeline | Connection string                                            ----+-------+---------+-----------+----------+----------+----------+----------+--------------------------------------------------------------- 1  | node1 | standby |   running | node2    | default  | 100      | 3        | host=192.168.1.1 user=repmgr dbname=repmgr connect_timeout=2 2  | node2 | primary | * running |          | default  | 100      | 4        | host=192.168.1.2 user=repmgr dbname=repmgr connect_timeout=2

auto failover

可以利用repmgrd进程实现自动的failover，首先要在repmgr.conf文件中将location参数设置为一致，不设置的话默认也是一致的。同时启动repmgrd必须在postgres.conf配置文件中设置shared_preload_libraries='repmgr'

修改主备库repmgr.conf文件

failover=automaticpromote_command='/pgsql/bin/repmgr standby promote -f /etc/repmgr.conf --log-to-file'follow_command='/pgsql/bin/repmgr standby follow -f /etc/repmgr.conf --log-to-file --upstream-node-id=%n'log_file=/home/postgres/repmgrd.logmonitoring_history=true （启用监控参数）                    monitor_interval_secs=5（定义监视数据间隔写入时间参数）reconnect_attempts=10（故障转移之前，尝试重新连接主库次数（默认为6）参数）reconnect_interval=5（每间隔5s尝试重新连接一次参数）

重启主备库使修改生效

[postgres@node1 ~]$ repmgr node service --action=restartDETAIL: executing server command "pg_ctl  -w -D '/pgdata' restart"

主备库启动repmgrd

[postgres@node1 ~]$ repmgrd –f /etc/repmgr.conf --pid-file /tmp/repmgrd.pid[2019-09-20 11:51:23] [NOTICE] redirecting logging output to "/home/postgres/repmgrd.log"

模拟主库故障

[postgres@node1 ~]$ pg_ctl stop -D /pgdata/waiting for server to shut down..... doneserver stopped

查看备库日志，发现已经升为主库

[2019-09-20 12:02:52] [NOTICE] promoting standby to primary[2019-09-20 12:02:52] [DETAIL] promoting server "node2" (ID: 2) using "pg_ctl  -w -D '/pgdata' promote"[2019-09-20 12:02:52] [NOTICE] waiting up to 60 seconds (parameter "promote_check_timeout") for promotion to complete[2019-09-20 12:02:52] [NOTICE] STANDBY PROMOTE successful[2019-09-20 12:02:52] [DETAIL] server "node2" (ID: 2) was successfully promoted to primary[2019-09-20 12:02:52] [INFO] 0 followers to notify[2019-09-20 12:02:52] [INFO] switching to primary monitoring mode[2019-09-20 12:02:52] [NOTICE] monitoring cluster primary "node2" (ID: 2)

查看cluster状态，备库已经升主

[postgres@node2 ~]$ repmgr cluster show ID | Name  | Role    | Status    | Upstream | Location | Priority | Timeline | Connection string                                            ----+-------+---------+-----------+----------+----------+----------+----------+--------------------------------------------------------------- 1  | node1 | primary | - failed  |          | default  | 100      | ?        | host=192.168.1.1 user=repmgr dbname=repmgr connect_timeout=2 2  | node2 | primary | * running |          | default  | 100      | 5        | host=192.168.1.2 user=repmgr dbname=repmgr connect_timeout=2 WARNING: following issues were detected  - unable to connect to node "node1" (ID: 1)

原主库执行rejoin加入集群

[postgres@node1 ~]$ repmgr node rejoin -d 'host=192.168.1.2 dbname=repmgr user=repmgr' --force-rewind --config-files=postgresql.conf,postgresql.auto.conf --verbose --dry-run[postgres@node1 ~]$ repmgr node rejoin -d 'host=192.168.1.2 dbname=repmgr user=repmgr' --force-rewind --config-files=postgresql.conf,postgresql.auto.conf --verboseINFO: looking for configuration file in /etcINFO: configuration file found at: "/etc/repmgr.conf"INFO: prerequisites for using pg_rewind are metINFO: 2 files copied to "/tmp/repmgr-config-archive-node1"NOTICE: executing pg_rewindDETAIL: pg_rewind command is "pg_rewind -D '/pgdata' --source-server='host=192.168.1.2 user=repmgr dbname=repmgr connect_timeout=2'"NOTICE: 2 files copied to /pgdataINFO: directory "/tmp/repmgr-config-archive-node1" deletedINFO: deleting "recovery.done"NOTICE: setting node 1's upstream to node 2WARNING: unable to ping "host=192.168.1.1 user=repmgr dbname=repmgr connect_timeout=2"DETAIL: PQping() returned "PQPING_NO_RESPONSE"NOTICE: starting server using "pg_ctl  -w -D '/pgdata' start"INFO: demoted primary is pingableINFO: node 1 has attached to its upstream nodeNOTICE: NODE REJOIN successfulDETAIL: node 1 is now attached to node 2

[postgres@node1 ~]$ repmgr cluster show ID | Name  | Role    | Status    | Upstream | Location | Priority | Timeline | Connection string                                            ----+-------+---------+-----------+----------+----------+----------+----------+--------------------------------------------------------------- 1  | node1 | standby |   running | node2    | default  | 100      | 5        | host=192.168.1.1 user=repmgr dbname=repmgr connect_timeout=2 2  | node2 | primary | * running |          | default  | 100      | 6        |

使用repmgrd实现postgresql failover和auto failover

继续阅读

关于Gradle配置的小结

Java小案例——随机数猜测随机数猜测

nginx location中斜线的位置的重要性

27 Best Free Eclipse Plug-ins for Java Developer to be ProductiveCode Quality PluginsText Editor PluginsDependency ManagementVersion Control Integration PluginsFramework Development Continuous Integration Related PluginsOther Utility Plugins

Java String.format方法的简单使用

neo4j之cypher使用文档

GitHub连夜封杀！这份阿里 10W 字内部 Java 字面试手册到底有多强？

spark/scala关于【资源文件】加载方法概述外部文件加载方案测试资源文件打包入jar包中小结

mybatis_入门程序Mybatis入门

AOP编程_Android优雅权限框架(1)概念基础，2021金三银四前言正文大纲正文

Effective Java 8:通用程序设计

OOM三种类型

工厂模式-三种类型

【递归】高效率求2的n次幂

win10本地scala和spark安装安装scala安装spark

scala (3) Function 和 Method