天天看点

CentOS安装验证PaceMaker

CentOS安装验证PaceMaker

    • 建立双机信任关系 不是必须的
    • 增加节点间的认证-在其中一台执行
    • 配置双机
    • 独占卷组激活
    • 创建资源组
      • 通过pcs status可以看到资源启动失败
      • 继续添加其他资源
    • 验证

参考:Cluster Software Installation

https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/7/html/high_availability_add-on_administration/ch-startup-HAAA#s1-clusterinstall-HAAA

# yum install pcs pacemaker fence-agents-all
           

实际多安装了corosync

[[email protected] ~]# yum install -y fence-agents-all corosync pacemaker pcs

安装后检查

[[email protected] ~]# rpm -q pacemaker
pacemaker-1.1.20-5.el7_7.1.x86_64
[[email protected] ~]# grep hacluster /etc/passwd
hacluster:x:189:189:cluster user:/home/hacluster:/sbin/nologin
           

配置主机名

[[email protected] ~]# hostnamectl set-hostname server4.example.com

vi /etc/hosts

192.168.122.143 server3.example.com s3

192.168.122.58 server4.example.com s4

建立双机信任关系 不是必须的

[[email protected] ~]# ssh-keygen 默认回车

[[email protected] ~]# ssh-copy-id s4 拷贝公钥到对端

[[email protected] ~]# ssh s4 验证无密码登录到对端

2台服务器启动pcsd

systemctl start pcsd

systemctl enable pcsd

增加节点间的认证-在其中一台执行

[[email protected] ~]# pcs cluster auth server3.example.com server4.example.com
Username: root
Password: 
Error: s3: Username and/or password is incorrect
Error: Unable to communicate with s4
[[email protected] ~]#
[[email protected] ~]# pcs cluster auth server3.example.com server4.example.com
Username: hacluster
Password: 
Error: Unable to communicate with server4.example.com
server3.example.com: Authorized
[[email protected] ~]#
           

不能使用root用户无法

参考官网增加防火墙配置

[[email protected] .ssh]# firewall-cmd --permanent --add-service=high-availability

[[email protected] .ssh]# firewall-cmd --add-service=high-availability

另外重启了pcsd服务

需要修改hacluster密码,2台机器都需要修改

[[email protected] ~]# pcs cluster auth server3.example.com server4.example.com
Username: hacluster
Password: 
server4.example.com: Authorized
server3.example.com: Authorized
           

添加成功后,根据官网,由于之前手工增加了hacluster用户的登录权限,手工删除

# usermod  -s /sbin/nologin hacluster
           

配置双机

在其中一台执行

pcs cluster setup --start --name mytest_cluster server3.example.com server4.example.com

在2台机器可以看到同样的输出内容

[[email protected] ~]# pcs cluster status
Cluster Status:
 Stack: corosync
 Current DC: server3.example.com (version 1.1.20-5.el7_7.1-3c4c782f70) - partition with quorum
 Last updated: Wed Sep 25 23:30:26 2019
 Last change: Wed Sep 25 23:28:19 2019 by hacluster via crmd on server3.example.com
 2 nodes configured
 0 resources configured

PCSD Status:
  server3.example.com: Online
  server4.example.com: Online
[[email protected] ~]# 
           

关机休息。

重启时pcs cluster status显示未启动,在任意一台机器都增加自动启动

[[email protected] ~]# pcs cluster enable --all

server3.example.com: Cluster Enabled

server4.example.com: Cluster Enabled

还需要手工启动双机[[email protected] ~]# pcs cluster start

如果只在4机启动,可以看到2台都是online的。但在3机执行 pcs cluster status 时显示未启动。

reboot重启server3后显示pcs状态正常

PCSD Status:

server4.example.com: Online

server3.example.com: Online

跳过了fencing配置,参考

Chapter 5. Fencing: Configuring STONITH

https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/7/html/high_availability_add-on_reference/ch-fencing-haar#s1-stonithlist-HAAR

[[email protected] ~]# pcs stonith list|grep -i virt
fence_virt - Fence agent for virtual machines
fence_xvm - Fence agent for virtual machines
           

参考下面资料,创建一个主备用的web服务

Chapter 2. An active/passive Apache HTTP Server in a Red Hat High Availability Cluster

准备资源:浮动IP地址一个,共享硬盘一个

在virt-manager的server3上点灯泡,Disk2的路径为/var/lib/libvirt/images/hd4ks-clone.raw,在server4上添加同一个文件。

[[email protected] ~]# fdisk -l

Disk /dev/vdb: 209 MB, 209715200 bytes, 409600 sectors

可以直接使用了。

在任意节点执行

实际测试server3执行fdisk -l看不到vdb盘,所以在4上执行,删除原有的ext分区。

[[email protected] ~]# pvcreate /dev/vdb
WARNING: ext3 signature detected on /dev/vdb at offset 1080. Wipe it? [y/n]: y
  Wiping ext3 signature on /dev/vdb.
  Physical volume "/dev/vdb" successfully created.
[[email protected] ~]#
           

重启s3还是看不到vdb,把2台机器的shareable勾选上,都关机后再重启。

[[email protected] ~]# vgcreate my_vg /dev/vdb
  Volume group "my_vg" successfully created
[[email protected] ~]# lvcreate -L 200 my_vg -n my_lv
  Volume group "my_vg" has insufficient free space (49 extents): 50 required.
[[email protected] ~]# lvs   不足200M创建不成功,lvs无输出
[[email protected] ~]# lvcreate -L 190 my_vg -n my_lv
  Rounding up size to full physical extent 192.00 MiB
  Logical volume "my_lv" created.
[[email protected] ~]# mkfs.ext4 /dev/my_vg/my_lv
           

此时在4机只能看到vdb和pv,使用vgscan之后还是看不到pv

2.2. Web Server Configuration

2台都安装 yum install -y httpd wget

为了让代理能够检测apache的状态,在/etc/httpd/conf/httpd.conf 还需要新增配置

<Location /server-status>

SetHandler server-status

Require local

代理不支持systemd,需要如下修改以支持reload Apache

/etc/logrotate.d/httpd

删除行

/bin/systemctl reload httpd.service > /dev/null 2>/dev/null || true

替换为

/usr/sbin/httpd -f /etc/httpd/conf/httpd.conf -c “PidFile /var/run/httpd.pid” -k graceful > /dev/null 2>/dev/null || true

在其中一台执行

# mount /dev/my_vg/my_lv /var/www/
# mkdir /var/www/html
# mkdir /var/www/cgi-bin
# mkdir /var/www/error
# restorecon -R /var/www
# cat <<-END >/var/www/html/index.html
<html>
<body>Hello</body>
</html>
END
# umount /var/www
           

测试了去掉END前的减号效果一样的。

独占卷组激活

2.3. Exclusive Activation of a Volume Group in a Cluster

Cluster要求禁止在双机软件之外激活卷组

/etc/lvm/lvm.conf

volume_list 配置的vg会自动激活,不应该包含双机要使用的卷组

# lvmconf --enable-halvm --services --startstopservices
           

该命令修改下列参数,并停止lvmetad

locking_type is set to 1 默认配置

use_lvmetad is set to 0 默认配置的1

grep -E “locking_type|use_lvmetad” /etc/lvm/lvm.conf

查看vg名称 # vgs --noheadings -o vg_name 居然搞这么多参数

重建initram并重启(跳过了),以防止其访问卷组

# dracut -H -f /boot/initramfs-$(uname -r).img $(uname -r)
           

H只安装本机启动需要的驱动 f覆盖原有文件

如果更新了内核,请重启后再执行上面命令。

2台都手工去激活卷组

[[email protected] ~]# vgchange -a n my_vg

0 logical volume(s) in volume group “my_vg” now active

创建资源组

2.4. Creating the Resources and Resource Groups with the pcs Command

4个资源(卷组LVM,文件系统Filesystem,浮动地址IPaddr2,应用)组成资源组apachegroup,保证都运行在同一台机器

卷组LVM

[[email protected] ~]# pcs resource create my_lvm LVM volgrpname=my_vg \
 exclusive=true --group apachegroup
Assumed agent name 'ocf:heartbeat:LVM' (deduced from 'LVM')

           

通过pcs status可以看到资源启动失败

Resource Group: apachegroup
     my_lvm	(ocf::heartbeat:LVM):	FAILED (Monitoring)[ server3.example.com server4.example.com ]

Failed Resource Actions:
* my_lvm_monitor_0 on server3.example.com 'unknown error' (1): call=5, status=complete, exitreason='The volume_list filter must be initialized in lvm.conf for exclusive activation without clvmd'
           

根据提示,如果停止了clvmd则需要在lvm.conf配置volume_list参数,这是指导书提及但被我忽略的。

/etc/lvm/lvm.conf

volume_list = []

由于机器没有其他任何vg,所以配置值为空,但该参数必须出现。

以下命令并未解决

pcs resource restart my_lvm 重启资源

pcs resource disable my_lvm 停止资源

pcs resource enable my_lvm 启动资源

pcs resource cleanup my_lvm 清除资源故障

pcs cluster stop --all 停止所有双机节点

pcs cluster start --all 启动所有双机节点

reboot

pcs resource show 显示资源为停止,可以省略show

Resource Group: apachegroup
     my_lvm	(ocf::heartbeat:LVM):	Stopped
           

重新作initram

备份,发现2台机器内核还不一样

[[email protected] ~]# cp -p /boot/initramfs-3.10.0-1062.1.1.el7.x86_64.img /boot/initramfs-3.10.0-1062.1.1.el7.x86_64.img.bakok1
[[email protected] ~]# cp -p /boot/initramfs-3.10.0-957.el7.x86_64.img /boot/initramfs-3.10.0-957.el7.x86_64.img.bakok1
[[email protected] ~]# dracut -H -f /boot/initramfs-$(uname -r).img $(uname -r)
[[email protected] ~]# reboot
```text
还是无法启动
删除重新创建,甚至修改为不独占,都无法启动
```text
pcs resource delete my_lvm
pcs resource create my_lvm LVM volgrpname=my_vg \
> exclusive=false --group apachegroup
           

看日志

[[email protected] ~]# journalctl -xe
Sep 26 17:13:54 server3.example.com pengine[3312]:    error: Resource start-up disabled since no STONITH resources have been defined
Sep 26 17:13:54 server3.example.com pengine[3312]:    error: Either configure some or disable STONITH with the stonith-enabled option
Sep 26 17:13:54 server3.example.com pengine[3312]:    error: NOTE: Clusters with shared data need STONITH to ensure data integrity
Sep 26 17:13:54 server3.example.com pengine[3312]:   notice: Removing my_lvm from server3.example.com
Sep 26 17:13:54 server3.example.com pengine[3312]:   notice: Removing my_lvm from server4.example.com
[[email protected] ~]# pcs property show --all
stonith-enabled: true
[[email protected] ~]# pcs property set stonith-enabled=false  只在1台修改
[[email protected] ~]# pcs property show --all |grep stonith-enabled
 stonith-enabled: false
           

重新添加资源,启动成功

[[email protected] ~]# pcs resource create my_lvm LVM volgrpname=my_vg \
>  exclusive=true --group apachegroup
Assumed agent name 'ocf:heartbeat:LVM' (deduced from 'LVM')
[[email protected] ~]# pcs status
[[email protected] ~]# pcs resource
 Resource Group: apachegroup
     my_lvm	(ocf::heartbeat:LVM):	Started server3.example.com
使用 lvdisplay 显示 LV Status              available

           

继续添加其他资源

pcs resource create my_fs Filesystem

device="/dev/my_vg/my_lv" directory="/var/www" fstype=“ext4” --group

apachegroup

使用df查看挂载

pcs resource create VirtualIP IPaddr2 ip=192.168.122.30

cidr_netmask=24 --group apachegroup

使用ip ad查看浮动IP

pcs resource create Website apache

configfile="/etc/httpd/conf/httpd.conf"

statusurl=“http://127.0.0.1/server-status” --group apachegroup

验证

状态检查

[[email protected] ~]# pcs status
Cluster name: mytest_cluster
Stack: corosync
Current DC: server3.example.com (version 1.1.20-5.el7_7.1-3c4c782f70) - partition with quorum
Last updated: Thu Sep 26 17:27:24 2019
Last change: Thu Sep 26 17:26:27 2019 by root via cibadmin on server3.example.com

2 nodes configured
4 resources configured

Online: [ server3.example.com server4.example.com ]

Full list of resources:

 Resource Group: apachegroup
     my_lvm	(ocf::heartbeat:LVM):	Started server3.example.com
     my_fs	(ocf::heartbeat:Filesystem):	Started server3.example.com
     VirtualIP	(ocf::heartbeat:IPaddr2):	Started server3.example.com
     Website	(ocf::heartbeat:apache):	Started server3.example.com

Daemon Status:
  corosync: active/enabled
  pacemaker: active/enabled
  pcsd: active/enabled
[[email protected] ~]#
           

访问应用

firefox访问 http://192.168.122.30 显示Hello

[[email protected] ~]$ curl http://192.168.122.30

Hello 此时2台机器systemctl status httpd的状态都未启动 Active: inactive (dead)

倒换测试

(1)重启

[[email protected] ~]# reboot

秒切换到server4

(2)停进程及听双机软件

[[email protected] ~]# ps -ef|grep httpd
root      7389     1  0 17:36 ?        00:00:00 /sbin/httpd -DSTATUS -f /etc/httpd/conf/httpd.conf -c PidFile /var/run/httpd.pid
apache    7391  7389  0 17:36 ?        00:00:00 /sbin/httpd -DSTATUS -f /etc/httpd/conf/httpd.conf -c PidFile /var/run/httpd.pid
apache    7392  7389  0 17:36 ?        00:00:00 /sbin/httpd -DSTATUS -f /etc/httpd/conf/httpd.conf -c PidFile /var/run/httpd.pid
apache    7393  7389  0 17:36 ?        00:00:00 /sbin/httpd -DSTATUS -f /etc/httpd/conf/httpd.conf -c PidFile /var/run/httpd.pid
apache    7394  7389  0 17:36 ?        00:00:00 /sbin/httpd -DSTATUS -f /etc/httpd/conf/httpd.conf -c PidFile /var/run/httpd.pid
apache    7395  7389  0 17:36 ?        00:00:00 /sbin/httpd -DSTATUS -f /etc/httpd/conf/httpd.conf -c PidFile /var/run/httpd.pid
root      7642  3468  0 17:36 pts/0    00:00:00 grep --color=auto httpd
[[email protected] ~]# kill -9 7389
           

程序会自动重启,停止了4次都没有切换,只是有一个告警

Failed Resource Actions:
* Website_monitor_10000 on server4.example.com 'not running' (7): call=42, status=complete, exitreason='',
* 
    last-rc-change='Thu Sep 26 17:37:11 2019', queued=0ms, exec=0ms
           

查看配置发现监控间隔为10秒,kill重启间隔估计只有3秒。

[[email protected] ~]# pcs config

Operations: monitor interval=10s timeout=20s (Website-monitor-interval-10s)

来一个狠招

# mv /sbin/httpd /sbin/httpdbak
           

之后再kill进程,立即就切换了,估计1秒。

看日志,程序比较智能,实际并未重启1000000次

Sep 26 17:54:08 server3.example.com apache(Website)[9380]: ERROR: apache httpd program not found
Sep 26 17:54:08 server3.example.com apache(Website)[9396]: ERROR: environment is invalid, resource considered stopped
Sep 26 17:54:08 server3.example.com lrmd[3325]:   notice: Website_monitor_10000:9320:stderr [ ocf-exit-reason:apache httpd program not found ]
Sep 26 17:54:08 server3.example.com lrmd[3325]:   notice: Website_monitor_10000:9320:stderr [ ocf-exit-reason:environment is invalid, resource considered stopped ]
Sep 26 17:54:08 server3.example.com crmd[3332]:   notice: server3.example.com-Website_monitor_10000:41 [ ocf-exit-reason:apache httpd program not found\nocf-exit-reason:environment is invalid, resource considered stopped\n ]
...
Sep 26 17:54:09 server3.example.com pengine[3331]:  warning: Processing failed start of Website on server3.example.com: not installed
Sep 26 17:54:09 server3.example.com pengine[3331]:   notice: Preventing Website from re-starting on server3.example.com: operation start failed 'not installed' (5)
Sep 26 17:54:09 server3.example.com pengine[3331]:  warning: Forcing Website away from server3.example.com after 1000000 failures (max=1000000)
           

可能无法恢复

在停止s4服务器的双机后pcs cluster stop

2 nodes configured
4 resources configured

Online: [ server3.example.com ]
OFFLINE: [ server4.example.com ]

Full list of resources:

 Resource Group: apachegroup
     my_lvm	(ocf::heartbeat:LVM):	Started server3.example.com
     my_fs	(ocf::heartbeat:Filesystem):	Started server3.example.com
     VirtualIP	(ocf::heartbeat:IPaddr2):	Started server3.example.com
     Website	(ocf::heartbeat:apache):	Stopped

Failed Resource Actions:
* Website_start_0 on server3.example.com 'not installed' (5): call=44, status=complete, exitreason='environment is invalid, resource considered stopped',
           

手工清除状态

[[email protected] ~]# pcs resource cleanup Website

还是无法启动

  • Website_start_0 on server3.example.com ‘unknown error’ (1): call=63, status=Timed Out, exitreason=’’,

    reboot重启后,所有资源都为stopped

    journalctl检查日志,不是很有用,没有仔细去找到关键日志。

Sep 26 18:13:53 server3.example.com LVM(my_lvm)[3460]: WARNING: LVM Volume my_vg is not available (stopped)
Sep 26 18:13:53 server3.example.com crmd[3321]:   notice: Result of probe operation for my_lvm on server3.example.com: 7 (not running)
Sep 26 18:13:53 server3.example.com crmd[3321]:   notice: Initiating monitor operation my_fs_monitor_0 locally on server3.example.com
Sep 26 18:13:53 server3.example.com Filesystem(my_fs)[3480]: WARNING: Couldn't find device [/dev/my_vg/my_lv]. Expected /dev/??? to exist
           

需要启动4机的双机[[email protected] ~]# pcs cluster start

自动恢复。

小结:

1、httpd文件损坏的问题,在恢复后仍然可能会出现异常,最好reboot一下。

2、1台机器停止双机软件后,不要重启另外一台机器。

(3)去掉浮动IP地址

[[email protected] ~]# ip addr del 192.168.122.30/24 dev eth0

也是在本机自动拉起

[[email protected] ~]# ip link set down dev eth0

只有通过kvm或者虚拟机界面操作了,s4还是不会切换

在3机上已经切换成功,网页访问也是正常的。

Online: [ server3.example.com ]
OFFLINE: [ server4.example.com ]

Full list of resources:

 Resource Group: apachegroup
     my_lvm	(ocf::heartbeat:LVM):	Started server3.example.com
     my_fs	(ocf::heartbeat:Filesystem):	Started server3.example.com
     VirtualIP	(ocf::heartbeat:IPaddr2):	Started server3.example.com
     Website	(ocf::heartbeat:apache):	Started server3.example.com
[[email protected] ~]$ curl http://192.168.122.30
<html>
<body>Hello</body>
</html>
           

把s4的网卡启动起来 ifconfig eth4 up 会自动去掉s4的浮动地址,并加入双机,2台机器显示pcs status相同。

继续阅读