Resource monitor time out occur when sub path of the multipath device fails.

2021-10-25 15:00:07

环境

Red Hat Enterprise Linux Server release 7

问题

Resource monitor time out occur when sub path of the multipath device fails

Error seen in "/var/log/messages"

Raw

Aug 16 13:19:21 xxxxx kernel: qla2xxx [0000:08:00.0]-801c:1: Abort command issued nexus=1:3:0 --  1 2002.
Aug 16 13:19:44 xxxxx  lrmd[4223]: warning: KEP_vg_arch_monitor_60000 process (PID 22244) timed out

决议

Tuning options with dm-multipath and in SCSI error handling which should reduce the time required in completing error handling on affected paths and then failover the IO to remaining available sub paths quickly. This should help to avoid monitor timeouts when only few of the sub paths to SAN devices are affected:

[1] Edit the "defaults" section in "/etc/multipath.conf" file to reduce the "checker_timeout", "polling_interval":

defaults { 
                     checker_timeout      10        
                              polling_interval        5     
                     }

Then update the devices section to reduce "fast_io_fail_tmo", "dev_loss_tmo":

Raw
devices { device { vendor "3PARdata" product "VV" failback "immediate" rr_weight "uniform" no_path_retry 2 ### Edited rr_min_io_rq 1 fast_io_fail_tmo 5 ### Edited dev_loss_tmo 10 ### Edited } }
In the above snip, "no_path_retry" is reduced to 2, this is because in RHEL 7 the minimum value for "dev_loss_tmo" is calculated as a product of current "polling_interval" and "no_path_retry" for the devices. So to reduce the "dev_loss_tmo" for affected paths we would need to reduce the "no_path_retry" to 2.

After updating the above changes, please reload "multipathd" service to make these changes effective:

[2] Then set the following two tunings:

eh_timeout    5   Per device 
        eh_deadline   5   Overall cap on the error handler, only starts when the error handler kicks in.

To set above options:

$ cd /sys/block
        for d in sd*
        do
         echo 5 > $d/device/eh_timeout
        done

        $ cd /sys/class/scsi_host
        for host in host1 host2                       
        do
         echo 5 > $host/eh_deadline
        done

根源

Cluster node is having SAN devices connected through 2 Qlogic FC HBAs. And there were command timeouts, command aborts happening only for the sub paths connected through Qlogic FC HBA host1. So the sub paths through another HBA host2 were still available. As the sub paths through another HBA were still available we expect the monitor IO operation on above resources should not fail.
SCSI error handling, and then IO failover from affected sub paths to the non-affected paths took slightly more time which was long enough to reach to the monitor timeout on resource. Due to this, a monitor operation on resources got timed out and pacemaker had initiated a recovery action on the resources.

Resource monitor time out occur when sub path of the multipath device fails.

环境

问题

决议

根源

继续阅读

Apache (You don't have permission to access / on this server.）

debian9升级4.9.0内核到4.19.2内核过程

centOS7 配置 vsftpd 虚拟用户及权限Vsftpd配置虚拟用户及权限

linux-svn卸载与安装

vsftp虚拟多用户多权限一键部署脚本

Ubuntu14.04 LTS下安装mongodb

Nginx服务优化（1）——隐藏版本号、修改用户与组、网页缓存时间、日志切割、连接超时一、隐藏版本号二、修改用户与组三、配置Nginx网页缓存时间四、实现Nginx日志分割五、配置Nginx实现连接超时六、补充关于时间日期的命令

httpd服务的部署、启动、配置和简单优化一、部署二、启动三、配置文件

配置网页内容访问

手动安装Intel network I217-LM网卡的Linux驱动

禁止ubuntu系统弹出报错界面

Ubuntu Linux下Apache的配置文件

samba服务器的功能

【Linux】UDP广播报文接收速率问题

Linux设备模型（中）之上层容器

PowerPC平台 Linux移植三