Pacemaker Resource Agent的錯誤處理 2. 錯誤類型 3. 錯誤處理 4.啟示

1.前言

Pacemaker通過調用各個resource agent提供的操作（比如start，stop）實作對資源的控制，當這個方法執行出錯時，Pacemaker會根據執行的操作和錯誤類型進行不同的錯誤處理。

Pacemaker将錯誤分成3類：soft，hard和fatal，後兩種屬于環境或配置問題，如果沒有人工幹預是不可能自動修複的。一般的故障都采用OCF_ERR_GENERIC作為傳回值，比如，服務程序crash，網絡不通等，OCF_ERR_GENERIC屬于soft類型。

<a href="http://clusterlabs.org/doc/en-US/Pacemaker/1.1-plugin/html-single/Pacemaker_Explained/index.html#_how_are_ocf_return_codes_interpreted">http://clusterlabs.org/doc/en-US/Pacemaker/1.1-plugin/html-single/Pacemaker_Explained/index.html#_how_are_ocf_return_codes_interpreted</a>

The first thing the cluster does is to check the return code against the expected result. If the result does not match the expected value, then the operation is considered to have failed and recovery action is initiated.

There are three types of failure recovery:

<a></a>

Table B.3. Types of recovery performed by the cluster

Type

Description

Action Taken by the Cluster

soft

A transient error occurred

hard

A non-transient error that may be specific to the current node occurred

fatal

A non-transient error that will be common to all cluster nodes (eg. a bad configuration was specified)

Assuming an action is considered to have failed, the following table outlines the different OCF return codes and the type of recovery the cluster will initiate when it is received.

Table B.4. OCF Return Codes and their Recovery Types

OCF Alias

OCF_SUCCESS

OCF_ERR_GENERIC

OCF_ERR_ARGS

OCF_ERR_UNIMPLEMENTED

OCF_ERR_PERM

OCF_ERR_INSTALLED

OCF_ERR_CONFIGURED

OCF_NOT_RUNNING

N/A

OCF_RUNNING_MASTER

OCF_FAILED_MASTER

other

Although counterintuitive, even actions that return 0 (aka. OCF_SUCCESS) can be considered to have failed.

每個資源的操作（operation）有一個on-fail屬性，用于控制如何進行出錯處理。

<a href="http://clusterlabs.org/doc/en-US/Pacemaker/1.1-plugin/html-single/Pacemaker_Explained/index.html#_monitoring_resources_for_failure">http://clusterlabs.org/doc/en-US/Pacemaker/1.1-plugin/html-single/Pacemaker_Explained/index.html#_monitoring_resources_for_failure</a>

Table 5.3. Properties of an Operation

Field

name

interval

timeout

on-fail

The action to take if this action ever fails. Allowed values:

* ignore - Pretend the resource did not fail

* block - Don’t perform any further operations on the resource

* stop - Stop the resource and do not start it elsewhere

* restart - Stop the resource and start it again (possibly on a different node)

* fence - STONITH the node on which the resource failed

* standby - Move all resources away from the node on which the resource failed

enabled

但是，實際測試驗證後，發現2個問題，或者說是Bug。

問題1：

在老版的Pacemaker（1.1.7）上不管如何設定on-fail，效果都不會變，也就是說永遠是預設行為。在最新的Pacemaker 1.1.14上驗證，沒有這個問題，即on-fail可以生效。

問題2：

通過讓Resource Agent的各個操作傳回OCF_ERR_GENERIC，檢視資料總管的處理，發現其on-fail的預設行為并不是手冊上說的“The default for the stop operation is fence when STONITH is enabled and block otherwise. All other operations default to stop.”。具體如下，對比發現實際的預設行為更加合理，是以可以認為這是Pacemaker手冊的一個Bug。

操作

錯誤處理

對應的on-fail值

start

設定fail-count=1000000

在本節點上調用stop

在其它節點上start該資源

restart

stop

阻止該資源的進一步操作，該資源成為unmanaged FAILED狀态，如下

dummy (ocf::heartbeat:Dummy2): Started srdsdevapp69 (unmanaged) FAILED

block

monitor

設定fail-count+=1

在本節點上依次調用stop，start，monitor。如果monitor依然出錯，重複stop，start，monitor，直到fail-count達到migration-threshold後，保持資源為stop狀态。

promote

在本節點上依次調用demote，stop，start 。

在其它節點上調用promote以提升其它節點上的資源為master

demote

在本節點上依次調用stop，start，demote。如果demote依然出錯，重複stop，start，demote，直到fail-count達到migration-threshold後，保持資源為stop狀态。

notify

無視

ignore

注1：逾時的處理與OCF_ERR_GENERIC相同

注2：Pacemaker不會對已經stop了的資源調用post stop notify。

注3：測試環境 Pacemaker 1.1.7-6 + CentOS 6.3 和 Pacemaker 1.1.14 + CentOS 6.3

上面關于錯誤處理的測試結果，可以給Resource Agent編寫者提供幾點啟示：

1. 如非确實必要，不要讓stop操作傳回錯誤

2. monitor和start的判斷要保持一緻，即不應該出現start成功後立刻執行monitor卻失敗的情況，否則可能導緻循環。

3. restart成功後執行demote不應該失敗，否則可能導緻循環。

4. migration-threshold設定為一個比較小的值（預設值是INFINITY，即100000），也可以減少上面的2和3的影響。

Pacemaker Resource Agent的錯誤處理 2. 錯誤類型 3. 錯誤處理 4.啟示

繼續閱讀

基于Pacemkaer Resource Agent的LVS負載均衡基于Pacemkaer Resource Agent的LVS負載均衡

maven項目中沒有resource檔案夾的問題

從加載DLL的中擷取放置于Resources檔案夾中資源字典的幾種方法

WPF整理--動态綁定到Logical Resource

ui5 resource file 404 error

com.sap.ui5.resource.ResourceServlet的工作原理介紹

com.sap.ui5.resource.ResourceServlet的工作原理介紹

PyQt5 技巧篇-便于文字排版的等寬字型推薦：Source Code Pro的中文為英文兩倍寬字型