1.前言
Pacemaker通過調用各個resource agent提供的操作(比如start,stop)實作對資源的控制,當這個方法執行出錯時,Pacemaker會根據執行的操作和錯誤類型進行不同的錯誤處理。
Pacemaker将錯誤分成3類:soft,hard和fatal,後兩種屬于環境或配置問題,如果沒有人工幹預是不可能自動修複的。一般的故障都采用OCF_ERR_GENERIC作為傳回值,比如,服務程序crash,網絡不通等,OCF_ERR_GENERIC屬于soft類型。
<a href="http://clusterlabs.org/doc/en-US/Pacemaker/1.1-plugin/html-single/Pacemaker_Explained/index.html#_how_are_ocf_return_codes_interpreted">http://clusterlabs.org/doc/en-US/Pacemaker/1.1-plugin/html-single/Pacemaker_Explained/index.html#_how_are_ocf_return_codes_interpreted</a>
The first thing the cluster does is to check the return code against the expected result. If the result does not match the expected value, then the operation is considered to have failed and recovery action is initiated.
There are three types of failure recovery:
<a></a>
Table B.3. Types of recovery performed by the cluster
Type
Description
Action Taken by the Cluster
soft
A transient error occurred
hard
A non-transient error that may be specific to the current node occurred
fatal
A non-transient error that will be common to all cluster nodes (eg. a bad configuration was specified)
Assuming an action is considered to have failed, the following table outlines the different OCF return codes and the type of recovery the cluster will initiate when it is received.
Table B.4. OCF Return Codes and their Recovery Types
RC
OCF Alias
RT
OCF_SUCCESS
1
OCF_ERR_GENERIC
2
OCF_ERR_ARGS
3
OCF_ERR_UNIMPLEMENTED
4
OCF_ERR_PERM
5
OCF_ERR_INSTALLED
6
OCF_ERR_CONFIGURED
7
OCF_NOT_RUNNING
N/A
8
OCF_RUNNING_MASTER
9
OCF_FAILED_MASTER
other
NA
Although counterintuitive, even actions that return 0 (aka. OCF_SUCCESS) can be considered to have failed.
每個資源的操作(operation)有一個on-fail屬性,用于控制如何進行出錯處理。
<a href="http://clusterlabs.org/doc/en-US/Pacemaker/1.1-plugin/html-single/Pacemaker_Explained/index.html#_monitoring_resources_for_failure">http://clusterlabs.org/doc/en-US/Pacemaker/1.1-plugin/html-single/Pacemaker_Explained/index.html#_monitoring_resources_for_failure</a>
Table 5.3. Properties of an Operation
Field
id
name
interval
timeout
on-fail
The action to take if this action ever fails. Allowed values:
* ignore - Pretend the resource did not fail
* block - Don’t perform any further operations on the resource
* stop - Stop the resource and do not start it elsewhere
* restart - Stop the resource and start it again (possibly on a different node)
* fence - STONITH the node on which the resource failed
* standby - Move all resources away from the node on which the resource failed
enabled
但是,實際測試驗證後,發現2個問題,或者說是Bug。
問題1:
在老版的Pacemaker(1.1.7)上不管如何設定on-fail,效果都不會變,也就是說永遠是預設行為。在最新的Pacemaker 1.1.14上驗證,沒有這個問題,即on-fail可以生效。
問題2:
通過讓Resource Agent的各個操作傳回OCF_ERR_GENERIC,檢視資料總管的處理,發現其on-fail的預設行為并不是手冊上說的“The default for the stop operation is fence when STONITH is enabled and block otherwise. All other operations default to stop.”。具體如下,對比發現實際的預設行為更加合理,是以可以認為這是Pacemaker手冊的一個Bug。
操作
錯誤處理
對應的on-fail值
start
設定fail-count=1000000
在本節點上調用stop
在其它節點上start該資源
restart
stop
阻止該資源的進一步操作,該資源成為unmanaged FAILED狀态,如下
dummy (ocf::heartbeat:Dummy2): Started srdsdevapp69 (unmanaged) FAILED
block
monitor
設定fail-count+=1
在本節點上依次調用stop,start,monitor。如果monitor依然出錯,重複stop,start,monitor,直到fail-count達到migration-threshold後,保持資源為stop狀态。
promote
在本節點上依次調用demote,stop,start 。
在其它節點上調用promote以提升其它節點上的資源為master
demote
在本節點上依次調用stop,start,demote。如果demote依然出錯,重複stop,start,demote,直到fail-count達到migration-threshold後,保持資源為stop狀态。
notify
無視
ignore
注1:逾時的處理與OCF_ERR_GENERIC相同
注2:Pacemaker不會對已經stop了的資源調用post stop notify。
注3:測試環境 Pacemaker 1.1.7-6 + CentOS 6.3 和 Pacemaker 1.1.14 + CentOS 6.3
上面關于錯誤處理的測試結果,可以給Resource Agent編寫者提供幾點啟示:
1. 如非确實必要,不要讓stop操作傳回錯誤
2. monitor和start的判斷要保持一緻,即不應該出現start成功後立刻執行monitor卻失敗的情況,否則可能導緻循環。
3. restart成功後執行demote不應該失敗,否則可能導緻循環。
4. migration-threshold設定為一個比較小的值(預設值是INFINITY,即100000),也可以減少上面的2和3的影響。