<b>2.6.4 開發類腳本</b>
業務需求在不斷地變化,有時候網際網路上的開源方案并不能全部解決,這個時候就需要自己寫一些開發類的腳本來滿足工作中的需求了,雖然很多時候腳本都可以獨立運作,但筆者的做法還是盡量将其return結果寫成nagios能夠識别的格式,以便配合nagios發送報警郵件和資訊。
1.監測redis是否正常運作
筆者接觸的線上nosql業務主要是redis資料庫,多用于處理大量資料的高通路負載需求。為了最大化地利用資源,每個redis執行個體配置設定的記憶體并不是很大,有時候程式組的同僚導入資料量大的ip list時會導緻redis執行個體崩潰,是以筆者開發了一個redis監測腳本并配合nagios進行工作,腳本内容如下所示(此腳本在amazon linux ami x86_64下已測試通過):
#!/usr/bin/python
#check redis
nagios plungin,please install the redis-py module.
import redis
import sys
status_ok = 0
status_warning =
1
status_critical
= 2
host =
sys.argv[1]
port =
int(sys.argv[2])
warning =
float(sys.argv[3])
critical =
float(sys.argv[4])
def
connect_redis(host, port):
r = redis.redis(host, port, socket_timeout
= 5, socket_connect_timeout = 5)
return r
def main():
r = connect_redis(host, port)
try:
r.ping()
except:
print host,port,'down'
sys.exit(status_critical)
redis_info = r.info()
used_mem =
redis_info['used_memory']/1024/1024/1024.0
used_mem_human =
redis_info['used_memory_human']
if warning <= used_mem < critical:
print host,port,'use memory
warning',used_mem_human
sys.exit(status_warning)
elif used_mem >= critical:
critical',used_mem_human
else:
ok',used_mem_human
sys.exit(status_ok)
if __name__ ==
'__main__':
main()
2.監測機器的ip連接配接數
需求其實比較簡單,先統計ip連接配接數,如果ip_conns值小于15 000則顯示為正常,介于15 000至20 000之間為警告,如果超過20 000則報警,腳本内容如下所示(此腳本在amazon linux ami x86_64下已測試通過):
#!/bin/bash
#nagios plugin
for ip connects
#$1 = 15000 $2 =
20000
ip_conns=`netstat
-an | grep tcp | grep est | wc -l`
messages=`netstat
-ant | awk '/^tcp/ {++s[$nf]} end {for(a in s) print a, s[a]}'|tr -s '\n' ',' |
sed -r 's/(.*),/\1\n/g' `
if [ $ip_conns
-lt $1 ]
then
echo "$messages,ok -connect counts is
$ip_conns"
exit 0
fi
-gt $1 -a $ip_conns -lt $2 ]
echo "$messages,warning -connect
counts is $ip_conns"
exit 1
-gt $2 ]
echo "$messages,critical -connect
exit 2
3.監測機器的cpu使用率腳本
線上的bidder業務機器,在業務繁忙的高峰期會出現cpu使用率達到100%(sys%+ user%),導緻後面的流量打在上面卻完全進不去的情況,但此時機器、系統負載及nginx+lua程序都是完全正常的,是以這種情況下需要開發一個cpu使用率腳本,在超過自定義閥值時報警,友善運維人員批量添加bidder ami機器以應對峰值,aws ec2執行個體機器是可以以小時來計費的,大家在這裡也要注意厘清系統負載和cpu使用率之間的差別。腳本内容如下所示(此腳本在amazon linux ami x86_64下已測試通過):
#
==============================================================================
# cpu
utilization statistics plugin for nagios
# usage :
./check_cpu_utili.sh [-w <user,system,iowait>] [-c
<user,system,iowait>] ( [ -i <intervals in second> ] [ -n
<report number> ])
#
# exemple:
./check_cpu_utili.sh
# ./check_cpu_utili.sh -w 70,40,30 -c
90,60,40
90,60,40 -i 3 -n 5
#-------------------------------------------------------------------------------
# paths to
commands used in this script. these may
have to be modified to match your system setup.
iostat="/usr/bin/iostat"
# nagios return
codes
state_ok=0
state_warning=1
state_critical=2
state_unknown=3
# plugin
parameters value if not define
list_warning_threshold="70,40,30"
list_critical_threshold="90,60,40"
interval_sec=1
num_report=1
variable description
progname=$(basename
$0)
if [ ! -x
$iostat ]; then
echo
"unknown: iostat not found or is not executable by the nagios user."
exit $state_unknown
print_usage() {
echo ""
echo "$progname $release - cpu
utilization check script for nagios"
echo "usage: check_cpu_utili.sh -w
-c (-i -n)"
echo " -w
warning threshold in % for warn_user,warn_system,warn_iowait cpu
(default : 70,40,30)"
echo " exit with warning status if cpu exceeds
warn_n"
echo " -c
critical threshold in % for crit_user,crit_system,crit_iowait cpu
(default : 90,60,40)"
echo " exit with critical status if cpu exceeds
crit_n"
echo " -i
interval in seconds for iostat (default : 1)"
echo " -n
number report for iostat (default : 3)"
echo " -h
show this page"
echo "usage: $progname"
echo "usage: $progname --help"
echo ""
}
print_help() {
print_usage
echo "this plugin will check cpu
utilization (user,system,cpu_iowait in %)"
# parse
parameters
while [ $# -gt 0
]; do
case "$1" in
-h | --help)
print_help
exit $state_ok
;;
-v | --version)
print_release
exit $state_ok
;;
-w | --warning)
shift
list_warning_threshold=$1
-c | --critical)
shift
list_critical_threshold=$1
-i | --interval)
interval_sec=$1
-n | --number)
num_report=$1
;;
*)
echo "unknown argument: $1"
print_usage
exit $state_unknown
esac
shift
done
# list to table
for warning threshold (compatibility with
tab_warning_threshold=(`echo
$list_warning_threshold | sed 's/,/ /g'`)
if [
"${#tab_warning_threshold[@]}" -ne "3" ]; then
echo "error : bad count parameter in
warning threshold"
exit $state_warning
else
user_warning_threshold=`echo
${tab_warning_threshold[0]}`
system_warning_threshold=`echo
${tab_warning_threshold[1]}`
iowait_warning_threshold=`echo
${tab_warning_threshold[2]}`
for critical threshold
tab_critical_threshold=(`echo
$list_critical_threshold | sed 's/,/ /g'`)
"${#tab_critical_threshold[@]}" -ne "3" ]; then
critical threshold"
else
user_critical_threshold=`echo
${tab_critical_threshold[0]}`
system_critical_threshold=`echo
${tab_critical_threshold[1]}`
iowait_critical_threshold=`echo
${tab_critical_threshold[2]}`
${tab_warning_threshold[0]} -ge ${tab_critical_threshold[0]} -o
${tab_warning_threshold[1]} -ge ${tab_critical_threshold[1]} -o
${tab_warning_threshold[2]} -ge ${tab_critical_threshold[2]} ]; then
echo "error : critical cpu threshold
lower as warning cpu threshold "
exit $state_warning
cpu_report=`iostat
-c $interval_sec $num_report | sed -e 's/,/./g' | tr -s ' ' ';' | sed '/^$/d' |
tail -1`
cpu_report_sections=`echo
${cpu_report} | grep ';' -o | wc -l`
cpu_user=`echo
$cpu_report | cut -d ";" -f 2`
cpu_system=`echo
$cpu_report | cut -d ";" -f 4`
cpu_iowait=`echo
$cpu_report | cut -d ";" -f 5`
cpu_steal=`echo
$cpu_report | cut -d ";" -f 6`
cpu_idle=`echo
$cpu_report | cut -d ";" -f 7`
nagios_status="user=${cpu_user}%,system=${cpu_system}%,iowait=${cpu_iowait}%,idle=${cpu_idle}%"
nagios_data="cpuuser=${cpu_user};${tab_warning_threshold[0]};${tab_critical_threshold[0]};0"
cpu_user_major=`echo
$cpu_user| cut -d "." -f 1`
cpu_system_major=`echo
$cpu_system | cut -d "." -f 1`
cpu_iowait_major=`echo
$cpu_iowait | cut -d "." -f 1`
cpu_idle_major=`echo
$cpu_idle | cut -d "." -f 1`
# return
${cpu_user_major} -ge $user_critical_threshold ]; then
echo "cpu statistics
ok:${nagios_status} | cpu_user=${cpu_user}%;70;90;0;100"
exit $state_critical
elif [ ${cpu_system_major} -ge
$system_critical_threshold ]; then
elif [ ${cpu_iowait_major} -ge
$iowait_critical_threshold ]; then
elif [ ${cpu_user_major} -ge
$user_warning_threshold ] && [ ${cpu_user_major} -lt $user_critical_threshold
]; then
exit $state_warning
elif [ ${cpu_system_major} -ge
$system_warning_threshold ] && [ ${cpu_system_major} -lt
elif
[ ${cpu_iowait_major} -ge $iowait_warning_threshold ] && [
${cpu_iowait_major} -lt $iowait_critical_threshold ]; then
exit $state_ok
此腳本參考了nagios的官方文檔https://exchange.nagios.org/并進行了代碼精簡和移值,源代碼是運作在ksh下面的,這裡将其移植到了bash下面,ksh下定義數組的方式跟bash還是有差別的;另外有一點也請大家注意,shell本身是不支援浮點運算的,但可以通過bc或awk的方式來處理。
另外,若要配合pnp4nagios出圖(pnp4nagios可以觀察一段周期内的cpu使用率峰值),此腳本還可以更精簡,腳本内容如下所示(此腳本在amazon linux ami x86_64下已測試通過):
list_warning_threshold="90"
list_critical_threshold="95"
num_report=5
-c $interval $num_report | sed -e
's/,/./g' | tr -s ' ' ';' | sed '/^$/d' |tail -1`
cpu_user=`echo $cpu_report
| cut -d ";" -f 2`
# add for
integer shell issue
$cpu_user | cut -d "." -f 1`
cpu_utili_cou=`echo
${cpu_user} + ${cpu_system}|bc`
cpu_utili_counter=`echo
$cpu_utili_cou | cut -d "." -f 1`
${cpu_utili_counter} -lt ${list_warning_threshold} ]
echo "ok - cpucou=${cpu_utili_cou}% |
cpucou=${cpu_utili_cou}%;80;90"
exit ${state_ok}
${cpu_utili_counter} -gt ${list_warning_threshold} -a ${cpu_utili_counter} -lt
${list_critical_threshold} ]
echo "warning -
cpucou=${cpu_utili_counter}% | cpucou=${cpu_utili_counter}%;80;90"
exit ${state_warning}
${cpu_utili_counter} -gt ${list_critical_threshold} ]
echo "critical -
exit ${state_critical}