Linux集群和自动化维2.6.4　开发类脚本

<b>2.6.4　开发类脚本</b>

业务需求在不断地变化，有时候互联网上的开源方案并不能全部解决，这个时候就需要自己写一些开发类的脚本来满足工作中的需求了，虽然很多时候脚本都可以独立运行，但笔者的做法还是尽量将其return结果写成nagios能够识别的格式，以便配合nagios发送报警邮件和信息。

1.监测redis是否正常运行

笔者接触的线上nosql业务主要是redis数据库，多用于处理大量数据的高访问负载需求。为了最大化地利用资源，每个redis实例分配的内存并不是很大，有时候程序组的同事导入数据量大的ip list时会导致redis实例崩溃，所以笔者开发了一个redis监测脚本并配合nagios进行工作，脚本内容如下所示（此脚本在amazon linux ami x86_64下已测试通过）：

#!/usr/bin/python

#check redis

nagios plungin,please install the redis-py module.

import redis

import sys

status_ok = 0

status_warning =

status_critical

= 2

host =

sys.argv[1]

port =

int(sys.argv[2])

warning =

float(sys.argv[3])

critical =

float(sys.argv[4])

def

connect_redis(host, port):

r = redis.redis(host, port, socket_timeout

= 5, socket_connect_timeout = 5)

return r

def main():

r = connect_redis(host, port)

try:

r.ping()

except:

print host,port,'down'

sys.exit(status_critical)

redis_info = r.info()

used_mem =

redis_info['used_memory']/1024/1024/1024.0

used_mem_human =

redis_info['used_memory_human']

if warning <= used_mem < critical:

print host,port,'use memory

warning',used_mem_human

sys.exit(status_warning)

elif used_mem >= critical:

critical',used_mem_human

else:

ok',used_mem_human

sys.exit(status_ok)

if __name__ ==

'__main__':

main()

2.监测机器的ip连接数

需求其实比较简单，先统计ip连接数，如果ip_conns值小于15 000则显示为正常，介于15 000至20 000之间为警告，如果超过20 000则报警，脚本内容如下所示（此脚本在amazon linux ami x86_64下已测试通过）：

#!/bin/bash

#nagios plugin

for ip connects

#$1 = 15000 $2 =

20000

ip_conns=`netstat

-an | grep tcp | grep est | wc -l`

messages=`netstat

-ant | awk '/^tcp/ {++s[$nf]} end {for(a in s) print a, s[a]}'|tr -s '\n' ',' |

sed -r 's/(.*),/\1\n/g' `

if [ $ip_conns

-lt $1 ]

then

echo "$messages,ok -connect counts is

$ip_conns"

exit 0

-gt $1 -a $ip_conns -lt $2 ]

echo "$messages,warning -connect

counts is $ip_conns"

exit 1

-gt $2 ]

echo "$messages,critical -connect

exit 2

3.监测机器的cpu利用率脚本

线上的bidder业务机器，在业务繁忙的高峰期会出现cpu利用率达到100%（sys%+ user%），导致后面的流量打在上面却完全进不去的情况，但此时机器、系统负载及nginx+lua进程都是完全正常的，所以这种情况下需要开发一个cpu利用率脚本，在超过自定义阀值时报警，方便运维人员批量添加bidder ami机器以应对峰值，aws ec2实例机器是可以以小时来计费的，大家在这里也要注意分清系统负载和cpu利用率之间的区别。脚本内容如下所示（此脚本在amazon linux ami x86_64下已测试通过）：

==============================================================================

# cpu

utilization statistics plugin for nagios

# usage :

./check_cpu_utili.sh [-w <user,system,iowait>] [-c

<user,system,iowait>] ( [ -i <intervals in second> ] [ -n

<report number> ])

# exemple:

./check_cpu_utili.sh

# ./check_cpu_utili.sh -w 70,40,30 -c

90,60,40

90,60,40 -i 3 -n 5

#-------------------------------------------------------------------------------

# paths to

commands used in this script. these may

have to be modified to match your system setup.

iostat="/usr/bin/iostat"

# nagios return

codes

state_ok=0

state_warning=1

state_critical=2

state_unknown=3

# plugin

parameters value if not define

list_warning_threshold="70,40,30"

list_critical_threshold="90,60,40"

interval_sec=1

num_report=1

variable description

progname=$(basename

$0)

if [ ! -x

$iostat ]; then

echo

"unknown: iostat not found or is not executable by the nagios user."

exit $state_unknown

print_usage() {

echo ""

echo "$progname $release - cpu

utilization check script for nagios"

echo "usage: check_cpu_utili.sh -w

-c (-i -n)"

echo " -w

warning threshold in % for warn_user,warn_system,warn_iowait cpu

(default : 70,40,30)"

echo " exit with warning status if cpu exceeds

warn_n"

echo " -c

critical threshold in % for crit_user,crit_system,crit_iowait cpu

(default : 90,60,40)"

echo " exit with critical status if cpu exceeds

crit_n"

echo " -i

interval in seconds for iostat (default : 1)"

echo " -n

number report for iostat (default : 3)"

echo " -h

show this page"

echo "usage: $progname"

echo "usage: $progname --help"

echo ""

}

print_help() {

print_usage

echo "this plugin will check cpu

utilization (user,system,cpu_iowait in %)"

# parse

parameters

while [ $# -gt 0

]; do

case "$1" in

-h | --help)

print_help

exit $state_ok

;;

-v | --version)

print_release

exit $state_ok

;;

-w | --warning)

shift

list_warning_threshold=$1

-c | --critical)

shift

list_critical_threshold=$1

-i | --interval)

interval_sec=$1

-n | --number)

num_report=$1

;;

echo "unknown argument: $1"

print_usage

exit $state_unknown

esac

shift

done

# list to table

for warning threshold (compatibility with

tab_warning_threshold=(`echo

$list_warning_threshold | sed 's/,/ /g'`)

if [

"${#tab_warning_threshold[@]}" -ne "3" ]; then

echo "error : bad count parameter in

warning threshold"

exit $state_warning

else

user_warning_threshold=`echo

${tab_warning_threshold[0]}`

system_warning_threshold=`echo

${tab_warning_threshold[1]}`

iowait_warning_threshold=`echo

${tab_warning_threshold[2]}`

for critical threshold

tab_critical_threshold=(`echo

$list_critical_threshold | sed 's/,/ /g'`)

"${#tab_critical_threshold[@]}" -ne "3" ]; then

critical threshold"

else

user_critical_threshold=`echo

${tab_critical_threshold[0]}`

system_critical_threshold=`echo

${tab_critical_threshold[1]}`

iowait_critical_threshold=`echo

${tab_critical_threshold[2]}`

${tab_warning_threshold[0]} -ge ${tab_critical_threshold[0]} -o

${tab_warning_threshold[1]} -ge ${tab_critical_threshold[1]} -o

${tab_warning_threshold[2]} -ge ${tab_critical_threshold[2]} ]; then

echo "error : critical cpu threshold

lower as warning cpu threshold "

exit $state_warning

cpu_report=`iostat

-c $interval_sec $num_report | sed -e 's/,/./g' | tr -s ' ' ';' | sed '/^$/d' |

tail -1`

cpu_report_sections=`echo

${cpu_report} | grep ';' -o | wc -l`

cpu_user=`echo

$cpu_report | cut -d ";" -f 2`

cpu_system=`echo

$cpu_report | cut -d ";" -f 4`

cpu_iowait=`echo

$cpu_report | cut -d ";" -f 5`

cpu_steal=`echo

$cpu_report | cut -d ";" -f 6`

cpu_idle=`echo

$cpu_report | cut -d ";" -f 7`

nagios_status="user=${cpu_user}%,system=${cpu_system}%,iowait=${cpu_iowait}%,idle=${cpu_idle}%"

nagios_data="cpuuser=${cpu_user};${tab_warning_threshold[0]};${tab_critical_threshold[0]};0"

cpu_user_major=`echo

$cpu_user| cut -d "." -f 1`

cpu_system_major=`echo

$cpu_system | cut -d "." -f 1`

cpu_iowait_major=`echo

$cpu_iowait | cut -d "." -f 1`

cpu_idle_major=`echo

$cpu_idle | cut -d "." -f 1`

# return

${cpu_user_major} -ge $user_critical_threshold ]; then

echo "cpu statistics

ok:${nagios_status} | cpu_user=${cpu_user}%;70;90;0;100"

exit $state_critical

elif [ ${cpu_system_major} -ge

$system_critical_threshold ]; then

elif [ ${cpu_iowait_major} -ge

$iowait_critical_threshold ]; then

elif [ ${cpu_user_major} -ge

$user_warning_threshold ] && [ ${cpu_user_major} -lt $user_critical_threshold

]; then

exit $state_warning

elif [ ${cpu_system_major} -ge

$system_warning_threshold ] && [ ${cpu_system_major} -lt

elif

[ ${cpu_iowait_major} -ge $iowait_warning_threshold ] && [

${cpu_iowait_major} -lt $iowait_critical_threshold ]; then

exit $state_ok

此脚本参考了nagios的官方文档https://exchange.nagios.org/并进行了代码精简和移值，源代码是运行在ksh下面的，这里将其移植到了bash下面，ksh下定义数组的方式跟bash还是有区别的；另外有一点也请大家注意，shell本身是不支持浮点运算的，但可以通过bc或awk的方式来处理。

另外，若要配合pnp4nagios出图（pnp4nagios可以观察一段周期内的cpu利用率峰值），此脚本还可以更精简，脚本内容如下所示（此脚本在amazon linux ami x86_64下已测试通过）：

list_warning_threshold="90"

list_critical_threshold="95"

num_report=5

-c $interval $num_report | sed -e

's/,/./g' | tr -s ' ' ';' | sed '/^$/d' |tail -1`

cpu_user=`echo $cpu_report

| cut -d ";" -f 2`

# add for

integer shell issue

$cpu_user | cut -d "." -f 1`

cpu_utili_cou=`echo

${cpu_user} + ${cpu_system}|bc`

cpu_utili_counter=`echo

$cpu_utili_cou | cut -d "." -f 1`

${cpu_utili_counter} -lt ${list_warning_threshold} ]

echo "ok - cpucou=${cpu_utili_cou}% |

cpucou=${cpu_utili_cou}%;80;90"

exit ${state_ok}

${cpu_utili_counter} -gt ${list_warning_threshold} -a ${cpu_utili_counter} -lt

${list_critical_threshold} ]

echo "warning -

cpucou=${cpu_utili_counter}% | cpucou=${cpu_utili_counter}%;80;90"

exit ${state_warning}

${cpu_utili_counter} -gt ${list_critical_threshold} ]

echo "critical -

exit ${state_critical}

Linux集群和自动化维2.6.4　开发类脚本

继续阅读

拒绝用户登录:/bin/false和/usr/sbin/nologin

Shell编程——sort排序、uniq忽略重复、tr替换压缩删除、cut指定删除字段、正则表达式元字符sort 命令uniq 命令tr 命令cut 命令正则表达式

Ubuntu14.04 LTS下安装mongodb

Nginx服务优化（1）——隐藏版本号、修改用户与组、网页缓存时间、日志切割、连接超时一、隐藏版本号二、修改用户与组三、配置Nginx网页缓存时间四、实现Nginx日志分割五、配置Nginx实现连接超时六、补充关于时间日期的命令

Linxu常用命令技巧汇总

httpd服务的部署、启动、配置和简单优化一、部署二、启动三、配置文件

配置网页内容访问

手动安装Intel network I217-LM网卡的Linux驱动

《Linux命令行与Shell脚本编程大全第2版.布卢姆》pdf

禁止ubuntu系统弹出报错界面

Ubuntu Linux下Apache的配置文件

ACS基本配置-权限等级管理

samba服务器的功能

【Linux】UDP广播报文接收速率问题

Linux设备模型（中）之上层容器

PowerPC平台 Linux移植三

Linux集群和自动化维2.6.4 开发类脚本

继续阅读

Linux集群和自动化维2.6.4　开发类脚本