天天看點

Linux叢集和自動化維2.6.4 開發類腳本

<b>2.6.4 開發類腳本</b>

業務需求在不斷地變化,有時候網際網路上的開源方案并不能全部解決,這個時候就需要自己寫一些開發類的腳本來滿足工作中的需求了,雖然很多時候腳本都可以獨立運作,但筆者的做法還是盡量将其return結果寫成nagios能夠識别的格式,以便配合nagios發送報警郵件和資訊。

1.監測redis是否正常運作

筆者接觸的線上nosql業務主要是redis資料庫,多用于處理大量資料的高通路負載需求。為了最大化地利用資源,每個redis執行個體配置設定的記憶體并不是很大,有時候程式組的同僚導入資料量大的ip list時會導緻redis執行個體崩潰,是以筆者開發了一個redis監測腳本并配合nagios進行工作,腳本内容如下所示(此腳本在amazon linux ami x86_64下已測試通過):

#!/usr/bin/python

#check redis

nagios plungin,please install the redis-py module.

import redis

import sys

status_ok = 0

status_warning =

1

status_critical

= 2

host =

sys.argv[1]

port =

int(sys.argv[2])

warning =

float(sys.argv[3])

critical =

float(sys.argv[4])

def

connect_redis(host, port):

    r = redis.redis(host, port, socket_timeout

= 5, socket_connect_timeout = 5)

    return r

def main():

    r = connect_redis(host, port)

    try:

        r.ping()

    except:

        print host,port,'down'

        sys.exit(status_critical)

    redis_info = r.info()

    used_mem =

redis_info['used_memory']/1024/1024/1024.0

    used_mem_human =

redis_info['used_memory_human']

    if warning &lt;= used_mem &lt; critical:

        print host,port,'use memory

warning',used_mem_human

        sys.exit(status_warning)

    elif used_mem &gt;= critical:

critical',used_mem_human

    else:

ok',used_mem_human

        sys.exit(status_ok)

if __name__ ==

'__main__':

    main()

2.監測機器的ip連接配接數

需求其實比較簡單,先統計ip連接配接數,如果ip_conns值小于15 000則顯示為正常,介于15 000至20 000之間為警告,如果超過20 000則報警,腳本内容如下所示(此腳本在amazon linux ami x86_64下已測試通過):

#!/bin/bash

#nagios plugin

for ip connects

#$1 = 15000 $2 =

20000

ip_conns=`netstat

-an | grep tcp | grep est | wc -l`

messages=`netstat

-ant | awk '/^tcp/ {++s[$nf]} end {for(a in s) print a, s[a]}'|tr -s '\n' ',' |

sed -r 's/(.*),/\1\n/g' `

if [ $ip_conns

-lt $1 ]

then

    echo "$messages,ok -connect counts is

$ip_conns"

    exit 0

fi

-gt $1 -a $ip_conns -lt $2 ]

    echo "$messages,warning -connect

counts is $ip_conns"

    exit 1

-gt $2 ]

    echo "$messages,critical -connect

    exit 2

3.監測機器的cpu使用率腳本

線上的bidder業務機器,在業務繁忙的高峰期會出現cpu使用率達到100%(sys%+ user%),導緻後面的流量打在上面卻完全進不去的情況,但此時機器、系統負載及nginx+lua程序都是完全正常的,是以這種情況下需要開發一個cpu使用率腳本,在超過自定義閥值時報警,友善運維人員批量添加bidder ami機器以應對峰值,aws ec2執行個體機器是可以以小時來計費的,大家在這裡也要注意厘清系統負載和cpu使用率之間的差別。腳本内容如下所示(此腳本在amazon linux ami x86_64下已測試通過):

#

==============================================================================

# cpu

utilization statistics plugin for nagios

# usage     :  

./check_cpu_utili.sh [-w &lt;user,system,iowait&gt;] [-c

&lt;user,system,iowait&gt;] ( [ -i &lt;intervals in second&gt; ] [ -n

&lt;report number&gt; ])

#          

# exemple:

./check_cpu_utili.sh

#          ./check_cpu_utili.sh -w 70,40,30 -c

90,60,40

90,60,40 -i 3 -n 5

#-------------------------------------------------------------------------------

# paths to

commands used in this script.  these may

have to be modified to match your system setup.

iostat="/usr/bin/iostat"

# nagios return

codes

state_ok=0

state_warning=1

state_critical=2

state_unknown=3

# plugin

parameters value if not define

list_warning_threshold="70,40,30"

list_critical_threshold="90,60,40"

interval_sec=1

num_report=1

variable description

progname=$(basename

$0)

if [ ! -x

$iostat ]; then

    echo

"unknown: iostat not found or is not executable by the nagios user."

    exit $state_unknown

print_usage() {

        echo ""

        echo "$progname $release - cpu

utilization check script for nagios"

        echo "usage: check_cpu_utili.sh -w

-c (-i -n)"

        echo "  -w 

warning threshold in % for warn_user,warn_system,warn_iowait cpu

(default : 70,40,30)"

        echo "  exit with warning status if cpu exceeds

warn_n"

        echo "  -c 

critical threshold in % for crit_user,crit_system,crit_iowait cpu

(default : 90,60,40)"

        echo "  exit with critical status if cpu exceeds

crit_n"

        echo "  -i 

interval in seconds for iostat (default : 1)"

        echo "  -n 

number report for iostat (default : 3)"

        echo "  -h 

show this page"

    echo "usage: $progname"

    echo "usage: $progname --help"

    echo ""

}

print_help() {

    print_usage

        echo "this plugin will check cpu

utilization (user,system,cpu_iowait in %)"

# parse

parameters

while [ $# -gt 0

]; do

    case "$1" in

        -h | --help)

            print_help

            exit $state_ok

            ;;

        -v | --version)

                print_release

                exit $state_ok

                ;;

        -w | --warning)

                shift

                list_warning_threshold=$1

        -c | --critical)

               shift

                list_critical_threshold=$1

        -i | --interval)

               interval_sec=$1

        -n | --number)

               num_report=$1

                ;;       

        *) 

echo "unknown argument: $1"

            print_usage

            exit $state_unknown

        esac

shift

done

# list to table

for warning threshold (compatibility with

tab_warning_threshold=(`echo

$list_warning_threshold | sed 's/,/ /g'`)

if [

"${#tab_warning_threshold[@]}" -ne "3" ]; then

    echo "error : bad count parameter in

warning threshold"

    exit $state_warning

else 

user_warning_threshold=`echo

${tab_warning_threshold[0]}`

system_warning_threshold=`echo

${tab_warning_threshold[1]}`

iowait_warning_threshold=`echo

${tab_warning_threshold[2]}`

for critical threshold

tab_critical_threshold=(`echo

$list_critical_threshold | sed 's/,/ /g'`)

"${#tab_critical_threshold[@]}" -ne "3" ]; then

critical threshold"

else

user_critical_threshold=`echo

${tab_critical_threshold[0]}`

system_critical_threshold=`echo

${tab_critical_threshold[1]}`

iowait_critical_threshold=`echo

${tab_critical_threshold[2]}`

${tab_warning_threshold[0]} -ge ${tab_critical_threshold[0]} -o

${tab_warning_threshold[1]} -ge ${tab_critical_threshold[1]} -o

${tab_warning_threshold[2]} -ge ${tab_critical_threshold[2]} ]; then

  echo "error : critical cpu threshold

lower as warning cpu threshold "

  exit $state_warning

cpu_report=`iostat

-c $interval_sec $num_report | sed -e 's/,/./g' | tr -s ' ' ';' | sed '/^$/d' |

tail -1`

cpu_report_sections=`echo

${cpu_report} | grep ';' -o | wc -l`

cpu_user=`echo

$cpu_report | cut -d ";" -f 2`

cpu_system=`echo

$cpu_report | cut -d ";" -f 4`

cpu_iowait=`echo

$cpu_report | cut -d ";" -f 5`

cpu_steal=`echo

$cpu_report | cut -d ";" -f 6`

cpu_idle=`echo

$cpu_report | cut -d ";" -f 7`

nagios_status="user=${cpu_user}%,system=${cpu_system}%,iowait=${cpu_iowait}%,idle=${cpu_idle}%"

nagios_data="cpuuser=${cpu_user};${tab_warning_threshold[0]};${tab_critical_threshold[0]};0"

cpu_user_major=`echo

$cpu_user| cut -d "." -f 1`

cpu_system_major=`echo

$cpu_system | cut -d "." -f 1`

cpu_iowait_major=`echo

$cpu_iowait | cut -d "." -f 1`

cpu_idle_major=`echo

$cpu_idle | cut -d "." -f 1`

# return

${cpu_user_major} -ge $user_critical_threshold ]; then

        echo "cpu statistics

ok:${nagios_status} | cpu_user=${cpu_user}%;70;90;0;100"

        exit $state_critical

    elif [ ${cpu_system_major} -ge

$system_critical_threshold ]; then

    elif [ ${cpu_iowait_major} -ge

$iowait_critical_threshold ]; then

    elif [ ${cpu_user_major} -ge

$user_warning_threshold ] &amp;&amp; [ ${cpu_user_major} -lt $user_critical_threshold

]; then

        exit $state_warning

      elif [ ${cpu_system_major} -ge

$system_warning_threshold ] &amp;&amp; [ ${cpu_system_major} -lt

      elif 

[ ${cpu_iowait_major} -ge $iowait_warning_threshold ] &amp;&amp; [

${cpu_iowait_major} -lt $iowait_critical_threshold ]; then

        exit $state_ok

此腳本參考了nagios的官方文檔https://exchange.nagios.org/并進行了代碼精簡和移值,源代碼是運作在ksh下面的,這裡将其移植到了bash下面,ksh下定義數組的方式跟bash還是有差別的;另外有一點也請大家注意,shell本身是不支援浮點運算的,但可以通過bc或awk的方式來處理。

另外,若要配合pnp4nagios出圖(pnp4nagios可以觀察一段周期内的cpu使用率峰值),此腳本還可以更精簡,腳本内容如下所示(此腳本在amazon linux ami x86_64下已測試通過):

list_warning_threshold="90"

list_critical_threshold="95"

num_report=5

-c $interval $num_report  | sed -e

's/,/./g' | tr -s ' ' ';' | sed '/^$/d' |tail -1`

cpu_user=`echo $cpu_report

| cut -d ";" -f 2`

# add for

integer shell issue

$cpu_user | cut -d "." -f 1`

cpu_utili_cou=`echo

${cpu_user} + ${cpu_system}|bc`

cpu_utili_counter=`echo

$cpu_utili_cou | cut -d "." -f 1`

${cpu_utili_counter} -lt ${list_warning_threshold} ]

    echo "ok - cpucou=${cpu_utili_cou}% |

cpucou=${cpu_utili_cou}%;80;90"

    exit ${state_ok}

${cpu_utili_counter} -gt ${list_warning_threshold} -a ${cpu_utili_counter} -lt

${list_critical_threshold} ]

    echo "warning -

cpucou=${cpu_utili_counter}% | cpucou=${cpu_utili_counter}%;80;90"

    exit ${state_warning}

${cpu_utili_counter} -gt ${list_critical_threshold} ]

   echo "critical -

    exit ${state_critical}

繼續閱讀