天天看點

Linux下Nagios的安裝與配置

一、本文說明

    本文是在參考:http://www.cnblogs.com/mchina/archive/2013/02/20/2883404.html   David_Tang文章以及網上的一些資料完成,其中絕大部分内容是轉載于David_Tang。

二、Nagios簡介

    Nagios是一款開源的電腦系統和網絡監視工具,能有效監控Windows、Linux和Unix的主機狀态,交換機路由器等網絡設定,列印機等。在系統或服務狀态異常時發出郵件或短信報警第一時間通知運維人員,在狀态恢複後發出正常的郵件或短信通知。

    Nagios原名為NetSaint,由Ethan Galstad開發并維護至今。NAGIOS是一個縮寫形式:“Nagios Ain't Gonna Insist On Sainthood” Sainthood翻譯為聖徒,而"Agios"是"saint"的希臘表示方法。Nagios被開發在Linux下使用,但在Unix下也工作得非常好。

主要功能

    •網絡服務監控(SMTP、POP3、HTTP、NNTP、ICMP、SNMP、FTP、SSH)

    •主機資源監控(CPU load、disk usage、system logs),也包括Windows主機(使用NSClient++ plugin)

    •可以指定自己編寫的Plugin通過網絡收集資料來監控任何情況(溫度、警告……)

    •可以通過配置Nagios遠端執行插件遠端執行腳本

    •遠端監控支援SSH或SSL加通道方式進行監控

    •簡單的plugin設計允許使用者很容易的開發自己需要的檢查服務,支援很多開發語言(shell scripts、C++、Perl、ruby、Python、PHP、C#等)

    •包含很多圖形化資料Plugins(Nagiosgraph、Nagiosgrapher、PNP4Nagios等)

    •可并行服務檢查

    •能夠定義網絡主機的層次,允許逐級檢查,就是從父主機開始向下檢查

    •當服務或主機出現問題時發出通告,可通過email, pager, sms 或任意使用者自定義的plugin進行通知

    •能夠自定義事件處理機制重新激活出問題的服務或主機

    •自動日志循環

    •支援備援監控

    •包括Web界面可以檢視目前網絡狀态,通知,問題曆史,日志檔案等

三、Nagios工作原理

    Nagios的功能是監控服務和主機,但是他自身并不包括這部分功能,所有的監控、檢測功能都是通過各種插件來完成的。

  啟動Nagios後,它會周期性的自動調用插件去檢測伺服器狀态,同時Nagios會維持一個隊列,所有插件傳回來的狀态資訊都進入隊列,Nagios每次都從隊首開始讀取資訊,并進行處理後,把狀态結果通過web顯示出來。

  Nagios提供了許多插件,利用這些插件可以友善的監控很多服務狀态。安裝完成後,在nagios主目錄下的/libexec裡放有nagios自帶的可以使用的所有插件,如,check_disk是檢查磁盤空間的插件,check_load是檢查CPU負載的,等等。每一個插件可以通過運作./check_xxx –h 來檢視其使用方法和功能。

  Nagios可以識别4種狀态傳回資訊,即 0(OK)表示狀态正常/綠色、1(WARNING)表示出現警告/黃色、2(CRITICAL)表示出現非常嚴重的錯誤/紅色、3(UNKNOWN)表示未知錯誤/深黃色。Nagios根據插件傳回來的值,來判斷監控對象的狀态,并通過web顯示出來,以供管理者及時發現故障。

    四種監控狀态:

Linux下Nagios的安裝與配置

    再說報警功能,如果監控系統發現問題不能報警那就沒有意義了,是以報警也是nagios很重要的功能之一。但是,同樣的,Nagios 自身也沒有報警部分的代碼,甚至沒有插件,而是交給使用者或者其他相關開源項目組去完成的。

  Nagios 安裝,是指基本平台,也就是Nagios軟體包的安裝。它是監控體系的架構,也是所有監控的基礎。

  打開Nagios官方的文檔,會發現Nagios基本上沒有什麼依賴包,隻要求系統是Linux或者其他Nagios支援的系統。不過如果你沒有安裝apache(http服務),那麼你就沒有那麼直覺的界面來檢視監控資訊了,是以apache姑且算是一個前提條件。關于apache的安裝,網上有很多,照着安裝就是了。安裝之後要檢查一下是否可以正常工作。

 知道Nagios 是如何通過插件來管理伺服器對象後,現在開始研究它是如何管理遠端伺服器對象的。Nagios 系統提供了一個插件NRPE。Nagios 通過周期性的運作它來獲得遠端伺服器的各種狀态資訊。它們之間的關系如下圖所示:

Linux下Nagios的安裝與配置

   Nagios 通過NRPE 來遠端管理服務

   1. Nagios 執行安裝在它裡面的check_nrpe 插件,并告訴check_nrpe 去檢測哪些服務。

   2. 通過SSL,check_nrpe 連接配接遠端機子上的NRPE daemon

   3. NRPE 運作本地的各種插件去檢測本地的服務和狀态(check_disk,..etc)

   4. 最後,NRPE 把檢測的結果傳給主機端的check_nrpe,check_nrpe 再把結果送到Nagios狀态隊列中。

   5. Nagios 依次讀取隊列中的資訊,再把結果顯示出來。

四、實驗環境

Host Name OS IP Software
node1 rhel5.4

192.168.1.151

192.168.11.164

hadoop0.20.2、namenode、dns、nfs、apache、php、nagios、nagios-plugins
node2

192.168.1.152

192.168.11.167

hadoop0.20.2、datanode、mysql、nagios-plugins、nrpe
node3

192.168.1.153

192.168.11.166

hadoop0.20.2、datanode、hive

   node1安裝了nagios軟體,對監控的資料做處理,并且提供web界面檢視和管理。當然也可以對本機自身的資訊進行監控。

   node2安裝了NRPE等用戶端,根據監控機的請求執行監控,然後将結果回傳給監控機。

   防火牆已關閉/iptables:Firewall is not running。

   SELINUX=disable

五、實驗目标

主機名 要監控的服務
cpu負載
目前登入使用者數
是否開啟80端口
是否活動
/分區使用情況
總程序數
是否開啟ssh服務
swap分區使用情況
是否啟動dns服務
datanode程序
mysql資料庫

 六、Nagios服務端安裝

    6.1、基礎支援套件:gcc glibc glibc-common gd gd-devel xinetd openssl-devel

[root@node1 nagios]# rpm -q gcc glibc glibc-common gd gd-devel xinetd openssl-devel
gcc-4.1.2-46.el5
glibc-2.5-42
glibc-common-2.5-42
gd-2.0.33-9.4.el5_4.2
gd-devel-2.0.33-9.4.el5_4.2
xinetd-2.3.14-10.el5
openssl-devel-0.9.8e-26.el5_9.1
----如果系統中沒有這些套件,使用yum安裝      

    6.2、建立nagios使用者和使用者組

[root@node1 app]# useradd nagios
[root@node1 app]# mkdir /usr/local/nagios
[root@node1 app]# chown -R nagios.nagios /usr/local/nagios
[root@node1 app]# ll -d /usr/local/nagios/
drwxr-xr-x 2 nagios nagios 4096 Sep 24 12:02 /usr/local/nagios/      

    6.3、編譯安裝Nagios

[root@node1 app]# cd nagios
[root@node1 nagios]# ./configure --prefix=/usr/local/nagios
*** Configuration summary for nagios 3.3.1 07-25-2011 ***:

 General Options:
 -------------------------
        Nagios executable:  nagios
        Nagios user/group:  nagios,nagios
       Command user/group:  nagios,nagios
            Embedded Perl:  no
             Event Broker:  yes
        Install ${prefix}:  /usr/local/nagios
                Lock file:  ${prefix}/var/nagios.lock
   Check result directory:  ${prefix}/var/spool/checkresults
           Init directory:  /etc/rc.d/init.d
  Apache conf.d directory:  /etc/httpd/conf.d
             Mail program:  /bin/mail
                  Host OS:  linux-gnu

 Web Interface Options:
 ------------------------
                 HTML URL:  http://localhost/nagios/
                  CGI URL:  http://localhost/nagios/cgi-bin/
 Traceroute (used by WAP):  /bin/traceroute


Review the options above for accuracy.  If they look okay,
type 'make all' to compile the main program and CGIs.      
[root@node1 nagios]# make all
cd ./base && make
make[1]: Entering directory `/app/nagios/base'
*** Support Notes *******************************************

If you have questions about configuring or running Nagios,
please make sure that you:

     - Look at the sample config files
     - Read the documentation on the Nagios Library at:
           http://library.nagios.com

before you post a question to one of the mailing lists.
Also make sure to include pertinent information that could
help others help you.  This might include:

     - What version of Nagios you are using
     - What version of the plugins you are using
     - Relevant snippets from your config files
     - Relevant error messages from the Nagios log file

For more information on obtaining support for Nagios, visit:

       http://support.nagios.com

*************************************************************

Enjoy.      
[root@node1 nagios]# make install
*** Main program, CGIs and HTML files installed ***

You can continue with installing Nagios as follows (type 'make'
without any arguments for a list of all possible options):

  make install-init
     - This installs the init script in /etc/rc.d/init.d

  make install-commandmode
     - This installs and configures permissions on the
       directory for holding the external command file

  make install-config
     - This installs sample config files in /usr/local/nagios/etc

make[1]: Leaving directory `/app/nagios'      
[root@node1 nagios]# make install-init
/usr/bin/install -c -m 755 -d -o root -g root /etc/rc.d/init.d
/usr/bin/install -c -m 755 -o root -g root daemon-init /etc/rc.d/init.d/nagios

*** Init script installed ***      
[root@node1 nagios]# make install-commandmode
/usr/bin/install -c -m 775 -o nagios -g nagios -d /usr/local/nagios/var/rw
chmod g+s /usr/local/nagios/var/rw

*** External command directory configured ***

[root@node1 nagios]# make install-config
/usr/bin/install -c -m 775 -o nagios -g nagios -d /usr/local/nagios/etc
/usr/bin/install -c -m 775 -o nagios -g nagios -d /usr/local/nagios/etc/objects
/usr/bin/install -c -b -m 664 -o nagios -g nagios sample-config/nagios.cfg /usr/local/nagios/etc/nagios.cfg
/usr/bin/install -c -b -m 664 -o nagios -g nagios sample-config/cgi.cfg /usr/local/nagios/etc/cgi.cfg
/usr/bin/install -c -b -m 660 -o nagios -g nagios sample-config/resource.cfg /usr/local/nagios/etc/resource.cfg
/usr/bin/install -c -b -m 664 -o nagios -g nagios sample-config/template-object/templates.cfg /usr/local/nagios/etc/objects/templates.cfg
/usr/bin/install -c -b -m 664 -o nagios -g nagios sample-config/template-object/commands.cfg /usr/local/nagios/etc/objects/commands.cfg
/usr/bin/install -c -b -m 664 -o nagios -g nagios sample-config/template-object/contacts.cfg /usr/local/nagios/etc/objects/contacts.cfg
/usr/bin/install -c -b -m 664 -o nagios -g nagios sample-config/template-object/timeperiods.cfg /usr/local/nagios/etc/objects/timeperiods.cfg
/usr/bin/install -c -b -m 664 -o nagios -g nagios sample-config/template-object/localhost.cfg /usr/local/nagios/etc/objects/localhost.cfg
/usr/bin/install -c -b -m 664 -o nagios -g nagios sample-config/template-object/windows.cfg /usr/local/nagios/etc/objects/windows.cfg
/usr/bin/install -c -b -m 664 -o nagios -g nagios sample-config/template-object/printer.cfg /usr/local/nagios/etc/objects/printer.cfg
/usr/bin/install -c -b -m 664 -o nagios -g nagios sample-config/template-object/switch.cfg /usr/local/nagios/etc/objects/switch.cfg

*** Config files installed ***

Remember, these are *SAMPLE* config files.  You'll need to read
the documentation for more information on how to actually define
services, hosts, etc. to fit your particular needs.      
[root@node1 nagios]# chkconfig --add nagios
[root@node1 nagios]# chkconfig --level 35 nagios on
[root@node1 nagios]# chkconfig --list nagios
nagios             0:off    1:off    2:off    3:on    4:on    5:on    6:off      

    6.4、驗證程式是否被正确安裝

    切換目錄到安裝路徑(這裡是/usr/local/nagios),看是否存在etc、bin、sbin、share、var 這五個目錄,如果存在則可以表明程式被正确的安裝到系統了。Nagios 各個目錄用途說明如下:

bin Nagios 可執行程式所在目錄
etc Nagios 配置檔案所在目錄
sbin Nagios CGI 檔案所在目錄,也就是執行外部指令所需檔案所在的目錄
share Nagios網頁檔案所在的目錄
libexec Nagios 外部插件所在目錄
var Nagios 日志檔案、lock 等檔案所在的目錄
var/archives Nagios 日志自動歸檔目錄
var/rw 用來存放外部指令檔案的目錄

    6.5、安裝Nagios插件

[root@node1 nagios-plugins-1.4.15]# ./configure --prefix=/usr/local/nagios
config.status: creating po/Makefile
            --with-apt-get-command: 
              --with-ping6-command: /bin/ping6 -n -U -w %d -c %d %s
               --with-ping-command: /bin/ping -n -U -w %d -c %d %s
                       --with-ipv6: yes
                      --with-mysql: no
                    --with-openssl: yes
                     --with-gnutls: no
               --enable-extra-opts: no
                       --with-perl: /usr/bin/perl
             --enable-perl-modules: no
                     --with-cgiurl: /nagios/cgi-bin
               --with-trusted-path: /bin:/sbin:/usr/bin:/usr/sbin
                   --enable-libtap: no
[root@node1 nagios-plugins-1.4.15]# make && make install      

    6.6、安裝與配置Apache和Php

    Apache 和Php 不是安裝nagios 所必須的,但是nagios提供了web監控界面,通過web監控界面可以清晰的看到被監控主機、資源的運作狀态,是以,安裝一個web服務是很必要的。

    需要注意的是,nagios在nagios3.1.x版本以後,配置web監控界面時需要php的支援。這裡我們下載下傳的nagios版本為nagios-3.4.3,是以在編譯安裝完成apache後,還需要編譯php子產品,這裡選取的php版本為php5.4.10。

    a.安裝Apache

# wget http://archive.apache.org/dist/httpd/httpd-2.2.23.tar.gz

# tar zxvf httpd-2.2.23.tar.gz

# cd httpd-2.2.23

# ./configure --prefix=/usr/local/apache2

# make && make install       

    若出現錯誤,則在編譯時加入 --with-included-apr 即可解決。

    b.安裝Php

# wget http://cn2.php.net/distributions/php-5.4.10.tar.gz

# tar zxvf php-5.4.10.tar.gz

# cd php-5.4.10

# ./configure --prefix=/usr/local/php --with-apxs2=/usr/local/apache2/bin/apxs 

# make && make install        

    c.配置apache

    找到apache的配置檔案/usr/local/apache2/conf/httpd.conf

----找到:
User daemon
Group daemon
----修改為:
User nagios
Group nagios
----然後找到:
<IfModule dir_module>
   DirectoryIndex index.html
</IfModule>
----修改為:
<IfModule dir_module>
   DirectoryIndex index.html index.php
</IfModule>  
----接着增加如下内容
AddType application/x-httpd-php .php       

    為了安全起見,一般情況下要讓nagios的web監控頁面必須經過授權才能通路,這需要增加驗證配置,即在httpd.conf檔案最後添加如下資訊:

#setting for nagios
ScriptAlias /nagios/cgi-bin "/usr/local/nagios/sbin"
<Directory "/usr/local/nagios/sbin">
     AuthType Basic
     Options ExecCGI
     AllowOverride None
     Order allow,deny
     Allow from all
     AuthName "Nagios Access"
     AuthUserFile /usr/local/nagios/etc/htpasswd         //用于此目錄通路身份驗證的檔案
     Require valid-user
</Directory>
Alias /nagios "/usr/local/nagios/share"
<Directory "/usr/local/nagios/share">
     AuthType Basic
     Options None
     AllowOverride None
     Order allow,deny
     Allow from all
     AuthName "nagios Access"
     AuthUserFile /usr/local/nagios/etc/htpasswd
     Require valid-user
</Directory>      

    d.建立apache目錄驗證檔案

    在上面的配置中,指定了目錄驗證檔案htppasswd,下面要建立這個檔案:

    # /usr/local/apache2/bin/htpasswd -c /usr/local/nagios/etc/htpasswd david

    這樣就在/usr/local/nagios/etc 目錄下建立了一個htpasswd 驗證檔案,當通過192.168.11.164/nagios/ 通路時就需要輸入使用者名和密碼了。

    e.檢視認證檔案的内容

    # cat /usr/local/nagios/etc/htpasswd

    f.啟動apache服務

    # /usr/local/apache2/bin/apachectl start

    到這裡nagios 的安裝也就基本完成了,你可以通過web來通路了。

Linux下Nagios的安裝與配置

 七、配置Nagios

    Nagios 主要用于監控一台或者多台本地主機及遠端的各種資訊,包括本機資源及對外的服務等。預設的Nagios 配置沒有任何監控内容,僅是一些模闆檔案。若要讓Nagios 提供服務,就必須修改配置檔案,增加要監控的主機和服務,下面将詳細介紹。

    7.1、預設配置檔案介紹

    Nagios安裝完畢後,預設的配置檔案在/usr/local/nagios/etc目錄下。

[root@node1 ~]# cd /usr/local/nagios/
[root@node1 nagios]# ls
bin  etc  include  libexec  sbin  share  var
[root@node1 nagios]# cd etc/
[root@node1 etc]# ls
cgi.cfg  contacts.cfg  hosts.cfg  htpasswd  nagios.cfg  objects  resource.cfg  services.cfg  timeperiods.cfg
[root@node1 etc]# cd objects/
[root@node1 objects]# ls
commands.cfg  localhost.cfg  switch.cfg     templates.cfg.bak  windows.cfg
contacts.cfg  printer.cfg    templates.cfg  timeperiods.cfg      

    每個檔案或目錄含義如下表所示:

檔案名或目錄名 用途
cgi.cfg 控制CGI通路的配置檔案
nagios.cfg Nagios 主配置檔案
resource.cfg 變量定義檔案,又稱為資源檔案,在些檔案中定義變量,以便由其他配置檔案引用,如$USER1$
objects objects 是一個目錄,在此目錄下有很多配置檔案模闆,用于定義Nagios 對象
objects/commands.cfg 指令定義配置檔案,其中定義的指令可以被其他配置檔案引用
objects/contacts.cfg 定義聯系人和聯系人組的配置檔案
objects/localhost.cfg 定義監控本地主機的配置檔案
objects/printer.cfg 定義監控列印機的一個配置檔案模闆,預設沒有啟用此檔案
objects/switch.cfg 定義監控路由器的一個配置檔案模闆,預設沒有啟用此檔案
objects/templates.cfg 定義主機和服務的一個模闆配置檔案,可以在其他配置檔案中引用
objects/timeperiods.cfg 定義Nagios 監控時間段的配置檔案
objects/windows.cfg 監控Windows 主機的一個配置檔案模闆,預設沒有啟用此檔案

     7.2、配置檔案之間的關系

     在nagios的配置過程中涉及到的幾個定義有:主機、主機組,服務、服務組,聯系人、聯系人組,監控時間,監控指令等,從這些定義可以看出,nagios各個配置檔案之間是互為關聯,彼此引用的。

     成功配置出一台nagios監控系統,必須要弄清楚每個配置檔案之間依賴與被依賴的關系,最重要的有四點:

     第一:定義監控哪些主機、主機組、服務和服務組;

     第二:定義這個監控要用什麼指令實作;

     第三:定義監控的時間段;

     第四:定義主機或服務出現問題時要通知的聯系人和聯系人組。

     7.3、 配置Nagios

     為了能更清楚的說明問題,同時也為了維護友善,建議将nagios各個定義對象建立獨立的配置檔案:

     建立hosts.cfg檔案來定義主機和主機組  

     建立services.cfg檔案來定義服務  

     用預設的contacts.cfg檔案來定義聯系人和聯系人組  

     用預設的commands.cfg檔案來定義指令  

     用預設的timeperiods.cfg來定義監控時間段  

     用預設的templates.cfg檔案作為資源引用檔案

     a. templates.cfg檔案

     nagios主要用于監控主機資源以及服務,在nagios配置中稱為對象,為了不必重複定義一些監控對象,Nagios引入了一個模闆配置檔案,将一些共性的屬性定義成模闆,以便于多次引用。這就是templates.cfg的作用。

----此檔案可能需要修改contact_groups----
[root@node1 objects]# cat templates.cfg
###############################################################################
# TEMPLATES.CFG - SAMPLE OBJECT TEMPLATES
#
# Last Modified: 10-03-2007
#
# NOTES: This config file provides you with some example object definition
#        templates that are refered by other host, service, contact, etc.
#        definitions in other config files.
#       
#        You don't need to keep these definitions in a separate file from your
#        other object definitions.  This has been done just to make things
#        easier to understand.
#
###############################################################################



###############################################################################
###############################################################################
#
# CONTACT TEMPLATES
#
###############################################################################
###############################################################################

# Generic contact definition template - This is NOT a real contact, just a template!

define contact{
        name                            generic-contact        ; The name of this contact template
        service_notification_period     24x7            ; service notifications can be sent anytime
        host_notification_period        24x7            ; host notifications can be sent anytime
        service_notification_options    w,u,c,r,f,s        ; send notifications for all service states, flapping events, and scheduled downtime events
        host_notification_options       d,u,r,f,s        ; send notifications for all host states, flapping events, and scheduled downtime events
        service_notification_commands   notify-service-by-email    ; send service notifications via email
        host_notification_commands      notify-host-by-email    ; send host notifications via email
        register                        0               ; DONT REGISTER THIS DEFINITION - ITS NOT A REAL CONTACT, JUST A TEMPLATE!
        }




###############################################################################
###############################################################################
#
# HOST TEMPLATES
#
###############################################################################
###############################################################################

# Generic host definition template - This is NOT a real host, just a template!

define host{
        name                            generic-host    ; The name of this host template
        notifications_enabled           1           ; Host notifications are enabled
        event_handler_enabled           1           ; Host event handler is enabled
        flap_detection_enabled          1           ; Flap detection is enabled
        failure_prediction_enabled      1           ; Failure prediction is enabled
        process_perf_data               1           ; Process performance data
        retain_status_information       1           ; Retain status information across program restarts
        retain_nonstatus_information    1           ; Retain non-status information across program restarts
    notification_period        24x7        ; Send host notifications at any time
        register                        0           ; DONT REGISTER THIS DEFINITION - ITS NOT A REAL HOST, JUST A TEMPLATE!
        }


# Linux host definition template - This is NOT a real host, just a template!

define host{
    name                linux-server    ; The name of this host template
    use                generic-host    ; This template inherits other values from the generic-host template
    check_period            24x7        ; By default, Linux hosts are checked round the clock
    check_interval            1        ; Actively check the host every 5 minutes
    retry_interval            1        ; Schedule host check retries at 1 minute intervals
    max_check_attempts        2        ; Check each Linux host 10 times (max)
        check_command               check-host-alive ; Default command to check Linux hosts
    notification_period        workhours    ; Linux admins hate to be woken up, so we only notify during the day
                            ; Note that the notification_period variable is being overridden from
                            ; the value that is inherited from the generic-host template!
    notification_interval        120        ; Resend notifications every 2 hours
    notification_options        d,u,r        ; Only send notifications for specific host states
    contact_groups            ts        ; Notifications get sent to the admins by default
    notifications_enabled           1
        register            0        ; DONT REGISTER THIS DEFINITION - ITS NOT A REAL HOST, JUST A TEMPLATE!
    }
----linux-server3和linux-server2為新增加進去的----
define host{
        name                            linux-server3    ; The name of this host template
        use                             generic-host    ; This template inherits other values from the generic-host template
        check_period                    24x7            ; By default, Linux hosts are checked round the clock
        check_interval                  1               ; Actively check the host every 5 minutes
        retry_interval                  1               ; Schedule host check retries at 1 minute intervals
        max_check_attempts              2               ; Check each Linux host 10 times (max)
        check_command                   check-host-alive ; Default command to check Linux hosts
        notification_period             workhours       ; Linux admins hate to be woken up, so we only notify during the day
                                                        ; Note that the notification_period variable is being overridden from
                                                        ; the value that is inherited from the generic-host template!
        notification_interval           120             ; Resend notifications every 2 hours
        notification_options            d,u,r           ; Only send notifications for specific host states
        contact_groups                  ts              ; Notifications get sent to the admins by default
        notifications_enabled           1
        register                        0               ; DONT REGISTER THIS DEFINITION - ITS NOT A REAL HOST, JUST A TEMPLATE!
        }

define host{
        name                            linux-server2    ; The name of this host template
        use                             generic-host    ; This template inherits other values from the generic-host template
        check_period                    24x7            ; By default, Linux hosts are checked round the clock
        check_interval                  5               ; Actively check the host every 5 minutes
        retry_interval                  1               ; Schedule host check retries at 1 minute intervals
        max_check_attempts              10              ; Check each Linux host 10 times (max)
        check_command                   check-host-alive ; Default command to check Linux hosts
        notification_period             workhours       ; Linux admins hate to be woken up, so we only notify during the day
                                                        ; Note that the notification_period variable is being overridden from
                                                        ; the value that is inherited from the generic-host template!
        notification_interval           120             ; Resend notifications every 2 hours
        notification_options            d,u,r           ; Only send notifications for specific host states
        contact_groups                  ts              ; Notifications get sent to the admins by default
        notifications_enabled           1
        register                        0               ; DONT REGISTER THIS DEFINITION - ITS NOT A REAL HOST, JUST A TEMPLATE!
        }


# Windows host definition template - This is NOT a real host, just a template!

define host{
    name            windows-server    ; The name of this host template
    use            generic-host    ; Inherit default values from the generic-host template
    check_period        24x7        ; By default, Windows servers are monitored round the clock
    check_interval        5        ; Actively check the server every 5 minutes
    retry_interval        1        ; Schedule host check retries at 1 minute intervals
    max_check_attempts    10        ; Check each server 10 times (max)
    check_command        check-host-alive    ; Default command to check if servers are "alive"
    notification_period    24x7        ; Send notification out at any time - day or night
    notification_interval    30        ; Resend notifications every 30 minutes
    notification_options    d,r        ; Only send notifications for specific host states
    contact_groups        ts        ; Notifications get sent to the admins by default
    hostgroups        windows-servers ; Host groups that Windows servers should be a member of
    register        0        ; DONT REGISTER THIS - ITS JUST A TEMPLATE
    }


# We define a generic printer template that can be used for most printers we monitor

define host{
    name            generic-printer    ; The name of this host template
    use            generic-host    ; Inherit default values from the generic-host template
    check_period        24x7        ; By default, printers are monitored round the clock
    check_interval        5        ; Actively check the printer every 5 minutes
    retry_interval        1        ; Schedule host check retries at 1 minute intervals
    max_check_attempts    10        ; Check each printer 10 times (max)
    check_command        check-host-alive    ; Default command to check if printers are "alive"
    notification_period    workhours        ; Printers are only used during the workday
    notification_interval    30        ; Resend notifications every 30 minutes
    notification_options    d,r        ; Only send notifications for specific host states
    contact_groups        ts        ; Notifications get sent to the admins by default
    register        0        ; DONT REGISTER THIS - ITS JUST A TEMPLATE
    }


# Define a template for switches that we can reuse
define host{
    name            generic-switch    ; The name of this host template
    use            generic-host    ; Inherit default values from the generic-host template
    check_period        24x7        ; By default, switches are monitored round the clock
    check_interval        5        ; Switches are checked every 5 minutes
    retry_interval        1        ; Schedule host check retries at 1 minute intervals
    max_check_attempts    10        ; Check each switch 10 times (max)
    check_command        check-host-alive    ; Default command to check if routers are "alive"
    notification_period    24x7        ; Send notifications at any time
    notification_interval    30        ; Resend notifications every 30 minutes
    notification_options    d,r        ; Only send notifications for specific host states
    contact_groups        ts        ; Notifications get sent to the admins by default
    register        0        ; DONT REGISTER THIS - ITS JUST A TEMPLATE
    }




###############################################################################
###############################################################################
#
# SERVICE TEMPLATES
#
###############################################################################
###############################################################################

# Generic service definition template - This is NOT a real service, just a template!

define service{
        name                            generic-service     ; The 'name' of this service template
        active_checks_enabled           1               ; Active service checks are enabled
        passive_checks_enabled          1                   ; Passive service checks are enabled/accepted
        parallelize_check               1               ; Active service checks should be parallelized (disabling this can lead to major performance problems)
        obsess_over_service             1               ; We should obsess over this service (if necessary)
        check_freshness                 0               ; Default is to NOT check service 'freshness'
        notifications_enabled           1               ; Service notifications are enabled
        event_handler_enabled           1               ; Service event handler is enabled
        flap_detection_enabled          1               ; Flap detection is enabled
        failure_prediction_enabled      1               ; Failure prediction is enabled
        process_perf_data               1               ; Process performance data
        retain_status_information       1               ; Retain status information across program restarts
        retain_nonstatus_information    1               ; Retain non-status information across program restarts
        is_volatile                     0               ; The service is not volatile
        check_period                    24x7            ; The service can be checked at any time of the day
        max_check_attempts              3            ; Re-check the service up to 3 times in order to determine its final (hard) state
        normal_check_interval           10            ; Check the service every 10 minutes under normal conditions
        retry_check_interval            2            ; Re-check the service every two minutes until a hard state can be determined
        contact_groups                  ts            ; Notifications get sent out to everyone in the 'admins' group
    notification_options        w,u,c,r            ; Send notifications about warning, unknown, critical, and recovery events
        notification_interval           60            ; Re-notify about service problems every hour
        notification_period             24x7            ; Notifications can be sent out at any time
         register                        0               ; DONT REGISTER THIS DEFINITION - ITS NOT A REAL SERVICE, JUST A TEMPLATE!
        }


# Local service definition template - This is NOT a real service, just a template!

define service{
    name                local-service         ; The name of this service template
    use                generic-service        ; Inherit default values from the generic-service definition
        max_check_attempts              4            ; Re-check the service up to 4 times in order to determine its final (hard) state
        normal_check_interval           5            ; Check the service every 5 minutes under normal conditions
        retry_check_interval            1            ; Re-check the service every minute until a hard state can be determined
        register                        0               ; DONT REGISTER THIS DEFINITION - ITS NOT A REAL SERVICE, JUST A TEMPLATE!
    }      

    b. resource.cfg檔案

    resource.cfg是nagios的變量定義檔案,檔案内容隻有一行:

[root@node1 etc]# cat resource.cfg 
$USER1$=/usr/local/nagios/libexec      

    其中,變量$USER1$指定了安裝nagios插件的路徑,如果把插件安裝在了其它路徑,隻需在這裡進行修改即可。需要注意的是,變量必須先定義,然後才能在其它配置檔案中進行引用。

    c. commands.cfg檔案

     此檔案預設是存在的,無需修改即可使用,當然如果有新的指令需要加入時,在此檔案進行添加即可。

[root@node1 etc]# cat objects/commands.cfg 
###############################################################################
# COMMANDS.CFG - SAMPLE COMMAND DEFINITIONS FOR NAGIOS 3.3.1
#
# Last Modified: 05-31-2007
#
# NOTES: This config file provides you with some example command definitions
#        that you can reference in host, service, and contact definitions.
#       
#        You don't need to keep commands in a separate file from your other
#        object definitions.  This has been done just to make things easier to
#        understand.
#
###############################################################################


################################################################################
#
# SAMPLE NOTIFICATION COMMANDS
#
# These are some example notification commands.  They may or may not work on
# your system without modification.  As an example, some systems will require 
# you to use "/usr/bin/mailx" instead of "/usr/bin/mail" in the commands below.
#
################################################################################


# 'notify-host-by-email' command definition
define command{
    command_name    notify-host-by-email
    command_line    /usr/bin/printf "%b" "***** Nagios *****\n\nNotification Type: $NOTIFICATIONTYPE$\nHost: $HOSTNAME$\nState: $HOSTSTATE$\nAddress: $HOSTADDRESS$\nInfo: $HOSTOUTPUT$\n\nDate/Time: $LONGDATETIME$\n" |/bin/mail -s "** $NOTIFICATIONTYPE$ Host Alert: $HOSTNAME$ is $HOSTSTATE$ **" $CONTACTEMAIL$
    }

# 'notify-service-by-email' command definition
define command{
    command_name    notify-service-by-email
    command_line    /usr/bin/printf "%b" "***** Nagios *****\n\nNotification Type: $NOTIFICATIONTYPE$\n\nService: $SERVICEDESC$\nHost: $HOSTALIAS$\nAddress: $HOSTADDRESS$\nState: $SERVICESTATE$\n\nDate/Time: $LONGDATETIME$\n\nAdditional Info:\n\n$SERVICEOUTPUT$\n" |/bin/mail -s "** $NOTIFICATIONTYPE$ Host Alert: $HOSTNAME$ is $HOSTSTATE$ **" $CONTACTEMAIL$
    }





################################################################################
#
# SAMPLE HOST CHECK COMMANDS
#
################################################################################


# This command checks to see if a host is "alive" by pinging it
# The check must result in a 100% packet loss or 5 second (5000ms) round trip 
# average time to produce a critical error.
# Note: Five ICMP echo packets are sent (determined by the '-p 5' argument)

# 'check-host-alive' command definition
define command{
        command_name    check-host-alive
        command_line    $USER1$/check_ping -H $HOSTADDRESS$ -w 3000.0,80% -c 5000.0,100% -p 5
        }




################################################################################
#
# SAMPLE SERVICE CHECK COMMANDS
#
# These are some example service check commands.  They may or may not work on
# your system, as they must be modified for your plugins.  See the HTML 
# documentation on the plugins for examples of how to configure command definitions.
#
# NOTE:  The following 'check_local_...' functions are designed to monitor
#        various metrics on the host that Nagios is running on (i.e. this one).
################################################################################

# 'check_local_disk' command definition
define command{
        command_name    check_local_disk
        command_line    $USER1$/check_disk -w $ARG1$ -c $ARG2$ -p $ARG3$
        }


# 'check_local_load' command definition
define command{
        command_name    check_local_load
        command_line    $USER1$/check_load -w $ARG1$ -c $ARG2$
        }


# 'check_local_procs' command definition
define command{
        command_name    check_local_procs
        command_line    $USER1$/check_procs -w $ARG1$ -c $ARG2$ -s $ARG3$
        }


# 'check_local_users' command definition
define command{
        command_name    check_local_users
        command_line    $USER1$/check_users -w $ARG1$ -c $ARG2$
        }


# 'check_local_swap' command definition
define command{
    command_name    check_local_swap
    command_line    $USER1$/check_swap -w $ARG1$ -c $ARG2$
    }


# 'check_local_mrtgtraf' command definition
define command{
    command_name    check_local_mrtgtraf
    command_line    $USER1$/check_mrtgtraf -F $ARG1$ -a $ARG2$ -w $ARG3$ -c $ARG4$ -e $ARG5$
    }


################################################################################
# NOTE:  The following 'check_...' commands are used to monitor services on
#        both local and remote hosts.
################################################################################

# 'check_ftp' command definition
define command{
        command_name    check_ftp
        command_line    $USER1$/check_ftp -H $HOSTADDRESS$ $ARG1$
        }


# 'check_hpjd' command definition
define command{
        command_name    check_hpjd
        command_line    $USER1$/check_hpjd -H $HOSTADDRESS$ $ARG1$
        }


# 'check_snmp' command definition
define command{
        command_name    check_snmp
        command_line    $USER1$/check_snmp -H $HOSTADDRESS$ $ARG1$
        }


# 'check_http' command definition
define command{
        command_name    check_http
        command_line    $USER1$/check_http -I $HOSTADDRESS$ $ARG1$
        }


# 'check_ssh' command definition
define command{
    command_name    check_ssh
    command_line    $USER1$/check_ssh $ARG1$ $HOSTADDRESS$
    }


# 'check_dhcp' command definition
define command{
    command_name    check_dhcp
    command_line    $USER1$/check_dhcp $ARG1$
    }


# 'check_ping' command definition
define command{
        command_name    check_ping
        command_line    $USER1$/check_ping -H $HOSTADDRESS$ -w $ARG1$ -c $ARG2$ -p 5
        }


# 'check_pop' command definition
define command{
        command_name    check_pop
        command_line    $USER1$/check_pop -H $HOSTADDRESS$ $ARG1$
        }


# 'check_imap' command definition
define command{
        command_name    check_imap
        command_line    $USER1$/check_imap -H $HOSTADDRESS$ $ARG1$
        }


# 'check_smtp' command definition
define command{
        command_name    check_smtp
        command_line    $USER1$/check_smtp -H $HOSTADDRESS$ $ARG1$
        }


# 'check_tcp' command definition
define command{
    command_name    check_tcp
    command_line    $USER1$/check_tcp -H $HOSTADDRESS$ -p $ARG1$ $ARG2$
    }


# 'check_udp' command definition
define command{
    command_name    check_udp
    command_line    $USER1$/check_udp -H $HOSTADDRESS$ -p $ARG1$ $ARG2$
    }


# 'check_nt' command definition
define command{
    command_name    check_nt
    command_line    $USER1$/check_nt -H $HOSTADDRESS$ -p 12489 -v $ARG1$ $ARG2$
    }



################################################################################
#
# SAMPLE PERFORMANCE DATA COMMANDS
#
# These are sample performance data commands that can be used to send performance
# data output to two text files (one for hosts, another for services).  If you
# plan on simply writing performance data out to a file, consider using the 
# host_perfdata_file and service_perfdata_file options in the main config file.
#
################################################################################


# 'process-host-perfdata' command definition
define command{
    command_name    process-host-perfdata
    command_line    /usr/bin/printf "%b" "$LASTHOSTCHECK$\t$HOSTNAME$\t$HOSTSTATE$\t$HOSTATTEMPT$\t$HOSTSTATETYPE$\t$HOSTEXECUTIONTIME$\t$HOSTOUTPUT$\t$HOSTPERFDATA$\n" >> /usr/local/nagios/var/host-perfdata.out
    }


# 'process-service-perfdata' command definition
define command{
    command_name    process-service-perfdata
    command_line    /usr/bin/printf "%b" "$LASTSERVICECHECK$\t$HOSTNAME$\t$SERVICEDESC$\t$SERVICESTATE$\t$SERVICEATTEMPT$\t$SERVICESTATETYPE$\t$SERVICEEXECUTIONTIME$\t$SERVICELATENCY$\t$SERVICEOUTPUT$\t$SERVICEPERFDATA$\n" >> /usr/local/nagios/var/service-perfdata.out
    }

#'check_nrpe' command definition
  define command{
            command_name   check_nrpe
            command_line   $USER1$/check_nrpe -H $HOSTADDRESS$ -c $ARG1$
            }
----以下三個指令是新增的
define command{
        command_name    check_jps
        command_line    /usr/local/nagios/libexec/check_jps $ARG1$ $ARG2$
        }

define command{
        command_name    check_zhulh
        command_line    /usr/local/nagios/libexec/check_zhulh $ARG1$ $ARG2$
        }

define command{
        command_name    check_jps2
        command_line    /usr/local/nagios/libexec/check_jps2 $ARG1$ $ARG2$
        }      

    d. hosts.cfg檔案

     此檔案預設不存在,需要手動建立,hosts.cfg主要用來指定被監控的主機位址以及相關屬性資訊,根據實驗目标配置如下:

[root@node1 etc]# cat hosts.cfg 
define host{
        use                     linux-server2
        host_name               node2
        alias                   Nagios-node2
        address                 192.168.11.167
        }
define host{
        use                     linux-server3
        host_name               node3
        alias                   Nagios-node3
        address                 192.168.11.166
        }
define hostgroup{      
        hostgroup_name          bsmart-servers      
        alias                   bsmart servers        
        members                 node2,node3
        }      

    注意:在/usr/local/nagios/etc/objects 下預設有localhost.cfg 和windows.cfg 這兩個配置檔案,localhost.cfg 檔案是定義監控主機本身的,windows.cfg 檔案是定義windows 主機的,其中包括了對host 和相關services 的定義。根據自己的需要修改其中的相關配置,詳細如下:

    localhost.cfg

[root@node1 etc]# cat objects/localhost.cfg 
###############################################################################
# LOCALHOST.CFG - SAMPLE OBJECT CONFIG FILE FOR MONITORING THIS MACHINE
#
# Last Modified: 05-31-2007
#
# NOTE: This config file is intended to serve as an *extremely* simple 
#       example of how you can create configuration entries to monitor
#       the local (Linux) machine.
#
###############################################################################




###############################################################################
###############################################################################
#
# HOST DEFINITION
#
###############################################################################
###############################################################################

# Define a host for the local machine

define host{
        use                     linux-server            ; Name of host template to use
                            ; This host definition will inherit all variables that are defined
                            ; in (or inherited by) the linux-server host template definition.
        host_name               node1
        alias                   node1
        address                 192.168.11.164
        }



###############################################################################
###############################################################################
#
# HOST GROUP DEFINITION
#
###############################################################################
###############################################################################

# Define an optional hostgroup for Linux machines

define hostgroup{
        hostgroup_name  linux-servers ; The name of the hostgroup
        alias           Linux Servers ; Long name of the group
        members         node1     ; Comma separated list of hosts that belong to this group
        }



###############################################################################
###############################################################################
#
# SERVICE DEFINITIONS
#
###############################################################################
###############################################################################


# Define a service to "ping" the local machine

define service{
        use                             local-service         ; Name of service template to use
        host_name                       node1
        service_description             PING
    check_command            check_ping!100.0,20%!500.0,60%
        }


# Define a service to check the disk space of the root partition
# on the local machine.  Warning if < 20% free, critical if
# < 10% free space on partition.

define service{
        use                             local-service         ; Name of service template to use
        host_name                       node1
        service_description             Root Partition
    check_command            check_local_disk!20%!10%!/
        }



# Define a service to check the number of currently logged in
# users on the local machine.  Warning if > 20 users, critical
# if > 50 users.

define service{
        use                             local-service         ; Name of service template to use
        host_name                       node1
        service_description             Current Users
    check_command            check_local_users!20!50
        }


# Define a service to check the number of currently running procs
# on the local machine.  Warning if > 250 processes, critical if
# > 400 users.

define service{
        use                             local-service         ; Name of service template to use
        host_name                       node1
        service_description             Total Processes
    check_command            check_local_procs!250!400!RSZDT
        }



# Define a service to check the load on the local machine. 

define service{
        use                             local-service         ; Name of service template to use
        host_name                       node1
        service_description             Current Load
    check_command            check_local_load!5.0,4.0,3.0!10.0,6.0,4.0
        }



# Define a service to check the swap usage the local machine. 
# Critical if less than 10% of swap is free, warning if less than 20% is free

define service{
        use                             local-service         ; Name of service template to use
        host_name                       node1
        service_description             Swap Usage
    check_command            check_local_swap!20!10
        }



# Define a service to check SSH on the local machine.
# Disable notifications for this service by default, as not all users may have SSH enabled.

define service{
        use                             local-service         ; Name of service template to use
        host_name                       node1
        service_description             SSH
    check_command            check_ssh
    notifications_enabled        1
        }



# Define a service to check HTTP on the local machine.
# Disable notifications for this service by default, as not all users may have HTTP enabled.

define service{
        use                             local-service         ; Name of service template to use
        host_name                       node1
        service_description             HTTP
    check_command            check_http
    notifications_enabled        1
        }

define service{
        use                             local-service         ; Name of service template to use
        host_name                       node1
        service_description             dns on node1
        check_command                   check_jps!dns!node1
        notifications_enabled           1
        }      

    windows.cfg 省略

    e. services.cfg檔案

    此檔案預設也不存在,需要手動建立,services.cfg檔案主要用于定義監控的服務和主機資源,例如監控http服務、ftp服務、主機磁盤空間、主機系統負載等等。

[root@node1 etc]# cat services.cfg 

define service{
        use                     local-service
        host_name               node3
        service_description     check-host-alive
        check_command           check-host-alive
        }  

define service{
        use                             local-service         ; Name of service template to use
        host_name                       node3
        service_description             datanode on node3
        check_command                   check_jps2!DataNode!node3
        notifications_enabled           1
        }

define service{
        use                     local-service
        host_name               node2
        service_description     check-host-alive
        check_command           check-host-alive
        }  

define service{
        use                             local-service         ; Name of service template to use
        host_name                       node2
        service_description             datanode on node2
        check_command                   check_jps2!DataNode!node2
        notifications_enabled           1
        }


define service{
        use                             local-service
        host_name                       node2
        service_description             mysql
        check_command                   check_nrpe!check_mysql
        notifications_enabled           1
        check_interval                  1               ; Actively check the host every 5 minutes
        retry_interval                  1               ; Schedule host check retries at 1 minute intervals
        max_check_attempts              2    
        }      

    f. contacts.cfg檔案

    contacts.cfg是一個定義聯系人和聯系人組的配置檔案,當監控的主機或者服務出現故障,nagios會通過指定的通知方式(郵件或者短信)将資訊發給這裡指定的聯系人或者使用者。

[root@node1 etc]# cat contacts.cfg 
define contact{
        contact_name                    David           
        use                             generic-contact 
        alias                           Nagios Admin
        email                           [email protected]
        }
define contact{
        contact_name                    Jack
        use                             generic-contact
        alias                           Nagios Admin2
        email                           [email protected]
        }

define contactgroup{
        contactgroup_name       ts                             
        alias                   Technical Support               
        members                 David,Jack                 
        }      

    g. timeperiods.cfg檔案

    此檔案隻要用于定義監控的時間段,下面是一個配置好的執行個體:

[root@node1 etc]# cat timeperiods.cfg 

define timeperiod{  
        timeperiod_name 24x7  
        alias           24 Hours A Day, 7 Days A Week  
        sunday          00:00-24:00  
        monday          00:00-24:00  
        tuesday         00:00-24:00  
        wednesday       00:00-24:00  
        thursday        00:00-24:00  
        friday          00:00-24:00  
        saturday        00:00-24:00  
        }
define timeperiod{  
        timeperiod_name workhours   
        alias           Normal Work Hours  
        monday          09:00-17:00  
        tuesday         09:00-17:00  
        wednesday       09:00-17:00  
        thursday        09:00-17:00  
        friday          09:00-17:00  
        }        

    h. cgi.cfg檔案

    此檔案用來控制相關cgi腳本,如果想在nagios的web監控界面執行cgi腳本,例如重新開機nagios程序、關閉nagios通知、停止nagios主機檢測等,這時就需要配置cgi.cfg檔案了。 由于nagios的web監控界面驗證使用者為david,是以隻需在cgi.cfg檔案中添加此使用者的執行權限就可以了,需要修改的配置資訊如下:

default_user_name=david
authorized_for_system_information=nagiosadmin,david  
authorized_for_configuration_information=nagiosadmin,david  
authorized_for_system_commands=david
authorized_for_all_services=nagiosadmin,david  
authorized_for_all_hosts=nagiosadmin,david
authorized_for_all_service_commands=nagiosadmin,david  
authorized_for_all_host_commands=nagiosadmin,david       

    i. nagios.cfg檔案

    nagios.cfg預設的路徑為/usr/local/nagios/etc/nagios.cfg,是nagios的核心配置檔案,所有的對象配置檔案都必須在這個檔案中進行定義才能發揮其作用,這裡隻需将對象配置檔案在Nagios.cfg檔案中進行引用即可。

# You can specify individual object config files as shown below:
cfg_file=/usr/local/nagios/etc/objects/commands.cfg
#cfg_file=/usr/local/nagios/etc/objects/contacts.cfg
#cfg_file=/usr/local/nagios/etc/objects/timeperiods.cfg
cfg_file=/usr/local/nagios/etc/objects/templates.cfg


# Definitions for monitoring the local (Linux) host
cfg_file=/usr/local/nagios/etc/objects/localhost.cfg
#cfg_file=/usr/local/nagios/etc/contactgroups.cfg
cfg_file=/usr/local/nagios/etc/contacts.cfg
#cfg_file=/usr/local/nagios/etc/hostgroups.cfg
cfg_file=/usr/local/nagios/etc/hosts.cfg
cfg_file=/usr/local/nagios/etc/services.cfg
cfg_file=/usr/local/nagios/etc/timeperiods.cfg

# Definitions for monitoring a Windows machine
#cfg_file=/usr/local/nagios/etc/objects/windows.cfg

# Definitions for monitoring a router/switch
#cfg_file=/usr/local/nagios/etc/objects/switch.cfg

status_update_interval=10

nagios_user=nagios
nagios_group=nagios

check_external_commands=0

command_check_interval=10s

interval_length=60      

    7.4、 驗證Nagios 配置檔案的正确性

    Nagios 在驗證配置檔案方面做的非常到位,隻需通過一個指令即可完成:

[root@node1 etc]# /usr/local/nagios/bin/nagios -v /usr/local/nagios/etc/nagios.cfg
Total Warnings: 0
Total Errors:   0

Things look okay - No serious problems were detected during the pre-flight check      

    Nagios提供的這個驗證功能非常有用,在錯誤資訊中通常會列印出錯誤的配置檔案以及檔案中的哪一行,這使得nagios的配置變得非常容易,報警資訊通常是可以忽略的,因為一般那些隻是建議性的。

    看到上面這些資訊就說明沒問題了,然後啟動Nagios 服務。

八、Nagios的啟動與停止

    8.1、啟動Nagios

service nagios start      

    8.2、手動方式啟動nagios

    通過nagios指令的"-d"參數來啟動nagios守護程序:

# /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg      

    8.3、手工方式停止Nagios

#kill <nagios_pid>      

九、利用NRPE監控遠端Linux上的"本地資訊"

    上面已經對遠端Linux 主機是否存活做了監控,而判斷遠端機器是否存活,我們可以使用ping 工具對其監測。還有一些遠端主機服務,例如ftp、ssh、http,都是對外開放的服務,即使不用Nagios,我們也可以試的出來,随便找一台機器看能不能通路這些服務就行了。但是對于像磁盤容量,cpu負載這樣的“本地資訊”,Nagios隻能監測自己所在的主機,而對其他的機器則顯得有點無能為力。畢竟沒得到被控主機的适當權限是不可能得到這些資訊的。為了解決這個問題,nagios有這樣一個附加元件--“NRPE”,用它就可以完成對Linux 類型主機"本地資訊”的監控。

    9.1、NRPE工作原理

Linux下Nagios的安裝與配置

    NRPE 總共由兩部分組成: check_nrpe 插件,位于監控主機上 NRPE daemon,運作在遠端的Linux主機上(通常就是被監控機) 按照上圖,整個的監控過程如下:

    當Nagios 需要監控某個遠端Linux 主機的服務或者資源情況時:

    Nagios 會運作check_nrpe 這個插件,告訴它要檢查什麼;

    check_nrpe 插件會連接配接到遠端的NRPE daemon,所用的方式是SSL;

    NRPE daemon 會運作相應的Nagios 插件來執行檢查;

    NRPE daemon 将檢查的結果傳回給check_nrpe 插件,插件将其遞交給nagios做處理。

    注意:NRPE daemon 需要Nagios 插件安裝在遠端的Linux主機上,否則,daemon不能做任何的監控。

    9.2、在被監控機(node2、node3)上

    a.增加使用者&設定密碼

    #useradd nagios

    #passwd nagios

    b.安裝Nagios插件

# tar zxvf nagios-plugins-1.4.16.tar.gz
# cd nagios-plugins-1.4.16
# ./configure --prefix=/usr/local/nagios
# make && make install      

    這一步完成後會在/usr/local/nagios/下生成三個目錄include、libexec和share。

    修改目錄權限:

# chown nagios.nagios /usr/local/nagios
# chown -R nagios.nagios /usr/local/nagios/libexec      

    c.安裝NRPE

# wget http://prdownloads.sourceforge.net/sourceforge/nagios/nrpe-2.13.tar.gz
# tar zxvf nrpe-2.13.tar.gz
# cd nrpe-2.13
# ./configure
*** Configuration summary for nrpe 2.13 11-11-2011 ***:

 General Options:
 -------------------------
 NRPE port:    5666
 NRPE user:    nagios
 NRPE group:   nagios
 Nagios user:  nagios
 Nagios group: nagios


Review the options above for accuracy.  If they look okay,
type 'make all' to compile the NRPE daemon and client.      
[root@node2 nrpe-2.13]# make all
cd ./src/; make ; cd ..
make[1]: Entering directory `/app/nrpe-2.13/src'
gcc -g -O2 -I/usr/include/openssl -I/usr/include -DHAVE_CONFIG_H -o nrpe nrpe.c utils.c acl.c -L/usr/lib  -lssl -lcrypto -lnsl -lwrap  
gcc -g -O2 -I/usr/include/openssl -I/usr/include -DHAVE_CONFIG_H -o check_nrpe check_nrpe.c utils.c -L/usr/lib  -lssl -lcrypto -lnsl 
make[1]: Leaving directory `/app/nrpe-2.13/src'

*** Compile finished ***

If the NRPE daemon and client compiled without any errors, you
can continue with the installation or upgrade process.

Read the PDF documentation (NRPE.pdf) for information on the next
steps you should take to complete the installation or upgrade.      

    接下來安裝NRPE插件,daemon和示例配置檔案

    c.1 安裝check_nrpe

    監控機需要安裝check_nrpe這個插件,被監控機并不需要,我們在這裡安裝它隻是為了測試目的。

[root@node2 nrpe-2.13]# make install-plugin
cd ./src/ && make install-plugin
make[1]: Entering directory `/app/nrpe-2.13/src'
/usr/bin/install -c -m 775 -o nagios -g nagios -d /usr/local/nagios/libexec
/usr/bin/install -c -m 775 -o nagios -g nagios check_nrpe /usr/local/nagios/libexec
make[1]: Leaving directory `/app/nrpe-2.13/src'      

     c.2 安裝deamon

[root@node2 nrpe-2.13]# make install-daemon
cd ./src/ && make install-daemon
make[1]: Entering directory `/app/nrpe-2.13/src'
/usr/bin/install -c -m 775 -o nagios -g nagios -d /usr/local/nagios/bin
/usr/bin/install -c -m 775 -o nagios -g nagios nrpe /usr/local/nagios/bin
make[1]: Leaving directory `/app/nrpe-2.13/src'      

    c.3 安裝配置檔案

[root@node2 nrpe-2.13]# make install-daemon-config
/usr/bin/install -c -m 775 -o nagios -g nagios -d /usr/local/nagios/etc
/usr/bin/install -c -m 644 -o nagios -g nagios sample-config/nrpe.cfg /usr/local/nagios/etc      

     按照安裝文檔的說明,是将NRPE deamon作為xinetd下的一個服務運作的。在這樣的情況下xinetd就必須要先安裝好,不過一般系統已經預設安裝了。

     d.安裝xinetd腳本

[root@node2 nrpe-2.13]# make install-xinetd
/usr/bin/install -c -m 644 sample-config/nrpe.xinetd /etc/xinetd.d/nrpe       

    可以看到建立了這個檔案/etc/xinetd.d/nrpe

    編譯這個腳本:

[root@node2 ~]# cat /etc/xinetd.d/nrpe 
# default: on
# description: NRPE (Nagios Remote Plugin Executor)
service nrpe
{
           flags           = REUSE
        socket_type     = stream    
    port        = 5666    
           wait            = no
        user            = nagios
    group        = nagios
           server          = /usr/local/nagios/bin/nrpe
        server_args     = -c /usr/local/nagios/etc/nrpe.cfg --inetd
           log_on_failure  += USERID
        disable         = no
    only_from       = 192.168.11.164 127.0.0.1
}      

    在only_from後增加監控主機的IP位址

    編輯/etc/services檔案,增加NRPE服務

[root@node2 ~]# tail -n 4 /etc/services 
iqobject    48619/tcp            # iqobject
iqobject    48619/udp            # iqobject
# Local services
nrpe            5666/tcp                        #nrpe      

     重新開機xinetd服務

[root@node2 ~]# service xinetd restart
Stopping xinetd:                                           [  OK  ]
Starting xinetd:                                           [  OK  ]      

     檢視NRPE是否已經啟動

[root@node2 ~]# netstat -an|grep 5666
tcp        0      0 0.0.0.0:5666                0.0.0.0:*                   LISTEN            

    可以看到5666端口已經在監聽了。

    e.測試NRPE是否正常工作

    使用上面在被監控機上安裝的check_nrpe 這個插件測試NRPE 是否工作正常。

    # /usr/local/nagios/libexec/check_nrpe -H localhost

    會傳回目前NRPE的版本

[root@node2 ~]# /usr/local/nagios/libexec/check_nrpe -H localhost
NRPE v2.13      

    也就是在本地用check_nrpe連接配接nrpe daemon是正常的。

   注:為了後面工作的順利進行,注意本地防火牆要打開5666能讓外部的監控機通路。

   9.3 在監控機(node1)上

   之前已經将Nagios運作起來了,現在要做的事情是:

   安裝check_nrpe 插件; 在commands.cfg 中建立check_nrpe 的指令定義,因為隻有在commands.cfg 中定義過的指令才能在services.cfg 中使用; 建立對被監控主機的監控項目;

    9.3.1、安裝check_nrpe插件

# tar zxvf nrpe-2.13.tar.gz 
# cd nrpe-2.13
# ./configure
# make all
# make install-plugin      

     隻運作這一步就行了,因為隻需要check_nrpe插件。

     在node2和node3上我們已經裝好了nrpe,現在我們測試一下監控機使用check_nrpe 與被監控機運作的nrpe daemon之間的通信。

[root@node1 etc]# /usr/local/nagios/libexec/check_nrpe -H 192.168.11.167
NRPE v2.13      

    看到已經正确傳回了NRPE的版本資訊,說明一切正常。

    9.3.2、在commands.cfg中增加對check_nrpe的定義

[root@node1 etc]# cat objects/commands.cfg

#'check_nrpe' command definition
  define command{
            command_name   check_nrpe
            command_line   $USER1$/check_nrpe -H $HOSTADDRESS$ -c $ARG1$
            }      

    -c 後面帶的$ARG1$ 參數是傳給nrpe daemon 執行的檢測指令,之前說過了它必須是nrpe.cfg 中所定義的那5條指令中的其中一條。在services.cfg 中使用check_nrpe 的時候要用 “!” 帶上這個參數。

    9.3.3、 定義對Nagios-Linux 主機的監控

    下面就可以在services.cfg 中定義對Nagios-Linux 主機的監控了。

[root@node1 etc]# cat services.cfg 

define service{
        use                     local-service
        host_name               node3
        service_description     check-host-alive
        check_command           check-host-alive
        }  

define service{
        use                             local-service         ; Name of service template to use
        host_name                       node3
        service_description             datanode on node3
        check_command                   check_jps2!DataNode!node3
        notifications_enabled           1
        }

define service{
        use                     local-service
        host_name               node2
        service_description     check-host-alive
        check_command           check-host-alive
        }  

define service{
        use                             local-service         ; Name of service template to use
        host_name                       node2
        service_description             datanode on node2
        check_command                   check_jps2!DataNode!node2
        notifications_enabled           1
        }


define service{
        use                             local-service
        host_name                       node2
        service_description             mysql
        check_command                   check_nrpe!check_mysql
        notifications_enabled           1
        check_interval                  1               ; Actively check the host every 5 minutes
        retry_interval                  1               ; Schedule host check retries at 1 minute intervals
        max_check_attempts              2    
        }      

    9.3.4、檢視配置情況:

Linux下Nagios的安裝與配置

十、Nagios郵件報警的配置

    10.1、安裝sendmail元件

    首先要確定sendmail相關元件的完整安裝,我們

    可以使用如下的指令來完成sendmail 的安裝:

    # yum install -y sendmail*

    然後重新啟動sendmail服務:

    # service sendmail restart

    然後發送測試郵件,驗證sendmail的可用性:

    # echo "Hello World" | mail [email protected]

    10.2、 郵件報警的配置

    在上面我們已經簡單配置過了/usr/local/nagios/etc/objects/contacts.cfg 檔案,Nagios 會将報警郵件發送到配置檔案裡的E-mail 位址。

    10.3 Nagios通知

Linux下Nagios的安裝與配置

十一、重點說明:

11.1、監控遠端的mysql

     Nagios監控遠端的mysql

11.2、由于需要監控node2和node3上面datanode的程序是以需要node1、node2、node3之間設定無密碼登陸。

11.3、啟動nagios報錯:

[root@rhel5 etc]# service nagios start
Starting nagios:This account is currently not available.
 done.      

    修改/etc/passwd

    将/sbin/nologin改成/bin/bash

十二、參考資料:

    •Nagios官方網站:http://www.nagios.org/

    •yahoon的小屋 《nagios全攻略》:http://yahoon.blog.51cto.com/

    •技術成就夢想 《運維監控利器Nagios》:http://ixdba.blog.51cto.com/