Nagios通过check_http监控一台Web应用服务器上多个Tomcat服务

共计 32290 个字符，预计需要花费 81 分钟才能阅读完成。

如何在 Nagios 监控 Tomcat，是一个比较简单又复杂的事情，简单是因为如果只监控 web 应用服务器的一个 tomcat 服务是否正常运行，那么比较简单；如果要监控 tomcat 的其他比如连接数比如 jvm 内存使用率等就比较复杂，google 没有适合的监控脚本；如果要监控 web 应用上面的多个 tomcat 服务器，而且很多 tomcat 服务都是跳转式的，那就需要多做很多事情。

一般通常都使用 tcp tomcat 端口的方式，不过这有一个 bug 就是 tomcat 假死的情况下，tcp 端口是 OK 的，但是 tomcat 里面部署的 web 应用其实已经不能正常访问，这个时候需要使用 http 方式来监控 tomcat 的状态。

所以本文就记录了如何采用 http 方式来监控一台 web 服务器上多个 tomcat 应用服务器。

1 在 tomcat web 服务器上安装 nrpe 客户端：

Rpm 包下载地址为：

免费下载地址在 http://linux.linuxidc.com/

用户名与密码都是www.linuxidc.com

具体下载目录在 /2014 年资料 / 6 月 /17 日 /Nagios 通过 check_http 监控一台 Web 应用服务器上多个 Tomcat 服务

下载方法见 http://www.linuxidc.com/Linux/2013-07/87684.htm

———————————— 分割线 ————————————

网络监控器 Nagios 全攻略 http://www.linuxidc.com/Linux/2013-07/87067.htm

Nagios 搭建与配置详解 http://www.linuxidc.com/Linux/2013-05/84848.htm

Nginx 环境下构建 Nagios 监控平台 http://www.linuxidc.com/Linux/2011-07/38112.htm

在 RHEL5.3 上配置基本的 Nagios 系统(使用 Nagios-3.1.2) http://www.linuxidc.com/Linux/2011-07/38129.htm

CentOS 5.5+Nginx+Nagios 监控端和被控端安装配置指南 http://www.linuxidc.com/Linux/2011-09/44018.htm

Ubuntu 13.10 Server 安装 Nagios Core 网络监控运用 http://www.linuxidc.com/Linux/2013-11/93047.htm

1.1，rpm 方式安装 nrpe 客户端

[root@localhost nagios]# ll
总计 768
-rw-r–r– 1 root root 713389 12-16 12:08 nagios-plugins-1.4.11-1.x86_64.rpm
-rw-r–r– 1 root root 32706 12-16 12:09 nrpe-2.12-1.x86_64.rpm
-rw-r–r– 1 root root 18997 12-16 12:08 nrpe-plugin-2.12-1.x86_64.rpm
[root@localhost nagios]# rpm -ivh *.rpm –nodeps –force
Preparing… ########################################### [100%]
1:nagios-plugins ########################################### [33%]
id: nagios：无此用户
2:nrpe ########################################### [67%]
3:nrpe-plugin ########################################### [100%]
[root@cache-1 ~]#

[root@localhost nagios]# ll
总计 768
-rw-r--r-- 1 root root 713389 12-16 12:08 nagios-plugins-1.4.11-1.x86_64.rpm
-rw-r--r-- 1 root root  32706 12-16 12:09 nrpe-2.12-1.x86_64.rpm
-rw-r--r-- 1 root root  18997 12-16 12:08 nrpe-plugin-2.12-1.x86_64.rpm
 
[root@localhost nagios]# rpm -ivh *.rpm --nodeps  --force
Preparing...                ########################################### [100%]
   1:nagios-plugins         ########################################### [33%]
id: nagios：无此用户
   2:nrpe                   ########################################### [67%]
   3:nrpe-plugin            ########################################### [100%]
[root@cache-1 ~]#

1.2 在配置文件最末尾，添加配置信息以及监控主机服务器 ip 地址

[root@ localhost nagios]# vim /etc/nagios/nrpe.cfg
# addby tim on 2014-06-11
command[check_users]=/usr/local/nagios/libexec/check_users -w 8 -c 15
command[check_load]=/usr/local/nagios/libexec/check_load -w 15,10,5 -c 30,25,20
command[check_sda1]=/usr/local/nagios/libexec/check_disk -w 20% -c 10% -p /dev/sda
command[check_zombie_procs]=/usr/local/nagios/libexec/check_procs -w 5 -c 10 -s Z
#command[check_total_procs]=/usr/local/nagios/libexec/check_procs -w 50 -c 80
command[check_total_procs]=/usr/local/nagios/libexec/check_procs -w 750 -c 800
command[check-host-alive]=/usr/local/nagios/libexec/check_ping -H 10.xx.xx.10 -w 3000.0,80% -c 5000.0,100% -p 5
allowed_hosts = 127.0.0.1,10.xx.xxx.xx1

[root@ localhost nagios]# vim /etc/nagios/nrpe.cfg
# add by tim on 2014-06-11
command[check_users]=/usr/local/nagios/libexec/check_users -w 8 -c 15
command[check_load]=/usr/local/nagios/libexec/check_load -w 15,10,5 -c 30,25,20
command[check_sda1]=/usr/local/nagios/libexec/check_disk -w 20% -c 10% -p /dev/sda
command[check_zombie_procs]=/usr/local/nagios/libexec/check_procs -w 5 -c 10 -s Z
#command[check_total_procs]=/usr/local/nagios/libexec/check_procs -w 50 -c 80
command[check_total_procs]=/usr/local/nagios/libexec/check_procs -w 750 -c 800
command[check-host-alive]=/usr/local/nagios/libexec/check_ping -H 10.xx.xx.10 -w 3000.0,80% -c 5000.0,100% -p 5
allowed_hosts = 127.0.0.1,10.xx.xxx.xx1

check 下命令是否生效：

[root@webserver nrpe-2.15]# /usr/local/nagios/libexec/check_users -w 8 -c 15
USERS OK – 2 users currently logged in |users=2;8;15;0
[root@webserver nrpe-2.15]#

[root@webserver nrpe-2.15]# /usr/local/nagios/libexec/check_users -w 8 -c 15
USERS OK - 2 users currently logged in |users=2;8;15;0
[root@webserver nrpe-2.15]#

看到已经 USERS OK -…. 命令已经生效。

1.3 启动 nrpe 报错如下：

[root@webserver ~]# service nrpe restart
Shutting down nrpe: [失败]
Starting nrpe: /usr/sbin/nrpe: error while loading shared libraries: libssl.so.6: cannot open shared object file: No such file or directory
[失败]
[root@webserver ~]#
[root@db-m2-slave-1 nagios_client]# service nrpe start
Starting nrpe: /usr/sbin/nrpe: error while loading shared libraries: libssl.so.6: cannot open shared object file: No such file or directory
[失败]
[root@db-m2-slave-1 nagios_client]#

[root@webserver ~]# service nrpe restart
Shutting down nrpe:                                        [失败]
Starting nrpe: /usr/sbin/nrpe: error while loading shared libraries: libssl.so.6: cannot open shared object file: No such file or directory
                                                           [失败]
[root@webserver ~]#
[root@db-m2-slave-1 nagios_client]# service nrpe start
Starting nrpe: /usr/sbin/nrpe: error while loading shared libraries: libssl.so.6: cannot open shared object file: No such file or directory
                                                           [失败]
[root@db-m2-slave-1 nagios_client]#

建立软连接

[root@db-m2-slave-1 nagios_client]# ln -s /usr/lib64/libssl.so /usr/lib64/libssl.so.6

(如果没有 libssl.so，就采用别的 libssl.so.10 来做软连接，ln -s /usr/lib64/libssl.so.10 /usr/lib64/libssl.so.6)

[root@db-m2-slave-1 nagios_client]#

再重新启动如下：

[root@webserver nagios_client]# service nrpe start
Starting nrpe: /usr/sbin/nrpe: error while loading shared libraries: libcrypto.so.6: cannot open shared object file: No such file or directory
[失败]
[root@web-10 ~]# ll /usr/lib64/libcrypto.so
lrwxrwxrwx. 1 root root 18 10 月 13 2013 /usr/lib64/libcrypto.so -> libcrypto.so.1.0.0
[root@webserver nagios_client]#

[root@webserver nagios_client]# service nrpe start
Starting nrpe: /usr/sbin/nrpe: error while loading shared libraries: libcrypto.so.6: cannot open shared object file: No such file or directory
                                                           [失败]
[root@web-10 ~]# ll /usr/lib64/libcrypto.so
lrwxrwxrwx. 1 root root 18 10 月 13 2013 /usr/lib64/libcrypto.so -> libcrypto.so.1.0.0
[root@webserver nagios_client]#

再建软链接：

[root@webserver nagios_client]# ln -s /usr/lib64/libcrypto.so /usr/lib64/libcrypto.so.6
(或者如果没有 libcrypto.so，就采用 libcrypto.so.10 做软连接，ln -s /usr/lib64/libcrypto.so.10 /usr/lib64/libcrypto.so.6)
[root@webserver nagios_client]# service nrpe start
Starting nrpe: [确定]
[root@webserver nagios_client]#

[root@webserver nagios_client]# ln -s /usr/lib64/libcrypto.so /usr/lib64/libcrypto.so.6
(或者如果没有 libcrypto.so，就采用 libcrypto.so.10 做软连接，ln -s /usr/lib64/libcrypto.so.10 /usr/lib64/libcrypto.so.6)
[root@webserver nagios_client]# service nrpe start
Starting nrpe:                                             [确定]
[root@webserver nagios_client]#

1.4 检测下 nrpe 是否正常运行：

去 nagios 服务器端 check 下

[root@cache-2 ~]# /usr/local/nagios/libexec/check_nrpe -H 10.xx.xx.10
NRPE v2.12
[root@cache-2 ~]#

[root@cache-2 ~]#  /usr/local/nagios/libexec/check_nrpe -H 10.xx.xx.10
NRPE v2.12
[root@cache-2 ~]#

看到返回 NRPE v2.15 表示已经连接成功。

1.5 在 web 应用下添加检测 jsp 文件

(1) 建立测试文件

vim ./webapps/nagios_test_0611/nagios_test_0611.jsp
<%@ page language=“java” contentType=“text/html; charset=gb2312”
pageEncoding=“gb2312”%>
<!DOCTYPE html PUBLIC“-//W3C//DTD HTML 4.01 Transitional//EN”“http://www.w3.org/TR/html4/loose.dtd”>
<html>
<head>
<meta http-equiv=“Content-Type” content=“text/html; charset=gb2312”>
<title>nagios test here</title>
</head>
<body>
<center>Now timeis: <%=new java.util.Date()%></center>
</body>
</html>

vim ./webapps/nagios_test_0611/nagios_test_0611.jsp
<%@ page language="java" contentType="text/html; charset=gb2312"
pageEncoding="gb2312"%>
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=gb2312">
<title>nagios test here</title>
</head>
<body>
 <center>Now time is: <%=new java.util.Date()%></center>
</body>
</html>

(2) 去 check 下 check_http 命令

[root@webserver~]# /usr/local/nagios/libexec/check_http -I 10.xx.xx.10 -p 8300 -u /nagios_test_0611/nagios_test_0611.jsp -e 200
HTTP CRITICAL – Invalid HTTP response received from host on port 8300: HTTP/1.1 404 Not Found

[root@webserver~]# /usr/local/nagios/libexec/check_http -I 10.xx.xx.10 -p 8300 -u /nagios_test_0611/nagios_test_0611.jsp -e 200
HTTP CRITICAL - Invalid HTTP response received from host on port 8300: HTTP/1.1 404 Not Found

需要重启一下 tomcat，使新添加的 jsp 生效能打开，执行如下 stop start 命令：

/usr/local/app/apache-tomcat-6.0.37_8300/bin/shutdown.sh

/usr/local/app/apache-tomcat-6.0.37_8300/bin/startup.sh

再执行 check_http 命令

[root@webserver~]# /usr/local/nagios/libexec/check_http -I 10.xx.xx.10 -p 8300 -u /nagios_test_0611/nagios_test_0611.jsp -e 200
HTTP OK: Status line output matched “200” – 571 bytes in 0.882 second response time |time=0.882479s;;;0.000000 size=571B;;;0
[root@ webserver ~]#

[root@webserver~]# /usr/local/nagios/libexec/check_http -I 10.xx.xx.10 -p 8300 -u /nagios_test_0611/nagios_test_0611.jsp -e 200
HTTP OK: Status line output matched "200" - 571 bytes in 0.882 second response time |time=0.882479s;;;0.000000 size=571B;;;0
[root@ webserver ~]#

1.6 查看 NRPE 的监控命令

[root@webserver nrpe-2.15]# cat /etc/nagios/nrpe.cfg |grep -v “^#”|grep -v “^$”
log_facility=daemon
pid_file=/var/run/nrpe.pid
server_port=5666
nrpe_user=nagios
nrpe_group=nagios
dont_blame_nrpe=0
debug=0
command_timeout=60
connection_timeout=300
command[check_users]=/usr/local/nagios/libexec/check_users -w 8 -c 15
command[check_load]=/usr/local/nagios/libexec/check_load -w 15,10,5 -c 30,25,20
command[check_sda1]=/usr/local/nagios/libexec/check_disk -w 20% -c 10% -p /dev/sda
command[check_zombie_procs]=/usr/local/nagios/libexec/check_procs -w 5 -c 10 -s Z
command[check_total_procs]=/usr/local/nagios/libexec/check_procs -w 750 -c 800
command[check-host-alive]=/usr/local/nagios/libexec/check_ping -H 10.xx.xx.10 -w 3000.0,80% -c 5000.0,100% -p 5
allowed_hosts=127.0.0.1,10.xx.xxx.xx1
[root@webserver nrpe-2.15]#

更多详情见请继续阅读下一页的精彩内容：http://www.linuxidc.com/Linux/2014-06/103268p2.htm

2，去 nagios 服务器端添加 host 等监控信息。

2.1 在 hosts.cfg 里面添加主机信息

define host{
use linux-server
host_name webserver
alias webserver
address 10.xx.xx.10
check_command check-host-alive
max_check_attempts 5
check_period 24×7
contact_groups ops
notification_interval 60
notification_period 24×7
notification_options d,u,r
}

define host{
        use                     linux-server
        host_name               webserver
        alias                   webserver
        address                 10.xx.xx.10
        check_command           check-host-alive
        max_check_attempts              5
        check_period                    24x7
        contact_groups                  ops
        notification_interval           60
        notification_period             24x7
        notification_options            d,u,r
        }

2.2 在 service.cfg 里面添加 web 机器监控的命令信息

# No.007 webserver
# service definition
define service{
host_name webserver
service_description check_load
check_command check_nrpe!check_load
max_check_attempts 5
normal_check_interval 3
retry_check_interval 2
check_period 24×7
notification_interval 10
notification_period 24×7
notification_options w,u,c,r
contact_groups opsweb
}
define service{
host_name webserver
service_description check-host-alive
check_command check-host-alive
max_check_attempts 5
normal_check_interval 3
retry_check_interval 2
check_period 24×7
notification_interval 10
notification_period 24×7
notification_options w,u,c,r
contact_groups opsweb
}
define service{
host_name webserver
service_description Check Disk sda1
check_command check_nrpe!check_sda1
max_check_attempts 5
normal_check_interval 3
retry_check_interval 2
check_period 24×7
notification_interval 10
notification_period 24×7
notification_options w,u,c,r
contact_groups opsweb
}
define service{
host_name webserver
service_description Total Processes
check_command check_nrpe!check_total_procs
max_check_attempts 5
normal_check_interval 3
retry_check_interval 2
check_period 24×7
notification_interval 10
notification_period 24×7
notification_options w,u,c,r
contact_groups opsweb
}
define service{
host_name webserver
service_description Current Users
check_command check_nrpe!check_users
max_check_attempts 5
normal_check_interval 3
retry_check_interval 2
check_period 24×7
notification_interval 10
notification_period 24×7
notification_options w,u,c,r
contact_groups opsweb
}
define service{
host_name webserver
service_description Check Zombie Procs
check_command check_nrpe!check_zombie_procs
max_check_attempts 5
normal_check_interval 3
retry_check_interval 2
check_period 24×7
notification_interval 10
notification_period 24×7
notification_options w,u,c,r
contact_groups opsweb
}
define service{
host_name webserver
service_description Check Tomcat 9300 Status
check_command check_nrpe!check_tomcat_9300_status
max_check_attempts 5
normal_check_interval 3
retry_check_interval 2
check_period 24×7
notification_interval 10
notification_period 24×7
notification_options w,u,c,r
contact_groups opsweb
}

# No.007 webserver
#  service definition
define service{
        host_name               webserver
        service_description     check_load
        check_command           check_nrpe!check_load
        max_check_attempts      5
        normal_check_interval   3
        retry_check_interval    2
        check_period            24x7
        notification_interval   10
        notification_period     24x7
        notification_options    w,u,c,r
        contact_groups          opsweb
        }
 
define service{
        host_name               webserver
        service_description     check-host-alive
        check_command           check-host-alive
        max_check_attempts      5
        normal_check_interval   3
        retry_check_interval    2
        check_period            24x7
        notification_interval   10
        notification_period     24x7
        notification_options    w,u,c,r
        contact_groups          opsweb
        }
 
define service{
        host_name               webserver
        service_description     Check Disk sda1
        check_command           check_nrpe!check_sda1
        max_check_attempts      5
        normal_check_interval   3
        retry_check_interval    2
        check_period            24x7
        notification_interval   10
        notification_period     24x7
        notification_options    w,u,c,r
        contact_groups          opsweb
        }
 
define service{
        host_name               webserver
        service_description     Total Processes
        check_command           check_nrpe!check_total_procs
        max_check_attempts      5
        normal_check_interval   3
        retry_check_interval    2
        check_period            24x7
        notification_interval   10
        notification_period     24x7
        notification_options    w,u,c,r
        contact_groups          opsweb
        }
 
define service{
        host_name               webserver
        service_description     Current Users
        check_command           check_nrpe!check_users
        max_check_attempts      5
        normal_check_interval   3
        retry_check_interval    2
        check_period            24x7
        notification_interval   10
        notification_period     24x7
        notification_options    w,u,c,r
        contact_groups          opsweb
        }

define service{
        host_name               webserver
        service_description     Check Zombie Procs
        check_command           check_nrpe!check_zombie_procs
        max_check_attempts      5
        normal_check_interval   3
        retry_check_interval    2
        check_period            24x7
        notification_interval   10
        notification_period     24x7
        notification_options    w,u,c,r
        contact_groups          opsweb
        }
 
define service{
        host_name               webserver
        service_description     Check Tomcat 9300 Status
        check_command           check_nrpe!check_tomcat_9300_status
        max_check_attempts      5
        normal_check_interval   3
        retry_check_interval    2
        check_period            24x7
        notification_interval   10
        notification_period     24x7
        notification_options    w,u,c,r
        contact_groups          opsweb
        }

2.3 在 vim contacts.cfg 添加新的 opsweb 邮件组信息

define contactgroup{
contactgroup_name opsweb
alias pl ops team
members tim,mch,nagiosadmin
}

define contactgroup{
        contactgroup_name       opsweb
        alias                   pl ops team
        members                 tim,mch,nagiosadmin
        }

2.4 添加新的监控 tomcat 的命令，check_tomcat_9300_status

这里不采用 check_tcp!8080 端口的方式，是因为在实际中 tomcat 服务假死之后，jsp 的网页都是打不开的，但是这个监控端口 8080 都是正常的，不会报警出来；所以采用 check_http 的方式，新建立一个通用的 /nagios_test_0611/nagios_test_0611.jsp 文件，来检测这个 jsp 的访问情况，如下所示：

vim commands.cfg
# add by tim on 20140611
define command{
command_name check_tomcat_9300_status
command_line $USER1$/check_http -I $HOSTADDRESS$ -p $PORT$ -u $URL$ -e $N200$ -w $Warning$ -c$Cri$
}

vim commands.cfg
# add by tim on 20140611
define command{
        command_name    check_tomcat_9300_status
        command_line    $USER1$/check_http -I $HOSTADDRESS$ -p $PORT$ -u $URL$ -e $N200$ -w $Warning$ -c$Cri$
        }

Jsp 文件内容如下：

[root@webserver webapps]# vim . /nagios_test_0611/nagios_test_0611.jsp
<%@ page language=“java”contentType=“text/html; charset=gb2312”
pageEncoding=“gb2312”%>
<!DOCTYPE html PUBLIC “-//W3C//DTD HTML 4.01 Transitional//EN” “http://www.w3.org/TR/html4/loose.dtd”>
<html>
<head>
<metahttp-equiv=“Content-Type”content=“text/html; charset=gb2312”>
<title>nagios test here</title>
</head>
<body>
<center>Now time is: <%=new java.util.Date()%></center>
</body>
</html>

[root@webserver webapps]# vim . /nagios_test_0611/nagios_test_0611.jsp
<%@ page language="java" contentType="text/html; charset=gb2312"
pageEncoding="gb2312"%>
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=gb2312">
<title>nagios test here</title>
</head>
<body>
 <center>Now time is: <%=new java.util.Date()%></center>
</body>
</html>

2.5 在被监控客户端的 nrpe.cfg 配置文件里面添加 tomcat 端口配置信息：

command[check_tomcat_9300_status]=/usr/local/nagios/libexec/check_http -I 10.xx.xx.10 -p 9444 -u /nagios_test_0611/nagios_test_0611.jsp -e 200 -w 5 -c 10
command[check_tomcat_8300_status]=/usr/local/nagios/libexec/check_http -I 10.xx.xx.10 -p 8300 -u /nagios_test_0611/nagios_test_0611.jsp -e 200 -w 5 -c 10

command[check_tomcat_9300_status]=/usr/local/nagios/libexec/check_http -I 10.xx.xx.10 -p 9444 -u /nagios_test_0611/nagios_test_0611.jsp -e 200 -w 5 -c 10
command[check_tomcat_8300_status]=/usr/local/nagios/libexec/check_http -I 10.xx.xx.10 -p 8300 -u /nagios_test_0611/nagios_test_0611.jsp -e 200 -w 5 -c 10

2.6 测试报错

[root@cache-2 objects]# /usr/local/nagios/libexec/check_nrpe -H 10.xx.xx.10 -c check_load
NRPE: Unable to read output
[root@cache-2 objects]#

[root@cache-2 objects]# /usr/local/nagios/libexec/check_nrpe -H 10.xx.xx.10  -c check_load
NRPE: Unable to read output
[root@cache-2 objects]#

已经添加了 tomcat930 端口，现在再添加一个 tomcat8300 端口

去服务器端 shell 命令行里面 check 下

/usr/local/nagios/libexec/check_nrpe -H 192.168.15.178 -c check_mysql_myisam_lock
[root@cache-2 etc]# /usr/local/nagios/libexec/check_nrpe -H 10.xx.xx.10 -c check_load
NRPE: Unable to read output
[root@cache-2 etc]#

/usr/local/nagios/libexec/check_nrpe -H 192.168.15.178 -c check_mysql_myisam_lock
[root@cache-2 etc]# /usr/local/nagios/libexec/check_nrpe -H 10.xx.xx.10  -c check_load
NRPE: Unable to read output
[root@cache-2 etc]#

同样报错，那么可能就是 nagios 被监控端的问题。

最终检查是 nrpe.cfg 里面路径有误，源码安装默认路径是：/usr/local/nagios/libexec/check_http，rpm 安装默认路径是：/usr/lib/nagios/plugins/。这里是 rpm 安装，所以 nrpe.cfg 配置文件里面用后面 rpm 的路径 /usr/lib/nagios/plugins/，替换下 service nrpe restart 之后，问题解决，如下图所示：

所以本文就记录了如何采用 http 方式来监控一台 web 服务器上多个 tomcat 应用服务器。

1 在 tomcat web 服务器上安装 nrpe 客户端：

Rpm 包下载地址为：

免费下载地址在 http://linux.linuxidc.com/

用户名与密码都是www.linuxidc.com

具体下载目录在 /2014 年资料 / 6 月 /17 日 /Nagios 通过 check_http 监控一台 Web 应用服务器上多个 Tomcat 服务

下载方法见 http://www.linuxidc.com/Linux/2013-07/87684.htm

———————————— 分割线 ————————————

网络监控器 Nagios 全攻略 http://www.linuxidc.com/Linux/2013-07/87067.htm

Nagios 搭建与配置详解 http://www.linuxidc.com/Linux/2013-05/84848.htm

Nginx 环境下构建 Nagios 监控平台 http://www.linuxidc.com/Linux/2011-07/38112.htm

在 RHEL5.3 上配置基本的 Nagios 系统(使用 Nagios-3.1.2) http://www.linuxidc.com/Linux/2011-07/38129.htm

CentOS 5.5+Nginx+Nagios 监控端和被控端安装配置指南 http://www.linuxidc.com/Linux/2011-09/44018.htm

Ubuntu 13.10 Server 安装 Nagios Core 网络监控运用 http://www.linuxidc.com/Linux/2013-11/93047.htm

1.1，rpm 方式安装 nrpe 客户端

[root@localhost nagios]# ll
总计 768
-rw-r–r– 1 root root 713389 12-16 12:08 nagios-plugins-1.4.11-1.x86_64.rpm
-rw-r–r– 1 root root 32706 12-16 12:09 nrpe-2.12-1.x86_64.rpm
-rw-r–r– 1 root root 18997 12-16 12:08 nrpe-plugin-2.12-1.x86_64.rpm
[root@localhost nagios]# rpm -ivh *.rpm –nodeps –force
Preparing… ########################################### [100%]
1:nagios-plugins ########################################### [33%]
id: nagios：无此用户
2:nrpe ########################################### [67%]
3:nrpe-plugin ########################################### [100%]
[root@cache-1 ~]#

[root@localhost nagios]# ll
总计 768
-rw-r--r-- 1 root root 713389 12-16 12:08 nagios-plugins-1.4.11-1.x86_64.rpm
-rw-r--r-- 1 root root  32706 12-16 12:09 nrpe-2.12-1.x86_64.rpm
-rw-r--r-- 1 root root  18997 12-16 12:08 nrpe-plugin-2.12-1.x86_64.rpm
 
[root@localhost nagios]# rpm -ivh *.rpm --nodeps  --force
Preparing...                ########################################### [100%]
   1:nagios-plugins         ########################################### [33%]
id: nagios：无此用户
   2:nrpe                   ########################################### [67%]
   3:nrpe-plugin            ########################################### [100%]
[root@cache-1 ~]#

1.2 在配置文件最末尾，添加配置信息以及监控主机服务器 ip 地址

[root@ localhost nagios]# vim /etc/nagios/nrpe.cfg
# addby tim on 2014-06-11
command[check_users]=/usr/local/nagios/libexec/check_users -w 8 -c 15
command[check_load]=/usr/local/nagios/libexec/check_load -w 15,10,5 -c 30,25,20
command[check_sda1]=/usr/local/nagios/libexec/check_disk -w 20% -c 10% -p /dev/sda
command[check_zombie_procs]=/usr/local/nagios/libexec/check_procs -w 5 -c 10 -s Z
#command[check_total_procs]=/usr/local/nagios/libexec/check_procs -w 50 -c 80
command[check_total_procs]=/usr/local/nagios/libexec/check_procs -w 750 -c 800
command[check-host-alive]=/usr/local/nagios/libexec/check_ping -H 10.xx.xx.10 -w 3000.0,80% -c 5000.0,100% -p 5
allowed_hosts = 127.0.0.1,10.xx.xxx.xx1

[root@ localhost nagios]# vim /etc/nagios/nrpe.cfg
# add by tim on 2014-06-11
command[check_users]=/usr/local/nagios/libexec/check_users -w 8 -c 15
command[check_load]=/usr/local/nagios/libexec/check_load -w 15,10,5 -c 30,25,20
command[check_sda1]=/usr/local/nagios/libexec/check_disk -w 20% -c 10% -p /dev/sda
command[check_zombie_procs]=/usr/local/nagios/libexec/check_procs -w 5 -c 10 -s Z
#command[check_total_procs]=/usr/local/nagios/libexec/check_procs -w 50 -c 80
command[check_total_procs]=/usr/local/nagios/libexec/check_procs -w 750 -c 800
command[check-host-alive]=/usr/local/nagios/libexec/check_ping -H 10.xx.xx.10 -w 3000.0,80% -c 5000.0,100% -p 5
allowed_hosts = 127.0.0.1,10.xx.xxx.xx1

check 下命令是否生效：

[root@webserver nrpe-2.15]# /usr/local/nagios/libexec/check_users -w 8 -c 15
USERS OK – 2 users currently logged in |users=2;8;15;0
[root@webserver nrpe-2.15]#

[root@webserver nrpe-2.15]# /usr/local/nagios/libexec/check_users -w 8 -c 15
USERS OK - 2 users currently logged in |users=2;8;15;0
[root@webserver nrpe-2.15]#

看到已经 USERS OK -…. 命令已经生效。

1.3 启动 nrpe 报错如下：

[root@webserver ~]# service nrpe restart
Shutting down nrpe: [失败]
Starting nrpe: /usr/sbin/nrpe: error while loading shared libraries: libssl.so.6: cannot open shared object file: No such file or directory
[失败]
[root@webserver ~]#
[root@db-m2-slave-1 nagios_client]# service nrpe start
Starting nrpe: /usr/sbin/nrpe: error while loading shared libraries: libssl.so.6: cannot open shared object file: No such file or directory
[失败]
[root@db-m2-slave-1 nagios_client]#

[root@webserver ~]# service nrpe restart
Shutting down nrpe:                                        [失败]
Starting nrpe: /usr/sbin/nrpe: error while loading shared libraries: libssl.so.6: cannot open shared object file: No such file or directory
                                                           [失败]
[root@webserver ~]#
[root@db-m2-slave-1 nagios_client]# service nrpe start
Starting nrpe: /usr/sbin/nrpe: error while loading shared libraries: libssl.so.6: cannot open shared object file: No such file or directory
                                                           [失败]
[root@db-m2-slave-1 nagios_client]#

建立软连接

[root@db-m2-slave-1 nagios_client]# ln -s /usr/lib64/libssl.so /usr/lib64/libssl.so.6

(如果没有 libssl.so，就采用别的 libssl.so.10 来做软连接，ln -s /usr/lib64/libssl.so.10 /usr/lib64/libssl.so.6)

[root@db-m2-slave-1 nagios_client]#

再重新启动如下：

[root@webserver nagios_client]# service nrpe start
Starting nrpe: /usr/sbin/nrpe: error while loading shared libraries: libcrypto.so.6: cannot open shared object file: No such file or directory
[失败]
[root@web-10 ~]# ll /usr/lib64/libcrypto.so
lrwxrwxrwx. 1 root root 18 10 月 13 2013 /usr/lib64/libcrypto.so -> libcrypto.so.1.0.0
[root@webserver nagios_client]#

[root@webserver nagios_client]# service nrpe start
Starting nrpe: /usr/sbin/nrpe: error while loading shared libraries: libcrypto.so.6: cannot open shared object file: No such file or directory
                                                           [失败]
[root@web-10 ~]# ll /usr/lib64/libcrypto.so
lrwxrwxrwx. 1 root root 18 10 月 13 2013 /usr/lib64/libcrypto.so -> libcrypto.so.1.0.0
[root@webserver nagios_client]#

再建软链接：

[root@webserver nagios_client]# ln -s /usr/lib64/libcrypto.so /usr/lib64/libcrypto.so.6
(或者如果没有 libcrypto.so，就采用 libcrypto.so.10 做软连接，ln -s /usr/lib64/libcrypto.so.10 /usr/lib64/libcrypto.so.6)
[root@webserver nagios_client]# service nrpe start
Starting nrpe: [确定]
[root@webserver nagios_client]#

[root@webserver nagios_client]# ln -s /usr/lib64/libcrypto.so /usr/lib64/libcrypto.so.6
(或者如果没有 libcrypto.so，就采用 libcrypto.so.10 做软连接，ln -s /usr/lib64/libcrypto.so.10 /usr/lib64/libcrypto.so.6)
[root@webserver nagios_client]# service nrpe start
Starting nrpe:                                             [确定]
[root@webserver nagios_client]#

1.4 检测下 nrpe 是否正常运行：

去 nagios 服务器端 check 下

[root@cache-2 ~]# /usr/local/nagios/libexec/check_nrpe -H 10.xx.xx.10
NRPE v2.12
[root@cache-2 ~]#

[root@cache-2 ~]#  /usr/local/nagios/libexec/check_nrpe -H 10.xx.xx.10
NRPE v2.12
[root@cache-2 ~]#

看到返回 NRPE v2.15 表示已经连接成功。

1.5 在 web 应用下添加检测 jsp 文件

(1) 建立测试文件

vim ./webapps/nagios_test_0611/nagios_test_0611.jsp
<%@ page language=“java” contentType=“text/html; charset=gb2312”
pageEncoding=“gb2312”%>
<!DOCTYPE html PUBLIC“-//W3C//DTD HTML 4.01 Transitional//EN”“http://www.w3.org/TR/html4/loose.dtd”>
<html>
<head>
<meta http-equiv=“Content-Type” content=“text/html; charset=gb2312”>
<title>nagios test here</title>
</head>
<body>
<center>Now timeis: <%=new java.util.Date()%></center>
</body>
</html>

vim ./webapps/nagios_test_0611/nagios_test_0611.jsp
<%@ page language="java" contentType="text/html; charset=gb2312"
pageEncoding="gb2312"%>
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=gb2312">
<title>nagios test here</title>
</head>
<body>
 <center>Now time is: <%=new java.util.Date()%></center>
</body>
</html>

(2) 去 check 下 check_http 命令

[root@webserver~]# /usr/local/nagios/libexec/check_http -I 10.xx.xx.10 -p 8300 -u /nagios_test_0611/nagios_test_0611.jsp -e 200
HTTP CRITICAL – Invalid HTTP response received from host on port 8300: HTTP/1.1 404 Not Found

[root@webserver~]# /usr/local/nagios/libexec/check_http -I 10.xx.xx.10 -p 8300 -u /nagios_test_0611/nagios_test_0611.jsp -e 200
HTTP CRITICAL - Invalid HTTP response received from host on port 8300: HTTP/1.1 404 Not Found

需要重启一下 tomcat，使新添加的 jsp 生效能打开，执行如下 stop start 命令：

/usr/local/app/apache-tomcat-6.0.37_8300/bin/shutdown.sh

/usr/local/app/apache-tomcat-6.0.37_8300/bin/startup.sh

再执行 check_http 命令

[root@webserver~]# /usr/local/nagios/libexec/check_http -I 10.xx.xx.10 -p 8300 -u /nagios_test_0611/nagios_test_0611.jsp -e 200
HTTP OK: Status line output matched “200” – 571 bytes in 0.882 second response time |time=0.882479s;;;0.000000 size=571B;;;0
[root@ webserver ~]#

[root@webserver~]# /usr/local/nagios/libexec/check_http -I 10.xx.xx.10 -p 8300 -u /nagios_test_0611/nagios_test_0611.jsp -e 200
HTTP OK: Status line output matched "200" - 571 bytes in 0.882 second response time |time=0.882479s;;;0.000000 size=571B;;;0
[root@ webserver ~]#

1.6 查看 NRPE 的监控命令

[root@webserver nrpe-2.15]# cat /etc/nagios/nrpe.cfg |grep -v “^#”|grep -v “^$”
log_facility=daemon
pid_file=/var/run/nrpe.pid
server_port=5666
nrpe_user=nagios
nrpe_group=nagios
dont_blame_nrpe=0
debug=0
command_timeout=60
connection_timeout=300
command[check_users]=/usr/local/nagios/libexec/check_users -w 8 -c 15
command[check_load]=/usr/local/nagios/libexec/check_load -w 15,10,5 -c 30,25,20
command[check_sda1]=/usr/local/nagios/libexec/check_disk -w 20% -c 10% -p /dev/sda
command[check_zombie_procs]=/usr/local/nagios/libexec/check_procs -w 5 -c 10 -s Z
command[check_total_procs]=/usr/local/nagios/libexec/check_procs -w 750 -c 800
command[check-host-alive]=/usr/local/nagios/libexec/check_ping -H 10.xx.xx.10 -w 3000.0,80% -c 5000.0,100% -p 5
allowed_hosts=127.0.0.1,10.xx.xxx.xx1
[root@webserver nrpe-2.15]#

更多详情见请继续阅读下一页的精彩内容：http://www.linuxidc.com/Linux/2014-06/103268p2.htm

3 tomcat 多端口监控报警

已经添加了 tomcat930 端口，现在再添加一个 tomcat8300 端口

3.1 客户端的 nrpe.cfg 里面添加配置

[root@webserver root]# vim /etc/nagios/nrpe.cfg
command[check_tomcat_8300_status]=/usr/lib/nagios/plugins/check_http -I 10.xx.xx.10 -p 8300 -u /xx_xx_xx/index.html -e 200 -w 5 -c 10

[root@webserver root]# vim /etc/nagios/nrpe.cfg
command[check_tomcat_8300_status]=/usr/lib/nagios/plugins/check_http -I 10.xx.xx.10 -p 8300 -u /xx_xx_xx/index.html -e 200 -w 5 -c 10

3.2 nagios 服务器端
添加 command 命令

[root@cache-2 etc]# vim ./objects/commands.cfg
define command{
command_name check_tomcat_8300_status
command_line $USER1$/check_http -I $HOSTADDRESS$ -p $PORT$ -u $URL$ -e $N200$ -w $Warning$ -c$Cri$
}

[root@cache-2 etc]# vim ./objects/commands.cfg
define command{
        command_name    check_tomcat_8300_status
        command_line     $USER1$/check_http -I $HOSTADDRESS$ -p $PORT$ -u $URL$ -e $N200$ -w $Warning$ -c$Cri$
        }

添加 service 服务

define service{
host_name webserver
service_description Tomcat_8300_Status
check_command check_nrpe!check_tomcat_8300_status
max_check_attempts 5
normal_check_interval 3
retry_check_interval 2
check_period 24×7
notification_interval 10
notification_period 24×7
notification_options w,u,c,r
contact_groups opsweb
}

define service{
        host_name               webserver
        service_description     Tomcat_8300_Status
        check_command           check_nrpe!check_tomcat_8300_status
        max_check_attempts      5
        normal_check_interval   3
        retry_check_interval    2
        check_period            24x7
        notification_interval   10
        notification_period     24x7
        notification_options    w,u,c,r
        contact_groups          opsweb
        }

3.3 在 nagios 服务器上 check 下新添加的命令是否生效

[root@cache-2 etc]# /usr/local/nagios/libexec/check_nrpe -H 10.xx.xx.10 -c check_tomcat_8300_status
HTTP OK HTTP/1.1 200 OK – 611 bytes in 0.003 seconds |time=0.003152s;5.000000;10.000000;0.000000 size=611B;;;0
[root@cache-2 etc]#

[root@cache-2 etc]# /usr/local/nagios/libexec/check_nrpe -H 10.xx.xx.10  -c check_tomcat_8300_status
HTTP OK HTTP/1.1 200 OK - 611 bytes in 0.003 seconds |time=0.003152s;5.000000;10.000000;0.000000 size=611B;;;0
[root@cache-2 etc]#

看到命令已经生效。

3.4 重启 nagios 服务器，查看结果

[root@cache-2 etc]# service nagios reload
Running configuration check…
Reloading nagios configuration…
done
[root@cache-2 etc]#

[root@cache-2 etc]# service nagios reload
Running configuration check...
Reloading nagios configuration...
done
[root@cache-2 etc]#

重启后，过 3 分钟，新的 tomcat8300 已经监控起来了，如下图所示：

Nagios 通过 check_http 监控一台 Web 应用服务器上多个 Tomcat 服务

为了验证 tomcat 的监控效果，在 web 服务器客户端，停掉 tomcat 的 9300 端口，一会就会收到报警 email，也会在 nagios 页面看到红色报警提示，如下所示：

Nagios 通过 check_http 监控一台 Web 应用服务器上多个 Tomcat 服务

这标示 2 个 nagios 选项监控的是 2 个端口，一个 9300，一个 8300；

4 添加新端口 8200 检测 -e 200 报错问题解决

[root@webserver OCC_MANAGER_Web]# /usr/lib/nagios/plugins/check_http -I 10.xx.xx.10 -p 8200 -u /OCC_REPORT_Web/index.html -e 200 -w 5 -c 10
HTTP CRITICAL – Invalid HTTP response received from host on port 8200
[root@webserver OCC_MANAGER_Web]#

[root@webserver OCC_MANAGER_Web]#  /usr/lib/nagios/plugins/check_http -I 10.xx.xx.10 -p 8200 -u /OCC_REPORT_Web/index.html -e 200 -w 5 -c 10
HTTP CRITICAL - Invalid HTTP response received from host on port 8200
[root@webserver OCC_MANAGER_Web]#

4.1 直接访问 tomcat 服务以及 indexhtml

http://10.xx.xx.10:8200/OCC_REPORT_Web/index.html 是可以访问的，但是会跳转到

http://www.xxxx.xx/OCC_SSO_Web/login.htm?redirect=http%3A%2F%2F10.xx.xx.10%3A8200%2FOCC_REPORT_Web%2Findex.html 的页面，证明 web 应用都是正常的，只是已经被跳转到别的域名页面而已。

4.2 –v 详细分析

这个时候 tomcat 服务器是正常 running 的，而且 web 应用也是正常返回的，只是运行看到这里大概意思是从 8200 端口获取无效的 HTTP 响应，因为这条命令最重要的是监控 /OCC_REPORT_Web/index.html 获取 http 信息并通过 -e 200 来判断 http 正常响应的 OK 状态，所以去掉报警的 -w 5 –c 10 参数，去掉 -e 200 的字符比对信息，看下 check 的返回信息。

[root@webserver OCC_MANAGER_Web]# /usr/lib/nagios/plugins/check_http -I 10.xx.xx.10 -p 8200 -u /OCC_REPORT_Web/index.html
HTTP OK – HTTP/1.1 302 Found – 0.003 second response time |time=0.003367s;;;0.000000 size=317B;;;0

[root@webserver OCC_MANAGER_Web]# /usr/lib/nagios/plugins/check_http -I 10.xx.xx.10 -p 8200 -u /OCC_REPORT_Web/index.html
HTTP OK - HTTP/1.1 302 Found - 0.003 second response time |time=0.003367s;;;0.000000 size=317B;;;0

看到返回的是 HTTP/1.1 302 Found 查看 Tomcat 错误代码知道是产生了新的 URL 信息

……

301 Moved Permanently 客户请求的文档在其他地方，新的 URL 在 Location 头中给出，浏览器应该自动地访问新的 URL。
302 Found 类似于 301，但新的 URL 应该被视为临时性的替代，而不是永久性的。注意，在 HTTP1.0 中对应的状态信息是“Moved Temporatily”。

……

最后加入 - v 参数调试看详细的获取信息：

[root@webserver OCC_MANAGER_Web]# /usr/lib/nagios/plugins/check_http -H www.xxxx.com -I 10.xx.xx.10 -p 8200 -u /OCC_REPORT_Web/index.html -v
GET /OCC_REPORT_Web/index.html HTTP/1.0
User-Agent: check_http/v1861 (nagios-plugins 1.4.11)
Connection: close
Host: www.xxxx.com
http://10.xx.xx.10:8200/OCC_REPORT_Web/index.html is 323 characters
STATUS: HTTP/1.1 302 Found
**** HEADER ****
Server: Apache-Coyote/1.1
Set-Cookie: ploccSessionId=45CD9C9921A5B89C59FCB2E34FE52734; Path=/
Location: http://www.xxx.com/OCC_SSO_Web/login.htm?redirect=http%3A%2F%2Fwww.xxx.com%2FOCC_REPORT_Web%2Findex.html
Content-Length: 0
Date: Thu, 12 Jun 2014 02:52:45 GMT
Connection: close
**** CONTENT ****
HTTP OK – HTTP/1.1 302 Found – 0.003 second response time |time=0.003268s;;;0.000000 size=323B;;;0

[root@webserver OCC_MANAGER_Web]# /usr/lib/nagios/plugins/check_http -H www.xxxx.com -I 10.xx.xx.10 -p 8200 -u /OCC_REPORT_Web/index.html -v
GET /OCC_REPORT_Web/index.html HTTP/1.0
User-Agent: check_http/v1861 (nagios-plugins 1.4.11)
Connection: close
Host: www.xxxx.com
 
http://10.xx.xx.10:8200/OCC_REPORT_Web/index.html is 323 characters
STATUS: HTTP/1.1 302 Found
**** HEADER ****
Server: Apache-Coyote/1.1
Set-Cookie: ploccSessionId=45CD9C9921A5B89C59FCB2E34FE52734; Path=/
Location: http://www.xxx.com/OCC_SSO_Web/login.htm?redirect=http%3A%2F%2Fwww.xxx.com%2FOCC_REPORT_Web%2Findex.html
Content-Length: 0
Date: Thu, 12 Jun 2014 02:52:45 GMT
Connection: close
**** CONTENT ****
HTTP OK - HTTP/1.1 302 Found - 0.003 second response time |time=0.003268s;;;0.000000 size=323B;;;0

看到页面重定向到域名系统，tomcat 服务器是正常运行的，所以 302 Found 也可以表示 tomca 服务器正常运转无误，因为架构是用的 lvs 负载均衡，所以如果动用跳转后的公用域名来判断的话，就不能确定是否是这个主机的 tomcat，因为��用域名每次只对应其中一个 tomcat 服务，因为这里是监控具体的一台 web 服务器的 tomcat，所以去监控 302 端口也是一个不错的办法，这里可以去修改客户端 nrpe.cfg 里面的 8200 端口的监控命令，改成监控 tomcat 的 302 状态值：

Vim /etc/nagios/nrpe.cfg
/usr/lib/nagios/plugins/check_http -I 10.xx.xx.10 -p 8200 -u /OCC_REPORT_Web/index.html -e 302 -w 3 -c 10

Vim /etc/nagios/nrpe.cfg
/usr/lib/nagios/plugins/check_http -I 10.xx.xx.10 -p 8200 -u /OCC_REPORT_Web/index.html  -e 302 -w 3 -c 10

报错记录(一): NRPE: Unable to read output

[1402557345] SERVICE ALERT: webserver;Tomcat_6100_OCC_SSO_Service_Status;UNKNOWN;SOFT;3;NRPE: Unable to read output

解决：一般是 nrpe 路径不对。

报错记录(二)：CHECK_NRPE: Error – Could not complete SSL handshake.

[root@cache-2 etc]# /usr/local/nagios/libexec/check_http -I 10.xx.3.xx -p 8100 -u /tradeAdmin/index.html

HTTP OK: HTTP/1.1 302 Found – 319 bytes in 0.064 second response time |time=0.064033s;;;0.000000 size=319B;;;0

[root@cache-2 etc]#

[root@cache-2 etc]# /usr/local/nagios/libexec/check_nrpe -H 10.xx.3.xx -c check_load

CHECK_NRPE: Error – Could not complete SSL handshake.

[root@cache-2 etc]#

解决：/etc/nagios/nrpe.cfg 里面没有添加 nagios 服务器主机 ip 地址

Vim /etc/nagios/nrpe.cfg

allowed_hosts=127.0.0.1,10.xx.xxx.xx1

之后重启 nrpe，service nrpe restart; 再去 nagios 服务器上验证OK：

[root@cache-2 etc]# /usr/local/nagios/libexec/check_nrpe -H 10.xxx.3.xx -c check_load
OK – load average: 0.43, 0.17, 0.06|load1=0.430;15.000;30.000;0; load5=0.170;10.000;25.000;0; load15=0.060;5.000;20.000;0;
[root@cache-2 etc]#

Nagios 的详细介绍：请点这里
Nagios 的下载地址：请点这里