nagios监控远程主机的存活、磁盘空间监控、负载监控、进程数监控、ip 连接

技术2025-12-28 10

http://qubaoquan.blog.51cto.com/1246748/292596

这些监控包括：主机存活、磁盘空间监控、负载监控、进程数监控、ip 连接数监控。（1）、在监控服务器上定义主机配置文件hosts.cfg define host { host_name cacti.com alias nagios server address 192.168.10.195 contact_groups admins check_command check-host-alive max_check_attempts 5 notification_interval 10 notification_period 24x7 notification_options d,u,r } 说明： ● 联系组contact_group 没有建立，需在后面的步骤完成。 ● 主机检查命令行一般选择检查主机存活check-host-alive。 ● 最大尝试次数最好不要设置为“1”,一般3-4 次比较合理。 ● 通知时间间隔notification_interval 根据自己实际情况设定，它的单位是分钟。 ● 通知选项notification_options 几个值的意思是 d-down,u-unreacheable,r-recovery. （2）、在监控服务器上定义服务配置文件 services.cfg define service { host_name cacti.com service_description check-host-alive check_period 24x7 max_check_attempts 4 normal_check_interval 3 retry_check_interval 2 contact_groups admins notification_interval 10 notification_period 24x7 notification_options w,u,c,r check_command check-host-alive } define service{ host_name cacti.com service_description check-disk check_command check_nrpe!check_df max_check_attempts 4 normal_check_interval 3 retry_check_interval 2 check_period 24x7 notification_interval 10 notification_period 24x7 notification_options w,u,c,r contact_groups admins } define service{ host_name cacti.com service_description check-load check_command check_nrpe!check_load max_check_attempts 4 normal_check_interval 3 retry_check_interval 2 check_period 24x7 notification_interval 10 notification_period 24x7 notification_options w,u,c,r contact_groups admins } define service{ host_name cacti.com service_description total_procs check_command check_nrpe!check_total_procs max_check_attempts 4 normal_check_interval 3 retry_check_interval 2 check_period 24x7 notification_interval 10 notification_period 24x7 notification_options w,u,c,r contact_groups admins } define service{ host_name cacti.com service_description ip_connets check_command check_nrpe! check_ip_connets max_check_attempts 4 normal_check_interval 3 retry_check_interval 2 check_period 24x7 notification_interval 10 notification_period 24x7 notification_options w,u,c,r contact_groups admins } 说明： ● 主机名 host_name,必须是主机配置文件hosts.cfg 中定义的主机。 ● 检查用的命令 check_command,在命令配置文件中定义或在nrpe 配置文件中有定义。 ● 最大重试次数 max_check_attempts 一般设置为3-4 次比较好，这样不会因为网络闪断片刻而发生误报。 ● 检查间隔和重试检查间隔的单位是分钟。 ● 通知间隔指探测到故障以后，每隔多少时间发送一次报警信息。它的单位是分钟。 ● 通知选项跟服务定义配置文件相同。 ● 联系组contact_groups 由配置文件contactgroup.cfg 定义。 ● 检查主机资源需要安装和配置nrpe,这个过程在后面完成。（3）在被监控端配置nrpe 修改配置文件/usr/local/nagios/etc/nrpe.cfg.改过的地方用粗体显示： #以单独的守护进程运行 server_address=192.168.10.195 #command[check_hda1]=/usr/local/nrpe/libexec/check_disk -w 20 -c 10 -p /dev/hda1 command[check_df]=/usr/local/nagios/libexec/check_disk -x /dev -w 20 -c 10 command[check_ip_connets]=/usr/local/nagios/libexec/ip_conn.sh 8000 10000 说明： ● command[check_df]=/usr/local/nagios/libexec/check_disk -w 20 -c 10 检查整个服务器的磁盘利用率；如果是freebsd 系统，因为其/dev 分区为100%，需要排除这个分区，因此其命令行应该为 “command[check_df]=/usr/local/nagios/libexec/check_disk -x /dev -w 20 -c 10”。 ● command[check_ip_connets]=/usr/local/nagios/libexec/ip_conn.sh 8000 10000 ip 连接数，（4）在被监控端创建监控脚本 [root@cacti nagios]# cd /usr/local/nagios/libexec/ [root@cacti libexec]# vi ip_conn.sh 下面给出脚本的内容： #!/bin/sh #if [ $# -ne 2 ] #then # echo "Usage:$0 -w num1 -c num2" #exit 3 #fi ip_conns=`netstat -an | grep tcp | grep EST | wc -l` if [ $ip_conns -lt $1 ] then echo "OK -connect counts is $ip_conns" exit 0 fi if [ $ip_conns -gt $1 -a $ip_conns -lt $2 ] then echo "Warning -connect counts is $ip_conns" exit 1 fi if [ $ip_conns -gt $2 ] then echo "Critical -connect counts is $ip_conns" exit 2 fi [root@cacti libexec]# chmod +x ip_conn.sh 我在nrpe 配置文件nrpe.cfg 把脚本所需的两个参数写上了，因此这个脚本就不需判断两个参数输入值的情况。只要当前ip 连接数大于8000，系统就发warning 报警，超过10000，则发“critical”报警信息。（5）、重起nrpe 服务并检验其配置 [root@cacti libexec]# killall -9 nrpe [root@cacti libexec]# /usr/local/nagios/bin/nrpe -c /usr/local/nagios/etc/nrpe.cfg –d [root@cacti libexec]# netstat -nltp |grep 5666 tcp 0 0 192.168.10.195:5666 0.0.0.0:* LISTEN 780/nrpe （6）在监控服务器上检查插件功能检查nrpe 服务 [root@nagios libexec]# ./check_nrpe -H 192.168.10.195 NRPE v2.12 通过nrpe 检查磁盘利用 [root@nagios libexec]# ./check_nrpe -H 192.168.10.195 -c check_df DISK OK - free space: / 21565 MB (82% inode=98%); /boot 82 MB (88% inode=99%); /dev/shm 505 MB (100% inode=99%);| /=4723MB;27699;27709;0;27719 /boot=10MB;78;88;0;98 /dev/shm=0MB;485;495;0;505 检测ip连接数 [root@nagios libexec]# ./check_nrpe -H 192.168.10.195 -c check_ip_connets OK -connect counts is 5 检查负载 [root@nagios libexec]# ./check_nrpe -H 192.168.10.195 -c check_load OK - load average: 0.93, 1.12, 1.21|load1=0.930;15.000;30.000;0; load5=1.120;10.000;25.000;0; load15=1.210;5.000;20.000;0; 检查进程数 [root@nagios libexec]# ./check_nrpe -H 192.168.10.195 -c check_total_procs PROCS OK: 92 processes

最新回复(0)