1824847 Members
3883 Online
109674 Solutions
New Discussion юеВ

monitor SG with nagios

 
SOLVED
Go to solution
Rick Garland
Honored Contributor

monitor SG with nagios

Hi all:

Working with nagios 2.9 on HPUX 11.23
IA systems.

Configuring nagios to monitor. Question - how to keep tabs on SG. When it fails over then I need to watch the other node.

I have a couple of ideas but I am looking for more. Anybody out there doing so?

Many thanks!

6 REPLIES 6
Robert-Jan Goossens
Honored Contributor
Solution

Re: monitor SG with nagios

Hi,

nagios scripts for hpux

http://tinyurl.com/2oz9hw

the check_heartbeat script in combination with the nagios log file script (check_logfiles) could be a start.

Hope this helps a bit,
Robert-Jan
Ps. I have some other nagios links I will check tomorrow. If I find something usefull I will add a message.
Steven E. Protter
Exalted Contributor

Re: monitor SG with nagios

Shalom,

I personally would consider using cmviewcl output to monitor the status of service guard.

Whether its up or down can be monitored with a simple grep scrip as well as failovers.

failovers are also noted in the /var/adm/syslog/syslog.log file.

I'm sure nagios can do it but you may need to write your own nagios monitor script. I'd look at two issues, the serviceguard daemon and the status from cmviewcl.

SEP
back after a week off the net.
Steven E Protter
Owner of ISN Corporation
http://isnamerica.com
http://hpuxconsulting.com
Sponsor: http://hpux.ws
Twitter: http://twitter.com/hpuxlinux
Founder http://newdatacloud.com
Stephen Doud
Honored Contributor

Re: monitor SG with nagios

You can add an entry in the customer_defined_run_cmds section of the script, to email your pager when the package starts, and the hostname that it started on.
Ralph Grothe
Honored Contributor

Re: monitor SG with nagios

Hi Rick,

there are basically two ways how you could monitor the packages' state, either by active or passive checks.
Are you familiar with NRPE (Nagios Remote Plugin Executor), and NSCA (Nagios Service Check Acceptor)?
The first are actively executed by your Nagios server through (usually an inetd spawned) nrpe daemon on a monitored remote host,
while the latter are (compareble to SNMP traps) executed on demand on the monitored host (or a distributed other Nagios server, e.g. to bridge a firewalled LAN) and are sent to the central Nagios server (via send_nsca) to an nsca daemon that (usually inetd spawned) lingers there and pipes the results of passive check commands into the servers command FIFO.

I tried both with our HP-UX MC/SG clusters,
and both work well.

For nrpe checks I have written a small Nagios plugin that I attached to this posting.
This needs to be placed in the directory where your Nagios check commands reside on the SG cluster node, and an appropiate check command needs to be defined in the nrpe.cfg file of that nrpe host, which could look like this

# grep check_SG_PKG_STATE /usr/local/nagios/etc/nrpe.cfg
command[check_SG_PKG_STATE]=/usr/local/nagios/libexec/check_sg_pkg_state.pl -i sms

Note, in this definition I "ignored" a test cluster package sms that I only set up to play and experiment with SMS notifications.
I wrote my plugin such, that it parses the cmqueryconf command to build up a hash that holds all the primary nodes of any package in the cluster.
The check itself simply parses cmviewcl to match the package distribution according to configuration with the current one (expcept for "ignored" packages, as sms in above example).
If anything deviates a critical state is signalled.

For this to work you need to define the user that the nrpe daemon executes under to be part of the monitor role in your cluster config, so that he may execute the non-distructive sg commands cmviewconf and cmviewcl.

e.g.

# grep nrpe /etc/inetd.conf /etc/services
/etc/inetd.conf:nrpe stream tcp nowait tivoli /usr/local/nagios/sbin/nrpe nrpe -c /usr/local/nagios/etc/nrpe.cfg -i
/etc/services:nrpe 5666/tcp # Nagios Remote Plug-in Executor

This may look pretty daft, that the user here is called tivoli.
Yes, we used to start with Tivoli monitoring but shifted to Nagios for obvious reasons ;-)

# cmviewconf | sed -n '/Access Policy/,/role:/p'
Cluster Access Policy Information:

user name: tivoli
user host: CLUSTER_MEMBER_NODE
user role: monitor

When you have set up all correctly,
from you Nagios server you simply can execute

e.g.

$ check_nrpe -H samoa -c check_SG_PKG_STATE
SG_PKG_STATE OK - pkg1 up vaila enabled running, pkg2 up vaila enabled running,
pkg3 up samoa enabled running, pkg4 up lanai enabled running

In your Nagios server you then could define a service similar to this:

define service {
use generic-service
service_description SG_CLUSTER_PKGs_UP
servicegroups sg_services
hostgroup_name sg_clusters
normal_check_interval 15
check_command check-nrpe!check_SG_PKG_STATE
contact_groups sg_admins
}


Now, for passive checks you best would place the send_nsca command call as Stephen suggested in each sg package's start/stop script in the customer_defined_*_cmd function bodies.

e.g.

NSCA_CLIENT=/usr/local/nagios/libexec/send_nsca_with_mcrypt
NSCA_CONF=/usr/local/nagios/etc/send_nsca.cfg
NSCA_SERVER=123.123.123.123
NSCA_PORT=5667

customer_defined_halt_cmds() {

printf "%s;%s;%u;CRITICAL - MC/SG Package %s halting on %s\n" \
$PACKAGE sms_pkg_state 2 $PACKAGE $(uname -n) \
|LD_LIBRARY_PATH=/usr/local/lib \
$NSCA_CLIENT -H $NSCA_SERVER -p $NSCA_PORT -d ';' -c $NSCA_CONF

}

Note, the LD_LIBRARY_PATH is just a kludge to satisfy a bad compilation of the send_nsca binary.

For this way you need to configure your Nagios server to accept passive checks,
which means mainly

accept_passive_service_checks=1

in your main nagios.cfg.

Also you need to set up nsca in the inetd of your nagios server.

$ grep nsca /etc/inetd.conf
#nsca stream tcp nowait nagios /opt/sw/nagios/bin/nsca nsca -c /opt/sw/
nagios/etc/nsca.cfg --inetd

Note, my Nagios server used to run on an AIX box, why the syntax may slightly deviate from HP-UX's inetd.conf.

For passive checks you could define a service similar to this

define service {
use generic-service
service_description sms_pkg_state
servicegroups sg_clusters
host_name sms
;notification_options c,r,u
notifications_enabled 0
;contact_groups nagiosadmin,admin_mobile,service_center
contact_groups nagiosadmin,admin_mobile
max_check_attempts 1
is_volatile 1
active_checks_enabled 0
passive_checks_enabled 1
check_freshness 0
check_period never
check_command passive-check-pad
}


Here it is important to set is_volatile and
max_check_attempts to 1.

Happy Checking

Ralph
Madness, thy name is system administration
Ralph Grothe
Honored Contributor

Re: monitor SG with nagios

Reconsidering this, I thought of a much simpler solution.
Usually you would add host definitions to your nagios server config for every single package which is reachable by its relocatable or virtual IP adress (VIP for short).
Because you already have defined (as I hope) host checks which simply run the check_host command
Note, check_host is a hard link to the check_icmp command, the latter of which must be owned by root and have the suid bit set.
This is because only root may emit ICMP packages.
Now you must know that check_icmp behaves quite differently when invoked as check_host.
Just run the --help option on both invocations to find out.
The main difference however is that check_host regards the check to be OK already as soon as the first ICMP package has returned whereas check_icmp would wait for every packet to return.
This can be a big performance boost.
Also never, ever define a check_interval in your host definitions as this could impair the performance significantly.
This is what the Nagios doc says about it:

check_interval: NOTE: Do NOT enable regularly scheduled checks of a host unless you absolutely need to! Host checks are already performed on-demand when necessary, so there are few times when regularly scheduled checks would be needed. Regularly scheduled host checks can negatively impact performance - see the performance tuning tips for more information. This directive is used to define the number of "time units" between regularly scheduled checks of the host. Unless you've changed the interval_length directive from the default value of 60, this number will mean minutes. More information on this value can be found in the check scheduling documentation.


So I have defined this host template
which I simply "use" with every new host definition, and where I only overwrite those directives that need to be overwritten
(a bit OO like).


define host {
name generic-host
alias Host Class Definition
register 0
max_check_attempts 5
active_checks_enabled 1
passive_checks_enabled 0
check_period 24x7
check_command check-host-alive
obsess_over_host 0
check_freshness 1
freshness_threshold 1800
event_handler passive-check-pad
event_handler_enabled 0
flap_detection_enabled 0
process_perf_data 0
retain_status_information 1
retain_nonstatus_information 0
contact_groups pb22_unix
notification_interval 30
notification_period 24x7
notification_options d,u,r
notifications_enabled 1
process_perf_data 0
}


And specifically this is what my check-host-alive definition looks like

define command {
command_name check-host-alive
command_line $USER1$/check_host -H $HOSTADDRESS$ -t 15 -c 10000
}


Of course, for our contacts we usually supress all host notifications by a
"host_notification_options n" in the contacts_template.cfg
(note, that pre 3.X Nagios versions lacked a configuration directive by which you could prevent to be flooded by service alerts for a failed host, where only a host alert would suffice for the admin to be alerted).
Also one has to consider that the failover of an SG package could be performed much quicker than until nagios had verified a hard state change to critical.

Therefore, I would rather concentrate on service checks for services that are running on your SG packages.
Often you have a database or some similar service running under a certain VIP.
Then it is much better to run specific checks like e.g. check_oracle via nrpe (for it requires an sqlplus binary).
Or you can download for free from Oracle the so called instant client which contains a working sqlplus binary without the need for a full blown Oracle installation.
Place this on your Nagios server and you even can do without nrpe by running checks directly from your nagios server to any Oracle DBMS.

So if any of the package bound services fails
you will be notified about these failures (given a max_check_attempts or a retry_check_interval that would result in a sooner critical hard state change than the package has failed over ;-)

Also consider service dependency definitions for all those services that form a common cluster package.
This will reduce notifications to a sane amount.
Madness, thy name is system administration
Rick Garland
Honored Contributor

Re: monitor SG with nagios

Hi folks:

Many thanks!
My ideas were from the output of cmviewcl but I also like the Heartbeat idea as well.

I am using the nrpe for remote checking. I see that HP is including nagios and nrpe with the iexpress but the versions are at 2.0. I downloaded the source and compiled all (plugins & nrpe). Using gcc I compiled the latest stable with no troubles. I found lots of references to problems with check_swap not working but I am having no problems.

The nagios itself runs on a CentOS 5 server (HP ProLiant DL360) and the nrpe and plugins I have for Linux, Solaris, and HPUX. The HPUX is compiled for PARISC 2.x. I am not going to bother with anything lower since these are not supported and we have these systems scheduled for decom.



Ralph - much appreciated on the on the script and config.

SEP - hope you are on vacation. Have a great time off the net.

Again, many thanks to all of you!