Operating System - HP-UX
1830166 Members
5797 Online
109999 Solutions
New Discussion

Configuring MC/SG to send hpmcSGSubnetDown trap

 
SOLVED
Go to solution
Ralph Grothe
Honored Contributor

Configuring MC/SG to send hpmcSGSubnetDown trap

Hello,

in the wake of a scheduled total power off of cells in our data center weeks ago it somehow got unnoticed that the standby NIC on a cluster's node lost its link for several days.
This was caused by a failed media converter which as an unmanaged dumb device wasn't on the radar of the network management monitors either.
The dead link eventually was only noticed by a cluster reformation and package failover when the primary VIP NIC also experienced a long enough short link loss.

Though both links are fixed by now
I would like to ward another unnoticed link loss by providing a Nagios check.
I instantly after the event fiddled up a plugin that runs checks via linkloops between all involved NICs.
But this will only catch losses which exceed the 5 min check interval or (highly unlikely) are caught coincidentally.

Therefore, I would like to set up a passive Nagios check which would catch some sort of linkDown trap.

Quick googling I came accross the OID for the HP SG Cluster MIB's trap table at
1.3.6.1.4.11.2.3.1.6.3.1.0 (hpmcSGTraps)
Unfortunately, there is no per NIC linkDown trap but at least a hpmcSGSubnetDown [16] entry, which I think could be useful.

However, I haven't found any reference yet
where and how to configure that such a trap is sent by csnmpd to my Nagios server.

Could anyone help?

Thanks

Ralph
Madness, thy name is system administration
10 REPLIES 10
Asif Sharif
Honored Contributor

Re: Configuring MC/SG to send hpmcSGSubnetDown trap

Hi Ralph,

As per my understanding you are looking for this Technical knowledge base - document ?

http://www12.itrc.hp.com/service/cki/docDisplay.do?docLocale=en&docId=ucr_na-KMN8606299725_ssb-1

Regards,
Asif Sharif
Regards,
Asif Sharif
Ralph Grothe
Honored Contributor

Re: Configuring MC/SG to send hpmcSGSubnetDown trap

Hi Asif,

the TKB doc that you referred me to
exactly describes our situation.

Unfortunately, they mentioned that they had no intention to fix this.
However, as this was issued in 2003 things may have changed by now.

Likewise, I haven't been able to download the referred to PDF document.
The link might be stale by now anyway.
Madness, thy name is system administration
Asif Sharif
Honored Contributor

Re: Configuring MC/SG to send hpmcSGSubnetDown trap

Hi Ralph,

This document is available on HP's internal network, so HP Support personnel can obtain the document and give it to any customer asking for.

Regards,
Asif Sharif
Regards,
Asif Sharif
Ralph Grothe
Honored Contributor

Re: Configuring MC/SG to send hpmcSGSubnetDown trap

Ok, thank you for the hint.
Madness, thy name is system administration
Stephen Doud
Honored Contributor
Solution

Re: Configuring MC/SG to send hpmcSGSubnetDown trap

Attached is the current drawer statement pertaining to HP support for SNMP traps for Serviceguard.

Ralph Grothe
Honored Contributor

Re: Configuring MC/SG to send hpmcSGSubnetDown trap

Hi Stephen,

that is really sad to read.

Sounds as if services offered by cmsnmpd were exclusively targeted at SG Cluster Manager,
a product that we don't use.

So there was never any entry point in SG to integrate with ones own monitoring solutions?

I understand that SNMP may no longer be considered state of the art and thus abandoned in future releases altogether.
Nevertheless, such an open interface always offers a relatively easy to use hook for extensible monitors like Nagios.

Do you have any idea how else I could monitor the link states of cluster relevant NICs (apart from my weird linkloop checks)?


Btw, this is the release of SG on the affected cluster.
As this is productive I have no chance to upgrade or patch this along the way.

$ swlist|grep -i guard
B3935DA A.11.14 MC / Service Guard
Madness, thy name is system administration
Stephen Doud
Honored Contributor

Re: Configuring MC/SG to send hpmcSGSubnetDown trap

Serviceguard does link-level DLPI checking every NETWORK_POLLING_INTERVAL (typically every second), and if a transmission failure occurs, down's the NIC. If a standby is available, SG moves traffic to the standby NIC.

In the case of a package requirement for the NIC, insure the package configuration has a dependency on the NICs' SUBNET, or else Serviceguard will be blind to a network failure and will allow the package to operate in the absence of network connectivity (assuming heartbeat traffic is still functionong on at least one network).

To check whether you have a package configured to monitor a subnet, use:
# cmviewconf | grep -i -e "package name" -e "package subnet"

Example output:
package name: P1
package subnet: 16.113.0.0

Edit the package configuration file to add a SUBNET reference to the needed network and cmapplyconf the file to update the cluster binary (cmapplyconf requires the package be down).

Providing a SUBNET reference for a package causes Serviceguard to fail the package to the adoptive node if the subnet (primary and standby NICs) are not performing.

If this is not sufficient, a package RESOURSE based on an EMS monitor could be configured - try:
# resls /net/interfaces/lan/status
and
# resls /net/interfaces/lan/status/lan0 to verify the monitor can be created.
Then configure a monitor in the package control script.

The package configuration file contains this example:

# Example : RESOURCE_NAME /net/interfaces/lan/status/lan0
# RESOURCE_POLLING_INTERVAL 120
# RESOURCE_START automatic
# RESOURCE_UP_VALUE = running
# RESOURCE_UP_VALUE = online
#
# Means that the value of resource /net/interfaces/lan/status/lan0
# will be checked every 120 seconds, and is considered to
# be 'up' when its value is "running" or "online".
#
Ralph Grothe
Honored Contributor

Re: Configuring MC/SG to send hpmcSGSubnetDown trap

Stephen,

the subnet configuration is already in place
(see below).
However, this hasn't prevented us from slipping our notice that the standby link silently passed away until also the primary was hit for long enough an interval to make the node abandon the cluster.

I cannot do any intrusive package reconfiguration like setting up EMS backed SG resource monitors,
for this is productive.

Here the dumps freed from names and addresses:

# cmviewconf|grep -E 'package (name|subnet)'|cut -d: -f1
package name
package subnet
package name
package subnet
package name
package subnet
package name
package subnet
package name
package subnet
package name
package subnet
package name
package subnet
package name
package subnet
package name
package subnet
package name
package subnet
package name
package subnet
package name
package subnet
package name
package subnet
package name
package subnet
package name
package subnet
package name
package subnet


# grep -h ^SUBNET /etc/cmcluster/*/*cntl|cut -d= -f1
SUBNET[0]
SUBNET[0]
SUBNET[0]
SUBNET[0]
SUBNET[0]
SUBNET[0]
SUBNET[0]
SUBNET[0]
SUBNET[0]
SUBNET[0]
SUBNET[0]
SUBNET[0]
SUBNET[0]
SUBNET[0]
SUBNET[0]
SUBNET[0]
SUBNET[1]
SUBNET[2]
SUBNET[3]
SUBNET[4]
SUBNET[5]
SUBNET[6]
SUBNET[7]
SUBNET[8]
SUBNET[9]

Madness, thy name is system administration
Stephen Doud
Honored Contributor

Re: Configuring MC/SG to send hpmcSGSubnetDown trap

Okay - Since you can't update each package config with resource for NIC outage, how about monitoring syslog.log?

Serviceguard sends lan failure and recovery messages to syslog.
# grep -e fail -e recover /var/adm/syslog/syslog.log| grep lan

Check it periodically, sending an alarm email when any NIC fails today.

NOTE: I investigated /etc/opt/resmon/lbin/monconfig (EMS resource monitor) but it doesn't have a lan monitor capability.

Ralph Grothe
Honored Contributor

Re: Configuring MC/SG to send hpmcSGSubnetDown trap

Hi Stephen,

the periodical checking of all syslog files (e.g. syslog.log on hpux, messages on solaris and linux) is already a standard check I implemented for each new nrpe enabled host I add to my Nagios config.
So far I've been using single tag matching (no blown up regex or so) with the check_log2.pl plugin for occurrences of "vmunix" to catch messages from the kernel.
Thus, I have to admit, the "cmcld" tagged entries like these

May 3 03:23:22 lech cmcld: lan1 failed
May 3 03:23:32 lech cmcld: lan1 recovered

have escaped my too simple filter
(meanwhile I adapted the filter)
not knowing beforehand that a failed NIC isn't necessarily reported by the kernel.

I much would have preferred if additionally a trap could be caught, though any udp datagram isn't at all guaranteed to be received and reacted on by a manager like my net-snmp snmptrapd/nagios combo.
Madness, thy name is system administration