Operating System - HP-UX
1834168 Members
2064 Online
110064 Solutions
New Discussion

Re: Need help on output interpretation for writing a Nagios quorum server check plug-in

 
Ralph Grothe
Honored Contributor

Need help on output interpretation for writing a Nagios quorum server check plug-in

Hi,

I am about to script a custom Nagios plug-in to monitor the availability of a quorum server from those cluster nodes that might require it as a tie breaker during cluster reformation.

A colleague of mine set up a quorum server on an RHEL Linux box

# rpm -q qs
qs-A.04.00.00-0.rhel5

Though I could merely check the existence of the qsc process(es) on the quorum server host and the port their listening socket has been bound to

e.g.

$ /usr/lib64/nagios/plugins/check_procs -c 1:4 -C qsc
PROCS OK: 2 processes with command name 'qsc'

# /usr/sbin/lsof -nP -c qsc -a -i4 -i tcp|grep LISTEN
qsc 20800 root 5u IPv4 1172935 TCP *:1238 (LISTEN)

# /usr/lib64/nagios/plugins/check_tcp -H localhost -p 1238
TCP OK - 0.000 second response time on port 1238|time=0.000085s;;;0.000000;10.000000


I thought that it would be a more realistic check if I ran the cmquerycl command from each potential quorum requesting cluster node instead.
(probably after having set USER_NAME to the uid that inetd spawns the nrpe daemon under and USER_ROLE to "monitor")

However, it isn't yet clear to me how to interpret the 195 seconds displayed, for instance.
And what does it want to convey by outputting "Replacing Quorum Server..." to stderr?

# cmquerycl -w none -l net -q asterix -c nbr02 -n $(uname -n)|tail -1
Replacing Quourm Server asterix with asterix
Quorum Server: asterix 195 seconds

If I look up the cluster config I can rather see a 120 secs polling interval and a 2 secs timeout extension.


# cmviewconf|grep -E 'qs (host:|polling|timeout)'
qs host: asterix
qs polling interval: 120.00 (seconds)
qs timeout extension: 2.00 (seconds)


How does this break down to 195 seconds?

Also I would have to check how the cmquerycl output changed if the quorum server was down or unreachable, and if the command might hang (in which case I would need to set a (alarm) timer.
I think that it would not hurt if I stopped the quorum server process for these tests since the quorum server would only be required and queried by the cluster nodes if cluster reformation was required (e.g. owe to a split brain or similar).
Is this correct, or could I inadvertently peril running cluster states?


Rgds
Ralph
Madness, thy name is system administration
5 REPLIES 5
Michael Steele_2
Honored Contributor

Re: Need help on output interpretation for writing a Nagios quorum server check plug-in

Hi and Whoa. Stop.

cmquerycl is used once, to build the initial cluster template file, cluster.ascii. In MS/SG it is found under /etc/cmcluster/cluster_name/cluster.ascii.

Use 'cmgetconf -c cluster.ascii /tmp/TEMP_cluster.ascii

And perform any parsing with this TEMP file.

############################
This comment confuses me and I'll need more explanation.

"...I am about to script a custom Nagios plug-in to monitor the availability of a quorum server from those cluster nodes that might require it as a tie breaker during cluster reformation...."

a) Are you saying you have a nagios application, and you want to place the responsibility of failing over in the Nagios application?

If so then this is also not how MC/SG works. MC/SG is a compliled application dependent upon the compiled binary running in the kernel.

level One : Physical Resourse
level Two : MC/SG
level Three : HP-UX
level Four : Application / Nagios

The execution of a qurom, is preformed with the heartbeat mechanism, and, as you stated, polled by the MC/SG binary ever few seconds. If one node in the cluster fails to receive a heartbeat, it fails over.

How, exactly, are you going to fit a third outside the cluster application server to initiate a failover?

And why would you want to tamper with something created by the Manufacturer to perform this function?

I guaranntee you, you would be making yourself NON-SUPPORTED by HP. And forced to go back to MC/SG if you ever needed their help. And since your company is probably paying a billion a year in support, I can't believe for a second that you have management appproval.

Finally, there would be absolutely no one that came after you, should you leave, who could support this.
Support Fatherhood - Stop Family Law
Michael Steele_2
Honored Contributor

Re: Need help on output interpretation for writing a Nagios quorum server check plug-in

Sorry, mis-statement here.

You are using the term 'quorum' in the same way as failover and have confused me and yourself.
Support Fatherhood - Stop Family Law
Ralph Grothe
Honored Contributor

Re: Need help on output interpretation for writing a Nagios quorum server check plug-in

Hi Michael,

you have confused me too with your reply.

Nagios is a monitoring system that runs on a server of its own and that has nothing to do with ServiceGuard or any other HA clustering SW (though you can replicate the Nagios server itself for HA, but that's not meant here).

One could compare Nagios to proprietary monitoring solutions such as e.g. HP OpenView,
although OV is more or less relying completely on SNMP I assume (which Nagios can be configured to use as well, but isn't usually, the protocol of the checks being totally up to the user or rather the implementation of used plug-ins).
Also OV is more of a kind of a checks "pushing" system (e.g. SNMP traps) whereas Nagios usually is set up to behave like a checks "polling" system (although one can also set up Nagios to rely mostly on so called passive checks which would make it more similar to OV in this respect).
One could probably best think of Nagios as a check plug-ins scheduling construction kit that can be extended ad lib.

But sorry for digressing.

All I want is to check by such a custom plug-in that the MC/SG quorum server really is up and available.
I never remotely thought about tampering with MC/SG cluster logic.

The reason why I want these checks to be run from the cluster nodes (via the Nagios Remote Plug-in Executor nrpe) is for once that I simply lack (or don't want to install) an MC/SG "client" that can talk to the MC/SG quorum server in its own protocol to query its state on my RHEL Nagios server, and second that such querying would be more realistic if performed from the cluster nodes rather than from the Nagios server.

I know that the cmquerycl command is usually only run to create an MC/SG configuration template dump to be edited.
However, the qs manpage of the quorum server SW cites cmquerycl in a usage example which led me to believe that cmquerycl being the onl (scriptable) user space command which can "talk" to the QS to obtain some sort of status information that would show my Nagios plug-in that the QS is living and servicing.
I would gladly use another command to this end if you could tell me which.
Madness, thy name is system administration
Michael Steele_2
Honored Contributor

Re: Need help on output interpretation for writing a Nagios quorum server check plug-in

Nagios has a ping utilty to verify server up or down.

LVM uses 'vgchange' in the same way with MC/S
to check for quorum not present error messages. But again, you are confusing the purpose of the heartbeat

a subnet that connects the Quorum Server to a cluster is also used for the cluster heartbeat, configure the heartbeat on at least one other network, so that both Quorum Server and heartbeat communication are not likely to fail at the same time.
Support Fatherhood - Stop Family Law
Ralph Grothe
Honored Contributor

Re: Need help on output interpretation for writing a Nagios quorum server check plug-in

> Nagios has a ping utilty to verify server up or down.

I know (btw. for host checks I use check_host which is a hard link to check_icmp), but I consider a mere check if the host that runs the QS service is up or down not sufficient
(besides, I have this checked automatically as soon as I integrate a new host into my Nagios config)

>LVM uses 'vgchange' in the same way with MC/S
>to check for quorum not present error
>messages. But again, you are confusing the
>purpose of the heartbeat

I can't follow, what has vgchange to do with the QS?
Here, we aren't using a quorum disk (where only vgchange activation/deactivation would make any sense to me)
What has heartbeat to do with QS?
As far as I have understood, a QS is only needed to fulfill a quorum when it is due, during a cluster reformation.
Or am I completely wrong?

>a subnet that connects the Quorum Server to
>a cluster is also used for the cluster
>heartbeat, configure the heartbeat on at
>least one other network, so that both Quorum
>Server and heartbeat communication are not
>likely to fail at the same time.

The QS resides in a completely separate LAN and can only be reached from the clustered nodes via the default gw but not through a NIC in the same segment.
So how should this be achieved?
Btw. this cluster's design and configuration wasn't done by me.
I am only the guy who was asked to include this thing into his monitoring.
Even if this cluster had LAN connections that could provide what you are suggesting I am by no means entitled to tamper with its configuration in such a way, I am sure.

I apologize Michael, though it may sound so,
I am not trying to be rude.
I appreciate your help very much but I can't get you, probably as much as you can't get what I am driveling about.
Madness, thy name is system administration