Operating System - HP-UX
1838377 Members
3068 Online
110125 Solutions
New Discussion

strange behavior - serviceguard 11.16

 
g3jza
Esteemed Contributor

strange behavior - serviceguard 11.16

Hi,
some strange things happened last night in our 2-node serviguard 11.16 cluster(patched according to release notes).

Let me introduce you the environment for this 2-node cluster:

2-way (reduntant) dedicated heart-beat networks are established between node A and node B. One heart beat goes through one switch and the other heart beat link goes through other lan switch.

What happened last night:

Our quorum server, providing tie-breaker services for those 2 nodes was down. It should not really matter as the possibility of 2 seperate heart beat networks failing at the same time is very low.

Suddenly, the lan interface on one node (not the heart beat interface), carrying the package IP address, just STOPPED working. There was nothing in syslog, saying anything about "bad cable connection", I even checked the log on the cisco switches, checked the nettl.log with 'netfmt', nothing indicating a bad HBA or anything else.

These were the symptomps of the unfunctional lan card: I could not ping it's gateway, even 'linkloop' from that lan card to another lan card on the other node in the cluster didn't work. I tried unplumbing/ plumbing the interface, pulling out and in the cable on the switch / HBA side. /sbin/init.d/net start didn't help. And the physical layer-2 connection (looking at the lan switch / dmesg on host)was OK. It was just non-sense.
Has anybody encountered something similar?

If there's nothing in the cisco switch syslog, i really have no idea if that was a bug in the OS. I had to reboot the server and the lan connection from that interface started working again.

Another question to ask:
If the heart beat is ok between these 2-nodes (I'm 100% sure it was okay because of 2 seperate HB networks), then there shouldn't be any need for quorum server to be up, right?

Thank you for your help.



In the attachment is the output from syslog. IP address of the QS has been replaced by 'X.X' and node name by 'NODE_A'.


10 REPLIES 10
Turgay Cavdar
Honored Contributor

Re: strange behavior - serviceguard 11.16

Hi,
Quorum is required when the cluster re-formation is happened. If i understand correctly, there is no cluster reformation in your situation.

Serviceguard normally monitors the health of the lan card by polling the card, if there is a problem then and it will mark the card as down. The important parameter here is "NETWORK_FAILURE_DETECTION". The value of the parameter define in which situation serviceguard markes the card as bad(Is your setting is INOUT?). In your situation as the card was not marked as dead we can assume that your lan card is healthy for at least for serviceguard.

>>I could not ping it's gateway, even 'linkloop' from that lan card to another lan card on the other node in the cluster didn't work.
I think this can be a possible network problem, L2 is operational but L3 is not. This is a similar issue that someone changes the vlan of the server, this means your L2 id okay but you cant reach your gateway.
g3jza
Esteemed Contributor

Re: strange behavior - serviceguard 11.16

The NETWORK_FAILURE_DETECTION is set to INOUT.

But if everything was ok on the switch and the status of ports on the switch was okay(nothing in cisco syslog), then why the linkloop (layer-2 test, right? ) didn't pass? That's weird, isn't it? I had to reboot the server to get the linkloop start working.



Turgay Cavdar
Honored Contributor

Re: strange behavior - serviceguard 11.16

Hi,
On the node_a, the network problem was started, just after the quorum server had been down?
g3jza
Esteemed Contributor

Re: strange behavior - serviceguard 11.16

Hi again :)

The quorum server was down from the beginning, when the cluster FIRST started. So the cluster was doing fine without quorum server like a day and 20hours. Those messages in syslog just suddently appeared.

But from the output in the attachment, there seems to be some connection between the quorum server not beeing reacheable (however the quorum server was down much longer) and that the whole subnet is lost, just weird. If it's some form of bug of Serviceguard, that after he cannot connect to quorum server, he just considers the whole subnett to be down, weird.
But the quorum server is on different subnet than the node which lost lan connectivity.
Stephen Doud
Honored Contributor

Re: strange behavior - serviceguard 11.16

Serviceguard checks accessibility to the arbitration device every hour, so I would expect to see a complaint in the syslog file about it's unavailability.

A graceful cluster reformation (such as a node leaving or entering the cluster) do not invoke a need to seek the quorum server(arbitration device) so inaccessibility of the QS should not cause a problem until ALL HB networks fail unexpectedly while a 2-node cluster is running. A 1-node cluster does not generate HB and does not need a quorum server.

If the problem recurs, use landiag to reset the NIC and see if it changes the NIC transmission capabilities.
g3jza
Esteemed Contributor

Re: strange behavior - serviceguard 11.16

That's one last thing that I didn't try (resetting the lan interface).
Anyway, how come that ServiceGuard haven't considered that particular lan interface to be down, if it couldn't communicate?

Turgay Cavdar
Honored Contributor

Re: strange behavior - serviceguard 11.16

Serviceguard marks the interface down for INOUT setting, when:

INOUT: When both the inbound and outbound counts stop incrementing for a certain amount of time, Serviceguard will declare
the card as bad. (Serviceguard calculates the time depending on the type of LAN card.) Serviceguard will not declare the card as bad if only the inbound or only the outbound count stops incrementing. Both must stop. This is the default.

In your situation, it means that server can still send or receive packages (possibly only send). So here the important thing is, at that time the switch can move the packages to its destinations or not. Be also sure that someone not change the vlan of the server.
g3jza
Esteemed Contributor

Re: strange behavior - serviceguard 11.16

Ok,
so what you are suggesting is changing the cluster config parameter to INONLY_OR_INOUT?
Would Serviceguard be reacting more "accurately" to networking problems (as I don't know yet if it was on the switch / HBA side), am I right?

One more note to say. This interface card is not configured as HEART BEAT interface, just stationary. Other 2 NIC's are configured as HB.
I'm no networking expert, but let's say that there will be a situation one night, when obviously there will be no application activity ( it's a SAP running on that cluster node) as no users will be using the application. So considering, that there is no net activity (not even HB packets, as it's not configured as HB NIC) and the settings will be set to INONLY_OR_INOUT, wouldn't it cause ServiceGuard to think that "Why isn't the NIC sending/receiving any packets? I will mark it as down". Maybe it's just nonsense I wrote here, but I always try to consider other things when changing something :).
Turgay Cavdar
Honored Contributor

Re: strange behavior - serviceguard 11.16

Actually it depends on your environment.Here is the what documentation tells about INONLY_OR_INOUT:

INONLY_OR_INOUT: This option will also declare the card as bad if both inbound and outbound counts stop incrementing. However, it will also declare it as bad if only the inbound count stops.This option is not suitable for all environments. Before choosing it, be sure these conditions are met:
â All bridged nets in the cluster should have more than two interfaces each.
â Each primary interface should have at least one standby interface, and it should be connected to a standby switch.
â The primary switch should be directly connected to its standby.
â There should be no single point of failure anywhere on all bridged nets.

New serviceguad versions (A.11.19 ?) have a new directive "IP_MONITOR", that it polles a target at the IP layer. For example you can give default gateway and sericeguard monitor the IP connectivity with the gateway.

With older versions of serviceguard we use INONLY_OR_INOUT directive . We have 2 cards for data network. Primary is connected to switch 1 and secondary is connected to switch2. Switch1 and 2 also connected to each other and HSRP is running on them. By this way we can still survive when the active switch is operational at L2 but not at L3.
g3jza
Esteemed Contributor

Re: strange behavior - serviceguard 11.16

Thank you for your time :)