cancel
Showing results for 
Search instead for 
Did you mean: 

SG & network failure

emha_1
Valued Contributor

SG & network failure

Hi,

I would like to clarify behaviour of SG in one specific situation.
my config is following: I have 2-node RH Linux SG cluster + quorum server. both server have 4 NICs - 2 of them are dedicated for 2 separate heartbeats, rest 2 are bonded together to make access to public/data LAN. both servers have 3 HBAa, 2 of them for failover access to SAN, 3rd one for backup. multipathing is managed by qlogic driver, SW mirroring is managed using md devices.

let's imagine failing scenario when on one server both NICs making up bond for data LAN fails or for some reason network connection is completely lost. both heartbeat connections are alive, connection to SAN is alive.
what should be behaviour of SG in such scenario?

I would expect that SG should relocate all the packages from server with failing LAN access to other node. when LAN connection on first server is reestablished, I should be able to fail packages back without any problems. Am I right?

in my configuration failover is made correctly, but at fail back (I do it manually, automatic failback is forbiden) packages are not started and errors of starting up md devices are reported. problem persists until I reboot server.
I guess there is something misconfigured as LAN failure should not cause SAN access failure.

thanks for clarification.

emha.
8 REPLIES
Steven E. Protter
Exalted Contributor

Re: SG & network failure

Shalom,

Since SG package configuration includes a floating ip address on the public network, the scenerio you give should cause all packages to fail over to the node that has the working public NIC interface.

As to the errors when you try and fail back manually, I'd need to see the error messges from /var/log/messages

I imagine the fix should be:
cmhaltnode (on the downed node)
cmrunnode (on the downed node)
cmhaltpkg (may be useful in solving the problem on the disconnected node.)

Before trying to fail back, you must be sure networking is up and the system has normal connectivity.

Seeing the actual cluster configuration ascii file would also be useful.

SEP
Steven E Protter
Owner of ISN Corporation
http://isnamerica.com
http://hpuxconsulting.com
Sponsor: http://hpux.ws
Twitter: http://twitter.com/hpuxlinux
Founder http://newdatacloud.com
emha_1
Valued Contributor

Re: SG & network failure

Hi Steven,

yes, it's absolutelly clear that all the packages have to be moved. it was in fact done so. my doubts are mainly about failback.

at the moment I tried to move package back network was already up, I've checked it in /var/log/message and tried to ping the server.

my question moreless was, how should SG in principle behave after LAN connectivity failure. I didn't want you to analyze my logs (however, if you are interested in I can post them).

why do you think halt/run node is necessary?
network connectivity is just one of the package resources, like a managed process, so if it fails SG should perfrom failover. later failback shouldn't be restricted anyhow.

emha.
Steven E. Protter
Exalted Contributor

Re: SG & network failure

Shalom,

>>>
why do you think halt/run node is necessary?
network connectivity is just one of the package resources, like a managed process, so if it fails SG should perfrom failover. later failback shouldn't be restricted anyhow.
>>>

I deal with very stable configurations that rarely deal with anything other than disk failure.

But I do remember when we pulled network plugs on our SG in SG class that sometimes we had to run cmhaltnode or cmhaltpkg before we could get the clsuter to reform after a catastrophic loss of service, such as TOC.

Someone is going to need to look at the logs to see why the failback does not work. Once network connectivity is re-established, you should be able to fail back.

I don't think cmhaltnode is necessary. I think cmhaltnode/cmrunnode will clear the errors on your logs and permit you to fail the packages back unless the network connection has not recovered sufficiently.

I just crashed a red hat cluster by pulling its only NIC connection and was forced to reboot both nodes to recover it. SG should with some intervention be able to recover from the same thing. Its just a matter of identifying the issue and writing the procedure.

SEP
Steven E Protter
Owner of ISN Corporation
http://isnamerica.com
http://hpuxconsulting.com
Sponsor: http://hpux.ws
Twitter: http://twitter.com/hpuxlinux
Founder http://newdatacloud.com
emha_1
Valued Contributor

Re: SG & network failure

Hi Steven,

------
Someone is going to need to look at the logs to see why the failback does not work. Once network connectivity is re-established, you should be able to fail back.
------
thanks, this is the fact I wanted to confirm for myself (system vendor keeps telling me SG is not able to handle LAN connectivity outage on bonded interface, and that I experienced 2 separate issues LAN/md).



------
But I do remember when we pulled network plugs on our SG in SG class that sometimes we had to run cmhaltnode or cmhaltpkg before we could get the clsuter to reform after a catastrophic loss of service, such as TOC.
------
yes, it's very common (and I think also recomended) to halt/reboot node if you pull out all the network connections including heartbeats and TOC is invoked




------
I just crashed a red hat cluster by pulling its only NIC connection and was forced to reboot both nodes to recover it. SG should with some intervention be able to recover from the same thing. Its just a matter of identifying the issue and writing the procedure.
------
if it was only NIC then I guess it was used for heartbeat as well. if so, reboot of the node I would see ok, but I would expect 2nd node should survive if it had accessible quorum.


anyway, thanks for sharing your experience.

emha.

Serviceguard for Linux
Honored Contributor

Re: SG & network failure

First a question - how are you using MD? MD is only supported in a very limited set of circumstances - specifically with XDC. I didn't want to get you off the main topic but wanted to make sure you didn't wind up with an unsupportable configuration.

For LAN failover, and package failover in general:

Yes - when both connections in a bonded pair fail then a package should fail over. From your description that should happen.

From your description, when you try to "failback" the package startup process happens, but the package is not starting correctly. Am I correct? Package failback with XDC requires close attention to those docs. If you are not using XDC, you CANNOT use MD and there are no ways around your problems.
emha_1
Valued Contributor

Re: SG & network failure

Hi,


---------------------
First a question - how are you using MD? MD is only supported in a very limited set of circumstances - specifically with XDC.
---------------------
yes, I use XDC.


---------------------
From your description, when you try to "failback" the package startup process happens, but the package is not starting correctly. Am I correct?
---------------------
exactly.
but from my point of view they do not start from other reasons (md mirror can not be built up due to "resource busy") than the reasons for failover (LAN outage).

emha.
Serviceguard for Linux
Honored Contributor

Re: SG & network failure

I agree, that's what is sounds like. One other question: Is the remirroring process completing on the first server? That is, when you do the first failover by pulling the LAN cables, MD will generally start a remirror. Are you waiting for that to complete before Doing the manual failback?

If you are waiting and still see the problem, can you attach logs from both systems. Please extract just the time period around the manual "failback". Also, I believe there is a separate XDC related log associated with teh package. This and any package logs might be helpful.

I'll ask the developers to take a look. There may not be a response until Monday because of timezones.
emha_1
Valued Contributor

Re: SG & network failure

Hi,

I didn't check whether remirroring was finished.
But as far as I know, if mirror is stopped without resync being completed, it's not a problem. once mirror is brought up next time resync is started again.

I'm not waiting for a solution here, I already resolved it. However, I attached logs, so you can check them.
network outage has happened on 15.8. 12:21:56, manual failback at 12:45:50.

I'm not aware of any special XDC log, where can I find it?

emha.