- Community Home
- >
- Servers and Operating Systems
- >
- Operating Systems
- >
- Operating System - Linux
- >
- Re: SG & network failure
Categories
Company
Local Language
Forums
Discussions
Forums
- Data Protection and Retention
- Entry Storage Systems
- Legacy
- Midrange and Enterprise Storage
- Storage Networking
- HPE Nimble Storage
Discussions
Discussions
Discussions
Forums
Forums
Discussions
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
- BladeSystem Infrastructure and Application Solutions
- Appliance Servers
- Alpha Servers
- BackOffice Products
- Internet Products
- HPE 9000 and HPE e3000 Servers
- Networking
- Netservers
- Secure OS Software for Linux
- Server Management (Insight Manager 7)
- Windows Server 2003
- Operating System - Tru64 Unix
- ProLiant Deployment and Provisioning
- Linux-Based Community / Regional
- Microsoft System Center Integration
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Community
Resources
Forums
Blogs
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
тАО08-15-2007 09:43 PM
тАО08-15-2007 09:43 PM
SG & network failure
I would like to clarify behaviour of SG in one specific situation.
my config is following: I have 2-node RH Linux SG cluster + quorum server. both server have 4 NICs - 2 of them are dedicated for 2 separate heartbeats, rest 2 are bonded together to make access to public/data LAN. both servers have 3 HBAa, 2 of them for failover access to SAN, 3rd one for backup. multipathing is managed by qlogic driver, SW mirroring is managed using md devices.
let's imagine failing scenario when on one server both NICs making up bond for data LAN fails or for some reason network connection is completely lost. both heartbeat connections are alive, connection to SAN is alive.
what should be behaviour of SG in such scenario?
I would expect that SG should relocate all the packages from server with failing LAN access to other node. when LAN connection on first server is reestablished, I should be able to fail packages back without any problems. Am I right?
in my configuration failover is made correctly, but at fail back (I do it manually, automatic failback is forbiden) packages are not started and errors of starting up md devices are reported. problem persists until I reboot server.
I guess there is something misconfigured as LAN failure should not cause SAN access failure.
thanks for clarification.
emha.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
тАО08-16-2007 12:49 AM
тАО08-16-2007 12:49 AM
Re: SG & network failure
Since SG package configuration includes a floating ip address on the public network, the scenerio you give should cause all packages to fail over to the node that has the working public NIC interface.
As to the errors when you try and fail back manually, I'd need to see the error messges from /var/log/messages
I imagine the fix should be:
cmhaltnode (on the downed node)
cmrunnode (on the downed node)
cmhaltpkg (may be useful in solving the problem on the disconnected node.)
Before trying to fail back, you must be sure networking is up and the system has normal connectivity.
Seeing the actual cluster configuration ascii file would also be useful.
SEP
Owner of ISN Corporation
http://isnamerica.com
http://hpuxconsulting.com
Sponsor: http://hpux.ws
Twitter: http://twitter.com/hpuxlinux
Founder http://newdatacloud.com
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
тАО08-16-2007 04:35 AM
тАО08-16-2007 04:35 AM
Re: SG & network failure
yes, it's absolutelly clear that all the packages have to be moved. it was in fact done so. my doubts are mainly about failback.
at the moment I tried to move package back network was already up, I've checked it in /var/log/message and tried to ping the server.
my question moreless was, how should SG in principle behave after LAN connectivity failure. I didn't want you to analyze my logs (however, if you are interested in I can post them).
why do you think halt/run node is necessary?
network connectivity is just one of the package resources, like a managed process, so if it fails SG should perfrom failover. later failback shouldn't be restricted anyhow.
emha.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
тАО08-16-2007 05:56 AM
тАО08-16-2007 05:56 AM
Re: SG & network failure
>>>
why do you think halt/run node is necessary?
network connectivity is just one of the package resources, like a managed process, so if it fails SG should perfrom failover. later failback shouldn't be restricted anyhow.
>>>
I deal with very stable configurations that rarely deal with anything other than disk failure.
But I do remember when we pulled network plugs on our SG in SG class that sometimes we had to run cmhaltnode or cmhaltpkg before we could get the clsuter to reform after a catastrophic loss of service, such as TOC.
Someone is going to need to look at the logs to see why the failback does not work. Once network connectivity is re-established, you should be able to fail back.
I don't think cmhaltnode is necessary. I think cmhaltnode/cmrunnode will clear the errors on your logs and permit you to fail the packages back unless the network connection has not recovered sufficiently.
I just crashed a red hat cluster by pulling its only NIC connection and was forced to reboot both nodes to recover it. SG should with some intervention be able to recover from the same thing. Its just a matter of identifying the issue and writing the procedure.
SEP
Owner of ISN Corporation
http://isnamerica.com
http://hpuxconsulting.com
Sponsor: http://hpux.ws
Twitter: http://twitter.com/hpuxlinux
Founder http://newdatacloud.com
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
тАО08-16-2007 06:30 AM
тАО08-16-2007 06:30 AM
Re: SG & network failure
------
Someone is going to need to look at the logs to see why the failback does not work. Once network connectivity is re-established, you should be able to fail back.
------
thanks, this is the fact I wanted to confirm for myself (system vendor keeps telling me SG is not able to handle LAN connectivity outage on bonded interface, and that I experienced 2 separate issues LAN/md).
------
But I do remember when we pulled network plugs on our SG in SG class that sometimes we had to run cmhaltnode or cmhaltpkg before we could get the clsuter to reform after a catastrophic loss of service, such as TOC.
------
yes, it's very common (and I think also recomended) to halt/reboot node if you pull out all the network connections including heartbeats and TOC is invoked
------
I just crashed a red hat cluster by pulling its only NIC connection and was forced to reboot both nodes to recover it. SG should with some intervention be able to recover from the same thing. Its just a matter of identifying the issue and writing the procedure.
------
if it was only NIC then I guess it was used for heartbeat as well. if so, reboot of the node I would see ok, but I would expect 2nd node should survive if it had accessible quorum.
anyway, thanks for sharing your experience.
emha.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
тАО08-17-2007 03:19 AM
тАО08-17-2007 03:19 AM
Re: SG & network failure
For LAN failover, and package failover in general:
Yes - when both connections in a bonded pair fail then a package should fail over. From your description that should happen.
From your description, when you try to "failback" the package startup process happens, but the package is not starting correctly. Am I correct? Package failback with XDC requires close attention to those docs. If you are not using XDC, you CANNOT use MD and there are no ways around your problems.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
тАО08-17-2007 03:39 AM
тАО08-17-2007 03:39 AM
Re: SG & network failure
---------------------
First a question - how are you using MD? MD is only supported in a very limited set of circumstances - specifically with XDC.
---------------------
yes, I use XDC.
---------------------
From your description, when you try to "failback" the package startup process happens, but the package is not starting correctly. Am I correct?
---------------------
exactly.
but from my point of view they do not start from other reasons (md mirror can not be built up due to "resource busy") than the reasons for failover (LAN outage).
emha.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
тАО08-17-2007 03:48 AM
тАО08-17-2007 03:48 AM
Re: SG & network failure
If you are waiting and still see the problem, can you attach logs from both systems. Please extract just the time period around the manual "failback". Also, I believe there is a separate XDC related log associated with teh package. This and any package logs might be helpful.
I'll ask the developers to take a look. There may not be a response until Monday because of timezones.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
тАО08-19-2007 09:28 PM
тАО08-19-2007 09:28 PM
Re: SG & network failure
I didn't check whether remirroring was finished.
But as far as I know, if mirror is stopped without resync being completed, it's not a problem. once mirror is brought up next time resync is started again.
I'm not waiting for a solution here, I already resolved it. However, I attached logs, so you can check them.
network outage has happened on 15.8. 12:21:56, manual failback at 12:45:50.
I'm not aware of any special XDC log, where can I find it?
emha.