- Community Home
- >
- Servers and Operating Systems
- >
- Operating Systems
- >
- Operating System - HP-UX
- >
- strange behavior - serviceguard 11.16
Categories
Company
Local Language
Forums
Discussions
Forums
- Data Protection and Retention
- Entry Storage Systems
- Legacy
- Midrange and Enterprise Storage
- Storage Networking
- HPE Nimble Storage
Discussions
Forums
Discussions
Discussions
Discussions
Forums
Discussions
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
- BladeSystem Infrastructure and Application Solutions
- Appliance Servers
- Alpha Servers
- BackOffice Products
- Internet Products
- HPE 9000 and HPE e3000 Servers
- Networking
- Netservers
- Secure OS Software for Linux
- Server Management (Insight Manager 7)
- Windows Server 2003
- Operating System - Tru64 Unix
- ProLiant Deployment and Provisioning
- Linux-Based Community / Regional
- Microsoft System Center Integration
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Community
Resources
Forums
Blogs
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
10-26-2010 08:47 AM
10-26-2010 08:47 AM
strange behavior - serviceguard 11.16
some strange things happened last night in our 2-node serviguard 11.16 cluster(patched according to release notes).
Let me introduce you the environment for this 2-node cluster:
2-way (reduntant) dedicated heart-beat networks are established between node A and node B. One heart beat goes through one switch and the other heart beat link goes through other lan switch.
What happened last night:
Our quorum server, providing tie-breaker services for those 2 nodes was down. It should not really matter as the possibility of 2 seperate heart beat networks failing at the same time is very low.
Suddenly, the lan interface on one node (not the heart beat interface), carrying the package IP address, just STOPPED working. There was nothing in syslog, saying anything about "bad cable connection", I even checked the log on the cisco switches, checked the nettl.log with 'netfmt', nothing indicating a bad HBA or anything else.
These were the symptomps of the unfunctional lan card: I could not ping it's gateway, even 'linkloop' from that lan card to another lan card on the other node in the cluster didn't work. I tried unplumbing/ plumbing the interface, pulling out and in the cable on the switch / HBA side. /sbin/init.d/net start didn't help. And the physical layer-2 connection (looking at the lan switch / dmesg on host)was OK. It was just non-sense.
Has anybody encountered something similar?
If there's nothing in the cisco switch syslog, i really have no idea if that was a bug in the OS. I had to reboot the server and the lan connection from that interface started working again.
Another question to ask:
If the heart beat is ok between these 2-nodes (I'm 100% sure it was okay because of 2 seperate HB networks), then there shouldn't be any need for quorum server to be up, right?
Thank you for your help.
In the attachment is the output from syslog. IP address of the QS has been replaced by 'X.X' and node name by 'NODE_A'.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
10-26-2010 09:49 PM
10-26-2010 09:49 PM
Re: strange behavior - serviceguard 11.16
Quorum is required when the cluster re-formation is happened. If i understand correctly, there is no cluster reformation in your situation.
Serviceguard normally monitors the health of the lan card by polling the card, if there is a problem then and it will mark the card as down. The important parameter here is "NETWORK_FAILURE_DETECTION". The value of the parameter define in which situation serviceguard markes the card as bad(Is your setting is INOUT?). In your situation as the card was not marked as dead we can assume that your lan card is healthy for at least for serviceguard.
>>I could not ping it's gateway, even 'linkloop' from that lan card to another lan card on the other node in the cluster didn't work.
I think this can be a possible network problem, L2 is operational but L3 is not. This is a similar issue that someone changes the vlan of the server, this means your L2 id okay but you cant reach your gateway.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
10-27-2010 01:31 AM
10-27-2010 01:31 AM
Re: strange behavior - serviceguard 11.16
But if everything was ok on the switch and the status of ports on the switch was okay(nothing in cisco syslog), then why the linkloop (layer-2 test, right? ) didn't pass? That's weird, isn't it? I had to reboot the server to get the linkloop start working.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
10-27-2010 02:59 AM
10-27-2010 02:59 AM
Re: strange behavior - serviceguard 11.16
On the node_a, the network problem was started, just after the quorum server had been down?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
10-27-2010 03:25 AM
10-27-2010 03:25 AM
Re: strange behavior - serviceguard 11.16
The quorum server was down from the beginning, when the cluster FIRST started. So the cluster was doing fine without quorum server like a day and 20hours. Those messages in syslog just suddently appeared.
But from the output in the attachment, there seems to be some connection between the quorum server not beeing reacheable (however the quorum server was down much longer) and that the whole subnet is lost, just weird. If it's some form of bug of Serviceguard, that after he cannot connect to quorum server, he just considers the whole subnett to be down, weird.
But the quorum server is on different subnet than the node which lost lan connectivity.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
10-27-2010 01:36 PM
10-27-2010 01:36 PM
Re: strange behavior - serviceguard 11.16
A graceful cluster reformation (such as a node leaving or entering the cluster) do not invoke a need to seek the quorum server(arbitration device) so inaccessibility of the QS should not cause a problem until ALL HB networks fail unexpectedly while a 2-node cluster is running. A 1-node cluster does not generate HB and does not need a quorum server.
If the problem recurs, use landiag to reset the NIC and see if it changes the NIC transmission capabilities.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
10-27-2010 09:36 PM
10-27-2010 09:36 PM
Re: strange behavior - serviceguard 11.16
Anyway, how come that ServiceGuard haven't considered that particular lan interface to be down, if it couldn't communicate?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
10-27-2010 11:33 PM
10-27-2010 11:33 PM
Re: strange behavior - serviceguard 11.16
INOUT: When both the inbound and outbound counts stop incrementing for a certain amount of time, Serviceguard will declare
the card as bad. (Serviceguard calculates the time depending on the type of LAN card.) Serviceguard will not declare the card as bad if only the inbound or only the outbound count stops incrementing. Both must stop. This is the default.
In your situation, it means that server can still send or receive packages (possibly only send). So here the important thing is, at that time the switch can move the packages to its destinations or not. Be also sure that someone not change the vlan of the server.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
10-28-2010 12:15 AM
10-28-2010 12:15 AM
Re: strange behavior - serviceguard 11.16
so what you are suggesting is changing the cluster config parameter to INONLY_OR_INOUT?
Would Serviceguard be reacting more "accurately" to networking problems (as I don't know yet if it was on the switch / HBA side), am I right?
One more note to say. This interface card is not configured as HEART BEAT interface, just stationary. Other 2 NIC's are configured as HB.
I'm no networking expert, but let's say that there will be a situation one night, when obviously there will be no application activity ( it's a SAP running on that cluster node) as no users will be using the application. So considering, that there is no net activity (not even HB packets, as it's not configured as HB NIC) and the settings will be set to INONLY_OR_INOUT, wouldn't it cause ServiceGuard to think that "Why isn't the NIC sending/receiving any packets? I will mark it as down". Maybe it's just nonsense I wrote here, but I always try to consider other things when changing something :).
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
10-28-2010 12:38 AM
10-28-2010 12:38 AM
Re: strange behavior - serviceguard 11.16
INONLY_OR_INOUT: This option will also declare the card as bad if both inbound and outbound counts stop incrementing. However, it will also declare it as bad if only the inbound count stops.This option is not suitable for all environments. Before choosing it, be sure these conditions are met:
â All bridged nets in the cluster should have more than two interfaces each.
â Each primary interface should have at least one standby interface, and it should be connected to a standby switch.
â The primary switch should be directly connected to its standby.
â There should be no single point of failure anywhere on all bridged nets.
New serviceguad versions (A.11.19 ?) have a new directive "IP_MONITOR", that it polles a target at the IP layer. For example you can give default gateway and sericeguard monitor the IP connectivity with the gateway.
With older versions of serviceguard we use INONLY_OR_INOUT directive . We have 2 cards for data network. Primary is connected to switch 1 and secondary is connected to switch2. Switch1 and 2 also connected to each other and HSRP is running on them. By this way we can still survive when the active switch is operational at L2 but not at L3.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
10-28-2010 01:38 AM
10-28-2010 01:38 AM