Operating System - Linux
1839143 Members
2961 Online
110136 Solutions
New Discussion

Re: RHEL3AS u3 and SG cmrunnode failure

 
Timo J
Frequent Advisor

RHEL3AS u3 and SG cmrunnode failure

ServiceGuard cluster won't start up after boot.
Lets say that nodes hostA & hostB are halted. I start node hostA, it boots ok, but it won't start a one-node cluster. cmviewcl claims that status of cluster is 'unknown' and node hostA status is 'down' (node hostB status is unknown.) cmviewcl also reports that it can't talk to all nodes.

I start node hostB, it boots up ok, but when trying to join the cluster, it timeouts after ~10 minutes. AUTOSTART_CMCLD is 1 on $SGCONF/cmcluster.rc on both nodes. During that 10 minutes period, cmviewcl reports the status of the cluster as 'starting'.

In both nodes, the deadman module is loaded ok by OS before trying to start the cluster.

After that, I run cmruncl from command line and wohooo, cluster starts up ok.

Attachment contains cmviewcl outputs and syslog entries.
N/A
11 REPLIES 11
Timo J
Frequent Advisor

Re: RHEL3AS u3 and SG cmrunnode failure

Uh, new try with attachment with linefeeds (at least for windows users)
N/A
Steven E. Protter
Exalted Contributor

Re: RHEL3AS u3 and SG cmrunnode failure

I see that you are using bonding. This is teaming between two NIC cards to have a single IP addresss. This is usually good, and I do it on one of my NAS servers to improve reliability and throughput.

The problem is SG might not like it.

Few things to remember(you may already know this).

SG is a High Availability system. You can not have a volume group activated on two nodes at the same time. You can't have a package running on two nodes at the same time.

Packages and volume groups pass back and forth from node to node when the node goes down.

My guess, based on information provided is that nodeb is trying to activate a volume group htat nodea has already activated.

Or: That the formation of the NIC bonding is confusing SG or is happening at the wrong point in the boot sequence.

I suggest you investigate these two possibilities. Please report back if you solve the issue without further assistance.

SEP
Steven E Protter
Owner of ISN Corporation
http://isnamerica.com
http://hpuxconsulting.com
Sponsor: http://hpux.ws
Twitter: http://twitter.com/hpuxlinux
Founder http://newdatacloud.com
Timo J
Frequent Advisor

Re: RHEL3AS u3 and SG cmrunnode failure

Yes, i've configured systems so that shared volume group (on MSA1000 through fibres) is not activated on boot. (Edited /etc/rc.d/rc.sysinit so that both systems run vgscan on boot on local and shared disks normally but activates _only_ local vg.)

And that works fine; if I change the AUTOSTART_CMCLD to 0 and boot the nodes, shared volume group gets scanned (to get the entry to /etc/lvmtab) but not activated on either node. And that's the right way.

I also have a lock LUN defined on MSA1000. That should decide which node activates the shared vg. And that works fine, i haven't seen any cases where both nodes are trying to concurrently activate the shared vg.

In fact, in RH you can activate the shared vg manually on nodeB even if it's activated on nodeA by SG. But as i can see, the lock LUN works ok so that which of nodes gets first the lock LUN, it starts the package and the other node stays standby. (Linux vgchange is missing that '-c' switch that you can use on HP-UX vgchange to mark each specified volume group as a member of the high availability cluster so that it can't be activated even manually eg. on nodeA if nodeB has already activated it)

Also before nodes are trying to form a cluster at bootup, SG reports on console that network verification is ok, so i don't think that this is bonding issue. (I'll check tomorrow if all bonds are up before SG tries to form the cluster but as far as i can remember, the bonds are up ok before SG commands)

And i think that this not a shared vg issue, because if nodeB is on halt when nodeA is trying to form a single-node cluster and activate that shared vg, there is definitely no possibility that nodeB has that vg activated.

Now it's 09:00PM here in Finland and I'd like to sleep a little bit....
N/A
Serviceguard for Linux
Honored Contributor

Re: RHEL3AS u3 and SG cmrunnode failure

bonding is fine with Serviceguard for Linux. It is actually the only way to have NIC Failover. An extra heartbeat network is recommended but not required.

I believe you need to execute the CMRUNCL command (it may be either with -f or -n nodename).

As I understand it, SG doesn't want to start without all nodes. It is assumed that the user is starting all the nodes & will (therefore) know that the clsuter is not up and take appropriate action.
Timo J
Frequent Advisor

Re: RHEL3AS u3 and SG cmrunnode failure

Rick your right, executing cmruncl on bootup was one of the things that i was thinking. But in the other hand, in HP-UX SG it's enough to set AUTOSTART_CMCLD to 1 to start even a single-node cluster. I can't find anything related to this from Linux SG manuals. And it doesn't sound fair that SG installation supposes that user adds that cmruncl command manually to configuration?
N/A
Timo J
Frequent Advisor

Re: RHEL3AS u3 and SG cmrunnode failure

I noticed that if cluster is up when i shut down both hosts, then the cluster won't start up on next boot. But if i do cmhaltcl before i reboot hosts, then the cluster will form ok on next reboot. Maybe there's a lock file or something somewhere....
N/A
melvyn burnard
Honored Contributor

Re: RHEL3AS u3 and SG cmrunnode failure

If only one node is available at cluster start time, i.e. one node is down and the other has rebooted and is trying to start the cluster, this will not work.
Serviceguard REQUIRES 100% of the nodes configured in the cluster to be available at the time the cluster is trying to form to be able to work.
This is by design!
If you do wish to start the cluster after booting only one node, then you must manually intervene and do:
cmruncl -n


And bonding is spported wihin SG/Linux provided you do not use the load-balancing mode. It must be set to use faillover mode in the ifcfg-bond0 file
My house is the bank's, my money the wife's, But my opinions belong to me, not HP!
Timo J
Frequent Advisor

Re: RHEL3AS u3 and SG cmrunnode failure

Melwyn:

Linux ServiceGuard Manual says:
"Although a cluster quorum of more than 50% is generally required,
exactly 50% of the previously running nodes may re-form as a new
cluster provided that the other 50% of the previously running nodes do
not also re-form. This is guaranteed by the use of a tie-breaker to choose
between the two equal-sized node groups, allowing one group to form the
cluster and forcing the other group to shut down. This tie-breaker is
known as a cluster lock. The cluster lock is implemented either by
means of a lock LUN or a quorum server. A cluster lock is required on
two-node clusters."

So I doubt that it wouldn't be possible to start a one-node cluster.

Here is again one simple test sequence that I tried:

- package on node AAA
- node BBB: shutdown -h now
- after node BBB has halted, node AAA:shutdown -r now
RESULT: cluster won't start up automatically as a single node cluster on node AAA.
Not even manually by command cmrunnode. Can be started by issuing command cmruncl -n AAA.
Node BBB joins to running cluster automatically ok after it's been powered up.
N/A
melvyn burnard
Honored Contributor

Re: RHEL3AS u3 and SG cmrunnode failure

This is the correct behaviour of Serviceguard.
The section you have quoted is ONLY applicable when the clsuter is running and you have a node(s) failing. This is completely different.

As stated, you reboot node A with node B unavailable, Node A tries to form a cluster automatically, but as the quorum of 100% (which is MANDATORY) is not met due to node B being down, this will fail after the default 10 minutes.
By then issuing a manual cmruncl -n nodeA, you will get your single node cluster riunning as you have seen, and nodeB will then join the running cluster once you reboot it.
My house is the bank's, my money the wife's, But my opinions belong to me, not HP!
Timo J
Frequent Advisor

Re: RHEL3AS u3 and SG cmrunnode failure

OK, I'm starting to believe you ;) So it's not possible without manual intervention at least in ServiceGuard for Linux. I don't have right now HP-UX ServiceGuard to test that same situation, but I remember that in HP-UX that doesn't need manual intervention. I might also be wrong with that.
N/A
melvyn burnard
Honored Contributor

Re: RHEL3AS u3 and SG cmrunnode failure

Serviceguiard is designed to do the same thing whether it be on HP-UX or Linux, so you will get the same result
My house is the bank's, my money the wife's, But my opinions belong to me, not HP!