1826163 Members
4533 Online
109691 Solutions
New Discussion

Re: ServiceGuard / cells

 
SOLVED
Go to solution
Samuel Scott
Frequent Advisor

ServiceGuard / cells

I have a 2 node SG cluster. On one of the nodes I have a 2-cell partition, cell 0 and cell 1.

On cell 0 I have:

1 Dedicated Heartbeat Interface
1 Primary Interface (also acts as a Heartbeat)

on Cell 1, I have:

1 Standby Interface

My question: If Cell 0 were removed or fails, would the partition be able to rejoin the cluster in the present configuration?


If I move the dedicated Heartbeat Interface from Cell 0 to Cell 1, would this configuration allow the partition to rejoin cluster on reboot in case cell 0 were removed or fails?

Can I make these network changes to the partition (moving heartbeat from cell 0 to 1) by removing the node from the cluster, making the network change, then adding the node back into the cluster, or do I have to halt the cluster and reapply the changes (cmapplyconf)?
8 REPLIES 8
Stephen Doud
Honored Contributor

Re: ServiceGuard / cells

If both cells are partitioned as one server and half of the server fails (1 cell), I'd think an HPMC (hardware failure triggered) kernel reboot would occur. The resulting server may be missing network components, which may not permit it to rejoin the cluster. At that point, it is possible to remove the node from the cluster online. However, if it doesn't have a matching set of heartbeat or data networks to the running partner node, it cannot be re-added to the cluster. Only the lack of standby NICs will permit the re-addition of the node to the cluster, and that is a guarded statement, based on the total number of HB networks.
Samuel Scott
Frequent Advisor

Re: ServiceGuard / cells

thanks Stephen for the reply. I agree if one of the cells fails, the system would indeed reboot and using the current configuration, I'm not sure if it would be able to rejoin the cluster.

What if I move the dedicated Heartbeat Interface from Cell 0 to Cell 1, would this configuration allow the partition to rejoin cluster on reboot in case cell 0 were removed or fails?

And if I moved the dedicated HB from 0 to 1, would I have to cmapplyconf or could I just re-add the node back into the cluster after the change?
Stephen Doud
Honored Contributor

Re: ServiceGuard / cells

NICs are identified in the cluster binary by path, so moving a NIC to another cell would certainly require an update to the cluster binary file. Under restricted circumstances dictated in the latest Managing Serviceguard manual (for A.11.18) online network changes are "possible". If your's runs A.11.17, updating the cluster binary file would require cluster downtime. So long as the same STATIONARY_IP or HEARTBEAT_IP references can be supported by the remaining cell, you should be able to remove it and re-add it to the cluster.
Samuel Scott
Frequent Advisor

Re: ServiceGuard / cells

Thanks again Stephen. I appreciate your answer on the network changes. That's what I thought and thanks for verifying that for me. On the case of a node rejoining a SG cluster after cell board failure, I'm still not clear. I know you're busy and I don't want to monopolize your time, but let me present this in a different way. Any help you could offer would be appreciated.


Current configuration:
One partition with two cell boards.
Cell 0 Primary Interface (also acting as HeartBeat)
Cell 0 Dedicated Heartbeat Interface
Cell 1 Standby Interface


Would both/either of these scenarios allow this node to rejoin the cluster?

1st Scenario (Cell board 0 failure)
Note: One partition with two cell boards

Cell Board 0 (Failed)

Cell Board 1
Standby Interface
Dedicated Heartbeat

Will this rejoin the cluster?

2nd Scenario (Cell board 1 failure)
Note: One partition with two cell boards

Cell Board 0
Primary Interface (also acting as Heartbeat)

Cell Board 1 (Failed)

Will this rejoin the cluster?


I guess my question is this... what is required for a node to join a cluster? Is the dedicated Heartbeat, or a Standby card required if it is listed in the configuration?
Matti_Kurkela
Honored Contributor

Re: ServiceGuard / cells

For a 2-cell partition to even boot after losing one of the cells, both cells must be core-capable and configured so that at least one mirror of the system disk is directly accessible to each cell. In older nPar servers, some cells may lack a Core IO card and thus not be core-capable.

As far as I've understood, as long as a partition still has a heartbeat interface and can access the network(s) required by the package(s), it will attempt to rejoin the cluster after a reboot.

But after losing a cell, a 2-cell partition has lost about one half of its CPU, memory and I/O resources. Do you really want such a crippled machine to automatically rejoin the cluster? Usually I wouldn't, not without doing at least some minimal testing to ensure that the machine is stable in its new reduced configuration.

There is probably some reason why you have set up your nPar machine into "1 node of 2 cells" configuration, as opposed to "2 nodes of 1 cell each". This same reason would limit the usefulness of the partition that has lost a cell.

MK
MK
Stephen Doud
Honored Contributor
Solution

Re: ServiceGuard / cells

Matti makes some good observations.

As to the technical feasability of a node rejoining a cluster after a NIC failure, A.11.18 was enhanced to permit this, when a standby NIC is available on the same bridged network.
It took 40 minutes and the help of 2 Expert Center engineers, but we located a publicly viewable statement discussing this topic, in the A.11.18 Release Notes, "Fixed in this version" section. See JAGaf46654 (SR8606386500) for details.
Samuel Scott
Frequent Advisor

Re: ServiceGuard / cells

Thanks Matti and Stephen. You have both been very helpful. I agree with you Matti and argued the point that when a cell is lost, the first priority is to fix the problem. That's why we have the failover node! The Engineer in charge seemed to want something like a failover within the cell, which may or may not be possible. I will relay the info you have given me. Thanks again.
Samuel Scott
Frequent Advisor

Re: ServiceGuard / cells


Info received and noted.