Re: Problem runing only One node

SnIphe · ‎05-23-2007

Hi all,

I have a SG cluster over two DL580 with a MSA1000. Using HBA Qlogic.

The cluster works perfectly, all switch, we have made all the battery test, and everithing works.

But, when I have my two nodes down. I switch on only one node and everithing starts well. The cluster waits for one minute because I put in the configuration file AUTO_START_TIMEOUT to one minute.
But after this, after the login screen. I do cmviewcl and I found the cluster in "unknown" state.
If I try to make a cmruncl, sais that the cluster is waiting for the other node to start.

There is another parameter to say to the cluster, not to wait to the other node?
If the cluster and both nodes are up and runing, and I switch off one node... everithing works OK... so I don`t know if ther is a locklun problem. I don`t thing so.

Thanks a lot another time.

melvyn burnard · ‎05-24-2007

this is expected behaviour. When starting the cluster, it requires 100% of the nodes to be available to start the cluster.
If you wish to start this on just one node, you will need to wait for the autostart interval to run out, then use:
cnmruncl -n
on the node you wish to run as a single node cluster
man cmruncl

My house is the bank's, my money the wife's, But my opinions belong to me, not HP!

SnIphe · ‎05-24-2007

OK thanks for the reply.

At this point, I want to know if there is some script, wich I can put in the init level, and I can make automatic the comand cmviewcl -n if the other node is down.
Because this cluster is going to a farm without workers. And I need to take the control of the cluster if the power fails and only one node starts.

I`m sure there is some script wich can solve this problem.

Thanks a lot.

SnIphe · ‎05-28-2007

If I put something like this in the cluster.init script?

At this point:
#
# Check to see if the daemon is already running
#
findproc cmcld
if [ "$pid" = "" ]
then
#
# The daemon isn't running already
#
+ isnodeup ingrids2
+ if [ "$node_status" = "down" ]
+ then
+ action "El nodo ingrids2 esta abajo, levantamos el cluster solo con el nodo ingrids1"
+ ${SGSBIN}/cmruncl -v -f -n ingrids1
+ exit 0
+ fi

if [ -f ${SGSBIN}/cmrunnode ]
then
#
# Attempt to join the cluster
#

Adding lines begining with + mark.

You know, I ask if the node "indgrids2" is up, and in the other node I ask for the "ingrids1"...

Can I find some problem??

Thanks a lot.

Matti_Kurkela · ‎05-28-2007

There is a reason why the standard cmcluster.init script does not work like you're suggesting.

The "cmruncl -n" is intended to be used only when it is *absolutely certain* that the other node is not running.

ServiceGuard cannot tell these two situations apart:
1.) one node is being started while the other node has lost power or has failed in some other way

2.) both nodes are actually starting at the same time, but the network connections between them have completely failed, i.e. the heartbeat of each node has no way of reaching the other node.

In situation 1), you can start the cluster using one node and have the other node join the cluster later when its problems have been fixed.

In situation 2), one or the other node *MUST NOT BE ALLOWED TO START*, since both nodes would assume the other one has failed, would mount the shared disks and start the applications. If two nodes use the same filesystem simultaneously without knowledge of each other, the result is *CERTAIN FILESYSTEM CORRUPTION*.

The idea behind the cluster lock is that whenever the 2-node cluster loses the heartbeat connections, only one node may continue processing while the other does a hard reboot (to stop the use of the package resources *instantly*) and stops in ServiceGuard startup phase to wait until the network connections are restored. The waiting node will not touch the package resources, because it must assume the other node is using them. Your change would allow the rebooting node to avoid this wait and just blindly assume the other node is down - the problem is that your script *cannot know that* for sure.

If your servers are located in an unmanned server farm, you should implement remote consoles. Which generation is your DL580? If it's reasonably modern, it should have iLO remote console functionality built-in. You can even control the server's power switch through iLO.

You could also use Wake-on-LAN to restart your servers after a power interruption, but in my opinion, WOL is a poor substitute for a real remote console.

MK

MK

melvyn burnard · ‎05-29-2007

If you make a change to the init script, you are then in an unsupported state.
Worse, you leave an opening for possible corruption of yhour data.

SG is working the way it was designed.
Meddle with this at your peril

My house is the bank's, my money the wife's, But my opinions belong to me, not HP!

SnIphe · ‎05-29-2007

Hi all,

Finally, the installation has finished.
The customer accepts the reason of the correct functioning of SG.

I`m with you, SG for Linux, works exactly in this way... no other way.
SG is designed to support N-1 posible mistakes. So SG is not designed to support... two nodes down at the same time,and after one node brocken.

Thanks a lot for everithing.

melvyn burnard · ‎05-29-2007

I would recommend you contact your local HP Education office, and look at attending a Serviceguard course. This would answer a lot of your questions

My house is the bank's, my money the wife's, But my opinions belong to me, not HP!

Categories

Company

Local Language

Forums

Discussions

Forums

Discussions

Discussions

Forums

Discussions

Forums

Discussions

Forums

Forums

Discussions

Forums

Discussions

Forums

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Community

Resources

Other HPE Sites

Discussions

Forums

Blogs

Re: Problem runing only One node

Problem runing only One node