TruCluster
Showing results for 
Search instead for 
Do you mean 

Cluster hangs when member that is CFS server dies

SOLVED
Go to Solution
Occasional Advisor

Cluster hangs when member that is CFS server dies

Tru64 Unix V5.1B PK 5, TRucluster V5.1B PK 5

I have a 2 node cluster with a quorum disk so it can survive as a single node cluster if one member dies or is shutdown.

All disks are on shared MSA1000 storage arrays and are accessible from both members.

If I shutdown (using shutdown -h now) the cluster member that is the CFS server (verified from cfsmgr -e) for the cluster_root, cluster_usr and cluster_var then the other remaining cluster member hangs and never recovers. Messages appear on its console relating to recovering filesystems to it but I never see the messages about recovering the cluster_root, cluster_usr and cluster_var filesystems.

If I shutdown the member that is NOT the CFS server for root, usr & var then all is fine, the other member carries on happily.

This seems to be such a fundamental problem that I can't believe it hasn't been thought of or catered for by now.

Any ideas?
14 REPLIES
Honored Contributor

Re: Cluster hangs when member that is CFS server dies

You should be able to shutdown any member and the recovery should accur.

What is the status of:

drdmgr dskN

For example:

View of Data from member esc as of 2006-03-23:08:56:11

Device Name: dsk3
Device Type: Direct Access IO Disk
Device Status: OK
Number of Servers: 2
Server Name: node1
Server State: Server
Server Name: node2
Server State: Server
Access Member Name: node1
Open Partition Mask: 0xc1 < a g h >
Statistics for Client Member: node1
Number of Read Operations: 1170229
Number of Write Operations: 9124428
Number of Bytes Read: 14322024448
Number of Bytes Written: 98324430848

Where dskN is the disk for the cluster_root, cluster_usr and cluster_var.

You may have this problem is the node is not listed as "Server", and is listed as "Not Server". If listed as "Not Server" this is a persisten reservation problem.
Por que hacerlo dificil si es posible hacerlo facil? - Why do it the hard way, when you can do it the easy way?
Honored Contributor

Re: Cluster hangs when member that is CFS server dies

Sounds like that the second can not get to the quorum disk for a tiebreaker vote to avoid cluster partitioning.

fwiw,
Hein.
Honored Contributor

Re: Cluster hangs when member that is CFS server dies

Does the same thing happen if you shutdown CFS server with "# init 0" and with "# shutdown -n now" ?
What additional patches did you install?
In vino veritas, in VMS cluster
Occasional Advisor

Re: Cluster hangs when member that is CFS server dies

All the disks are showing as having both cluster members as servers from drdmgr so that's not the problem.

Surely if there are quorum problems I should get messages about loss of quorum and suspension of cluster activities which I don't.

I don't believe there is a problem with the quorum disk as I can make the problem happen whichever member is the CFS server for root, usr and var, i.e. I can relocate the CFS server for these domains and the problem follows the member that is the CFS server.

Also I have had one occurrence of the problem where I left the hung member overnight and it did actually recover successfully after about 5 hours - without the other member being rebooted.

Also if I reboot the member that was shutdown then it hangs during the boot sequence just after it has joined the cluster and in fact gets a CFS error 9 when trying to access the cluster_root disk.
Honored Contributor

Re: Cluster hangs when member that is CFS server dies

Are you using memory channel or lan interconnect?
Por que hacerlo dificil si es posible hacerlo facil? - Why do it the hard way, when you can do it the easy way?
Occasional Advisor

Re: Cluster hangs when member that is CFS server dies

Same problem occurs with init 0 as well.

No additional patches over and above Patch Kit 5 have been installed - are there any relevant ones available?

We are using LAN interconnect via a crossover cable. We have also tried a switch but it makes no difference to the problem.
Honored Contributor

Re: Cluster hangs when member that is CFS server dies

I would troubleshoot at the lan interconnect level. Set network adapter speeds manually at console level and with the rc.config command. 100 FD should be enough for testing. Check the lan interconnect best practices white paper.(I don't remember the exact name but there is a whitepaper with tuning recommendations)
Por que hacerlo dificil si es posible hacerlo facil? - Why do it the hard way, when you can do it the easy way?
Honored Contributor

Re: Cluster hangs when member that is CFS server dies

I do not think it is LAN interconnect related problem. Also not quorum disk problem.
Try to install this patch:
ftp://ftp.itrc.hp.com/tru64_patches/tru64/5.1X/T64KIT0025601-V51BB26-E-20050513.tar
Maybe it would not help but you can try.
Could be some time-out value problem or kernel parameter problem.
In vino veritas, in VMS cluster
Honored Contributor

Re: Cluster hangs when member that is CFS server dies


Note that rc.config has no influence on the speed/fdx of the LAN-interface used by ics0.

For completeness, check if the MSA entries in /etc/ddr.dbase is are up to date.
A good entry is attached to this reply.

Johan.

_JB_
Occasional Advisor

Re: Cluster hangs when member that is CFS server dies

The network cards in use are all DE602s. It was my understanding that the driver doesn't take any notice of the console level settings for these cards.

Also we have lan_config settings in /etc/inet.local to explicitly set them to the desired configuration (100 Mbit, FD).

The MSA entry in /etc/ddr.dbase matches the one attached above.

Have got the patch mentioned previously and will give install it and give it a try.
Honored Contributor

Re: Cluster hangs when member that is CFS server dies

When both nodes are up, can you try manual relocation of the filesystem using 'cfsmgr' to see if that succeeds?
Valued Contributor

Re: Cluster hangs when member that is CFS server dies

Hi Martin,

This is a know issue and you should ask your HP services representative to get the following CSP:

- TCRKIT1000058-V51BB26-20051027 which includes:

> Patch C 01362.02 - Fixes a deadlock issue during cluster root failover
and
> Patch C 00316.02 - Fixes a deadlock issue during cluster root failover

These patches fixe a deadlock that can happen during failover on the cluster root domain.

Hope this will help you.

Kind regards,

VINCENT Jean-Marc
HP France
Groupe Support Unix
Tru64(tm) UNIX Technical Consultant - Tru64(tm) UNIX Ambassador HP France
+33 1 5762-8861
jean-marc.vincent@hp.com
Occasional Advisor

Re: Cluster hangs when member that is CFS server dies

Jean-Marc,

Many thanks for your reply. We got in touch with our local HP support and passed on your information. They replied saying that a work around is to disable vfast on the cluster_root, cluster_usr and cluster_var domains.

We have tried this and it does indeed fix the problem.

Thanks again.
Honored Contributor

Re: Cluster hangs when member that is CFS server dies

Thank you all for sharing this information!
Both the pointer to the patch, and the vfast workaournd should be valuable for others

Regards,
Hein

[0 points for this]
//Add this to "OnDomLoad" event