cancel
Showing results for 
Search instead for 
Did you mean: 

cluster node goes down

Ninad_1
Honored Contributor

cluster node goes down

Hello ,

We have a 2 node cluster of 2 ES40 s with memory channel cable.Tru64 unix version 4.0F , Trucluster 1.6 , Server 1 has all the applications and services running whereas the Server 2 is standby and all services will be transffered to Server 2 in case of Server 1
failure.We have a tie-breaker disk and the tie-breaker service also runs on Server 1 on startup.
The problem we are facing is that after we boot the 2 systems the Server 2 goes down anytime say after few hours to 1 or 2 days
but if Server 2 doesnt go down after say 2-3 days then it remains in cluster for as many days as you do not take shutdown. The problem
seems strange. But sometimes it has also happened that after booting both the servers , Server 2 doesnt go down. This problem has started since last 6-7 months , before that such a problem had never occured. Almost all the times with an exception of 1/2 times the Server2 has gone down in the night time say anytime after 20:00 hrs or before 7:00 hrs. This may just be a coincedence but I thought worth mentioning our observations. It seems that the problem is that the Server2 is not
able to communicate with Server1 and hence both Servers think that the other server is down and tries to be the cluster manager. Since Server1 already has the tie-breaker disk Server 2 does not get its access and hence
Server 2 shuts down and comes out of the cluster. This is all we have concluded from the various logs and some manuals. But we are unable to understand the reason behind this problem.
I am attaching the extracts from the following log files
1) daemon.log , kern.log from Server 1
2) daemon.log , kern.log from Server 2
3) messages file from Server 2
4) binary.errlog file ( output of uerf -R ) from Server 2

Please if anyone can help us to reach any solution we would be grateful.


Thanks in advance

Ninad
1 REPLY
Ralf Puchner
Honored Contributor

Re: cluster node goes down

As taken from the log files, a network partitioning occurs. Which means mc2 goes down and member 1 can not ping member 2 over this interface - leading to a splitted cluster.

Please check memory channel (mc_diag, mc_cable), check if host routes for all interfaces exists within /etc/routes (so member can reach the other using other interfaces).

Is there any problem reported using clu_ivp -v on each member?

Is there a new machine since the occurence of the problem in the net or is a backup job/heavy network use running on the given time?

please provide output of asemgr -d -c

Help() { FirstReadManual(urgently); Go_to_it;; }