TruCluster
cancel
Showing results for 
Search instead for 
Did you mean: 

Tru64 4.0F cluster - tiebreaker service shows unassigned

Ninad_1
Honored Contributor

Tru64 4.0F cluster - tiebreaker service shows unassigned

Hello all,

I have a cluster of 2 ES40's 4.0F Trucluster 1.6 Production server. The 2nd cluster member was down for few days for soem reason.Today I tried to boot it wiht the 1st member down but after booting the system it shows Suspended : Yes in cnxshow.
Also it does not start the tiebreaker service and the drd service. The daemon.log file is attached which shows some errors like can't unreserve disk , fru parse error etc. which I am unable to understand. Please help me and guide me to solve the problem

Thanks in advance

Ninad
8 REPLIES
Michael Schulte zur Sur
Honored Contributor

Re: Tru64 4.0F cluster - tiebreaker service shows unassigned

Hi Ninad,

***ALERT: possible device failure: /dev/rzc17g

check that disk.

greetings,

Michael
Ralf Puchner
Honored Contributor

Re: Tru64 4.0F cluster - tiebreaker service shows unassigned

In an ase production server setup a tiebreaker disk acts like a quorum disk to prevent cluster partitioning if more than 2 cluster members.

Troubleshooting issues:
1. set loglevel to informational
2. shutdown and switch off all members and storage
3. boot 1st member with storage, tail -f to daemon.log
4. boot 2nd member, tail -f to daemon.log
5. verify proper work and access.

btw. set ASE_PARTIAL_MIRRORING to on to be sure lsm starts if one of the plexes are gone and service should be higly available if lsm partial fails.

Without this it is not clear if a partitioning happened or a storage problem leads to the messages....
Help() { FirstReadManual(urgently); Go_to_it;; }
Johan Brusche
Honored Contributor

Re: Tru64 4.0F cluster - tiebreaker service shows unassigned


Ninand,

There seems to be a SCSI-bus or disk drive problem, with as consequence a plex (or plexes) of LSM mirror sets not being online.

This inhibits the startup of the involved ASE service, since you have ASE_PARTIAL_MIRROR=OFF.

If you know that the other half of the LSM mirror(s) is OK (ie good data), then try to
boot first into single user mode, do a
mount -u /
bcheckrc
and
rcmgr set ASE_PARTIAL_MIRROR ON
init 3

With this setting the ASE service, should be able to come on-line.

Good luck,
Johan.

_JB_
Ninad_1
Honored Contributor

Re: Tru64 4.0F cluster - tiebreaker service shows unassigned

Hello

Michael Schulte
>***ALERT: possible device failure: /dev/rzc17g
>check that disk.

Let me tell you a bit more in detail.
We have a RA8000 box connected via Fibre optic switch - San switch . The RA8000 has a raidset with 2 partitions with units D1 and D2
These partitions are available to the system as /dev/rzb17 and /dev/rzc17 respectively.
The disk rzb17 is used as the tiebreaker disk on partition rzb17a
The disk rzc17 is used by LSM - rzc17g is LSMpubl , rzc17h is LSMpriv and has the database.

Everything works fine with ServerI - which means that both the disks are available and are working fine and there is no device failure for disk /dev/rzc17g.

Now the ServerII was down as there was power supply problem. Some days back the power was OK , but the customer wanted to install 833 MHz Cpu ( from another ES40 server in another setup ) instead of the existing 500 MHz Cpu to check peformance and compatibility for possible upgrade. Hence we installed the 833 MHz cpus upgraded the firmware of the ServerII to version 6.4 ( existing was version 5.6 ) and booted the ServerII after shutting down ServerI as a precaution. Since it did not give any errors without doing any other tests like mounting database, checking applcns etc, we reverted back to original config ( i.e installed the 4 500 MHz cpus ) then few days back we wanted to synchronize the home dir data on ServerII to that of ServerI , hence we took backup from ServerI and restored the /home2 , /home4 dirs on ServerII , but at that time when we booted ServerII we got some errors for emx and when checked with scu> show edt , we found that we were not able to see 2 devices which could be seen from ServerI , and these were the same 2 devices - rzb17 and rzc17.
Then we found out that in HSG80 configuration the units D1 and D2 were having access to particular connections and when we had upgraded the firmware 2 new connections had come in HSG80 show connections output ( These were the conenctions for the ServerII through San switch ). So yesterday we added access to the 2 new connections for the D1 and D2 units. Then we booted ServerII and we were able to see the devices now in scu> show edt. But still we were getting the errors in ServerII for the tiebreaker and the drd service showing unassigned and the status of Suspended showing Yes in cnxshow output on ServerII.
Then again we shutdown ServerII and downgraded the firmware to 5.6. and booted the ServerII again , now also the devices are visible in scu> show edt but the errors persist.
We tried to mount the tiebreaker disk manually and it mounted wiout any errors. The tiebreaker disk is UFS.
We are really unable to understand why this problem has come and what the solution is.



Ralf Puchner

>In an ase production server setup a tiebreaker disk acts like a quorum disk to prevent cluster partitioning if more than 2 cluster members.

>Troubleshooting issues:
>1. set loglevel to informational
>2. shutdown and switch off all members and storage
>3. boot 1st member with storage, tail -f to daemon.log
>4. boot 2nd member, tail -f to daemon.log
>5. verify proper work and access.


When we boot ServerI with storage it boots and works fine. After booting ServerII it is added to the cluster but in cnxshow it shows only entry for self and shows Suspended : Yes

punetelecom2# cnxshow

Cluster View from mc02

Director: mc02 Suspended: Yes

Node monitor using tie-breaking disk: /dev/rrzb17a

Hostname Cluster I/F CS_ID Incarnation Comm Okay Member
-----------------------------------------------------------------------------
punetelecom2 mc02 0001,0001 000000000000db00 Yes ?

>btw. set ASE_PARTIAL_MIRRORING to on to be sure lsm starts if one of the plexes are gone and service should be higly available if lsm partial fails.

>Without this it is not clear if a partitioning happened or a storage problem leads to the messages....

We do not have mirroring in LSM so do we need to set ASE_PARTIAL_MIRRORING to on ?

Johan Brusche

>There seems to be a SCSI-bus or disk drive problem, with as consequence a plex (or plexes) of LSM mirror sets not being online.
>This inhibits the startup of the involved ASE service, since you have ASE_PARTIAL_MIRROR=OFF.
>If you know that the other half of the LSM mirror(s) is OK (ie good data), then try to
boot first into single user mode, do a
mount -u /
bcheckrc
and
rcmgr set ASE_PARTIAL_MIRROR ON
init 3

>With this setting the ASE service, should be able to come on-line.

We do not have mirroring in LSM so do we need to set ASE_PARTIAL_MIRRORING to on ?
Ralf Puchner
Honored Contributor

Re: Tru64 4.0F cluster - tiebreaker service shows unassigned

As I've read from your problem description you have restored the content of server I to server II and problem started after rebooting server II right?

If so there seems some footprint on server II belonging to server I preventing them to join the cluster. Solution in this case: Reinstall server II from a server II backup or scratch - a work of 3-5 hours.

Help() { FirstReadManual(urgently); Go_to_it;; }
Ninad_1
Honored Contributor

Re: Tru64 4.0F cluster - tiebreaker service shows unassigned

Hello,

Ralph - I have not restored anything that has a footprint of ServerI , I have restored only the home dir of users which has the updated programs and info and it is the same on both the servers and we have been doing this often, so thats not the source of the problem.This was just mentioned for additional info and actually is not related with the problem.

Regards
Ninad
Ralf Puchner
Honored Contributor

Re: Tru64 4.0F cluster - tiebreaker service shows unassigned

ok, it was not clear within the given information. Btw. have you set logtype informational and started the member as requested?
Help() { FirstReadManual(urgently); Go_to_it;; }
Johan Brusche
Honored Contributor

Re: Tru64 4.0F cluster - tiebreaker service shows unassigned


With Tru64 V4.xx the HSG80 had to be configured in transparant failover and SCSI-2 mode (not SCSI-3), is that still OK after it's firmware upgrade ?

Johan.

_JB_