Simpler Navigation for Servers and Operating Systems - Please Update Your Bookmarks
Completed: a much simpler Servers and Operating Systems section of the Community. We combined many of the older boards, so you won't have to click through so many levels to get at the information you need. Check the consolidated boards here as many sub-forums are now single boards.
If you have bookmarked forums or discussion boards in Servers and Operating Systems, we suggest you check and update them as needed.
Operating System - Tru64 Unix
cancel
Showing results for 
Search instead for 
Did you mean: 

recovering from disk failure RAID 0 TruCluster LSM/AdvFS

Jane W.
Occasional Advisor

recovering from disk failure RAID 0 TruCluster LSM/AdvFS

We have an old two-node TruCluster (used to be four nodes but part of the hardware died two years ago). Each node has a local data store of 9 disks - RAID 0. Three disks in the ES45, 6 disks in an external shelf. For reasons no one could ever explain to me these were set up as 9-disk LSM volumes each in their own volume group; each volume was then used to to create an AdvFS file domain and one AdvFS fileset was created in each of those domains. All has been fine for several years. While I was away a disk failed on one of those LSM volumes and without asking someone tried to recover using a procedure we had that worked "some of the time". Unfortunately the person did not keep track of what command they issued and they also rebooted the cluster several times. At this point I can tell they did replace the failed disk and did commands like the following but I cannot sort out what they did before replaceing the disk:
# hwmgr -v d
# hwmgr -show scsi
# volprint -ht -g hwste01_dg
# umount /hwste01_data1
(they claim they fileset was already unmounted)
# voldg -g hwste01_dg -k rmdisk dsk7
# voldisk rm disk7
# hwmgr -scan scsi
# hwmgr -view device
# dsfmgr -m dsk77 dsk7
did not work, they tried
# hwmgr -delete scsi -did 77
(that was the failed disks HWID but
they said the command failed so they did
# hwmgr -delete component -id 77
# dsfmgr -m dsk77 dsk7
and this worked
# dsfmgr -vI
# disklabel -r dsk7
disk is unlabelled
# disklabel -wrn dsk7c
Tried to use voldg to add the new disk;
they aren't sure what they used.
They read back over the procedure and realized they had not done
# rmfdmn hwste01_data1_domain
so tried it but never got a confirmation prompt - not even after several hours.

At this point it seems that they did some fiddling in /etc/fdmns - like remove the lock file for the domain.

The storage was volatile so it all I am trying to do at this point is get it so I have a 9-disk volume and if necessary remake the domain and fileset and restore the directory structure.
The expected directory /etc/fdmns/hwste01_data1_domain exists and has a symbolic link to /dev/vol/hwste01_dg/hwste01_vol01 and "volprint -g hwste01_dg_vh" looks good to me and shows the plex and volume as ACTIVE. Unfortunately the command
# showfdmn hwste01_data1_domain
does not give an error nor does it return.
The command
# /sbin/advfs/advscan -g hwste01_dg
lists information and says it was created Jun 10 00:39:44 2003 (and that is the original setup date) but it says the Lastmount is Jun 22 13:09:34 2009.

Suggestions on how to proceed?