Operating System - Tru64 Unix
1751969 Members
4964 Online
108783 Solutions
New Discussion юеВ

AdvFS Domain paniced / Filed Disk?

 
SOLVED
Go to solution

AdvFS Domain paniced / Filed Disk?

Platform: Alpha DS20
OS: V5.1B rev.2650 alpha

Hello all,
I have a problem with one of the old systems in house (although I'm new to Tru64).

The problem is; system boots up properly. All the mountpoints accessible and "df" outputs regularly. But after a couple of hours, I cannot access any of the maountpoints
under core_dmn and when I try to 'cd' to one of the mountpoints under core_dmn it gives me the following error message.
# ksh: /u01/cet: permission denied

WHEN I REBOOT, I AHVE ACCESS TO CORE_DMN BUT IT BECOMES INACTIVE AFTER AN HOUR OR SO...

I suspect that it's this failed disk dsk10c but I am not quite sure. How can I verify this?
Can it also be a controller problem? How can I identify and find the source of problem properly?

PS: Attached you may see logs since boot...

Thanks in advance...
CET
----------------------------------------------
Below is a short overview of the system messages...
Jan 26 16:36:56 ds20-1 vmunix: AdvFS Domain Panic; Domain core_dmn Id 0x452b5421.000d53e6

Jan 26 16:36:56 ds20-1 vmunix: An AdvFS domain panic has occurred due to either a metadata write error or an internal inconsistency. This domain is being rendered inaccessible.

Jan 26 16:36:56 ds20-1 vmunix: Domain#Fileset: core_dmn#cet
Jan 26 16:36:56 ds20-1 vmunix: Mounted on: /u01/cet
Jan 26 16:36:56 ds20-1 vmunix: Volume: /dev/disk/dsk10c

Jan 26 16:36:56 ds20-1 vmunix: I/O error appears to be due to a hardware problem.
Jan 26 16:36:56 ds20-1 vmunix: Check the binary error log for details.
-----------------------------------------------
#df -h
Filesystem Size Used Available Capacity Mounted on
root_domain#root 977M 185M 786M 20% /
/proc 0 0 0 100% /proc
usr_domain#usr 19G 3927M 15G 21% /usr
var_domain#var 10G 869M 9866M 9% /var
usr_domain#tmp 19G 25K 15G 1% /cluster/members/member0/tmp
core_dmn#rsn 204G 30G 62G 33% /u01/rsn
core_dmn#cet 204G 46G 62G 43% /u01/cet
core_dmn#cettest 204G 34G 62G 36% /u01/cettest
core_dmn#cettest2 204G 30G 62G 33% /u01/cettest2
core_dmn#u02 204G 2220M 62G 4% /u02
core_dmn#u03 204G 16K 62G 1% /u03
core_dmn#u04 204G 2032K 62G 1% /u04
core_dmn#u05 204G 9660K 62G 1% /u05
eva#backup 343G 64G 279G 19% /backup
eva#test 343G 16K 279G 1% /u01/rsn1
-----------------------------------------------
# hwmgr view hierarchy
24: scsi_bus scsi2
60: disk bus-2-targ-0-lun-0 dsk0
94: disk bus-2-targ-1-lun-0 dsk9
61: disk bus-2-targ-2-lun-0 dsk1
62: disk bus-2-targ-3-lun-0 dsk2
97: disk bus-2-targ-4-lun-0 dsk12
64: disk bus-2-targ-5-lun-0 dsk4
95: disk bus-2-targ-6-lun-0 dsk10
-----------------------------------------------
# /etc/fdmns/core_dmn> ls
dsk10c dsk12c dsk1c dsk2c dsk4c dsk9c

-----------------------------------------------
### Last 50 error messages I received are as below;

# cat /var/adm/messages | tail -50
Jan 26 15:48:20 ds20-1 vmunix: scsi2: HTH intr. on bus 2, SBCL = 0x26
Jan 26 15:48:23 ds20-1 vmunix: scsi2: SCSI Bus was reset
Jan 26 15:49:31 ds20-1 vmunix: scsi2: HTH intr. on bus 2, SBCL = 0x26
Jan 26 15:49:34 ds20-1 vmunix: scsi2: SCSI Bus was reset
Jan 26 16:30:31 ds20-1 vmunix: scsi2: SCSI Bus was reset
Jan 26 16:31:48 ds20-1 vmunix: scsi2: HTH intr. on bus 2, SBCL = 0x2e
Jan 26 16:32:23 ds20-1 vmunix: scsi2: SCSI Bus was reset
Jan 26 16:33:39 ds20-1 vmunix: scsi2: SCSI Bus was reset
Jan 26 16:35:36 ds20-1 vmunix: scsi2: SCSI Bus was reset
Jan 26 16:36:49 ds20-1 vmunix: scsi2: SCSI Bus was reset
Jan 26 16:36:56 ds20-1 vmunix: AdvFS I/O error:
Jan 26 16:36:56 ds20-1 vmunix: Volume: /dev/disk/dsk10c
Jan 26 16:36:56 ds20-1 vmunix: Tag: 0xffffffd8.0000
Jan 26 16:36:56 ds20-1 vmunix: Page: 4142
Jan 26 16:36:56 ds20-1 vmunix: Block: 28520144
Jan 26 16:36:56 ds20-1 vmunix: Block count: 32
Jan 26 16:36:56 ds20-1 vmunix: Type of operation: Write
Jan 26 16:36:56 ds20-1 vmunix: Error: 5 (see /usr/include/errno.h)
Jan 26 16:36:56 ds20-1 vmunix: EEI: 0x6200 (Advfs cannot retry this)
Jan 26 16:36:56 ds20-1 vmunix: AdvFS initiated retries: 0
Jan 26 16:36:56 ds20-1 vmunix: Total AdvFS retries on this volume: 0
Jan 26 16:36:56 ds20-1 vmunix: I/O error appears to be due to a hardware problem.
Jan 26 16:36:56 ds20-1 vmunix: Check the binary error log for details.
Jan 26 16:36:56 ds20-1 vmunix:
Jan 26 16:36:56 ds20-1 vmunix: bs_osf_complete: metadata write failed
Jan 26 16:36:56 ds20-1 vmunix: AdvFS Domain Panic; Domain core_dmn Id 0x452b5421.000d53e6
Jan 26 16:36:56 ds20-1 vmunix: An AdvFS domain panic has occurred due to either a metadata write error or an internal inconsistency. This domain is being rendered inaccessible.
Jan 26 16:36:56 ds20-1 vmunix: Please refer to guidelines in AdvFS Guide to File System Administration regarding what steps to take to recover this domain.
Jan 26 16:36:56 ds20-1 vmunix: Domain panic appears to be due to a hardware problem
Jan 26 16:36:56 ds20-1 vmunix: Check the binary error log for more information.
Jan 26 16:36:56 ds20-1 vmunix: AdvFS I/O error:
Jan 26 16:36:56 ds20-1 vmunix: Domain#Fileset: core_dmn#cet
Jan 26 16:36:56 ds20-1 vmunix: Mounted on: /u01/cet
Jan 26 16:36:56 ds20-1 vmunix: Volume: /dev/disk/dsk10c
Jan 26 16:36:56 ds20-1 vmunix: Tag: 0x0000ea0f.8006
Jan 26 16:36:56 ds20-1 vmunix: Page: 3
Jan 26 16:36:56 ds20-1 vmunix: Block: 70050624
Jan 26 16:36:56 ds20-1 vmunix: Block count: 16
Jan 26 16:36:56 ds20-1 vmunix: Type of operation: Read
Jan 26 16:36:56 ds20-1 vmunix: Error: 5 (see /usr/include/errno.h)
Jan 26 16:36:56 ds20-1 vmunix: EEI: 0x6200 (Advfs cannot retry this)
Jan 26 16:36:56 ds20-1 vmunix: AdvFS initiated retries: 0
Jan 26 16:36:56 ds20-1 vmunix: Total AdvFS retries on this volume: 0
Jan 26 16:36:56 ds20-1 vmunix: I/O error appears to be due to a hardware problem.
Jan 26 16:36:56 ds20-1 vmunix: Check the binary error log for details.
Jan 26 16:36:56 ds20-1 vmunix: To obtain the name of the file on which
Jan 26 16:36:56 ds20-1 vmunix: the error occurred, type the command:
Jan 26 16:36:56 ds20-1 vmunix: /sbin/advfs/tag2name /u01/cet/.tags/59919
Jan 26 16:37:57 ds20-1 vmunix: scsi2: HTH intr. on bus 2, SBCL = 0x2e
Jan 26 16:37:59 ds20-1 vmunix: scsi2: SCSI Bus was reset



5 REPLIES 5
Martin Moore
HPE Pro

Re: AdvFS Domain paniced / Filed Disk?

Yes, the domain becomes inaccessible because AdvFS detects an error and does a "domain panic" to take the domain off line. The supporting information indicates a problem with dsk10, as you suspected. The error message information also indicates a problem with SCSI bus 2, which dsk10 is connected to. So you could have either a disk problem or some other problem with the bus, or both. You could get more information on the errors by decoding the binary error log with wsea.

My suggestion: first track down and fix the hardware problem(s). Then run fixfdmn on core_dmn to ensure there is no metadata corruption within the domain. But don't run fixfdmn until the hardware problem is resolved.

Martin
I work for HPE
A quick resolution to technical issues for your HPE products is just a click away HPE Support Center
See Self Help Post for more details

Accept or Kudo

Re: AdvFS Domain paniced / Filed Disk?

Thank you for the quick response, if I need to change a disk on system, how should I proceed?

Because as far as I can see, this device doesn't have a RAID controller.! (hwmgr v h output is attached)

So that means I will lose any data on dsk10.

- But will I be able to reboot system after changing the disk?
- Is there anything else I need to perform in OS level after changing the disk physically? (eg. hwmgr -scan scsi ...)
- Is it possible to recover or rebuild data on core_dmn (this may be a basic AdvFS question)?

Sorry for too many questions...
Regards,
CET
Rob Leadbeater
Honored Contributor

Re: AdvFS Domain paniced / Filed Disk?

Hi,

Do you know what the underlying hardware configuration is ?

It would be useful to see the output of

# hwmgr show scsi
and/or
# hwmgr show scsi -full

If it is a stand alone disk, then you're probably going to have to start recovering from your backups...

It might also be useful to see the output of

# ls -lR /etc/fdmns

and

# showfdmn core_dmn

to see how the domain is created.

Cheers,

Rob
Venkatesh BL
Honored Contributor
Solution

Re: AdvFS Domain paniced / Filed Disk?

Fix the hardware problem first. If the disk cannot be corrected, then you need to restore the entire file system from backup. If you don't have any backup, you could still use 'salvage' command to retrieve files that belong to others disks on the same AdvFS domain.

If you correct the hardware issue, but, fear that the disk could go wrong anytime, you can run 'fixfdmn' on the domain and then 'addvol' a new disk (of similar size) and do 'rmvol' to remove this disk from the AdvFS domain (all data would be intact at the end of this operation).

Re: AdvFS Domain paniced / Filed Disk?

I'd like to thank everyone for their help.

I kind a solve the problem;
- salvage command recovered all the files except only 1 out of 25.000.

Now I'll try to find the missing file because in log file there is no name.

Then remove domain / recreate a new one and transfer the recovered files back to original locations

CET