Operating System - OpenVMS
1752653 Members
5470 Online
108788 Solutions
New Discussion юеВ

Re: System disk errors when running "analyze/disk"

 
Wim Van den Wyngaert
Honored Contributor

Re: System disk errors when running "analyze/disk"

With DFU you can find the file name.

E.g. DFU search dev: /fid=17665

Don't know why we don't get the file name. Must be old software and HP prefers to implement new stuff.

Wim
Wim
IFX_1
Frequent Advisor

Re: System disk errors when running "analyze/disk"

DFU was not able to find the files as well.

I also ran DFU verify/fix but it was not able to fix the errors too.
Oswald Knoppers_1
Valued Contributor

Re: System disk errors when running "analyze/disk"

Which version of DFU are you using. Older versions of DFU can cause MULTALLOC problems in certain circumstances.

Oswald
Jur van der Burg
Respected Contributor

Re: System disk errors when running "analyze/disk"

In general, neither analyze/disk nor dfu can fix multalloc errors. They can't read your mind about which file to delete.

multalloc means that you have a serious problem with the integrity of the disk structure, and that two or more files have the same logical blocks allocated to them. So you have to delete at least one of them, but you can't be sure which one. I would be very careful with this, and preferably reinstall vms on a new volume. Or do manual recovery, but then you REALLY have to know what you do.

Jur.
IFX_1
Frequent Advisor

Re: System disk errors when running "analyze/disk"

Oswald,
I'm using DFU v3.2. As suggested above, I tried using it to search for other files based on file id.

Jur,
Thanks to the early responses, I have already deleted most of the errant files and has significantly the reduced the disk errors. Now, aside from quickly jumping into performing a re-installation of the OS, I need your 'expert' thoughts on how to delete/deal with the remaining errant files.
Jon Pinkley
Honored Contributor

Re: System disk errors when running "analyze/disk"

IFX,

What event let you to run analyze disk in the first place?

We know you have a shadow set with errors, but we don't know how many members are in the shadow set, and how the members are connected to the system. Do any of the members have direct connections to multiple systems? By that I mean are any of the shadow set members connected to either a shared storage bus (shared SCSI bus, Fibre Channel. CI or DSSI)? I see your last analyze output was for a FC DG device. What type of Fibre Channel controller do you have (MSA, EVA, XP, HSG80, something else)? Was (at least) one of the shadow set members ever presented to more than one system that were not part of the same cluster?

Can you please give us a bit of information about your hardware/software configuration?

Can you do the following?

$ define/job DFU$NOSMG T ! disable the pesky SMG interface
$ dfu report $1$DGA4995:

Cut and paste the output into notepad, save as a .txt file and attach to the comment. This output will provide the info we need to be able to have you dump out the blocks of indexf.sys that contain the file headers of these files. It is possible that something has overwritten parts of the indexf.sys file.

If you go back to your original posting, you will see that the first errors are reporting files marked for delete, but then there is a block of 14 contiguous file headers (may not be contiguous on disk, but probably are) that have %ANALDISK-W-BADHEADER. These for the headers from 17711 to 17724

I will reproduce the first and last one, see the first message for the rest.

%ANALDISK-W-BADHEADER, file (17711,24624,0)
invalid file header
-ANALDISK-I-FIDNUM_ZERO, file number zero but not a valid deleted header
-ANALDISK-I-INVHEADER_BUSY, invalid file header marked "busy"
in index file bitmap

...

%ANALDISK-W-BADHEADER, file (17724,0,0)
invalid file header
-ANALDISK-I-IDLEHEADER_BUSY, idle file header marked "busy"
no user action necessary

This is the portion of the indexf.sys file that I think had a good chance of being overwritten. We can't determine what to dump without knowing more about the disk, and that is reported in the output of dfu report. (The line "First header VBN" is the most important, but unless there is something sensitive in the output, please provide the complete report.

Jon
it depends
IFX_1
Frequent Advisor

Re: System disk errors when running "analyze/disk"

Jon,

*** What event let you to run analyze disk in the first place?

I ran analyze/disk on the system when I encountered the following errors upon accessing the audit journal.

%AUDSRV-W-BADRECORD, invalid data in record 184669
%RMS-F-IRC, illegal record encountered; VBN or record number = 169176


*** Do any of the members have direct connections to multiple systems? By that I mean are any of the shadow set members connected to either a shared storage bus (shared SCSI bus, Fibre Channel. CI or DSSI)? Was (at least) one of the shadow set members ever presented to more than one system that were not part of the same cluster?

We are using EMC storage on a two-node ES40 cluster which has common system disk. Most of the shadow sets (including the system disk) have two members. The system is now up with only one volume ($1$DGA899). I'm running the DFU and analyze/disk/repair on the 2nd member, $1$DGA4995 which I mounted privately.

Most of the errant files that showed in my original posting were created at the time when HP tried to configure the second node in the cluster. I recall some problems were encountered that time and I believe it connected to the common system disk as a separate node which may have caused these errors.


*** Can you do the following?
$ define/job DFU$NOSMG T ! disable the pesky SMG interface
$ dfu report $1$DGA4995:

Pls refer to attached dfu report.


Much thanks for the help.
Jon Pinkley
Honored Contributor

Re: System disk errors when running "analyze/disk"

"Most of the errant files that showed in my original posting were created at the time when HP tried to configure the second node in the cluster. I recall some problems were encountered that time and I believe it connected to the common system disk as a separate node which may have caused these errors."

I would think twice before inviting them back for system upgrades.

I would not trust anything on the disk. You did see some of the effects when you analyzed the audit file. There are probably other errors you are not yet aware of, and these errors are being copied to your system backups. Hopefully you still have a backup from prior to the addition of the second node. If you do have backups from prior to the event, write protect them and keep them safe. You may need them to restore data files from.

You even have multiply allocated blocks in the same file (17670,16,0)

%ANALDISK-W-MULTALLOC, file (17670,16,0) 0├Г┬в??├Г┬в?? ├Г ├В┬╕├Г┬в?├В┬м0├Г┬в?├В ├Г┬в?? O├Г┬в?├В┬м├Г ├Г ├В┬┐x ├Г┬в?├В┬мP y o├Г┬в?├В┬м
multiply allocated blocks
VBN 110161 to 110192
LBN 7931120 to 7931151, RVN 1
%ANALDISK-W-MULTALLOC, file (17670,16,0) 0├Г┬в??├Г┬в?? ├Г ├В┬╕├Г┬в?├В┬м0├Г┬в?├В ├Г┬в?? O├Г┬в?├В┬м├Г ├Г ├В┬┐x ├Г┬в?├В┬мP y o├Г┬в?├В┬м
multiply allocated blocks
VBN 110977 to 111008
LBN 7931120 to 7931151, RVN 1

Here LBN 7931120 to 7931151 are mapped to VBN 110161 to 110192 and mapped again to VBN 110977 to 111008, both in the same file (17670,16,0).

If you run the DFU report and look for "first header VBN", take that number, subtract one (the first file number is 1, not 0), then add the file number of a file you want to dump the header of (for example 17670 for the file with file id (17670,16,0). This will give the VBN of [000000]INDEXF.SYS that has the file header for the file.

Since on your $1$DGA4995:, the First header VBN is 827, to dump the file headers, add 826 to the file number, and dump that block of [000000]indexf.sys

For example to dump the header of the file that has the same LBNs mapped twice, (17670,16,0), dump VBN 18496 (17670+826) of [000000]indexf.sys

$ dump/file_header/block=(start:18496,count:1) $1$DGA4995:[000000]indexf.sys

or to see it in hex/ascii format

$ dump/block=(start:18496,count:1) $1$DGA4995:[000000]indexf.sys

Look at the retrieval pointers and you will see that the same LBNs are mapped more than once.

Example: (much left out for brevity), see attachment for full details.

$ dfu report disk$user1
...
First header VBN : 998
,,,
$ dir login.com;/file

Directory ROOT$USERS:[JON]

LOGIN.COM;241 (340977,27,0)

Total of 1 file.
$ vbn = (998-1)+340977
$ sho sym vbn
VBN = 341974 Hex = 000537D6 Octal = 00001233726
$ dump/file/block=(start:341974,count:1) disk$user1:[000000]indexf.sys

Dump of file DSA1200:[000000]INDEXF.SYS;1 on 10-SEP-2008 21:08:12.46
File ID (1,1,0) End of file block 444376 / Allocated 801008

Virtual block number 341974 (000537D6), 512 (0200) bytes

Header area
...
File identification: (340977,27,0)
...
Identification area
File name: LOGIN.COM;241
...
$

If you use dump without /file_header, it will dump the contents of the file header in a hex dump instead of formatting it as a file header.

$ dump/block=(start:341974,count:1) disk$user1:[000000]indexf.sys/width=80

You may be able to see some clues in the ascii text if some other file got mapped over that portion of indexf.sys.

Good luck (you will need it),

Jon
it depends
Jur van der Burg
Respected Contributor

Re: System disk errors when running "analyze/disk"

>Most of the errant files that showed in my >original posting were created at the time >when HP tried to configure the second node in >the cluster.

Right, so there probably has been a partitioned cluster. Someone clearly did not know enough about clusters.

I would for sure toss this disk and rebuild it or restore a backup from before the event as you don't know which blocks are corrupt. You may very well run into subtile (or not so subtile) issues later on. And as said by Jon, don't let these people touch your system ever again.

Jur.
Oswald Knoppers_1
Valued Contributor

Re: System disk errors when running "analyze/disk"

With DFU V3.2 the MULTALLOC problem was fixed.