Operating System - OpenVMS
1827883 Members
1214 Online
109969 Solutions
New Discussion

Re: System disk corruption

 
SOLVED
Go to solution
Art Wiens
Respected Contributor

System disk corruption

I got a service ticket today from a client who's having some "difficulties" with their system disk (VMS v6.2 on a 4GB scsi disk). They were having some space issues and did some "maintenance" on their own (I'm still trying to find out what they did exactly!!). Now they have entries in their backup logs report similar to:

%BACKUP-E-READDIR, error reading directory DSA1:[SYSE.SYSCOMMON.DECW$BOOK]
-SYSTEM-F-ILLBLKNUM, illegal logical block number

I did an ANALYZE/DISK/NOREPAIR to see what I was up against...WOW! 36,055 lines of output although 36,019 lines are (example) %ANALDISK-W-LOSTHEADER, file (21319,73,1) .; (the app is crap...temporary files on the fly...doesn't clean up!).

If I let ANALYZE /REPAIR will it be able to "fix" the illegal block numbers AND actually retain the contents of the bad directories? Fortunately it's in the DecWindows area and not really needed but...

As well, if there are more than 32,768 lost "unnamed" files, this will probably take a couple of passes to fix, correct? And many, many hours to delete them :-(

What a mess,
Art

ps. just to make things "interesting" - DSA1: it's a 3rd party disk shadowing product!
20 REPLIES 20
John Gillings
Honored Contributor
Solution

Re: System disk corruption

Art,

As well as ANALYZE/DISK/REPAIR, see DFU on the freeware distribution. It can perform repairs on directories. You may not save everything, but it should be able to make a clean directory for you. For stuff like DECW$BOOK, it should be easy to recover anything you need from distribution media. DFU can also delete large numbers of files very quickly. Let ANALYZE/DISK/REPAIR put then in SYSLOST, salvage anything you want and use DFU to delete SYSLOST.DIR. It will be very quick.
A crucible of informative mistakes
Willem Grooters
Honored Contributor

Re: System disk corruption

Just a hint:
they might have done a 'cleanup' by "deleting unneeded files" , forgetting these are actually just directory entries: Removing [VMSCOMMON]DEC$WINDOWS stuff, for instance, since "VMSCOMMON is no user". Indeed, find out WHAT they did exactly...

As it is only the DECWindows environment, well, this might well be the case. But bewarem, they MAY have other problems because of that. Someone who made that mistaken doesn't know VMS (as a system manager, perhaps some Unix/Windows sysadmin "doing some VMS as well". I've seen that before (and had to rebuild the system disk from backup)

ANALYZE/DISK/REPAIR may help, files that lost their header will be placed under some directory (created by the system) but then you have the task to place them in their proper position. For a quick repair, I'm afraid that only a restore of the system disk will help you. Or, if it really is DecWindows only, try to reinstall DecWindows.


Willem
Willem Grooters
OpenVMS Developer & System Manager
Bojan Nemec
Honored Contributor

Re: System disk corruption

Art,

32,768 lost files seems lot for the DecWindows area, maybe some other area is also affected. Did you found out what was done for the "maintenance"? Maybe some forced delete on the system directories, where the same file is put in two directories with the set file/enter command. I am on vacancy now and have no VMS machine to test what hapens when you delete such a directory.

For me it will be hard to trust to such system disk so my aproach to your problem will be:
Save all user data on the system disk including system files modified by user (startup procedures, sysuaf, rightslist etc...) and new fresh VMS installation.

And if the system was not rebooted after this "maintenance" dont try to reboot before backing up user data. There are many chances that the system will not boot!

Bojan
Art Wiens
Respected Contributor

Re: System disk corruption

John - I've never used DFU, but if it'll save the extraoridnary time to delete 32K files, I'll give it a go.

Willem - I should have given them the benefit of the doubt, they didn't do anything "weird". Turns out they were trying to solve a lack of contiguous space problem and did a backup/restore. The only unusual thing (and perhaps this is what introduced the "corruption") was they took the system disk and mounted it (privately) on another system - an Alpha running VMS v7.1 . They did an image backup to tape, followed by an analyze on the disk, and then an image restore. Not sure what the point of the analyze was (in the order it was done).

Bojan - "32,768 lost files seems lot for the DecWindows area"

The 32K files aren't from the DecWindows dirs, they're created (and not deleted) by a "logically challenged" app!

The system does still boot from this disk, so it's seemingly not a fatal mess! The odd thing is the corruption is only since the restore.

We'll see what ANALYZE/REPAIR can do.

Thanks,
Art
Robert Atkinson
Respected Contributor

Re: System disk corruption

Art - before I read your response, I was thinking along the lines of 'clustered disk mounted privately', and you've confirmed this.

The level of corruption your talking about (and I know from my own stupic mistakes) points to the disk being mounted on 2 unclustered nodes.

Basically, parts of the bitmap are updated by one system, while the other is overwriting it with what it knows, so all hell it let loose!

You say they mounted it privately - are you absolutely sure about that?

Rob.
Mohamed  K Ahmed
Trusted Contributor

Re: System disk corruption

I just wanted to mention that the number 32,768 is the largest number a VMS can reach. for a 16 register system it is 2^15 +1

there might be more than that but the system didn't count them and/or stopped counting.

Mohamed
Uwe Zessin
Honored Contributor

Re: System disk corruption

I am afraid you math is broken:

$ create a.tmp;32768
%CREATE-E-OPENOUT, error opening USER01:[ZESSIN]A.TMP;32768 as output
-RMS-E-CRE, ACP file create failed
-SYSTEM-W-BADFILEVER, bad file version number
$ write sys$output f$getsyi("version")
V7.3-2
$

>>> print (2**15)
32768
>>> print (2**15)+1
32769
>>>
>>> m = (2**15)-1
>>> print m, hex(m)
32767 0x7fff
>>>
.
Art Wiens
Respected Contributor

Re: System disk corruption

Robert - "You say they mounted it privately - are you absolutely sure about that?"

As sure as I can be...the system disk was physically removed from this StorageWorks shelf and put into another and a straight MOUNT command issued.

I sourced DFU (v2.7 for VAX) and loaded it up. Aside from the 64,013 entries that look like:

%DFU-E-INVBAKFID, file (60152,39,1) .; has invalid backlink

I have attached a file that has the remaining errors.

According to DFU help, the INVBAKFID's should be handled by ANALYZE rather than DFU to better determine if they are actual directories rather than lost files.

I plan to:

1) shut system down
2) get a standalone image backup
3) come up without app running
4) run ANALYZE/REPAIR
5) run DFU's VERIFY/FIX
6) book an airline ticket to get to the site when it all falls apart! ;-)

Sound reasonable?
Art
Uwe Zessin
Honored Contributor

Re: System disk corruption

%DFU-E-MULTALLOC, file (52352,563,1) [SCGTS]DFU.LOG;1 ,
blocks LBN 5737617 through 5737625 multiple allocated

Outch, outch, outch! I strongly recommend that you do not use this disk in further production.

You can try to recover any configuration files from that disk and check if they are intact, but I would not spend any more time to attempt a recovery of this disk's contents. It is obvious (MULTALLOC) that there are files which share the same data blocks - not a nice thing, beleive me.

Pick up the installation media before you enter the plane.
.
Robert Atkinson
Respected Contributor

Re: System disk corruption

I know it's not likely, but did they dismount the disk before they moved it?

Is it possible that they moved it making the disk go into mount-verify, corrupted it, and then moved it back?

I agree with Uwe. The likelyhood is that even if the files look OK, the data in them is likely to be corrupt and from other files.

You've got to take the hit on this one and restore or rebuild the disk from backup, I'm afraid.

Rob.
Jan van den Ende
Honored Contributor

Re: System disk corruption

and if you need another vote to convince the client:

re-label this this disk DYNAMITE or TNT or some other explosive name.

ANY file you try to salvage from this disk should be considered suspect until you have inspected the contents COMPLETELY.

Don't be surprised to find a file that in the beginning looks like a good file A, then find several disk-clusters from file B, and continue as A.


Kind of: Keep a safe distance, smoking and open fire prohibited. Don't try this at home.


Good luck (afraid you'll need it)


Jan
Don't rust yours pelled jacker to fine doll missed aches.
Art Wiens
Respected Contributor

Re: System disk corruption

Although I agree with all of your concerns to not use this disk anymore, due to the location of this system, it's role in production and the amount of time required to rebuild, I have to try the repair first. I know, I know time spent rebuilding vs. cost of downtime, but hopefully the repair will buy me a bit more time to plan a rebuild. The app is very old, it's not any kind of "vanilla" VMSINSTAL installation.

In my mind rebuilding is probably more scary than trying to fix it.

One quick question: in DFU, what would the exact syntax be to delete the SYSLOST directory? Do I just give it SYS$SYSDEVICE:[SYSLOST} ?

Thanks for all your concerns, I'll let you know on Monday how it went.

Art
Ian Miller.
Honored Contributor

Re: System disk corruption

"One quick question: in DFU, what would the exact syntax be to delete the SYSLOST directory? Do I just give it SYS$SYSDEVICE:[SYSLOST} ?"

MCR DFU DELETE SYS$SYSDEVICE:[000000]SYSLOST.DIR/DIRECTORY/TREE
____________________
Purely Personal Opinion
Jan van den Ende
Honored Contributor

Re: System disk corruption

Art,

look in
http://forums1.itrc.hp.com/service/forums/questionanswer.do?threadId=625667

Locate the answer by Mark Hopkins, about running ANA/REPA with /CONF and an input file.

That looks excactly like what you are looking for!

Jan
Don't rust yours pelled jacker to fine doll missed aches.
Art Wiens
Respected Contributor

Re: System disk corruption

The results are in:

First of all I found out what happened to the disk (sorry if it wraps):

http://h18000.www1.hp.com/support/asktima/operating_systems/0098FB46-44F7E940-1C02A1.html

The system is a VAX 4100 w/HSD05 (DSSI <-> SCSI) w/RZ29B-VW . That disk was taken to an Alpha with a different disk controller, image backup to a scratch disk, and then restored back onto the original disk, allowing BACKUP to initialize it.

So we have files/dirs on the disk that are outside the maximum LBN of the disk!!

I ran ANALYZE/REPAIR and DFU VERIFY/DIR/FIX but neither could fix them. ANALYZE did reclaim all ~36,000 lost files - took two passes because it stopped at ;32,768. FYI, ANALYZE was "creating" 10,000 files / 20 minutes, DFU was able to delete 10,000 files / 10 minutes so it wasn't too painful.

So, I have a "clean" disk with 3 directories and a handful of files that are "buggered". Again, they all seem to be DECwindows related so no panic, but they still cause problems...recursive directory listings, image backups etc. complain w/ SYSTEM-F-ILLBLKNUM message.

My thinking is to:

1) put a spare RZ29 in the shelf
2) INITIALIZE it
3) shutdown and S/A BACKUP to it
4) swap disks and come back up

My question is (and I've also asked HP), will this "fix" it? I don't care if I lose the files/dirs, I can recover from another system. Will the corruption be left behind, or will it come along because it's an image backup?

Art
Uwe Zessin
Honored Contributor

Re: System disk corruption

Aha! That explains a lot. A storage controller usually reserves a certain amount of disk blocks for its own meta data. Some of the early HSD controllers also have the ability to 'truncate' a disk to a multiple of 127 blocks, if I recall correctly. And finally they very likely present the new disk with a different geometry which gives inconsitent homeblock locations unless it was initialized post-V6.1.

I do not understand your steps.
1) which shelf? the one connected to the HSD?
4) swap which disks? between which locations?

You CANNOT easily move a disk between a local attachment and a RAID-controller one.

There is one exception: see if the HSD05 understands TRANSPORTABLE disks (I think it does, but I can't check right now) - that means that it will not store any meta data on it and keep the geometry unchanged. The downside, of course is, that you cannot use any RAID functionality of the HSD.
.
Art Wiens
Respected Contributor

Re: System disk corruption

The problem was introduced because the disk was initialized on a different system (hardware). My plan means to initialize the new disk on the system it will be used on - it has two StorageWorks "shelves" in between two VAX 4100's using HSD05's to interface to them. That way the disk will be the correct size. Swap the disks means - use the newly created system disk copy as production ... put the new disk in the old disk's slot so no command procedures need to change.

The question remains, will the image backup to the new disk leave the corruption behind, or propogate it? ie. (will BACKUP say) "I can't read these blocks so I won't try and copy them", or "I can't read these files, but here's the crap header anyway".

Art
Jan van den Ende
Honored Contributor

Re: System disk corruption

Art,

Probably just the other way around.

You got multiply allocated blocks, right?

Just a simple demo example, but you get the point:

File A consists of
Blocks 1=10, 21-50, 91-100

File B
Blocks 11-20, 101-120, 41-60

Image backup (makes file logicalle contious): A
1-10=> 1-10 + 21-50=> 11-40 + 91-100=> 41-50

B:
11-20=> 1-10 + 101-120=> 11-30 + 41-60=>31-50


So now, the doubly allocated blocks 51-60 are now separate parts of both A and B.
Of course, in at least one of those cases, the content is non-consistent, but you DO get rid of your double allocation.

hth

Jan
Don't rust yours pelled jacker to fine doll missed aches.
Art Wiens
Respected Contributor

Re: System disk corruption

Jan - "You got multiply allocated blocks, right?"

No, I have SYSTEM-F-ILLBLKNUM's

$ SHOW DEV/FULL SYS$SYSDEVICE

Disk $1$DIA0: (LEFT), device type RF72, is online, mounted, file-oriented
...
Total blocks 8380008
...

$ DUMP/HEADER/BLOCKS=COUNT=0 $1$DIA0:[SYS0.SYSCOMMON]DECW$BOOK.DIR
...
Map area
Retrieval pointers
Count: 9 LBN: 8380017

Art
Uwe Zessin
Honored Contributor

Re: System disk corruption

A BACKUP/image from a corrupted disk might create a consistent file system on the destination disk, but it can not repair any data corruptions.

On 19. AUG you posted an attachment that contains:
%DFU-W-MULTFND, reporting multiple allocated blocks...
%DFU-E-MULTALLOC, file (52352,563,1) [SCGTS]DFU.LOG;1 ,
blocks LBN 5737617 through 5737625 multiple allocated

Of course, at least one file contains invalid data, now.
.