Operating System - OpenVMS
1752513 Members
5030 Online
108788 Solutions
New Discussion юеВ

Backup, Copy, RMS Errors (OpenVMS 7.3-2)

 
Jim Geier_1
Regular Advisor

Backup, Copy, RMS Errors (OpenVMS 7.3-2)

I help manage a two-system cluster (AlphaServer ES40 systems) running OpenVMS 7.3-2 and patched up to Update V6. The cluster is seldom rebooted, perhaps every 10 months or so. The application on this cluster uses indexed RMS files. The system had been up for 320 days and it was discovered that a few of the RMS files had become corrupted. The storage is an MSA1000 with LUNs built using the ADG format. Working with HP hardware support and OpenVMS support, no hardware errors (storage, memory, other hardware) or problems could be found. In fact, after being up for 320 days, SHOW ERROR only showed a very small number of errors on the network adapter and 1 error on the diskette drives. In the diagnosis of the possible causes, the following was noticed with a medium-sized RMS file:

Step 1. Analyze/RMS/Check SRC-FILE -- No errors
Step 2. Copy SRC-FILE to COPY-FILE
Step 3. Backup SRC-FILE to BACKUP-FILE
Step 4. Analyze/RMS/Check COPY-FILE -- No errors
Step 5. Analyze/RMS/Check BACKUP-FILE тАУ Errors occurred
The output showed several errors of the type:
VBN 341: Bucket check byte is out of phase

This was repeated consistently a few times, using different disks and directories.

Step 6. One system rebooted
Step 7. Entire sequence repeated with no errors
Step 8. Other system rebooted

Because of time constraints, we did not try the test on the system that had not been rebooted after the first was rebooted. My understanding is that such errors often indicate a hardware problem, but none can be found.

Backup and Copy must do some things differently internally. What might cause this problem? How can we prevent this from occurring again?
25 REPLIES 25
P Muralidhar Kini
Honored Contributor

Re: Backup, Copy, RMS Errors (OpenVMS 7.3-2)

Hi Jim,

>> The system had been up for 320 days and it was discovered that a few of
>> the RMS files had become corrupted.
How was this corruption detected ?
Using ANALYZE/RMS/CHECK command ?
or
Was VERIFY (ANAL/DISK/LOCK) run on the disk, to check for any other errors
on the disk.

>> Step 1. Analyze/RMS/Check SRC-FILE -- No errors
>> Step 2. Copy SRC-FILE to COPY-FILE
>> Step 3. Backup SRC-FILE to BACKUP-FILE
>> Step 4. Analyze/RMS/Check COPY-FILE -- No errors
>> Step 5. Analyze/RMS/Check BACKUP-FILE ├в Errors occurred
>> The output showed several errors of the type:
>> VBN 341: Bucket check byte is out of phase

The steps here indicate that ANALYZE/RMS/CHECK of the file has not indicated
any error. When the file is copied using COPY, the destination file has no errors
but when the file is copied using BACKUP the destination file has some set of
errors reported.

BACKUP and COPY would be doing things little differently but then its
interesting to note that a file copied using BACKUP is giving a error.

Are all other errors similar to "VBN 341: Bucket check byte is out of phase"
or are there any other errors?

>> My understanding is that such errors often indicate a hardware problem,
>> but none can be found.
Have you checked the ERRLOG.SYS for any hardware problems reported
corresponding to the disk?

Regards,
Murali
Let There Be Rock - AC/DC
John Gillings
Honored Contributor

Re: Backup, Copy, RMS Errors (OpenVMS 7.3-2)

Jim,

Was your Step 3 BACKUP/IGNORE=INTERLOCK? If not could you please post the exact command and any output? (also sanity check any possible symbol definitions in the command)

>VBN 341: Bucket check byte is out of phase

The first and last bytes of a bucket are the check bytes. They're kept at the same value, but incremented each time a bucket is rewritten. If they're out of phase (ie: different), that implies an I/O rewriting the bucket was interrupted. That shouldn't happen if everything is correctly interlocked at the RMS level.

>Backup and Copy must do some things
>differently internally

For an indexed file, I suspect COPY and BACKUP are VERY different. BACKUP just shovels bits, where COPY may create an empty file with the same attributes as the original, then reads records from the source and inserts them into the destination. You'll need to look at the source to see for sure.

If that's the case, the original and BACKUP file should be bit for bit identical, but the COPY file could be very different because the records may be inserted in a different sequence.

What does CONVERT SRC-FILE CVT-FILE do?
A crucible of informative mistakes
P Muralidhar Kini
Honored Contributor

Re: Backup, Copy, RMS Errors (OpenVMS 7.3-2)

Hi Jim,

To rule out any possibility of any disk corruption (like say multiply allocated
blocks) of files on the disk, execute
$ANAL/DISK/LOCK
is the disk on which you have the corrupted index files.

However, from the steps that you have performed, the errors have got
introduced after the BACKUP command.

As John has pointed out, it would be interesting to know the exact BACKUP
command you have used to perform the copy operation. It could turn out that
the file was in use when you were performing the BACKUP operation and the
BACKUP was performed using "/IGNORE=INTERLOCK" causing the file to be
copied while it was in a inconsistent state
(Note - thats what the /IGNORE=INTERLOCK qualifier directs BACKUP to do).

Regards,
Murali
Let There Be Rock - AC/DC
Hein van den Heuvel
Honored Contributor

Re: Backup, Copy, RMS Errors (OpenVMS 7.3-2)

Intriguing! Repeatable (IO) errors are rare.

Any issues with backup/ignore=interlock would certainly not be readily repeatable and typically do not end in check-byte errors, but bucket pointers beyond EOF and such.
Anyway, the test setup suggests to me that a dedicated file is used for testing.... otherwise copy would not work, would it now?

You indicate a medium sized file. So after a first copy and until a reboot that might just live entirely in the XFC cache. But then backup does novcache IO's. That means it will not cause blocks to be loaded into the cache, but I believe it may use previously used cached blocks.

Next time... how about :
$ SHOW MEM/CACHE=FILE=SRC-FILE around copy and backup?

Is there host-based shadowing in play?

>> Backup and Copy must do some things differently internally.

Yes, copy just double buffers and processes vbn after vbn. Backup can look at LBNs and SORT the file to minimize head movement (assuming old style disks behaviour). Backup will also issues many read IOs, up to process quotas, making for a much more intense load.
Are you using the old, recommended backup quotas for the processing using backup? Those were just crazy! ASTLM/DIRIOLM=4096 and the likes? Nuts!

Was the SRC-FILE heavily fragmented? Backup would know, copy would not.

Finally... with the bucket check byte out of phase, and the easy access to the original data, did you take a peak at the corruption?

I like a DUMP /BLOCK=(START=vbn-with-error-minus-one, COUNT=bucket-size-plus-two)
So I'd dump one block before the reported VBN (in case data runs into the bucket from below), and one block after the bucket, in case date spilled over. Admittedly that's more for internal RMS issues.

At least do an ANALYZE/RMS/INT ... POS/BUCK=341 both in SRC-FILE and BACKUP-FILE.

Good luck!
Hein
Jim Geier_1
Regular Advisor

Re: Backup, Copy, RMS Errors (OpenVMS 7.3-2)

There were no application users on the system when the test was done -- they were locked out, and no processes accessing the file other than the process copying and then using backup to copy the file. The exact backup command was:
$ BACKUP/LOG

ERRORLOG.SYS was analyzed several times with HP hardware and OpenVMS support -- there was no indication of any hardware errors at all. We emphasized this because we were certain that the problem was a hardware problem.

We cannot reproduce the problem now -- since the systems were rebooted, there have been no errors.
John Gillings
Honored Contributor

Re: Backup, Copy, RMS Errors (OpenVMS 7.3-2)

Jim,

A quick experiment with a small indexed file... it looks like COPY just shovels bits, like BACKUP, treating the file in block mode.

However, I got a difference in the resulting EOF - remember that EOF doesn't make a whole lot of sense for an indexed file, so, in theory it shouldn't matter. In my case the BACKUP file EOF was before the COPY file EOF, but I don't suppose there's any reason why it might be the other way around for a different input file.

I reset the attributes of the files to sequentual fixed so I could compare the bytes with DIFF:

$ set file/attr=(rfm:fix,mrs:512,org:seq,lrl:512)

All I saw were apparent extra null characters in the COPY file.

Try comparing F$FILE(file,"EOF") and F$FILE(file,"FFB") between the files.

Are the VBNs reporting errors near the physical end of file allocation? Could they be "junk" at the end of the file, after the apparent EOF for the COPY file, but not for the BACKUP file?

Try SET FILE/ATTR=(EBK:eof-of-copy) file.BACKUP

Does that change anything?

If your values are different, perhaps it's a bug that COPY and BACKUP get different values for a field that shouldn't be applicable to the file type, and/or that the difference possibly affects ANALYZE/RMS, but which utility is at fault?

It all begs the question... apart from the errors called out by ANALYZE, are there any other symptoms? If you read the files with your application are there any errors, or anything that looks like corruption?

What happens if you now CONVERT both the COPY and BACKUP files. Are the resulting files (logically) identical and error free? Any error messages?
A crucible of informative mistakes
P Muralidhar Kini
Honored Contributor

Re: Backup, Copy, RMS Errors (OpenVMS 7.3-2)

Hi Hein,

>> That means it will not cause blocks to be loaded into the cache, but I believe
>> it may use previously used cached blocks.

BACKUP,
The IO's issued by BACKUP would be the NOVCACHE IO's. Hence both the
read/write IO's issued by BACKUP would skip the XFC cache.
Even if the file is already in the XFC cache, BACKUP would end up fetching the
data from the disk.

COPY,
But in case of COPY those, IO's would go through the XFC cache and pick the
data from the XFC cache it its already there.

What if the data is present in the RMS Local buffers(or global buffers if thats
enabled). Then the request to fetch the data may get satisfied from the RMS
cache itself. COPY may get it from RMS cache but what about BACKUP?

>> Next time... how about :
>> $ SHOW MEM/CACHE=FILE=SRC-FILE around copy and backup?
Yes, best way to confirm whether the IO goes through the XFC cache or not
would be to look at the above XFC statistic.

Are you suggesting that the data in cache (XFC or RMS cache) may be having
the problem but the data in disk is OK. This is the reason why COPY works but
not BACKUP.

Regards,
Murali
Let There Be Rock - AC/DC
Volker Halle
Honored Contributor

Re: Backup, Copy, RMS Errors (OpenVMS 7.3-2)

Jim,

if this happens again, consider to use DUMP/BL=(COUNT:n,START:vbn) (n=bucket size) to compare the contents of the blocks in both the SRC-FILE and BACKUP-FILE. As BACKUP copies BLOCKS of data, both blocks must be in indentical positions within both files. When you do this comparison for multiple blocks, you might be able to find out the 'extent of corruption'. Note that there may well be more bytes corrupted, than ANAL/RMS will see !

Volker.

Shriniketan Bhagwat
Trusted Contributor

Re: Backup, Copy, RMS Errors (OpenVMS 7.3-2)

Jim,

It would be interesting to check the characteristics of the file.

$ dir/full SRC-FILE

And also are there any error counts getting increased on the disk for any operation done on the disk?

Regards,
Ketan