Operating System - OpenVMS
cancel
Showing results for 
Search instead for 
Did you mean: 

Backup, Copy, RMS Errors (OpenVMS 7.3-2)

Jim Geier_1
Regular Advisor

Backup, Copy, RMS Errors (OpenVMS 7.3-2)

I help manage a two-system cluster (AlphaServer ES40 systems) running OpenVMS 7.3-2 and patched up to Update V6. The cluster is seldom rebooted, perhaps every 10 months or so. The application on this cluster uses indexed RMS files. The system had been up for 320 days and it was discovered that a few of the RMS files had become corrupted. The storage is an MSA1000 with LUNs built using the ADG format. Working with HP hardware support and OpenVMS support, no hardware errors (storage, memory, other hardware) or problems could be found. In fact, after being up for 320 days, SHOW ERROR only showed a very small number of errors on the network adapter and 1 error on the diskette drives. In the diagnosis of the possible causes, the following was noticed with a medium-sized RMS file:

Step 1. Analyze/RMS/Check SRC-FILE -- No errors
Step 2. Copy SRC-FILE to COPY-FILE
Step 3. Backup SRC-FILE to BACKUP-FILE
Step 4. Analyze/RMS/Check COPY-FILE -- No errors
Step 5. Analyze/RMS/Check BACKUP-FILE – Errors occurred
The output showed several errors of the type:
VBN 341: Bucket check byte is out of phase

This was repeated consistently a few times, using different disks and directories.

Step 6. One system rebooted
Step 7. Entire sequence repeated with no errors
Step 8. Other system rebooted

Because of time constraints, we did not try the test on the system that had not been rebooted after the first was rebooted. My understanding is that such errors often indicate a hardware problem, but none can be found.

Backup and Copy must do some things differently internally. What might cause this problem? How can we prevent this from occurring again?
25 REPLIES
P Muralidhar Kini
Honored Contributor

Re: Backup, Copy, RMS Errors (OpenVMS 7.3-2)

Hi Jim,

>> The system had been up for 320 days and it was discovered that a few of
>> the RMS files had become corrupted.
How was this corruption detected ?
Using ANALYZE/RMS/CHECK command ?
or
Was VERIFY (ANAL/DISK/LOCK) run on the disk, to check for any other errors
on the disk.

>> Step 1. Analyze/RMS/Check SRC-FILE -- No errors
>> Step 2. Copy SRC-FILE to COPY-FILE
>> Step 3. Backup SRC-FILE to BACKUP-FILE
>> Step 4. Analyze/RMS/Check COPY-FILE -- No errors
>> Step 5. Analyze/RMS/Check BACKUP-FILE â Errors occurred
>> The output showed several errors of the type:
>> VBN 341: Bucket check byte is out of phase

The steps here indicate that ANALYZE/RMS/CHECK of the file has not indicated
any error. When the file is copied using COPY, the destination file has no errors
but when the file is copied using BACKUP the destination file has some set of
errors reported.

BACKUP and COPY would be doing things little differently but then its
interesting to note that a file copied using BACKUP is giving a error.

Are all other errors similar to "VBN 341: Bucket check byte is out of phase"
or are there any other errors?

>> My understanding is that such errors often indicate a hardware problem,
>> but none can be found.
Have you checked the ERRLOG.SYS for any hardware problems reported
corresponding to the disk?

Regards,
Murali
Let There Be Rock - AC/DC
John Gillings
Honored Contributor

Re: Backup, Copy, RMS Errors (OpenVMS 7.3-2)

Jim,

Was your Step 3 BACKUP/IGNORE=INTERLOCK? If not could you please post the exact command and any output? (also sanity check any possible symbol definitions in the command)

>VBN 341: Bucket check byte is out of phase

The first and last bytes of a bucket are the check bytes. They're kept at the same value, but incremented each time a bucket is rewritten. If they're out of phase (ie: different), that implies an I/O rewriting the bucket was interrupted. That shouldn't happen if everything is correctly interlocked at the RMS level.

>Backup and Copy must do some things
>differently internally

For an indexed file, I suspect COPY and BACKUP are VERY different. BACKUP just shovels bits, where COPY may create an empty file with the same attributes as the original, then reads records from the source and inserts them into the destination. You'll need to look at the source to see for sure.

If that's the case, the original and BACKUP file should be bit for bit identical, but the COPY file could be very different because the records may be inserted in a different sequence.

What does CONVERT SRC-FILE CVT-FILE do?
A crucible of informative mistakes
P Muralidhar Kini
Honored Contributor

Re: Backup, Copy, RMS Errors (OpenVMS 7.3-2)

Hi Jim,

To rule out any possibility of any disk corruption (like say multiply allocated
blocks) of files on the disk, execute
$ANAL/DISK/LOCK
is the disk on which you have the corrupted index files.

However, from the steps that you have performed, the errors have got
introduced after the BACKUP command.

As John has pointed out, it would be interesting to know the exact BACKUP
command you have used to perform the copy operation. It could turn out that
the file was in use when you were performing the BACKUP operation and the
BACKUP was performed using "/IGNORE=INTERLOCK" causing the file to be
copied while it was in a inconsistent state
(Note - thats what the /IGNORE=INTERLOCK qualifier directs BACKUP to do).

Regards,
Murali
Let There Be Rock - AC/DC
Hein van den Heuvel
Honored Contributor

Re: Backup, Copy, RMS Errors (OpenVMS 7.3-2)

Intriguing! Repeatable (IO) errors are rare.

Any issues with backup/ignore=interlock would certainly not be readily repeatable and typically do not end in check-byte errors, but bucket pointers beyond EOF and such.
Anyway, the test setup suggests to me that a dedicated file is used for testing.... otherwise copy would not work, would it now?

You indicate a medium sized file. So after a first copy and until a reboot that might just live entirely in the XFC cache. But then backup does novcache IO's. That means it will not cause blocks to be loaded into the cache, but I believe it may use previously used cached blocks.

Next time... how about :
$ SHOW MEM/CACHE=FILE=SRC-FILE around copy and backup?

Is there host-based shadowing in play?

>> Backup and Copy must do some things differently internally.

Yes, copy just double buffers and processes vbn after vbn. Backup can look at LBNs and SORT the file to minimize head movement (assuming old style disks behaviour). Backup will also issues many read IOs, up to process quotas, making for a much more intense load.
Are you using the old, recommended backup quotas for the processing using backup? Those were just crazy! ASTLM/DIRIOLM=4096 and the likes? Nuts!

Was the SRC-FILE heavily fragmented? Backup would know, copy would not.

Finally... with the bucket check byte out of phase, and the easy access to the original data, did you take a peak at the corruption?

I like a DUMP /BLOCK=(START=vbn-with-error-minus-one, COUNT=bucket-size-plus-two)
So I'd dump one block before the reported VBN (in case data runs into the bucket from below), and one block after the bucket, in case date spilled over. Admittedly that's more for internal RMS issues.

At least do an ANALYZE/RMS/INT ... POS/BUCK=341 both in SRC-FILE and BACKUP-FILE.

Good luck!
Hein
Jim Geier_1
Regular Advisor

Re: Backup, Copy, RMS Errors (OpenVMS 7.3-2)

There were no application users on the system when the test was done -- they were locked out, and no processes accessing the file other than the process copying and then using backup to copy the file. The exact backup command was:
$ BACKUP/LOG

ERRORLOG.SYS was analyzed several times with HP hardware and OpenVMS support -- there was no indication of any hardware errors at all. We emphasized this because we were certain that the problem was a hardware problem.

We cannot reproduce the problem now -- since the systems were rebooted, there have been no errors.
John Gillings
Honored Contributor

Re: Backup, Copy, RMS Errors (OpenVMS 7.3-2)

Jim,

A quick experiment with a small indexed file... it looks like COPY just shovels bits, like BACKUP, treating the file in block mode.

However, I got a difference in the resulting EOF - remember that EOF doesn't make a whole lot of sense for an indexed file, so, in theory it shouldn't matter. In my case the BACKUP file EOF was before the COPY file EOF, but I don't suppose there's any reason why it might be the other way around for a different input file.

I reset the attributes of the files to sequentual fixed so I could compare the bytes with DIFF:

$ set file/attr=(rfm:fix,mrs:512,org:seq,lrl:512)

All I saw were apparent extra null characters in the COPY file.

Try comparing F$FILE(file,"EOF") and F$FILE(file,"FFB") between the files.

Are the VBNs reporting errors near the physical end of file allocation? Could they be "junk" at the end of the file, after the apparent EOF for the COPY file, but not for the BACKUP file?

Try SET FILE/ATTR=(EBK:eof-of-copy) file.BACKUP

Does that change anything?

If your values are different, perhaps it's a bug that COPY and BACKUP get different values for a field that shouldn't be applicable to the file type, and/or that the difference possibly affects ANALYZE/RMS, but which utility is at fault?

It all begs the question... apart from the errors called out by ANALYZE, are there any other symptoms? If you read the files with your application are there any errors, or anything that looks like corruption?

What happens if you now CONVERT both the COPY and BACKUP files. Are the resulting files (logically) identical and error free? Any error messages?
A crucible of informative mistakes
P Muralidhar Kini
Honored Contributor

Re: Backup, Copy, RMS Errors (OpenVMS 7.3-2)

Hi Hein,

>> That means it will not cause blocks to be loaded into the cache, but I believe
>> it may use previously used cached blocks.

BACKUP,
The IO's issued by BACKUP would be the NOVCACHE IO's. Hence both the
read/write IO's issued by BACKUP would skip the XFC cache.
Even if the file is already in the XFC cache, BACKUP would end up fetching the
data from the disk.

COPY,
But in case of COPY those, IO's would go through the XFC cache and pick the
data from the XFC cache it its already there.

What if the data is present in the RMS Local buffers(or global buffers if thats
enabled). Then the request to fetch the data may get satisfied from the RMS
cache itself. COPY may get it from RMS cache but what about BACKUP?

>> Next time... how about :
>> $ SHOW MEM/CACHE=FILE=SRC-FILE around copy and backup?
Yes, best way to confirm whether the IO goes through the XFC cache or not
would be to look at the above XFC statistic.

Are you suggesting that the data in cache (XFC or RMS cache) may be having
the problem but the data in disk is OK. This is the reason why COPY works but
not BACKUP.

Regards,
Murali
Let There Be Rock - AC/DC
Volker Halle
Honored Contributor

Re: Backup, Copy, RMS Errors (OpenVMS 7.3-2)

Jim,

if this happens again, consider to use DUMP/BL=(COUNT:n,START:vbn) (n=bucket size) to compare the contents of the blocks in both the SRC-FILE and BACKUP-FILE. As BACKUP copies BLOCKS of data, both blocks must be in indentical positions within both files. When you do this comparison for multiple blocks, you might be able to find out the 'extent of corruption'. Note that there may well be more bytes corrupted, than ANAL/RMS will see !

Volker.

Shriniketan Bhagwat
Trusted Contributor

Re: Backup, Copy, RMS Errors (OpenVMS 7.3-2)

Jim,

It would be interesting to check the characteristics of the file.

$ dir/full SRC-FILE

And also are there any error counts getting increased on the disk for any operation done on the disk?

Regards,
Ketan
Hein van den Heuvel
Honored Contributor

Re: Backup, Copy, RMS Errors (OpenVMS 7.3-2)

Murali>> What if the data is present in the RMS Local buffers(or global buffers if thats
enabled). Then the request to fetch the data may get satisfied from the RMS
cache itself. COPY may get it from RMS cache but what about BACKUP?

RMS buffer do not come into play.
Local buffers come and go with the open/close.
So a fresh copy run would open the file fresh and gets its own, fresh, local buffers.
Global buffer are only active for record IO under non-readonly sharing. That does not apply to COPY no BACKUP

Murali>> Even if the file is already in the XFC cache, BACKUP would end up fetching the
data from the disk.

Thanks for confirming.

Volker>> consider to use DUMP/BL

Right, I mentioned that also.
Need to see what it looks like, specially if you know what it should look like (SCR-FILE)

Hein





Volker Halle
Honored Contributor

Re: Backup, Copy, RMS Errors (OpenVMS 7.3-2)

Jim,

testing with SYSUAF.DAT (as an example) seems to show, that COPY does about the same kind of operations than BACKUP. This indexed file can be copied with COPY and BACKUP/COMPARE shows no differences between the original file and the copied file, indicating that COPY indeed did a block-by-block copy of the file. If you really want to know, use the IO$SDA extension and trace the IOs.

Trying to answer your question: How can we prevent this from occurring again?

You can NOT prevent this problem from re-occuring, if you don't UNDERSTAND the problem ! To understand this problem, you need to do more intense ANALYSIS. Now that the problem is gone, you would want to do more experiments to better understand what could have gone wrong and be better prepared for the next occurence.

Some of the answers in here may help you prepare...

Volker.
Shriniketan Bhagwat
Trusted Contributor

Re: Backup, Copy, RMS Errors (OpenVMS 7.3-2)

Hi,

My understanding is that, BACKUP will use the RMS buffer only if it is explicitly enabled through $ set rms/buffer command.
It would be interesting to check are there any difference between original file and backup file and copy file?

$ diff SRC-FILE BACKUP-FILE
$ diff SRC-FILE COPY-FILE

Regards,
Ketan
P Muralidhar Kini
Honored Contributor

Re: Backup, Copy, RMS Errors (OpenVMS 7.3-2)

Hein,
>> Local buffers come and go with the open/close.
>> So a fresh copy run would open the file fresh and gets its own, fresh,
>> local buffers.
>> Global buffer are only active for record IO under non-readonly sharing.
>> That does not apply to COPY no BACKUP

Record IO would go through RMS and hence would use the RMS global
buffers. Block IO would not go through RMS and hence would not use the RMS
global buffers. Thanks for this information.

If we want to rule out any involvement XFC caching then,
Currently,
COPY uses XFC cache
BACKUP skips XFC cache

To make COPY also skip XFC cache,
1) SET FILE /CACHING=NO_CACHING
XFC Caching is disabled on the file.

2) Set Dynamic SYSGEN parameter VCC_MAX_IO_SIZE = 0
This would mean all subsequent IOs greater than size 0 would skip XFC
cache. This would apply to all IO's in the system

Issue the COPY command and observe its behavior.
After issuing the COPY command, revert back the NOCACHE settings.

If the problem is reproducible then we can use this method to rule out any
involvement of XFC cache.

Regards,
Murali
Let There Be Rock - AC/DC
Volker Halle
Honored Contributor

Re: Backup, Copy, RMS Errors (OpenVMS 7.3-2)

Ketan,

BACKUP only uses RMS to write/read the SAVESET (on disk), but NOT to either read nor write individual files, it uses QIOs for that purpose.

SET RMS/BUFFER does NOT enable or disable something, it changes default values for the RMS multibuffer count.

Volker.
Volker Halle
Honored Contributor

Re: Backup, Copy, RMS Errors (OpenVMS 7.3-2)

Murali,

why do we enter a discussion of XFC here ? It's the backup operation that seems to produce a corrupt file, not the COPY operation.

And we agree, that BACKUP is using IO$M_NOVCACHE ...

Volker.
Hein van den Heuvel
Honored Contributor

Re: Backup, Copy, RMS Errors (OpenVMS 7.3-2)

Volker>> why do we enter a discussion of XFC here ?

That was me. All the other methods take blocks from the XFC, and backup does not. Therefor that is a potential source of discrepancy. But it would mean that the original on disk was bad which was not that case, so forget that angle.

But before we forget that angle...
Murali>> To make COPY also skip XFC cache,...

An other simple way to make copy V8.3 skip the cache is to use /BLOCK=256 ... or any other value larger than SYSGEN VCC_MAX_IO_SIZE.

Volker>> COPY does about the same kind of operations than BACKUP.

It's all in the listings, but it might be a little interesting to lay out a carefuly fragmented file on an LD device and trace the IOs for copy and backup. I'm not sure when/if backup uses LBN access for its copy option.

Hein
Volker Halle
Honored Contributor

Re: Backup, Copy, RMS Errors (OpenVMS 7.3-2)

Hein,

before wading through the [BACKUP] source listings, I did a simple experiment:

$ COPY some-large-file somewhere-else:

$ BACKUP some-large-file somewhere-else:

The nice SDA> XFC show file/id=^d/br command proved beyond doubt, that COPY is using XFC, whereas BACKUP is NOT !

Volker.
P Muralidhar Kini
Honored Contributor

Re: Backup, Copy, RMS Errors (OpenVMS 7.3-2)

Hein,

>> But before we forget that angle...
>> An other simple way to make copy V8.3 skip the cache is to use
>> /BLOCK=256 ... or any other value larger than SYSGEN VCC_MAX_IO_SIZE.
Nice. Using the COPY/BLOCK would be more suited to make COPY skip the
cache than changing the VCC_MAX_IO_SIZE parameter. Thanks for sharing
this information.

>> But it would mean that the original on disk was bad which was not
>> that case, so forget that angle
Thats right. This would probably rule out the the cache/disk data
being out of sync. The problem has to do something with the way the COPY
and BACKUP do their file copy operation.

Regards,
Murali
Let There Be Rock - AC/DC
Jim Geier_1
Regular Advisor

Re: Backup, Copy, RMS Errors (OpenVMS 7.3-2)

There is a lot of good information to wade through here. Thanks to everyone.

One thing to know is that the XFC is off on this cluster (VCC_FLAGS = 0), so while the XFC discussion is interesting and useful information, the XFC does not come into play at all in this particular situation.

Global buffers are used by the application, so I assume that global buffers are set for this file.
Cass Witkowski
Trusted Contributor

Re: Backup, Copy, RMS Errors (OpenVMS 7.3-2)

Are you using an LD disk? If so it need to be mounted with /NOCACHE
GuentherF
Trusted Contributor

Re: Backup, Copy, RMS Errors (OpenVMS 7.3-2)

Unless it has been changed, BACKUP does not re-order QIOs by VBN unless /IMAGE or /FAST is used.

Cheers,
Guenther
GuentherF
Trusted Contributor

Re: Backup, Copy, RMS Errors (OpenVMS 7.3-2)

I doubt that if a disk block is in the XFC cache that any "virtual" read would bypass the cache. As far as I remember the IO$M_NOVCACHE was for writes only.

But I don't have the resources to confirm that this is truely the case.

Cheers,
Guenther
Cass Witkowski
Trusted Contributor

Re: Backup, Copy, RMS Errors (OpenVMS 7.3-2)

All I know is that when we tried to use backup to copy a directory tree on an LD disk to another tree on the same disk it would not error but it would not work correctly. A sear of Google pointed to the fact that the LD devices should not be mounted with Cache.

I don't know why but there seems to be some interaction or lack there of with XFC.

That is why I was asking if you were using an LD device.
Hein van den Heuvel
Honored Contributor

Re: Backup, Copy, RMS Errors (OpenVMS 7.3-2)



Cass, I realize you are trying to help, but IMHO you are needlessly and dangerously muddling the waters.

Kindly give is explicit data instead of handwaving.

Are you perhaps referring to:
http://h71000.www7.hp.com/doc/732final/6668/6668pro_005.html
"4.11 Logical Disk (LD) Utility: Error when Using RMS V7.3-2"
(included below)

That is explicit about using the LD device from 'two angles', as a file and as a container file systems. That is not the general usage.
And it mentions that the file used as a container should not be cached which is rather (completely) different from mounting an LD device with or without (xqp) cache.

I would go as far as saying that is there was (still) a known, or suspect issues then Jur would mention that on his site (http://www.digiater.nl/lddriver). Since he doesn't there must not be an issue :-).

Doug Gordon also mentions this in : http://de.openvms.org/TUD2005/20_VMS_Hints_and_Kinks_Doug_Gordon.pdf.
It suggests this is an old, resolved issue, only to be mentioned while mentioning the exact versions used, but admittedly it is not explicit about that.
(I happened to have lunch with Doug today! :- Szechuan Chef! :-)

Cheers,
Hein




4.11 Logical Disk (LD) Utility: Error when Using RMS
V7.3-2

A feature of the Logical Disk (LD) utility can cause problems for end users who are using logical disks. This problem can occur on any version of OpenVMS that uses the LD utility.

The LD utility bypasses the cache for the container file when it is operating on the disk. If RMS is used to read or write to the container file, RMS will have stale data if the LD utility is used to connect to the file and subsequently to the logical disk that is being written.

This problem occurs mainly when the LD utility is used to create disk images that are to be burned onto a CD-ROM.

To work around the problem, execute the following DCL command to turn caching off for any file that will be used as a container file for the LD utility:


$ SET FILE/CACHING_ATTRIBUTE=NO_CACHING CONTAINER_FILE.DSK.
This DCL command is not executed as part of the CDRECORD.COM command procedure. Therefore, if you reuse a container file for a logical disk created by CDRECORD.COM, be sure to turn off the caching using this command.