Operating System - OpenVMS
cancel
Showing results for 
Search instead for 
Did you mean: 

Has backup/image/ignore=interlock become useless?

 
Korendyk
Advisor

Has backup/image/ignore=interlock become useless?

I have recently become aware of a possible problem with long-standing procedures used to provide system and data backups. I am currently investigating and performing test, but thought I might raise the issue here in case someone has heard of a similar problem or might offer some insight during my investigations.

A simple site, with a system disk and a data disk, each a 2-drive shadow set. The procedures simply perform an image backup to tape using, for example, the following:

$ backup/ignore=(label,interlock)/image/verify -
dsa1: mkb600:dsa1.sav /media=compac/norewind

The backup is performed at idle times (no one logged in) and there is the occasional report of files marked for backup (all expected) and accessed for write (also expected). There appear to be no other errors or warnings reported, expected or not.

The problem is that when the tape is examined, the saveset is valid, but there are MANY files missing. The first detected instance was that an early part of a directory is copied, but not the rest of the directory. Entire subdirectories are also missing. And there does not (yet) appear to be a pattern.

As I mentioned, I am continuing to investigate and will provide additional information as it becomes available. And of course I should mention... OpenVMS/Alpha V8.3 on a DS20.

I am looking to see if there's any patches that might apply. What prompted this message was something that I did discover. I came across the following line in the V8.3 documents describing the "/IGNORE qualifier":

"Also, because of the way BACKUP scans directories, any activity in a directory (such as creating or deleting files) can cause files to be excluded from the backup."

Now, if this is what is happening here, then I am not impressed. For something like this to happen without any warnings, errors, or even informational messages is not what I've come to expect from OpenVMS!! I've been using OpenVMS for a lot of years, and I don't ever remember reading this before. A quick scan of previous (pre-8.x) documents appear not to include this statement, so I have to assume it is recent.

It leaves me wondering about the "way BACKUP scans directories", and if it is known to "cause files to be excluded from the backup" then why wasn't it addressed?!

Anyways. Any suggestions or insights are welcome.

thnx
\bill
35 REPLIES 35
Hoff
Honored Contributor

Re: Has backup/image/ignore=interlock become useless?

That /IGNORE=INTERLOCK has not been a reliable BACKUP is known.

This detail has been in the OpenVMS FAQ for a very long time, and I've made myself somewhat of a nuisance on this topic (see the other thread going here in the forums, and see the comp.os.vms newsgoup) pointing out the risks of the qualifier.

Silent data corruptions.

There has been a request to get the hazards more clearly documented, and it looks like the risks have finally made it into the manuals. (The older documentation tend to presume you knew that the interlocks were present for a reason; to flag questionable data access. This is the same basic reason why there's been a longstanding standalone BACKUP (OpenVMS VAX) or boot the CD (OpenVMS Alpha) or DVD (OpenVMS I64) or another system disk to get a backup of an OpenVMS system disk.
Korendyk
Advisor

Re: Has backup/image/ignore=interlock become useless?

Hi Hoff,

Thanks. Yes, I am well aware of the silent data corruptions possible with /ignore=interlock. I have dealt with it on many a system recovery. My concern is not with files being corrupted, since those are identified when the saveset is created, and can (should) be appropriately handled in the rare event of a recovery.

My concern is that files simply do not appear in the saveset. And when I said many, I means hundreds of small data files. Directories in the saveset contain varying numbers of the files that they should: many are there, many are not. Some entire directories are missing. And none of the files are open during the backup.

It is a puzzler. Sadly, backups are slow, idle time is rare, and so testing is a tad tedious.

\bill
John Gillings
Honored Contributor

Re: Has backup/image/ignore=interlock become useless?

\bill,

BACKUP/IGNORE=INTERLOCK has always been useless. Indeed, any attempt to BACKUP an active disk is mostly useless. This is not a fault in BACKUP, it's a fault in expectations.

There are many ways that changes in a directory could prune off large branches in the directory tree, with no way to guarantee it will even be detected. There are many ways files can change between the start of a backup operation and the completion. Some are detectable as potentially affecting the state of the backup, some are not.

BACKUP/IMAGE is really only useful for saving and restoring a static system disk. Any potentially changing files need to be saved independently. Any application data needs to be handled by the application, NOT the operating system. Only the application can know when the data is in a quiescent state. Backup should be an architecturally integral part of any serious application.

This is not the fault of OpenVMS or any other operating system, it's a simple issue of time. Things change many orders of magnitude faster than state can be saved, so it's simply not possible, even in theory to have a generic, covers-all-cases mechanism for creating a backup that can be restored with the system in a guaranteed known state.

There's an OpenVMS Technical Journal article (in V1?) covering some of the issues. The take home message is stop thinking in terms of getting the data off the system. Turn it around, think about how you will restore your system if something fails, work out what you'll need and work backwards to figure out how to save it.
A crucible of informative mistakes
Jan van den Ende
Honored Contributor

Re: Has backup/image/ignore=interlock become useless?

(sorry if double post; previous try seamingly failed)

bill,

of course, Hoff and John G. are very right!

Yet, your situation needs not be not as bleak as it obviously is now. And implicitly John G. indicated such:

>>>
or any other operating system, it's a simple issue of time.
<<<

And the main reason for my much more optimistic view you gave yourself:

>>>
each a 2-drive shadow set.
<<<

So, if you dismount one member of the set, mount that (process-private to avoid label conflict), and backup THAT drive, you will have brought the time issue down to only those activities that modify different locations on disk, and have already started but not yet finished.

Orders of magnitude less likely than such changes between reading a directory and procssing what has to be done according to that info. Or processing a (database, RMS, ...) index and processing the associated data. Or ... (any non-atomic activity or activity involving different disk locations.)

And HostBasedMiniMerge is fully integrated into VMS (patched 7.3-2 and) 8.x, so any pre-existing issues with merge performance have vanished.

Bottom line: modify your backup to profit from shadowing, and 99% +++ of your issues are past.

Success.

Proost.

Have one on me.

jpe
Don't rust yours pelled jacker to fine doll missed aches.
Hoff
Honored Contributor

Re: Has backup/image/ignore=interlock become useless?

You're certainly welcome to log a direct problem report if there's a support contract around. I would personally doubt you're going to get traction with HP via ITRC, and given the known restrictions around this particular command.

>Thanks. Yes, I am well aware of the silent data corruptions possible with /ignore=interlock. I have dealt with it on many a system recovery. My concern is not with files being corrupted, since those are identified when the saveset is created, and can (should) be appropriately handled in the rare event of a recovery.

I'd have to assume you're not familiar with /IGNORE = INTERLOCK because you're (still) using it. (I thought it was bad and was discussing getting the badness better documented, and while talking with the then-current maintainers of the BACKUP utility, I realized I hadn't understand half of the possible badness here.)

>My concern is that files simply do not appear in the saveset. And when I said many, I means hundreds of small data files. Directories in the saveset contain varying numbers of the files that they should: many are there, many are not. Some entire directories are missing. And none of the files are open during the backup.

Those interlocks were designed and implemented for a reason. (The same sort of model holds with the cluster quorum scheme; it wasn't implemented to cause folks boot or run-time problems, that stuff was implemented to prevent data corruptions.)

I'm not sure which I'd consider better here: entirely missing, or silently corrupt.

>It is a puzzler. Sadly, backups are slow, idle time is rare, and so testing is a tad tedious.

How to split an OpenVMS software RAID-1 shadowset volume is in the host-based volume shadowing manual, IIRC. That (greatly) reduces the window, but you can still have the potential for inconsistency corruptions.

With OpenVMS, the only way this archival stuff can be done (reliably) is either with the applications quiescent, or with application-integrated archival support. BACKUP /IGNORE=INTERLOCK can't reliably copy a system disk (which is how I realized there were problems early on), and HBVS might (though this is usually rare, we are looking at enterprise applications) miss part of a a multi-block or cached or inflight change. (StorageWorks disks could drop multiblock writes; that was the reason that the shelves and the controllers could optionally have batteries.)

I get reasonably good and consistent backups off the local OpenVMS and Unix databases because I use the databases, and because the databases have archival support. The applications - the databases, in this case - have archival processing integrated.
Robert Gezelter
Honored Contributor

Re: Has backup/image/ignore=interlock become useless?

Bill,

The issues here are as Hoff, John, and Jan have mentioned.

An amplification on what Jan commented about the RAID set, however, is in order. Actually, it is a combination of something John and Hoff noted and the splitting of the RAID set.

Often, the best solution is to backup issues is to add a scratch volume to the RAID set, temporarily increasing it (in this case, from two to three members). When the three members are fully up-to-date, disconnect the third member, remount it privately with writes disabled and make the backup from the private copy (NOINTERLOCK will not be necessary).

However, one must be careful that the volume is quiescent when disconnecting the temporary shadow set member. If a directory is being updated at the precise instant that the disconnect is happening, the disconnected shadow set member will also have the directory in an inconsistent state. There is no magic here.

That said, the pause in system activity is straightforward to architect, because the disconnect can be done very quickly.

Often, what allows people to "get away" with backing up system volumes with /IGNORE=NOINTERLOCK is that they "know" that the only a small set of files on THEIR system volume are actually ever modified (e.g., SYSUAF, error logs), and they separate steps to preserve those files (e.g., using CONVERT/SHARE and other utilities).

- Bob Gezelter, http://www.rlgsc.com
Hoff
Honored Contributor

Re: Has backup/image/ignore=interlock become useless?

To further Bob G's approach....

As for backing up the system disk on a regular schedule, I usually don't bother with that.

No point, really.

I do back up the system disk once in a while (after ECO kits or upgrades, or significant configuration changes), but I do archive the core files (see the SYLOGICALS.TEMPLATE file) regularly.

But the system disk in most OpenVMS configurations doesn't change all that often.
comarow
Trusted Contributor

Re: Has backup/image/ignore=interlock become useless?

All this just demonstrates how versatile
and ahead of it's time Host Based Shadowing
is.

Simply removing a disk and backing it up,
with host based mini merge, it should
go back quickly.

With the flexibility of adding and removing members, these operational issues have a simple solution.

That said, I have restored 100s of systems
backed up with /ignore=interlock and frankly,
they've always worked, though I always point out it's unsupported.

EMC says host based shadowing is obsolete, but it has nothing that solves operational
issues like host based shadowing. Me thinks it is they just don't want to bother coding
a long word.

If you want active backups, they generally
are part of an application, like Oracle has it's own backup, and RDB, with transaction
journals and such.

Bob
Martin Hughes
Regular Advisor

Re: Has backup/image/ignore=interlock become useless?

FWIW, note that MINIMERGE (HBMM) is being referenced in some of the above responses where I believe MINICOPY is what is meant.
For the fashion of Minas Tirith was such that it was built on seven levels, each delved into a hill, and about each was set a wall, and in each wall was a gate. (J.R.R. Tolkien). Quote stolen from VAX/VMS IDSM 5.2
Jon Pinkley
Honored Contributor

Re: Has backup/image/ignore=interlock become useless?

Bill,

I haven't seen any responses that try to explain the behavior you saw. Specifically, why any backup made with /FAST (which /IMAGE implies) would have files skipped by a backup/ignore=interlock. In fact, to me this sounds more like the effect of a backup/image without /ignore=interlock used on a disk with open files.

We still use 7.3-2 for production, and I have never seen the problem you describe.

I am not sure what the warning in the 8.3 documentation is really warning about. Can you provide a reference to the warning in the 8.3 documentation? I was unable to find it in the BACKUP chapter of the "HP OpenVMS System Management Utilities Reference Manual"

What does not make sense to me is that files would be missed in an image backup, since a /FAST file scan is implied, and this scans the INDEXF.SYS file and generates a list of FIDs to backup

This is what the BACKUP chapter of "HP OpenVMS System Management Utilities Reference Manual: A-L"

http://h71000.www7.hp.com/doc/83final/6048/ovms_83_sysman_util1.pdf

says about /IGNORE=INTERLOCK

Command Qualifier
Specifies that a BACKUP save or copy operation will override restrictions placed on files or will not perform tape label processing checks.

Note

--------------------------------------------------------------------------------
File system interlocks are expressly designed to prevent data corruptions, and to allow applications to detect and report data access conflicts.
Use of the INTERLOCK keyword overrides these file data integrity interlocks. The data that BACKUP subsequently transfers can then contain corrupted data for open files. Also, all cases in which these data corruptions can occur in the data that BACKUP transfers are not reliably reported to you; in other words, silent data corruptions are possible within the transferred data.
--------------------------------------------------------------------------------

INTERLOCK Processes files that otherwise cannot be processed due to file access conflicts. Use this option to save or copy files currently open for writing. No synchronization is made with the process writing the file, so the file data that is copied might be inconsistent with the input file, depending on the circumstances (for example, if another user is editing the file, the contents might change). When a file open for writing is processed, BACKUP issues the following message:

%BACKUP-W-ACCONFLICT, 'filename' is open for write by another user.

The INTERLOCK option is especially useful if you have files that are open so much of the time that they might not otherwise be saved. The use of this option requires the user privilege SYSPRV, a system UIC, or ownership of the volume.
See the Note before this table for more information about this keyword

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

Unfortunately, there is no information of what conditions are necessary for the "silent data corruptions" to occur are. I have tried to create a case where a file is open for write while backup is backing the file up without getting a "%BACKUP-W-ACCONFLICT, 'filename' is open for write by another user." message, and I have not been successful. Just because I am not able, doesn't mean it isn't possible, but if it is, then why can't someone give us a reproducer, instead of just repeating the "silent corruption" dogma? I don't consider file corruption of a file for which a warning message stating that the file is open for write by another user, as "silent corruption".

If you do a backup/list/out=files.lis of the backup saveset, does this list not have the files? In an image backup, directory files are copied as is, so it is theoretically possible that a directory file in the process of being modified could be copied in an inconsistent state, but I would still have expected the files that existed at the time of the initial file scan, to have been copied to the saveset. These may show up as lost files if an image restore is done, and possibly would show up as being in the [] directory in a listing of the tape saveset.

Can you confirm that the account doing the backup has SYSPRV? I see that /ignore=interlock lists that as a requirement, although it isn't clear to me why this would be a requirement, i.e. why READALL would not be sufficient.

Here are other threads too read.

How to backup a shadowed system disk ?

http://forums.itrc.hp.com/service/forums/questionanswer.do?threadId=1191154

Backup/Restore system disk

http://forums.itrc.hp.com/service/forums/questionanswer.do?threadId=1209276

Restore System Disks

http://forums.itrc.hp.com/service/forums/questionanswer.do?threadId=802643

BACKUP/IMAGE

http://forums.itrc.hp.com/service/forums/questionanswer.do?threadId=1028893

VAX/VMS image backup of system disk

http://forums.itrc.hp.com/service/forums/questionanswer.do?threadId=1312636

taking backup of disks of a production system

http://forums.itrc.hp.com/service/forums/questionanswer.do?threadId=974326

Are this command the same ?

http://forums.itrc.hp.com/service/forums/questionanswer.do?threadId=910829

How Vms backup works ?

http://forums.itrc.hp.com/service/forums/questionanswer.do?threadId=1094410

Process crashes while backup is activ

http://forums.itrc.hp.com/service/forums/questionanswer.do?threadId=964756

Jon
it depends
GuentherF
Trusted Contributor

Re: Has backup/image/ignore=interlock become useless?

In contradiction to all the folklore.../IGNORE=INTERLOCK does not produce a corrupted save set.

It "CAN" create an inconsistent copy of a file in the save set.

Btw. if you use /IGNORE=INTERLOCK from one node in a cluster and a file is opened for write on another node you do not get any info or warning message (that's bad).

The /IMAGE backup reads directory files directly off the disk bypassing the file system. BACKUP does not lock down a directory file or synchronize this access with the file system. While BACKUP uses a copy of directory blocks the actual directory on disk can be modified by the file system. Hence BACKUP may copy a file that actually had just been deleted or, miss a file that has just been created. This is more troublesome when directories themselves are created/deleted while BACKUP walks down the directory tree.

And another piece: BACKUP/IMAGE processes INDEXF.SYS before any other task. From the index file it creates an in-memory list of file IDs to save (=/FAST). Again, INDEXF.SYS may change while BACKUP/IMAGE is running.

Even without /IGNORE=INTERLOCK these problems still exist.

Baseline: DO NOT USE VMS BACKUP TO BACKUP AN ACTIVE DISK VOLUME.

/Guenther
Korendyk
Advisor

Re: Has backup/image/ignore=interlock become useless?

I wish to thank everyone who responded. I understand and agree with virtually everything everyone has said. I was perhaps not as clear as I could have been in expressing my concern. Let me try a different approach, beginning with a summary of the motivation.

The backup procedures in question are for recovery after some catastrophic event, and intended for (hopefully) prompt recovery of the system and the disk volumes. The method involves performing the aforementioned image
backup to establish the foundation for the recovery. The errors, warnings, and informational messages during the saveset creation and as a result of the verification pass, serve to identify any parts (directories and files) that should be considered as missing or corrupt in the resulting saveset. Special procedures (as appropriate for system files and for any applications) then ensure that those parts can be restored.

In the event of the recovery, the image saveset restores the volume to the state at or shortly before the time of the backup. The special procedures then restore applications to the state saved shortly before then. This is the method I have suggested at a number of sites, and have even taken some site's "backup tapes" to a similiarly configured box and performed the "disaster recovery" to confirm that it can be done :-]

What raised my concern, and resulted in this thread, was that a random, accidental examination of one of these backups showed that a large number of files were missing from the saveset. Knowing how and why these files became missing is necessary to determine what special procedures are needed to ensure recovery from a catastrophic event.

With no disresepect to others, I really do appreciate all the comments, but as Jon Pinkley points out, I'm really only interested in why the files are missing from the backup. And why it appears that that fact is not reported. Any comments on the risks of /ignore=interlock or using HBVS will be quietly ignored in an attempt to stay on topic. :-}

John Gillings's comment is the one that also scares me:

"There are many ways that changes in a directory could prune off large branches in the directory tree, with no way to guarantee it will even be detected."

So I have to ask: What are the many ways that a directory can be pruned off? And if it occurs without detection (even during the verification pass?), how can one hope to establish a valid foundation for recovery from a catastrophic event?

I considered Backup/Image to be the best (and only?) way to establish a foundation, from which a recovery method can be built that addresses any of the known limitations. What do you do if your foundation can not be assured?

I'm hoping further investigation and testing will determine whether there is a flaw in the method or a peculiarity in the site configuration. I should point out that this is all related to a data disk. The system disk, which does remain "active" during the backup, uses the same process, and examination of those savesets (so far) show them to be complete.

Clues, suggestions, are always welcome!

\bill
Korendyk
Advisor

Re: Has backup/image/ignore=interlock become useless?


Respnding to comments from Jon Pinkley.

>>>> We still use 7.3-2 for production, and I have never seen the problem you describe.
<<<<

Nor have I; this is the first. Part of my investigation is to see if it is specific to 8.3.

>>>> I am not sure what the warning in the 8.3 documentation is really warning about. Can you provide a reference to the warning in the 8.3 documentation? I was unable to find it in the BACKUP chapter of the "HP OpenVMS System Management Utilities Reference Manual"
<<<<

The occurs twice in the "System Manager's Manual, Volume 1: Essentials" in Section 11.15.1 (Backing UP User Disks) and again in section 15.18.3 (Ensuring Data Integrity).

>>>> What does not make sense to me is that files would be missed in an image backup, since a /FAST file scan is implied, and this scans the INDEXF.SYS file and generates a list of FIDs to backup.
<<<<

Unless the fast file scan is no longer implied. Something else to check into...

>>>> If you do a backup/list/out=files.lis of the backup saveset, does this list not have the files?
<<<<

Sadly, the Site considered it sufficient to only retain the batch log (listing the problems and not the successes). I would have insisted on a journal file... which is what it does now. :-/

>>>>> Can you confirm that the account doing the backup has SYSPRV? I see that /ignore=interlock lists that as a requirement, although it isn't clear to me why this would be a requirement, i.e. why READALL would not be sufficient.
<<<<<

I'm also looking to see if there are issues around the process quotas.

thnx.
\bill
Korendyk
Advisor

Re: Has backup/image/ignore=interlock become useless?

Hi Guenther.

>>>> The /IMAGE backup reads directory files directly off the disk bypassing the file system. BACKUP does not lock down a directory file or synchronize this access with the file system. While BACKUP uses a copy of directory blocks the actual directory on disk can be modified by the file system. Hence BACKUP may copy a file that actually had just been deleted or, miss a file that has just been created. This is more troublesome when directories themselves are created/deleted while BACKUP walks down the directory tree.
<<<<

I understand this happens, seen it often, but can it occur in a way that the "difference" is not detected (and reported) either when the saveset is created or during the verification pass?

>>>>
And another piece: BACKUP/IMAGE processes INDEXF.SYS before any other task. From the index file it creates an in-memory list of file IDs to save (=/FAST). Again, INDEXF.SYS may change while BACKUP/IMAGE is running.

Even without /IGNORE=INTERLOCK these problems still exist.
<<<<

Again, can you think of how these changes might be undetected during the backup/verify process? I suppose that if the in-memory list is not refreshed prior to the verification, and changes are made just so...

It remains puzzling, since I have confirmed that the files and associated directories for those missing from the saveset still exist on the disk volume, in a state that appears to be "unchanged" in a long time.

>>>>
Baseline: DO NOT USE VMS BACKUP TO BACKUP AN ACTIVE DISK VOLUME.
<<<<<

There's a scary notion. Many an archiving solution uses VMS Backup as the underlying copy mechanism. So you're saying that Backup is only usable as standalone, or on private read-only volumes. In all other circumstances it may be unreliable.

I'll need to mull that over a bit...

thnx
\bill

/Guenther
AEFAEF
Advisor

Re: Has backup/image/ignore=interlock become useless?

Responding to Gunther:

>
Btw. if you use /IGNORE=INTERLOCK from one node in a cluster and a file is opened for write on another node you do not get any info or warning message (that's bad).
<

Despite the fact that it says this somewhere in the docs, it simply isn't true. When you open a file from another node, there is a FAL process on the local node that has the file open.

Example (merged and edited for clarity):

LOCAL> DIR FTEND.LOG;

Directory _DSA1:[FT]

FTEND.LOG;7 748/750 6-MAY-2009 23:07:00.48

Total of 1 file, 748/750 blocks.

LOCAL> SHOW DEV /FILES

Files accessed on device DSA1: on 11-MAY-2009 19:51:12.11

Process name PID File name
00000000 [000000]INDEXF.SYS;1

REMOTE> OPEN/WRITE/READ SPOOK node_x::FTTOP:FTEND.LOG

LOCAL> SHOW DEV /FILES

Files accessed on device DSA1: on 11-MAY-2009 19:52:07.43

Process name PID File name
00000000 [000000]INDEXF.SYS;1
FAL_16734 000004DB [FT-2-1-0]FTEND.LOG;7

LOCAL> BACK/LOG FTEND.LOG;7 NL:A.B/SAVE
%BACKUP-E-OPENIN, error opening _DSA1:[FT]FTEND.LOG;7 as input
-SYSTEM-W-ACCONFLICT, file access conflict
%BACKUP-W-NOFILES, no files selected from _DSA1:[FT]FTEND.LOG;7

LOCAL> BACK/LOG/IGNORE=INTERLOCK FTEND.LOG;7 NL:A.B/SAVE
%BACKUP-W-ACCONFLICT, _DSA1:[FT]FTEND.LOG;7 is open for write by another user
%BACKUP-S-COPIED, copied _DSA1:[FT]FTEND.LOG;7

REMOTE> CLOSE SPOOK

LOCAL> BACK/LOG/IGNORE=INTERLOCK FTEND.LOG;7 NL:A.B/SAVE
%BACKUP-S-COPIED, copied _DSA1:[FT]FTEND.LOG;7
LOCAL>

>
The /IMAGE backup reads directory files directly off the disk bypassing the file system. BACKUP does not lock down a directory file or synchronize this access with the file system. While BACKUP uses a copy of directory blocks the actual directory on disk can be modified by the file system. Hence BACKUP may copy a file that actually had just been deleted or, miss a file that has just been created. This is more troublesome when directories themselves are created/deleted while BACKUP walks down the directory tree.
<

Well, the doc says BACKUP does synchronize with the file system, but it does lock files, so I completely concur with your bottom line. And I know from experience (V5.5-2 long ago) that it does copy directory files as you say: it copies them block for block. So I'm not sure just exactly how "BACKUP opens the index file to synchronize with the file system (no update is made)" (see below) affects the directory-copy operation.

BACKUP doc:
"To use the /IMAGE qualifier, you need write access to the volume index file (INDEXF.SYS) and the bit map file (BITMAP.SYS), or the input medium must be write-locked. BACKUP opens the index file to synchronize with the file system (no update is made). Finally, you must have read access to all files on the input medium."

AEF
Hoff
Honored Contributor

Re: Has backup/image/ignore=interlock become useless?

>Despite the fact that it says this somewhere in the docs, it simply isn't true. When you open a file from another node, there is a FAL process on the local node that has the file open

The issue here is with remote file access within a cluster. Not with DECnet FAL-level access, which is itself (and specifically the FAT server) arguably local to the process running FAL.

And BTW, the fellow you're discussing this utility with here (GF) has worked on BACKUP itself for a while, adding various support and debugging various problems within that tool. While I do not know if that is still the case, GF is quite familiar with the tool.


Jon Pinkley
Honored Contributor

Re: Has backup/image/ignore=interlock become useless?

I had done some testing a while back, and tried different scenarios, like a file being opened for write and closed during the time that the file was being backed up. I had also tried the simple cases, like a file being open at the time the backup started backing up the file, but I must not have tried that simple case on a file opened on another node.

At any rate, I just verified that what GF said is true.

See attachment for log

Guenther, thanks for sharing a condition that doesn't get a warning message. Ian has also stated this in other threads, but I was "sure" I had tested that case.

Jon
it depends
Jon Pinkley
Honored Contributor

Re: Has backup/image/ignore=interlock become useless?

Sorry the last attachment has incorrect comments with the first backup command.

Correction attached.
it depends
Jon Pinkley
Honored Contributor

Re: Has backup/image/ignore=interlock become useless?

Sorry the last attachment has incorrect comments with the first backup command.

Correction attached (hopefully it will make it this time).
it depends
John Gillings
Honored Contributor

Re: Has backup/image/ignore=interlock become useless?

\bill,

>So I have to ask: What are the many ways
>that a directory can be pruned off? And if
>it occurs without detection (even during the
>verification pass?),

As Guenther has pointed out:

>The /IMAGE backup reads directory files
>directly off the disk bypassing the file
>system. BACKUP does not lock down a
>directory file or synchronize this access
>with the file system.

If a directory is modified while this read is occurring, all manner of weird things can happen. Remember directories are "bucket" structured, so imagine a bucket with a single directory entry. Deleting that file will cause the whole directory to be shuffled down. If that happens at the same time as backup is copying the directory, the entire contents of the next bucket will be skipped, thus pruning off the branches leading from any subdirectories contained in that bucket.

Access is unsynchronized (that's what you asked for!), so anything can happen.

>how can one hope to establish a valid
>foundation for recovery from a
>catastrophic event?

Your foundation disk images MUST be taken with the system quiesced. In practice that means STANDALONE.

The incremental updates then need to be taken in a manner that is strictly and deliberately synchronised with your application. OpenVMS can't do that for you.

Again I urge you to turn the question on its head. Think about how you will do your restore. What data do you need and how can you obtain it? Expecting BACKUP to magically know what you need is NOT a solution.

The BACKUP utility isn't much more than a glorified COPY command. It's design goals are to move data as fast as possible from one place to another. Interlocks and synchronization just get in the way. You, the user, are expected to deal with that.
A crucible of informative mistakes
AEFAEF
Advisor

Re: Has backup/image/ignore=interlock become useless?




Hoff> May 11, 2009 20:34:56 GMT Unassigned

AEF>> Despite the fact that it says this somewhere in the docs, it simply isn't true. When you open a file from another node, there is a FAL process on the local node that has the file open

Hoff> The issue here is with remote file access within a cluster. Not with DECnet FAL-level access, which is itself (and specifically the FAT server) arguably local to the process running FAL.

OK, I missed the cluster part. My fault.

Hoff> And BTW, the fellow you're discussing this utility with here (GF) has worked on BACKUP itself for a while, adding various support and debugging various problems within that tool. While I do not know if that is still the case, GF is quite familiar with the tool.

Yeah, well. . . . Hey -- I did agree with him on everything else!

But I must say I am confused. I looked at the log posted by Jon Pinkley. If BACKUP knows that the file is locked by a process on another node, why doesn't it know that *independent* of whether /IGNORE=INTERLOCK was specified or not? Why doesn't backup give an error message in the /IGNORE=INTERLOCK case?

Restated: BACKUP must know the file is locked in both cases, as it clearly knows it in the regular command. And it knows to copy the file anyway in the /IGNORE=INTERLOCK case. (Well I *assume* it was copied. Was it?) So if BACKUP knows all that, why does it "forget" to give out the

%BACKUP-W-ACCONFLICT, is open for write by another user

message?

AEF
Jon Pinkley
Honored Contributor

Re: Has backup/image/ignore=interlock become useless?

Bill,

Do you still have the tape with the backup saveset that has the missing files? If so, the listing file can be recreated, although the syntax I previously gave was incorrect (there isn't a distinct /out= qualifier, the output file is associated with the /list qualifier).

Please get a listing of what is on the tape:

Assuming your tape drive is mkb600:

$ mount/foreign/nounload/nowrite mkb600:
$ backup/list=sys$scratch:mkb600.lis mkb600:*.* ! for every saveset on the tape

or

$ backup/list=sys$scratch:mkb600.lis mkb600:dsa1.sav ! for just the DSA1.SAV saveset

This can take a while depending on the tape drive and the size of the saveset.

Then search the listing for the files that you know were "missing".

The first thing to verify is what the actual backup command that created the saveset was. It is contained in the header of the listing. Here's an example:



Listing of save set(s)

Save set: TEST_NOINTERLOCK.BCK
Written by: JON
UIC: [000002,000016]
Date: 11-MAY-2009 16:53:54.44
Command: BACKUP/IMAGE/IGNORE=INTERLOCK DISK$TEST: CNVS72:TEST_NOINTERLOCK.BCK/SAVE

Rest deleted (see attachment for more complete example)

For even more info, you can use the undocumented/unsupported /analyze switch along with /list and get a listing that includes a "formatted dump" of the backup saveset records. This has the list of FIDs that the /FAST file scan creates.

An example of part of the output showing the first section of FIDs found during /FAST INDEXF.SYS file scan.

Record header
RSIZE = 144 = %X'0090'
RTYPE = FID (7)
FLAGS = %X'00000000'
ADDRESS = 0
BLOCKFLAGS = %X'0000'

STRUCLEV = 0101
FID_COUNT = 64
FID = (1,1,1)
FID = (2,2,1)
FID = (3,3,1)
FID = (4,4,1)
FID = (5,5,1)
FID = (6,6,1)

Rest deleted (see attachment for more complete example)

If you also have the disk that "still has the files", you can determine the FIDs of some of the "missing" files using directory/file and you can then search for the FID in the /analyze listing.

We need to verify that /ignore=interlock was actually used, as if it was not, any file open for write will generate the warning, but will not be copied to the saveset. Your description sounds like only part of the files in a directory were copied, so it seems more likely that something was creating or deleting files in these directories that had missing files. I would expect the probability of problems to increase with the size of the directory file, and with the activity (files being entered or removed from the .DIR file). If what Guenther says is true (and he should know); "BACKUP does not lock down a directory file or synchronize this access with the file system.", then the longer the "critical section" where a directory is in an inconsistent state, the higher will be the probability that backup will access it while it is in an inconsistent state.

I am surprised that BACKUP doesn't take out a serialization lock (F11B$s) by accessing the file, at least when the volume is mounted shared /write.

However, even if the directories were not copied intact, I still think that the files should be in the saveset, although they may not be "cataloged" in a directory, and may be in the "[]" null directory, like lost files.

Jon
it depends
Jon Pinkley
Honored Contributor

Re: Has backup/image/ignore=interlock become useless?

AEF,

Re: Why does backup not warn about writers on another cluster node?

Several things. BACKUP uses $QIO to read the VBNs associated with the files it is backing up. RMS is used too, but for savesets, listing files, etc.

In other words, backup copies blocks, not records, and uses low level access (the ACP interface) to read (non-saveset) files instead of RMS.

I just looked at section 8.4 "Access Arbitration" of Kirby McCoy's "VMS File System Internals". It discusses a routine ARBITRATE_ACCESS that is called to coordinate file activity in several places, one being "To open a file (except for explicit interlock ignore)" (top of page 353). It says that it first checks for local access, and only takes a lock if the volume is cluster accessible. It uses information in the FCB to determine if there are other accessors on the local node, so I am guessing that BACKUP is only looking at the FCB, and that is how it detects other writers on the same node. Since "explicit interlock ignore" is specified when /ignore=interlock is used, it probably never calls ARBITRATE_ACCESS, and doesn't actually enqueue a lock, so it doesn't detect writers on another node.

Perhaps Guenther will be able to shed more light than I can.

Jon
it depends
Volker Halle
Honored Contributor

Re: Has backup/image/ignore=interlock become useless?

The ACCONFLICT check in BACKUP works like this (see [BACKUP]DISKSCAN), if /IGNORE=INTERLOCK has been specified:

Once a file has been completely copied into the saveset, backup accesses the file again and checks

- the current revision date against the revision date saved when the file was first accessed for backup

- the writer count (SBK$W_WCNT) from the statistics block (ATR$C_STATBLK). Note that this value only reflects other WRITERS on the SAME node in a cluster. This data comes from the FCB (i.e. FCB$W_WCNT).

If the revision dates are different or the writer count is .ne. 0, the %BACKUP-W-ACCONFLICT message is signalled.

Volker.