Operating System - OpenVMS
1752799 Members
6055 Online
108789 Solutions
New Discussion юеВ

Re: Has backup/image/ignore=interlock become useless?

 
John Gillings
Honored Contributor

Re: Has backup/image/ignore=interlock become useless?

\bill,

>So I have to ask: What are the many ways
>that a directory can be pruned off? And if
>it occurs without detection (even during the
>verification pass?),

As Guenther has pointed out:

>The /IMAGE backup reads directory files
>directly off the disk bypassing the file
>system. BACKUP does not lock down a
>directory file or synchronize this access
>with the file system.

If a directory is modified while this read is occurring, all manner of weird things can happen. Remember directories are "bucket" structured, so imagine a bucket with a single directory entry. Deleting that file will cause the whole directory to be shuffled down. If that happens at the same time as backup is copying the directory, the entire contents of the next bucket will be skipped, thus pruning off the branches leading from any subdirectories contained in that bucket.

Access is unsynchronized (that's what you asked for!), so anything can happen.

>how can one hope to establish a valid
>foundation for recovery from a
>catastrophic event?

Your foundation disk images MUST be taken with the system quiesced. In practice that means STANDALONE.

The incremental updates then need to be taken in a manner that is strictly and deliberately synchronised with your application. OpenVMS can't do that for you.

Again I urge you to turn the question on its head. Think about how you will do your restore. What data do you need and how can you obtain it? Expecting BACKUP to magically know what you need is NOT a solution.

The BACKUP utility isn't much more than a glorified COPY command. It's design goals are to move data as fast as possible from one place to another. Interlocks and synchronization just get in the way. You, the user, are expected to deal with that.
A crucible of informative mistakes
AEFAEF
Advisor

Re: Has backup/image/ignore=interlock become useless?




Hoff> May 11, 2009 20:34:56 GMT Unassigned

AEF>> Despite the fact that it says this somewhere in the docs, it simply isn't true. When you open a file from another node, there is a FAL process on the local node that has the file open

Hoff> The issue here is with remote file access within a cluster. Not with DECnet FAL-level access, which is itself (and specifically the FAT server) arguably local to the process running FAL.

OK, I missed the cluster part. My fault.

Hoff> And BTW, the fellow you're discussing this utility with here (GF) has worked on BACKUP itself for a while, adding various support and debugging various problems within that tool. While I do not know if that is still the case, GF is quite familiar with the tool.

Yeah, well. . . . Hey -- I did agree with him on everything else!

But I must say I am confused. I looked at the log posted by Jon Pinkley. If BACKUP knows that the file is locked by a process on another node, why doesn't it know that *independent* of whether /IGNORE=INTERLOCK was specified or not? Why doesn't backup give an error message in the /IGNORE=INTERLOCK case?

Restated: BACKUP must know the file is locked in both cases, as it clearly knows it in the regular command. And it knows to copy the file anyway in the /IGNORE=INTERLOCK case. (Well I *assume* it was copied. Was it?) So if BACKUP knows all that, why does it "forget" to give out the

%BACKUP-W-ACCONFLICT, is open for write by another user

message?

AEF
Jon Pinkley
Honored Contributor

Re: Has backup/image/ignore=interlock become useless?

Bill,

Do you still have the tape with the backup saveset that has the missing files? If so, the listing file can be recreated, although the syntax I previously gave was incorrect (there isn't a distinct /out= qualifier, the output file is associated with the /list qualifier).

Please get a listing of what is on the tape:

Assuming your tape drive is mkb600:

$ mount/foreign/nounload/nowrite mkb600:
$ backup/list=sys$scratch:mkb600.lis mkb600:*.* ! for every saveset on the tape

or

$ backup/list=sys$scratch:mkb600.lis mkb600:dsa1.sav ! for just the DSA1.SAV saveset

This can take a while depending on the tape drive and the size of the saveset.

Then search the listing for the files that you know were "missing".

The first thing to verify is what the actual backup command that created the saveset was. It is contained in the header of the listing. Here's an example:



Listing of save set(s)

Save set: TEST_NOINTERLOCK.BCK
Written by: JON
UIC: [000002,000016]
Date: 11-MAY-2009 16:53:54.44
Command: BACKUP/IMAGE/IGNORE=INTERLOCK DISK$TEST: CNVS72:TEST_NOINTERLOCK.BCK/SAVE

Rest deleted (see attachment for more complete example)

For even more info, you can use the undocumented/unsupported /analyze switch along with /list and get a listing that includes a "formatted dump" of the backup saveset records. This has the list of FIDs that the /FAST file scan creates.

An example of part of the output showing the first section of FIDs found during /FAST INDEXF.SYS file scan.

Record header
RSIZE = 144 = %X'0090'
RTYPE = FID (7)
FLAGS = %X'00000000'
ADDRESS = 0
BLOCKFLAGS = %X'0000'

STRUCLEV = 0101
FID_COUNT = 64
FID = (1,1,1)
FID = (2,2,1)
FID = (3,3,1)
FID = (4,4,1)
FID = (5,5,1)
FID = (6,6,1)

Rest deleted (see attachment for more complete example)

If you also have the disk that "still has the files", you can determine the FIDs of some of the "missing" files using directory/file and you can then search for the FID in the /analyze listing.

We need to verify that /ignore=interlock was actually used, as if it was not, any file open for write will generate the warning, but will not be copied to the saveset. Your description sounds like only part of the files in a directory were copied, so it seems more likely that something was creating or deleting files in these directories that had missing files. I would expect the probability of problems to increase with the size of the directory file, and with the activity (files being entered or removed from the .DIR file). If what Guenther says is true (and he should know); "BACKUP does not lock down a directory file or synchronize this access with the file system.", then the longer the "critical section" where a directory is in an inconsistent state, the higher will be the probability that backup will access it while it is in an inconsistent state.

I am surprised that BACKUP doesn't take out a serialization lock (F11B$s) by accessing the file, at least when the volume is mounted shared /write.

However, even if the directories were not copied intact, I still think that the files should be in the saveset, although they may not be "cataloged" in a directory, and may be in the "[]" null directory, like lost files.

Jon
it depends
Jon Pinkley
Honored Contributor

Re: Has backup/image/ignore=interlock become useless?

AEF,

Re: Why does backup not warn about writers on another cluster node?

Several things. BACKUP uses $QIO to read the VBNs associated with the files it is backing up. RMS is used too, but for savesets, listing files, etc.

In other words, backup copies blocks, not records, and uses low level access (the ACP interface) to read (non-saveset) files instead of RMS.

I just looked at section 8.4 "Access Arbitration" of Kirby McCoy's "VMS File System Internals". It discusses a routine ARBITRATE_ACCESS that is called to coordinate file activity in several places, one being "To open a file (except for explicit interlock ignore)" (top of page 353). It says that it first checks for local access, and only takes a lock if the volume is cluster accessible. It uses information in the FCB to determine if there are other accessors on the local node, so I am guessing that BACKUP is only looking at the FCB, and that is how it detects other writers on the same node. Since "explicit interlock ignore" is specified when /ignore=interlock is used, it probably never calls ARBITRATE_ACCESS, and doesn't actually enqueue a lock, so it doesn't detect writers on another node.

Perhaps Guenther will be able to shed more light than I can.

Jon
it depends
Volker Halle
Honored Contributor

Re: Has backup/image/ignore=interlock become useless?

The ACCONFLICT check in BACKUP works like this (see [BACKUP]DISKSCAN), if /IGNORE=INTERLOCK has been specified:

Once a file has been completely copied into the saveset, backup accesses the file again and checks

- the current revision date against the revision date saved when the file was first accessed for backup

- the writer count (SBK$W_WCNT) from the statistics block (ATR$C_STATBLK). Note that this value only reflects other WRITERS on the SAME node in a cluster. This data comes from the FCB (i.e. FCB$W_WCNT).

If the revision dates are different or the writer count is .ne. 0, the %BACKUP-W-ACCONFLICT message is signalled.

Volker.
Jon Pinkley
Honored Contributor

Re: Has backup/image/ignore=interlock become useless?

Correction in terminology:

I said "In other words, backup copies blocks, not records, and uses low level access (the ACP interface) to read (non-saveset) files instead of RMS."

That should be: "In other words, backup copies blocks, not records, and uses low level access (the ACP interface IO$M_ACCESS) to open (non-saveset) files instead of RMS file opens."

And while I was entering this, I see Volker has provided the answer to why only local writers are detected (and why a file open/close for update on anther node is detected, as it (under most conditions) changes the revision date.

Jon





it depends
Volker Halle
Honored Contributor

Re: Has backup/image/ignore=interlock become useless?

Bill,

if you've really used /VERIFY on the BACKUP command (as shown in your example, which probably is NOT from the batch .LOG file itself) and you have not received VERIFYERR errors, then consider, that the verification process seems to compare the files from the saveset against the files from the source disk having been saved and not vice-versa.

Volker.

Jon Pinkley
Honored Contributor

Re: Has backup/image/ignore=interlock become useless?

Expanding on what Volker said about verification:

Verification was probably originally meant to provide a way to know that there were no uncorrectable errors introduced during the copy process, and to allow a check of the complete end-to-end data path between the source disk and saveset. Remember that BACKUP was written before tape drives with ECC were available. In my opinion, doing a backup/listing of the saveset provides a reasonably good verification that the saveset is readable, although it isn't a confirmation that it the data is identical to what is on the disk. But the only way for it to be identical, is if the disk hasn't changed.

Verification is really only useful on a static disk. Otherwise you will get verification errors on any file which has changed between the time the file was written to the saveset, and when it is read from the saveset in the verification pass. The verification pass starts after the whole saveset has been written, and in the case of a saveset on tape, after the tape has been repositioned to the start of the saveset. This repositioning isn't optimized (at least that is my experience with 7.3-2); it rewinds the tape, then checks each saveset until it finds the correct one. In other words, if there are many savesets on the tape, it takes progressively longer to reposition the tape as the saveset number on tape gets higher. This shouldn't be necessary with modern tape drives that can report position, and skip to a position on the tape, but there is no TMSCP command or even $QIO function to do that. It can be done with a $QIO DIAGNOSE SCSI pass-through, but I am not expecting that this functionality will ever be added to BACKUP.

Summary: /VERIFY is expensive from a time point of view, and unless the source disk is static, it doesn't provide much useful information.

Note that the same "verification errors" would happen if you used convert/share before your backup, a second convert/share after backup, and a differences on the two copies. If the files had changed during the interval between the two copies, differences will report that the files are different. This doesn't mean the first copy was "bad", only that the file is no longer the same as it was at a previous point in time. I.e. if you do get verification errors, that is not necessarily an indication that what is on the tape is inconsistent, only that it is different than what is on the disk at the time the verification pass re-read the disk.

Jon
it depends
AEFAEF
Advisor

Re: Has backup/image/ignore=interlock become useless?

I find /VERIFY more useful than that. I always use it when making tapes. If nothing else, it tells you that there is something readable on the tape!

Re ECC: I now use /GROUP=0 because of that, but I still use /VERIFY for all save-to-tape operations.

Re BACKUP/LIST: Someone once posted on cov how he trusts BACKUP/LIST for determining save-set integrity. Excuse me a minute while I try to find it . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Working . . . . . . . . . . . . . OK, I'm back. (Yeah, I know -- silly me.)

I couldn't find the original by Jerry Leichter, but here is the url for my quote of it. In it, he deliberately corrupts a save set by changing one byte in it and shows how BACKUP/LIST picks it up. (If you prefer, just search comp.os.vms for RECORD BOUNDARY BACKUP RUTGERS and read my only post in the thread, or maybe you'll find the original! Google groups appears not to have it. The title of the thread is "LD062 Install Question".)

http://groups.google.com/group/comp.os.vms/browse_frm/thread/7ae42920ab4a3a19/cd52ae44c694abe7?lnk=gst&q=record+boundary+backup+rutgers#cd52ae44c694abe7

And sure, there'll be some differences causing verification errors, but you can see if they look reasonable. No, that doesn't guarantee that the save operation went flawlessly, but if will reveal any major problems. It's a quick (okay, not so quick!) spot check.

And I've had situations where the verify pass resulted in a (fatal!) parity error part way through the save set. I'd call that useful information!

And if the CRC values are checked during the verification pass, and no bad values are found -- that's also a good sign. (I would assume that CRC values are checked, but I don't know for sure. Anyone?)

I've had enough problems with tapes that I always use /VERIFY. In fact, when it comes to causing trouble, printers are just hopeless tape-drive wannabees.

Summary: I always use /VERIFY when saving data to tape.

AEFAEF
comarow
Trusted Contributor

Re: Has backup/image/ignore=interlock become useless?

I'm unclear. Are you trying to create a disaster recovery tape?

Then arrange some down time and boot
stand alone backup, (still supported on
5.5-2) and you'll get a clean, supported
backup.

You can then backup/image/VERIFY for your own
security.