Operating System - OpenVMS
1828539 Members
2679 Online
109980 Solutions
New Discussion

Re: taking backup of disks of a production system

 
SOLVED
Go to solution
mustafa_12
Frequent Advisor

taking backup of disks of a production system

Hi,

Our cluster system is 7x24 online and there is always something reading or writing to disks. We are taking full image backup on Sunday and incremental backups on other days of a week without shutting down any program on the system. In the log files of the backup, there are always a warning messages saying "X file is open for write by another user", but still backuping the file.

My question is, suppose that we are using 20 disks on our storage and all of them are gone in a disaster. However, I have got the image tape backups of the disks (with the above method) and wanted to restore these backups to another 20 disks-storage device (suppose that it is also OK to lose some file that is produced or changed after backup). But, I am not sure whether it will work or not. Because, when looking at the log file of the backups, there are always saying that

"diskX:[000000]QUOTA.SYS;1 is open for write by another user".

If some of files gave this kind of warning when taking the image backup of a disk, is it possible to restore this backup to another disk and still continue to work? Or is there another way to take the snapshot of the disk and then taking backup of this snapshot?


Note: My OS version OpenVMS 7-3-2 and my disks are on HSG80 storage
18 REPLIES 18
Hein van den Heuvel
Honored Contributor
Solution

Re: taking backup of disks of a production system

This is a universal problem, going well beyond Operating Systems. OpenVMS backup happens to e nice enough to be able to warn you, other backup tools silently ignore the problem giving you unreliable backup without a warning.

Those warnings will have to be analyzed each on their own merit.

For example the one quota.sys is for a file where the data can be regenerated from other data if needed, so this one keeps the backup valid.
A warning on say operator.log is fine also. You may not have otained the last few hundred lines, but they'll come in on the incremental backup tomorrow. Be aware though that the end of operator.log may be in the middle of data, cause error on future reads from a restored version, but again that can be warked around as needed.

If your systems happend to use RMS indexed files as main database then that file may or might not be valid. The backed-up version may have major inconsistencied: a new pointer in the end of the file, pointing to information in the beginning which was not there when backup came by.
How likely the backed up data is useful depends on
- whether deferred writes were use (dangerous for backups)
- whether the file is 'clean', adding records to the end or actively bucket splitting all over the place.
- your skill in repairing, or 'dealing' with a bad file if the restore shows problems.
- the availability of After Image Journal files.


For those indexed files you may need to decide you need to exclude then from the backup and for example CONVERT/SHARE to capture valid records. Or you may decided to use shadowing and unhooking a member for a backup thereby minimizing the time window of bad changes within a file, as well as between files.

SYSUAF and RIGHTLIST are often modified together, notably adding records. You may decide that you know no records are added during the backup, and you are willing to not see for example a truly up to date 'last login' timestamp. (hey... if you had started the backup a minute earlier that fresh last-login might not have been there at all!)

Finally, if you are working with a serious database (oracle?) then you should completely exclude those files from the backup and only use database verdor integrated backup tools like rman for Oracle-oracle.

In the end it is your call... 'do you feel lucky'.
What is the cost of scheduled downtime every time vs the potential cost of making data on a 'somewhat inconsistent backup' useable. For many applicaitions the cost having a not readibly useable backup is infinite and scheduled downtime is the only reasonable thing to do.

Hope this helps some,

Hein.
Mike Reznak
Trusted Contributor

Re: taking backup of disks of a production system

Hi,

I agree with Hein. (his answer is really comprehensive !)

If you wish to have consistent image backup, the best way is to stop all processing, so that nothing is writing to a disk at the time of backup.
To shorten the downtime to a minimum, implement storage solution (Bussines Copy, Continous Acces - EVA,XP). Making a clone is a matter of seconds and you make backup from unaccessed disk with plenty of time to finish it.

Mike
...and I think to myself, what a wonderful world ;o)
Jan van den Ende
Honored Contributor

Re: taking backup of disks of a production system

Mike,

pardon my relative ignorance on SAN details, but ISTR that 'standard' SAN solutions like Bussines Copy or Continous Access are TOTALLY INCOMPATIBLE with HBVS, and offer less functionality & reliability.

If you think along those lines, re-read Hein: DISMOUNT one member from a shadowset (and for open indexed RMS files, do a CONVERT/SHARE on the disengaged member), effectovely creating a snapshot to BACKUP, and rebuild the shadowset. MINIcopy is a great help!

Proost.

Have one on me.

jpe
Don't rust yours pelled jacker to fine doll missed aches.
Allan Bowman
Respected Contributor

Re: taking backup of disks of a production system

Just to add a little to the above responses, I recently came from a position where for the past 15 years we have been doing backups of a production environment with many open files. We would get hundreds of warning messages with each backup. However, since we knew the application, and in 99% of the cases the files remained opened during the backup - we were able to survive by crossing our fingers. In 15 years I probably only needed to restore a full disk from backup about a dozen times, and not once did I have a problem - possibly some missing data, but we were able to live with that since are data feeds were sent to a second site in real time. Our main concern was just to get the disk structure and directory layout restored - the data files could be recovered from the other system if need be (only had to do that once on a fairly large file).

Of course I'm not saying that the way we did it was a "good practice", but it worked for us.

Allan in Atlanta
John Gillings
Honored Contributor

Re: taking backup of disks of a production system

Re: Jan,

DISMOUNTing a shadow set member will guarantee the integrity of the DISK STRUCTURE, but not necessarily that of the internal structure of data files.

What is needed is a way for your application code to quiesce data files. Notionally something like, flush all transactions and CLOSE files at a point from which operations can be continued. This is not something any operating system can do for you. Only the application knows when it's in a state where all transactions are in a consistent state.

Bottom line is backup and restore must be built into application code from the initial design. With storage controller cloning or host based shadowing, you can get it down to:

1 Send message to application to quiesce
2 Dismount shadow member or initiate clone
3 Send message to application to continue

The "interruption" of service could be subsecond, but without it you're running a 24x7 mission critical operation with fingers crossed!

A crucible of informative mistakes
Willem Grooters
Honored Contributor

Re: taking backup of disks of a production system

If you have a lot of RMS-files that constitute the database of your applications, there always lurks the danger of inconsistency - within the file itself (when data is backed up but the index isn't, or just the other way around), or between files, when an update is taken place in files already copied (the backup won't hold the update) and files still to be copied (the backup will hold the update).
It doesn't have to be a problem, if you know your applications, data and users.
If you have the ability within your applications to block all updates for some time, it won't happen that easy - however, all buffers need to be flushed before backup is ever started, to ensire all data is written to disk (and thus copied).

You must take into account that if the database consists of several RMS files, you cannot restore just one file - you have to restore ALL of them, from the SAME backup, to ensure at least the best consistency you can get from backup.

Database files, no matter what system you use, MUST be treated differently. AKAIK every vendor has one or more methods to backup database files. USE THEM. This to ensure consistency within the database and internal data. Most vendows allow hot-backup.

On SYSAUF and RIGHTLIST - often named as files why NOT to rely on /ignore=interlock:
I have no problem - whatsoever - with backing up these files that way.
Why? Because changes I NEED to preserve in case of restore, are chnages that I, as a system manager, will not inititiate when backing up the system disk: change of credentials, quota, privilieges. A user's last login is of no importance when I have to restore that file.
Since these will normally be done during office hours, backkup (mostly) isn't. So I don't see the conflict.

Hoever, if you could avoid it - it's the best. But when you cannot, calculate the risk of missing some updates and inconsistency (within one file, or between several).
Like Allan says: it will work - most of the time. I second that.
Willem Grooters
OpenVMS Developer & System Manager
Wim Van den Wyngaert
Honored Contributor

Re: taking backup of disks of a production system

Backups over here were also as you described.

There were a number of problems which I solved.

1. files open by other users.
after each backup we now verify if any of the open files are in a configured list. If so, an alarm is given in the monitoring tools. E.g. Sybase dumps may not be open. 1 application had to be stable, so not 1 file should be open.

2. files no longer existing.
after each backup we verify if any files were missing (message non-exiting file). Again the Sybase dumps. If they are gone during the backup start and end, something is wrong.

3. timeframes.
All backups are made manually by operators. To avoid errors we configured when the backup taking is allowed. This to avoid that a mix of old and new data is backuped.

4. Backup sizes.
To avoid backup overflow, we monitor the size of the files to be backuped. This indicates unforeseen growth or cleanup problems. Or things moved without adjusting backup parameters.

Wim
Wim
Jan van den Ende
Honored Contributor

Re: taking backup of disks of a production system

Re John:

I admit to over-simplifying.
The not-mentioning of quiescing the databases
with dbmsses as procured by their vendors should have been made explicit. I concentrated on RMS files because here there is much to be gained by CONVERT/SHARE, _after_ the dismount, _on the split-off member_.
Willem also had a point that deserves being named explicitly: IF the database consists of multiple RMS files, AND thase are on different drives, it becomes IMPERATIVE to dismount them as close together as possible before their respective CONVERT/SHARE, AND, any restore should be of the WHOLE database.

hoping to have been more clear this time.

Proost.

Have one on me.

jpe
Don't rust yours pelled jacker to fine doll missed aches.
Robert_Boyd
Respected Contributor

Re: taking backup of disks of a production system

Mustafa,

I think from all the excellent responses you can see that this is not an easy issue to resolve. I have found over the years that the single most important thing to establish right up front is a clear statement of purpose and priority concerning backups.

Different sites and systems have different priorities and values for why backups are done, and the methods selected are best founded on those.

I find it helpful to evaluate several questions:

Who owns the data being backed up?
Who owns the systems being backed up?
How much will it cost the owner(s) of the data in the event of a complete loss?
How much will it cost the owner(s) of the data in the event of a partial loss?
How much will it cost per hour to the business of downtime if the need should arise to restore data from backups? (This must be answered by the owner(s) of the data and the people who depend on the availability of the data)
If you could implement redundancy and backup systems to minimize the potential for downtime and maximize business continuity even in the event of partial equipment failures, what is the price point where the solution becomes too expensive? That is, when does the cost exceed the value to the business?
If your business is not willing to spend enough to implement disaster tolerant solutions, so that equipment failures or software failures present potential singularities that could interrupt business, what are the next best approaches?
How reliable do backups need to be to make sure that restoration of service can be done in as short a time as possible for the most likely failures?
How thorough do the backup/vaulting/recovery testing methods need to be in order to guarantee business continuity in the case of a disaster in the computer facility (fire, flood, power malfunction, chemical spill, toxic gas release, biohazard, storm, earthquake, etc...)?
Are backups/archival data required for regulatory reasons?
Are backups required to do occasional restores of accidentally deleted files?
Are there other reasons that mandate any particular type of backups?
What service level do the owners/users of the data demand in terms of expected time to restoration for file restores?
How are these requirements ranked in priority?

There are probably some other issues that go along with these. This is a beginning for having the necessary discussions with the decision makers about how to come to reasonable decisions about how to structure backup/recovery/continuity solutions.

I welcome comments on these or additional questions/discussions that are relevant to making the necessary tradeoffs for any particular site/system.

Cheers,

Robert
Master you were right about 1 thing -- the negotiations were SHORT!
John Gillings
Honored Contributor

Re: taking backup of disks of a production system

Willem,

>On SYSAUF and RIGHTLIST - often named as
>files why NOT to rely on /ignore=interlock:
>I have no problem - whatsoever - with
>backing up these files that way.

>Why? Because changes I NEED to preserve
>in case of restore, are chnages that I,
>as a system manager, will not inititiate
>when backing up the system disk: change
>of credentials, quota, privilieges.
...
>Since these will normally be done during
>office hours, backkup (mostly) isn't. So
>I don't see the conflict.

Sorry, this is all irrelevant. What if your SYSUAF.DAT goes out to tape as an INDEXED file, but comes back SEQUENTIAL? How will you recover if the file us completely unreadable?

BACKUP/IGNORE=INTERLOCK DOES NOT GUARANTEE ANYTHING! All it does is change the status of attempting to backup a locked file from BACKUP-E-OPENIN to BACKUP-W-ACCONFLICT. Severity E to severity W. That's all. There are no guarantees that there will be anything of use in the file, or that any of the file attributes are intact. That's REGARDLESS of how long it's been between updates to the file.

Please get this into your heads! Any file that gets ACCONFLICT from BACKUP/IGNORE=INTERLOCK is only as good as random bits. You CANNOT depend on the contents of the file. I don't care how many thousands of times you've got away with it and received a useable file, Murphy's law says the one time you WON'T get a good file is exctly the time you most need it.

Trust me! I've had the 3am phone calls from people who can't boot their system because the restored disk was from a BACKUP/IGNORE=INTERLOCK. There is NOTHING we can do to help in those circumstances.

You have the tools to obtain reliable, guaranteed backups, it's just a matter of using them properly.

The base note in this thread is a classic case. You shouldn't be asking about "how do I restore" *after* the disaster! You should START with that question, and make sure you implement a backup strategy that will allow you to do it. When disaster strikes, invoke your (written) recovery plan.
A crucible of informative mistakes
Ian Miller.
Honored Contributor

Re: taking backup of disks of a production system

Note that with BACKUP/IGNORE=INTERLOCK it is possible to save junk with any warning such was the 'open for write' message. See Steve Hoffmans many postings on this topic.

What I now do is do not use /IGNORE=INTERLOCK and check the log file for files that are reported.
Then if I want to be able restore one of those files then I change the backup DCL procedure to do something to get a good copy of the file that can be written to tape.
____________________
Purely Personal Opinion
Lawrence Czlapinski
Trusted Contributor

Re: taking backup of disks of a production system

PRODUCTION SYSTEMS: Some of our manufacturing production systems normally run about 360x24. All you can do is reduce the risk of having unusable files. It would be nice if backup had a way of knowing whether the file was temporarily or permanently locked. If it was temporary, it could test the lock again. Then the question is how long is reasonable to wait for a lock to clear before ignoring it. In our environment, the customer is willing to risk losing data rather than taking the down time to stop production. Some productions environments can't afford to lose data. Our production disks are mirrored most of the time. Our biggest exposure is when we lose the mirroring of our disks like when a Raid Controller or an HSD goes down.
DEVELOPMENT SYSTEMS: We have had more exposure with development clusters/systems especially for disks that aren't mirrored. So far our customer is willing to risk the loss of the data rather than purchase redunancy on two of our older development clusters. Our newest development system has redundant system disks on the same controller.
1. Unfortunately for many of us, BACKUP/IGNORE=INTERLOCK is a necessity till there are better backup options. Unfortunately files are often opened no share and kept open. Getting quiet files in a 360x24 hour production environment for backups is random. Files that aren't changing as frequently are more likely to be consistent. Also the amount of elapsed time it takes to backup related non-database files is an important factor. CONVERT/SHARE can help with important files. I do a CONVERT/SHARE of some important system files that were recommended in information about BACKUP (in this forum) in our DAILY_MGR.COM.
2. Database backups for consistency of data are very important.
3. Unfortunately applications often leave files open and we seldom get a say about that unless the files get quite large. Then we can say start a new file periodically.
Lawrence
Doug Phillips
Trusted Contributor

Re: taking backup of disks of a production system


Any file that gets ACCONFLICT from BACKUP/IGNORE=INTERLOCK is only as good as random bits. You CANNOT depend on the contents of the file.


An ACCONFLICT doesn't tell you anything about the integrity of the backup. An error on the /VERIFY step does. Any backup that doesn't do /VERIFY can't be trusted. (DP's 6th rule of backups:-)
mustafa_12
Frequent Advisor

Re: taking backup of disks of a production system

First of all, I thank you all very much for your valuable answers. As a reply to your comments, I want to add some info. There is no database in our disks. So, this reduces the risk of transactional inconsistency very much. Although the programs are up and running at the time of backup, there is no interaction with the programs AFAIK. However, in the logs there are sometimes "open for write by another user" warnings. And these errors are produced more than others when backing up system disk. From your comments, I understand that it is too risky to use the newly restored system disk with this kind of backup.

You mention about "volume shadowing". I've read from the HP book that it is not possible to use incremental backup in volume shadowing. Because when the disk that is backed up is entered back to the shadow set, the backup date attribute of the file on it is modified according to the one on the disk that remained in the shadow set. I wonder, if it is possible somehow to take incremental backup using volume shadowing

best regards...
labadie_1
Honored Contributor

Re: taking backup of disks of a production system

To John Gillings:
I fully agree with what you have said about backup/ign=int.

A friend of mine used to say that instead of doing a backup/ign=int, he preferred to do no backup at all, like this he knew (and all the other people knew too) that he had no backup.
Rather provocative, but worth of attention

:-)
John Gillings
Honored Contributor

Re: taking backup of disks of a production system

>I've read from the HP book that it
>is not possible to use incremental
>backup in volume shadowing. Because
>when the disk that is backed up is
>entered back to the shadow set, the
>backup date attribute of the file on
>it is modified according to the one
>on the disk that remained in the
>shadow set.

Correct, but that doesn't stop you from using incremental backups altogether.

You can still do BACKUP/MOD/SINCE=BACKUP, you just can't do /RECORD because the recorded date will be lost (and will also break MINICOPY).

Suppose you do a full BACKUP/RECORD on Sunday on your "live" data. Each subsequent day you'll get everything since Sunday, instead of everything since the day before. This may be reasonable for your data, but it will affect your restoration procedure. In practice it will make the restore easier, because you just lay down the full backup and restore the most recent incremental, rather than all intervening incrementals.

If you want to get clever about it, you could use DFU to set the backup date on the live files after completing the backup.

Bottom line here is you should all be lobbying your application developers to build backup and restore into the application from the ground up.

I'd be very wary of any application that claims to be "24x7" which doesn't have a well integrated backup and restore capability. Most of the time it's very easy to do if it's considered at the design stage.
A crucible of informative mistakes
John Gillings
Honored Contributor

Re: taking backup of disks of a production system

Re: Doug,

>An ACCONFLICT doesn't tell you anything
>about the integrity of the backup. An
>error on the /VERIFY step does. Any
>backup that doesn't do /VERIFY can't
>be trusted. (DP's 6th rule of backups:-)

BZZZT! Sorry, NO, your 6th rule is not necessarily correct. A file with an ACCONFLICT cannot be trusted EVEN IF THE VERIFY SUCCEEDS! That could just mean that the verify pass got the same junk as the initial pass. It DOES NOT mean the bits in the backup copy are useable.

Please drum this into your heads. A locked file is LOCKED. Only the application that has it locked has a 100% reliable view of it.

>If it was temporary, it could test the
>lock again. Then the question is how
>long is reasonable to wait for a lock
>to clear before ignoring it.

Retrying doesn't always help either. The objective of a BACKUP is to get a copy of ALL your data in a consistent state at a specific point in time. It needs to be a state from which the application can run, AND you need to have some idea of that state. Retrying random files at arbitrary times decreases the chances that your complete set of files is in a useful coherent state.

Only the application itself can "know". Some data is relatively easy to track. For example, a series of discrete transactions, you can just record and replay the sequence.

Other data is much more difficult. A simple example, a financial transaction might involve adjusting debits and credits across several accounts, stored in multiple locations (files). If you backup those files at skewed times, you may miss some parts of the transactions, so your accounts won't balance.
A crucible of informative mistakes
Doug Phillips
Trusted Contributor

Re: taking backup of disks of a production system

John,

The "junk" you're refering to; do you mean inconsistency in the application data? You don't mean RMS, do you? Never had backup corrupt RMS in 25 years but I've had it cause inconsistent application data (usually between related files).

I stand behind /verify as the best detector of problems. At many, many sites, if we couldn't use /ignor=inter then we couldn't do backup --- and I can recover the small percent of inconsistent application data *much* easier than 100% of the data lost because there wasn't a backup.

True, mirroring & such is getting cheaper and more available to the smaller company, but the reality is that there are still a bunch of systems that won't be replaced or upgraded for years. That's just the way the real world works.

Rather than say /ignore=interlock should never be used, I say: know your data and understand what /ignore does; then decide whether or not to use it. Recovery tools are very important, and data consistency test and recovery is one of the first tools that should be built.

If backup has a problem with maintaining RMS file structure integrity, then shame on the engineers who designed it. If it doesn't have a problem, and I can trust that the RMS container will be whole, then I'll worry about my data, thank you.