Operating System - OpenVMS
cancel
Showing results for 
Search instead for 
Did you mean: 

Deleted files blocks being reallocated (and visible) to new files?

 
Mark Corcoran
Frequent Advisor

Deleted files blocks being reallocated (and visible) to new files?

I've been looking off and on at a problem that's been plaguing us for the last two months (seemingly after an RDB patch was installed)...

There is a process which creates its own log file (in addition to the "redirected" SYS$OUTPUT) once a day, just after midnight.

The original programmers simply have 3 calls for writing into the log file:

handle = fopen(filename, "a") ;
fprintf(handle, "log message") ;
fclose(handle) ;

For each message written to the log file, there's a call to fopen, fprintf and fclose (I guess the original developer didn't work out how to make the file shareable to interactive VMS users "at the same time" as it was being written to) - so you can force a new file to be created simply by renaming the existing one, rather than waiting for the process to detect a day rollover.

What has been happening, is that the start of the file appears to be getting corrupted, with "random stuff".

I've looked at the calls for writing into the log file, and it's as basic as I've just outlined - apart from a couple of places where %s is used (to substitute one element of an array of fixed strings, or to substitute a string containing the current date & time), it's either fixed text, or integers that are being substituted using %d.

The "random stuff" that's appearing in the log file is not anything that the process would have access to from any global section it maps to or even if running off beyond the end of an array - it's part of other (log) files.

During some experimentation today, I found that what appears to be happening, is that when the file is being created (rather than appended to), the blocks that are allocated to it, appear to be 1152 blocks from files that have (recently) been deleted...

The log file created by the process during the test (when I renamed its existing log file) had two extents:

Count: 552 LBN:35395413
Count: 621 LBN:35396172

The first 492 blocks of the first extent came from another log file I'd manually created, and the remaining 60 blocks from some other log file.

Of the 621 blocks in the second extent, only the last 21 actually contained log messages written by the process.

It's clear to me that something is going awry here, but this is the only place (file) where it has been observed.

The disk on which the log files exist (it was moved, but the problem continued) has a cluster size of 69 blocks, is ODS-2, and does not have highwater file marking enabled.

The deleted files which appear as part of this process's log file are (to the best of my knowledge) not set to /ERASE_ON_DELETE, and the file is a "bog standard" file as would be created using fopen(filename, "a") from the DEC C library - i.e.:

File Organisation :SEQUENTIAL
Shelved State :ONLINE
Caching Attributes:WRITETHROUGH
File Attributes :EXTEND=0, GLOBAL BUFFER
COUNT=0
Record Format :STREAM_LF
Record Attributes :CARRIAGE RETURN CARRIAGE
CONTROL
RMS Attributes :NONE
Journaling Enabled:NONE
Client Attributes :NONE

Despite the no highwater marking and the apparent lack of use of ERASE_ON_DELETE, my understanding is that the old contents of the blocks allocated to the file SHOULDN'T be visible (you can see them from DUMP, but generally get an error on TYPE, because the file contents now have lines >32768 chars long).

Even if this was the case, what I really don't understand is why for 21 blocks' worth of use (over about 70mins), the file should have had 1173 blocks allocated to it?

It's a standard C call (not directly using RMS, so isn't specifying an initial allocation size or DEQ), and the .COM file which starts up the executable doesn't change RMS settings with the DCL command SET RMS_DEFAULT.

There's no disk errors being reported (bearing in mind that ever after moving to another disk, the problem continues, though I'm not sure if it is served by the same controller), nor anything odd in ERRLOG.SYS or the Operator log file.

A previous attempt at ANA /DISK /NOREPAIR hasn't shown anything untoward, like doubly-allocated blocks.

A defragger utility (part of the Disk File Optimizer suite) does run in the early hours of the morning, but it does so on most disks, and even after stopping it, it made no difference.

The nodes in the cluster have even been rebooted since the problem first occurred.

It strikes me that it's probably poisoning of a cache (XQP?), and I seem to recall that this RDB patch was partly intended to fix some (of its own?) cache corruption.

The RDB patch installed was RDBV71521AM, onto an OVMS-AXP v7.3-2 cluster.

Don't suppose anyone has any thoughts on the matter (or might even have encountered it themselves)?

Regards,

Mark
4 REPLIES 4
Wim Van den Wyngaert
Honored Contributor

Re: Deleted files blocks being reallocated (and visible) to new files?

I would think this is only possible if the file is created and a "set file/end" is done.

fwiw

WIm
Wim
Gregg Parmentier
Frequent Advisor

Re: Deleted files blocks being reallocated (and visible) to new files?


Looks to me like the write is being done with an offset into the file that corresponds to the next write that would have been done to the renamed file.

I'm not sure how to track that, but hopefully it might get you looking in the right direction.
RBrown_1
Trusted Contributor

Re: Deleted files blocks being reallocated (and visible) to new files?

I saw that you could rename the log file and your application would automatically create a new one.

What happens at midnight? Does some agent rename the log file so that the application creates a new one? Or does the application simply close the file and create a new one with a different name?

Does the problem happen only at midnight? Or only when you rename the existing log file? Or both?

And this only happens with this one application? Do any other applications do something similar?

If you make a test program with only the fopen, fprintf, and fclose, maybe in a loop with a time delay, can you make the problem happen by renaming the log file? If so, you could play with the debugger and get a better idea of what is going on. If you stop after the fopen creates a new file, but before the fprintf, I expect that $ ANALYZE/RMS would show EOF VBN=0, offset 0. After the fclose, EOF VBN & offset should point to the byte after the message you sent.

What happens if you rename the file between the fopen and the fprintf? Or between the fprintf and the fclose?


hth
- Rob
Mark Corcoran
Frequent Advisor

Re: Deleted files blocks being reallocated (and visible) to new files?

>What happens at midnight? Does some agent rename the log file so that the application creates a new one? Or does the application simply close the file and create a new one with a different name?

Each time the process writes to the file, it does so by fopen in append mode, using a created-on-the-fly filename, which contains the current system date.

Consequently, on a day rollover, a log file containing the new date does not exist, and gets created.

During testing today, every time I renamed the existing log file, when the process fopen-ed /append the file, it created a new one, and it always had 1152 blocks of other deleted files' recovered blocks at the start of it.



>Does the problem happen only at midnight? Or only when you rename the existing log file? Or both?

See above. Well, apparently, whilst the process was running on "the other" node in the cluster (two main nodes, plus one quorum one (the software doesn't run on the quorum node)), it apparently ran for ~5 days without encountering the problem.

How/why/when the process failed over onto the other node, I don't know - I wasn't looking at the problem at the time.


>And this only happens with this one application? Do any other applications do something similar?

Well, it's the only one we've observed it on, though that doesn't mean it's the only one it's happening on!

Normally, we wouldn't even look in the log file unless something else has failed, but on looking in the log file "after" the RDB patch, the problem was observed.

Of course, it could well have been happening before the RDB patch, just nobody ever noticed...


>If you make a test program with only the fopen, fprintf, and fclose, maybe in a loop with a time delay, can you make the problem happen by renaming the log file?

Well, I did try this with a.n.oher program I wrote, but it was really for the purposes of checking to see if the corruption occurred on initial allocation, or once the file jumped beyond the initial allocation of the volume's cluster size (with my test program's log file, the problem wasn't observed).


>What happens if you rename the file between the fopen and the fprintf? Or between the fprintf and the fclose?

In terms of the log file created by the "offending" process, this is impossible, without a rebuild, a change request being raised and new software being installed (the fprintf() call immediately follows the fopen(); I may be quick on the keyboard, but not that quick!)


I should also note that I've tried putting an ACL on the directory that contains the log file (since you can't put it on a file which hasn't yet been created), but it only shows the process accessing the file, and us support folks DIR/TYPE/DUMPing the file.

The version of RDB on the build cluster (alas, we are not responsible for it) doesn't match the version on the production cluster (I know, I know!).

Since the fopen() call is (eventually) making use of RMS to create the file, the RMS structures (RAB/FAB etc) must be in the process space somewhere.

I wondered whether or not a mismatch of RDB versions (possibly even VMS versions - I'm not logged in at the moment to check) between the build and production systems could result in the RMS structures on the production system being partially trashed...

Not trashed to the extent that an exception occurs, or that file access is impossible, but if for example, the EOF block/marker is overwritten in the structure with a value that happens to be >=1152 blocks, and which somehow (perhaps, after the fprintf() and/or fclose()) is causing old contents of the allocated blocks to "become visible again"?


Mark