RMS - How does SET FILE /ATTR=EBK:blk differ from a $QIO using FIB$M_TRUNC?

Mark Corcoran · ‎04-17-2008

I've been looking at some existing code written by someone else, with a view to altering its behaviour in a manner that will avoid an outage.

The problem is that an application creates a relative record file for use as a queue, but the way in which the queue is used, means that new entries are added to the end of the file, rather than re-use existing (processing-completed) records - i.e. no linked list is in operation.

Once in a while, the code that I am looking to change, is used to modify these queue files, by shuffling up still-in-play records to the head of the file, and resetting head/tail pointers for the queue.

[This is all done with the file being opened for share, and supposedly guarded against inappropriate updating by other processes at the same time, by use of locks (locks used by this code and the other processes which share the file, rather than RMS locking).

I'm not entirely convinced that all pieces of code that use this locking, do so correctly, but in any case, when this utility is run, other processes that write to the queue file are killed off.

The only other processes which share it, are ones which read from it, but these are not stopped, because it basically involves shutting the the entire application]

It's all very well that this utility allows recovery of allocated blocks for re-use, but on occasion, the file can grow very large, and causes an issue with disk space.

What I'm looking to do, is to modify the utility to free up the recovered allocated blocks.

The DCL command SET FILE/ATTR=EBK:blk cannot be used, because aside from the file being RMS-locked at the time you would attempt to issue the command, you cannot accurately determine the end block on-the-fly.

[I suppose that if the file had say 1,000,000 blocks allocated, but only 20,000 in use, you could "safely" assume that you could free up the last 750,000 blocks, because the processing time required to add transactions to the queue file to use up blocks 20,0001 through 250,000 would be several tens of minutes if not hours]

I started looking at how I might achieve truncation of the file, and happened upon FSTIOCPY.C in a posting to comp.os.vms

I endeavoured to use this as a basis for standalone attempts at proving the concept of being able to truncate a file that is open and shared by other processes.

However, my code always gets a return status of %X24 (%SYSTEM-F-NOPRIV) on the $QIO using FIB$M_TRUNC

The RMS Reference manual says that the truncation only works for sequential files (but the logic of what FSTIOCPY.C does, suggests it can work for any file organization).

However, irrespective of whether I try to use this for a sequential or relative file, I still get the same error.

It's not a privelege issue as such, because I can quite happily use the SET FILE /ATTR=EBK command, so I'm assuming that there is something that I haven't set (or have set incorrectly) in the RAB/FAB/XAB/FIB.

Although I'm quite happy to post the code, I wonder whether or not anyone (Hein?) can confirm whether or not what I'm trying to do, is even remotely possible...

i.e. if you have one relative file opened by two processes (one for read, one for write/truncate, and both with appropriate sharing options), is there any way that the process with truncate-access can free up the disk blocks whilst the read-access process still has the file open?

[Even if the answer is no, I'm still prepared to post the code - I have to confess that I'm not terribly experienced with RABs/FABs/XFABs et al, so it would be a useful learning experience to know what I've done wrong]

Mark

Hein van den Heuvel · ‎04-17-2008

>>> The problem is that an application creates a relative record file for use as a queue, but the way in which the queue is used, means that new entries are added to the end of the file

Do you know wether they use the RMS option $CONNECT to EOF, or do they maintain their own next-record-number in some master record.

>> rather than re-use existing (processing-completed) records - i.e. no linked list is in operation.

Typically that would not be a linked list, but a circular buffer. Wrap around to some low value when some high value is used. You'd have to maintain a fence at the low side, or watch for exisiting records on $put (and not use _UIF)

>> However, my code always gets a return status of %X24 (%SYSTEM-F-NOPRIV) on the $QIO using FIB$M_TRUNC

Sounds like a coding error.

>> The RMS Reference manual says that the truncation only works for sequential

RMS must have chosen not to implement truncate for relative files, and even add defensive code for a reason. The reason is not documented, but is likely to involve timing windows in shared environments in the general case. Sounds like you have a specific case where the added control might maki this safe.

>> is there any way that the process with truncate-access can free up the disk blocks whilst the read-access process still has the file open?

Not in RMS directly. The problem is that in a shared environment the real EBK is NOT maintained on disk, but in the shared-file-lock.

So if you wanted to write a tool, then that tool would have to didle an exec mode lock.
Not very difficult, but not a beginners task either. To do it properly, similar to what RMS does, the tool would also need to take out a 'prologue lock', that is a lock on VBN 1, again in exec mode.

Sounds like a nice little project!
Send Email if you want me to help.

Do I understand that the records know about each other, such that a simple CONVERT (with applicaiton down) would not work anyway?

If yuo have the external locks, can they not be used to communicate that a new version of the file must be used?
That is...
Take 'file lock'
Create new, smaller, relative file
Shuffle sparse records from old file into dense structure in new file
Release lock with value block indicating the need to re-open.

New FID in lock value block if there are 6 bytes to spare? Or File version number in lock block if the is just 1 byte to spare? Even one flipped bit woudfl be good enough if the re-shuffle will only be done 'every so often', less frequent than the longest wait idle time in the application processed.

Hope this helps some,
Hein van den Heuvel (at gmail dot com)
HvdH Performance Consulting

Ian Miller. · ‎04-21-2008

either creating a smaller file and getting the other processes to read it or prevent the file growing in the first place.

Fiddling with the RMS exec mode locks is wandering into unsupported territory and I don't think you want to go there.

____________________
Purely Personal Opinion

Mark Corcoran · ‎04-21-2008

>Do you know wether they use the RMS option $CONNECT to EOF, or do they maintain their own next-record-number in some master record.

The latter, I'm afraid.

>Typically that would not be a linked list, >but a circular buffer.

Ah, but a circular buffer does rather limit the number of records you could have in use.

[Some days, with delays/problems, it can be in the 10 or 100s of thousands :-(]

>So if you wanted to write a tool, then that tool would have to didle an exec mode lock.
>Not very difficult, but not a beginners task either.

I was trying to establish the feasibility of doing this; unfortunately, the coding has long since been off-shored, and I fear that there would not be enough limbs left in the budget to pay for the contractors to do this!

>Do I understand that the records know about each other, such that a simple CONVERT (with applicaiton down) would not work anyway?

Not quite sure I follow?

>If yuo have the external locks, can they not be used to communicate that a new version of the file must be used?

It's looking increasingly like this is the way it would have to be; now I'll need to take a look at the reader code, to see how they use the files, and whether it makes sense (always a trade off - cost vs convenience, and unfortunately, the former normally overrules the latter).

>New FID in lock value block if there are 6 >bytes to spare?

>Or File version number in lock block if the is just 1 byte to spare?

Now you're asking! I'm not sure that I would be able to determine this...

>Even one flipped bit woudfl be good enough if the re-shuffle will only be done 'every so often', less frequent than the longest wait idle time in the application processed.

I think maybe once every few months the app gets shut down (generally as part of an upgrade), and we take the opportunity to sort this out at the same time; in fact, as I'm planning another release soon, I'll see if I can do that at the same time.

Mark

Hein van den Heuvel · ‎04-22-2008

Mark>> Ah, but a circular buffer does rather limit the number of records you could have in use.

Yes, you would have to pick a 2x the size of the largest acceptable backup queue.

Hein>> Do I understand that the records know about each other, such that a simple CONVERT (with application down) would not work anyway?
Mark>Not quite sure I follow?

Well, an RMS CONVERT for a relative file shuffles down the records, startign from 1 leaving no holes. So it renumbers the records. Say you had valid record 1,5,7.
After the convert that will be 1,2,3.
That's fine if each record just conveys a work element and maybe an order of arrival.
It is bad if records have the numbers of other records, as those changed.

Hein>>If you have the external locks, can they not be used to communicate that a new version of the file must be used?
Mark> It's looking increasingly like this is the way it would have to be;

It's the clean solution.

>> I think maybe once every few months the app gets shut down (generally as part of an upgrade), and we take the opportunity to sort this out at the same time;

If that's good enough for years, then why muck with it now.
So it is a big file... what is the problem with that? It just sits there and eats no $$$ does it?

Or in other words... what problem are you really trying to fix?

Is the file a significant cost in the total backup picture perhpas? Then maybe it deserves a special backup plan rather than a special usage plan. For example. Take the tool you are writting now to shuffle records... make it write to a new file. Run it once with every backup. Set the main file as NOBACKUP and just backup the 'compressed', cloned, but unused data file.

KISS!

Good luck,
Hein.

Mark Corcoran · ‎05-12-2008

>Well, an RMS CONVERT for a relative file shuffles down the records
[DELETIA]
>It is bad if records have the numbers of other records, as those changed.

Unfortunately, this is the problem - a header record(s) notes where the head/tail of the queue is, and this is used by the readers.

>If that's good enough for years, then why muck with it now.
>So it is a big file... what is the problem with that? It just sits there and eats no $$$ does it?
>Or in other words... what problem are you really trying to fix?

The problem is that the file keeps growing until the disk eventually runs out of disk space.

In the normal course of things, the file wouldn't grow excessively large (unless a large amount of work was placed on the queue).

However, when a transaction fails (because e.g. a downstream system is not available, or responds with an error because of a fault on it), then the transaction is requeued.

If there is only a single transaction on the queue, and it keeps on failing because of a data discrepancy between this system and the downstream system, then the transaction will spin round on the queue as fast as both systems can process it...

Which leads to the queue file getting very big very quick. In such cases, the transaction will pertain to one customer, and it might take that customer a long time to notice that they haven't got whatever new service they ordered.

i.e. the spinning on the queue won't necessarily be noticeable to anyone looking at queue monitors, as they will simply show a single failed transaction on the queue - it won't be obvious that that transaction is spinning around 100s of times per minute.

[the queue monitors show the number of items on the queue; a single transaction on the queue as shown by the monitor could mean that

A) There's only one transaction, and it is spinning very fast
B) There's an even flow of transactions, but not so much work as to cause the queue monitor to show >1 transaction on the queue - i.e. GOOD TRANS, BAD TRANS, GOOD TRANS, BAD TRANS etc.

Because of the rate at which the queue can be monitored (locks on the file both RMS and the "internal" lock shared by readers/writer), and the rate at which the monitor can update a VT terminal, both A and B appear the same.

Consequently, you won't know about A until either the customer complains, or you run out of disk space.

Now, before you say:

1) Arguably, the queueing should be performed in memory rather than using queue files.

[Except you then have a problem that if you're using memory structures, you're typically going to have to have them a defined length, which may be problematic if a downstream system is down for any length of time - meaning that users can't add more to the queue for processing when the downstream system is back up again; yes, you can dynamically reallocate memory (upto process/sysgen limits), but it will only delay the inevitable, and may cause more/different problems than it solves]

2) The way in which requeueing is performed is perhaps not desirable; if a transaction gets an error, why remove it from the queue then add it back on again?

Why not just leave it there (except again, depending on why the transaction failed, you don't want it to spin around (like it currently does) - you want more control over the error handling - are certain conditions caused by resource depletion?

Are those resource deplation conditions likely to be permanent without admin intervention?

If so, leave on the queue but don't process it; if not, leave on the queue for reprocessing, but only reprocess once every X amount of time)

If not a resource depletion, then most likely the fault is caused either by DB discrepancies between two system (in which case it will almost certianly require manual intervention), or a permanent/transient fault on the downstream system which has the same error-handling requirements such as resource depletion.

3) Whilst queue files are currently in use (it would be a substantial code rewrite, testing and cost $$$$), there are monitors in place that check for disk utilisation.

However, depending on the nature of the fault causing the queue files to be extended, the amount of time between an alert being raised (and then noticed; like most places manpower is always an issue) and the disk running out, could be very short.

MArk

Guenther Froehlin · ‎05-12-2008

Instead of using a relative file you could use a sequential file. Probably with fixed length record format. Use the Record File Address (RFA) to index into the file instead of the record number in the relative file case. You can extend and truncate such a file.

/Guenther

Hoff · ‎05-12-2008

Random drive-by comments...

How much does an outage cost per unit time? That value is also used to determine your project funding here.

Here's a related thread -- same requirement, different implementation, and an issue -- that might provide some insight into implementations and alternatives:

http://forums12.itrc.hp.com/service/forums/questionanswer.do?threadId=1190031

Kelly Stewart_1 · ‎05-14-2008

Mark,

Perhaps I missed something here, but it seems to me that truncating the file after the reorganization will not correct the fault. Since the problem occurs when a given transaction is repeatedly re-queued, the final size of the file is set by the number of re-queues, not it's size when the transaction first failed. That is, if a failed transaction is re-queued a million times, the file is going to have a million records. (Unless it was already bigger than that!)

In addition to your idea of delaying between attempts to recover the bad transaction, you might consider counting failures per transaction and kill or stall one that goes over a given limit - and of course notify somebody via mail or whatever.

Kelly

Robert Gezelter · ‎05-14-2008

Mark,

From the last couple of comments here, there seem to be more unresolved issues with this queuing system than merely file truncation.

Perhaps a good review of the coding is in order. The case of a runaway requeue seems particularly bad.

As Hoff noted, the cost of a downtime, particularly a downtime at an "inconvenient" moment, is the justification for the project budget.

It should be possible to correct the aberrant behaviors WITHOUT impacting any of the producers and consumers of the code (at least this is what I gather from the comments describing how the code functions.)

Having done these in the past, as I suspect Hoff and Hein have also, getting these cases correct is the difference between a system that runs for years without interruption, and a system that is constantly an operational problem. With proper interlocking (which can be done in eminently safe ways, compression with the file online and working should be possible.

- Bob Gezelter, http://www.rlgsc.com

Mark Corcoran · ‎05-19-2008

>As Hoff noted, the cost of a downtime, particularly a downtime at an "inconvenient" moment, is the justification for the project budget.

Agreed, but bean-counters can't see beyond the end of their noses.

All managers are being told repeatedly to cut costs (who isn't, these days?), and will continue to be told to do so until there's one man and a dog looking after the system (with the dog being there to bite the man if he tries to change anything).

Since all of the development was outsourced, the suppliers charge significantly more than when the same developers worked internally within the company, making any changes at least twice as expensive.

Having seen previous quotes from the supplier, the amount of time they are likely to plan for such a rewrite (including extensive testing) is likely to be months, and unfortunately, we don't have a blank chequebook from which to issue cheques.

Don't get me wrong - the way in which the code "works" doesn't of itself lead to outages.

It simply means that the queue files get excessively large over time, and if not fixed, would consume all the disk space and then cause a problem.

As big as they do grow, I've never seen an occasion where it's caused a problem - we merely include shrinking the file as part of upgrades which involve intentional outages.

It would have been nice if we could have done this "on-the-fly", although it seems that the original developers went for ease-of-development rather than ease-of-operational-maintenance (again, most likely to do with costs).

Mark

Mark Corcoran · ‎05-19-2008

>Perhaps I missed something here, but it seems to me that truncating the file after the reorganization will not correct the fault.

Well, yes and no.

Invariably, whatever fault occurs to cause a transaction to be repeatedly requeued, it will eventually be fixed.

Obviously, this leaves "ghost" instances of the record in the queue file, and these need to be removed.

As per my last update, the disk space has never (to my knowledge) run out - transactions would have to spin around for weeks before this would happen.

It's just the hassle of having to take down parts of the application in order to recreate the file to avoid getting into the situation where it would cause a problem.

I don't want to put myself or other ops staff/system managers out of a job, but ideally, systems should be as self-sufficient as possible.

If the original developers had only designed this a different way....

>you might consider counting failures per transaction and kill or stall one that goes over a given limit - and of course notify somebody via mail or whatever.

It's certainly worth considering, but much will probably depend on whether or not there is any unused space in the record that could be used to store the counter, otherwise it'll be a much larger ($$$) change.

Mark

Categories

Company

Local Language

Forums

Discussions

Forums

Discussions

Forums

Discussions

Forums

Discussions

Forums

Discussions

Discussions

Forums

Forums

Discussions

Forums

Discussions

Forums

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Community

Resources

Other HPE Sites

Discussions

Forums

Blogs

RMS - How does SET FILE /ATTR=EBK:blk differ from a $QIO using FIB$M_TRUNC?

RMS - How does SET FILE /ATTR=EBK:blk differ from a $QIO using FIB$M_TRUNC?

Re: RMS - How does SET FILE /ATTR=EBK:blk differ from a $QIO using FIB$M_TRUNC?

Re: RMS - How does SET FILE /ATTR=EBK:blk differ from a $QIO using FIB$M_TRUNC?

Re: RMS - How does SET FILE /ATTR=EBK:blk differ from a $QIO using FIB$M_TRUNC?

Re: RMS - How does SET FILE /ATTR=EBK:blk differ from a $QIO using FIB$M_TRUNC?

Re: RMS - How does SET FILE /ATTR=EBK:blk differ from a $QIO using FIB$M_TRUNC?

Re: RMS - How does SET FILE /ATTR=EBK:blk differ from a $QIO using FIB$M_TRUNC?

Re: RMS - How does SET FILE /ATTR=EBK:blk differ from a $QIO using FIB$M_TRUNC?

Re: RMS - How does SET FILE /ATTR=EBK:blk differ from a $QIO using FIB$M_TRUNC?

Re: RMS - How does SET FILE /ATTR=EBK:blk differ from a $QIO using FIB$M_TRUNC?

Re: RMS - How does SET FILE /ATTR=EBK:blk differ from a $QIO using FIB$M_TRUNC?

Re: RMS - How does SET FILE /ATTR=EBK:blk differ from a $QIO using FIB$M_TRUNC?