Operating System - OpenVMS
cancel
Showing results for 
Search instead for 
Did you mean: 

Unexpected "SYSTEM-F-EXQUOTA, process quota exceeded" from BACKUP ?

 
Mark_Corcoran
Frequent Advisor

Unexpected "SYSTEM-F-EXQUOTA, process quota exceeded" from BACKUP ?

We have a job that archives some application log files into .ZIP files.

Ordinarily, there would be a single file per day unless support staff have had to roll it during the day to investigate issues.

Due to the possibility of multiple versions of log file, the DCL code bundles (all) the version(s) of the day's file into a temporary file, using a combination of BACKUP and (if >1 version exists) APPEND - this is primarily to maintain the original CDT of the first version of the file.

Yesterday morning, BACKUP reported the following as it attempted to copy one file to a target file (on the same disk, not to tape; effectively renaming it, but without modifying the CDT or RDT fields):

-RMS-E-CRE, ACP file create failed
-SYSTEM-F-EXQUOTA, process quota exceeded

The file would theoretically have been ~3700 blocks at the time (unfortunately during my absence no-one took corrective action, so the file then had the following day's log file appended to it).

I've done an ANA /RMS of the file, and the longest record is 77 bytes.

The account which performs the BACKUP/RENAME/APPEND & ZIP has fairly modest quotas as far as DIOLM, BIOLM and ASTLM are concerned:

Maxjobs:         0  Fillm:       300  Bytlm:       100000
Maxacctjobs:     0  Shrfillm:      0  Pbytlm:           0
Maxdetach:       0  BIOlm:        40  JTquota:       4096
Prclm:           2  DIOlm:        40  WSdef:         1500
Prio:            4  ASTlm:        40  WSquo:         4000
Queprio:         0  TQElm:        40  WSextent:     18000
CPU:        (none)  Enqlm:       200  Pgflquo:      80000

As the job regularly processes a file of this size (and didn't then have a problem the next day), it strikes me that the size of the file or any records therein is not the problem.

The code requests the application logger process to close the log file and open a new one (the recently closed one is then acted upon by the job).

The request completes synchronously (in that it is queued for delivery to the logger process), and ordinarily it would probably be actioned instantaneously.

There is another job that snapshots various network counters to .CSV files, and the stats suggest that there might have been a delay around the time the job encountered the problem, possibly due to high CPU usage (or that the snapshotting is in itself the cause;  not sure if this is a common occurence with CharonVAX).

That delay might have then impeded the closing of the old application log file and creation of a new one, so the BACKUP might have attempted to copy the file whilst it was in the throes of being closed.

However, I'm at a loss to explain how that might cause SS$_EXQUOTA to be returned to BACKUP - at best, I'd guess it might cause SS$_ACCONFLICT to be returned.

I'm clutching at straws here, so throwing it out for any pearls of wisdom that might help me see the wood for the trees.

I could - of course - increase quotas, but I'd much rather understand what "new" behaviour might have caused this job to have failed than randomly bump quotas and hope that that fixes it until the next bottleneck is encountered.

 

Any thoughts/suggestions much appreciated.

 

Mark

 

[Formerly appearing as woeisme]
7 REPLIES 7
Steven Schweda
Honored Contributor

Re: Unexpected "SYSTEM-F-EXQUOTA, process quota exceeded" from BACKUP ?

> [...] with CharonVAX).

   Is that the "hardware" involved?  VMS version?

> However, I'm at a loss to explain how that might cause SS$_EXQUOTA to
> be returned to BACKUP - [...]

   SS$_EXQUOTA famously reveals little about which quota was exceeded.
Knowing nothing, I'd double Pgflquo. Possibly twice.

> I could - of course - increase quotas, but I'd much rather understand
> what "new" behaviour might have caused this job to have failed than
> randomly bump quotas and hope that that fixes it until the next
> bottleneck is encountered.

   One common possibility involves multiple processes eating (and
exhausting) a shared quota.  Replicating such situations can be tough.
On the other hand, especially with an emulator, adding "physical" memory
and disk space is cheap and easy enough that 2X or 4X applied to some
quota might be a good first step.  

Steven Schweda
Honored Contributor

Re: Unexpected "SYSTEM-F-EXQUOTA, process quota exceeded" from BACKUP ?

> The account which performs the BACKUP/RENAME/APPEND & ZIP has fairly
> modest quotas as far as DIOLM, BIOLM and ASTLM are concerned:

   True.  If you want to be organized about it, the "OpenVMS System
Manager's Manual" has some advice on this:

http://h30266.www3.hpe.com/odl/vax/opsys/vmsos73/vmsos73/6017/6017pro_045.html#proc_sec

That's focused on big BACKUP jobs, but might contain some clues.

Mark_Corcoran
Frequent Advisor

Re: Unexpected "SYSTEM-F-EXQUOTA, process quota exceeded" from BACKUP ?

Thanks Steven for your replies, and apologies for the delay in replying - unfortunately, a couple of other issues occurred since my post which took up most of my time at work yesterday.

I think you probably hit the nail on the head about "One common possibility involves multiple processes eating (and exhausting) a shared quota.  Replicating such situations can be tough."

Off the back of the problem that encountered, when I did some testing of using the command procedure to roll the log files a couple of times on Thursday, I found that whilst the job had bundled up the multiple versions into a single concatenated file, it hadn't deleted the original files.

I modified the controlling job to run with VERIFY enabled overnight, and looked at the log file today (sometimes it's easier to work out what is going wrong that way, that re-reading the code and speculating about what values various symbols were set to).

It transpired that despite me having put comments in the code to say it was supposed to delete the files, in the end-of-day processing mode, it didn't actually call the subroutine to do it, so I've modified that today.

What has that got to do with anything, you may well ask...

Well, as part of the comment header block, I wanted to reference the original EXQUOTA failure that resulted in me doing testing which then picked up an omission in the code, so I ended up searching my handover emails for EXQUOTA, and found an email from a month ago about another EXQUOTA issue (written by me, but I sometimes have a memory like a seive).

In that case, there is a .COM that is run /DETACHED, and it is the one which I alluded to in my original posting - snapshotting and resetting various counters (via LATCP, LANCP, NETCP, and UCX).

The process unexpectedly terminated because an EXQUOTA error was encountered when UCX SHOW INTERFACE ZE0 /FULL command was issued.

A subsequent UCX SHOW COMMUNICATION revealed that at some point (almost certainly, at the time the UCX SHOW INTERFACE ZE0 /FULL was issued), the peak count for Device_sockets was the same as the Maximum.

On one of our test systems, I dropped the Maximum Device_sockets count, established as many Telnet/FTP sessions as was necessary, and an attempt to do UCX SHOW INTERFACE ZE0 /FULL reported:

%UCX-E-INTEERROR, Error processing interface request
-SYSTEM-F-EXQUOTA, process quota exceeded

It seemed therefore that something on the network was doing some kind of port-scan or similar and hitting the TCP/IP stack with a lot of requests (more likely accidental DoS than intentional;  at least, I hope so).

The problem with that is that even if something is hitting the node for ports that have no services enabled on them, a device socket has to be temporarily allocated to deal with the request before tearing it down (OPERATOR.LOG didn't show a flurry of inbound Telnet connections, and the FTP log didn't show anything either).

If something hits the node with nearly Maximum Device_sockets connections (obviously, we are already using a few for Telnet &etc.), per second, then it becomes a DoS attack, and you can't issue any UCX commands to increase the Maximum or confirm what your current usage (or connections) is.

I'm not sure whether or not a similar issue occurred whilst the BACKUP was being attempted, and whether or not any resources that UCX uses in that context would be shared with BACKUP, leading to the EXQUOTA.

Unfortunately, there's no way of resetting the Peak count back to 0 (short of restarting UCX or rebooting the node), so I can't tell whether or not the Peak has been hit since we first encountered it a month ago (nor does there appear to be any logging option that might alert us to the condition having occurred).

[In answer to your question, yes, the "hardware" is CharonVAX, we are running at OpenVMS/VAX v6.2 (yes, yes, I know), and before you ask, UCX is V4.2 ECO 4]

It's possible of course that it wasn't another (unintentional) DoS attack on TCP/IP, and it may be that the sum load of processes running at the time exceeded some shared resource (which, like you say, would be very difficult to reproduce).

I will have a look at the accounting records next week to see what processes were running around the time of the EXQUOTA failure that BACKUP encountered - of course, whilst I can see what the total consumed resources were over the processes' lifetimes, I won't be able to say that at HH:MM:SS.hh, process X was using 95% of the resource that it consumed during (say) 10 minutes of run time.

It may be the case that one or more processes "happened" to use significantly more resources than they normally would, which might be an avenue to investigate, but I suspect that probably won't be the case.

The two applications log most commands that users enter, so I can see whether or not somebody happened to be running a resource-intensive query at the time (but there wouldn't likely be anyone other than 1st/2nd line support logged on to the system, so that's probably going to be a dead end too).

I'll take a look at the documentation to see what might be worth changing for the quotas, but realistically, that account is used for running the application's detached processes (so wouldn't want to unnecessarily "reserve" lots more resources for these processes).

As BACKUP is only really being used by that account (when one of the application processes submits the batch job containing the BACKUP command, and therefore runs under the account that detached process is running as) to copy a ~1.75MB-2MB file to another file in the same directory, it doesn't seem that it should really need to have much higher quotas (i.e. it's not doing an entire disk /IMAGE backup or similar).

I'm not sure, but I suspect that the way in which resources are used by BACKUP is different (read: greater) than COPY;  COPY unfortunately generates a new CDT/RDT for the file whereas BACKUP doesn't.

I have created a .COM to emulate the U*ix TOUCH command, by using F$FILE_ATTRIBUTES and CONVERT/FDL, so I could potentially use that as a way of preserving the CDT/RDT, but I'm not sure whether CONVERT's resource usage would (in this context) be as "bad" as BACKUP...

[Formerly appearing as woeisme]
Hein van den Heuvel
Honored Contributor

Re: Unexpected "SYSTEM-F-EXQUOTA, process quota exceeded" from BACKUP ?

Maybe something to do with the directory? Are there (many) ACL's on the file? Many headers?

Also check SYSGEN MAXBUF and try 16K, or 32K instead of the default 8K

According to HELP/MESS EXQUOTA.

This message may also occur if the size of a buffered I/O request exceeds the value of the SYSGEN parameter MAXBUF."

It is NOT the result of a disk-quota exceeded, as that woudl look like:

-RMS-E-CRE, ACP file create failed
-SYSTEM-F-EXDISKQUOTA, disk quota exceeded

Does SET WATCH FILE [/CLASS=MAJOR] tell anything more?

Are there spAwnned sub-processes involved and some4time you hit AUTORIZE PRCLM?

 

hth,

Hein

 

Mark_Corcoran
Frequent Advisor

Re: Unexpected "SYSTEM-F-EXQUOTA, process quota exceeded" from BACKUP ?

Just to give an update to my last post, and respond to Hein...

I had a look at the accounting records, and there were no processes that terminated around the same time as the batch job that encountered the error, which were either new (or didn't normally run at that time), or which reported higher-than-normal quota usage.

I did observe that the failing process had a significantly higher accumulated DIO than normal (around 17000 vs the ~10000 it usually accumulated, IIRC), which might suggest that the DIOLM setting for the account might (in this case) need to be increased (but it doesn't explain why the same operation succeeded subsequently - I can't imagine that using (and running out of) a shared resource with other processes).

I can't remember now why I looked, but although the original input file that would have been the input source of the BACKUP has since been deleted, other similar files were (for their size) comparatively fragmented - in the region of 60-80 fragments for a~3600-block file.

I plan on trying on one of the test systems to find a similarly fragmented file, and drop the account's DIOLM further and further down to see if a similar problem occurs when using them as the source of a BACKUP.

I was going to say that I can't imagine why (if the fragmentation caused BACKUP to use more DIO and then reach DIOLM) if would then subsequently succeed on another run, but the error was reported for the output file, not the input file - could excessive fragmentation of the disk cause BACKUP to barf when it hits the account's DIOLM whilst trying to create the output file?

[Of course, the next time it ran, who knows what files were subsequently deleted, freeing up more contiguous disk blocks, that didn't require BACKUP to issue more DIO operations that might exceed DIOLM?]

 

As that particular disk has not been /IMAGE backup & restored for at least the 4.5 years I've been working here, as part of our shutdown procedures today, in addition to /IMAGE backup & restoring the other disk on which .indexed RMS files live (and which I'd had an issue deleting 180K+ records from), I've instigated /IMAGE backup & restore of this disk too.

The INDEXF.SYS file for it wasn't particularly fragmented (I think about 3 extensions), but I've taken a copy of BITMAP.SYS before & after the restore (at a previous job, I wrote a utility to analyse BITMAP.SYS for fragmentation;  I just need to recompile it on the test system, and throw these files at it).

 

[Actually, this is a curiosity - I'd tried doing a BACKUP of BITMAP.SYS, got no error ($STATUS indicated it was successful), but the destination of the BACKUP wasn't created.

I thought that maybe a procedure I had run earlier and aborted, had done a SET MESSAGE /NOFACILITY /NOSEVERITY /NOID /NOMESSAGE, so I tried re-enabling it, and still got no error (or informational or warning) message.

I then checked the file with F$FILE_ATTRIBUTES for NOBACKUP, but it indicated that the NOBACKUP wasn't set;  a further attempt with SET WATCH FILE /CLASS=ALL didn't (appear to) reveal any issues either;  a COPY src dest did work however, so I'm a little bit curious as to why BACKUP didn't work (and didn't whine about something that prevented it from working)]

 

Hein in answer to your questions:

>Maybe something to do with the directory? Are there (many) ACL's on the file? Many headers?

There's certainly no ACLs - there's only one system where the original developers went (IMHO) overboard on ACL usage, but this isn't one of them.  As I've implied, a DUMP /HEAD /BLOCK=COUNT=0 revealed that similar files that would be used as input to the BACKUP command, had about 60+ extents.

 

>Also check SYSGEN MAXBUF and try 16K, or 32K instead of the default 8K

I was busy doing shutdown procedures & backups today, so didn't see your message until tonight;  I'll check what MAXBUF is set to tomorrow, but I think like Steven said, reproducing this fault is going to be a nightmare.

 

>Does SET WATCH FILE [/CLASS=MAJOR] tell anything more?

The original source file has since been deleted, so I can't reproduce this, but the more I think about it (the fact BACKUP was complaining about the output file) makes me (wrongly?) think that maybe "excessive" fragmentation on the disk (which was both the source & target of the BACKUP) combined with "low" DIOLM might be the issue.

 

>Are there spAwnned sub-processes involved and some4time you hit AUTORIZE PRCLM?

Nope, just a single batch job which in-line (@) calls another one;  I think there's about only two command procedures across 3 systems that do a SPAWN (Fortran and C .EXEs on the other hand...)

But, good call nonetheless.

 

If I can get access to the test system to rebuild my BITMAP analyser, I'll try to post summarised results;  I'll also give it another month to see how much the disk has fragmented up to then (the disk has a high level of traffic as far as log and temporary files (of varying sizes) goes - with differing purge/deletion criteria depending on their nature and usefulness);  it may be that we ought to start defragmenting this disk during our shutdowns too...

 

Many thanks,

 

Mark

[Formerly appearing as woeisme]
Mark_Corcoran
Frequent Advisor

Re: Unexpected "SYSTEM-F-EXQUOTA, process quota exceeded" from BACKUP ?

Okay folks, so I've done rather a lot of investigation since my last post, and it's time to give an update...

After rebuilding on a test system my program that analyses fragmentation of BITMAP.SYS, I ran it against copies of BITMAP.SYS from the disk on which the BACKUP encountered the EXQUOTA error.

Sadly, it doesn't appear to work as well as it did when I wrote it back in 2005 and had it running on a V5.5-2 system - the reported totals at the end don't match the total free blocks reported by SHOW DEV /FULL (and an ANA /DISK /NOREPAIR indicated no disk issues like lost clusters which might account for this).

Notwithstanding, I created a very fragmented disk by populating it with 1000s of file each one cluster in size, then when disk space ran out, deleting every second file.

BACKUP succeeded, even if the resulting file had >600 fragments.

Whilst searching handover emails that I'd mentioned EXQUOTA, I re-read one from April in which a .COM running as a detached process had encounter EXQUOTA when it tried to issued the command UCX SHOW INTERFACE ZE0 /FULL to get counter information - I reproduced this by using UCX SET COMMUNICATION /DEVICE_SOCKET to an abysmally small value, then ramping up Telnet sessions until the UCX SHOW INTERFACE command reported the EXQUOTA error.

When I then issued a UCX SHOW COMMUNICATION, it indicated that the Peak Device_socket count had (since reboot a few weeks previously) reached ~260 (we have the maximum set to 300).

I then thought I had found the culprit when - whilst looking at the output of UCX SHOW DEVICE_SOCKET /FULL - I noticed some device sockets that had a local IP address & port but no remote one.

These were for REXEC and RLOGIN (an artefact, it seems of original default behaviour for UCX install, to enable these services).

I tried making multiple connections to the port (TELNET ipaddr portno) on OpenVMS but didn't get far due to an EXQUOTA error of my own (not MAXDETACH in UAF, but some shared quota between the parent and child processes).

I didn't go down the route of investigating it, as I suspected that attempts to create several hundred processes on the system might then encounter some other issue - so I ended up creating a .BAT file with 100s of "START TELNET ipaddr portno", opening up a "DOS box", then executing the .BAT file.

On issuing a UCX SHOW COMMUNICATION when the .BAT file finished creating lots of connections, surprisingly, it didn't actually reach the limit of 300 - it peaked at ~270 (on the test system there were far fewer user Telnet sessions and - I dare say - listener sockets).

Subsequent investigation showed that REXEC and RLOGIN weren't responsible...

Although UCX SHOW SERVICE REXEC /FULL and the same for RLOGIN indicated that the Peak session count was 0, I've never used REXEC or RLOGIN (sounds suspiciously U*ixy to me), so I'm not exactly sure what needs to be sent, but looking at SYS$SYSDEVICE:[UCX$REXEC]UCX$REXECD_STARTUP.COM, it appears to split apart the NCB, and I was only doing Telnet to the port from the DOS box, clearly not passing whatever it needed/expected to authenticate/whatever.

A UCX SHOW DEVICE /PORT=512 /FULL for the REXEC and RLOGIN services showed that the "I/O completed" count on the test system incremented by 1 for each of these test Telnet connections, even if no "real" REXEC or RLOGIN session was ever established (because the NCB didn't have what it wanted).

On the production system, these counts were still 0.

I made changes to the .COM that runs as a detached process to get network counter information once a minute, and added execution of the UCX SHOW COMMUNICATION command, extracting all the values and writing them to a .CSV file along with the timestamp (with some assistance from DEFINE /USER and then CONVER/FDL because UCX outputs it as a "continuous" stream of characters, without "proper" CR/LF).

This allowed me to at least see when the UCX Device_socket count was getting higher than one would ordinarily expect.

Nothing in OPERATOR.LOG, ACCOUNTING.DAT, SECURITY.AUDIT$JOURNAL, UCX$FTPD.LOG or ERRLOG.SYS

One of the other .CSV files of network counters indicated that the increase in Device_socket usage coincided with an increase in TCP/IP traffic.

It was either something that was bouncing off the UCX stack and being discarded/not logged/recorded, or it was hiding in plain sight.

I had a look at one of the application log files, and it indicated that around these times, large numbers of files were being created for another server to pull via FTP, and this consistently occurred at each peak.

Spanning the network switch port so we could use Wireshark required a change request and all the associated baggage, so we ended up getting a test version of the application running on one of our test networks today.

With Wireshark capturing it, it did show that new connections were being established for each of these files, and at first I thought it was an issue to do with the application - failing to drop the connection after it pulls each file, then establishing a new connection.

The application connects via Passive FTP, and when I researched a bit more on what it does and looked at the Wireshark capture again, it was clear(ish) that the sequence of events as required in the specifications (such as they are) were being followed - it appeared to be the case that it was the UCX stack itself that wasn't closing down the data connections in passive FTP.

I then did some more testing, and found that from the "DOS box" on the PC on the test network, issuing a QUOTE PASV command in the FTP connection resulted in the same hanging of data connections.

However, sadly, even establishing a new FTP session (without using QUOTE PASV) then doing this:

PROMPT OFF

LCD C:\TEST

MGET %%.TXT (of which there were 11)

still left 12 Device_sockets in use (1 for the control connection, 11 for the data connections) - so it occurs irrespective of whether or not you are in passive mode (I also reviewed the Wireshark capture for this small test, and the DOS box FTP didn't "behind the scenes" set it into passive mode).

When testing, I ALT-TABbed to the terminal session after kicking off the MGET command, hit CTRL-T, then repeatedly entered UCX SHOW COMMUNICATION until the Current Device_socket dropped back down to its original level.

Taking into account human reaction time, it's near enough (as makes no odds) 30 seconds for UCX to time out the data connection (it uses a device_socket, but doesn't create a BG device, so you can't even used UCX SHOW DEVICE_SOCKET /FULL to find lots of device sockets in a CLOSE_WAIT state or similar.

I've had a look at the UCX documentation, and there was dsicrepancies in what it said (one part referred to a logial name of UCX$FTPD_IDLE_TIMEOUT, another referred to UCX$FTPD_IDLETIMEOUT (a search of .EXEs revealed it was the latter);  one said that the UCX$FTPD_KEEPALIVE logical name was created as a result of UCX SET SERVICE (FTP) /SOCKET_OPTIONS=KEEPALIVE and restarting UCX but this didn't happen;  I think also there was confusion as to the way the value is specified for UCX$FTPD_IDLETIMEOUT- one place suggested it was an integer value for the number of seconds, one place suggested it was a delta time (I tried both, which resulted in inbound FTP sessions being disconnected as soon as they came in - so it must have set a timeout period of 0.00 seconds;  in the end, it appears to be HH:MM:SS.hh rather than DDDD HH:MM:SS.hh)

The UCX$FTPD_KEEPALIVE in this context at least, did nothing;  the UCX$FTPD_IDLETIMEOUT is only for a main (control) connection to FTP on the node, not the myriad of data connections that are established for each file that is retrieved with an FTP GET/RETR.

I've searched other .EXEs and .COMs, and I can't find any reference to any (configurable) timer option that alters this (and if I DISC from the DOS box, the data connections still remain connected - killing the control connection doesn't kill off the "child" data connections).

Whilst I was searching in Google, a hit/result directed me to this:

https://superuser.com/questions/983350/ftp-multiple-files-from-vax-vms-to-pc-fails

In that case, the user has effectively the same symptoms, and they appear to be using TCP/IP Services v5.6 (from the "220 remote.server.location.com FTP Server (Version 5.6) Ready." message).

A quick Google search suggests that the latest version of TCP/IP services for OpenVMS/VAX is v5.4 - if that's the case, then the problem persists in that, v5.6 and possibly beyond, which implies that even upgrading to the latest version of OpenVMS/VAX (not without its hurdles, not least because that last Condist OpenVMX/VAX CD I have before I left my second employer, is v7.2), we still wouldn't be shot of this problem.

Our UCX maximum Device_sockets count is 300, but you're probably wondering why these hanging-around FTP data connections have got anything to do with the EXQUOTA error that BACKUP got...

Well, each device_socket comes at a cost of consumption of NPAGEDYN - in SDA, it showed a significant increase in NPAGEDYN being uset for NET, UCB and "UNKNOWN" (probably the internal UCX (BG?) socket structures).

On the test system even though I didn't reach the maximum of 300 Device_sockets concurrently in use, the NPAGEDYN dropped by half.

The test system doesn have the same number of users & Telnet sessions, batch jobs, and other activity.

The executable image for BACKUP.EXE is larger than COPY.EXE, and whilst that is not a fantastic barometer of how much (and what type of) memory it uses (I didn't do SET WATCH FILE /CLASS=ALL to check image activations), a simple test of two sessions that once logged in, only issued a BACKUP on one and a COPY on the other, before logging off - revealed from the accounting record that the Peak working set and Buffered IO (which will use NPAGEDYN) were significantly higher.

I'm of the opinion that the EXQUOTA possible (probably?) occurred because the NPAGEDYN might have dipped too low to service the demands placed on it by BACKUP (thus, increasing the Maximum Device_sockets count in UCX is only going to delay the inevitable, and potentially lead to more issues when NPAGEDYN is depleted).

Anyone with more internals knowledge on BACKUP and NPAGEDYN than me, feel free to chip in :-D

I'd also be interested to know whether or not the issue of FTP data connections taking a long time to timeout is still a problem in more recent versions of the TCP/IP services stack (if SHOW COMMUNICATION is even still a usable command in more recent versions, and if it shows the same kind of metrics).

I would guess that a similar metrics comparison in MultinetTGV, TCPware and Wollongong Pathway (new product names these days?) might not be possible, so I couldn't even determine if they have a similar issue (or other issues that we don't currently have;  though older versions of them for running on OpenVMS/VAX 6.2 might not in any case be "available for new supply", meaning we'd possibly still need to upgrade from V6.2).

 

Thoughts, on the back of a reply post...

 

Mark

 

[Formerly appearing as woeisme]
Mark_Corcoran
Frequent Advisor

Re: Unexpected "SYSTEM-F-EXQUOTA, process quota exceeded" from BACKUP ?

Following my last posting, the same problem has occurred on three further occasions.

In the intervening time, I did some more testing...

  1. Creating a .COM to run as a detached process and snapshot memory statistics every 10 seconds into a .CSV file;  when analysed after the first instance of the fault, it didn't reveal any (significant) depletion of the NPAGEDYN.
  2. On our test network, colleagues set up a test instance of the system that normally connects to our OpenVMS system via FTP to pull files from it.

    I forcibly generated thousands of files for it to pick up, and increased the Maximum Device_sockets count to 600 (twice what we are currently using), then opened the floodgates to allow it to connect & pull all the files.

    Again, there was no significant depletion of the NPAGEDYN, so it seems that FTP data connections use significantly less resources than RLOGIN.

    More significantly, we found that the decay rate at which FTP data connections were finally disconnected by UCX (or at least, the Device_sockets were freed up), meant that we never reached anywhere near the 600 - old Device_sockets would be freed-up after 30s, and the system that was pulling the FTP files wasn't able to create new Data connections quickly enough to get much beyond the 300 limit that we originally had.
  3. So, we're going to be increasing the Maximum Device_socket count to 600 next week, as well as increasing the account quotas for the account under which the .COM was submitted that was encountering the EXQUOTA issue (it's definitely an issue to do with how fragmented the source of the BACKUP is, and/or the available disk space where the target of the BACKUP will be.
  4. I also resurrected a disk fragmentation analysis utility I'd written 10 years ago, and found that there were discrepancies in the total amount of free disk space it reported versus what SHOW DEVICE did.

    Three issues, one self-inflicted - it requires the cluster size parameter to be passed as a CLI argument, and I had been specifying an off-by-one value for the BITMAP.SYS that it was processing.

    The other two was the fact that there were some files showing up under ANALYZE DISK /NOREPAIR as "lost";  even once that was sorted out, it still requires a SET VOLUME /REBUILD=FORCE

 

Mark

[Formerly appearing as woeisme]