Re: %QMAN-W-LOWMEMORY

Art Wiens · ‎05-03-2010

%QMAN-W-LOWMEMORY, the queue manager process may require more virtual memory than is currently available

VMS v7.2-2, AS800 512MB, uptime 328 01:25:12

"Suddenly" this morning, I'm receiving the above message regarding the queue manager. It is month-end, but I don't think it's any busier a month-end than others, job wise.

Thus far I have reduced the journal file size (DIAG 7) from 2M blocks to ~6K, added a pagefile and restarted some processes to make use of it.

Is there anything else I can do to help it right now? I'm waiting for a quiet time to restart the queue manager with a larger value for page file quota.

ES47's with gobs of memory are patiently waiting on the sidelines so no need to suggest I upgrade, we are progressing at glacial speeds in that direction.

Cheers,
Art

Robert Gezelter · ‎05-03-2010

Art,

Clustered or non-clustered environment? Shared system disk or separate system disks?

I'm thinking along the lines of fixing the queue manager on each different node in turn.
However, I don't have access to verify everything at this instant.

- Bob Gezelter, http://www.rlgsc.com

Art Wiens · ‎05-03-2010

4 Alpha's in a cluster with a shared (36GB HSG80) system disk. Single queue manager.

Art

Hoff · ‎05-03-2010

That smells like a (slow) leak. Check for patches.

I suspect you'll be restarting the queue manager.

Presuming it's not some other system-level leak.

And if I were suggesting an upgrade, memory would be a factor, but I'd also be looking at OpenVMS Alpha V8.3 or at OpenVMS I64 V8.3-1H1.

I'd recommend getting away from those long up-times, too; that's an inviting strategy on the surface, but one tends to leave your servers down-revision on patches. My preference is shorter up-times and preferably with some redundancies of servers, and with clustering or fail-over. Planning for failures and for upgrades, in other words, rather than trying to avoid that. (There's often no prize for long up-times, and there can be costs.)

Barring key applications that are locked into Alpha, I'd be moving to OpenVMS I64, too.

Art Wiens · ‎05-03-2010

I know it's behind on patches ... these systems will not receive any further "maintenance". Life support until we can get the apps over to the ES47's (which are VMS v8.3). There isn't (unfortunately) any great interest in going with Integrity either. One last push to the ES47's and I think that will be it for these apps, but you never know ... we're in the 8th year of the "in 3-5 years, all the VMS systems will be gone" decree.

Our company will be changing hands "soon", we'll see what the new owners think of VMS.

Cheers,
Art

John Gillings · ‎05-03-2010

Art,

Have you checked for accumulating queue entries? Maybe a queue with /RETAIN?

I run a job daily to look for retained jobs, or jobs with procedures that don't match the most recent file.

A crucible of informative mistakes

Art Wiens · ‎05-03-2010

Not really too much crud. I cleaned up some old retained on error print jobs, but for the most part it's all current jobs waiting to run (timed release batch) or waiting to print.

Art

P Muralidhar Kini · ‎05-03-2010

Hi Art,

Looks like QMAN is not getting enough virtual memory to perform its work.

What is the JOB_LIMIT of the queue. JOB_LIMIT would indicate number of jobs
in the queue that execute in parallel.
Also, how many jobs in the queue do actually execute in parallel?

>> restart the queue manager with a larger value for page file quota.
Yes. This would give the queue manager some more virtual memory to work with.

Regards,
Murali

Let There Be Rock - AC/DC

Shriniketan Bhagwat · ‎05-03-2010

Hi,

Looks like fix is available for this problem.
From the kits VMS732_QMAN-V0100 and VMS82A_QMAN-V010

=============================================

Problem Description:

Systems which have a very large physical memory (more
then 10GB) and a large PAGEFILE.SYS sometimes produce the
following QMAN error messages.

%QMAN-W-LOWMEMORY, the queue manager process may require
more virtual memory than is currently available.

%QMAN-F-ALLOCMEM, error allocating virtual memory
-LIB-F-INSVIRMEM, insufficient virtual memory

%JBC-E-QMANDEL, unexpected queue manager process termination

=============================================

Regards,
Ketan

P Muralidhar Kini · ‎05-03-2010

Hi Art,

Also,
>> VMS v7.2-2, AS800 512MB, uptime 328 01:25:12
Current version of VMS is V7.2-2.
You have to upgrage VMS to V73-2 or onwards in order to be able
to instal the QMAN patches which has fix related to "QMAN-W-LOWMEMORY"
problem.

Regards,
Murali

Let There Be Rock - AC/DC

Shriniketan Bhagwat · ‎05-03-2010

Hi,

Ok, The fix is not for V7.2-2. Did you check the queue manager's page file quota?

SDA> SET PROCESS/INDEX=
SDA> READ SYSDEF
SDA> FORMAT JIB
...
FFFFFFFF.81DC0C80 JIB$L_PGFLQUOTA 00000A00
FFFFFFFF.81DC0C84 JIB$L_PGFLCNT 00000000
...

Regards,
Ketan

Shriniketan Bhagwat · ‎05-03-2010

Hi,

>> I'm waiting for a quiet time to restart the queue manager with a larger value for page file quota.

This can be the workaround for the problem.

Regards,
Ketan

Volker Halle · ‎05-03-2010

Art,

you might want to have a look at the problem analysis section of the QMAN-W-LOWMEMORY problem/solution in the patches referenced before:

5.2.5.3 Problem Analysis:

With a large PAGEFILE.SYS, the check for available memory in the Queue Manager may overflow a longword. The results of this overflow are the unnecessary LOWMEMORY warnings and the possible system crash.

As this patch does NOT seem to be available for V7.2-2, your only real 'workaround' seems to be a restart of the queue manager.

Volker.

Volker Halle · ‎05-03-2010

Art,

I've seen this problem in 2004 and have the original IPMT text (from the problem escalation). It's some simple math problem (overflow) in the code.

Note that you can easily force a QUEUE_MANAGER process dump and have it restart automatically (queue commands will hang for as along as it takes to write the process dump, i.e. less than a minute):

$ MCR JBC$COMMAND
JBC$COMMAND> diag 4
%JBC-I-DIAGNOSTIC,
Log for playback = 0
Save old Journal files = 0
Log all requests = 0
Dump on error = 0
Checkpoint: State = 0, In-memory blocks = 100
PersAlpha CHAALP-E8.4 $
%%%%%%%%%%% OPCOM 4-MAY-2010 08:39:33.24 %%%%%%%%%%%
Message from user SYSTEM on CHAALP
%QMAN-F-DIAGNOSTIC, A request was made to dump the queue manager.

Note that increasing the amount of pagefile space is contraproductive in this case !

Look at PAGFILCNT of the QUEUE_MANAGER process with F$GETJPI("pid-of-queue_manager","PAGFILCNT"). If it's getting near 214 million, you may see that problem.

Volker.

Volker.

P Muralidhar Kini · ‎05-03-2010

Hi Art,

>> It's some simple math problem (overflow) in the code.
Volker's right.
Systems with large physical memory and large pagefile.sys would
face this problem. The problem was with the check for available memory
in the queue manager which could overflow the longword boundary.
The "QMAN-W-LOWMEMORY" messages logged were as a result of this overflow.

>> Note that increasing the amount of pagefile space is contraproductive
>> in this case !
Yes. Even after increasing the pagefile space, you could see the same old
error messages again and may be even more number of times.

Intalling the QMAN patch would be the way to go forward. But for this you
have to upgrade the current version of OpenVMS on the system to V73-2 or
onwards.

Regards,
Murali

Let There Be Rock - AC/DC

Art Wiens · ‎05-04-2010

"Murali: What is the JOB_LIMIT of the queue. JOB_LIMIT would indicate number of jobs
in the queue that execute in parallel.
Also, how many jobs in the queue do actually execute in parallel?"

Well, as in most VMS systems/clusters, we have more than one queue! There are about 370 ... ~110 batch queues and the rest print queues. Impossible to say how many jobs execute in parallel at any given time.

"Ketan: Looks like fix is available for this problem."

Great, except these systems are v7.2-2 and aren't going to be upgraded.

"Murali: You have to upgrage VMS to V73-2 or onwards ..."

Not going to happen.

"Ketan: pagefile quota"

FFFFFFFF.812EF840 JIB$L_PGFLQUOTA 00009EB0
FFFFFFFF.812EF844 JIB$L_PGFLCNT 00000950

Which "matches" what I see at DCL:

$ write sys$output f$getjpi(20307833,"PGFLQUOTA") 649984 (%x9EB00)

$ write sys$output f$getjpi(20307833,"PAGFILCNT") 38144 (%x9500)

"Volker: With a large PAGEFILE.SYS ..."

The original single pagefile is 1,300,000 blocks and I added another 1,300,000 block one. That's not "large" is it?

"Volker: ...you can easily force a QUEUE_MANAGER process dump and have it restart automatically (queue commands will hang for as along as it takes to write the process dump."

What happens to currently executing / printing jobs? Evaporate, or also just hang until the mgr comes back? The points for that suggestion depend on the answer. ;-)

Cheers,
Art

Art Wiens · ‎05-04-2010

"Volker: Note that you can easily force a QUEUE_MANAGER process dump and have it restart automatically (queue commands will hang for as along as it takes to write the process dump, i.e. less than a minute):"

Well the time was right, I did the DIAG 4. What followed was one of the longer 3 or 4 minutes of my life ... the queue manager restarted and went solid computable and the cluster "hung" ... not quite, as I was seeing OPCOM messages about users trying to login, timing out. But all of my commands entered stalled. It did finish up whatever it was doing and things are back to "normal".

One exception, I can't use the f$getjpi lexical to get the pagfilcnt and pgflquota:

$ pipe show sys | search sys$pipe queue
2030CD04 QUEUE_MANAGER HIB 9 1893 0 00:08:31.90 4434 3267
$ write sys$output f$getjpi(2030CD04,"PAGFILCNT")
%DCL-W-IVCHAR, invalid numeric value - check for invalid digits
\2030CD04\
$ write sys$output f$getjpi(2030CD04,"PGFLQUOTA")
%DCL-W-IVCHAR, invalid numeric value - check for invalid digits
\2030CD04\

I can't do this for any process. WTF?

Art

Volker Halle · ‎05-04-2010

Art,

please include the double-quotes around the process-id:

AXPVMS $ write sys$output f$getjpi(26600E13,"PAGFILCNT")
%DCL-W-IVCHAR, invalid numeric value - check for invalid digits
\26600E13\
AXPVMS $ write sys$output f$getjpi("26600E13","PAGFILCNT")
495136

Volker.

Art Wiens · ‎05-04-2010

Never mind that last WTF ... too early in the day:

$ pipe show sys | search sys$pipe queue
2030CD04 QUEUE_MANAGER HIB 9 2805 0 00:08:32.74 4434 3267
$ write sys$output f$getjpi("2030CD04","PGFLQUOTA")
649984
$ write sys$output f$getjpi("2030CD04","PAGFILCNT")
592080

All's well again. ;-)

Cheers,
Art

Robert Gezelter · ‎05-04-2010

Art,

With all due respect, you need to put the hexadecimal Process ID in quotes, to wit:

$ WRITE SYS$OUTPUT F4GETJPI("05CE","BIOCNT")

Otherwise, the DCL parsing does not identify the first parameter as a literal constant, it identifies it as the name of a DCL symbol (hint: a process ID of aced would otherwise be ambiguous).

- Bob Gezelter, http://www.rlgsc.com

Volker Halle · ‎05-04-2010

Art,

executing batch and print jobs continue unharmed, if you restart the QUEUE_MANAGER. On my standalone V8.3 test system, it took about 21 seconds to write the .DMP file - I at least verfified, that the command worked, before I posted it.

The time for writing the forced dump certainly depends on the virtual memory used by the QUEUE_MANAGER. Did you check the size and the creation/modification date of SYS$SYSTEM:QMAN$QUEUE_MANAGER.DMP ?

Volker.

P Muralidhar Kini · ‎05-04-2010

Hi Art,

>> Great, except these systems are v7.2-2 and aren't going to be upgraded.
Upgrading would have given you access to the patches which has the
fix for the problem.

As upgrading is not a option, other alternative is to use a workaround.
i.e. to increase the pagefile size. Again with this change also, the problem
might come back again. Its a tricky situation.

As of now, how frequently are you seeing the "QMAN-W-LOWMEMORY" messages?

Regards,
Murali

Let There Be Rock - AC/DC

Art Wiens · ‎05-04-2010

Thanks Bob. Hadn't quite made it out of first gear yet ;-)

Volker, the dump file only took 45 seconds to write, but for some reason the queue manager was working "very hard" when it restarted. Definately "paused" the cluster for several minutes while it COMputed something.

Thanks,
Art

Art Wiens · ‎05-04-2010

Murali, since the queue manager reset 40 minutes ago, I have not seen the message again. The queue manager's pagefile quota and count are back to "normal":

$ write sys$output f$getjpi("2030CD04","PGFLQUOTA")
649984
$ write sys$output f$getjpi("2030CD04","PAGFILCNT")
592080

Cheers,
Art

P Muralidhar Kini · ‎05-04-2010

Hi Art,

>> since the queue manager reset 40 minutes ago, I have not seen the
>> message again. The queue manager's pagefile quota and count are
>> back to "normal"
Hmm. Looks like the workaround would be to restart the queue manager
at a silent time.
Increasing the pagefile quota might not be required as the problem may not be
related to that (i.e. problem is due to overflow of longword in the code)

Regards,
Murali

Let There Be Rock - AC/DC

Categories

Company

Local Language

Forums

Discussions

Forums

Discussions

Forums

Discussions

Forums

Discussions

Forums

Discussions

Discussions

Forums

Forums

Discussions

Forums

Discussions

Forums

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Community

Resources

Other HPE Sites

Discussions

Forums

Blogs

Re: %QMAN-W-LOWMEMORY

%QMAN-W-LOWMEMORY