Operating System - OpenVMS
1753500 Members
4469 Online
108794 Solutions
New Discussion юеВ

Re: SET FILE/GLOB=DEFAULT -> several BugCheck/Crash

 
SOLVED
Go to solution
Ruslan R. Laishev
Super Advisor

Re: SET FILE/GLOB=DEFAULT -> several BugCheck/Crash

SYSGEN> SHO NPA
Parameter Name Current Default Min. Max. Unit Dynamic
-------------- ------- ------- ------- ------- ---- -------
NPAGEDYN 200712192 4194304 163840 1879048192 Bytes
NPAGEVIR 687226880 16777216 163840 1879048192 Bytes
NPAG_BAP_MIN 40960 0 0 -1 Bytes
NPAG_BAP_MAX 131072 0 0 -1 Bytes
NPAG_BAP_MIN_PA 0 0 0 -1 Mbytes
NPAG_BAP_MAX_PA 2147483647 -1 0 -1 Mbytes
NPAG_RING_SIZE 2048 2048 0 -1 Entries
NPAGECALC 0 1 0 2 Coded-valu
NPAGERAD 0 0 0 -1 Bytes
NPAG_INTERVAL 30 30 0 -1 Seconds D
NPAG_GENTLE 100 100 1 100 Percent D
NPAG_AGGRESSIVE 100 100 1 100 Percent D
SYSGEN>
Hein van den Heuvel
Honored Contributor
Solution

Re: SET FILE/GLOB=DEFAULT -> several BugCheck/Crash

Ok, I saw the CLUE output,

- CPUSPINWAIT, CPU spinwait timer expired
- Current process Oracle
- Cause: timeout acquiring spinlock
- Spinlock name: POOL
- Non-Paged Pool:
- Unsuccessful Expansions 5456

Oracle is NOT likely to actually be touching file with RMS global buffers other than SYSUAF/RIGHTLIST.

So Oracle was LIKELY a victim.
An other process which, possibly using global buffers, or creating some for the first time, may have been the cause.

I would first and foremost raise this as a formal support call to HP.

I would verify SMP_SPINWAIT and SMP_LNGSPINWAIT setting and possibly bump those as potential workaround.

I would aggresively increase the POOL pre-allocation.

Seeing that CACHEALLMAX is 'only' 50,000 I would switch back to do the tedious, manual, per file non-default GBC setting with something like: (not actual command)
SET FILE/GLO= MIN( 32000, 35 * ALQ / BLS * 100 )

See if that also suffers.

IFF you can tolerate a potential crash, I would try the above BEFORE changing anything else. But I would only want you to try that to un-couple this problem from the SET FILE/GLO=DEF. I do not expect that the problem is caused by the RMS options that this command triggers, nor to I expect that the 50,000 triggered it. I suspect that the 'old' 32,000 will also cause this, and am curious to know.

How aggresively had you set rms global buffers befor this? none? 5,000? 32,000?

fyi... below the details on how SET FILE/GLO=DEF is used.

Hein.



XAB$M_GBC_DEFAULT --- Requests RMS at run time to recalculate the global cache size based on an algorithm that makes use of two global buffer (GB) SYSGEN parameters: GB_CACHEALLMAX and GB_DEFPERCENT. If the default option is enabled, and if the size (in blocks) of the file is less than or equal to the specified size for the GB_CACHEALLMAX parameter, RMS allocates sufficient global buffers to cache the whole file. If the size (in blocks) is greater than the specified size for the GB_CACHEALLMAX parameter, RMS allocates sufficient global buffers to cache the percentage of the file specified by the GB_DEFPERCENT (global buffer default percent) parameter.
Volker Halle
Honored Contributor

Re: SET FILE/GLOB=DEFAULT -> several BugCheck/Crash

Ruslan,

which CPU is holding the spinlock and what's running there ? SDA> CLUE CRASH should output this information in the spinlock section.

Note that nonpaged pool allocation and de-allocation to a severely fragmented variable nonpaged pool list can cause these type of crashes. Depending on CPU speed, this may happen, if the no. of packets on the variable list exceeds 20000...

SDA> SHO MEM/POOL/FULL

Volker.
Volker Halle
Honored Contributor

Re: SET FILE/GLOB=DEFAULT -> several BugCheck/Crash

Ruslan,

in the CPUSPINWAIT crash, the CPU owning the POOL spinlock (CPU 05) was executing at EXE$DEALLOCATE_C+00014, i.e. trying to allocate a packet of nonpaged pool. I bet the variable list was huge, so it just took 'too long' to find a suitable packet.

Diagnosing the CLUEXIT crash would need more configuration information.

Volker.
Volker Halle
Honored Contributor

Re: SET FILE/GLOB=DEFAULT -> several BugCheck/Crash

Ruslan,

the CLUEXIT crash also shows massive nonpaged pool problems on the local node. Either increase nonpaged pool a lot or consider, whether there may be a nonpaged pool leak on these systems. If the other node (xxx1), which has sent the DISCONNECT message, also has similar pool problems and would be a non-SMP system, this could also explain such a CLUEXIT crash. There could also have been CI problems to do the nonpaged pool shortage.

Volker.