HPE Community read-only access December 15, 2018
This is a maintenance upgrade. You will be able to read articles and posts, but not post or reply.
Hours:
Dec 15, 4:00 am to 10:00 am UTC
Dec 14, 10:00 pm CST to Dec 15, 4:00 am CST
Dec 14, 8:00 pm PST to Dec 15, 2:00 am PST
Operating System - OpenVMS
cancel
Showing results for 
Search instead for 
Did you mean: 

DS15 - LOCKMGRERR crash

 
SOLVED
Go to solution
Jan van den Ende
Honored Contributor

DS15 - LOCKMGRERR crash

Hei,

yesterday the DS15 in our cluster crashed. This was especially unpleasant to the users of one application that was dedicated assigned to that node.

Also, the crash error does not sound very assuring...

SDA sh cras & SDA clue cras attached

The crash has also been forwarded though our support channel, but I guess this will be quicker, also because the support channel is not exactly a direct route...

Proost.

Have one on me.

jpe
Don't rust yours pelled jacker to fine doll missed aches.
16 REPLIES
Jan van den Ende
Honored Contributor

Re: DS15 - LOCKMGRERR crash

and the cluë:

Proost.

Have one on me.

jpe
Don't rust yours pelled jacker to fine doll missed aches.
comarow
Trusted Contributor

Re: DS15 - LOCKMGRERR crash

How about the Complete output of
Clue crash
clue config
clue register
clue stack

That would be helpful
Jan van den Ende
Honored Contributor

Re: DS15 - LOCKMGRERR crash

Comarow

crash see previous,
find the other 3 attached.

Proost.

Have one on me.

jpe
Don't rust yours pelled jacker to fine doll missed aches.
Jan van den Ende
Honored Contributor

Re: DS15 - LOCKMGRERR crash

Comarow

crash see previous,
find the other 3 attached.

Proost.

Have one on me.

jpe
Don't rust yours pelled jacker to fine doll missed aches.
Wim Van den Wyngaert
Honored Contributor

Re: DS15 - LOCKMGRERR crash

Once had a RWCLU problem that ended in a LOCKMGR crash (7.3). Ana/crash show sum will confirm if that is the case (check process states).

Wim
Wim
Volker Halle
Honored Contributor
Solution

Re: DS15 - LOCKMGRERR crash

Jan,

the 'key' piece of information is in the CLUE REGISTER output:

R0 = 00000000.00000124 %SYSTEM-F-INSFMEM, insufficient dynamic memory

The fast remastering code (new since V7.3) is sensitive to resource problems.

Volker.
Jan van den Ende
Honored Contributor

Re: DS15 - LOCKMGRERR crash

Volker,

I KNEW you would beat the official support!
So, just a matter of a parameter adjustment after all.
Boy, am I glad it is not really something more serious (well, _I_ suspected it hardly could be some inherent fault, but there ARE those, that would like nothing better than pointing at VMS with evidence of potential harm to data integrety, as could easily happen when LockManager should be at fault!)

Proost.

Have one on me.

jpe
Don't rust yours pelled jacker to fine doll missed aches.
Volker Halle
Honored Contributor

Re: DS15 - LOCKMGRERR crash

Jan,

this system seems to be badly tuned in general, look at:

System Uptime: 1 00:37:35.82
EXE$GL_FLAGS: poolpging,init,bugdump,pgflfrag,pgflcrit,pagfildmp

To find about nonpaged pool expansion problems, see:

SDA> CLUE MEM/STAT

The LKBs and RSBs are allocated from S2 space:

SDA> SHOW PAGE/S2/FREE

To look at LCKMGR pool zone counters, use:

SDA> exa @LCK$AR_POOLZONE_REGION+80;20

The counters are (quadwords from right to left): hits, misses, expansions, failures.

Volker.
Ian Miller.
Honored Contributor

Re: DS15 - LOCKMGRERR crash

Look at page file space also - the mmg flags
pgflfrag, pgflcrit show that the pagefile was full or nearly so at some time.
____________________
Purely Personal Opinion
Jan van den Ende
Honored Contributor

Re: DS15 - LOCKMGRERR crash

Volker,

This node was rebooted after 159 days because of the tape MDR: the driver for $2$MGA had received a wrong SCSI bitmask. Obviously a know problem, and only to be cleared by reboot. (and NO patches coming anymore, because MDR is EOL! How did that stuff EVER qualify for use under VMS?)
24 hours after the reboot this crash happened.


Clue mem/stat:
Successful pool expansions : 0
Unsuccessful pool exp : 0
Various "Failed" stats: all are 0

SHOW PAGE/S2/FREE:
not sure how to interpret what I see.
Mapped addr:
counting down in steps of %X4000, 8000, C000, 10000, 20000 for the first couple of pages
PTE addr:
conting down in (irregular?) multiples of 4, like 18, 30, 1C , C0
PTE:
counting down in rather big steps (all ending 0000)
Count:
small numbers, single digit except the last one: 3F7
But what does that mean?


exa @LCK$AR_POOLZONE_REGION+80;20

4F9A6A - 25C1 - 445A 1A
Again, what does that mean?


system seems to be badly tuned in general

Care to elaborate?
Any suggestions for improvement?


Proost.

Have one on me.

jpe
Don't rust yours pelled jacker to fine doll missed aches.
Jan van den Ende
Honored Contributor

Re: DS15 - LOCKMGRERR crash

Ian,

Indeed, that is what HELP/MESS INSVIRMEM offers as possibility, and I already installed an extra Gb of pagefile. But it makes me wonder WHY all of a sudden (after a reboot!!) so much pagefile was needed, because we monitor pagefile use, and try to never need it whatsoever.
(Then again, this IS the one small machine in the cluster).

Proost.

Have on on me.

jpe
Don't rust yours pelled jacker to fine doll missed aches.
Volker Halle
Honored Contributor

Re: DS15 - LOCKMGRERR crash

Jan,

EXE$GL_FLAGS: ...,pgflfrag,pgflcrit,...

This says, that the page file has been severely fragmented and critically full during the uptime of the system (which is just 1 day). Look at the current situation at the time of the crash with:

SDA> CLUE MEM/FILES

SDA> SHOW PAGE/S2/FREE shows the amount of free PTEs in the S2 free page list. If the lock manager needs to allocate more RSBs and LKBs, it may need to expand it's pool zone in S2 space and would need some free S2 PTEs. Only the count fields would be interesting.

Were there any free physical pages SDA> SHOW PFN/FREE ?

If you've copied the LCKMGR POOLZONE counters from right to left, it would be:

hits: 4F9A6A
misses: 25C1
expansions: 445A
failures: 1A <<< normally this counter is 0

NOTE: you've seen an INSFMEM error, not an INSVIRMEM ! Lock manager resources are in S2 space, which is NOT paged, so pagefile space problems cannot cause this crash.

If this is 'the small machine' in the cluster, it might just not have had enough resources to receive the lock/resource tree being moved to it.

Volker.
Jan van den Ende
Honored Contributor

Re: DS15 - LOCKMGRERR crash

Volker,

NOTE: you've seen an INSFMEM error, not an INSVIRMEM
Sorry, typo in the posting. I used the actual message in HELP.

SHOW PFN/FREE
*** List is empty ***

Looks we pinned it down!
Maybe a budget request for more memory is in order.
A bigger pagefile has already be installed.

Thanks!

Proost.

Have one on me.

jpe

Don't rust yours pelled jacker to fine doll missed aches.
Jan van den Ende
Honored Contributor

Re: DS15 - LOCKMGRERR crash

Sufficiently explained.

Proost.

Have one on me.

jpe
Don't rust yours pelled jacker to fine doll missed aches.
Ian Miller.
Honored Contributor

Re: DS15 - LOCKMGRERR crash

parhaps you need to set LOCKDIRWT system parameter to keep lock directory load off this node.

More memory is always a good thing.
____________________
Purely Personal Opinion
Volker Halle
Honored Contributor

Re: DS15 - LOCKMGRERR crash

Jan,

maybe - just maybe - you've run BACKUP to test access to the tape after the reboot ? And backup has used lots of memory and pulled over the resource tree of the disk (due to it's lock activity) ?

Volker.