Re: DS15 - LOCKMGRERR crash

Jan van den Ende · ‎02-02-2006

Hei,

yesterday the DS15 in our cluster crashed. This was especially unpleasant to the users of one application that was dedicated assigned to that node.

Also, the crash error does not sound very assuring...

SDA sh cras & SDA clue cras attached

The crash has also been forwarded though our support channel, but I guess this will be quicker, also because the support channel is not exactly a direct route...

Proost.

Have one on me.

jpe

Don't rust yours pelled jacker to fine doll missed aches.

Jan van den Ende · ‎02-02-2006

and the cluÃ«:

Proost.

Have one on me.

jpe

Don't rust yours pelled jacker to fine doll missed aches.

comarow · ‎02-02-2006

How about the Complete output of
Clue crash
clue config
clue register
clue stack

That would be helpful

Jan van den Ende · ‎02-02-2006

Comarow

crash see previous,
find the other 3 attached.

Proost.

Have one on me.

jpe

Don't rust yours pelled jacker to fine doll missed aches.

Jan van den Ende · ‎02-02-2006

Comarow

crash see previous,
find the other 3 attached.

Proost.

Have one on me.

jpe

Don't rust yours pelled jacker to fine doll missed aches.

Wim Van den Wyngaert · ‎02-02-2006

Once had a RWCLU problem that ended in a LOCKMGR crash (7.3). Ana/crash show sum will confirm if that is the case (check process states).

Wim

Wim

Volker Halle · ‎02-02-2006

Jan,

the 'key' piece of information is in the CLUE REGISTER output:

R0 = 00000000.00000124 %SYSTEM-F-INSFMEM, insufficient dynamic memory

The fast remastering code (new since V7.3) is sensitive to resource problems.

Volker.

Jan van den Ende · ‎02-02-2006

Volker,

I KNEW you would beat the official support!
So, just a matter of a parameter adjustment after all.
Boy, am I glad it is not really something more serious (well, _I_ suspected it hardly could be some inherent fault, but there ARE those, that would like nothing better than pointing at VMS with evidence of potential harm to data integrety, as could easily happen when LockManager should be at fault!)

Proost.

Have one on me.

jpe

Don't rust yours pelled jacker to fine doll missed aches.

Volker Halle · ‎02-02-2006

Jan,

this system seems to be badly tuned in general, look at:

System Uptime: 1 00:37:35.82
EXE$GL_FLAGS: poolpging,init,bugdump,pgflfrag,pgflcrit,pagfildmp

To find about nonpaged pool expansion problems, see:

SDA> CLUE MEM/STAT

The LKBs and RSBs are allocated from S2 space:

SDA> SHOW PAGE/S2/FREE

To look at LCKMGR pool zone counters, use:

SDA> exa @LCK$AR_POOLZONE_REGION+80;20

The counters are (quadwords from right to left): hits, misses, expansions, failures.

Volker.

Ian Miller. · ‎02-03-2006

Look at page file space also - the mmg flags
pgflfrag, pgflcrit show that the pagefile was full or nearly so at some time.

____________________
Purely Personal Opinion

Jan van den Ende · ‎02-03-2006

Volker,

This node was rebooted after 159 days because of the tape MDR: the driver for $2$MGA had received a wrong SCSI bitmask. Obviously a know problem, and only to be cleared by reboot. (and NO patches coming anymore, because MDR is EOL! How did that stuff EVER qualify for use under VMS?)
24 hours after the reboot this crash happened.

Clue mem/stat:
Successful pool expansions : 0
Unsuccessful pool exp : 0
Various "Failed" stats: all are 0

SHOW PAGE/S2/FREE:
not sure how to interpret what I see.
Mapped addr:
counting down in steps of %X4000, 8000, C000, 10000, 20000 for the first couple of pages
PTE addr:
conting down in (irregular?) multiples of 4, like 18, 30, 1C , C0
PTE:
counting down in rather big steps (all ending 0000)
Count:
small numbers, single digit except the last one: 3F7
But what does that mean?

exa @LCK$AR_POOLZONE_REGION+80;20

4F9A6A - 25C1 - 445A 1A
Again, what does that mean?

system seems to be badly tuned in general

Care to elaborate?
Any suggestions for improvement?

Proost.

Have one on me.

jpe

Don't rust yours pelled jacker to fine doll missed aches.

Jan van den Ende · ‎02-03-2006

Ian,

Indeed, that is what HELP/MESS INSVIRMEM offers as possibility, and I already installed an extra Gb of pagefile. But it makes me wonder WHY all of a sudden (after a reboot!!) so much pagefile was needed, because we monitor pagefile use, and try to never need it whatsoever.
(Then again, this IS the one small machine in the cluster).

Proost.

Have on on me.

jpe

Don't rust yours pelled jacker to fine doll missed aches.

Volker Halle · ‎02-03-2006

Jan,

EXE$GL_FLAGS: ...,pgflfrag,pgflcrit,...

This says, that the page file has been severely fragmented and critically full during the uptime of the system (which is just 1 day). Look at the current situation at the time of the crash with:

SDA> CLUE MEM/FILES

SDA> SHOW PAGE/S2/FREE shows the amount of free PTEs in the S2 free page list. If the lock manager needs to allocate more RSBs and LKBs, it may need to expand it's pool zone in S2 space and would need some free S2 PTEs. Only the count fields would be interesting.

Were there any free physical pages SDA> SHOW PFN/FREE ?

If you've copied the LCKMGR POOLZONE counters from right to left, it would be:

hits: 4F9A6A
misses: 25C1
expansions: 445A
failures: 1A <<< normally this counter is 0

NOTE: you've seen an INSFMEM error, not an INSVIRMEM ! Lock manager resources are in S2 space, which is NOT paged, so pagefile space problems cannot cause this crash.

If this is 'the small machine' in the cluster, it might just not have had enough resources to receive the lock/resource tree being moved to it.

Volker.

Jan van den Ende · ‎02-03-2006

Volker,

NOTE: you've seen an INSFMEM error, not an INSVIRMEM
Sorry, typo in the posting. I used the actual message in HELP.

SHOW PFN/FREE
*** List is empty ***

Looks we pinned it down!
Maybe a budget request for more memory is in order.
A bigger pagefile has already be installed.

Thanks!

Proost.

Have one on me.

jpe

Don't rust yours pelled jacker to fine doll missed aches.

Jan van den Ende · ‎02-03-2006

Sufficiently explained.

Proost.

Have one on me.

jpe

Don't rust yours pelled jacker to fine doll missed aches.

Ian Miller. · ‎02-03-2006

parhaps you need to set LOCKDIRWT system parameter to keep lock directory load off this node.

More memory is always a good thing.

____________________
Purely Personal Opinion

Volker Halle · ‎02-03-2006

Jan,

maybe - just maybe - you've run BACKUP to test access to the tape after the reboot ? And backup has used lots of memory and pulled over the resource tree of the disk (due to it's lock activity) ?

Volker.

Categories

Company

Local Language

Forums

Discussions

Forums

Discussions

Forums

Discussions

Forums

Discussions

Forums

Discussions

Discussions

Forums

Forums

Discussions

Forums

Discussions

Forums

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Community

Resources

Other HPE Sites

Discussions

Forums

Blogs

Re: DS15 - LOCKMGRERR crash

DS15 - LOCKMGRERR crash