Re: LOCKMGRERR bugcheck (V7.3-2)

Jerry Eckert · ‎04-07-2008

A standalone DS20 (2 CPU) running OpenVMS V7.3-2 (Update V13.0 and a hand-full of other ECOs) crashed with a LOCKMGRERR bugcheck. S2 space, free memory, pool, etc. are OK.

The GRQ and CVTQ on one of the RSBs were corrupted, with the two queues being linked together such that the CVTQ merges into the GRQ. The MS Powerpoint diagram attached to this entry shows the state of the two queues.

Is anyone aware of known synchronization problems in the V7.3-2 lock manager that might cause this type of corruption?

Thanks,
Jerry

Volker Halle · ‎04-07-2008

Jerry,

only HP can tell - if at all ;-)

Could you provide the CLUE file from the crash (CLUE$COLLECT:CLUE$node_ddmmyy_hhmm.LIS) ? Either post it as a .TXT attachment or send it to me via mail (look at my ITRC profile).

This look like some queu manipulation and/or synchronization problem. None of the patches for V7.3-2 seem to contain an obvious description of a similar problem.

Volker.

Jerry Eckert · ‎04-07-2008

Thanks Volker. I do have a case open with HP, but thought if on the outside chance it was a known problem I might get a response here first.

The CLUE output is attached.

Jerry

Volker Halle · ‎04-07-2008

Jerry,

thanks for providing the CLUE file. Did this system see other unusual crashes in the past ? Could you also provide the crash history (file pointed to by CLUE$HISTORY logical) of this system ?

The crash seems to have happened on the first deadlock search perfomed during the uptime (only 7 hours) of this system. The basic crash footprint is:

Bugcheck Type: LOCKMGRERR, Error detected by Lock Manager
Failing PC: FFFFFFFF.801E44DC LCK$SEARCHDLCK_C+0027C

The deadlock search code within the lock manager is not at fault, but seems to be a victim to a previous queue corruption of the RSB queue.

Consider to reference this ITRC topic in the case 3601524270 and ask the specialist working this call to send the CLUE file to the CCAT tool...

There is one tool inside HP called CCAT (previously: CANASTA), to which all specialists working crashdumps were (are ?)supposed to send all CLUE files. This tool would extract the most important parameters of a crash and compare it to a knowledge base of known crash problems and also to crash footprints of all other crashes ever reported in this tool. This would immediately and automatically point out other system crashes with the same or similar footprints. I have maintained this tool and the knowledge base within Digital/Compaq/HP for about 10 years.

If the HP specialist working this call does not know about CCAT (CANASTA), let me know and I'll ask around, whether this tool is still existing and being used within HP.

Volker.

Martin Hughes · ‎04-07-2008

CCAT (CANASTA) is definately still around.

For the fashion of Minas Tirith was such that it was built on seven levels, each delved into a hill, and about each was set a wall, and in each wall was a gate. (J.R.R. Tolkien). Quote stolen from VAX/VMS IDSM 5.2

Jur van der Burg · ‎04-07-2008

These crashes are very often caused by privileged software doing some mismanagement of nonpaged pool, or having synchronisation problems. Since the lockmanager can be a big consumer of pool it is very often the victim of that. So I suggest to look at other privileged software and drivers as well, and see what's changed recently on the system.

Jur.

Jerry Eckert · ‎04-08-2008

Volker, Martin, and Jur, thanks for your replies.

This is the only crash recorded for the system.

The applications are completely user mode and there are no foreign drivers. The last change was on 16 Jan 08 to install OpenVMS updates (UPDATE V13, CLIUTL V1, DCL V9, PTHREAD V6, TCPIP V5.4-156, and MOTIF V1.3-1). We have this same set of updates running on 21 other similarly configured DS20s running the same application and about 30 other systems and have not seen any similar problems. None of the applications have changed since that time.

As Volker noted, the system had been rebooted just under 7 hours before the crash. It was shut down to replace a power supply wiring harness.

The same system showed one deadlock scan in CLUE MEM/STAT when I first checked about two hours after it rebooted from this crash. This morning, 3.5 days later, the count is still one. One of the comparable systems which has been up for 33 days also shows one deadlock scan.

Non-paged pool was 44% at the crash. One thing I found interesting is that SHOW POOL/STAT shows a negative number for "Packets (approx)" for the 576 byte and 2176 byte lookaside lists; the actual count is positive. I see the same on several different running systems, so I don't believe this to be significant, or at least not a fatal condition.

Volker Halle · ‎04-08-2008

Jerry,

so you're saying that this is the first crash ever occured on this system or the first crash from which a dump and a CLUE file has been captured ?

I've checked some dumps and I also see some negative numbers in the Packets (approx) columns, so appparently no need to worry.

If this is a one-off problem, it is most likley impossible to even further diagnose this problem from just one dump. You may still want to ask for the call to be escalated to OpenVMS engineering. And ask the HP specialist to find out, if crashes with similar footprints have been reported in CCAT (CANASTA) recently.

Volker.

Jerry Eckert · ‎04-08-2008

It is the only crash recorded in the CLUE history file, and the only one I can find a record of (although, with 290+ OpenVMS servers, sometimes the record keeping is not all it should be...)

Jerry Eckert · ‎04-09-2008

The CSC has advised that my crash is very similar to those that were caused by a problem in the Lock Manager that was corrected in ECO VMS732_SYS-V1400, which is included in VMS732_UPDATE-V1400. The image file containing the fix is LOCKING.EXE with link date 29-AUG-2007.

The patch details document for VMS732_SYS-V1400 shows that a new version of LOCKING.EXE is included in the kit, but none of the problem descriptions list LOCKING.EXE as an affected image, hence why neither Volker nor I found the fix.

I am told the problem involves a very short timing window during which the lock manager does not properly synchronize the queue manipulations. The only known occurrences have been on multiprocessor systems.

Thanks again to Volker an Jur for their assistance.

Categories

Company

Local Language

Forums

Discussions

Forums

Discussions

Discussions

Forums

Discussions

Forums

Discussions

Forums

Forums

Discussions

Forums

Discussions

Forums

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Community

Resources

Other HPE Sites

Discussions

Forums

Blogs

Re: LOCKMGRERR bugcheck (V7.3-2)

LOCKMGRERR bugcheck (V7.3-2)