Re: SHOW PROCESS/CONT shows PC=2BAD2BAD.CODE2BAD?

Michael Moroney · ‎02-23-2010

Volker: The VCB queue is consistently around 168 items.

SDA> v q vcb+vcb$l_fcbfl
Queue is complete, total of 168 elements in the queue

Hein:

The system is a cluster of 3 nodes running V8.3 (no -1Hx), and was up to date with all V8.3 patches as of a year ago. The application runs on only one now, the other's mostly idle, the third is a quorum system.

Mon Mode shows 1% user mode, ~70% kernel mode, ~10% MPSYNC, on a 4 processor rx3600.

The PC at DEL_EXTFCB_C+360 matches what you have, +350 = cmp4.eq p7, p0=11,r28 and +380 = cmp4.eq p0, p8=11, r30

I'm attaching the output of the SPL script. As you can see, tons of FILSYS time.

I neglected to mention that the process is running IBM MQ Series client code. I don't see how this could relate, there are 4 channels to the fragmented disk, none seem to relate to MQ. The executable image lives there, and 3 logging files do as well. One's 1000 blocks, the other two are much smaller.

Volker Halle · ‎02-24-2010

Michael,

interesting data. Let me give you my interpretation:

Some XQP operation is being executed around 55000 times per second. This operation involves 5 routines, which acquire and release the FILSYS spinlock during operation:

MAPVBLK, UPDATE_EOF, CREATE_FCB, UPDATE_FILESIZE, ALLOCATE

Routine BUILD_EXT_FCBS calls at least 4 of these routines directly (3) or indirectly (1).

DEL_EXTFCB acquires the FILSYS spinlock about 23 times per second, but holds it for a very LONG time (about 28 ms !), so the DEL_EXTFCB operation under the FILSYS lock seems to be causing a problem. This is being confirmed by the PCS data !

Somehow a file with many extension headers must be involved in this...

Did you look at MONITOR FCP ? This should also show some unusual numbers.

Volker.

Ian Miller. · ‎02-24-2010

What about using DFU to search for files with extension headers.

$ DFU SEARCH/MULTIPLE diskname

____________________
Purely Personal Opinion

Volker Halle · ‎02-24-2010

Michael,

another question comes to mind:

why is all the FILSYS spinlock activity happening on CPU 1 ? Is there a process with affinity set to CPU 1 ?

You may also want to edit SPL.COM to add

$ write outcom "spl start col/spin=FILSYS"
$ write outcom "wait 00:00:05"
$ write outcom "spl show col"

because this is really the only 'interesting' spinlock on this system.

Volker.

Michael Moroney · ‎02-26-2010

I found the problem, it was an *extremely* fragmented file, but not on the drive I was talking about. The IBM MQ product, which was installed on a different drive, was writing a "vmerr.log" file on this drive. The contents appear to be MQ debug info (not actual errors) and it was 25 million+ blocks, and more importantly, it had 100,000+ fragments in 2000+ headers! It was the most pathological case of file fragmentation I've ever seen!

What I don't understand is why a channel to this file doesn't show up in a SDA> SHOW PROC/CHAN, otherwise I would have found this much sooner. I don't know what IBM/MQ is doing.

Also, the drive I was discussing all along has never been more than 5% full. It is surprising that the drive was as fragmented as it was.

Volker Halle · ‎02-26-2010

Michael,

maybe that file was not permanently open ?!

DEL_EXTFCB was being called 23 times per second to delete those 2000+ FCBs (one FCB per extension file header). This gives about 46000+ FILSYS acquisition and releases per seconds for each of all the 6 routines involved, giving about a total 276000+ acq/s. SPL.COM reported about 332000 acq/s, so this was pretty close...

So everything I concluded from the data you've provided seems to have been true ;-)

Nice problem and analysis,

Volker.

Michael Moroney · ‎02-26-2010

From what I can tell, MQ opens, writes one line to and closes the file each time, so it appears you are correct. Perhaps the file is considered actually open for such a short period of time I never caught it with SHOW PROCESS/CHANNEL.

H.Becker · ‎02-27-2010

As far as I know, for deleting the extension FCBs the DEL_EXTFCB code aquires the FILSYS spinlock once. It holds it until it moved all the extension FCBs to a local queue. After releasing it, it actually deletes the FCBs from the local queue. Moving one FCB however requires to decrement a count in the VCB. To find the VCB the code always walks all the remaining extension FCBS which "end" in the VCB.

The high acquisition and release rate for FILSYS are in other routines for example MAP VBNs. The long waits are in MAP VBNs from other PIDs. It didn't only take much time to delete all the extension FCBs, it also blocked other applications "on this disk".

It very likely looks like MQ tried (still tries?!) to mimic Unix behavior on VMS, where you can always see what's just written at the end of a (log) file. At least for some situations there are better ways than to open/close.

John Gillings · ‎02-28-2010

>It very likely looks like MQ tried (still
>tries?!) to mimic Unix behavior on VMS,
>where you can always see what's just written
>at the end of a (log) file. At least for
>some situations there are better ways than
>to open/close.

Not that it helps with your MQ code...

If you want to be able to see the end of a "live" file, rather than force an update of the file on every write (frequent and expensive), it's better to put the resource burden on the reader. So, every time you want to read the file, you do something to make sure the EOF is up to date. Since reading is likely to be much less frequent than writing, the overall cost is far less.

The simplest way to achieve this in OpenVMS is for the writer to open the file SHARE=READ+WRITE. A process wanting to read the file opens it for APPEND access SHARE=READ+WRITE then closes the file. This causes the existing writer to flush all buffers and update the EOF mark. You can now re-open the file READ access SHARE=READ and see recent output.

Note that even if you don't have access to the source code, you can usually trick any program into shared write using a process permanent file with a small DCL jacket:

$ CREATE/FDL="RECORD;FORMAT STREAM_LF"
$ OPEN/APPEND/SHARE=(READ,WRITE) LOG
$ DEFINE/USER SYS$OUTPUT LOG
$
$ CLOSE LOG

now the reader side:

$ OPEN/APPEND/SHARE=WRITE LOG
$ CLOSE LOG
$ TYPE/TAIL

A crucible of informative mistakes

Categories

Company

Local Language

Forums

Discussions

Forums

Discussions

Discussions

Forums

Discussions

Forums

Discussions

Forums

Forums

Discussions

Forums

Discussions

Forums

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Community

Resources

Other HPE Sites

Discussions

Forums

Blogs

Re: SHOW PROCESS/CONT shows PC=2BAD2BAD.CODE2BAD?