Operating System - OpenVMS
cancel
Showing results for 
Search instead for 
Did you mean: 

SHOW PROCESS/CONT shows PC=2BAD2BAD.CODE2BAD?

SOLVED
Go to solution
Michael Moroney
Frequent Advisor

SHOW PROCESS/CONT shows PC=2BAD2BAD.CODE2BAD?

I did a SHOW PROCESS/CONTINUOUS of a wayward process and often it is shown as being in CUR mode, but with a PC = 2BAD2BAD.CODE2BAD and a PSL = CODE2BAD, obviously bogus values. Other tools show that it is spending excessive time in kernel mode. Am I correct in assuming that SHOW PROCESS/CONTINUOUS displays those values when the process is at high IPL and not interruptable? (yes I know the concept of a "current" process is ambiguous above sched IPl)

This is OpenVMS V8.3 on an Integrity rx3600.
28 REPLIES
John Gillings
Honored Contributor

Re: SHOW PROCESS/CONT shows PC=2BAD2BAD.CODE2BAD?

Michael,

This is unlikely to be documented except in the source code. Addresses like 2BAD2BAD (and a few others, like DEADBEEF are scattered in the source code as flags).

This *may* be something in SHOW.EXE, but it's also possible it's in the scheduler or $GETJPI. FWIW, I can't see any instances of the string "2BAD" in SHOW.EXE.

Having worked on a derivative of SHOW PROCESS/CONTINUOUS back in VAX days, at that time it got most of its information from globally visible (though privileged) data structures, and the rest from $GETJPI, so chances are SHOW is just displaying what it found. Something else is planting the data. It must have changed fairly radically since then as a lot of the VAX code wasn't portable to Alpha, let alone Integrity.

As they say "Use the Source (Luke)".
A crucible of informative mistakes
Michael Moroney
Frequent Advisor

Re: SHOW PROCESS/CONT shows PC=2BAD2BAD.CODE2BAD?

There is no "C0DE2BAD" anywhere in the source, list or map of the executable, and I can't find either that string nor its binary equivalent on sys$sysdevice:, which is puzzling.
H.Becker
Honored Contributor

Re: SHOW PROCESS/CONT shows PC=2BAD2BAD.CODE2BAD?

As far as I know the SHOW PROCESS queues a special kernel mode AST to the target process. Once that is delivered it walks the call stack and retrieves the PC/PSL of the code thread that was interrupted by this AST. That pair is reported back and displayed by SHOW PROCESS.

If walking the stack fails, for whatever reason, special values are reported: 2BAD...

While walking the stack is simple for VAX and Alpha, it is very different and complex for I64.

Unfortunately there is no indication from the special PC/PSL what the actual failure in walking the stack was. So it is difficult to say why it failed. But these codes do not indicate a problem in your image. You should be able to find the string in sys$loadable_images, in the process management image.
Michael Moroney
Frequent Advisor

Re: SHOW PROCESS/CONT shows PC=2BAD2BAD.CODE2BAD?

I know Shadowing used 2BAD codes. I know how bad frame tracing is on the Itanium, I had to deal with that a couple of years ago.

I used the PC sampler in SDA to see what's going on and why the process is using more CPU time and is visibly slower. I see lots of time in F11BXQP at IPL 8, routine DEL_EXTFCB_C. Now I have to find out what that means. Weird in that it doesn't do much file IO.
Volker Halle
Honored Contributor

Re: SHOW PROCESS/CONT shows PC=2BAD2BAD.CODE2BAD?

Michael,

DEL_EXTFCB in [F11X]CLENUP removes and de-allocates Extension FCBs (File Control Blocks) linked to a primary FCB during a cleanup operation after a successfully completed file operation.

Is this program working with very fragmented files or files with man ACLs and therefore with many extension file headers ? Or are there a huge number of files open on the associated volumes ? High Trans Count on the mounted devices ?

DEL_EXTFCB runs in kernel mode and follows the FCB queues to the VCB to decrement the transaction count. Long lists of FCBs to be followed may cause this behaviour.

Volker.
Jur van der Burg
Respected Contributor

Re: SHOW PROCESS/CONT shows PC=2BAD2BAD.CODE2BAD?

The code comes from [sys]proc_read_write_reg.c, routine exe$read_process_reg_ast. It is walking the stack and has a problem interpreting that.

Fwiw,

Jur.

Michael Moroney
Frequent Advisor

Re: SHOW PROCESS/CONT shows PC=2BAD2BAD.CODE2BAD?

I'm going to guess that exe$read_process_reg_ast cannot access the process' registers when the process is at IPL 8, hence the bogus values.

I looked at the drive and it is somewhat fragmented. What is interesting is the process doesn't do much I/O to the drive but another process which uses it heavily is hardly affected. Anyway, I told them to defrag the drive the next chance they get.
Volker Halle
Honored Contributor
Solution

Re: SHOW PROCESS/CONT shows PC=2BAD2BAD.CODE2BAD?

Michael,

DEL_EXTFCB is called to clean up after 'file operations', so just doing IO on open files may not call this routine. Just opening and closing (fragmented) files may cause this behaviour !

Volker.
H.Becker
Honored Contributor

Re: SHOW PROCESS/CONT shows PC=2BAD2BAD.CODE2BAD?

Any guess why the name has the "_ast" suffix at all? Or in other words, if at lower IPL, how can code in one process get the PC/PSL of another process?

2BAD2BAD.CODE2BAD shows a problem in walking the call stack not finding the AST frame. It is not an indicator that the target process is at a high IPL.

There were changes in the unwind code after V8.3. But I don't know why the AST frame isn't found in your environment and if any changes so far can fix that.

But I'm sure, this is not the real problem you want to solve. You seem to have a performance problem and SHOW PROCESS/CONT can't help you to find out whats going on.

PS: I was somehow wrong about finding the string in the image. I forgot that moving a 64 bit literal into a register (MOVL) splits the immediate operand into pieces and then into two slots so you can't see the literal as a whole in the code segment of the image. However, ANALYZE/IMAGE/SEGMENT=CODE can show the MOVL and the literal.
Ian Miller.
Honored Contributor

Re: SHOW PROCESS/CONT shows PC=2BAD2BAD.CODE2BAD?

You may want to use PRF or PCS to see where that processing is spending it's excessive kernel mode time.

____________________
Purely Personal Opinion
Michael Moroney
Frequent Advisor

Re: SHOW PROCESS/CONT shows PC=2BAD2BAD.CODE2BAD?

Yes this is a performance issue. A process wakes up once a second and checks for things to do from a network socket, something that should take milliseconds but has crept up to almost 1/2 second. I initially used show process/cont to try to see where in the code it was spending its time, and most of its time was shown at the bogus address. PCS from SDA has told me it spends most of its time in the XQP at IPL 8.

I noticed the code has a C fsynch() call on each pass. If the disk the file being fsynched is on is fragmented, but the file itself isn't very fragmented, would there be excessive XQP time? There would be 0 or 1 fprintf() calls between the fsynch() calls. The file currently has 5 extents.
Volker Halle
Honored Contributor

Re: SHOW PROCESS/CONT shows PC=2BAD2BAD.CODE2BAD?

Michael,

I described the operation of DEL_EXTFCB in a previous reply. The code walks FCB chains to the VCB. Are there many files open on that disk ?

Why not write a simple example program doing fsync calls in a tight loop and watch the performance of that test program ?

Volker.

Michael Moroney
Frequent Advisor

Re: SHOW PROCESS/CONT shows PC=2BAD2BAD.CODE2BAD?

That was a good idea, I ran a program that was a loop with a single fprintf(), an fsync() and sleep(1) with the file on the fragmented disk. This is a simplification of the main loop of the program. It ran using very little CPU time, so the problem isn't the fsync to the fragmented disk.
Hoff
Honored Contributor

Re: SHOW PROCESS/CONT shows PC=2BAD2BAD.CODE2BAD?

Entirely for giggles, try this on the file involved:

SET FILE /CACHING_ATTRIBUTE=NO_CACHING file
Michael Moroney
Frequent Advisor

Re: SHOW PROCESS/CONT shows PC=2BAD2BAD.CODE2BAD?

We can't apply the cache attribute to an open file. Customer pretty much doesn't want to shut down the application unless there's essentially a near-guarantee of a fix. They're open to get the disk defragmented although I can't explain while another process that's a much heavier user of the drive doesn't seem to have any issues.

I just noticed that SDA also has the "2BADCODE" (as well as "1BADCODE") problem with SHOW PROCESS/FULL but SDA PCS module doesn't.

Those who are into XQP may be amused by the attached pcs sh trace/stat output.
Hein van den Heuvel
Honored Contributor

Re: SHOW PROCESS/CONT shows PC=2BAD2BAD.CODE2BAD?

fflush just gives records from the Crtl to RMS
fsync maps to an RMS $FLUSH which will tell the xqp to update the file header.

You did not mention revision (-H1?) or patch level. The makes the PCS trace (nice! ) data ambiguous.
Maybe you can use ANAL/SYS.. EXA/INST for one or two of the popular instructions to make sure we are looking at the same sources.
Like that 624 DEL_EXTFCB_C+00360
Is that a "ld4 r18 = [r27]" per chance?
I'm thinking it may be in...
UNTIL .P[VCB$B_TYPE] EQL DYN$C_VCB DO P = .P[FCB$L_FCBFL];

Also, is the system burning significant MPsync?
How about a spinlock trace, perhaps using @ SYS$EXAMPLES:SPL
Is there a lot of FILSYS time?

- Is this a cluster or standalone
- Any TRUNCATE calls happening?

fwiw,
Hein.





Volker Halle
Honored Contributor

Re: SHOW PROCESS/CONT shows PC=2BAD2BAD.CODE2BAD?

Michael,

this process is clearly spending a lot of CPU cycles in DEL_EXTFCB. One question would be, why does it get there ? The other question is, why does it spend so much time in there once it gets there ?

There are a couple of loops in this routine, which walk the FCB chains.

$ ANAL/SYS
SDA> SHOW DEV disk-device-name
SDA> READ SYSDEF
SDA> VALI QUE vcb+vcb$l_fcbfl

How many FCBs are in that queue ?

Volker.
Jur van der Burg
Respected Contributor

Re: SHOW PROCESS/CONT shows PC=2BAD2BAD.CODE2BAD?

>I just noticed that SDA also has the "2BADCODE" (as well as "1BADCODE") problem with SHOW PROCESS/FULL but SDA PCS module doesn't.<

Of course pcs does not display that. The pc is fabricated and never actually sess that address.

Jur (PCS author).

Volker Halle
Honored Contributor

Re: SHOW PROCESS/CONT shows PC=2BAD2BAD.CODE2BAD?

Michael,

when RMS is flushing the buffers, it looks like it only updates the file header (via a call to the XQP), if the end-of-file block number (IFB$L_EBK) has changed. So a simple fsync test without actually writing to the file may not trigger the behaviour seen in the application.

Volker.
Michael Moroney
Frequent Advisor

Re: SHOW PROCESS/CONT shows PC=2BAD2BAD.CODE2BAD?

Volker: The VCB queue is consistently around 168 items.

SDA> v q vcb+vcb$l_fcbfl
Queue is complete, total of 168 elements in the queue


Hein:

The system is a cluster of 3 nodes running V8.3 (no -1Hx), and was up to date with all V8.3 patches as of a year ago. The application runs on only one now, the other's mostly idle, the third is a quorum system.

Mon Mode shows 1% user mode, ~70% kernel mode, ~10% MPSYNC, on a 4 processor rx3600.

The PC at DEL_EXTFCB_C+360 matches what you have, +350 = cmp4.eq p7, p0=11,r28 and +380 = cmp4.eq p0, p8=11, r30

I'm attaching the output of the SPL script. As you can see, tons of FILSYS time.

I neglected to mention that the process is running IBM MQ Series client code. I don't see how this could relate, there are 4 channels to the fragmented disk, none seem to relate to MQ. The executable image lives there, and 3 logging files do as well. One's 1000 blocks, the other two are much smaller.
Volker Halle
Honored Contributor

Re: SHOW PROCESS/CONT shows PC=2BAD2BAD.CODE2BAD?

Michael,

interesting data. Let me give you my interpretation:

Some XQP operation is being executed around 55000 times per second. This operation involves 5 routines, which acquire and release the FILSYS spinlock during operation:

MAPVBLK, UPDATE_EOF, CREATE_FCB, UPDATE_FILESIZE, ALLOCATE

Routine BUILD_EXT_FCBS calls at least 4 of these routines directly (3) or indirectly (1).

DEL_EXTFCB acquires the FILSYS spinlock about 23 times per second, but holds it for a very LONG time (about 28 ms !), so the DEL_EXTFCB operation under the FILSYS lock seems to be causing a problem. This is being confirmed by the PCS data !

Somehow a file with many extension headers must be involved in this...

Did you look at MONITOR FCP ? This should also show some unusual numbers.

Volker.
Ian Miller.
Honored Contributor

Re: SHOW PROCESS/CONT shows PC=2BAD2BAD.CODE2BAD?

What about using DFU to search for files with extension headers.

$ DFU SEARCH/MULTIPLE diskname

____________________
Purely Personal Opinion
Volker Halle
Honored Contributor

Re: SHOW PROCESS/CONT shows PC=2BAD2BAD.CODE2BAD?

Michael,

another question comes to mind:

why is all the FILSYS spinlock activity happening on CPU 1 ? Is there a process with affinity set to CPU 1 ?

You may also want to edit SPL.COM to add

$ write outcom "spl start col/spin=FILSYS"
$ write outcom "wait 00:00:05"
$ write outcom "spl show col"

because this is really the only 'interesting' spinlock on this system.

Volker.
Michael Moroney
Frequent Advisor

Re: SHOW PROCESS/CONT shows PC=2BAD2BAD.CODE2BAD?

I found the problem, it was an *extremely* fragmented file, but not on the drive I was talking about. The IBM MQ product, which was installed on a different drive, was writing a "vmerr.log" file on this drive. The contents appear to be MQ debug info (not actual errors) and it was 25 million+ blocks, and more importantly, it had 100,000+ fragments in 2000+ headers! It was the most pathological case of file fragmentation I've ever seen!

What I don't understand is why a channel to this file doesn't show up in a SDA> SHOW PROC/CHAN, otherwise I would have found this much sooner. I don't know what IBM/MQ is doing.

Also, the drive I was discussing all along has never been more than 5% full. It is surprising that the drive was as fragmented as it was.