Re: Find why process uses 100% cpu.

Edward Alekxandr · ‎09-13-2006

Hi All,

I'm a little lost with debugging this, and not sure where to turn....

When running an RPC Broker (in M using GT.M) everything works fine.
GT.M starts a detached process that runs the M code providing the RPC Broker.
If you then telnet to the RPC port and disconnect, the detached process then goes crazy using 100% of the CPU.

How can I find out *why* it is doing this?
Is it TCPware that is feeding it nul's? or is it a bug in GT.M or the M code itself?

I can do a SHOW PROC/CONT on the detached process and can see that the image is GTM$DMOD, but will any of the other details tell me where to look for any bugs?

The detached process is being started by an unprivileged account (NETMBX and TMPMBX)

This is the show proc/image

Process VSTCPRS_3466J1 01:03:08

State COM Working set 724

Cur/base priority 4/4 Virtual pages 12010

Current PC FFFFFFFF.8025DD2C CPU time 0 00:08:58.74

Current PSL 00000000 Direct I/O 227

Current user SP 7ADFA870 Buffered I/O 1996126

PID 0000346A Page faults 895

UIC [X990099,VSTCPRS] Event flags C00000C9
80000000

DSA0:[SYS0.SYSCOMMON.][GTM_DIST]GTM$DMOD.EXE

Any pointers on how to find the issue (which I suspect is in the M code itself) would be most appreciated.

Hein van den Heuvel · ‎09-13-2006

Did it used to work and something changed?
First time you tried?

>> Buffered I/O 1996126

Is that counter going up rapidly?

My guess is that the process is rapidly beating its head against a wall trying to perform IO to a no longer function 'terminal'

I would poke around with ANAL/SYST... SET PROC VSTCPR... SHOW PROC/CHAN...

fwiw.
Hein.

John Gillings · ‎09-13-2006

Edward,

On an OpenVMS system, using 100% of a CPU means the process is compute bound, and no one else wants the CPU. The way the priority system works, your CPU hungry process will rapidly become a replacement for the Idle loop. Other processes should not be significantly affected.

If you have source code for the errant process, one way to work out what it's doing is to "freeze" the process with

$ SET PROCESS/SUSPEND/ID=pid

This will put the process into SUSP state, preventing it from consuming more CPU time. Now you can use SDA ($ ANALYZE/SYSTEM) to examine the process.

SDA> SHOW CALL
SDA> SHOW CALL/NEXT

will walk up the call chain. You should be able to use the addresses, correlated against a MAP file for the program to determine what it thinks it's doing.

You can also use DEBUG or a process dump (SET PROCESS/DUMP=NOW/ID=pid) to catch the process and work out what it's doing.

All that aside, I agree with Hein. I'd guess the process is not handling the sudden loss of the telnet channel. An I/O is failing, and instead of halting or returning to a "listen and wait" state, it's just iterating around a retry loop.

A crucible of informative mistakes

Edward Alekxandr · ‎09-13-2006

Hi All,

Yes the Buffered I/O count is incrementing like crazy,
and I just discovered something odd.
If you telnet to the port and type a couple of enter's,
the broker disconnects the telnet connection and seems to work fine.
However if you disconnect before sending *any* data then you get the problem.

I think your analysis of a missed I/O is spot on, but how can I find out *who* is causing the issue.
VMS, TCPware or GT.M?

I have looked at the SDA stuff, but not really sure what i'm looking for so heres a copy:

Show/CALL:
Memory Stack Frame at 00000000.7FF9CAA0
---------------------------------------
Stack Frame Procedure Descriptor
Flags: Base Register = FP, No Jacket, Native
Procedure Entry: FFFFFFFF.8015C680 PROCESS_MANAGEMENT+24680
Return address on stack = FFFFFFFF.801815F0 EXE$REI_TO_AST_C+00890

Registers saved on stack
------------------------
7FF9CAB0 FFFFFFFF.818EFCD8 Saved R13 SCH_STD$REMOVACB+00100
7FF9CAB8 00000000.7FF9CAC0 Saved R29
SDA>

Show call/next

AST Dispatcher at 00000000.7FF9CAC0
-----------------------------------
Stack Frame Procedure Descriptor
Flags: Base Register = FP, No Jacket, Native, AST Dispatcher
Procedure Entry: FFFFFFFF.80181F20 SCH$CHAIN_AST_C+000A0
Return address on stack = 00000000.00000001

Registers saved on stack
------------------------
7FF9CAD8 00000000.00307BF8 Saved R13
7FF9CAE0 00000000.0001C0EC Saved R14 GTM$DMOD+1C0EC
7FF9CAE8 00000000.7ADFA870 Saved R29
SDA>

Show chan

Process active channels
-----------------------

Channel CCB Window Status Device/file accessed
------- --- ------ ------ --------------------
0010 7FF70000 00000000 DSA0:
0020 7FF70020 81E92EC0 DSA0:[VMS$COMMON.GTM_DIST]GTM$DMOD.EXE;2 (section file)
0030 7FF70040 81CCA040 DSA0:[VMS$COMMON.SYSLIB]LIBRTL.EXE;1 (section file)
0040 7FF70060 81E92640 DSA0:[VMS$COMMON.GTM_DIST]GTMSHR.EXE;1 (section file)
0050 7FF70080 81CD47C0 DSA0:[VMS$COMMON.SYSEXE]DCL.EXE;1 (section file)
0060 7FF700A0 81CC9F80 DSA0:[VMS$COMMON.SYSLIB]DCLTABLES.EXE;89 (section file)
0070 7FF700C0 00000000 NLA0:
0080 7FF700E0 00000000 MBA12131:
0090 7FF70100 81CCB440 DSA0:[VMS$COMMON.SYSLIB]SMGSHR.EXE;1 (section file)
00A0 7FF70120 81CCA0C0 DSA0:[VMS$COMMON.SYSLIB]LIBOTS.EXE;1 (section file)
00B0 7FF70140 81CCCDC0 DSA0:[VMS$COMMON.SYSLIB]DPML$SHR.EXE;1 (section file)
00C0 7FF70160 81CCBC40 DSA0:[VMS$COMMON.SYSLIB]CMA$TIS_SHR.EXE;1 (section file)
00D0 7FF70180 81CCAF40 DSA0:[VMS$COMMON.SYSLIB]LBRSHR.EXE;1 (section file)
00E0 7FF701A0 81CCD340 DSA0:[VMS$COMMON.SYSLIB]DECC$SHR_EV56.EXE;1 (section file)
00F0 7FF701C0 81E91F40 DSA0:[VMS$COMMON.GTM_DIST]GTMSECSHR.EXE;1 (section file)
0100 7FF701E0 82023C80 DSA2:[FILEMAN]XWBTCPL.MJO;3
0110 7FF70200 8201F040 $1$DKA0:[VISTA]FILEMAN.DAT;1
0120 7FF70220 81D7FB00 DSA0:[VMS$COMMON.TCPWARE]UCX$IPC_SHR.EXE;1 (section file)
0130 7FF70240 00000000 BG8204:
0140 7FF70260 00000000 BG8205:
0150 7FF70280 00000000 BG26220:

Volker Halle · ‎09-13-2006

Edward,

you did not look at enough call frames in SDA. You need to continue with SDA> SHOW CALL/NEXT, until you get a return address in P0 space of your process. SDA> CLUE CALL might do a better job here.

This is a typical scenario, an IO seems to fail and the software just re-tries the IO without any retry limits and such.

You need to find the most recent call frame in the application code and then goto the .MAP and the source code. If you don't have either, you need to take this problem to the vendor/maintainer of that piece of software.

If you are running a recent version of OpenVMS (V7.3-2 or higher), you could use the PCS$SDA extension (PC sampling) to collect PC samples from your looping process.

For older version, you could try to collect this type of info with running SHOW PROC/CONT/ID=xxx after a SET HOST/LOG 0. Once run for a while, stop it and logout the SET HOST 0 session. Then edit the SETHOST.LOG file, which includes all the updated PC values and you should be able to extract those values and sort them. You are looking for PC values in P0 space, to be able to map them to your image. SDA> PCS is much better !

Volker.

Edward Alekxandr · ‎09-14-2006

Hi Volker!

Umm this is really new territory for me so i'm not sure how to get the details regarding P0 space.

This is the clue call output:

7FF9CAA0 Stack 8015C680 PROCESS_MANAGEMENT+24680 801815F0 EXE$AST_RETURN
7FF9CAC0 Stack 80181F20 SCH$ASTDEL_C 00000001
7ADFA870 Stack 802DE820 SYS$OPEN_C 80AF3DF4 DECC$SHR_EV56+5FDF4
7ADFA8A0 Stack 80AF2DC0 DECC$SHR_EV56+5EDC0 80BBECC4 DECC$SHR_EV56+0012ACC4
7ADFB050 Stack 80BBEC40 DECC$SHR_EV56+0012AC40 00144E64 GTMSHR+00102E64
7ADFB0E0 Stack 001445A0 GTMSHR+001025A0 000F1914 GTMSHR+000AF914
7ADFB1F0 Null 7AF77860 DCL+7B860
7ADFB200 Stack 000F1890 GTMSHR+000AF890 003B7F54
7ADFBA50 Stack 00030000 GTM$DMOD+00030000 7FF4FA70

Now, I think that this is a problem in GT.M, but before I can point fingers I need to make sure its not TCPware causing issues.

One reason I suspect GT.M is that for another issue with SQLServices and RDB, we did the same telnet to the port and disconnect, and RDB has no issues.
GT.M on the other hand.....

Thank you for your help, and I'm sorry I don't know much about what I'm asking.

Richard Whalen · ‎09-14-2006

As others have said, the M code is not getting nulls, but is having an I/O fail and is not detecting it, hence it looks like it is receiving nulls. Since the process is using 100% of the CPU time it is unlikely that you can capture the data required to determine whether the problem is in M or in TCPware's RPC code.

Is the M code seeing nulls and trying to read more? If so, then the bug is probably in GT.M or M for not noticing a failure from the RPC call.

Edward Alekxandr · ‎09-14-2006

I can run the broker in interactive mode (rather than as a detached process) and it looks like it keeps trying to read the BG device in a loop.

I'm pointing fingers at GT.M and or the VistA RPC broker (in M)

The M process is looping, my only thought regarding TCPware was if the UCX emulation was different to TCP Services with regards to a connection that was closed before being written to.

Richard Whalen · ‎09-14-2006

We try to have TCPware's UCX service (the BG device and UCX$IPC_SHR) perform identically to TCP/IP services. We are not currently aware of any differences.

Richard Whalen
Process Software

Edward Alekxandr · ‎09-14-2006

I thought that any differences would have been picked up *long* ago, just following my (wondering) train of thought.

One question though,
If the process is trying to read from a BG device, how with TCPware can I find out what state that connection / device is in.

I have done a SHOW DEVICE BG32410/FULL, with no real info. I'm looking through the NETCU help ATM, to see if there is anything here that may help.

Richard Whalen · ‎09-14-2006

$ @TCPWARE:TCPWARE_COMMANDS
$ NETCU SHOW CONNECTIONS
Will list active connections and their state.
If you don't see the BG device that your program is using, then that BG device doesn't have an active connection.

Note that a program that uses BG devices and uses the SELECT socket library call (and I believe that the RPC code does) will always have at least one BG device that does not have a connection associated with it. This BG device is used for the qiow that does the SELECT operation. (This is true for TCP/IP services, MultiNet and TCPware.)

Volker Halle · ‎09-14-2006

Edward,

the call stack shows various routines on the stack at the time you issued the SDA> CLUE CALL command.

There is a call to SYS$OPEN_C (called from the C runtime library), called from a routine in GTMSHR.

You would need to execute a couple of those SDA> CLUE CALL commands against the looping process and find out, which of the call frames are constantly shown on the stack. Then go to the highest frame (lowest Procedure Frame address), which shows a return address in your code (GTMSHR or other). When this call frame is ALWAYS on the stack of the looping process, you need to analyze the call made from that return address.

You need the map and the source listing (with machine code !) of the current version of that module for further analysis.

Based on the fact, that there is a SYS$OPEN call, this might give a hint...

Volker.

Edward Alekxandr · ‎09-14-2006

Thanks everyone for all your help.

I think i've traced the error to the following M code:

F D Q:DONE
. S S=L-$L(R),R=R_$E(XWBRBUF,1,S),XWBRBUF=$E(XWBRBUF,S+1,999999)
. I ($L(R)=L)!(R[$C(4))!(C>TO) S DONE=1 Q
. R XWBRBUF:2 S:'$T C=C+1 S:$L(XWBRBUF) C=0
. I $G(XWBDEBUG)>2,$L(XWBRBUF) D LOG^XWBDLOG("rd: "_$E(XWBRBUF,1,252))

When the process is run interactively rather than detached, I can step through the M code and it seems to loop rapidly through this at maximum throttle.

I'm going to have another look tomorrow, but I'm somewhat zombie-fied with flu today so thinking is out of the window.

Jan van den Ende · ‎09-15-2006

@John:

On an OpenVMS system, using 100% of a CPU means the process is compute bound, and no one else wants the CPU. The way the priority system works, your CPU hungry process will rapidly become a replacement for the Idle loop. Other processes should not be significantly affected.

Well, in these matters I usually consider you to be WAY better informed than I am, but this time I feel things are not exactly as you portray them.

As I have (regrettably) experienced all to often, in cases like this the process actually gets an interactive terminal IO request boost, which as soon as it is satisfied gets a re-boost, etc. Effectively tying up one CPU.
If you have several such (eg, Progress) processes, then as soon as "several" reaches "number of CPU's" it becomes VERY hard to take corrective measures!

The solution (ahh, call that workaround) for us was to enable Virtual Terminals, and have a (short) Disconnected Terminal process termination.

fwiw

Proost.

Have one on me.

jpe

Don't rust yours pelled jacker to fine doll missed aches.

Categories

Company

Local Language

Forums

Discussions

Forums

Discussions

Forums

Discussions

Forums

Discussions

Forums

Discussions

Discussions

Forums

Forums

Discussions

Forums

Discussions

Forums

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Community

Resources

Other HPE Sites

Discussions

Forums

Blogs

Re: Find why process uses 100% cpu.

Find why process uses 100% cpu.