Re: Looping installed image

Mick O'Brien · ‎06-28-2010

I have an installed image running under ACMS (DCL server with dynamic username) that intermitently loops (CPU bound not IO) and then needs to be stopped (killed).

I have looked at the code and can see no obvious reason for the loop and need to know if there is any way to 'trace' (for want of a better word) where the program is looping.

Note: Image is installed and runs against and RDB as an ACMS DCL server under dynamic username.

Any help appreciated.

OpenVMS V8.3
COBOL V2.8-1286
RDB 7.2-321
ACMS 5.1B

abrsvc · ‎06-28-2010

There are a number of methods for this, perhaps the simplest is to use SDA to examine the PCs of the looping process. There is a trace utility within SDA, but I have found that it is too quick. I usually create an X.COM file with 10-20 "exam PC" commands in it and set the SDA focus to the process that is looping. Do an "@X.COM" to a file or the screen and map the PCs to the actual image.

While little tedious, it gives you a gross idea of the code flow. Once you narrow that part down, the PC trace utility can give you more details.

Dan

Bhadresh · ‎06-28-2010

>I have looked at the code and can see no obvious reason for the loop and need to know if there is any way to 'trace' (for want of a better word) where the program is looping.

Run the application and collect the pc sampling data.
To Collect pc sampling:
$anal/sys
SDA>pcs load
SDA>pcs start trace
SDA>pcs stop trace

Regards,
Bhadresh

P Muralidhar Kini · ‎06-28-2010

Hi Mark,

>> I have looked at the code and can see no obvious reason for the loop and
>> need to know if there is any way to 'trace' (for want of a better word)
>> where the program is looping.
Yes, there is.
You can make use of the PC sampling feature in order to find out what the
looping process is doing. From the PC sampling output, you should see some
set of PC's getting repeatedly logged in case of a loop in the program.
Mapping such a PC back to your source code would tell you where in the
program you have the loop.

PC Sampling Usage -
$ ANALYZE/SYSTEM
SDA> PCS LOAD
SDA> PCS START TRACE
... wait for some time ...
SDA> PCS STOP TRACE
SDA> SET OUTPUT PC.DAT
SDA> PCS SHOW TRACE
SDA> PCS UNLOAD
SDA> EXIT

Then analyze the PC.DAT file for PC's that are getting repeatedly logged.

If you know the PID of the process, you can narrow down the PC sampling by
making the PCS to sample PC's corresponding to a particular process.

i.e. In the above command,
SDA> PCS START TRACE --> Traces all PC's of all process

Instead use
SDA> PCS START TRACE/PID=XXX --> Traces all PC's of PID XXX

This would cause the PCS to sample PC's for only the process with specified PID.

Hope this helps.

Regards,
Murali

Let There Be Rock - AC/DC

Mick O'Brien · ‎06-28-2010

Murali,

Next time the process loops I will use the code you supplied - very clearly explained. Once I get the PC trace file I will no doubt be asking more questions.

Mick

P Muralidhar Kini · ‎06-28-2010

Hi Mark,

Also, for more information on the VMS SDA Extensions, Refer
* OpenVMS SDA Extensions
http://www.connect-community.de/Events/OpenVMS2009/folien/05-sda_extensions.html

For PCS sampling related information, in the above link, refer section
"PC Sampling Utility PCS commands:"

It also has a example of PCS sampling usage with corresponding output.

Regards,
Murali

Let There Be Rock - AC/DC

Hoff · ‎06-28-2010

With respect to the other respondents and their literal (and correct) replies here, the PC stuff (in conjunction with the link maps and compiler listings) will get you into the general area of the loop.

Which is interesting.

But not entirely useful.

While the tools cited are all functional, it's also easily feasible to snag a few PCs via a simple SHOW PROCESS /CONTINUOUS command.

A half-dozen PCs are often enough data. Particularly if you see a repeated PC or two somewhere in P0 space, then you're usually ready to proceed.

See where those addresses exist within the application.

How to get to the source code from a virtual address? How to translate from those PCs you have? The sequence is described here:

http://labs.hoffmanlabs.com/node/800

If that source code not pointing to an obvious trigger, then you can use more specific tools and techniques. In particular, run this application under the debugger. Yes, you can debug detached processes. Here's how:

http://labs.hoffmanlabs.com/node/803

You won't be able to debug an installed image, but you can set up the same privileged context for the detached process and run without the INSTALL.

If necessary, you can program the debugger if you can identify a trigger but not its cause. The debugger can then run in the background, waiting for the initial conditions for the loop to be met, and you'll then have some visibility into the run-up to the trigger.

http://labs.hoffmanlabs.com/node/848

If you're having difficulty with spotting the trigger with the debugger and with the debugger-level programming, then you can also build the application with the debugger activated in a signal handler, and send the debug signal over.

And if you're using the word "killed" or analogous in conjunction with this effort, then consider switching techniques and using the process dump mechanisms. Dial back the brute-force setting slightly. The debugging sequence and the preference for creating process dumps or crashdumps is analogous to using the >>> BOOT rather than the SRM crashdump command; sure, the immediate freak-out is fixed, but these tend to arise anew, and, well, you can choose to repeat the >>> BOOT or you can write a dump and go looking. In other words, capture a dump or other evidence, and +then+ restart.

As for how, you can use the SET PROCESS /DUMP command, or toss the appropriate $forcex signal over.

That signal can be allowed to cause the dump, or can captured by a signal handler. There's a simple signal handler (in C) here:

http://labs.hoffmanlabs.com/node/1438

For typical non-trivial production applications, it's usually best to instrument the code, too. To integrate debugging.

And given the use of COBOL (which avoids many of the foibles of C or Macro or Bliss), my initial suspicion would be in ACMS and system service and RTL error handling; not checking the return status or IOSB, for instance. Even so-called optional IOSB arguments aren't really optional; you should always specify the IOSB, ad always check the return status and the IOSB. Otherwise, errors can tend to accrete, and get weird.

The other and ancillary question would involve why the image is installed. If it is installed for security (privilege) reasons, then subsystem identifiers can be a good alternative.

Stephen Hoffman
HoffmanLabs LLC

Hein van den Heuvel · ‎06-29-2010

I start out with a quick MONI MODE... all USER? EXEC?

PC Samples are typically highly useful as others replied. I uses SHOW PROC/CONT, SDA> PCS, or SDA> PRF as I see fit.
But for for my (COBOL) customer the top hits are often in OTS$mumblfratz routines. So be sure to look for infrequenm but lower PC values to find out where in the code the program was when it call the OTS helpers.

Or, just cut out the (PC trace) middle man and try to find out how the process got there.
How? $ ANALYZE / SYSTEM ... SET PROC ACMSxxxSPxx ... SHOW STAC/USER/SUMM

Repeat 2 - 5 times. Pattern?

Good luck!
Hein

Hein van den Heuvel · ‎06-29-2010

>> my initial suspicion would be in ACMS

No way. In this setup ACMS only launches.

But it reminded me... maybe you can use :
$ ACMS /DEBUG /SERVER ACMSxxxSPxxx

With that you can break in to a running ACMS SP process and the DBG$INPUT and DBG$OUTPUT will be managed for you.

I happen to have used it yesterday, but for a procedure server process, on a DCL server. I needed it because ACMS V5.1 on Itanium does not support its normal ACMS/DEBUG and we had (have!) and issue with TDMS not returning filled fields for a specific request with a long input list. Using %ALL worked.

http://h71000.www7.hp.com/doc/721final/6607/6607pro_017.html

Hein

Hoff · ‎06-29-2010

>> my initial suspicion would be in ACMS

Might have misread the "and"? I was pointing to the potential for errors in the error handling from all the wheels that are spinning here. Not to ACMS itself.

Mick O'Brien · ‎06-29-2010

Abrsvc , Bhadresh , Murali, Stephen (Hoff)

I had another looping process last night and did as suggested and got out some PC details. I noticed that there were a lot of calls to FDVSHR (sorry forgot to mention process uses FMS) which seems to indicate that process is hung up on a screen somewhere. These looping processes only happen in the evening but when I talked to looping process owner this morning I was advised that they definitely logged out before going home. Ignoring the end-users comments I was wondering whether a default timeout on an FDV$GETAL call is not being handled correctly within code thus causing the CPU loop - I looked for an FDV$STIME but could not find one. Does anyone know if FDV$GETAL has a default timeout?

Hein,

I like the idea of sticking the server into debug but I suspect I would not be allowed to run debug on a production server within SLA (service line agreement) just in case it brings the rest of the application down! Instead next time I'll do what you suggested to "...try to find out how the process got there."

Mick

abrsvc · ‎06-29-2010

Without seeing the actual PCs involved and looking at the code, I can sat this much:

I ran into a problem where a mis-handled error condition would cause the SMG routines to loop forever in a COM state. The problem in initially tracking this down was that it occurred during a process shutdown. At the time of the loop, the majority of the process context was broken down. This means that "normal" debugging techniques were useless.

Check for return status checks that don't handle all status possibilities that might corrupt some buffers.

If you can post parts of the listings here with the PC values, we can perhaps provide some specific suggestions.

Dan

P Muralidhar Kini · ‎06-29-2010

Hi Mick,

>> I noticed that there were a lot of calls to FDVSHR (sorry forgot to
>> mention process uses FMS)
Now that you got the PC sample, the intresting thing would be to know the
source code to which the repeated PC values map.
Are you seeing some set of PC values getting repeatedly logged in the
PC sampling output ??
If Yes, to which source code line do those PC values map to ??

This would give a idea as to where exactly in your source code there is a loop.

Regards,
Murali

Let There Be Rock - AC/DC

P Muralidhar Kini · ‎06-29-2010

Hi Mike,

>> Does anyone know if FDV$GETAL has a default timeout?
FDV$STIME can be used to specify timeout value but you said that you did not
find that in your code.

Also, Refer the following link -
http://h71000.www7.hp.com/doc/73final/6619/6619pro_004.html
-> Section - 5.3.14.2 DECforms Timeout Values

Check if thats applicable in your case.

Regards,
Murali

Let There Be Rock - AC/DC

Hoff · ‎06-29-2010

I suspect you'll have to roll up your sleeves and compile with debugging enabled and launch the debugger and isolate the error.

Either directly.

Or by instrumenting the code.

Or both.

The virtual addresses that other replies discuss will usually help you find the loop, but there's still work to be done in the source code to isolate the trigger for the loop.

I've seen all manner of network and lower-level errors triggering this sort of misbehavior, and those can lead application code into all manner of dark corners.

And yes, application errors are most definitely entirely in play here, too.

Old code often has issues with its error handling. The old stuff probably once worked fine, but just wasn't designed for modern systems and modern networks and modern problems.

Folks don't necessarily think about how much stuff has actually changed within the whole of the stack, and what that means for the applications.

As a testament to the degree of compatibility that has been achieved here, comparatively simple user applications that once used FMS to chat with VT100 or VT220 terminals are now using various wholly new intermediate layers out a telnet connection to some random terminal emulation on some other operating system, and you can be assured that there are whole zoos filled with the errors that can now arise here.

Networking in particular offers a well-populated zoo.

How much old COBOL code ever was ever implemented with the intermittent connectivity and with the requirements for reconnections and refreshes that can arise with mobile IP networking?

There's no magic bullet here. The PC is just the jumping off point. You're just going to have to isolate and debug your code.

And yes, this could well be an error in lower-level code, but (in most any non-trivial application environment) you're generally left to have to prove that is the case.

John Gillings · ‎06-29-2010

The only problem using SDA for this is that the process is executing at the time. Fix that with SUSPEND:

When a process is looping:

$ SET PROCESS/SUSPEND/ID=looping-process

Now use SDA to examine the process. See SHOW CALL, SHOW CALL/NEXT etc... to walk up the call stack. The first few frames will be related to the SUSPEND, but you should get some idea of where the process is looping. Sanity check with:

$ SET PROCESS/RESUME
$ SET PROCESS/SUSPEND

and repeat the traceback. Do this a few times to find the deepest routine common to all. That will contain the loop.

A crucible of informative mistakes

Mick O'Brien · ‎06-29-2010

All,

I got another looping process last night for the same image - I have done as suggested and got out a PC file and have attached. I have also attached a link map BUT note that the map was generated recently on development and does not relate to the image on production.

Hein,

Tried to use command 'SHOW STAC/USER/SUMM' but it wants a range of memory locations?
John,
I'll give your suggested investigation route a try but I noticed from 'HELP' that command required a 'starting-address'?

Regards,
Mick

John Gillings · ‎06-30-2010

Mick,

That set of PC samples doesn't tell you much. The process is obviously doing stuff, but you're really not interested in anything below your own code, so ignore any of the samples in RTLs. The exceptions may be interesting. Perhaps something is failing, but your code doesn't notice?

Try SUSPENDing and examining the call stack. That should help localise the issue into your own code, somewhere you can check. Also see SDA> SHOW IMAGE to determine base addresses, and if you don't have a current MAP file, please get one!

>but I noticed from 'HELP' that command
>required a 'starting-address'?

The starting address parameter is optional.

From SDA use SHOW CALL:

SDA> SHOW CALL

this will show you the current call frame. To step to the frame of the caller:

SDA> SHOW CALL/NEXT

repeat until you reach the top.

Minimally, take note of the return PC. There may be other things of interest. You can examine the instruction sequence leading up to the call with:

SDA> EXAMINE/INSTR -20;20

SDA> SHOW CALL/SUMMARY

gives one line per call frame, a bit like a traceback.

A crucible of informative mistakes

Hein van den Heuvel · ‎06-30-2010

>>> SHOW STAC/USER/SUMM

My bad. I mixed up SHOW CALL and SHOW STACK. Sorry.

btw... Are you also using the full ACMS tasks with their (TDMS?) terminal request and procedure calls into server procedure images?

The system you describe is Alpha right. Any Itaniums? Shoot me an Email ?

Anyway...

Thanks for the trace.
You gave the raw one, not the statistics.

Here is the 'top 20' + user image top 10...

$ perl -ne "$pc{$1}++ if / U [0-9A-F]+ (\S+)/; }{ for (sort {$pc{$b}<=>$pc{$a}} keys %pc){ print qq($pc
{$_}\t$_\n) if /COIN/ or $i++<20}" pc.dat

592 FDVSHR+3DB04
568 FDVSHR+377C4
541 FDVSHR+37680
537 FDVSHR+378F4
503 FDVSHR+377FC
503 FDVSHR+376CC
486 FDVSHR+40098
451 LIBOTS+2417C
435 MMG_STD$SWAP_PTBR_C+00838
394 FDVSHR+3DBD4
392 FDVSHR+3D9C4
371 MMG_STD$SWAP_PTBR_C+00840
342 FDVSHR+40070
293 FDVSHR+3DA84
263 MMG_STD$SWAP_PTBR_C+00830
196 FDVSHR+36C00
185 EXE$SYNCH_LOOP_C+00DF4
175 LIBOTS+24170
152 LIBOTS+20290
145 LIBRTL+76A0C
10 COIN_DCL_OSIP+7E294
8 COIN_DCL_OSIP+7C150
8 COIN_DCL_OSIP+7E448
7 COIN_DCL_OSIP+869C4
6 COIN_DCL_OSIP+7E2EC
6 COIN_DCL_OSIP+86940
5 COIN_DCL_OSIP+7E460
5 COIN_DCL_OSIP+86A54
5 COIN_DCL_OSIP+7E344
5 COIN_DCL_OSIP+7E258
5 COIN_DCL_OSIP+7BED0
4 COIN_DCL_OSIP+7C1A4
4 COIN_DCL_OSIP+7E240
4 COIN_DCL_OSIP+7E1D8
4 COIN_DCL_OSIP+86960
3 COIN_DCL_OSIP+7E2C4
3 COIN_DCL_OSIP+7E4C0
3 COIN_DCL_OSIP+7E280
3 COIN_DCL_OSIP+869A4

Hein

Richard J Maher · ‎06-30-2010

Biggus Mickus,

Would it be fair to say this code has been running for over 25 years without the looping behaviour? What's changed? New VMS version, layered products, itanium test? When was the last change to OSIP?

DDoes it only happen after running normally for X hours? Quotas? Leaks?

Over time has an internal COBOL array grown past its limits and is overwriting memory below? Number or transactions, no. of customers etc?

WHile your waiting for the correct oscilliscope settings it might be worth just inspecting the code for CALLs to services that don't check the return status at all or, as Hoff mentioned, the iosb.

What username do the DCL servers run under? (Ooops! I IIRC don't answer that :-( )

Anyway it seemed time for clutching at straws :-)

Cheers Richard Maher

PS. It's very cold here :-(

Mick O'Brien · ‎07-01-2010

John,

Thanks for the pointer on 'SHOW CALL' - I'll give that a try next time

Hein,

How do I get the statistics out?

Richard,

OSIP is 17 years old buts it current 'ACMS' version is 12 years old (it's an image [that uses FMS] called from ACMS DCL server). The process started to loop about a year ago but as our version of ACMS was out of support we sort of hoped that the upgrade to current level would sort it out (it did not). The underlying code itself has not changed - the last time the image was released was November 2006 (a new version (recompiled/linked) was release post upgrade). The code has looped twice so far this week (Monday and Tuesday) and the alert comes out between 6pm and 6:30pm - I contacted the two users that the image was running under (remember itâ s a DCL server with dynamic username) and they tell me: -

1) They did not do anything unusual
2) Logged off properly (from what they can remember)

I'm going to rebuild/link the image with compile options of /list/machine and link options of /map and try to get that installed THEN get a few more PC stats out

Regards,
Mick

(PS Its supposed to be 27C here but I'm in my cardi [but no vest])

abrsvc · ‎07-01-2010

Mick,

Where are you located? Perhaps a visit might be in order.

Dan

Mick O'Brien · ‎07-01-2010

Costa Del Coutts (London)

Hein van den Heuvel · ‎07-01-2010

Richard M... great question! Why now?
My WAG is emulated terminal sessions being disconnected and the resulting terminal IO errors leading the program astray.

>> Hein, How do I get the statistics out?

Read onwards for alternative thoughts, but the simple, and often sufficient, way is:

SDA> PCS SHOW TRACE /STAT

As with any SDA extension you can help a quick command overview by just typing the extension name, here PCS.

For this case, PRF may get you more and better data quicker.

Beware as to how to interpret the traces...
The raw, timestamped, full trace from either PCD or PRF may lead you to believe you are looking at a natural flow through a program. Not so! They are just samples and only indicate how often, not in which sequence.
And the samples will pick up 'slow' memory access PC's more often then register moves.

One can speculate at a sequence by 'fuzzy' sorting the data roughly by counts first and by address next.

As a coarse example I changed the perl 'one-liner' presented earlier to a program and called out the anonymous simple sort function used earlier into a proper labeled subroutine.

------------------ sda_pcs_trace.pl -------
sub my_sort {
# if the PC stats counter stats are close
# then sort ascending by address
# else sort descending by count
#
if ( abs( $pc{$a} - $pc{$b}) <= $pc{$a}/3 ) {
return ($a cmp $b);
} else {
return ( $pc{$b} <=> $pc{$a});
}
}

while ( <> ) {
$pc{$1}++ if / U [0-9A-F]+ (\S+)/;
}

for (sort my_sort keys %pc) {
next unless /COIN/ or $i++ < 20;
printf qq(%6d %s\n), $pc{$_}, $_;
}
-----------------------

Now when we run this we get 'zones' of program activity. Because of the low sample count I used a coarse 30% range to group PC's because I wanted 3 coinsurance to sort equal to 4. With more samples you want to change the divider from 3 to 10 or some such and just eyeball whether the result speaks more clearly to you.

Sample resorted output below.

Hope this helps some more.
Hein van den Heuvel
HvdH Performance Consulting

------------ sample stats run -----------
# perl sda_pcs_trace.pl pc.dat
541 FDVSHR+37680
503 FDVSHR+376CC
568 FDVSHR+377C4
503 FDVSHR+377FC
537 FDVSHR+378F4
592 FDVSHR+3DB04
392 FDVSHR+3D9C4
394 FDVSHR+3DBD4
486 FDVSHR+40098
451 LIBOTS+2417C
293 FDVSHR+3DA84
342 FDVSHR+40070
435 MMG_STD$SWAP_PTBR_C+00838
371 MMG_STD$SWAP_PTBR_C+00840
196 FDVSHR+36C00
145 EXCEPTION+08544
185 EXE$SYNCH_LOOP_C+00DF4
263 MMG_STD$SWAP_PTBR_C+00830
122 EXE$SYNCH_LOOP_C+00DB0
141 FDVSHR+36BAC
8 COIN_DCL_OSIP+7E448
8 COIN_DCL_OSIP+7C150
10 COIN_DCL_OSIP+7E294
7 COIN_DCL_OSIP+869C4
6 COIN_DCL_OSIP+7E2EC
5 COIN_DCL_OSIP+7E460
5 COIN_DCL_OSIP+86A54
4 COIN_DCL_OSIP+7E1D8
6 COIN_DCL_OSIP+86940
5 COIN_DCL_OSIP+7E344
5 COIN_DCL_OSIP+7BED0
5 COIN_DCL_OSIP+7E258
4 COIN_DCL_OSIP+7C1A4
4 COIN_DCL_OSIP+7E240
3 COIN_DCL_OSIP+7E280
3 COIN_DCL_OSIP+7E2C4
3 COIN_DCL_OSIP+7E4C0
3 COIN_DCL_OSIP+7E1C0
3 COIN_DCL_OSIP+869A4
2 COIN_DCL_OSIP+869A0
2 COIN_DCL_OSIP+86CB0
2 COIN_DCL_OSIP+870D0
4 COIN_DCL_OSIP+86960
2 COIN_DCL_OSIP+7C1C0
2 COIN_DCL_OSIP+869E0
2 COIN_DCL_OSIP+7C178
2 COIN_DCL_OSIP+869B0
1 COIN_DCL_OSIP+7C190
1 COIN_DCL_OSIP+7C1A0
1 COIN_DCL_OSIP+7C200
1 COIN_DCL_OSIP+7E190
1 COIN_DCL_OSIP+7E230
1 COIN_DCL_OSIP+7E284
1 COIN_DCL_OSIP+7E2A0
1 COIN_DCL_OSIP+7E2D8
1 COIN_DCL_OSIP+7E2F8
1 COIN_DCL_OSIP+7E304
1 COIN_DCL_OSIP+7E310
1 COIN_DCL_OSIP+7E350

Richard J Maher · ‎07-01-2010

Hi Mick,

Looks like the PC analysis way is only one likely to yeild useful results but, in the meantime: -

Is there any i/o happening in the loop or just CPU?

I hear what Hein's saying but the users have always been clicking the X rather than exiting gracefully so I don't think there should be anything new there. But certainly very few people I know check every FDV$ status or have the signal error option set, so who knows?

Given what you've told us, I still have to opt for it being the data tha's changed. I remember having to extend ACMS workspaces about once a year because there were now more than 100 depts, account types, widgets, whatever. Is it the same account/user that it's happening on each time? Any COBOL arrays in working-storage at all?

Can't remember what OSIP does. Is it the general account screen or the statement thing? IIRC some of these options connected with RDO as well as SQLMOD and did not always share a handle. How many database connections per user? Any transactions have NOWAIT option set or locking loops?

You say it's installed shared; are all the PSECT attributes set to NOSHR for the EXTERNAL stuff like database handles?

Sorry to be as useful as usual :-)

Cheers Richard Maher

PS. Finally a few drops of rain so I can leave the push-bike at home! I ride to work and am still pushing 100kg :-(

Categories

Company

Local Language

Forums

Discussions

Forums

Discussions

Forums

Discussions

Forums

Discussions

Forums

Discussions

Discussions

Forums

Forums

Discussions

Forums

Discussions

Forums

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Community

Resources

Other HPE Sites

Discussions

Forums

Blogs

Re: Looping installed image

Looping installed image