Operating System - OpenVMS
1752275 Members
5143 Online
108786 Solutions
New Discussion юеВ

Re: Looping installed image

 
Mick O'Brien
Advisor

Re: Looping installed image

Abrsvc , Bhadresh , Murali, Stephen (Hoff)

I had another looping process last night and did as suggested and got out some PC details. I noticed that there were a lot of calls to FDVSHR (sorry forgot to mention process uses FMS) which seems to indicate that process is hung up on a screen somewhere. These looping processes only happen in the evening but when I talked to looping process owner this morning I was advised that they definitely logged out before going home. Ignoring the end-users comments I was wondering whether a default timeout on an FDV$GETAL call is not being handled correctly within code thus causing the CPU loop - I looked for an FDV$STIME but could not find one. Does anyone know if FDV$GETAL has a default timeout?

Hein,

I like the idea of sticking the server into debug but I suspect I would not be allowed to run debug on a production server within SLA (service line agreement) just in case it brings the rest of the application down! Instead next time I'll do what you suggested to "...try to find out how the process got there."

Mick
abrsvc
Respected Contributor

Re: Looping installed image

Without seeing the actual PCs involved and looking at the code, I can sat this much:

I ran into a problem where a mis-handled error condition would cause the SMG routines to loop forever in a COM state. The problem in initially tracking this down was that it occurred during a process shutdown. At the time of the loop, the majority of the process context was broken down. This means that "normal" debugging techniques were useless.

Check for return status checks that don't handle all status possibilities that might corrupt some buffers.

If you can post parts of the listings here with the PC values, we can perhaps provide some specific suggestions.

Dan
P Muralidhar Kini
Honored Contributor

Re: Looping installed image

Hi Mick,

>> I noticed that there were a lot of calls to FDVSHR (sorry forgot to
>> mention process uses FMS)
Now that you got the PC sample, the intresting thing would be to know the
source code to which the repeated PC values map.
Are you seeing some set of PC values getting repeatedly logged in the
PC sampling output ??
If Yes, to which source code line do those PC values map to ??

This would give a idea as to where exactly in your source code there is a loop.

Regards,
Murali
Let There Be Rock - AC/DC
P Muralidhar Kini
Honored Contributor

Re: Looping installed image

Hi Mike,

>> Does anyone know if FDV$GETAL has a default timeout?
FDV$STIME can be used to specify timeout value but you said that you did not
find that in your code.

Also, Refer the following link -
http://h71000.www7.hp.com/doc/73final/6619/6619pro_004.html
-> Section - 5.3.14.2 DECforms Timeout Values

Check if thats applicable in your case.

Regards,
Murali
Let There Be Rock - AC/DC
Hoff
Honored Contributor

Re: Looping installed image

I suspect you'll have to roll up your sleeves and compile with debugging enabled and launch the debugger and isolate the error.

Either directly.

Or by instrumenting the code.

Or both.

The virtual addresses that other replies discuss will usually help you find the loop, but there's still work to be done in the source code to isolate the trigger for the loop.

I've seen all manner of network and lower-level errors triggering this sort of misbehavior, and those can lead application code into all manner of dark corners.

And yes, application errors are most definitely entirely in play here, too.

Old code often has issues with its error handling. The old stuff probably once worked fine, but just wasn't designed for modern systems and modern networks and modern problems.

Folks don't necessarily think about how much stuff has actually changed within the whole of the stack, and what that means for the applications.

As a testament to the degree of compatibility that has been achieved here, comparatively simple user applications that once used FMS to chat with VT100 or VT220 terminals are now using various wholly new intermediate layers out a telnet connection to some random terminal emulation on some other operating system, and you can be assured that there are whole zoos filled with the errors that can now arise here.

Networking in particular offers a well-populated zoo.

How much old COBOL code ever was ever implemented with the intermittent connectivity and with the requirements for reconnections and refreshes that can arise with mobile IP networking?

There's no magic bullet here. The PC is just the jumping off point. You're just going to have to isolate and debug your code.

And yes, this could well be an error in lower-level code, but (in most any non-trivial application environment) you're generally left to have to prove that is the case.
John Gillings
Honored Contributor

Re: Looping installed image

The only problem using SDA for this is that the process is executing at the time. Fix that with SUSPEND:

When a process is looping:

$ SET PROCESS/SUSPEND/ID=looping-process

Now use SDA to examine the process. See SHOW CALL, SHOW CALL/NEXT etc... to walk up the call stack. The first few frames will be related to the SUSPEND, but you should get some idea of where the process is looping. Sanity check with:

$ SET PROCESS/RESUME
$ SET PROCESS/SUSPEND

and repeat the traceback. Do this a few times to find the deepest routine common to all. That will contain the loop.

A crucible of informative mistakes
Mick O'Brien
Advisor

Re: Looping installed image

All,

I got another looping process last night for the same image - I have done as suggested and got out a PC file and have attached. I have also attached a link map BUT note that the map was generated recently on development and does not relate to the image on production.

Hein,

Tried to use command 'SHOW STAC/USER/SUMM' but it wants a range of memory locations?
John,
I'll give your suggested investigation route a try but I noticed from 'HELP' that command required a 'starting-address'?

Regards,
Mick
John Gillings
Honored Contributor

Re: Looping installed image

Mick,

That set of PC samples doesn't tell you much. The process is obviously doing stuff, but you're really not interested in anything below your own code, so ignore any of the samples in RTLs. The exceptions may be interesting. Perhaps something is failing, but your code doesn't notice?

Try SUSPENDing and examining the call stack. That should help localise the issue into your own code, somewhere you can check. Also see SDA> SHOW IMAGE to determine base addresses, and if you don't have a current MAP file, please get one!

>but I noticed from 'HELP' that command
>required a 'starting-address'?

The starting address parameter is optional.

From SDA use SHOW CALL:

SDA> SHOW CALL

this will show you the current call frame. To step to the frame of the caller:

SDA> SHOW CALL/NEXT

repeat until you reach the top.

Minimally, take note of the return PC. There may be other things of interest. You can examine the instruction sequence leading up to the call with:

SDA> EXAMINE/INSTR -20;20

SDA> SHOW CALL/SUMMARY

gives one line per call frame, a bit like a traceback.
A crucible of informative mistakes
Hein van den Heuvel
Honored Contributor

Re: Looping installed image

>>> SHOW STAC/USER/SUMM

My bad. I mixed up SHOW CALL and SHOW STACK. Sorry.

btw... Are you also using the full ACMS tasks with their (TDMS?) terminal request and procedure calls into server procedure images?

The system you describe is Alpha right. Any Itaniums? Shoot me an Email ?

Anyway...

Thanks for the trace.
You gave the raw one, not the statistics.

Here is the 'top 20' + user image top 10...

$ perl -ne "$pc{$1}++ if / U [0-9A-F]+ (\S+)/; }{ for (sort {$pc{$b}<=>$pc{$a}} keys %pc){ print qq($pc
{$_}\t$_\n) if /COIN/ or $i++<20}" pc.dat

592 FDVSHR+3DB04
568 FDVSHR+377C4
541 FDVSHR+37680
537 FDVSHR+378F4
503 FDVSHR+377FC
503 FDVSHR+376CC
486 FDVSHR+40098
451 LIBOTS+2417C
435 MMG_STD$SWAP_PTBR_C+00838
394 FDVSHR+3DBD4
392 FDVSHR+3D9C4
371 MMG_STD$SWAP_PTBR_C+00840
342 FDVSHR+40070
293 FDVSHR+3DA84
263 MMG_STD$SWAP_PTBR_C+00830
196 FDVSHR+36C00
185 EXE$SYNCH_LOOP_C+00DF4
175 LIBOTS+24170
152 LIBOTS+20290
145 LIBRTL+76A0C
10 COIN_DCL_OSIP+7E294
8 COIN_DCL_OSIP+7C150
8 COIN_DCL_OSIP+7E448
7 COIN_DCL_OSIP+869C4
6 COIN_DCL_OSIP+7E2EC
6 COIN_DCL_OSIP+86940
5 COIN_DCL_OSIP+7E460
5 COIN_DCL_OSIP+86A54
5 COIN_DCL_OSIP+7E344
5 COIN_DCL_OSIP+7E258
5 COIN_DCL_OSIP+7BED0
4 COIN_DCL_OSIP+7C1A4
4 COIN_DCL_OSIP+7E240
4 COIN_DCL_OSIP+7E1D8
4 COIN_DCL_OSIP+86960
3 COIN_DCL_OSIP+7E2C4
3 COIN_DCL_OSIP+7E4C0
3 COIN_DCL_OSIP+7E280
3 COIN_DCL_OSIP+869A4


Hein

Richard J Maher
Trusted Contributor

Re: Looping installed image

Biggus Mickus,

Would it be fair to say this code has been running for over 25 years without the looping behaviour? What's changed? New VMS version, layered products, itanium test? When was the last change to OSIP?

DDoes it only happen after running normally for X hours? Quotas? Leaks?

Over time has an internal COBOL array grown past its limits and is overwriting memory below? Number or transactions, no. of customers etc?

WHile your waiting for the correct oscilliscope settings it might be worth just inspecting the code for CALLs to services that don't check the return status at all or, as Hoff mentioned, the iosb.

What username do the DCL servers run under? (Ooops! I IIRC don't answer that :-( )

Anyway it seemed time for clutching at straws :-)

Cheers Richard Maher

PS. It's very cold here :-(