Re: Remote Procedure Call problem.

Tim-Ellis · ‎03-12-2012

We have a process which makes use of RPC calls to fetch data from a remote (non vms) server. This process was originally built on an Alpha using OpenVMS 7.3-2 and TCPIP 5.4, and has worked without any problems for 2+years

We recently migrated to a new VMS server running OpenVMS 8.4 and TCPIP 5.7 - We rebuilt and relinked the procedure under this version, and it runs for between 15 mins to 4 hours, then starts to fail. Once it starts failing it seems to continue to do so unless one or other of the servers involved are rebooted.

Running the executable built under OpenVMS 7.3-2 continues to work without displaying this problem.

The error returned is "JCI_SERVER: RPC: Remote system error - Connection refused"

Any Idea what might be causing the problem?

abrsvc · ‎03-12-2012

The most common cause for behavior changes like this in my experience have been alignment issues. You don't indicate the language used for the application. Are there changes in the compiler(s) that may change the data layout? I'd expect the procedure calls to be the same, but look for any data issues first. Update this with more details about the languages used etc. and we may be able to provide additional suggestions.

Thanks,

Dan

Tim-Ellis · ‎03-12-2012

The process on the Remote (non VMS) server is unchanged.

The calling procedure was generated using RPCGEN, which produces C code

The old (working) version was compiled using DEC C V5.7-004

The new version was compiled using HP C V7.3-009-48GBT

but Note that the new version has worked. I would have thought that, were it an alignment issue, it would never work? - It appears to be the case that once the procedure fails it will not restart again without a reboot of one or both servers.

abrsvc · ‎03-12-2012

If you can trace the network packets, I would compare a "working" session with the "non-working" one. You should see a difference in the packet construction. While you may not have control over the packets directly (driver level), you should be able to see the differences. Once that is available, further research into the reasons for the chagne can occur. I would suspect a subtle change in the underlying TCPIP code as well. Again, examining the packets themselves should reveal the problem with the call(s).

Dan

Hoff · ‎03-12-2012

That's a generic RPC error. Off the top, look for dangling network connections, and for errors that would prevent a connection.

You went from a functional release to a new architecture and a release and a new compiler, and on a version that's had issues in other areas. Instruction timings and performance characteristics change. Stack contents change. In aggregate, that's a significant change.

First stop is loading the current relevant patches. Then call the HP support center, asking for anything relevant that hasn't been released as a patch that's related to networking or RPC or TCP/IP Services. (There have been various issues with VMS and TCP, and your own code is always going to be a big suspect.)

If the patches are not successful and if HP Support doesn't know of any relevant details or fixes or errors, then you're going to have to locate which resource is being depleted; a channel leak looks like a potential candidate. You're going to be debugging this.

That this application worked before is not as relevent as any of us might hope. There can be latent bugs exposed by a port, and there can be latent bugs lurking in otherwise working code for decades.

When approaching this, consider your source code to be buggy, and work to prove that it's not. Once you've proved your code is not the trigger, then you now have fodder to work with HP support to resolve the error. Or if not, you've found your bug.

When you've got the wedge, crash the process (writing out a process dump) and have a look at the carcass. That might involve building with debugging information. But you've found a good reproducer, so you've got a path to resolve this. Watching the network traffic is one tactic here, but I'd look at the source code, too; if this is a channel leak (for instance), that might or might not show up in a packet trace.

Hein van den Heuvel · ‎03-12-2012

>> The most common cause for behavior changes like this in my experience have been alignment issues.

I beg to differ.

When Dan mentions "alignment issues" I originally assumed he refers to incorrect alignment generating alignment faults. I do not think there has been a single reported case where alignment issues change functionality (directly). Alignment fault issues only impacts performance (and with that possibly ordering of things, but if there is an issue with that then the code is broken already. On re-read I see this was just a version change but staying on Alpha

Now maybe Dan meant that alignment issues changed data packing / record / structure layout.

Yes, that can break code and has broken code, but in my experience such issues show up quickly, not after hours of running succesfully as indicated (15 min - 4 hr )

>> It appears to be the case that once the procedure fails it will not restart again without a reboot of one or both servers.

That's odd!? Must be some resource issue.

Does the server fall over, or jus stop talking? What does the server resource usage look like?

Instead of a full reboot, have they tried a TCPIP stop - start?

fwiw,

Hein.

abrsvc · ‎03-12-2012

Sorry about hte confusion. I did indeed mean alignment interms of packing of the individual items within a structure. I should have been more clear.

Dan

GuentherF · ‎03-12-2012

The allocation change of structures due to member alignment may have increased the size of the structures. If the code uses some hardcoded structure sizes for VM allocation this may now cause memory corruptions. And it may take a while until that memory corruption happens at the right spot. So there IS a possibility.

Can the code be compiled with /NOMEMBER_ALIGNMENT and tested for a couple of hours?

/Guenther

Tim-Ellis · ‎03-16-2012

Thanks for all the suggestions. I finally managed to track down the problem with the aid of TCPTRACE - the process that was failing was calling the wrong ports - The C code produced by RPCGEN was originally compiled on a third machine due to the availability of C licences, and a change made there had not been copied back to the source code used to build the application on the VMS 8.4 machine. Once the correct version of the program was compiled and built the interface works fine....

Categories

Company

Local Language

Forums

Discussions

Forums

Discussions

Discussions

Forums

Discussions

Forums

Discussions

Forums

Forums

Discussions

Forums

Discussions

Forums

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Community

Resources

Other HPE Sites

Discussions

Forums

Blogs

Re: Remote Procedure Call problem.

Remote Procedure Call problem.