Re: RX2800 node crash when on same network as a Redhat 6 server

Volker Halle · ‎05-04-2016

Hi Brian,

you could diagnose nonpaged pool corruption problems by setting the system parameter SYSTEM_CHECK=1 or POOLCHECK and analyzing the system crashes, they may become more frequent with these parameters set and there is some hope, that the problems will be detected 'earlier'.

You could also look at the current crashes, but you need to save each of them, for trying to detect common patterns of nonpaged pool corruption.

Volker.

abrsvc · ‎05-04-2016

As far as an explanation for why some machines vs others: I would suggest that this is a matter of timing. It is quite possible that the corruption is occurring on all machines. Thee particular workload of the ones failiing is such that the corruption is seen.

As an example, I had a client with software that worked at many sites without problems. At one site, the application would fail with an ACCVIO. Same software, same machine hardware. The problem was traced to a variable that was not initialized properly. This resulted in a memory location being used that was not expected. This particular client's set parameters resulted in the consumption of that location (high end of virtual memory) where others did not. Thus, the problem reported.

Here, the corrupt area may not be used very often by the other machines based upon the workload and the corruption (while still there ) is avoided.

Dan

Hoff · ‎05-04-2016

Ah, this old chestnut. May I translate the customer's request for you? This request is either "please spend more than a little time and effort to re-debug the known crashes and that have already fixed by the patches, and prove to me which one is involved here" or my favorite variation "exactly which patch do I need to install to fix this, because I can't install all mandatory patches for {reasons}."

I've gone on more than a few of these rock-fetches over the years, and the best and simplest answer is usually that somebody screwed up and didn't load the mandatory patches, and that there should be a policy of installing mandatory patches as they become available and can be tested. Once the mandatory patches are all loaded and once any subsequent crashes have been run through a crash scanner, then the system crashes get far more interesting to everybody involved.

If your customer wants to know the specific cause here, then you're going to be using the source listings for OpenVMS itself in conjunction with the system dump analyzer to determine what has apparently corrupted pool and — in this case — there's a non-trivial chance you'll be reverse-engineering the binary code for TCP/IP Services as there are no source listings available for that. Probably the first step here is to wander around and see what's getting corrupted in pool, what's building up in pool, and what patterns might exist to the corruptions, or if there are registers or some other resource getting corrupted. (Pool corruptions and register corruptions can be some of the most wonderfully difficult bugs to locate, too — the triggers can be subtle, and the faulty code can be somewhere completely unexpected. There was an NFS floating point register corruption from a ~dozen years ago that is still one of my benchmarks for bizarre crashes.)

Now once you're done with the rock-fetch and know the trigger, then the information you'll have gathered will usually either lead to the outcomes "apply the patches to fix this" or "apply the patches and submit a crashdump" — knowing the specific trigger doesn't solve any of this, unless you're also going to be creating the patch yourself. Either of the usual outcomes here can be predicted with some certainty, and usually only serve to delay the actual and desired outcome of a stable system, too.

TL;DR: install the mandatory patches, and escalate any subsequent crashes to HPE or VSI, and figure out why the mandatory patches and updates aren't being loaded expeditiously.

Brian Reiter · ‎05-05-2016

Hi Folks,

Applying the patches appears to have resolved the problem, although more testing is required to satisfy ourselves. The next interesting job is rolling out the patches to a number of sites.

Thanks for all your help and advice.

cheers

Brian

Categories

Company

Local Language

Forums

Discussions

Forums

Discussions

Discussions

Forums

Discussions

Forums

Discussions

Forums

Forums

Discussions

Forums

Discussions

Forums

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Community

Resources

Other HPE Sites

Discussions

Forums

Blogs

Re: RX2800 node crash when on same network as a Redhat 6 server