Re: RX2800 node crash when on same network as a Redhat 6 server

Brian Reiter · ‎04-29-2016

Hi folks,

Not sure of this is a networking issue or not. We have a 2 node cluster comprising of a SAN array and two RX2800s, both on OpenVMS 8-4 Update 7 and TCPIP/IP 5-7 ECO 3.

Both nodes boot into the cluster quite happily and will sit there, however a soon as a Redhat server is introduced onto the network, both nodes bug check with NOTWCBWCB, Corrupted WCB list. Both nodes use an NFS share running on the Redhat system., Rolling back the Redhat system allows the nodes to reboot and then function as expected. The network cards are plugged into the PCI riser card. I havent verified whether or not the issue will ocur with the network ports on the motherboard.

We run several installation of this system and at least one of these works correctly with the update Redhat server, the primary difference being that that particular system is a few months older than the one crashing.

A secondary concern is that at around the same time the RX2800 managed to lose its boot options and iLO passwords, I'm not sure if this is related or just plain bad luck. In any event the system no longer boots.

thanks in advance

Brian

Volker Halle · ‎04-29-2016

Hi Brian,

a NOTWCBWCB crash is most likely a software issue (pool corruption ?). Could you provide the CLUE file from CLUE$COLECT:CLUE$node_yymmdd_hhmm.LIS) as an attachment ?

TCPIP NFS client would be the most likely culprit. What happens, if you just DO NOT mount these NFS shares on that Redhat NFS server from your rx2800 ?

Volker.

Brian Reiter · ‎04-29-2016

Hi Volker,

I've attached one of the CLUE files. We've rolled the Redhat server back to give the customer a working system. The manufacturers will be in the office next week to do some more testing, at which point we'll try and work out which protocol caused the issue.

The concern from my point of view is that the new version of the system this works at their office but not live.

cheers

Brian

Volker Halle · ‎04-29-2016

Hi Brian,

the code in [F11X]WITURN (in routine MARK_COMPLETE) walks a list of Files-11 related data structures in nonpaged pool. If it finds a packet, which is NOT of the expected type (in this case a WCB=Window Control Block), it bugchecks:

BUG_CHECK (NOTWCBWCB, FATAL, 'Currupted WCB list');

So my initial analysis still holds: most likey pool corruption by some software component and something in the TCPIP stack or the TCPIP NFS client is most likely the culprit.

Compare NFS versions (TCPIP SHOW VERS/ALL) between a working system and a failing one.

Volker.

Brian Reiter · ‎04-29-2016

Hi Volker,

I've jas had a quick look at ome of the the other bugcheck reasons, I got:

28-APR-2016 11:54 V8.4 HP rx2800 i2 (1.60 MAINT1 CPUSPINWAIT 80193B20 SYSTEM_SYNCHRONIZATION_ 00010F20
28-APR-2016 12:08 V8.4 HP rx2800 i2 (1.60 MAINT1 FATALEXCPT MSS_20_SY_50199 80704462 LOCKING 00056C62
28-APR-2016 12:21 V8.4 HP rx2800 i2 (1.60 MAINT1 NOTFCBFCB SYS_MONITOR 80787430 F11BXQP 00043030
28-APR-2016 12:50 V8.4 HP rx2800 i2 (1.60 MAINT1 SSRVEXCEPT SIG_ACTPRO 801EFBC0 SYSTEM_SYNCHRONIZATION_ 0006CFC0

And

28-APR-2016 11:53 V8.4 HP rx2800 i2 (1.60 MAINT2 INVEXCEPTN NULL 80A33430 SECURITY 0002C430
28-APR-2016 12:08 V8.4 HP rx2800 i2 (1.60 MAINT2 NOTWCBWCB CIMDAEMON 807898A0 F11BXQP 000454A0

I think one of the nodes will have 20 or bgchecks for tuesday night. I suspect they're all related.

cheers

Brian

Steven Schweda · ‎04-29-2016

> [...] TCPIP/IP 5-7 ECO 3

If you do suspect an NFS-related problem, then you might start by
considering getting the TCPIP software up to date. I have an
ill-maintained hobbyist system with newer than that, and the
availability of newer than mine would not amaze me.

REX $ tcpip show version

HP TCP/IP Services for OpenVMS Industry Standard 64 Version V5.7 - ECO 4
on an HP rx2600 (1.50GHz/6.0MB) running OpenVMS V8.4

Brian Reiter · ‎05-03-2016

OK, been able to experiment a bit (without having a customer breathing down my kneck);

It looks as though the core networking is fine, without the application starting up I can manually mount the NFS share hosted on the Redhat host, I can ping the host, SSH to it and so on. So at that level it looks as though things are OK.

As soon as I startup the application up things go awry, the bugchecks seem to be inconsistent:

3-MAY-2016 11:22 V8.4 HP rx2800 i2 (1.60 NWRCC1 NOTFCBFCB SIG_20_SY_2674 80787430 F11BXQP 00043030
3-MAY-2016 12:14 V8.4 HP rx2800 i2 (1.60 NWRCC1 UNXSIGNAL NWRCC1_HW_IA64 00000000 <not available> 00000000
3-MAY-2016 12:27 V8.4 HP rx2800 i2 (1.60 NWRCC1 INVEXCEPTN TCPIP$RE_BG2101 80118240 SYSTEM_PRIMITIVES_MIN 00108240
3-MAY-2016 12:38 V8.4 HP rx2800 i2 (1.60 NWRCC1 SSRVEXCEPT DNFS2011ACP 80704351 LOCKING 00056B51
3-MAY-2016 12:48 V8.4 HP rx2800 i2 (1.60 NWRCC1 INVEXCEPTN DNFS2012ACP 80102820 SYSTEM_PRIMITIVES_MIN 000F2820
3-MAY-2016 13:42 V8.4 HP rx2800 i2 (1.60 NWRCC1 UNXSIGNAL SIG_20_SY_46329 90AFB7CF <not available> 00000000

Th exceptions may be netowrk related but I'm more concerned by the seemingly random nature of the crashes.

Volker Halle · ‎05-03-2016

Hi Brian,

these are the TYPICAL symptoms of nonpaged pool corruptions: crashes all over the place ! You may even be able to reproduce these crashes WITHOUT starting the application by copying files from/to the NFS share on the Redhat server.

Get and install the most recent TCPIP ECO first !

Volker.

abrsvc · ‎05-03-2016

Brian,

I will echo Volker's recommendation. Spending time now trying to locate crash causes will be a waste of time. With pool corruption, the crashes will be random and not show any particular cause. Upgrade TCPIP at least to the most recent available patch you can. I would look too at any release notes availale for more recent VMS releases to see if there are any pool related "updates". if these problems continue after the upgrade, a more drastic investigation effort may be necessary.

Dan

Brian Reiter · ‎05-04-2016

Hi Folks,

I'll try the patches today., although I still have to explain to the customer why, out of 4 two node RX2800 clusters and 1 RX2660 running the same version of the OS (including patches) and the same application software, two of clustered systems fail with this error.

Are there any tools etc. I can use (now and in the future) to invesgiate these problems?

cheers

Brian

Volker Halle · ‎05-04-2016

Hi Brian,

you could diagnose nonpaged pool corruption problems by setting the system parameter SYSTEM_CHECK=1 or POOLCHECK and analyzing the system crashes, they may become more frequent with these parameters set and there is some hope, that the problems will be detected 'earlier'.

You could also look at the current crashes, but you need to save each of them, for trying to detect common patterns of nonpaged pool corruption.

Volker.

abrsvc · ‎05-04-2016

As far as an explanation for why some machines vs others: I would suggest that this is a matter of timing. It is quite possible that the corruption is occurring on all machines. Thee particular workload of the ones failiing is such that the corruption is seen.

As an example, I had a client with software that worked at many sites without problems. At one site, the application would fail with an ACCVIO. Same software, same machine hardware. The problem was traced to a variable that was not initialized properly. This resulted in a memory location being used that was not expected. This particular client's set parameters resulted in the consumption of that location (high end of virtual memory) where others did not. Thus, the problem reported.

Here, the corrupt area may not be used very often by the other machines based upon the workload and the corruption (while still there ) is avoided.

Dan

Hoff · ‎05-04-2016

Ah, this old chestnut. May I translate the customer's request for you? This request is either "please spend more than a little time and effort to re-debug the known crashes and that have already fixed by the patches, and prove to me which one is involved here" or my favorite variation "exactly which patch do I need to install to fix this, because I can't install all mandatory patches for {reasons}."

I've gone on more than a few of these rock-fetches over the years, and the best and simplest answer is usually that somebody screwed up and didn't load the mandatory patches, and that there should be a policy of installing mandatory patches as they become available and can be tested. Once the mandatory patches are all loaded and once any subsequent crashes have been run through a crash scanner, then the system crashes get far more interesting to everybody involved.

If your customer wants to know the specific cause here, then you're going to be using the source listings for OpenVMS itself in conjunction with the system dump analyzer to determine what has apparently corrupted pool and — in this case — there's a non-trivial chance you'll be reverse-engineering the binary code for TCP/IP Services as there are no source listings available for that. Probably the first step here is to wander around and see what's getting corrupted in pool, what's building up in pool, and what patterns might exist to the corruptions, or if there are registers or some other resource getting corrupted. (Pool corruptions and register corruptions can be some of the most wonderfully difficult bugs to locate, too — the triggers can be subtle, and the faulty code can be somewhere completely unexpected. There was an NFS floating point register corruption from a ~dozen years ago that is still one of my benchmarks for bizarre crashes.)

Now once you're done with the rock-fetch and know the trigger, then the information you'll have gathered will usually either lead to the outcomes "apply the patches to fix this" or "apply the patches and submit a crashdump" — knowing the specific trigger doesn't solve any of this, unless you're also going to be creating the patch yourself. Either of the usual outcomes here can be predicted with some certainty, and usually only serve to delay the actual and desired outcome of a stable system, too.

TL;DR: install the mandatory patches, and escalate any subsequent crashes to HPE or VSI, and figure out why the mandatory patches and updates aren't being loaded expeditiously.

Brian Reiter · ‎05-05-2016

Hi Folks,

Applying the patches appears to have resolved the problem, although more testing is required to satisfy ourselves. The next interesting job is rolling out the patches to a number of sites.

Thanks for all your help and advice.

cheers

Brian

Categories

Company

Local Language

Forums

Discussions

Forums

Discussions

Forums

Discussions

Forums

Discussions

Forums

Discussions

Discussions

Forums

Forums

Discussions

Forums

Discussions

Forums

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Community

Resources

Other HPE Sites

Discussions

Forums

Blogs

Re: RX2800 node crash when on same network as a Redhat 6 server

RX2800 node crash when on same network as a Redhat 6 server