Re: RX2800 node crash when on same network as a Redhat 6 server

Brian Reiter · ‎04-29-2016

Hi folks,

Not sure of this is a networking issue or not. We have a 2 node cluster comprising of a SAN array and two RX2800s, both on OpenVMS 8-4 Update 7 and TCPIP/IP 5-7 ECO 3.

Both nodes boot into the cluster quite happily and will sit there, however a soon as a Redhat server is introduced onto the network, both nodes bug check with NOTWCBWCB, Corrupted WCB list. Both nodes use an NFS share running on the Redhat system., Rolling back the Redhat system allows the nodes to reboot and then function as expected. The network cards are plugged into the PCI riser card. I havent verified whether or not the issue will ocur with the network ports on the motherboard.

We run several installation of this system and at least one of these works correctly with the update Redhat server, the primary difference being that that particular system is a few months older than the one crashing.

A secondary concern is that at around the same time the RX2800 managed to lose its boot options and iLO passwords, I'm not sure if this is related or just plain bad luck. In any event the system no longer boots.

thanks in advance

Brian

Volker Halle · ‎04-29-2016

Hi Brian,

a NOTWCBWCB crash is most likely a software issue (pool corruption ?). Could you provide the CLUE file from CLUE$COLECT:CLUE$node_yymmdd_hhmm.LIS) as an attachment ?

TCPIP NFS client would be the most likely culprit. What happens, if you just DO NOT mount these NFS shares on that Redhat NFS server from your rx2800 ?

Volker.

Brian Reiter · ‎04-29-2016

Hi Volker,

I've attached one of the CLUE files. We've rolled the Redhat server back to give the customer a working system. The manufacturers will be in the office next week to do some more testing, at which point we'll try and work out which protocol caused the issue.

The concern from my point of view is that the new version of the system this works at their office but not live.

cheers

Brian

Volker Halle · ‎04-29-2016

Hi Brian,

the code in [F11X]WITURN (in routine MARK_COMPLETE) walks a list of Files-11 related data structures in nonpaged pool. If it finds a packet, which is NOT of the expected type (in this case a WCB=Window Control Block), it bugchecks:

BUG_CHECK (NOTWCBWCB, FATAL, 'Currupted WCB list');

So my initial analysis still holds: most likey pool corruption by some software component and something in the TCPIP stack or the TCPIP NFS client is most likely the culprit.

Compare NFS versions (TCPIP SHOW VERS/ALL) between a working system and a failing one.

Volker.

Brian Reiter · ‎04-29-2016

Hi Volker,

I've jas had a quick look at ome of the the other bugcheck reasons, I got:

28-APR-2016 11:54 V8.4 HP rx2800 i2 (1.60 MAINT1 CPUSPINWAIT 80193B20 SYSTEM_SYNCHRONIZATION_ 00010F20
28-APR-2016 12:08 V8.4 HP rx2800 i2 (1.60 MAINT1 FATALEXCPT MSS_20_SY_50199 80704462 LOCKING 00056C62
28-APR-2016 12:21 V8.4 HP rx2800 i2 (1.60 MAINT1 NOTFCBFCB SYS_MONITOR 80787430 F11BXQP 00043030
28-APR-2016 12:50 V8.4 HP rx2800 i2 (1.60 MAINT1 SSRVEXCEPT SIG_ACTPRO 801EFBC0 SYSTEM_SYNCHRONIZATION_ 0006CFC0

And

28-APR-2016 11:53 V8.4 HP rx2800 i2 (1.60 MAINT2 INVEXCEPTN NULL 80A33430 SECURITY 0002C430
28-APR-2016 12:08 V8.4 HP rx2800 i2 (1.60 MAINT2 NOTWCBWCB CIMDAEMON 807898A0 F11BXQP 000454A0

I think one of the nodes will have 20 or bgchecks for tuesday night. I suspect they're all related.

cheers

Brian

Steven Schweda · ‎04-29-2016

> [...] TCPIP/IP 5-7 ECO 3

If you do suspect an NFS-related problem, then you might start by
considering getting the TCPIP software up to date. I have an
ill-maintained hobbyist system with newer than that, and the
availability of newer than mine would not amaze me.

REX $ tcpip show version

HP TCP/IP Services for OpenVMS Industry Standard 64 Version V5.7 - ECO 4
on an HP rx2600 (1.50GHz/6.0MB) running OpenVMS V8.4

Brian Reiter · ‎05-03-2016

OK, been able to experiment a bit (without having a customer breathing down my kneck);

It looks as though the core networking is fine, without the application starting up I can manually mount the NFS share hosted on the Redhat host, I can ping the host, SSH to it and so on. So at that level it looks as though things are OK.

As soon as I startup the application up things go awry, the bugchecks seem to be inconsistent:

3-MAY-2016 11:22 V8.4 HP rx2800 i2 (1.60 NWRCC1 NOTFCBFCB SIG_20_SY_2674 80787430 F11BXQP 00043030
3-MAY-2016 12:14 V8.4 HP rx2800 i2 (1.60 NWRCC1 UNXSIGNAL NWRCC1_HW_IA64 00000000 <not available> 00000000
3-MAY-2016 12:27 V8.4 HP rx2800 i2 (1.60 NWRCC1 INVEXCEPTN TCPIP$RE_BG2101 80118240 SYSTEM_PRIMITIVES_MIN 00108240
3-MAY-2016 12:38 V8.4 HP rx2800 i2 (1.60 NWRCC1 SSRVEXCEPT DNFS2011ACP 80704351 LOCKING 00056B51
3-MAY-2016 12:48 V8.4 HP rx2800 i2 (1.60 NWRCC1 INVEXCEPTN DNFS2012ACP 80102820 SYSTEM_PRIMITIVES_MIN 000F2820
3-MAY-2016 13:42 V8.4 HP rx2800 i2 (1.60 NWRCC1 UNXSIGNAL SIG_20_SY_46329 90AFB7CF <not available> 00000000

Th exceptions may be netowrk related but I'm more concerned by the seemingly random nature of the crashes.

Volker Halle · ‎05-03-2016

Hi Brian,

these are the TYPICAL symptoms of nonpaged pool corruptions: crashes all over the place ! You may even be able to reproduce these crashes WITHOUT starting the application by copying files from/to the NFS share on the Redhat server.

Get and install the most recent TCPIP ECO first !

Volker.

abrsvc · ‎05-03-2016

Brian,

I will echo Volker's recommendation. Spending time now trying to locate crash causes will be a waste of time. With pool corruption, the crashes will be random and not show any particular cause. Upgrade TCPIP at least to the most recent available patch you can. I would look too at any release notes availale for more recent VMS releases to see if there are any pool related "updates". if these problems continue after the upgrade, a more drastic investigation effort may be necessary.

Dan

Brian Reiter · ‎05-04-2016

Hi Folks,

I'll try the patches today., although I still have to explain to the customer why, out of 4 two node RX2800 clusters and 1 RX2660 running the same version of the OS (including patches) and the same application software, two of clustered systems fail with this error.

Are there any tools etc. I can use (now and in the future) to invesgiate these problems?

cheers

Brian

Categories

Company

Local Language

Forums

Discussions

Forums

Discussions

Discussions

Forums

Discussions

Forums

Discussions

Forums

Forums

Discussions

Forums

Discussions

Forums

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Community

Resources

Other HPE Sites

Discussions

Forums

Blogs

Re: RX2800 node crash when on same network as a Redhat 6 server

RX2800 node crash when on same network as a Redhat 6 server