Operating System - OpenVMS
1753771 Members
4729 Online
108799 Solutions
New Discussion

DECserver: NAS release notes? Known software exceptions? Version derived from server ROM or *ENG*.*?

 
Mark_Corcoran
Frequent Advisor

DECserver: NAS release notes? Known software exceptions? Version derived from server ROM or *ENG*.*?

I've posted here, because I thought that if anyone might know/have information, it would be the OVMS folks - IMHO there's no suitable forum under networking products to post under.

 

We have a large number of DECservers on site, and some recent issues started me thinking...

 

Of the 26 DECservers in use, all but one of them are DECserver 700-16s (the other is a DS90M).

 

Of the 25 DECserver 700-16s, there are:

3 versions of (NAS) software - V1.0, V1.1A, and V2.3A

3 versions of BL* - BL34-10 (on NAS V1.0), BL45-12 (on NAS V1.1A) and BL47-60 (on NAS V2.3A)

*From a copy of a NAS Problem Solving Guide, the BL seems to be an acronym for Base Level.

5 versions of DECserver ROM - V3.4-9, V4.0-0, V6.3-0, and V7.1-0 (with NAS V1.1A), and V7.2-0 (with NAS V1.1A and V2.3A)

 

On the DECserver 90M, NAS V2.2 BL29-52 ROM V4.1

 

It's not clear whether or not the BLnn-nn value is derived from the downline-loaded software, or the ROM inside the DECserver itself, but I suspect it is the latter - different DECservers all loading WWENG1.SYS have different BLnn-nn values.

Even for DECservers with the same ROM version (reported as V7.2-0), we have BL47-60 and BL45-12, so I don't know if they are essentially sub-variants of the V7.2-0 ROM version.

 

Q1: Does anyone know the significance of the BLnn-nn value, and its relationship to either the NAS software or the ROM (can you only expect to see certain combinations)?

 

We only have one DECserver that is running at NAS V2.3A and which loads WWENG2.SYS (others are running at V1.0, V1.1A, V2.2), and it maintains per-port "Input Characters" and "Output Characters" counters (essentially, DECserver-side equivalent to the Bytes Transmitted and Bytes Received counters in MC LATCP SHOW PORT LTAnnnn: /COUNT).

 

Q2: Is the NAS version number reported by SHOW SERVER STATUS (and the fact that only the DECserver with NAS V2.3A reports Input/Output Character counts for SHOW PORT n COUNT) an artefact of the ROM inside the DECserver, or the WWENG1.SYS / WWENG2.SYS that we downline load (it doesn't readily appear as a text string in the file, but it may be derived from "binary" byte or word values)?

 

 

I ask because changing one of the spare DECservers to load WWENG2.SYS (as used by the DECserver running NAS V2.3A) rather than WWENG1.SYS alters neither the reported NAS S/W version, nor the ability to count per port Input/Output Characters.

 

[WWENG1.SYS and WWENG2.SYS are different sizes, and (not surprisingly) generate different values for the
CHECKSUM$CHECKSUM symbol when running them through CHECKSUM.

 

Curiously, when DUMPing the files to search for any tid-bits of text, they both contain the string Whitewater V1.4 DRAM]

 

 

We have one DECserver which has rebooted 5 times in the last 3 months, and following the reboot, a SHOW SERVER STATUS has reported PC=07834010, SP=078BC354, SR=2004, ME=00000000, CO=004

 

This indicates it has encountered a software exception; the PC (Program Counter), SP (Stack Pointer), SR (Status Register) and ME (MEmory address (potentially) attempted to be accessed by the code at the PC address) are effectively meaningless without access to the source code.

 

A copy of the DEC Communications Options Minireference Manual Volume 5 Ethernet Devices (Part 1) from August 1988 (part number EK-CMIV5-RM-005) suggests that the error code (CO=004) is Motorola 68000 Illegal Instruction (given DEC's penchant for making things backwardly compatible, I am surmising that this code value from the DECserver 200 is /possibly/ also the same in the DECserver 700-16).

 

None of this really helps me determine the root cause, and there appears to be a dearth of Release Notes for older NAS versions (though given the issues I have with the NAS version number and feature functionaly not appearing to change between WWENG1.SYS and WWENG2.SYS, I wonder whether or not it's not really a software exception, but a hardware issue with the DECserver).

 

Q3: Does anyone have any release notes for DNAS V1.0, V1.1A, V2.2 or V2.3A?

[The only versions that I have been able to find online are for V2.6 and V3.6, and neither mention software exceptions being a known issue or fixed (so whatever is causing the exception that we are having, I presume has possibly fixed prior to V2.6)

I am disinclined to change the downline-loaded image from WWENG1.SYS to WWENG2.SYS for the DECserver 700-16 that has rebooted 5 times, because at first glance, it doesn't /appear/ to offer any differences (NAS and BL version numbers do not change, and there is no new ability to get per-port bytes received/transmitted counts, nor do SHOW MEMORY CONFIGURATION nor SHOW MEMORY STATUS) - I also don't have any real information on the difference between WWENG1.SYS and WWENG2.SYS, so don't know whether or not I might be jumping out of the frying pan into the fire]

 

Regards,

Mark

 

[Formerly appearing as woeisme]
5 REPLIES 5
Bill Hall
Honored Contributor

Re: DECserver: NAS release notes? Known software exceptions? Version derived from server ROM or *ENG

I can't answer all of your questions, but I can answer the most important ones.  AFAIK, the BL numbers are the build numbers for that version of DNAS.  I have a DS90M running "Network Access SW V2.4 BL50 for DS90M" and DS700-16 running "Network Access SW V3.6 BL01 for DS700-16".  The DECserver's firmware version is not as important and the amount of RAM and the size of the flash memory card if one is installed for determining what version of DNAS it can run and what the load file name is.  Use the show memory command on the DECserver to determine how much RAM and if flash is installed in a DECserver.  You can downline load WWENG1 to any DECserver 700, but you have to have at least 2MB of memory in the 700 to download WWENG2.  See the table that I have put together below, hopefully it will be readable.

Check your output from show server status more closely.  It sounds to me like you have more than one load host and they have different versions of WWENG1.SYS and WWENG2.SYS. Are you sure that the DECservers are not loading from flash and have different versions of DNAS in their flash memory?

Firmware filenames:
         Minimum
Model       Memory  Firmware Filename to be Downloaded
---------------  ----   -----------------------------------------------------
DECserver 90L+  none    PROM-based; no firmware downloaded
DECserver 90TL  1 MB   MNENG1.SYS
DECserver 90TL  1 MB   MNENG2.SYS (Some descriptions say 4 MB is required)
DECserver 90M   1 MB   MNENG2.SYS (Some descriptions say 4 MB is required)
DECserver 90M   2 MB   MNENG3.SYS
DECserver 90M+  4 MB   MNENG4.SYS
DECserver 100        PS0801ENG.SYS
DECserver 200        PR0801ENG.SYS
DECserver 250        DP0601ENG.SYS
DECserver 300        SH1601ENG.SYS
DECserver 500        DS5node-name.SYS
DECserver 700   1 MB   WWENG1.SYS
DECserver 700   >2 MB   WWENG2.SYS
DECserver 900TM       WWENG2.SYS

Bill

Bill Hall
Bill Hall
Honored Contributor

Re: DECserver: NAS release notes? Known software exceptions? Version derived from server ROM or *ENG

Forgot to mention that you can upgrade all of your DECservers to the latest version of DNAS by purchasing an upgrade kit from Vnetek Communications LLC .  The cost for the version 3.6 kit was less than $400 US years ago.  We were also sent the latest firmware files for the memory challanged DS90's (version 2.4 maybe) that didn't have enough RAM to run version 3.6.

 

 

Bill Hall
Mark_Corcoran
Frequent Advisor

Re: DECserver: NAS release notes? Known software exceptions? Version derived from server ROM or *ENG

Thanks for your replies, Bill.

> Use the show memory command on the DECserver to determine how much RAM and if flash is installed in a DECserver
That seems to be a function of the version of NAS that is installed - on almost all of the DECservers, the MEMORY
keyword is not a recognised parameter for the SHOW verb (only the server with NAS V2.3A accepts it - and it's not
a SET PRIVILEGED issue either).

I think the only way to determine the memory would be to configure (where necessary) a console port on the DECserver,
connect a VT terminal to it, and force a reboot during non-production hours with a CRASH or power-cycle.

 

>Check your output from show server status more closely. It sounds to me like you have more than one load host and they have different versions of WWENG1.SYS and WWENG2.SYS.

There are three load hosts, all of which appear to have the same WWENG1.SYS and WWENG2.SYS (at least, a CHECKSUM on the files followed by a SHOW SYMBOL CHECKSUM$CHECKSUM, returns the same values).

Having looked more closely and added the information to the spreadsheet I created for quick comparison:

NAS V1.1A comes from a downline-load of WWENG1.SYS
NAS V1.0 comes from a downline-load of WWENG2.SYS
NAS V2.2 comes from a Flash RAM boot of MNENG2.SYS (this is a DS90M)
NAS V2.3A comes from a Flash RAM boot of WWENG2.SYS

It's clear therefore that the WWENG2.SYS on the load hosts doesn't appear to be what it says on the tin - a file name with an implied more recent version is actually older (which is hardly surprising, given that the WWENG2.SYS (1101 blocks) on disk is 55 shorter than WWENG1.SYS (1156 blocks), or that WWENG2.SYS has a creation date 6 months earlier than WWENG1.SYS)

 

>Forgot to mention that you can upgrade all of your DECservers to the latest version of DNAS by purchasing an upgrade kit from Vnetek Communications LLC
I guessed that that would be the case, but as is the case in most shops, reluctance to spend pennies on something that is due to be replaced "soon" is the order of the day.


The ultimate reason I posted this is because of repeated reboots of one server ("coincidentally" almost every 48 hours).

Following the reboot, the server status as previously indicated, was shown, indicating that a software exception had occurred.


On further examination, after reboot, SHOW SERVER COUNTERS has regularly reported an "Illegal Messages Rcv'd" count that has normally been 2 (except on the last reboot, when it has remained at zero).

The manual indicates that this is a LAT counter, and the fact it is not the "Illegal Multicasts Rcv'd" that has increased, implies that the LAT frame didn't indicate it was multicast, rather so by default must have been a LAT frame "targetted" at the MAC address of the DS700.

The fact that it is the "Illegal Messsages Rcv'd" counter in SHOW SERVER COUNTERS that is non-zero, rather than one of the nodes in SHOW NODE COUNTERS, indicates that either the host name could not be determined, or the sanity checking of the frame rejected it before it could determine the host to which the erroneous count could be ascribed.


My guess is that when the exception occurs, it is because a (very) badly formatted LAT frame is received, and the code is perhaps not as robust as one would ideally like.

e.g. an RLE length indicator might indicate a length of say 50 bytes, but only 40 bytes is received, so attempts to access byte 41 would extend beyond the bounds of an array/buffer, and cause an exception (I don't think that's what is happening here - the CO=004 indicates a Motorola 68000 Illegal Instruction (not sure if this means the opcode is illegal, or any parameters it requires are illegal), which suggests either the code having been instructed to jump to a memory address that contains data rather than code, or the memory containing the code has been overwritten (it is perhaps not as well defended against such an operation as in OpenVMS)).

I've had a look at forcing a server dump, but there's no documentation on the format of the dump file (I have identified that the PC and SP values appear near the first ~32 bytes, but their position largely depends on the CO= error code value), so I don't think it will really help (there is a MOM$SYSTEM:DS7_010_CRASH_DISPLAY.COM, but either someone locally here renamed an extant .COM file to this, or DEC?Compaq/HP re-packaged a DS300 .COM file, because that's what the comments inside it refer to, and it doesn't work with a dump file on the test system where I purposefully CRASHed a test DS700)

Updating to a later version of software for the DECservers might "help", but only if I knew what it was that the more recent (and interim) version(s) fixed.

I use the term "help" loosely - it would merely mask the problem of (I believe) something sending badly formatted LAT frames that causes the DECserver to generate an exception;  how long before an even worse formatted LAT frame defeats even the most recent version of NAS software.

[I know of a mobile network operator who originally had everything on the same VLAN until one day someone plugged something into the data network;  it sent out rubbish that cause the VAX systems to crash repeatedly until the offending item was unplugged;  needless to say, they subsequently segmented their data network;  I'll spare their blushes by not mentioning their name here :-D

I've hate to get the bosses to cough up for the latest version of NAS (and potentially, memory upgrades), only for the sender of the rubbish to have a hissy fit, send even worse frames onto the network in multicast, and cause all the DECservers to crash - I'd really rather get to the bottom of who is sending rubbish, and why, and getting that fixed]

I'm badgering to get a switch port on the same VLAN to be set up to allow a PC with Wireshark to be connected and capture the traffic before an expected reboot, to identify what messages the DECserver is the target of, and depending on how badly formatted the messages (and any preceding ones) are, perhaps identify at least the MAC address of the sender, and possibly from the payload of the message (and preceding ones) determine what was being attempted (and maybe what process on the sender system (assuming it is a server rather than some rogue bit of hardware) was responsible).

I'm convinced it's not a hardware issue (on the basis of the "Illegal Msgs Rcv'd" counter, the fact that it reboots every ~48 hours, and only since production resumed after the Christmas break) - if it had been a thermal one, a warm reboot from a software exception wouldn't allow things to cool down, and I can't see framing errors on the serial port (not that we have any) causing the issue.

When I hopefully get an update on the ultimate cause, I'll update here.

I did find another community posting along similar lines overnight (subject is "DECServers Rebooting"), from 2005 by Andy Bustamente, saying in his reply that "There was an Alpha VMS 7.1x bug in LAT", but I can't find any ECO/release notes relating to this (there are two AXP systems here, but they are at V7.3-2, so I would presume that any ECO for that bug would be included).

 

Regards,

Mark

[Formerly appearing as woeisme]
Mark_Corcoran
Frequent Advisor

Re: DECserver: NAS release notes? Known software exceptions? Version derived from server ROM or *ENG

We managed to get a Wireshark capture of the traffic on the VLAN that the DS700 is connected to, when it rebooted this morning.

 

My colleague in the network team who had set up spanning of the port that the DS700 was connected to, to another port on the switch (where the laptop with Wireshark was connected) observed a flurry of Remote Console commands prior to the reboot  (he was at the cabinet where the network switch and DS700 are co-located, and he observed the LED on the DS700 go through its POST status "letters").

 

I was surprised by this as I had already disconnected from the DS700, so wasn't issuing any commands.

 

Analysis of the payload of the LAT frames indicate that it was SHOW SERVER STATUS commands, which were being issued by a colleague in our Operations team in the run-up to the "witching hour".

 

Wireshark reported a large number of malformatted frames, including standard LAT service announcements from the OpenVMS systems on the same VLAN.

 

I've not manually decoded these frames, but at first glance, nothing appeared to be inherently wrong with them, so I suspect it is bad/missing logic in the LAT dissector in Wireshark (I had previously used Wireshark on a different DECserver issue, where the version of Wireshark didn't have a LAT dissector, so I had to manually decode those LAT frames;  I plan on manually decoding some of these "malformatted" ones tonight, to see if there is actually something wrong with them).

 

The capture showed two things:

 

1) The DS700 had stopped responding to LAT messages, and it could be clearly seen that LAT on the OpenVMS nodes was retransmitting the messages because they hadn't received a response from the DS700 within the timeout period (of ~200ms).

 

2) The last SHOW SERVER STATUS command that the DS700 responded to, showed an uptime of 1 23:59:45, with current memory usage at 7%, and similar low values for CPU, local services, reachable nodes, active circuits, connected nodes & sessions;  the Wireshark capture records the number of seconds since the capture started at which each frame is received (it probably gives a DD-MMM-YYYY HH:MM:SS.hh time stamp, but I was purely looking at the elapsed time since the capture started).

 

Subtracting the elapsed time of first part of the SHOW SERVER STATUS response (it doesn’t fit in the payload of a single LAT frame) from the elapsed time of last response from the DS700 before it stopped working, gave a value of 14.75-something-or-other seconds.

 

Taking into account the fact that the uptime reported by SHOW SERVER STATUS is only granular to HH:MM:SS, adding the 14.75-something-or-other to 1 23:59:45 gives (as near as makes no difference) an uptime of 48 hours.

 

48 hours seems a rather odd amount of time for (e.g.) a counter to have overflowed…

It's not clear how (internally) the DS700 records the uptime – whether it records it in an OpenVMS quadword (unlikely, especially as the chipset inside is Motorola 68000), or a count of the total number of seconds,  or separate counters for each of the number of days, hours, minutes and seconds (given that the reported uptime is only granular to HH:MM:SS, it seems unlikely that it counts either milliseconds or centiseconds).

 

Some of our DECservers have been up for a significant period of time – looking at the last snapshot I took of the output of SHOW SERVER STATUS across the DECserver estate, indicates the highest one being  593 14:44:47

 

That is 51288287 seconds, which (if the uptime is stored as the number of seconds then converted into DDD HH:MM:SS for display) would need to be stored in 4 bytes (which if it was stored this way, would cater for >136 years of uptime).

 

A day's worth of seconds is 86400, requiring 3 bytes of storage (if you considered the count of uptime days was stored separately, so you'd only ever need to store a seconds value up to 86399).

 

If that was the case, a rollover of uptime into 48 hours would still only require 3 bytes, so there's no prospect for counter overflow, and in any case, our DS700 with the longest uptime far exceeds 48 hours.

 

It leads me to believe that if the problem is "simply" caused by the uptime extending into a DDD value of 2, then there must be something very odd going on either with the timing circuitry in the DS700 (but which doesn't impact normal traffic of retransmit timers &etc.), or how/where the DS700 stores the DDD uptime value.

 

My network colleague posited the notion of "well, maybe a power cycle of the DS700 would cause the problem to 'go away'".

 

I'd like to think that this wouldn't be the case, and that a warm reboot would have the same initialisation effect as a cold boot.

 

However, I don't know what the DS700 does internally, particularly in terms of where it is storing the uptime – maybe the storage location for the DDD value is (bizarrely) actually only updated if the DDD value is 2 or greater (i.e. it is stored as the count of number of seconds, and when DDD is 2 or greater, the seconds counter stores (uptime in seconds MINUS (2 * 86400)).

 

If that is the case, then if the code has a pointer to the storage location and that pointer is only initialised on a cold boot, then it could explain why a warm reboot doesn't fix the problem if the pointer has an invalid value.

 

We plan to swap out the DS700, and then conduct some tests on it in the office environment on a test network…

 

Obviously, relocating it to the office environment would mean loss of power, so the "cold boot fixes it" theory might come in to play…

 

Reconnect it to the test environment, power it up, then configure a console port and connect a VT terminal to it, and see if after 48 hours it still reboots (there is still the possibility that malformatted LAT frames could be causing it, though the traffic on the test network of 3 machines would be largely limited to LAT service announcements).

 

If it still reboots, then power-cycle it, allowing it to downline-load, then disconnect the network cable, and see if it still reboots (if so, this removes the possibility that malformatted LAT frames are causing it, and more or less confirms it as a hardware issue).

 

Having a VT connected to the console port might elicit further diagnostic information prior to the software exception that might point to the root cause.

 

I realise that a lot of folks might consider I'm over-analysing this, but I'm like a dog with a bone – I really don't like faults that "go away" without ever establishing the root cause.

 

This is only the second time in 28 years that I've had a spurious DECserver fault – and both times, at the same site (the environment is a little harsher than a typically office or data centre though).

 

The previous fault on another DS700 was caused by bearings in the internal fan – the fan was occasionally seizing up, leading to the main system board overheating, and once you start getting thermal events, all bets are off.

 

Likewise, if you have errors in timing circuitry, all bets can similarly be off - I've encountered an application issue recently with a deadly embrace caused by two separate processes unaware of each others actions both targetting the same record in the database (where one has 25 lots of record lock & unlock, and the other has 18 lots);  ordinarily, the second process wouldn't have attempted to access the database record whilst the first process was still in the throes of doing so, but a number of different factors caused "the stars to align" and create a perfect set of conditions for it to happen (the code uses three different types of locks, and between the two processes, causes interplay;  I've worked out a fix for this, and just need to find time to make code changes)

[Formerly appearing as woeisme]
Mark_Corcoran
Frequent Advisor

Re: DECserver: NAS release notes? Known software exceptions? Version derived from server ROM or *ENG

I thought I would belatedly follow up my last post, prior to posting a new query.

The DECserver was swapped out from production, and tests were conducted in the office on a test LAN, and initially, the DS700 crashed every 24 hours (not 48) with the same software exception.

It was permitted to boot as usual from a donwline-load host, then the network cable was removed (to avoid the possibility of dodgy packets and the NAS software perhaps not being as robust as one would ideally like, causing it to crash)  - still it crashed.

The image that was downline loaded was the same as that used by most of the other DECservers, and though they hadn't recently needed to downline-load, it seemed unlikely that the image would be corrupt (as I'd expect other DS700s to similarly crash).

I compared the image on one host with another (just using CHECKSUM and examining the CHECKSUM$CHECKSUM global symbol afterwards) - both the same.

On the basis that the image has to be loaded in to memory, I considered the possibility that there was a spurious bit rot-type fault with the SIM in the DS700, so it was replaced;  still, it crashed.

We then decided it would need to be sent to the company that we have a maintenance agreement with, and although I fully expect it to be returned as "beyond economic repair", I knew that the chances of them having a load host that they could service the downline load request (or which they'd be prepared to put on the same network as the DS700) was zero.

So, it needed to be able to boot from a PCMCIA Flash RAM card;  it didn't have one, but one of the other previously-returned b.e.r. DS700s did, so that was plugged it, but initially the DS700 wouldn't boot - the SERVER SOFTWARE was set to WWENG1, whereas what was on the card was WWENG2

Once the config on the DS700 was changed to expect WWENG2, it booted from Flash RAM (but only after a network cable was plugged in, otherwise the POST hangs).

This gave us a new version of NAS (v1.5), but still it crashed.

 

It seems therefore that it is a hardware fault, which is probably going to remain undiagnosed.  If our hardware maintenance company still have any old-timers that are permitted to do component-level diagnostics and manage to determine the fault, I'll post a further update.

 

Mark

[Formerly appearing as woeisme]