Re: PEA0 unknown port error type

Stuart Kendrick · ‎03-01-2008

our four alphas are intermittently logging the following -- any suggestions on what might be causing this?

Unit SIDHU$PEAO
Product Name NI-SCA Port

Logged Message Type Code 3. Port Message
Error Type/SubType x400B Unknown Port
Error Type/
Subtype

we're working a case with HP: we've swapped out the NICs, the Ethernet cables, the Ethernet switch ports, and we've hard-coded both the NICs and the ports to 100/half (that part didn't make sense to me, but hey); no change in behavior

i don't see a pattern to the timing of the events. sometimes, there is a flurry (5-10) spanning a few seconds; sometimes the event occurs once and then hours go by without another event. some days, the issue only occurs a couple times; others, several dozen times. ~6am is a popular, though not consistent, time for a flurry (we've been looking around for what else happens ~6am, no clues yet)

AlphaServer ES45 Model 2 running v7.3-2

has anyone else seen these?

--sk

stuart kendrick
fhcrc

Volker Halle · ‎03-01-2008

Stuart,

did you check the console for any messages at the times those errors are seen ? This may be some error generated from PEDRIVER, when it detects packet loss on a channel in the cluster.

Search Google Groups for: Error Type/SubType x400B

Also check LANCP SHOW DEV/INT Exn0 on your LAN interfaces for any errors.

And check the counters and errors in SCACP for LAN, BUS and/or CHANNELS.

Volker.

Volker Halle · ‎03-01-2008

Stuart,

this Error Type/Sub Type x400B has the following meaning:

PAER$K_ET_PKT = 0x40 - Packet error, general case
PAER$K_ES_CHLS = 0x0B - Channel Lossy Message

This indicates, that a CHANNEL used for clsuter communications (i.e. a chennel between 2 LAN interfaces in a cluster), has a communication problem.

You might want to check with

$ SET TERM/WID=132
$ MC SCACP SHOW CHAN/COUNTER

Also check on the console and find out from the full errlog entry, which LAN interface and remote system is involved each time this type of error is logged. This could tell you, which network path is affected.

Volker.

Hoff · ‎03-01-2008

After the HP hardware support folks involved here decide they've had enough hardware swapping for an afternoon or two, do consider some LAN packet monitoring.

For errors that hit multiple boxes, I tend to look for a common trigger. That can certainly be hardware, but it can also be spikes in traffic, whether expected and transient and authorized traffic, or it can be malware or network attacks. Traffic tends to be more common.

The NIC counters will show you some of your network activity (and quite possibly some packet storms or some network glitches), but a network monitor will get you a much better picture of what's happening on your LAN.

If you don't have a network monitor handy (and you'll want to work on that, and there are some reasonable open-source monitors such as wireshark, and there are also commercial options), look to reduce what's sharing the LAN with the AlphaServer ES45 boxes; to segment what you have with a switch (layer 2 or 3 managed, preferably, as higher-layer switches can allow you to set up vlans). Or somewhat less desirably, segment your LAN with an IP router. Use the switch to segment the network, and move or vlan it to try to isolate and identify the trigger.

Do also ensure your existing switch (if you're using one) has current firmware, and has enough bandwidth to deal with your LAN load. And make sure your network folks aren't messing with your switch(es) or your existing vlans.

That this is timing based -- 6 am is popular -- does tend to point toward the LAN and LAN traffic, or toward LAN reconfiguration. (I have encountered a LAN that traffic- and server-spiked at 8am weekdays, due to a massive number of people all logging in circa 7:58 to 8:00, as per their contract. FWIW.)

I'm surprised this isn't near the top of the HP network troubleshooting playbook, too. Drop a LAN monitor box onto the customer LAN, and watch...

And for completeness, do ensure that you are current on your patches for OpenVMS Alpha V7.3-2, as there can be subtle errors here. I'd look for any involving NIC and clustering, and all of the mandatory patches.

Stephen Hoffman
HoffmanLabs LLC

DECxchange · ‎03-01-2008

Hello,
Are these errors showing up in the Opcom messages? Maybe you already know everyting I'm about to type here (if so, I apologize), but in any case ... You can see these not only at the console, but at any terminal in which a user is logged in and who has the OPER privilege. You can see these messages by doing a $ REPLY/ENA.

If you do a reply/log, you can start a new version of sys$manager:operator.log. You can then use the $ search command to look find all occurrences of particular messages. You can then see if these messages are occurring at about the same time on each node or not. Are all nodes logging this message at the same time or at different times?

Or are you seeing these messages in the ACCOUNTNG log file? You can check this file by doing a $ acc. You probably don't want to just do that latter, as it will give you everyting since the accountng log was started, who knows when.

If you want just today's accounting, you can do a $ acc/since[=today]. You need to be patient with acc, as it takes a while to search through its file. If you want to narrow it down to a shorter, more recent period of time, you can say, acc/sin=09:00/before=10:00 and so on. Substitute the time range you want.

You can then choose a particular accounting record and look at it in more detail by doing a $ acc/id=
This could point to the process that is having a problem.

But I really think since all nodes apparently see the same error, it is not the Alphas. It might be a networking problem. Perhaps a network wiring issue, a hub issue, a network switch issue. Or such.

Do you experience in degradation in communication performance on your network?

Richard Stockdale · ‎03-02-2008

>>we've hard-coded both the NICs and the ports to 100/half (that part didn't make sense to me, but hey)

Doesn't make any sense to me either.

The only reason to set half-duplex is to avoid a duplex mode mismatch because the switch is set to do auto-negotiation. Why not set to 100/full/auto-negotiate? I'd look at these underlying network settings and counters before worrying about PEDRIVER.

- Dick

Stuart Kendrick · ‎03-02-2008

hi folks,

i appreciate the responses; i'll post again after i've gather more information from logs and double-checked patch levels, per your suggestions

more information, particularly wrt to larger network issues. [all four alphas are plugged into the same Catalyst 4503 w/Sup V switch, running IOS 12.2(40)SG, an image released in Q42007, same image we deploy on ~100 other Catalyst 4500s, company-wide]

-i have a hardware sniffer (Finisar THG) inserted in-line with 'SIDHU', the node on which i'm focusing my trouble-shooting efforts. using this device, i've graphed collision rates across time. there are periods when collision rates jump from the usual 1-10/second to several thousand per second ... but to date, these times do not correlate with the times when the alphas are recording their PEA0 errors. [the two 'times' are occasionally separated by minutes but more generally separated by hours]

-i also graph interface utilization on the ethernet switch across time (standard MRTG five minute polling interval) -- traffic on the alpha ports tends to peak ~ 7Mb/s (i.e. ~7% of the available throughput on these 100Mb/s links), with occasional spikes to ~10Mb/s ... to my way of thinking, these Ethernet links aren't even breaking a sweat

-i'm capturing traces this weekend around the 6am window ... assuming that at least one of these windows coincides with a SubType x400B event, i will see if i can correlate what is happening on the wire with the error message event

-i graph the collision rates on the Ethernet switch ports (MRTG again) ... peak on this port is ~30 second (though, as i say, using the hardware sniffer, i occasionally see spikes to several thousand per second, an event which is invisible to MRTG's five minute polling rate) ... but i'm not particularly concerned with ~30 collisions per second ... collisions are normal, for half-duplex Ethernet. what does intrigue me is that the hardware sniffer sees a fragment (7-12 bytes long) for *every* collision. i don't see this behavior on other half-duplex ports (i.e. from the Ethernet switch point of view, their collision counters increment, but their fragment counters stay steady at 0). not sure what this implies

-i'm confident that the network folks aren't messing with the switch ... because we *are* the network folks (we own and configure our company's ethernet switches & routers, including this one). we deploy a single VLAN across the entire switch -- all ports belong to it. so no complexities there

at any rate, i appreciate the input, as i've been running out of ideas on my end -- i'm re-energized now. i'll post again later this week, after i've gathered more information

--sk

Richard Brodie_1 · ‎03-03-2008

"what does intrigue me is that the hardware sniffer sees a fragment (7-12 bytes long) for *every* collision. i don't see this behavior on other half-duplex ports (i.e. from the Ethernet switch point of view, their collision counters increment, but their fragment counters stay steady at 0)."

I'd expect a decent hardware sniffer to see a collision fragment for every collision. It's likely that a switch port wouldn't double count it as a fragment, though. If only the ES45 switch ports show fragments, that would be odd.

Categories

Company

Local Language

Forums

Discussions

Forums

Discussions

Discussions

Forums

Discussions

Forums

Discussions

Forums

Forums

Discussions

Forums

Discussions

Forums

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Community

Resources

Other HPE Sites

Discussions

Forums

Blogs

Re: PEA0 unknown port error type

PEA0 unknown port error type