Upto 50% ping loss when ARP cache grows over ~100 items ( HP-UX 11.31 RX2620-2 )

SemihBATTAL · ‎07-05-2017

Hi All,
Please help me solve this strange problem.
In order to make this post readable, I'll leave out all the troubleshooting steps I have taken so far and only point out the temporary fix I've found, which should give a clue as to what may be wrong.

This Itanium server ( HP-UX 11..31, fully patched, working happily since 2004 ) recently fails to respond to some of the pings from other servers when the ARP cache grows over ~100 items. And, in a few minutes time the server becomes completely unusable because even an incoming ssh session can't be maintained due to lost packets.
To fix this, I run a shell script which continously monitors the number of lines in the output of "arp -a" and if there are more than 100 items, it executes the command "/usr/sbin/ifconfig lan0 192.168.8.8", which among other things, clears the ARP cache.. This action instantly fixes the "lost pings" problem.
I may be completely wrong in associating this problem with the size of the ARP cache, there may be something else which the "ifconfig" command clears/fixes? (Resets the NIC card? )
Needles to say that there are no cabling/routing/switching problems, the problem starts in the server and can be fixed completely but temporarily within the server itself.
What could be wrong?
Sorry about my rusty English...

Semih BATTAL

Steven Schweda · ‎07-05-2017

   I know nothing, but ...

> I may be completely wrong in associating this problem with the size of
> the ARP cache,

   That would be my guess.

> there may be something else which the "ifconfig" command
> clears/fixes? (Resets the NIC card? )

   Or triggers an ARP broadcast?

   My first guess would be something like an IP address conflict. I
(knowing nothing) can imagine that if two systems have the same IP
address, they'd be arguing about who should be getting the traffic for
the conflicted address. A fresh "ifconfig" command might help to tilt
the dispute temporarily (while coincidentally clearing the ARP cache),
but the conflicting system might do things which tilt it back the other
way. A growing ARP cache might indicate only that other folks on the
network get to talking, and, as that happens, the other competitor is
gaining audience share.

   But what do I know?

   If you take down the HP-UX system, does "ping" from elsewhere to the
HP-UX system's IP address get any responses (which must come from the
competitor)?

Bill Hassell · ‎07-06-2017

Do you have any networking messages in the command: dmesg?

Do you any error counts in the command: lanadmin -g 0
(the errors follow the line: Index)

How about network errors in /var/adm/syslog/syslog.log? Specifically errors from ssh would be useful.

Bill Hassell, sysadmin

SemihBATTAL · ‎07-06-2017

Hi Steven,

Thank you for the useful tips...
I'll unplug the server LAN cable and ping its IP to see if someone else has it...
I'll be reporting back in an hour or so...

SemihBATTAL · ‎07-06-2017

Hi Bill,

Nothing unusual in dmesg.
Nothing unusual in syslog.
Nothing unusual in the "port status" pages of the switches involved. ( Procurve 1810G24 )

Here is lanadmin -g 0 output: ( with ifconfig script disabled for at least 30 minutes )

LAN INTERFACE STATUS DISPLAY
Thu, Jul 6,2017 20:56:02

PPA Number = 0
Description = lan0 HP PCI-X 1000Base-T Release B.11.31.1103
Type (value) = ethernet-csmacd(6)
MTU Size = 1500
Speed = 1000000000
Station Address = 0x1438eb4b62
Administration Status (value) = up(1)
Operation Status (value) = up(1)
Last Change = 345285005
Inbound Octets = 11825894
Inbound Unicast Packets = 814834779
Inbound Non-Unicast Packets = 12480874
Inbound Discards = 0
Inbound Errors = 0
Inbound Unknown Protocols = 769002
Outbound Octets = 525592820
Outbound Unicast Packets = 541306995
Outbound Non-Unicast Packets = 6249701
Outbound Discards = 1
Outbound Errors = 0
Outbound Queue Length = 2
Specific = 655367

Ethernet-like Statistics Group

Index = 1
Alignment Errors = 0
FCS Errors = 0
Single Collision Frames = 0
Multiple Collision Frames = 0
Deferred Transmissions = 0
Late Collisions = 0
Excessive Collisions = 0
Internal MAC Transmit Errors = 0
Carrier Sense Errors = 0
Frames Too Long = 0
Internal MAC Receive Errors = 0

Bill Hassell · ‎07-06-2017

Nothing unusual in dmesg.
Nothing unusual in syslog.

So are you redirecting all sshd logging to another file?
What does that file show when ssh fails?
Or do you mean that all the sshd failure messages are not unusual?

Do ftp or telnet fail to login?

There are no protocol errors in the lanadmin output.

Bill Hassell, sysadmin

SemihBATTAL · ‎07-06-2017

Hi Bill,
No, sshd logs do go to syslog but I don't get a log line for a broken connection, I don'y know why. ( LogLevel is ERROR )
But, I get an error from the "putty" client which we use for ssh access from PC's.
The error message says: "Network error: Software caused connection abort".
Actually, you can predict that your ssh connection will break any minute by looking at the latency of the character echoes from the server. They begin to take longer and longer and in a few seconds/minutes the ssh connection breaks.
For telnet access we use 700/32 terminals connected to 2340 DTC's and the same thing happens with them.

Here is an sshd log line from /usr/adm/syslog/syslog.log
Jul 6 09:34:16 everest sshd[10097]: error: PAM: Authentication failed for sanurt from 192.168.11.70
But, as I said, there are no log lines for broken connections.

This problem is not confined to telnet or ssh, we have oracle listeners running on this server and they suffer from the the same "broken connection" problem...

Here is a "flood-ping" test result from another server running Oracle Linux. These two servers are connected to a Procurve 1910G switch by short patch cables...

Thu Jul 6 21:46:50 GMT-3 2017 : 400 packets transmitted, 195 received, 51% packet loss, time 6324ms
Thu Jul 6 21:50:22 GMT-3 2017 : 400 packets transmitted, 331 received, 17% packet loss, time 6869ms
Thu Jul 6 21:52:37 GMT-3 2017 : 400 packets transmitted, 270 received, 32% packet loss, time 6883ms
Thu Jul 6 21:55:01 GMT-3 2017 : 400 packets transmitted, 361 received, 9% packet loss, time 6582ms
Thu Jul 6 22:05:39 GMT-3 2017 : 400 packets transmitted, 260 received, 35% packet loss, time 6786ms
Thu Jul 6 22:05:49 GMT-3 2017 : 400 packets transmitted, 390 received, 2% packet loss, time 6656ms
Thu Jul 6 22:06:45 GMT-3 2017 : 400 packets transmitted, 375 received, 6% packet loss, time 6302ms
Thu Jul 6 22:06:55 GMT-3 2017 : 400 packets transmitted, 301 received, 24% packet loss, time 6498ms
Thu Jul 6 22:08:22 GMT-3 2017 : 400 packets transmitted, 244 received, 39% packet loss, time 6868ms
Thu Jul 6 22:08:32 GMT-3 2017 : 400 packets transmitted, 342 received, 14% packet loss, time 6748ms
Thu Jul 6 22:11:56 GMT-3 2017 : 400 packets transmitted, 328 received, 18% packet loss, time 7036ms
Thu Jul 6 22:13:03 GMT-3 2017 : 400 packets transmitted, 398 received, 0% packet loss, time 6764ms
Thu Jul 6 22:21:08 GMT-3 2017 : 400 packets transmitted, 344 received, 14% packet loss, time 6799ms

This test illustrates the graveness of our problem.
In short, this server is unusable unless the "ifconfig" command is run every few minutes...

SemihBATTAL · ‎07-06-2017

Hi Steven,

I've unplugged the server LAN cable and left it like that for about 10 minutes., during this period there was noone else replying to the pings....
Having said that, the "duplicate" may not be replying the ICMP packets?
I've now reconnected the server to the LAN but via a different Procurve switch.
And, nothing changed, we are still experiencing high packet loss...

Thu Jul 6 23:30:42 GMT-3 2017 : 400 packets transmitted, 372 received, 7% packet loss, time 6513ms
Thu Jul 6 23:36:01 GMT-3 2017 : 400 packets transmitted, 264 received, 34% packet loss, time 6749ms

Bill Hassell · ‎07-07-2017

>> latency of the character echoes from the server...

This sounds like a possible ARP storm. You'll need to get your network admins to look at network traces. You can use nettl to trace the network when things get bad and then run the result through Wireshark. Your network team can help with the output from Wireshark.

Here's a sample trace:

# nettl -traceon all -e all -f /var/tmp/net-trc

after a few seconds

# nettl -traceoff -e all

Then use Wireshark to open the file /var/tmp/net-trc.TRC000

Bill Hassell, sysadmin

SemihBATTAL · ‎07-10-2017

Hi Bill,
I traced the server network for 30 seconds, during this period ~12000 packets were captured, 216 of which were ARP requests.
It doesn't look like an ARP Storm does it?

"This frame is a (suspected) retransmission", TCP, count: 4763
"This frame is a (suspected) fast retransmission", TCP, count: 627
The port numbers involved for these two reports belong to our Oracle listeners.
Do these indicate any trouble?

Categories

Company

Local Language

Forums

Discussions

Forums

Discussions

Discussions

Forums

Discussions

Forums

Discussions

Forums

Forums

Discussions

Forums

Discussions

Forums

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Community

Resources

Other HPE Sites

Discussions

Forums

Blogs

Upto 50% ping loss when ARP cache grows over ~100 items ( HP-UX 11.31 RX2620-2 )

Upto 50% ping loss when ARP cache grows over ~100 items ( HP-UX 11.31 RX2620-2 )

Re: Upto 50% ping loss when ARP cache grows over ~100 items ( HP-UX 11.31 RX2620-2 )

Re: Upto 50% ping loss when ARP cache grows over ~100 items ( HP-UX 11.31 RX2620-2 )

Re: Upto 50% ping loss when ARP cache grows over ~100 items ( HP-UX 11.31 RX2620-2 )

Re: Upto 50% ping loss when ARP cache grows over ~100 items ( HP-UX 11.31 RX2620-2 )

Re: Upto 50% ping loss when ARP cache grows over ~100 items ( HP-UX 11.31 RX2620-2 )

Re: Upto 50% ping loss when ARP cache grows over ~100 items ( HP-UX 11.31 RX2620-2 )

Re: Upto 50% ping loss when ARP cache grows over ~100 items ( HP-UX 11.31 RX2620-2 )

Re: Upto 50% ping loss when ARP cache grows over ~100 items ( HP-UX 11.31 RX2620-2 )

Re: Upto 50% ping loss when ARP cache grows over ~100 items ( HP-UX 11.31 RX2620-2 )