Networking
cancel
Showing results for 
Search instead for 
Did you mean: 

Upto 50% ping loss when ARP cache grows over ~100 items ( HP-UX 11.31 RX2620-2 )

SOLVED
Go to solution
SemihBATTAL
Advisor

Upto 50% ping loss when ARP cache grows over ~100 items ( HP-UX 11.31 RX2620-2 )

Hi All,
Please help me solve this strange problem.
In order to make this post readable, I'll leave out all the troubleshooting steps I have taken so far and only point out the temporary fix I've found, which should give a clue as to what may be wrong.

This Itanium server ( HP-UX 11..31, fully patched, working happily since 2004 ) recently fails to respond to some of the pings from other servers when the ARP cache grows over ~100 items. And, in a few minutes time the server becomes completely unusable because even an incoming ssh session can't be maintained due to lost packets.
To fix this, I run a shell script which continously monitors the number of lines in the output of "arp -a" and if there are more than 100 items, it executes the command "/usr/sbin/ifconfig lan0 192.168.8.8", which among other things, clears the ARP cache.. This action instantly fixes the "lost pings" problem.
I may be completely wrong in associating this problem with the size of the ARP cache, there may be something else which the "ifconfig" command clears/fixes? (Resets the NIC card? )
Needles to say that there are no cabling/routing/switching problems, the problem starts in the server and can be fixed completely but temporarily within the server itself.
What could be wrong?
Sorry about my rusty English...

Semih BATTAL

17 REPLIES
Steven Schweda
Honored Contributor

Re: Upto 50% ping loss when ARP cache grows over ~100 items ( HP-UX 11.31 RX2620-2 )

   I know nothing, but ...

> I may be completely wrong in associating this problem with the size of
> the ARP cache,

   That would be my guess.

>  there may be something else which the "ifconfig" command
> clears/fixes? (Resets the NIC card? )

   Or triggers an ARP broadcast?

   My first guess would be something like an IP address conflict.  I
(knowing nothing) can imagine that if two systems have the same IP
address, they'd be arguing about who should be getting the traffic for
the conflicted address.  A fresh "ifconfig" command might help to tilt
the dispute temporarily (while coincidentally clearing the ARP cache),
but the conflicting system might do things which tilt it back the other
way.  A growing ARP cache might indicate only that other folks on the
network get to talking, and, as that happens, the other competitor is
gaining audience share.

   But what do I know?

   If you take down the HP-UX system, does "ping" from elsewhere to the
HP-UX system's IP address get any responses (which must come from the
competitor)?

Bill Hassell
Honored Contributor

Re: Upto 50% ping loss when ARP cache grows over ~100 items ( HP-UX 11.31 RX2620-2 )

Do you have any networking messages in the command: dmesg?

Do you any error counts in the command: lanadmin -g 0 
(the errors follow the line: Index)

How about network errors in /var/adm/syslog/syslog.log? Specifically errors from ssh would be useful.



Bill Hassell, sysadmin
SemihBATTAL
Advisor

Re: Upto 50% ping loss when ARP cache grows over ~100 items ( HP-UX 11.31 RX2620-2 )

Hi Steven,

Thank you for the useful tips...
I'll unplug the server LAN cable and ping its IP to see if someone else has it...
I'll be reporting back in an hour or so...

SemihBATTAL
Advisor

Re: Upto 50% ping loss when ARP cache grows over ~100 items ( HP-UX 11.31 RX2620-2 )

Hi Bill,

Nothing unusual in dmesg.
Nothing unusual in syslog.
Nothing unusual in the "port status" pages of the switches involved. ( Procurve 1810G24 )

Here is lanadmin -g 0 output: ( with ifconfig script disabled for at least 30 minutes )

LAN INTERFACE STATUS DISPLAY
Thu, Jul 6,2017 20:56:02

PPA Number = 0
Description = lan0 HP PCI-X 1000Base-T Release B.11.31.1103
Type (value) = ethernet-csmacd(6)
MTU Size = 1500
Speed = 1000000000
Station Address = 0x1438eb4b62
Administration Status (value) = up(1)
Operation Status (value) = up(1)
Last Change = 345285005
Inbound Octets = 11825894
Inbound Unicast Packets = 814834779
Inbound Non-Unicast Packets = 12480874
Inbound Discards = 0
Inbound Errors = 0
Inbound Unknown Protocols = 769002
Outbound Octets = 525592820
Outbound Unicast Packets = 541306995
Outbound Non-Unicast Packets = 6249701
Outbound Discards = 1
Outbound Errors = 0
Outbound Queue Length = 2
Specific = 655367

Ethernet-like Statistics Group

Index = 1
Alignment Errors = 0
FCS Errors = 0
Single Collision Frames = 0
Multiple Collision Frames = 0
Deferred Transmissions = 0
Late Collisions = 0
Excessive Collisions = 0
Internal MAC Transmit Errors = 0
Carrier Sense Errors = 0
Frames Too Long = 0
Internal MAC Receive Errors = 0

Bill Hassell
Honored Contributor

Re: Upto 50% ping loss when ARP cache grows over ~100 items ( HP-UX 11.31 RX2620-2 )

Nothing unusual in dmesg.
Nothing unusual in syslog.

So are you redirecting all sshd logging to another file?
What does that file show when ssh fails?
Or do you mean that all the sshd failure messages are not unusual?

Do ftp or telnet fail to login?

There are no protocol errors in the lanadmin output.

 



Bill Hassell, sysadmin
SemihBATTAL
Advisor

Re: Upto 50% ping loss when ARP cache grows over ~100 items ( HP-UX 11.31 RX2620-2 )

Hi Bill,
No, sshd logs do go to syslog but I don't get a log line for a broken connection, I don'y know why. ( LogLevel is ERROR )
But, I get an error from the "putty" client which we use for ssh access from PC's.
The error message says: "Network error: Software caused connection abort".
Actually, you can predict that your ssh connection will break any minute by looking at the latency of the character echoes from the server. They begin to take longer and longer and in a few seconds/minutes the ssh connection breaks.
For telnet access we use 700/32 terminals connected to 2340 DTC's and the same thing happens with them.

Here is an sshd log line from /usr/adm/syslog/syslog.log
Jul 6 09:34:16 everest sshd[10097]: error: PAM: Authentication failed for sanurt from 192.168.11.70
But, as I said, there are no log lines for broken connections.

This problem is not confined to telnet or ssh, we have oracle listeners running on this server and they suffer from the the same "broken connection" problem...

Here is a "flood-ping" test result from another server running Oracle Linux. These two servers are connected to a Procurve 1910G switch by short patch cables...

Thu Jul 6 21:46:50 GMT-3 2017 : 400 packets transmitted, 195 received, 51% packet loss, time 6324ms
Thu Jul 6 21:50:22 GMT-3 2017 : 400 packets transmitted, 331 received, 17% packet loss, time 6869ms
Thu Jul 6 21:52:37 GMT-3 2017 : 400 packets transmitted, 270 received, 32% packet loss, time 6883ms
Thu Jul 6 21:55:01 GMT-3 2017 : 400 packets transmitted, 361 received, 9% packet loss, time 6582ms
Thu Jul 6 22:05:39 GMT-3 2017 : 400 packets transmitted, 260 received, 35% packet loss, time 6786ms
Thu Jul 6 22:05:49 GMT-3 2017 : 400 packets transmitted, 390 received, 2% packet loss, time 6656ms
Thu Jul 6 22:06:45 GMT-3 2017 : 400 packets transmitted, 375 received, 6% packet loss, time 6302ms
Thu Jul 6 22:06:55 GMT-3 2017 : 400 packets transmitted, 301 received, 24% packet loss, time 6498ms
Thu Jul 6 22:08:22 GMT-3 2017 : 400 packets transmitted, 244 received, 39% packet loss, time 6868ms
Thu Jul 6 22:08:32 GMT-3 2017 : 400 packets transmitted, 342 received, 14% packet loss, time 6748ms
Thu Jul 6 22:11:56 GMT-3 2017 : 400 packets transmitted, 328 received, 18% packet loss, time 7036ms
Thu Jul 6 22:13:03 GMT-3 2017 : 400 packets transmitted, 398 received, 0% packet loss, time 6764ms
Thu Jul 6 22:21:08 GMT-3 2017 : 400 packets transmitted, 344 received, 14% packet loss, time 6799ms

This test illustrates the graveness of our problem.
In short, this server is unusable unless the "ifconfig" command is run every few minutes...

SemihBATTAL
Advisor

Re: Upto 50% ping loss when ARP cache grows over ~100 items ( HP-UX 11.31 RX2620-2 )

Hi Steven,

I've unplugged the server LAN cable and left it like that for about 10 minutes., during this period there was noone else replying to the pings....
Having said that, the "duplicate" may not be replying the ICMP packets?
I've now reconnected the server to the LAN but via a different Procurve switch.
And, nothing changed, we are still experiencing high packet loss...

Thu Jul 6 23:30:42 GMT-3 2017 : 400 packets transmitted, 372 received, 7% packet loss, time 6513ms
Thu Jul 6 23:36:01 GMT-3 2017 : 400 packets transmitted, 264 received, 34% packet loss, time 6749ms

Bill Hassell
Honored Contributor

Re: Upto 50% ping loss when ARP cache grows over ~100 items ( HP-UX 11.31 RX2620-2 )

>> latency of the character echoes from the server...

This sounds like a possible ARP storm. You'll need to get your network admins to look at network traces. You can use nettl to trace the network when things get bad and then run the result through Wireshark. Your network team can help with the output from Wireshark.

Here's a sample trace:

# nettl -traceon all -e all -f /var/tmp/net-trc

after a few seconds

# nettl -traceoff -e all

Then use Wireshark to open the file /var/tmp/net-trc.TRC000



Bill Hassell, sysadmin
SemihBATTAL
Advisor

Re: Upto 50% ping loss when ARP cache grows over ~100 items ( HP-UX 11.31 RX2620-2 )

Hi Bill,
I traced the server network for 30 seconds, during this period ~12000 packets were captured, 216 of which were ARP requests.
It doesn't look like an ARP Storm does it?

"This frame is a (suspected) retransmission", TCP, count: 4763
"This frame is a (suspected) fast retransmission", TCP, count: 627
The port numbers involved for these two reports belong to our Oracle listeners.
Do these indicate any trouble?

donna hofmeister
Trusted Contributor

Re: Upto 50% ping loss when ARP cache grows over ~100 items ( HP-UX 11.31 RX2620-2 )

what results do you get for the following commands?

ndd -get /dev/arp arp_cache_report
ndd -get /dev/arp arp_cleanup_interval

what's in your nddconf file?

SemihBATTAL
Advisor

Re: Upto 50% ping loss when ARP cache grows over ~100 items ( HP-UX 11.31 RX2620-2 )

Hi Donna,
Thank you for giving a hand...

# ndd -get /dev/arp arp_cleanup_interval
600000

# /etc/rc.config.d/nddconf
# As per PHNE_43814. Semih BATTAL 2014-11-19
TRANSPORT_NAME[0]=tcp
NDD_NAME[0]=tcp_sack_enable
NDD_VALUE[0]=2
#
TRANSPORT_NAME[1]=ip
NDD_NAME[1]=ip_ire_gw_probe
NDD_VALUE[1]=0

# ndd -get /dev/arp arp_cache_report
ifname proto addr proto mask hardware addr flags
lan0 192.168.007.111 255.255.255.255 ec:1f:72:b7:d3:18
lan0 192.168.007.110 255.255.255.255 dc:85:de:ba:d4:2c
lan0 192.168.007.108 255.255.255.255 dc:85:de:ba:d4:29
lan0 192.168.011.164 255.255.255.255 00:8c:fa:61:c7:5c
lan0 192.168.011.042 255.255.255.255 00:25:86:e3:2f:07
lan0 192.168.011.041 255.255.255.255 00:25:86:e3:1f:66
lan0 192.168.011.110 255.255.255.255 10:bf:48:05:23:21
lan0 192.168.011.045 255.255.255.255 40:b0:34:29:82:5b
lan0 192.168.007.097 255.255.255.255 00:0c:43:ce:42:de
lan0 192.168.008.048 255.255.255.255 00:1e:68:1e:1a:02
lan0 192.168.011.113 255.255.255.255 00:8c:fa:61:ba:55
lan0 192.168.008.251 255.255.255.255 aa:aa:aa:00:cb:65
lan0 192.168.011.191 255.255.255.255 1c:87:2c:42:22:b4
lan0 192.168.007.112 255.255.255.255 c8:d5:fe:f1:ae:4d
lan0 192.168.011.194 255.255.255.255 1c:87:2c:41:ab:a7
lan0 192.168.008.001 255.255.255.255 00:17:08:59:b4:60
lan0 192.168.011.193 255.255.255.255 1c:87:2c:42:1e:7d
lan0 192.168.002.014 255.255.255.255 00:14:38:eb:4b:62 UNRESOLVED
lan0 192.168.010.006 255.255.255.255 00:80:92:6d:62:d2
lan0 192.168.011.070 255.255.255.255 00:1e:8c:df:60:6e
lan0 192.168.002.012 255.255.255.255 00:14:38:eb:4b:62 UNRESOLVED
lan0 192.168.002.013 255.255.255.255 00:14:38:eb:4b:62 UNRESOLVED
lan0 192.168.008.072 255.255.255.255 a0:2b:b8:1f:35:28
lan0 192.168.008.008 255.255.255.255 00:14:38:eb:4b:62 PERM PUBLISH LOCAL
lan0 192.168.011.010 255.255.255.255 c8:60:00:56:e6:4f
lan0 192.168.000.002 255.255.255.255 00:0b:86:6e:cb:54
lan0 192.168.015.010 255.255.255.255 ac:9b:f4:82:69:1c
lan0 192.168.011.014 255.255.255.255 88:51:fb:57:57:38
lan0 192.168.011.012 255.255.255.255 c8:60:00:56:e4:ac
lan0 192.168.011.083 255.255.255.255 00:0f:fe:f3:cb:2d
lan0 192.168.011.082 255.255.255.255 00:0f:fe:f2:73:88
lan0 192.168.011.022 255.255.255.255 00:1e:90:28:79:1c
lan0 192.168.007.088 255.255.255.255 00:0c:43:ce:44:41
lan0 192.168.011.026 255.255.255.255 00:24:d6:3b:43:60
lan0 192.168.008.026 255.255.255.255 00:22:64:2a:30:3c
lan0 192.168.011.024 255.255.255.255 00:16:e6:64:d0:e5
lan0 192.168.011.088 255.255.255.255 00:1e:33:d1:60:ba
lan0 192.168.011.159 255.255.255.255 d8:5d:4c:80:e3:e9
lan0 192.168.007.211 255.255.255.255 00:0c:43:ce:42:bf
lan0 192.168.011.028 255.255.255.255 10:bf:48:04:7b:9d
lan0 224.000.000.000 240.000.000.000 01:00:5e:00:00:00 PERM MAPPING
( Unresolved MAC's are powered down but they are being ping'ed by the server )

donna hofmeister
Trusted Contributor

Re: Upto 50% ping loss when ARP cache grows over ~100 items ( HP-UX 11.31 RX2620-2 )

do you know why arp_cleanup was changed from 300000 (5 min)? i'm thinking running cleanup every 5 minutes should resolve your issue...

BUT before you make any changes, please do the following:

netstat -s > before.txt
<wait for 10 minutes>
netstat -s > after.txt

please attach "before" and "after" so i can see how your network is performing.

SemihBATTAL
Advisor

Re: Upto 50% ping loss when ARP cache grows over ~100 items ( HP-UX 11.31 RX2620-2 )

Hi Donna,
While I wait for the "after" report...
I've tried a wide range of values for the arp_cleanup_interval, values from as low as 10 seconds right up to 10 minutes...
They did not make the problem any better or any worse.
The title of this thread is a bit misleading isn't it, as I said earlier I am probably wrong in making this association.
And, as said earlier, "ifconfig" is doing things besides clearing the ARP cache..
The size of the ARP cache seems to be indicative of the time period when things start to go wrong ( after the ifconfig command clears the ARP cache )
before.txt and after.txt are attched...
No, they are NOT... Only jpg, bmp etc. are accepted.
Should I cheat by changing the extension? Or include them ihere n the text?

 

SemihBATTAL
Advisor

Re: Upto 50% ping loss when ARP cache grows over ~100 items ( HP-UX 11.31 RX2620-2 )

Hi again Donna,
See below how bad the problem is when the "ifconfig" script is not running...
Mon Jul 10 22:16:27 GMT-3 2017 : 400 packets transmitted, 335 received, 16% packet loss, time 6462ms
Mon Jul 10 22:16:46 GMT-3 2017 : 400 packets transmitted, 351 received, 12% packet loss, time 6612ms
Mon Jul 10 22:18:22 GMT-3 2017 : 400 packets transmitted, 397 received, 0% packet loss, time 6626ms
Mon Jul 10 22:19:41 GMT-3 2017 : 400 packets transmitted, 357 received, 10% packet loss, time 7181ms
Mon Jul 10 22:21:09 GMT-3 2017 : 400 packets transmitted, 395 received, 1% packet loss, time 6782ms
Mon Jul 10 22:23:05 GMT-3 2017 : 400 packets transmitted, 181 received, 54% packet loss, time 6364ms
Mon Jul 10 22:32:44 GMT-3 2017 : 400 packets transmitted, 291 received, 27% packet loss, time 6882ms
Mon Jul 10 22:36:16 GMT-3 2017 : 400 packets transmitted, 375 received, 6% packet loss, time 6746ms
Mon Jul 10 22:43:02 GMT-3 2017 : 400 packets transmitted, 303 received, 24% packet loss, time 6565ms
Mon Jul 10 22:43:22 GMT-3 2017 : 400 packets transmitted, 350 received, 12% packet loss, time 6876ms
Mon Jul 10 22:47:51 GMT-3 2017 : 400 packets transmitted, 367 received, 8% packet loss, time 6534ms
Mon Jul 10 22:48:01 GMT-3 2017 : 400 packets transmitted, 309 received, 22% packet loss, time 6757ms
Mon Jul 10 22:53:29 GMT-3 2017 : 400 packets transmitted, 374 received, 6% packet loss, time 6692ms
Mon Jul 10 22:56:05 GMT-3 2017 : 400 packets transmitted, 377 received, 5% packet loss, time 6846ms
Mon Jul 10 22:59:28 GMT-3 2017 : 400 packets transmitted, 227 received, 43% packet loss, time 6811ms
Mon Jul 10 23:00:56 GMT-3 2017 : 400 packets transmitted, 399 received, 0% packet loss, time 6683ms
Mon Jul 10 23:01:45 GMT-3 2017 : 400 packets transmitted, 372 received, 7% packet loss, time 6771ms
Mon Jul 10 23:02:05 GMT-3 2017 : 400 packets transmitted, 390 received, 2% packet loss, time 6951ms
Mon Jul 10 23:06:36 GMT-3 2017 : 400 packets transmitted, 387 received, 3% packet loss, time 6949ms
Mon Jul 10 23:07:53 GMT-3 2017 : 400 packets transmitted, 389 received, 2% packet loss, time 6602ms
Mon Jul 10 23:09:01 GMT-3 2017 : 400 packets transmitted, 119 received, 70% packet loss, time 6808ms
Mon Jul 10 23:09:11 GMT-3 2017 : 400 packets transmitted, 123 received, 69% packet loss, time 6593ms
Mon Jul 10 23:11:26 GMT-3 2017 : 400 packets transmitted, 285 received, 28% packet loss, time 6892ms
Mon Jul 10 23:11:36 GMT-3 2017 : 400 packets transmitted, 389 received, 2% packet loss, time 6777ms
Mon Jul 10 23:20:26 GMT-3 2017 : 400 packets transmitted, 287 received, 28% packet loss, time 6541ms
Mon Jul 10 23:21:53 GMT-3 2017 : 400 packets transmitted, 394 received, 1% packet loss, time 6738ms
Mon Jul 10 23:26:35 GMT-3 2017 : 400 packets transmitted, 373 received, 6% packet loss, time 6876ms
Mon Jul 10 23:27:42 GMT-3 2017 : 400 packets transmitted, 356 received, 11% packet loss, time 6357ms
Mon Jul 10 23:31:06 GMT-3 2017 : 400 packets transmitted, 359 received, 10% packet loss, time 6857ms
Mon Jul 10 23:31:15 GMT-3 2017 : 400 packets transmitted, 210 received, 47% packet loss, time 6883ms
Mon Jul 10 23:32:23 GMT-3 2017 : 400 packets transmitted, 390 received, 2% packet loss, time 6376ms

SemihBATTAL
Advisor

Re: Upto 50% ping loss when ARP cache grows over ~100 items ( HP-UX 11.31 RX2620-2 )

I cheated... The file extensions should be changed to "txt"...

donna hofmeister
Trusted Contributor

Re: Upto 50% ping loss when ARP cache grows over ~100 items ( HP-UX 11.31 RX2620-2 )

this is strange.....  the following reflect the differences between your before and after file, where the elapsed time is ~10 minutes:

tcp:
33372 packets sent
36924 packets received

however

icmp:
25174 calls to generate an ICMP error message
0 ICMP messages dropped
Output histogram:
echo reply: 25131

you've got nearly as many pings as you have with all other network activity! pinging too much is not better. (in the old days there was such a thing as 'the ping of death' (large packets with large counts)...) maybe you can dial down your paranoia level and only ping (say) every 10-15 minutes with 10 bytes for 10 iterations?

that's the only thing i'm seeing that may be an immediately addressable issue.  i suspect there's a larger issue with your lan.  i suspect too that if you were to actually open a call with the RC you'd get a better answer.

SemihBATTAL
Advisor
Solution

Re: Upto 50% ping loss when ARP cache grows over ~100 items ( HP-UX 11.31 RX2620-2 )

Hi Donna,
Sorry it was my fault...
I had forgotten to stop the "flood-ping" script before running the netstat commands.
I've done the same test again and there were only 37 pings in 27186 sent tcp packets in the 10 minute period.
I only started to run the "flood-ping" script when we started to encounter broken connections due to lost packets.
And, I also have to run the "ifconfig" script simultaneously, to periodically "reset" the network stack?, clear the ARP cache?
This is the only way that I can make the server "usable"...
Actually, these two scripts have been running continously since I booted the system on 16th May 2017.
The server was shutdown for a few hours due to UPS maintenance.
I remember installing the March 2017 QPKBASE and QPKAPPS bundles before shutting down the system.
Should I uninstall both bundles and see what happens?

Today, after everyone went home, I stopped our "global systems daemon" program.
This program checks about 180 "must-be-alive" systems/services such as our SMTP/POP3/HTTP/Oracle/Etc. services as well as IOT nodes, some sensor values,, security cameras/recording equipment, personnel attendance readers and so on.
With this program stopped, the ARP cache grows very slowly, in fact it took 4 hours to reach 100 MAC's.
And, to my surprise there was not a single packet loss in this 4 hour period.
As soon as the ARP count exceeded ~100, some packet loss started.
This problem is starting look like it is directly related to the ARP cache size...