Switches, Hubs, and Modems
cancel
Showing results for 
Search instead for 
Did you mean: 

Procurve 4000m management interface freezes

Michael Robbert
Occasional Visitor

Procurve 4000m management interface freezes

We are having occational problems with a few of our 4000m switches. The switches are still passing traffic as far as we can tell, but we lose all connectivity to the management interface, can't ping and serial console if frozen. Are there any common things to look for when this happens?

Here is some more info. The problem may be tied to an automated snmpbulkwalk that we do to track the MAC address table on the switch, but the problem lasts for a few minutes after the walk is done. This is happening to 4 switches right now that are all uplinked to the same switch. The uplink switch is having no problems and two other switches (same model, similar configs) that uplink to the same place do not have the problem at all. Nothing shows up in the logs and no errors are reported on any of the interfaces. One other funny thing to mention is that this only seems to happen when the building that these switches feed is closed to the public. We are a college campus and it is spring break this week. The problem has happened during other breaks in the past, but does not regularly happen every night when the building is also closed.

Any ideas? If you want more info let me know.

Thanks,
Mike Robbert
7 REPLIES
Matt Hobbs
Honored Contributor

Re: Procurve 4000m management interface freezes

Hi Mike,

I haven't heard of anything like this before. As you suspect it may be related to the snmpbulkwalk which could be temporarily overwhelming the switches CPU.

You say the serial console is frozen too when it happens? I would try the serial console to one of the switches and open the Status & Counters > General System Info, run the snmpbulkwalk and see if anything interesting happens there.

Make sure you're running the latest firmware.

If you could reliably reproduce it in a lab environment it would definitely help.
Eirram_1
Frequent Advisor

Re: Procurve 4000m management interface freezes

Hello Mike,

So basically what you are saying is that Management hangs. During the problem, you can still ping OVER the switch? This would be an important data point. What do you do to recover? Reboot the switch?

Anyway, there is an interesting fix in C.09.16 relating to CERT SNMPv1 "req-enc" test #1150. It can make the console freeze and telnet sessions hang. If the firmware is older than .16, then definitely do an update of the software.

Goodluck
Michael Robbert
Occasional Visitor

Re: Procurve 4000m management interface freezes

Thanks for the quick replies. Here are some more data points to work with. We are running the latest firmware on all of our switches. They were upgraded over the winter break. For the 4000m this is C.09.22

Yesterday I disabled our program that runs the snmpbulwalk and our monitoring software hasn't seen the switches disappear since. We really do need to run this software so this isn't a fix, just a temporary workaround.
I haven't had a chance to bring a laptop down to the data closet again, but I did telnet into the switch and watch the general system information screen while running a snmpbulkwalk. On one of the affected switches the clock stops refreshing as soon as the command is executed and the management interface becomes unpingable. A ping to a device plugged into this switch is unaffected. In order to fix the problem we just wait. After about 4-5 minutes everything goes back to normal. One other thing to mention is that the snmpbulkwalk fails with a timeout.
Meanwhile, on another switch that doesn't seem to have this problem: When the snmpbulkwalk starts everything freezes just like the affected box, can't ping management. The difference is that the snmpbulkwalk returns data after less than 30 seconds and as soon as it does the management interface comes back.
I never see the CPU utilization rise, but it could spike before it has time to update. I don't think that I'm going to see different results on the serial console since it seems to lock up at the same time. Here is the command that is causing the issue:
snmpbulkwalk -v 2c x.x.x.x -t 60 -r 0 public enterprises.11.2.14.2.10.5.1.3.1

I'll run some more tests to see if other OIDs will cause the problem or if the size of the MAC address table or maybe number of interfaces on the switch affects this.

Keep the suggestions coming.

Mike
Matt Hobbs
Honored Contributor

Re: Procurve 4000m management interface freezes

Hi Mike,

What if you try snmpwalk instead of snmpbulkwalk? My understanding is that snmpbulkwalk asks the agent on the switch for all that information in one transaction, whereas snmpwalk will divide it up..

An snmpwalk command will take longer though.

I've tried both myself, using the same OID as yourself, on C.09.22 and it went through fine, but I have only 61 mac addresses on my 4000M. How many entries would you say you have on these switches?

Don't forget to assign points to posts that have helped you.
Magnus_18
Advisor

Re: Procurve 4000m management interface freezes

Actually, the latest firmware is C.09.26.
Cleary KingLoom
Occasional Visitor

Re: Procurve 4000m management interface freezes

I bought a 4000M on flea market for curiosity (8USD paid) - I hadn't seen the insides of this model before :)

Anyway, I have the same symptoms described here plus some new:

1. During the start sequence, it is possible to fit max 3 modules or the FAULT LED will start blink. When installing 1-2 modules, FAULT LED mostly will not blink. With 3 modules, the probablility of FAULT LED blinking for some of modules is ca 50%. I have not yet seen 4 modules installed and FAULT LED not blinking for some of these.

2. I made tens of experiments with various physical modules installed into various positions from A through J (I have 6 pieces of HP J4111A for the purpose).

It is rather random, WHICH module will trigger "FAULT" and in what position it has been installed (A-J). However, all modules actually are working and they are switching the packets, even these modules which were indicated as FAULTY during the selg-test sequence.

3. With all 6 modules installed, LED's are litting OK, thus the PSU is not the source of the problem.

4. After restart, telnet and WWW initially are working but only during a short time span. When not accessing these services, a ping to switch IP address will last even up to 8 and 10 minutes before failing. When accessing telnet or WWW, the IP connectivity to the switch will last max 2-3 minutes, not more. One single SNMP request freezes the network momentarily.

5. The SW version was C_09_26, I installed C_09_30. The change of software introduced Java. All the faulty behaviour continued. Yeah, I know, the very moment of upgrading was a big risk under these conditions ;)

SUMMARY - My guess is that the master module has some fault on it - something closely related to RAM stability. It could be a capacitor or an aged RAM chip or even the dust particles. Any usage of TCP/IP services makes the memory usage more agressive and then the module fails.

Nevertheless, it's nice to see how HP engineers have created a bestia which is able to work even without the brains.
Cleary KingLoom
Occasional Visitor

Re: Procurve 4000m management interface freezes

OK, here are the error messages in log:

System went down 10/10/09 14:41:00
Saved crash information:
NMI occured: IP=0x00290154 PCW:0x00000003 ACW 0x000001002 Task='eGeordiMon' Task pfp: 0x0099f190 sp:0x0099f450 rio:0x00290154

alternative numbers, next crash:
NMI occured: IP=0x002925cc PCW:0x00000003 ACW 0x000001004 Task='eGeordiMon' Task pfp: 0x0099f4d0 sp:0x00995450 rio:0x002925cc