Switches, Hubs, and Modems
1752729 Members
5717 Online
108789 Solutions
New Discussion юеВ

Re: 16-port 10/100/1000 Module (J4907A) failure

 
Les Ligetfalvy
Esteemed Contributor

16-port 10/100/1000 Module (J4907A) failure

I initially posted to another thread http://forums1.itrc.hp.com/service/forums/questionanswer.do?threadId=431664 my experiences with these modules. Rather than hijaak that thread further, I started this one and will repost what I had posted there Sept. 17th.

I recently received a bunch of 5308xl switches with this gig module and two of them were failing on by test bench. I was testing the power supply redundancy by unplugging/plugging one of the two power cords and somtimes a module would fail.
At no time was there any trap sent nor was there anything in the syslog. A soft boot would not bring back the module but would log the failure locally. An issue with the OS prevents boot-time errors from being sent to the trap. A cold boot would resurrect the module.
HP replaced both modules and both chassis and I have not been able to reproduce the problem since.
The switches are still on the test bed and will not go in production until some software notification issues are resolved. I still have a nagging concern about the lack of notification when the module failed as that points to an OS issue that hardware replacement does not solve.
26 REPLIES 26
Les Ligetfalvy
Esteemed Contributor

Re: 16-port 10/100/1000 Module (J4907A) failure

Yesterday I had a third module exhibit the same symptoms. I am waiting for HP to arrive at a prognosis but I have to wonder if anyone else is experiencing the same failure. One module failure is easy to dismiss, two is suspicious but possible, but three is getting to look like a production run spec issue and if that were the case, others should be reporting this also.

I have put off deploying these seven 5308xl switches until we get to the bottom of the module failures and I see some progress on the lack of notification issue. Meanwhile, I am testing them further in the lab and have encountered a couple more issues for which I have separate threads.
SCOOTER
Esteemed Contributor

Re: 16-port 10/100/1000 Module (J4907A) failure

Les,

Out of curiosity, what firmware are you running and do the J4907A modules fail in any slot or specific slots?

Regards,

SCOOTER
Les Ligetfalvy
Esteemed Contributor

Re: 16-port 10/100/1000 Module (J4907A) failure

Scooter,
I am running the latest (E.08.42) software.

As I had mentioned, it is only sometimes that they fail. On one switch it was in slot E and the other it was in C. I was instructed by HP support to move the modules to a different slot and never saw a recurrance of the failures under the same conditions. Not satisfied that simply moving them was the solution to the problem, I pressed HP support and they suggested I put them back in their orignal slots for more testing. When I did, they failed immediately on boot. HP had me ship both modules and both chassis back to them so no further testing was done by me.

Now a third mondule on yet another switch failed in slot E as well but this one failed on boot and not during RPS testing (I did manage to ferret out a faulty RPS though but that is yet another topic). I am still working with HP on the issue and will do more tests to see if it will also fail under RPS cycling.

It is this random intermittent failing that really concerns me because there is no trap sent when it happens. I have yet another incident (2 actually) open with HP on the no trap notification issue.
Marc Villeneuve_1
Regular Advisor

Re: 16-port 10/100/1000 Module (J4907A) failure

Hi.
Just an idea when I was reading the tread!

May be one good thing you can do with this switch is put a UPS with. If your power falilure are short, you can buy little UPS that not cost a lot and then you avoid faillure.

Its only a patch on the problem. But while you wait for HP, you can put them faster in production.
SCOOTER
Esteemed Contributor

Re: 16-port 10/100/1000 Module (J4907A) failure

Les,

I grabbed a 5308 running E.08.42.
2 Powersupply's
Slot A J4907A module
Slot B J4820A module
Slot E J4907A module

Configured only a Static IP to check eventlog.

Booted the switch, all fine.
Powercycled PS 1, all fine.
Powercycled PS 2, all fine.
Removed PSU 1 reinserted PS 1, all fine
Removed PSU 2 reinserted PS 2, all fine
Removed all power booted, all fine.
Moved J4907A modules around in the empty slots all slots recogized the module and no faults on the slots or modules occurred.

Performed this procedure twice without any problems.

FYI:
S/N Switch SG419JZ030
S/N J4907A SG421PM0IU
S/N J4907A SG421PM040

Sorry, I could not reproduce your problem.

Kind regards,

SCOOTER
Les Ligetfalvy
Esteemed Contributor

Re: 16-port 10/100/1000 Module (J4907A) failure

WOW Scooter,
I did not expect anyone except perhaps HP to go through that level of testing. I am not surprised that it would not fail for you. Most times it would not fail for me either!

This is a very obscure failure that is almost impossible to deliberately reproduce. It only manifests itself rarely but I have my eyes as a witness and show techs (after a warm boot) to prove it.

show running-config
; J4819A Configuration Editor; Created on release #E.08.42

hostname "MO PBX HP 5308xl"
snmp-server contact "Les Ligetfalvy"
snmp-server location "Main Office PBX Room"
time timezone -300
module 1 type J4907A
module 2 type J4907A
module 3 type J4907A
module 4 type J4907A
module 5 type J4907A
module 6 type J4907A
module 8 type J4907A
module 7 type J4907A
...
show modules

Status and Counters - Module Information

Slot Module Description Serial Number
----- ------------------------------------ --------------
A HP J4907A XL Gig-T/GBIC module SG425PM0X6
B HP J4907A XL Gig-T/GBIC module SG425PM0ZM
D HP J4907A XL Gig-T/GBIC module SG425PM0PS
E HP J4907A XL Gig-T/GBIC module SG425PM0M0
F HP J4907A XL Gig-T/GBIC module SG425PM0LG
...
PowerDsine Show:

Slot 3

CRASHLogfileshow

slot 3:
-------
ERROR: slot 3 not ready

CRASHData

slot 3:
-------
ERROR: slot 3 not ready

poe_status_port all

slot 3:
-------
ERROR: slot 3 not ready

pdshow

slot 3:
-------
ERROR: slot 3 not ready
...
W 08/24/04 09:26:53 snmp: SNMP Security access violation from 10.198.10.12
I 08/24/04 09:26:54 tftp: Transfer completed
M 08/24/04 09:26:57 sys: 'Config updated via network tftp'
I 08/24/04 09:26:57 system: --------------------------------------------------
I 08/24/04 09:26:57 system: System went down: 08/24/04 09:26:57
I 08/24/04 09:26:57 system: Config updated via network tftp
I 08/24/04 09:27:02 lacp: Passive Dynamic LACP enabled on all ports
I 08/24/04 09:27:07 chassis: Slot A Inserted
I 08/24/04 09:27:07 chassis: Slot B Inserted
I 08/24/04 09:27:07 chassis: Slot C Inserted
I 08/24/04 09:27:07 chassis: Slot D Inserted
I 08/24/04 09:27:08 chassis: Slot E Inserted
I 08/24/04 09:27:08 dhcpr: DHCP relay agent feature enabled
I 08/24/04 09:27:08 chassis: Slot F Inserted
W 08/24/04 09:27:08 chassis: Power Supply failure: Supply: 2, Failures: 1
I 08/24/04 09:27:08 chassis: Slot A Downloading
I 08/24/04 09:27:08 tftp: Enable succeeded
I 08/24/04 09:27:08 system: System Booted.
I 08/24/04 09:27:08 cdp: CDP enabled
I 08/24/04 09:27:08 chassis: Slot B Downloading
I 08/24/04 09:27:09 chassis: Slot D Downloading
I 08/24/04 09:27:09 chassis: Slot E Downloading
I 08/24/04 09:27:09 chassis: Slot F Downloading
I 08/24/04 09:27:10 chassis: Slot A Download Complete
I 08/24/04 09:27:10 chassis: Slot B Download Complete
I 08/24/04 09:27:10 chassis: Slot D Download Complete
I 08/24/04 09:27:11 chassis: Slot E Download Complete
I 08/24/04 09:27:11 chassis: Slot F Download Complete
I 08/24/04 09:27:25 chassis: Slot A Ready
I 08/24/04 09:27:26 chassis: Slot F Ready
I 08/24/04 09:27:26 chassis: Slot E Ready
I 08/24/04 09:27:26 chassis: Slot D Ready
I 08/24/04 09:27:26 chassis: Slot B Ready
W 08/24/04 09:27:27 chassis: Module in Slot C not Supported or may be Faulty


I obviously left out a lot of the show tech. I am starting to wonder if my "SNMP Security access violation" issue (another thread) may be overflowing a buffer (stack) and the change of RPS status what triggered it. This one may be a tough nut to crack.

As for the UPS suggestion, I would never consider putting cheapy UPSes on the switches but I do have expensive redundant UPSes in most of my rack rooms. Cheap UPSes are generally unmanaged and yet another point of failure.
OLARU Dan
Trusted Contributor

Re: 16-port 10/100/1000 Module (J4907A) failure

Les,
I don't know why, but my feeling is that SCOOTER _IS_ connected to HP somehow.

Right, SCOOTER? C'mon, you can tell us!!!
Les Ligetfalvy
Esteemed Contributor

Re: 16-port 10/100/1000 Module (J4907A) failure

Looks like Scooter's not 'fessing up.

I made a typo when I said "Now a third mondule on yet another switch failed in slot E"...It was slot C and I have not been able to reproduce the error.

I did get word today from HP on the "trap notification failure on boot" issue and it sounds like I may be getting a fix for Christmas.
Les Ligetfalvy
Esteemed Contributor

Re: 16-port 10/100/1000 Module (J4907A) failure

I feel like this is turning into a blog. :(

Yesterday I took all my switches up to 8.50 and decided I would take them from my test lab and press them into service since I have not been able to repro the module faulting mentioned above.

Well, today I mounted three of them into racks and meshed them together. I did not connect any servers or clients yet but did connect it to my Cisco. THey were not up for an hour when a module faulted.

One of the three locations (L3) has only a single UPS so one of the two RPSs was connected to raw town power. There was a scheduled town power outage today and the one (and only) UPS dumped. When that happened, the switch at L3 rebooted and faulted module C. While there is no knowing where in this chain of events the module faulted, there was no trap thrown nor was there any syslog entry.

I guess my old Cisco core and Nortel edge switches will be around for a while longer.