Switches, Hubs, and Modems
cancel
Showing results for 
Search instead for 
Did you mean: 

16-port 10/100/1000 Module (J4907A) failure

Les Ligetfalvy
Esteemed Contributor

16-port 10/100/1000 Module (J4907A) failure

I initially posted to another thread http://forums1.itrc.hp.com/service/forums/questionanswer.do?threadId=431664 my experiences with these modules. Rather than hijaak that thread further, I started this one and will repost what I had posted there Sept. 17th.

I recently received a bunch of 5308xl switches with this gig module and two of them were failing on by test bench. I was testing the power supply redundancy by unplugging/plugging one of the two power cords and somtimes a module would fail.
At no time was there any trap sent nor was there anything in the syslog. A soft boot would not bring back the module but would log the failure locally. An issue with the OS prevents boot-time errors from being sent to the trap. A cold boot would resurrect the module.
HP replaced both modules and both chassis and I have not been able to reproduce the problem since.
The switches are still on the test bed and will not go in production until some software notification issues are resolved. I still have a nagging concern about the lack of notification when the module failed as that points to an OS issue that hardware replacement does not solve.
26 REPLIES
Les Ligetfalvy
Esteemed Contributor

Re: 16-port 10/100/1000 Module (J4907A) failure

Yesterday I had a third module exhibit the same symptoms. I am waiting for HP to arrive at a prognosis but I have to wonder if anyone else is experiencing the same failure. One module failure is easy to dismiss, two is suspicious but possible, but three is getting to look like a production run spec issue and if that were the case, others should be reporting this also.

I have put off deploying these seven 5308xl switches until we get to the bottom of the module failures and I see some progress on the lack of notification issue. Meanwhile, I am testing them further in the lab and have encountered a couple more issues for which I have separate threads.
SCOOTER
Esteemed Contributor

Re: 16-port 10/100/1000 Module (J4907A) failure

Les,

Out of curiosity, what firmware are you running and do the J4907A modules fail in any slot or specific slots?

Regards,

SCOOTER
Les Ligetfalvy
Esteemed Contributor

Re: 16-port 10/100/1000 Module (J4907A) failure

Scooter,
I am running the latest (E.08.42) software.

As I had mentioned, it is only sometimes that they fail. On one switch it was in slot E and the other it was in C. I was instructed by HP support to move the modules to a different slot and never saw a recurrance of the failures under the same conditions. Not satisfied that simply moving them was the solution to the problem, I pressed HP support and they suggested I put them back in their orignal slots for more testing. When I did, they failed immediately on boot. HP had me ship both modules and both chassis back to them so no further testing was done by me.

Now a third mondule on yet another switch failed in slot E as well but this one failed on boot and not during RPS testing (I did manage to ferret out a faulty RPS though but that is yet another topic). I am still working with HP on the issue and will do more tests to see if it will also fail under RPS cycling.

It is this random intermittent failing that really concerns me because there is no trap sent when it happens. I have yet another incident (2 actually) open with HP on the no trap notification issue.
Marc Villeneuve_1
Regular Advisor

Re: 16-port 10/100/1000 Module (J4907A) failure

Hi.
Just an idea when I was reading the tread!

May be one good thing you can do with this switch is put a UPS with. If your power falilure are short, you can buy little UPS that not cost a lot and then you avoid faillure.

Its only a patch on the problem. But while you wait for HP, you can put them faster in production.
SCOOTER
Esteemed Contributor

Re: 16-port 10/100/1000 Module (J4907A) failure

Les,

I grabbed a 5308 running E.08.42.
2 Powersupply's
Slot A J4907A module
Slot B J4820A module
Slot E J4907A module

Configured only a Static IP to check eventlog.

Booted the switch, all fine.
Powercycled PS 1, all fine.
Powercycled PS 2, all fine.
Removed PSU 1 reinserted PS 1, all fine
Removed PSU 2 reinserted PS 2, all fine
Removed all power booted, all fine.
Moved J4907A modules around in the empty slots all slots recogized the module and no faults on the slots or modules occurred.

Performed this procedure twice without any problems.

FYI:
S/N Switch SG419JZ030
S/N J4907A SG421PM0IU
S/N J4907A SG421PM040

Sorry, I could not reproduce your problem.

Kind regards,

SCOOTER
Les Ligetfalvy
Esteemed Contributor

Re: 16-port 10/100/1000 Module (J4907A) failure

WOW Scooter,
I did not expect anyone except perhaps HP to go through that level of testing. I am not surprised that it would not fail for you. Most times it would not fail for me either!

This is a very obscure failure that is almost impossible to deliberately reproduce. It only manifests itself rarely but I have my eyes as a witness and show techs (after a warm boot) to prove it.

show running-config
; J4819A Configuration Editor; Created on release #E.08.42

hostname "MO PBX HP 5308xl"
snmp-server contact "Les Ligetfalvy"
snmp-server location "Main Office PBX Room"
time timezone -300
module 1 type J4907A
module 2 type J4907A
module 3 type J4907A
module 4 type J4907A
module 5 type J4907A
module 6 type J4907A
module 8 type J4907A
module 7 type J4907A
...
show modules

Status and Counters - Module Information

Slot Module Description Serial Number
----- ------------------------------------ --------------
A HP J4907A XL Gig-T/GBIC module SG425PM0X6
B HP J4907A XL Gig-T/GBIC module SG425PM0ZM
D HP J4907A XL Gig-T/GBIC module SG425PM0PS
E HP J4907A XL Gig-T/GBIC module SG425PM0M0
F HP J4907A XL Gig-T/GBIC module SG425PM0LG
...
PowerDsine Show:

Slot 3

CRASHLogfileshow

slot 3:
-------
ERROR: slot 3 not ready

CRASHData

slot 3:
-------
ERROR: slot 3 not ready

poe_status_port all

slot 3:
-------
ERROR: slot 3 not ready

pdshow

slot 3:
-------
ERROR: slot 3 not ready
...
W 08/24/04 09:26:53 snmp: SNMP Security access violation from 10.198.10.12
I 08/24/04 09:26:54 tftp: Transfer completed
M 08/24/04 09:26:57 sys: 'Config updated via network tftp'
I 08/24/04 09:26:57 system: --------------------------------------------------
I 08/24/04 09:26:57 system: System went down: 08/24/04 09:26:57
I 08/24/04 09:26:57 system: Config updated via network tftp
I 08/24/04 09:27:02 lacp: Passive Dynamic LACP enabled on all ports
I 08/24/04 09:27:07 chassis: Slot A Inserted
I 08/24/04 09:27:07 chassis: Slot B Inserted
I 08/24/04 09:27:07 chassis: Slot C Inserted
I 08/24/04 09:27:07 chassis: Slot D Inserted
I 08/24/04 09:27:08 chassis: Slot E Inserted
I 08/24/04 09:27:08 dhcpr: DHCP relay agent feature enabled
I 08/24/04 09:27:08 chassis: Slot F Inserted
W 08/24/04 09:27:08 chassis: Power Supply failure: Supply: 2, Failures: 1
I 08/24/04 09:27:08 chassis: Slot A Downloading
I 08/24/04 09:27:08 tftp: Enable succeeded
I 08/24/04 09:27:08 system: System Booted.
I 08/24/04 09:27:08 cdp: CDP enabled
I 08/24/04 09:27:08 chassis: Slot B Downloading
I 08/24/04 09:27:09 chassis: Slot D Downloading
I 08/24/04 09:27:09 chassis: Slot E Downloading
I 08/24/04 09:27:09 chassis: Slot F Downloading
I 08/24/04 09:27:10 chassis: Slot A Download Complete
I 08/24/04 09:27:10 chassis: Slot B Download Complete
I 08/24/04 09:27:10 chassis: Slot D Download Complete
I 08/24/04 09:27:11 chassis: Slot E Download Complete
I 08/24/04 09:27:11 chassis: Slot F Download Complete
I 08/24/04 09:27:25 chassis: Slot A Ready
I 08/24/04 09:27:26 chassis: Slot F Ready
I 08/24/04 09:27:26 chassis: Slot E Ready
I 08/24/04 09:27:26 chassis: Slot D Ready
I 08/24/04 09:27:26 chassis: Slot B Ready
W 08/24/04 09:27:27 chassis: Module in Slot C not Supported or may be Faulty


I obviously left out a lot of the show tech. I am starting to wonder if my "SNMP Security access violation" issue (another thread) may be overflowing a buffer (stack) and the change of RPS status what triggered it. This one may be a tough nut to crack.

As for the UPS suggestion, I would never consider putting cheapy UPSes on the switches but I do have expensive redundant UPSes in most of my rack rooms. Cheap UPSes are generally unmanaged and yet another point of failure.
OLARU Dan
Trusted Contributor

Re: 16-port 10/100/1000 Module (J4907A) failure

Les,
I don't know why, but my feeling is that SCOOTER _IS_ connected to HP somehow.

Right, SCOOTER? C'mon, you can tell us!!!
Les Ligetfalvy
Esteemed Contributor

Re: 16-port 10/100/1000 Module (J4907A) failure

Looks like Scooter's not 'fessing up.

I made a typo when I said "Now a third mondule on yet another switch failed in slot E"...It was slot C and I have not been able to reproduce the error.

I did get word today from HP on the "trap notification failure on boot" issue and it sounds like I may be getting a fix for Christmas.
Les Ligetfalvy
Esteemed Contributor

Re: 16-port 10/100/1000 Module (J4907A) failure

I feel like this is turning into a blog. :(

Yesterday I took all my switches up to 8.50 and decided I would take them from my test lab and press them into service since I have not been able to repro the module faulting mentioned above.

Well, today I mounted three of them into racks and meshed them together. I did not connect any servers or clients yet but did connect it to my Cisco. THey were not up for an hour when a module faulted.

One of the three locations (L3) has only a single UPS so one of the two RPSs was connected to raw town power. There was a scheduled town power outage today and the one (and only) UPS dumped. When that happened, the switch at L3 rebooted and faulted module C. While there is no knowing where in this chain of events the module faulted, there was no trap thrown nor was there any syslog entry.

I guess my old Cisco core and Nortel edge switches will be around for a while longer.
Les Ligetfalvy
Esteemed Contributor

Re: 16-port 10/100/1000 Module (J4907A) failure

Update:
The module that faulted is a replacement module from HP. The original module and chassis was returned to HP. The replacement chassis took a few weeks to arrive, so I ended up taking my spare off the shelf, placing the replacement on the shelf as a spare when it finally arrived.

So, to summarize... I have had three failures, two of which HP has replaced. All three chassis were close to each other in serial number (I had ordered seven at one time).
They end in:
08S - current failure
090 - replaced
099 - replaced

The remaining siblings are:
09C
0BY
0KJ
0KM

The units that failed were populated with six or more J4907A modules. The remaining units have fewer modules.
Les Ligetfalvy
Esteemed Contributor

Re: 16-port 10/100/1000 Module (J4907A) failure

blogging...
Tried running some diags but the reboot into diag mode cleared the fault. :(

I removed the switch from its service location and hauled it back to my test bench. I tried 21 times to fault it by cycling RPS2 but to no avail.

One thing I have noticed is that in every case where there was a module fault, the switch had six or eight modules. and that the fault followed closely after a cycling or RPS2. I inserted two more modules to stack the odds in my favor, taking the total to eight, and cycled RPS2 13 more times.

I have the faulting 5308xl connected via module B port 7 to my 2524 on port 1. My WUG syslog shows this port #1 toggling off and on several times one minute later for the next minute. This I believe is when the switch module C faulted/crashed/rebooted.

I moved all the modules to another chassis (maintaining slot for slot) with different RPSs and will continue the testing.
Les Ligetfalvy
Esteemed Contributor

Re: 16-port 10/100/1000 Module (J4907A) failure

I also got the module to fault in another chassis with different RPSs. Next I will put a another identical module in slot C and try to fault it.
Les Ligetfalvy
Esteemed Contributor

Re: 16-port 10/100/1000 Module (J4907A) failure

Well... I must say that I am not impressed. Eight days after reporting this (10/21/04), I am still waiting for a replacement and if (big IF) it were to arrive on Monday, it will be eleven. I am also not impressed that it took until this week for HP to start looking at the original two faulted units that I sent out on 9/21/04.

Now if i started counting when the first fault was reported on 8/25/04... my $100,000 that is tied up in these switches that I have yet to put in production...

If I get chastized for saying "the emperor has no clothes" then so be it. I have run out of patience! Like Dan said in another thread, have I paid $100,000 for you, folks at HP, to sleep well?
Les Ligetfalvy
Esteemed Contributor

Re: 16-port 10/100/1000 Module (J4907A) failure

Well... I may as well update this blog. I got word on the results of the forensic testing commissioned by HP Division, and was told that there was an off-spec component. There is no plan to recall any other modules so I assume any possible future failures would be processed as regular warranty replacements through normal channels.
Les Ligetfalvy
Esteemed Contributor

Re: 16-port 10/100/1000 Module (J4907A) failure

Yesterday I had yet another module fail. This module had no connections to it so one cannot attribute the falure to ouside forces like ESD. I have had so many issues with these switches that the deployment has halted. This particular switch presently serves only to perpetuate the mesh. It has three 16-port gig modules that were to serve workstations on our production floor but for obvious reasons, I halted the migration. Beside this HP switch, I still have my Nortel Baystack switches serving the production wrapline.

Everytime my confidence starts to grow to the point I consider resuming the migration, along comes another speedbump. How do other people deploy this product in a 24/7 production environment and sleep at night (or do they)?
William_169
Occasional Advisor

Re: 16-port 10/100/1000 Module (J4907A) failure

Hi Les,

Is the problems with the 5308XL solved?
Is this a bad switch?
Do others have the same issues as yours?
What is the equivalent in Cisco?
Thanks.

William
william@gndt.com
Les Ligetfalvy
Esteemed Contributor

Re: 16-port 10/100/1000 Module (J4907A) failure

William,
I really don't know how to answer your question. If I really said how I felt right now, my comments would be censored. See also my other posts:
http://forums1.itrc.hp.com/service/forums/pageList.do?userId=CA1185869&admit=716493758+1113785918617+28353475&listType=question&forumId=1

You will just have to draw your own conclusions.
William_169
Occasional Advisor

Re: 16-port 10/100/1000 Module (J4907A) failure

We have several 5372XL with J4820A and J4907A. So far, we didn't have many problems. The problem we had was the J4820A was the "older" version and HP replaced them all quickly.
Do you think your problems are related to the meshing?
Did you have such problems with only the 10/100/1000 J4907A modules?
What if you only use the 10/100 J4820A?
Thanks.

Les Ligetfalvy
Esteemed Contributor

Re: 16-port 10/100/1000 Module (J4907A) failure

William,
The J4907A modules are the only copper modules that I use. I purchased 8 of the J8167A bundles (HP ProCurve Switch 5308XL-48G 8-slot chassis with three 16-port modules pre-installed) so that gives me 24 of these gig modules and I have had 4 of them go bad already. HP says that statistically I should not be experiencing this many failures. With these odds, I should have put the $100,000 on lottery tickets!

I have several J4852A
12 port 100FX modules that I use to connect to my Cisco core switch and my Nortel Baystack edge switches. I already have 100 meg copper on my Nortels so there is no reason for me to put in HP 100 meg copper modules.

As to whether these modules are reponsible for my mesh problem, that is the 100,000 dollar question that even a team of engineers at HP Division cannot answer.
William_169
Occasional Advisor

Re: 16-port 10/100/1000 Module (J4907A) failure

What cisco core switch do you use?
Are you planning using the HP 5308XL as your core?
We had the 4 J8167A (J4907A bundles) for over 6 months now, and they seemed fine.
We plan to use two of them in XRRP config.
How do you like the PCM 1.6?
Does it help you config and monitoring?
What software do you use to monitor your servers and your network equipments?
Les Ligetfalvy
Esteemed Contributor

Re: 16-port 10/100/1000 Module (J4907A) failure

I plan to retire my Cisco Catalyst 5000 core switch, not because of any problems with it but because the backplane does not have the bandwidth for gig and the Cat5000 is at EOL.

The cost of LH GBICs and the way my multi-mode fibre is strung over two kilometres makes it impractical to have a true core switch which is why I decided to build a mesh instead.

I do use PCM+ but to be honest, it still looks like a beta to me. I also use WhatsUp Pro and a suite of Fluke products to manage my network.
William_169
Occasional Advisor

Re: 16-port 10/100/1000 Module (J4907A) failure

Hi Les,
Thanks and good luck.
Les Ligetfalvy
Esteemed Contributor

Re: 16-port 10/100/1000 Module (J4907A) failure

Unfortunately, I need more than good luck. I had a 5th module fail me. :`(
William_169
Occasional Advisor

Re: 16-port 10/100/1000 Module (J4907A) failure

My J4907A still working.
I begin to think it maybe is meshing that causes the problems.