ProCurve / ProVision-Based
cancel
Showing results for 
Search instead for 
Did you mean: 

Unrecoverable fault on PoE controller 4

Dennis_HP-ATP
Occasional Advisor

Unrecoverable fault on PoE controller 4

Hi guys,

Please be informed, that the ProCurve 3800 switches as well as modules for the 54XX and 82XX chassis, have a known issue regarding PoE on the ports.

Show log output looks like this:

W 09/12/16 15:22:51 00567 ports: ST1-CMDR: port 1/16 PD Other Fault indication.
W 09/12/16 15:23:13 00578 chassis: ST1-CMDR: Ports 1/1-24 Unrecoverable fault on PoE controller 4

From a personal experience, i know that the following modules are affected:

HPE 3800 switches

HPE 54XX zl modules

HPE 82XX zl modules

Customer is running flash versions K.15.18.0008

 

Known affected Product ID's are:

J9092A

J9534A

J9536A

 

Solution:

Hardware replacement and start RMA.

8 REPLIES
Arimo
Respected Contributor

Re: Unrecoverable fault on PoE controller 4

Interesting post.

"Known issue" is an issue where public documentation exists. This means the issue is described in Release Notes (if fixed) or Customer / Product advisories. I'm not able to find any such advisory. so no, it doesn't look like we have a "Known issue". If such advisory exists, please do provide the link here.

Here's what to do if you run into an issue with a module:

  1. If you're running outdated FW version (K.15.18.0008 is soon a year old), first check the Release Notes for a possible fix. If a fix is found, upgrade and re-test.
  2. If a fix isn't found, cross-test the module in a known good slot or a known good chassis. Testing with a current FW version is always a good idea even if the device was afterwards reverted back to older version.
  3. Save a show tech all -command output of the switch
  4. Contact support

 


HTH,

Arimo
HPE Networking Engineer
Dennis_HP-ATP
Occasional Advisor

Re: Unrecoverable fault on PoE controller 4

Hi Arimo,

Your statement is true, but there is more to the story ... So let me provide you with some more details about this, from my own experience.

This issue refers to a HealthCare organization in the Netherlands and is currently still ongoing.

Since 2 years, the production environment is completey HP networking and consists of approx 270 (former) Procurve switches mixed by 3800, 54xx and 8212 zl switches and linecards.

In the periodic reports with the customer, we've noticed an unusual high amount of incidents, all with the same symptoms (PoE related). That was the trigger to notify the vendor and after several discussions, HP agreed to have a batch of 24 3800 switches to be replaced proactively.

Fast forward to the current situation, the PoE incidents still pop up and after a networkscan in iMC with the Data Collection feature, I've noticed in April 2016 that several more switches as well as 54xx and 8212 linecards were affected and needed to be replaced.

For this, HP referred to the warranty procedure, and we are replacing them continiously.

I've performed a lab test with a 54xx linecard (sh power-over-ethernet br) and the PoE status column was showing DISABLED state.

Using a PoE tester, I can confirm that some ports do have PoE but others (while on  the same PoE controller) do not show any sign of PoE. Although 4 UTP ports share 1 controller, it looks like ports will fail randomly.

Also, the POE-RESET on for the PoE controllers did not had any effect.

For me, this is now a "known issue" and I like to share it with the rest of the community, in case you should advise your customers about new kit. Please take this matter in consideration.

Unfortunately, Procurve is nowhere near the reliability of the comware line.

Looks like I am not the only one, sharing my findings: http://pyarra.blogspot.nl/2012/03/mysterious-case-of-hp-procurve-poe.html

 Rgds,

Dennis

doaren
Visitor

Re: Unrecoverable fault on PoE controller 4

I can confirm Dennis observations: We have a fleet of about 40 5406 and we have already replaced about 12 modules in 2 years, mostly J9535A and some J9534A. We have run different firmware versions through time. All our switches are behind UPSes (not the same UPS unit for all switches).

For me it's a failure rate high enough that underline an hardware design problem with the device, hence a "known issue" whether it's recognized by manufacturer or not. And I believe that HP recognize this problem at least to a certain degree: they no longer ask us to send a "show tech all", they just send the replacment part once we mention the error in the log.

Cheers,

|_eo

 

parnassus
Honored Contributor

Re: Unrecoverable fault on PoE controller 4


doaren wrote: We have a fleet of about 40 5406 and we have already replaced about 12 modules in 2 years, mostly J9535A and some J9534A. We have run different firmware versions through time. All our switches are behind UPSes (not the same UPS unit for all switches).

Since the named modules are PoE/PoE+:

  • HPE 24-Port Gig-T PoE+ v2 zl Module (J9534A)
  • HPE 20-Port Gig-T PoE+ / 4 Port SFP v2 zl Module (J9535A)

and the issues reported are basically all related to PoE/PoE+...would be interesting to discover if there is a near/far relationship with the type of Power Supplies' configuration used on each site [*].

In other terms: the experienced issues are a problem of single modules' PoE faulted Hardware (showing an Hardware high failure rate) or there is also a relationship with the type of Power Supplies' configuration deployed on each involved 5400R zl2 switch? What was the implemented PoE planning, if any?

Just curious.

[*] Through the usage (mixing is supported but not recommended) of:

  • one (or more) J9828A providing a base of 275W (254W if more than one is used) of PoE power.
  • one (or more) J9829A providing a base of 900W (832W if more than one is used) of PoE power.
  • one (or more) J9830A providing a base of 2500W (2312W if more than one PS is used) of PoE power.

See the HPE ArubaOS-Switch Power Over Ethernet (PoE/PoE+) Planning and Implementation Guide (November 2016).

doaren
Visitor

Re: Unrecoverable fault on PoE controller 4

All switches here have 2xJ9306A power supply. None of them failed so far.

We don't load modules more than 50% of the power available (power-over-ethernet slot X threshold 50) and we try to equally distribute the load across all modules. We power basically only Polycom VoIP phones.

Also we have A/C in all switch locations.

Cheers,

|_eo

parnassus
Honored Contributor

Re: Unrecoverable fault on PoE controller 4

That definitely favours the hardware (some PoE/PoE+ modules) premature fallacy theory (cause: an engineering/manufactoring issue? an issue with HP 5400/8200 zl's backplane?), a theory that looks valid at least when those PoE/PoE+ modules are installed into older HP ProCurve 5400/8200 zl Switch Series (not on newer HPE/Aruba 5400R zl2 [*]).

[*] that series has PS units listed above and it doesn't support the J9306A PS.

Arimo
Respected Contributor

Re: Unrecoverable fault on PoE controller 4

I just searched again, and still do not find any notifications about this kind of thing. There are some cases with this issue with different switch platforms, but I still can't find traces of a "known issue" as defined above. A few customers out of our whole user base experiencing an issue still doesn't qualify it as a "known issue", even if it's multiple times.

"And I believe that HP recognize this problem at least to a certain degree: they no longer ask us to send a "show tech all", they just send the replacment part once we mention the error in the log."

If we did, there would be a public document about it. This isn't the way thing should work unless TEC (L1) has been specifically instructed by the management to replace without troubleshooting. If a TEC gets multiple calls on the same issue, he should alert higher support levels.   

Any class issues can only be determined by the HW lab. Let's assume there's a production line with some fault, and they produce parts with sub-par quality. A TEC guy on the phone gets some calls, replaces a few cards, and thinks "well this seems to fix the problem"; hence he proceeds replacing the cards without investigating the required data. The cases will never be even sent to RTCC (i.e. L2) so they will never end up in the HW lab. So the production line with fault keeps on producing, sub-par parts keep being replaced. Replacement parts may also fail and get replaced again.

Instead of the modules, the root cause of the problem might also be the switch backplane, something in the power distribution system may be faulty. Or the problem might be in the local main power or UPS - I've worked several times myself on this kind of situation. And this isn't by any means an exhaustive list of possibilities. Point is that if the cases don't end up in the lab, we will never be able to determine and eliminate the root cause. In the long run nobody benefits.

Hence the best way is to follow the steps I posted above. This way we get cases with verifiable, comparable data, and we can effectively address the issue. The more cases we get, the more evident this kind of issue will become.


HTH,

Arimo
HPE Networking Engineer
ictadmin
Occasional Visitor

Re: Unrecoverable fault on PoE controller 4

Since Decmenber i have change 5 modules on a 5412 with the same issue that everyone is mentioning in this post.  There definitely an issue.  The firmware i am running is k.15.18.0013