BladeSystem - General
1752339 Members
5702 Online
108787 Solutions
New Discussion

Re: VCM-OA Communication Down on FW v3.60

 
didwenger
Occasional Advisor

VCM-OA Communication Down on FW v3.60

Hi there,

 

I'm having this issue whereby the VCM will looses connectivity with the OA up to several times a day. It has now happened 40+ times since Dec 2012.

 

The chassis has a standard config :

  • C7000 Enclosure
  • 2x Flex-10 in Bay1/2
  • 2x FC 8 Gb in Bay3/4
  • 8 blades, all BL460c G7 except one BL460c G8
  • FW level is 3.60 on OA and VC
  • FW/driver levels on servers as per SPP 2012.06
  • Mix of ESXi 5.0 U2 and Windows 2008 R2

From the VC log, I can see that after the VCM reconnects to the OA (usually 3mn later), the VCM seems to "re-discover" the content of the enclosure including how many blades are present, what their power state is and if they have a profile assigned.

 

95% of the time, the blades do not suffer any I/O interruption but sometimes during that "discovery" phase, one or two blades are not identified although they are running, and as a result, these servers will immediately loose all NIC/FC connections. The VC profile of these servers will show as <Not Present>. At this point, you're only solution is to force a power off from the iLO and upon reboot the VC profile loads just fine and the box regains NIC/FC connectivity.

 

Questions:

  • Anyone has seen this before?
  • is the VCM-OA communication happening inside the chassis or does it go out to the external switch and back in?

 

So far HP tells me they can't spot anything wrong...so any help would be much appreciated :-)

 

Did

15 REPLIES 15
Psychonaut
Respected Contributor

Re: VCM-OA Communication Down on FW v3.60

I am running OA 3.60 and VC 3.70.  I am seeing the same behavior on multiple domains but fortunately not seeing the outages you are.  And not as often as you are seeing, the logs show this happening once per domain at random times.  With a couple of the domains being in the same row connected to the same switches, but the drops happening at separate times.  

 

So that would lead me to guess that it's not something on the switch.  

 

Do you have the OA connected to a 1GbE switch?  Is there a lot of traffic (ISO load to a server or something) when you see those drops?

Michael Leu
Honored Contributor

Re: VCM-OA Communication Down on FW v3.60

On the BL460c G7 I would recommend to update the LOM firmware to the latest version as there are IMHO many horrible bugs fixed.

 

Easiest way to update is probably with the offline or online ISOs:

http://h20000.www2.hp.com/bizsupport/TechSupport/SoftwareIndex.jsp?lang=en&cc=us&prodNameId=4324630&prodTypeId=329290&prodSeriesId=4324629&swLang=8&taskId=135&swEnvOID=54

didwenger
Occasional Advisor

Re: VCM-OA Communication Down on FW v3.60

Hi Psychonaut,

 

Thanks for your reply. In a way I'm glad to hear we're not alone with this problem and like I said 95% of the time we don't see I/O disconnections when it happens but this behavior is not normal and a "VCM-OA Communication Down" event is clearly marked as Critical in VCM logs.

 

I've logged a call with HP but so far they refuse to admit there's something wrong with VCM and they're going the route of replacing the motherboard on the BL460c G8. Since Monday I have switched the management of the VCM from interconnect #1 to #2 using the "vcm reset -failover" command and I haven't seen any errors yet but it might be too soon.

 

The chassis has 1x 10 Gbit connection per Flex-10 module (bays 1-2) and some other 1 Gbit links on the pass-through modules (bays 5-6). There should be plenty of bandwidth available for the blades to use but I have to admit I didn't measure the load on the uplinks to prove my point.

 

OA's are connected at 1 Gbit and seem fine, they're constantly monitored by HP SIM and I've never seen them go down. Additionally, the HP technician I've talked to is telling me that VC-OA communication happens intra-chassis so it does not need to go out of and back in through an exeternal network swithch.

 

I'll move this G8 blade to a different chassis this week and I'll see if the problem follows the blade...

didwenger
Occasional Advisor

Re: VCM-OA Communication Down on FW v3.60

Hi Michael,

Thanks for your reply. All the blades were updated using HP SUM and the SPP 2012.06 bundle to make sure that firmware and drivers for all OS are part of the same family. With regards to ESXi, I've aligned the NIC/HBA/Smart Array drivers with the "HP-VMware Software Recipe" doc from June 2012 (http://vibsdepot.hp.com/hpq/recipes/June2012VMwareRecipe1.0.pdf).
Michael Leu
Honored Contributor

Re: VCM-OA Communication Down on FW v3.60

There is this one issue that I've had with the old BL460c G7 NIC firmware, but only after upgrading VC: HP NC55X Adapters - FIRMWARE UPDATE RECOMMENDED: Device Control Channel (DCC) May Be Unavailable with a 10Gb Physical Link (c036000279)

Maybe your affected G7 blades are having these DCC issues? It's just a guess...

didwenger
Occasional Advisor

Re: VCM-OA Communication Down on FW v3.60

Thanks Michael I did not know about that DCC problem, definitely something to watch out for. After our G7 boxes were patched with SPP 2012.06 FW/Driver bundle, they seem to be fine throughout the VCM-OA NO_COMM storm, in fact there's no traces of disconnection in event logs. Recently only the BL460c G8 running ESXi 5.0 U1 gets disconnected. However, since the G8 and G7 share the same firmware for the Emulex 10 Gb adapter (although it's different models) I think the advisory applies for G8's as well.

Interestingly I noticed that HP just released the Feb 2013 Software Recipe doc in which they recommend Emulex NIC FW 4.2.401.6...together with VC FW 3.75
Psychonaut
Respected Contributor

Re: VCM-OA Communication Down on FW v3.60

My firmware is newer, we are running 4.1.450.16 and we are at SPP 2012.02 for the servers (G7's).  I would suggest updating your firmware, hopefully that can fix your problem.

 

It still makes no sense that we are seeing the OA and VC lose communication, if it's internal how is that possible?

Michael Leu
Honored Contributor

Re: VCM-OA Communication Down on FW v3.60

Hmm another wild guess: maybe there are duplicate OA IPs assigned on the network where the OAs/VCs are? Which might lead to VC sometimes not getting the expected reply and losing the communication. Or your OAs are failing over all the time because of a flaky OA cable?

 

So when the VC domain gets re-imported after the NO_COMM to OA, some blades might lose I/O when at the same time their iLOs are hung or not reachable. But I have had this only happen with multi-bay Integrity servers, not sure if this also applies to ProLiants.

didwenger
Occasional Advisor

Re: VCM-OA Communication Down on FW v3.60

Switching the VC management to interconnect module #2 did nothing for us as I just spotted 2x NO_COMM errors during the night (no impact on the server this time). I have now moved the G8 blade to another chassis today to see if the disconnections follow the server. If the problem happens again, I'll definitely upgrade the Emulex FW.

 

HP is asking me to upgrade to VC FW 3.70 and now admits there's a"DSS (direct) communication problem between OA and VCM". They now have 2 support engineers from BladeSystem + Networking depts looking at the issue. I've also told them the same errors appear on FW 3.70.

 

If this happens to anyone out there, you can always reference my case #4693499393.

 

Oh by the way I've checked all our OAs and VCs IPs, they're all unique and statically assigned, no conflict on the horizon.