BladeSystem - General
1748219 Members
4342 Online
108759 Solutions
New Discussion

Re: Virtual machines on blades seeing intermittent PING/ARP issues.

 
Mike O.
Regular Advisor

Virtual machines on blades seeing intermittent PING/ARP issues.

We are having a problem with some of our virtual machines running on our blades intermittently losing communication with each other, and I’m at a loss as to the source.

 

We have about 250 VM’s running on about 20 HP BL465C blades installed on two HP C7000 chassis, using the HP Virtual Connect interconnect modules.  The blades are G7 machines with the flex 10 NIC's, but the interconnect modules are the 1/10G, non-flex 10 modules.  The blade chassis are connected to our core Cisco 6500 switches.  The VMWare hosts are at 5.0, the guest VM’s are a mix on Windows 2003, 2008, and 2008R2. 

 

We've had this configuration for a couple of years.  The firmware in the chassis, interconnect modules, and blades were updated in one chassis about a month back, and we're one or two versons back on the other one. 

 

What’s going on is that everything seems to be OK, but then out of nowhere, we will get communication failures between specific machines.    It looks like it’s an ARP issue.  Using PING, it works fine in one direction, but we get an “unreachable” error when going the other way, unless we ping from the target back to the source first.

 

For example: we have servers, “A” and “B”.   Ping A to B fails with “unreachable”. Ping “B” to “A” works fine.   However after pinging “B” to “A”, we can now ping “A” to “B”, at least for a while until the entry falls out of the ARP cache.  If we go into server “A” and set a static ARP entry (“arp –s”) for server “B”, everything works OK.  Through all this both server “A” and server “B” have no issues communicating with any other machines.

 

We tried using vMotion to move the servers to a different host, different blade chassis, etc.  Nothing worked except when we put both VM’s on the same host.  Then everything worked OK.  Moving one of the servers to a different host and the problem came back.

 

It seems like either the ARP broadcast from the one server, or the reply back from the target isn't making it through.  However, according to our networking group, there are no issues showing up Cisco switches.

 

Early this year, we had an issue where it happened on about a third of machines at the same time (it caused significant outages to production systems!).   It seemed like it was limited to machines on one chassis (but not all of the machines on that chassis).  At that time, we opened up tickets with VMWare and HP.  Neither found anything wrong with our configurations, but somewhere in the various server moves, configuration resets, etc., everything started working.

 

Since that time we’ve seen it very intermittently on a few machines, but then it seems to go away after a few days.

 

The issue we found today was that the server we’re using for the Microsoft WSUS server hadn’t been receiving updates from a couple of the member servers.  We could ping from the WSUS to the member server, but not back from the member server unless we put a static ARP entry in the member server.  The member servers are working fine otherwise, talking to other machines OK, etc.   They are a production environment, so we’re limited on the testing we can do.

 

Also, when it has happened, it seems like always been between machines on the same subnet.  However, most of our servers are on the same subnet, so it might just be coincidence.

 

I’ve done a lot of internet searching, and have found some postings with similar issues, but haven’t found any solution.  I don’t know if it’s a VMWare issue, HP, Cisco, or Windows issue.

 

Any assistance would be appreciated.

 

Mike O'Donnell

13 REPLIES 13
Jan Soska
Honored Contributor

Re: Virtual machines on blades seeing intermittent PING/ARP issues.

Hello Mike,

we use the same amd blades without problem, just keep everything updated. These were some problem with Emulex NIC in these servers. For example there is quite critical firmware update 5 days old causing PSOD for VMware - check here: http://h20000.www2.hp.com/bizsupport/TechSupport/SoftwareDescription.jsp?lang=en&cc=sk&prodTypeId=329290&prodSeriesId=5033632&swItem=co-108976-1&prodNameId=5033634&swEnvOID=54&swLang=8&taskId=135&mode=4&idx=1

 

and of course - check rest of your firmware and drivers (If you have'nt done yet), there is great HP page dedicated to vmware:http://vibsdepot.hp.com, especially check latest "recomended matrix" http://vibsdepot.hp.com/hpq/recipes/September2012VMwareRecipe2.0.pdf

 

Please let us know progress.

 

Regards,

 

Jan

Mike O.
Regular Advisor

Re: Virtual machines on blades seeing intermittent PING/ARP issues.

Thank you very much for the information.    We had already gone through the chassis/blade firmware and VMWare updates as a result of an issue we had last month with an interconnect on one of the chassis.  We're planning on updating the other chassis next week.

 

However, the ARP/Ping issue we're having is showing up on the blades and chassis that we already updated.     I went through the document, and everything is current except for three items.  However, on all three of those items, the NC551i (Emulex BE2Net, 4.1.450.7), Onboard Admin (3.56), and Virtual Connect (3.6), we were only 1 version out, and the release notes of the latest version didn't have any fixes relating to the ARP items.

 

 

Mike O.
Regular Advisor

Re: Virtual machines on blades seeing intermittent PING/ARP issues.

Is there any way to remove the "solved" setting on the thread?

Dennis Handly
Acclaimed Contributor

Re: Virtual machines on blades seeing intermittent PING/ARP issues.

>Is there any way to remove the "solved" setting on the thread?

 

Sure.  On the Post Options > Not the Solution

 

The FAQ has the wrong text for the item:

http://h30499.www3.hp.com/t5/help/faqpage/faq-category-id/solutions#solutions

Mike O.
Regular Advisor

Re: Virtual machines on blades seeing intermittent PING/ARP issues.

Thanks.  I had looked at the "post options", but I was on the original message, looking for something like "not solved".

 

 

Mike O.
Regular Advisor

Re: Virtual machines on blades seeing intermittent PING/ARP issues.

I still haven't found a solution to this.  I have done some more testing and research.  Someone in the VMWare forum suggested the Microsoft hotfix related to a "gratuitous arp" issue in Windows 2008.  However, that didn't resolve it.

 

I ran a script run on all of our machines, doing a "netsh interface ip show neighbors", searching for anything that had an "unreachable" entry.   

 

The issue did show up on multiple subnets, but in each source/target pair, the servers were in the same subnet, with no router.

 

The ones with "unreachable"  all had at least one of the servers in the VMWare environment, passing through the blade chassis Virtual Connect.

 

If the two VM's were on the same host, the "unreachable" issue went away.  Moving them back to different machines, and the "unreachable" came back.

 

There were some repeats in the "targets", but most other servers could ping those target servers OK.

 

We are using VLans, and having Virtual Connect separate the networks before sending them to the blades.  I recall seeing something about a Virtual Connect issue where when the VC environment would strip off the VLAN from the ARP packet, the resulting packet would be too small and would be dropped by other networking.  However, I thought that was fixed in a later firmware release.  Also, wouldn't it affect all ARP's going through Virtual Connect, and not just some?

 

 

Re: Virtual machines on blades seeing intermittent PING/ARP issues.

Did you ever find a solution for this?  Having a very similar issue.  We did just add a new blade, but the issue didn't occur until it was up with VMs for almost 24 hours. And we shut it down after the first problem and new VMs continued to have the issue pop up 12 hours later.

Jags_21
HPE Pro

Re: Virtual machines on blades seeing intermittent PING/ARP issues.

Hi Mike,

 

I hope below forum thread could lead to some solution. Also check is any ARP proxy settings on the core switches.

 

http://h30499.www3.hp.com/t5/HP-BladeSystem-Virtual-Connect/LACP-Etherchannel-with-HP-BL460c-Blade-enclosure-Connectivity/td-p/2301508

 

 

Any personal opinions expressed are mine, and not official statements on behalf of Hewlett Packard Enterprise.
I work for HPE

Accept or Kudo

gazJones
Occasional Collector

Re: Virtual machines on blades seeing intermittent PING/ARP issues.

Hi Mike,

 

We experienced exactly the same issue including the behaviour where a ping in the opposite direction allows the guests to communicate, we raised a call with HP and they identified that our Flex-10 NIC firmware on the ESX hosts weren't alligned with the VC module firmware.  Updating the firmware fixed the issue however we're now experiencing a similar issue again, this time pinging guest B -> A doesn't add an entry into the arp cache, we've wiresharked the both VMs and the ARP packet never arrives at the destination VM.

 

I'd definitely check your NIC firmware against the VC modules and OAs, the VC guy gave me the impression that even they're slightly out you can experience issues like this.

 

Cheers, Gareth.