BladeSystem Infrastructure and Application Solutions
1753367 Members
5179 Online
108792 Solutions
New Discussion

C7000 enclosure loss and effect on a VMware cluster

 
chuckk281
Trusted Contributor

C7000 enclosure loss and effect on a VMware cluster

Clustering questions around VMware have propmted some questions and looking for your input. Here is the scenario: "I have a very loyal HP Blade customer who runs about everything they have on c-class, including their ever growing VMware environment. With the later technology G6 servers they are approaching 300+ VMs hosted per enclosure (16 half height BL46x blades). They are not looking to move off c-class but are starting to be concerned about the effects of losing an entire enclosure to the VM community. They have asked me both for hypothetical scenarios of total enclosure loss (planned or unplanned) and also best practice VMware designs to minimize any disruptions. I have my own thoughts below but would like any feedback, which you can address to the community if you like.. First they know about and have a high availability design in their enclosures – all the normal redundancies at the power / network / SAN layer, VC user, good SAN design, etc. They are talking full loss of the enclosure, as unlikely as that may be. The scenarios for loss of the enclosure seem easy enough. As for an unplanned cataclysmic event to bring down a known good enclosure (one that was not damaged from the start), there are very few I can think of. In fact, the only one that comes to mind is the power supply advisory of last year (where supposedly a bad supply could cause the others to trip and result in enclosure power loss), and we proactively fixed that one. However, I do feel there are situations where an enclosure may need to be downed; most notably, damage to a signal connector on the blade or I/O side which I have seen in rare instances (sometimes, I think these may be a factory or shipping defect but were not discovered until all slots were populated). I categorize this as a planned outage, as you can limp along without the effected blade or I/O module, but will need to shut down the entire enclosure at some point to correct. Anyone think of any other planned or unplanned outages to an entire c-class enclosure? Which brings me to the second part of this exercise; how best to construct the VMware farm to handle this. For arguments sake, say an entire enclosure represents 200VMs. How do I best construct an environment to moved the 200VMs somewhere else while the enclosure is serviced? I have bounced around some ideas but each has drawbacks: - Combine multiple enclosures into one large VMware cluster - say 4 enclosures together in one VMware cluster so need only 25% capacity per enclosure to handle loss of one – but this flies against the best practice designs of VMware clusters (too many nodes / reservation locks, etc.) - Split VMware clusters across enclosures (say 4 enclosures again, each with 4 blades in one of 4 different clusters) – this reduces the number of blades per cluster but this could increase the cabling for network and SAN quite a bit - Perhaps use a standby enclosure with no blades but fully SAN / network connected blades could then be physically moved from affected enclosure to this standby enclosure – this seems pretty time consuming and expensive - Other ideas? I am leaning towards option 2 as a best way to mitigate, although still thinking about the network and SAN connection issues. That is what I have so far, again interested in any feedback you might have.
1 REPLY 1
chopper3
Frequent Advisor

C7000 enclosure loss and effect on a VMware cluster

I'm in a very similar situation (almost sounds like my company in fact!) and as such I have been investigating this problem for months now. The specific issue I have is that every single VM I have is part of some form of load-balanced pool or cluster - I'm therefore happy to lose a single VM but not more than one. Obviously you can use anti-affinity rules to ensure than no two cluster member VMs are running on the same host but there's no 'second level' to define the anti-affinity that you could use to define the enclosure. To deal with this I have ESX clusters that have one host per enclosure - so my risk is spread out 'horizontally'. This way if one enclosure goes down I lose a single cluster member from each application/LB cluster. Now this depends on how many enclosures you have and spreading out your servers smartly between them but it does stop me from losing more than one VM type in an enclosure loss. You end up with quite a few clusters too. I've raised this with VMWare and we've discussed there being additional 'enclosure', 'rack' and 'room' data that you could assign to a host that could be used for anti-affinity but I was told this won't even be in the next major release. Hope this, complex/expensive, solution at least answers one or two questions. [Updated on 10/9/2009 11:23 AM]