StoreVirtual Storage
1752780 Members
6130 Online
108789 Solutions
New Discussion

Re: P4000 VSA | No Quorum after restriping fails

 
Holger_68623
Advisor

P4000 VSA | No Quorum after restriping fails

Hi everybody,

 

Hopefully somebody can help on that major issue - I've already tried to get (paid) support from HP but the manager declined to accept the request due to possible weekend bottlenecks on available technicians...

 

The setup:

Two HP DL385 running vSphere 5.1 (named ESX-01 and ESX-02), each consisting of

- 16 Cores

- 48 GB RAM

- 8 GBit NICs

- Five hard drives: 2 in a RAID1 (datastore 1), 3 in a RAID5 (datastore 2)

One Switch HP ProCurve 2510-24G for iSCSI

One Switch HP ProCurve 1810-24G for company network

Each server runs a P4000 VSA on datastore 1 (named VSA-01 and VSA-02).

 

The task:

Due to a lack of free space I had to replace all hard drives. So I wanted to remove the five smaller hard drives at each server and replace them with eight larger hard drives.

 

What I did:

  1. moved all VMs from ESX-02 to ESX-01 via vCenter Server
  2. did shut down VSA-01 on ESX-01 via the CMC
  3. did shut down ESX-01 via vCenter Server
  4. removed the existing five drives and installed eight new drives
  5. started the ESX-01 and booted to smartstart where I’ve configured the adequate RAID level for all disks
  6. installed vSphere 5.1 and configured the setting adequately
  7. rolled out VSA via .ovf file and configured everything correctly
  8. added the new VSA to the cluster via CMC and waited for the re-striping to finish – NO PROBLEMS OCCURRED
  9. The next day I made sure that all VMs would be capable to run without VSA-02 and ESX-02
  10. Then I repeated steps 1 to 7 for ESX-02

 

For some reason, the new VSA on ESX-02 was rebooted while restriping took place and that’s when the mess began.

 

All of a sudden, the store was not online anymore. I’ve tried several things to bring it online again but no chance.

So I did the following:

- shut down the freshly installed ESX-01

- replace the new set of hard drives on the ESX-02 with the old one to be able to access the store.

- start server (ESX-02) and boot to vSphere

- start the old VSA-02

Then I’ve opened the CMC and found VSA-02 but now the CMC states “No quorum (log out and log back in to the management group to try to detect quorum)”

 

 

I needed to access the VMs that reside inside that store urgently. So what happened?

 

Can anybody help? If there’s somebody willing to help by assisting me to solve the problem here in Germany I’m also willing to pay for the help.

 

Thanks-

Holger

 

13 REPLIES 13
Dirk Trilsbeek
Valued Contributor

Re: P4000 VSA | No Quorum after restriping fails

you don't seem to have a failover manager present in your configuration. The failover manager provides quorum in case at least 50 % of your nodes are unavailable (= no quorum for the remaining nodes). I'm not sure if you can add a failover manager to an already unavailable cluster, probably not.

 

What does your VSA-02 show in terms of status / managers? There is a feature with which you can replace a node with another one, preparing the cluster for the new node. That would probably be a better way to change all drives at once. You could also just replace one disk at a time and add the additional space as soon as all disk are replaced and the VMFS datastore has been expanded.

Holger_68623
Advisor

Re: P4000 VSA | No Quorum after restriping fails

Hi Dirk,

 

many thanks for your reply. Well, I've tried to get it working quickly but obviously it was the wrong approach. I've already had a FOM in place - sorry that I've forgot to mention this. 

 

What I did in the meantime: 

  1. shut down the old VSA-02 again
  2. started to copy the vmdk-files from the old VSA-02 to a local disk dive to have copied the VMs that reside inside the VSA's vmdks to a "safe place".
  3. started the new VSA-01 on ESX-01 to try to recover the quorum somehow

 

Unfortunately I didn't find ways to do so. The PNG files show the mess...

Background info:

  • 192.168.138.233 was the new VSA-02 (due to a MAC address issue I had to roll it out two times)
  • 192.168.138.132 was the old VSA-02

 

Question that came into my mind:

  • Would it be an option to restart all over again? Speeking in terms of using "recoverQuorum" on the command line or simply remove the FOM and rollout a new one?
  • Shouldn't it be possible to use an (hopefully) intact VSA with a new cluster by simply assigning or importing it to the new cluster?

Can you help?

 

Thanks-

Holger

Dirk Trilsbeek
Valued Contributor

Re: P4000 VSA | No Quorum after restriping fails

a new cluster means you're losing your data. I'd say the cluster configuration is a bit messed up as you have two nodes in your cluster which both have the same ip address.

 

That might also be your problem - you now have a total of 4 nodes in your cluster and you need more than 50 % of all managers available for quorum. For 2 nodes + FOM you would only need one node and the FOM, but with 3 nodes + FOM for quorum majority you'd need 3 managers (=2 nodes + FOM).

 

I never tried this (so sort of a last resort procedure), but that's the one thing i could think of to bring your cluster back online:  you could try using the "replace node"-function to remove vsa-02. Then login to the vsa-02 shell, assign a new name and ip address, reset the management group assignment and add the "new" node to the cluster, as a replacement for vsa-02. Other than that, i'd have to rely on HP support.

Holger_68623
Advisor

Re: P4000 VSA | No Quorum after restriping fails

Hi Dirk,

 

thanks for your reply.

 

Might sound a little strange but where do I find the "replace node" function?

 

And: if VSA-02 has been assigned a new IP address and been renamed finally will the existing data be accessible?

 

Will that solve the issue you've metionend regarding the number of Managers, too?

 

Thanks,

Holger

Holger_68623
Advisor

Re: P4000 VSA | No Quorum after restriping fails

Can't even login to the FOM via CMC anymore... Don't know why. :-(

Dirk Trilsbeek
Valued Contributor

Re: P4000 VSA | No Quorum after restriping fails

that would probably add a 4th node. Might still work, as it would get you 3 managers, which would be enough for quorum in a cluster with 5 systems.

 

i just checked our environment, i think the function was called "repair storage system", but according to the manual it is only used to repair systems where you just switched a disk, probably only relevant for the hardware p4000 systems.

 

can you remove the vsa-02 systems from the cluster? Maybe it helps to clean them out and re-add the system again.

 

You have a full backup of all vmdk files of your working vsa-01? If you completly destroy your cluster you could use them to revert destructive actions.

Holger_68623
Advisor

Re: P4000 VSA | No Quorum after restriping fails

I'm just copying the vmdk files for VSA-01 to another hard drive. Done that for VSA-02 already.

 

Since I can't login to the cluster - all storages have been switched off for backup pruposes - I cannot remove the VSA-02 systems from the cluster also.

 

I tried this before but recieved an error message stating that I should bring up a quorum first (HAHA!)...

 

This is really weird...! :-(

Dirk Trilsbeek
Valued Contributor

Re: P4000 VSA | No Quorum after restriping fails

no, that was what i feared. You need quorum to change anything in the cluster, otherwise the nodes that are not available might override the cluster changes you did while they were offline. There is no "master" who controls all configuration, so you always need quorum to propagate changes throughout the cluster.

 

Difficult situation, there are probably some shell commands known to HP support that can remove the invalid vsa-02 nodes so that you can just add it again, but i don't know them.

Holger_68623
Advisor

Re: P4000 VSA | No Quorum after restriping fails

Unfortunately HP won't help even though I want to pay for support... Bad user experience...!

 

Anyway, thanks for your help.