HPE SimpliVity
cancel
Showing results for 
Search instead for 
Did you mean: 

Disaster recovery steps

 
AubreyT
Visitor

Disaster recovery steps

My manager has asked me to come up with a DR test process using our Simplivity environment. We are using ESXi 6.0, which is not compatible with RapidDR.

Scenario 1:
1. Cut our network link between two 3.7.2 SVT nodes at our main office and two nodes our colo site
2. Create copy VMs from SVT backups at the colo site, and verify functionality
3. Restore network connectivity and resume backups from our main office to colo site

Scenario 2:
1. Cut our network link between two 3.7.2 SVT nodes at our main office and two nodes at our colo site
2. Bring up all VMs at our colo site during a planned outge (no production processing at the main office during the cutover), and run production processing through the colo internet connection
3. Copy all VMs with production data back to the VMs at our main office
4. Bring up all VMs during a planned outage at our main office, and run production from there.

Is there a document/white paper that discusses best practice steps for failover/failback when not using RapidDR?

5 REPLIES 5
Highlighted
LiamP
Advisor

Re: Disaster recovery steps

Disaster recovery methods relies on backups located at the destination i.e. remotely, so that VMs can be restored from backup. 

Both scenerios also requires VMs to be removed from the source, if a restore was successful and operational. Otherwise. there will be issues with duplicate IPs, etc.

A failback, requires a new backup to be taken from colo back to main site and then restored. 

RapidDR follows the same principles, but uses a set of scripts to automate & customise the process i.e. recovery plan, etc

 

 

 

I am a HPE Employee
LiamP
Advisor

Re: Disaster recovery steps

The method of restoring/creating backups will also need to be taken into consideration if vCenter(s) becomes unavailable during these scenerios i.e. vCenter located on HPe SimpliVity storage or external

I am a HPE Employee
AubreyT
Visitor

Re: Disaster recovery steps

Liam, based on your feedback, you would say the process would be the following for scenario 2,? (I'm guessing there is no whitepaper with more details than you included previously)

I. Broad failover steps (planned outage)
  1a. Turn off all production VMs and pause all backup policies originating at the main site OR
  1b. Cut the network connection between the primary and colo site.
  2. Bring up all the VMs at the colo site, by creating a new VM with date/time stamp  (the OS would contain the same SID as the original machine)
  3. After verifying the colo VMs are behaving normally, delete all production VMs at the primary site
  4. Bring up the primary to colo network (if 1b was executed)
  5. Create backup jobs to copy VMs at the colo to the primary site, and copy all VMs back to their original location.

II Broad failback steps (planned outage)
    1. Turn off all colo production VMs and pause all backup jobs sending data to the primary site
    2. Bring up all VMs at the production site, by choosing to a new VM, without a date/time stamp.
    3. After verifying the primary site VMs are behaving normally, delete all VM copies at the colo site
    4. Enable backup jobs to start copying VM data back to the colo site.

Is there a way to write pre-written scripts, PowerCLI, or something else, to automate bringing up VMs on both sites? We want to run this test without RapidDR, and would like to streamline things as much as possible.

Thanks!   

AubreyT
Visitor

Re: Disaster recovery steps

If I turned off VMs at the production site, then copied backups from the colo site, would that recover our primary site any faster? Could the backups from colo just copy delta changes, or would it re-create the VM from scratch?

LiamP
Advisor

Re: Disaster recovery steps

I. Broad failover steps (planned outage)
  1a. Turn off all production VMs and pause all backup policies originating at the main site OR
  1b. Cut the network connection between the primary and colo site.
  2. Bring up all the VMs at the colo site, by restoring creating a VM from backup with date/time stamp  (the OS would contain the same SID as the original machine)
  3. After verifying the colo VMs are behaving normally, delete all production VMs at the primary site.

Maybe just remove from inventory, (but would need to be cleaned up later), if no backups were present on the main site. Otherwise, re-seeding backups from colo to main might take an extended amount of time (delaying the process)

  4. Bring up the primary to colo network (if 1b was executed)
  5. Create backup jobs to copy VMs at the colo to the primary site, and copy all VMs back to their original location.

A new rule will need to be added to a policy for VMs to back up remotely from colo to main.

"Is there a way to write pre-written scripts, PowerCLI, or something else, to automate bringing up VMs on both sites? We want to run this test without RapidDR, and would like to streamline things as much as possible."

Maybe now you see why RapidDR was released. Of course you can write scripts to streamline this process, but this will be unsupported if you run into issues.

HPe SimpliVity support team will be able to assist with the outage, but not the process. 

Not sure I would risk running scripts when HPe SimpliVity have already released a supported product i.e. RapidDR, especially when dealing with an outage/disaster and a certain RTO value might be critical to a company.

What RTO & RPO can your company tolerate?

I am a HPE Employee