Operating System - HP-UX
1833752 Members
2712 Online
110063 Solutions
New Discussion

disaster test - what to check?

 
SOLVED
Go to solution
Viktor Balogh
Honored Contributor

disaster test - what to check?

Hi,

In two weeks or so we will have a disaster test in our datacenter, and I need to check some things and prepare for it. The power will suddenly turned off and after a few minutes turned back. We use ServiceGuard and external XP storage, the outage affects only the half of the two-node clusters and only the half of the storage subsystem. I need to foretell what will happen with the system/packages, and how the resynchronization of LVM will be done - how will state between the two storage boxes be synchronized.

The SG part is clear for me, but for now I didn't do any storage yet so I am most curious about the storage part here. And if you have a disaster recovery plan here, it is welcomed too! Points will be awarded. ;)
****
Unix operates with beer.
18 REPLIES 18
Viveki
Trusted Contributor

Re: disaster test - what to check?

Hi

I do not know what you are looking for from the storage side if the power goes down. Usually, for XP there is a disaster recovery software called continuous access. Is the same implemented?

Michael Steele_2
Honored Contributor

Re: disaster test - what to check?

Hi

a) Have your unlimited power supply vendor out to check for bad batteries.

b) and verify all boxes are on battery backup

c) then there is nothing. You're not failing over unless the network is disrupted. So if all of your network nodes are on batteries...
Support Fatherhood - Stop Family Law
Johnson Punniyalingam
Honored Contributor

Re: disaster test - what to check?

Check all power supply, UPS,

MP Login for all the Servers

CM>PS

UPS,
====
Check with UPS Vendor,

Backup
======
I would also like to make sure all latest

OS backup and latest File System backup
for all servers included under disaster test
nickel script to collect all the system information details.

Rgds,
Johnson
Problems are common to all, but attitude makes the difference
Viktor Balogh
Honored Contributor

Re: disaster test - what to check?

You must probably misunderstood me: the electricity will be completely off, also the UPS's! It will be tested, what will happen with the packages in this case. So one half of the clusters/storages will be offline, without a clean shutdown!

My question is: how will be the filesystems synchronized? If the package failovers, I think the resync of LVM will be initiated by the surviving clusterpartner, where the package actually runs. But what if the failover isn't permitted? e.g. after reboot of the powered off node the package starts automatically. In this case the sync was initiated by the rebooted node. I'm afraid here will be the correct data overwritten by the stale one. Can LVM auto-resync turned off? With what command?

****
Unix operates with beer.
Viktor Balogh
Honored Contributor

Re: disaster test - what to check?

>Usually, for XP there is a disaster recovery >software called continuous access. Is the >same implemented?

No, HP XP Continuous Access isn't implemented.

****
Unix operates with beer.
Viveki
Trusted Contributor

Re: disaster test - what to check?

Hi Victor,

Still we are not clear on what you are looking for? If you are going to perform a power OFF test in the XP and test the disaster recovery of the same, I should say to ensure a good back up. Nothing else ....
Viveki
Trusted Contributor
Solution

Re: disaster test - what to check?

Sorry Balogh,

didnt see your post on the above.

I do not know the current setup. But normally, if you power off one node of the cluser, the package should automatically change over to the other node. If a filesystem correction is there, the fsck will be called automatically by the other node.

Just for your info, I will share one of my experience. In the site, the power is failed. The storage (not XP) and one of the nodes got powered off suddenly. They came back after a while. But the cluster failed to start and fsck was consuming hours to repair a file system. Finally, a complete shutdown and proper restart of full setup solved the issue without any further delay. No fsck doen this time. So again do not go by docs or information from other sites. The power down test will be unpredictable and as per me, there is a rare chance that you won't have a problem after that, whatever may be your precaustions.

Again, please stick to back up before the activity since the machience may not be aware you are just testing ;)
Viktor Balogh
Honored Contributor

Re: disaster test - what to check?

Hi Viveki,

Thanks for your help. I know the Serviceguard part: if the package AUTO_RUN is enabled then the package will failover to the surviving node. (For the test packages it isn't always enabled) We have several LVM-mirrored filesystems, they are mirrored with the help of PVGs. Every such physical volume group consists of LUNs from separate XP boxes, so in case of a storage box failure only the half of the mirror will be affected.

But that part isn't clear to me: after a package switch, the package runs with half of the mirror. After the other half of the system powered on, how will be the data synchronized? The surviving node will a sync initiate, but we must make sure that the failed XP will be synchronized to the surviving one, and not reverse..

And what will be with the test packages? They will be started on the surviving node, after it has come back to life. Will here be needed a sync? Or only an fsck? We are using VXFS filesystems, do you think a full fsck (nolog) would be recommended?

After all, we will create an extra backup of the OS and the data...
****
Unix operates with beer.
Tingli
Esteemed Contributor

Re: disaster test - what to check?

If one node of a two node cluster is powered off, then the root mirror of the survival node won't be affected. It runs as usual, only the processes running in the failed node will be failed over the the survival node.

When the failed node is up, then it is just as usual system boot up. You can bring those failed processes can be brought back manually to the original node.
Viktor Balogh
Honored Contributor

Re: disaster test - what to check?

Thanks Tingli,

and could you point me towards, what will on the storage side happen? The one side of the mirror (on storage box) will be out of electricity, my question would be: what happen after restarting the failed storage? How will the resynchronization occur? Will it happen manually or automatically? Could we set this synchronization to manual?
****
Unix operates with beer.
Viktor Balogh
Honored Contributor

Re: disaster test - what to check?

my other question would be: will the server again powered off after we give the electricity back? where can I check it if it will reboot itself automatically? on MP/GSP?
****
Unix operates with beer.
Viktor Balogh
Honored Contributor

Re: disaster test - what to check?

> will the server again powered off after we give the electricity back?

will the server powered ON again after we give the electricity back?
****
Unix operates with beer.
Johnson Punniyalingam
Honored Contributor

Re: disaster test - what to check?

my other question would be: will the server again powered off after we give the electricity back? where can I check it if it will reboot itself automatically? on MP/GSP?

Are You referring with "raw power" or UPS power" or You are doing power mantiance test ?

Once you shutdown the server- gracefully (shutdown -hy 0)
You have to unplug the "Power Cables" from Server- does your power source comes from (UPS) ?
Once you have Completed your power mantinance activity- connect back the power cables once power resume you need manual power on

> will the server again powered off after we give the electricity back?<<

This Question depends on "Power source" - if poweroff, you can check

can Check under Console logs (E - Error logs ) (MP/GSP)
Problems are common to all, but attitude makes the difference
Viktor Balogh
Honored Contributor

Re: disaster test - what to check?

I am referring to RAW power. And this is not a power maintenance activity, our customer plans to simply remove the power without graceful shutdown. Switches, storage boxes, and servers will be affected to this, a whole datacenter - the other half of the infrastucture resides in another building will be inaffected, and hopefully will take over the packages.

****
Unix operates with beer.
Tingli
Esteemed Contributor

Re: disaster test - what to check?

Why not make the two raw power supplies to both servers and both storages, so you don't need to worry about sudden power off.

If you have UPS, then it can supply power for a few minutes and you don't need to worry about it either.

But if one server and half of the storage is down, then everything related to that half of storage will be down for sure and the database might be corrupted. Mean while the processes resided in the failed system will fail over to the the survival system. Usually the fail over takes several minutes and if the original failed system is back, then the result is unpredictable.



Viktor Balogh
Honored Contributor

Re: disaster test - what to check?

Tingli: yeah, with UPSs it would be much easier. AFAIK there is an alternate power source, but it won't be an option.

>and if the original failed system is back, then the result is unpredictable.

Yes, that's my task: to predict the unpredictable. This whole action was organized only for checking what would happen if... electricity would be off all of a sudden. :(

I will leave this thread open and share the details. The test will be made on 27-29 november...
****
Unix operates with beer.
Tor-Arne Nostdal
Trusted Contributor

Re: disaster test - what to check?

I assume you've tested ordinary package switching and know that this works ok ;)
That you also know that your backups are running and is possible to restore ;)
...

Tip:
Ensure that they really cut the power for all components at once. We found a plausible case for error if failures came in a specific sequence...

Tip:
Check your MC/SG setup, that when your primary node once again comes up again - if it will switch back or not.
It might be that you want a controlled fallback to primary node, and do not want package switching automatically when the power is back again.

/2r
I'm trying to become President of the state I'm in...
Viktor Balogh
Honored Contributor

Re: disaster test - what to check?

Thank you for your replies. Thread closed.
****
Unix operates with beer.