cancel
Showing results for 
Search instead for 
Did you mean: 

Disaster scenarios

SOLVED
Go to solution
Paula J Frazer-Campbell
Honored Contributor

Disaster scenarios

Hi to all

I am writing scenarios to throw at my system admins so that they can practice DR/recovery.

I am conctrating at this stage on situations that do not require a componant change.

Whilst I can invent many I would like some real word situations.

My Servers are D/K/L/N class running HPUX 11.00 - universe and cobol ip and x25.

I will post my completed document.

Paula
If you can spell SysAdmin then you is one - anon
20 REPLIES
Alex Glennie
Honored Contributor
Solution

Re: Disaster scenarios

Santosh Nair_1
Honored Contributor

Re: Disaster scenarios

When you say no component changes, I assume you're talking about non-hardware issue(?)

How about the case where someone (with root abilities) does a rm -rf a * (note the space) on /. That's always a fun scenario.

All the other things that I am thinking of relate back to hardware issues, i.e. disks failing, CPUs failing, etc.

-Santosh
Life is what's happening while you're busy making other plans
Paula J Frazer-Campbell
Honored Contributor

Re: Disaster scenarios

Alex

Good one I had forgotted ;^) that Bill had asked that.
10 points on the way but not yet as I do not want rabbit to appear yet.

Paula
If you can spell SysAdmin then you is one - anon
Tim D Fulford
Honored Contributor

Re: Disaster scenarios

Paula,

I'm not sure if you are after practical disasters or theoretical scenarios. I'm giving you some proctical problems (& a dummy cause)

The obvious one is a full system rebuild, after say a fire etc.

I have been involved in DR for real and usually the data is far too precious to warrent going all the way back to yesterdays tapes & rolling forward db logical logs.

My definition of DR is ANYTHING that causes the system to be out of service. That way you may have many DR problems in a year but the guys/gals dealing with it have a wide spectrum of experience to pull from, not just the plane crash scenario, fire quake etc & system rebuild, which to be frank is easy to do & whilst managers will go loopy there really is no effort in making the decision. The above said:

o Corruption of root disk, say rm -r /dev/* would be pritty devastating. (also let them diagnose that /dev/* is missing), or something equally as bad.
o Test failure of root disks & check they reboot. e.g. the data is ok but you only have 1 of a mirrored pair to boot from.
o Test your lan failure scenarios. Say make the mac addresses on two lan cards the same on different machines on the same subnet.
o Test a catastrophic database crash. say drop database ... or drop critical table, something that will force a database restore + roll forward from logical logs.
o Test restores from tapes of filesystems.

Personally I would set up the scenario and let the guys figure out what has gone wrong & how to fix it. I would also include trivial problems like a stupid file missing from the filesystem. keep the guys on their toes. Lastly if you want to make it really real do it at 01:00 am & drag them into work!!! How cruel can one guy be !-)

As you make have got the hint from the above, I believe that DR is a very under estimated & badly dealt with area. It should be as much about management/technical communication, decision making, problem diagnosis as fixing problems. People only ever think of the worst scenario do that, which is just a rebuild, v.easy.

Any way I'll get off my soap box now

Tim
-
Stefan Farrelly
Honored Contributor

Re: Disaster scenarios


Hmm, some disaster scenarios - this could be fun.

1. If you dont want to warn your admins your gonna do a test then ask them to simply reboot a server, but before they do remove the /etc/services file and replace it with one 0 bytes in size (use touch), and do a chmod 4555 /usr/sbin/mount. These 2 events cause a server to boot in the most amazingly bad way and scare the crap out of whoever is watching it startup! Takes sometime to diagnose and fix also (cant even boot properly to single use mode).

2. If youre talking realworld scenarios then how about asking your admin to do a restore from tape but the tape needed you blast beforehand (erase) so its unusable. Then see what they do to get around it. Or, disable the tape drive theyre going to use to restore from - see if they can figure out what to do then.

3. Another good one is to get them to restore to a different sever (as your primary server is lost in the "disaster"), so give them the backup tapes but trying to use ignite or restore to a different server (without access to the "disastered server") always causes problems.

Theres lots more great ideas, but these usually involve hardware.

Cheers,

Stefan
Im from Palmerston North, New Zealand, but somehow ended up in London...
Andreas D. Skjervold
Honored Contributor

Re: Disaster scenarios

Hi

Oracle database failures if you need that:

User errors:
-Drop table, truncate table,delete rows unintentional.
+resolved by table export file or point_in_time recovery.

Instance failure:
-background processes have failed.
+resolvd by restarting database

Media failure:
-disk crash, file erase, I/O problems.
+restore of missing datafile, logfile, controlfile and database recovery, depending on backup method.

And I just hate it when these things happen...

Andreas
Only by ignoring what everyone think is important, can you be aware of what everyone ignores!
Paula J Frazer-Campbell
Honored Contributor

Re: Disaster scenarios

Good answers

Further info: -

I do not intent to surprise them, just give them the test server and the problem.

i.e. server down in a 24/7 environment with critical data on board, so that they can work out sequences and the decision points to bring the server/business back on line.

One thing I have found in the past is that each step in a ???situation??? is not documented. So the day after when trying to find out what happened the accounts may vary sufficiently so as not to give clear enough information to say that ???X??? caused the problem ??? which is not very impressive when the IT Director wants answers to give to the board.

Paula
If you can spell SysAdmin then you is one - anon
Alex Glennie
Honored Contributor

Re: Disaster scenarios

that's not what I want to hear !!!!
...... blamming it on X ;)

Paula J Frazer-Campbell
Honored Contributor

Re: Disaster scenarios

Hi
"X" is an unknown and as always "nobody" did it.

Paula
If you can spell SysAdmin then you is one - anon
Thierry Poels_1
Honored Contributor

Re: Disaster scenarios

hi,

the mentioned cases are all direct hits, they require a "simple" recovery to a certain time before the incident.
But what if you have no clue when the problem commenced: after a program modification data is selectively being corrupted, and it might take days before anybody discovers this. What if somebody who got fired has built in little traps, or still has a backdoor to acccess the system and corrupts data selectively. It might be hopeless to define the exact beginning of the problem; and since most of the data and other applications are correct it is probably not desired to restore an old backup.
The good part of this story is that this is merely a programmer's problem, and the sysadmin can go home for the weekend ;-)

regards,
Thierry.
All unix flavours are exactly the same . . . . . . . . . . for end users anyway.
A. Clay Stephenson
Acclaimed Contributor

Re: Disaster scenarios

Ok Paula:

1) Wipe a box. (That was the fire).
2) You are dead - you didn't get out of the building quickly enough because you were trying to save the pet fishies in the aquarium. Your Jr. Admins had poached fish for breakfast and now they have to get to work. All the onsite media was destroyed in the fire including your OS Media / Ignite Data.
3) They have to start with getting the media from the off-site facility except someone forgot to authorize the Jr. Admins to request media.
4) If it ain't in the DR Plan - they don't know how to do it. I'll cut you a little slack - they know how to run vi but not a whole bunch more. They even know the difference between "/" and "\" and most of the time they remember. They are allowed to call HP
but only for specific questions.

Believe it or not - this is how I test DR plans; I'm there but I'm a ghost so that I know what needs to be added to the DR Plan.

---------------------------------------------

If I were really mean, Ignite wouldn't get you back up because of budgetary/probability constraints - 1 server has to do the work of 3 servers (though not as fast) so that you have to mix and match in the rebuild/restore.

Enjoy, Clay


If it ain't broke, I can fix that.
Darrell Allen
Honored Contributor

Re: Disaster scenarios

Hi,

How about...

Server physically lost - rebuild on an identical server and also on one with slight differences (ie different scsi paths, NICs, standalone tape drives instead of tape library, etc)

Patches installed now system unbootable or unstable... What patch was it?

LIF corrupted on boot disk

boot disk or alt boot disk replaced

Latest make_recovery bad

Darrell
"What, Me Worry?" - Alfred E. Neuman (Mad Magazine)
Bernie Vande Griend
Respected Contributor

Re: Disaster scenarios

A lot of this depends on what your disaster recovery plans are.
Do you have a recovery site? Do you have hot servers at the site, warm servers (running but not updated), servers in a warehouse, or use something like Sunguard to get equipment from?
A good disaster recovery plans specifies how important each piece of equipment is, how soon the business needs it back, and what the plan is to get replacment equipment and how to restore the system. Each scenario should be tested completely. In most cases you probably can't bring down the current production box but you should test everything else, including getting the tapes from off site, drop shipping (if needed), configuring the hardware, building OS, applications, databases, and restoring data.
The key to everything is documentation. A good test of your documentation would be to have an admin that is not aware of your procedures attempt to follow them to restore. After all, there's no guarentee your current admins will be around in a disaster.
Hope this helps a bit.
Ye who thinks he has a lot to say, probably shouldn't.
harry d brown jr
Honored Contributor

Re: Disaster scenarios

Years back I worked for a Software firm and our DR documentation was a little morbid. We included previous employees, how to possibly locate them - family/friends, and their expertise. I was always concerned with us losing everyone or 50% of the staff - our office was in Kansas - remember the Wizard of OZ?

Some good tests that I can add to the others:

(01) Corrupt /etc/passwd - especially the root entry. See if they can fix it on the fly.

(02) Change the root passwd, and make sure to remove any trusted access (.rhosts, su, etc.), then see if they remember how to change the passwd to something they know.

(03) Printer dies, they need to redirect the output, but it needs to be filtered to a different type of printer.

(04) Building can not be entered (fire, bomb threat, whatever), power is still on, do they have the ability to remotely administer the machines - like say from home?

(05) Root disk fails, there is no make_recovery and it is not mirrored, there are 3 other VG's that can't be restored for some reason. Make them rebuild the root disk and import the other 3 VG's.

(06) A system with two external scsi cards, one tape drive on one, a hass rack (2 disks) on the second. The first scsi card with tape drive fails, do they know how to move the tape drive to the other path, while correctly changing the SCSI ID? (assume HP can't get a new card for two days, but you need a backup now!)

(07) Similar to 06, but tape drive fries, do they know how to do a backup from one system to another using tar or cpio?

(08) Take a console, modify every setting - make it useless, don't tell them what you have done, and ask them to shutdown the system from the console, but don't let them replace it with another.

(09) Kill inetd, don't tell them what you did, have them figure out what happened and how do fix it.

(10) A VG has 5 disks, striped using hp striping, their task is to add another disk to the VG. Let them figure out how to do it.

(11) Your network router is down, you have no other routers, but you do have a crossover cable (give them some network cables in a box,, but don't tell them what they are, but if they ask for a crossover cable point it out). You need to transfer data from system one to system two - neither have tape drives or removable disks.

(12) Fill up /opt directory with a very large file. Start a daemon process that opens the file and sits there. Remove the file. Put the utility lsof on the system, but don't tell them its there unless they ask about it. Have them figure out where that disk space went.

live free or die
harry


Live Free or Die
Eileen Millen
Trusted Contributor

Re: Disaster scenarios

The password file gets messed up.
Especially the root entry.

Eileen
Roger Baptiste
Honored Contributor

Re: Disaster scenarios

Paula,

chmod -R 555 /usr
-> We have seen that happen and even
in this forum,i remember cries of
help for this scenario.

In /etc/fstab, swap the mountpoints
of lv's in the VG00 ; eg swap the lv
pointing to /home to the lv pointing
to the /usr (and so on).
-> I ran into this problem, when
by mistake some admin had copied
over an fstab file from another box.
Everything was ok, until the box
was rebooted for a Maintenance schedule.
It ran into weird problem on bootup.
Imagine /usr containing /home data ;-)



rm /usr/lib/dld.sl
-> i ran into this with new admins
long back.

- dumping a huge file into /dev directory
in a non-existent device driver filling up
the root filesystem
-> May not classify as a "disaster" , but
it sure does cause confusion for
newbies

-Restoring a huge database to another system
-> Causes nerves to even the best of the
admins, since it involves interaction
with the DBA's, tape library, bureacracy
etc. I had just been through restoring
a terabyte database in response to
an emergency call.

-If the disks are connected to the system
through a switch, power the switch off!
If they are two switches (alternate paths),
switch only one at a time. Good test for
PV links.

- Disconnect the lan connection and
have them swinstall a product from
a local CD ;-)

- Introduce a typo in the middle of the
password file.

There are many more scenarious, but i
guess you have your file overflowing by now .

-raj


-
Take it easy.
G. Vrijhoeven
Honored Contributor

Re: Disaster scenarios

Hi Paula,

Tests...

1. Try removings or corrupting (r/c) /etc/inittab, /etc/ioinit.conf & /stand/ioinit.conf and rebout the system.

2. Try r/c the oratab file.

3. Try system name change correction. (hostname -S name)

4. A low quorum boot (lvm) and fix a mixed up lvmtab.

5. You must have some documentation about desaster scenarios. What if you can not access it instantaniasly due to a fire, are the able to set up a system and restore a backup of the documentation with out documentation.

Hope this will help.

Gideon

Re: Disaster scenarios

Paula, you have had a lot of good susggestions here, but personally I believe A.Clay has hit the ones most often missed in plannig:
1) personnel (especially knowledgeable ones) disabled or unavailable
2) documentation missing/damaged/destroyed
3) access unavailable for equipment/media/documentation.

Murphy's Law: whatever can go wrong, will go wrong!
My house is the bank's, my money the wife's, But my opinions belong to me, not HP!
Michael Tully
Honored Contributor

Re: Disaster scenarios

Hi Paula,

I was thinking about this particular item
over the weekend, and went through some old
course notes that I had. These are some
items that I came up with where the system
is unstable, but could easily be fixed.
Yes a disaster.... but some good things that
could be parctised by Junior SA's.

Turn the root mirror off, change the boot
disk to the mirror in the primary boot path.

Change the root shell

Corrupt or missing config files
hosts, passwd, fstab, netconf, inittab
(host ip address different to netconf

NFS hanging on system boot (control \)

no pseudo ttys

/stand/vmunix* missing

corrupted or missing /sbin/sh

Swap the tape labels (This is a fun one!)

Hope these scenarios are useful.
It will be very interesting to see the
document when you post it.

-Michael
Anyone for a Mutiny ?
John Bolene
Honored Contributor

Re: Disaster scenarios

We have done partial tests, the problems encountered were:

Incomplete documentation
Recovery tapes could not be found
Recovery tapes could not be loaded (backup was incomplete or not done and nobody was notified or they deleted the message that it was not complete)

It was determined by management that personnel issues would not be considered (lack of planning I thought).

I think beyond the technical and equipment issues that people would be the hardest resource to handle as was found in the WTC disaster.
It is always a good day when you are launching rockets! http://tripolioklahoma.org, Mostly Missiles http://mostlymissiles.com