1830730 Members
2408 Online
110015 Solutions
New Discussion

Downtime checklists

 
Joe Robinson_2
Super Advisor

Downtime checklists

I'm putting together a checklist of actions to be taken in the event of a system failure and was wondering if any of you had something similiar already created (yes, I hate re-inventing wheels)?

Thanks in advance,
Joe
9 REPLIES 9
Michael Tully
Honored Contributor

Re: Downtime checklists

This is a good thread. I've not seen a white paper on it, but perhaps it could be developed here. I suggest this be placed into two different chapters. 1) Being system is down and is not coming back up. 2) System has rebooted in some form after a problem.

1)
Has the system got power?
Can you get into the GSP logs?
Do you know how to interpret the GSP logs?
If not place hardware support call.

2)
System comes up normally, what should be checked?
- /var/adm/syslog/syslog.log
- /var/adm/syslog/OLDsyslog.log
- root mail
- /var/tombstones
- /etc/rc.log
- /etc/shutdownlog
crashdump (/var/adm/crash)
stm logs
GSP logs
Anyone for a Mutiny ?
A. Clay Stephenson
Acclaimed Contributor

Re: Downtime checklists

In general, if a system crashes the only thing of value is examining the dump image using q4. If the crash is bad enough, you won't even have that --- though these are very rare. I have to tell you that preparing a checklist for crashes is really giving up the battle. The real goal is to never have crashes --- or prepare so that if they do happen they don't matter very much (e.g. MC/Service Guard).

I think you will find that if mirror (or RAID everything) and only use hot-plug disks, have multiple NIC's and swiches, clean power and environment that you checklist is almost useless --- and that's a good thing. If you filter all your patches and upgrades througha sandbox then crashes will really become extremely rare. I speak from experience as I have recently passed the 5 year mark with zero unplanned production downtime. I don't need no stinkin checklist for crashes.

If it ain't broke, I can fix that.
Jeff_Traigle
Honored Contributor

Re: Downtime checklists

Hmmm... not to sound stuck-up, but, if anyone is trying to troubleshoot by an all-purpose checklist, they're in trouble from my experience. Troubleshooting involves knowing the system well from many aspects. By understanding how it works, you can understand error messages or general system behavior better to determine what is causing the problem when a failure occurs. (While I agree with Clay about engineering a system so failures aren't fatal, that's not always possible when management with unwisely tight purse strings is involved... been there, done that.)

This kind of checklist reminds me of the "disaster/recovery" documentation that was requested from site personnel at my previous job by one of the other big IT companies. They wanted detailed procedures of how we would go about troubleshooting a problem on a system that was down. I laughed and told the IT Lead (who was quite UNIX-savvy himself and laughing right along with me) that I'd be more than happy to hand them a copy of all the HP-UX manuals because I wasn't writing volumes of documentation that already existed from the vendor. I'm not sure what, if anything, got submitted.
--
Jeff Traigle
Carlo Corthouts
Frequent Advisor

Re: Downtime checklists

The checklist might be interesting not because you can troubleshoot it yourself but you could prepare some information for when you log a HW/SW support call to figure out why it did boot.

From my work experience I do know that a lot of time is wasted before you get the result from the vendor just because of the gathering the necessary information.

What you always should do is when you have a crash is to see if you can get your system up and running again and have a co-worker log a call right away.

Even if you get the system back into production it is still a good idea to have the support people help you figure out the cause of the crash.

Joe Robinson_2
Super Advisor

Re: Downtime checklists

This thread got EVERYBODY going! Just to clarify, I was also thinking along the functional train of thought; i.e. a) notify help desk, b) notify management, etc. Fortunately (for me), I've only had about 20 minutes downtime in the last 10 years (despite what any of you might think, I call it blind luck ), but would like to come up with a good tool that would delineate the functional steps in the event of downtime to cover all bases and do it the same way any and every time that I (we) get hit w/ the "whammy".
Rick Garland
Honored Contributor

Re: Downtime checklists

Part of what you are asking is dependent on your environment; hardware, software, and policies. I say policies because not everybody needs to notify the helpdesk. (I'm not saying a bad idea, just saying it is not a requirement everywhere)

These policies are just as important (if not more important) to develop as a hardware/software checklist. This is a conversation with you fellow team members/ employees.
Rick Garland
Honored Contributor

Re: Downtime checklists

An example from past employers.

If a system went down then I would notify the 1st tier support personnel. The 1st tier folks would in turn do the notification of those who need to know and then the info was disemminated out to the user community.

At another location I would notify the helpdesk directly even though I would be 3rd tier.

Still another location would post the system status on a web site and consider that notification of all users.

Again, this is a question for your co-workers, peers, supervisors, etc...
Gavin Clarke
Trusted Contributor

Re: Downtime checklists

I can think of two occasions where we've had unplanned downtime.

The last one was a change to the application, which we had to get people off the system to rectify. I made the change and I put it back after some worrying that it was some other issue.

The one before that was a suspected controller failure. We have serviceguard but the load seemed to be too much for one server, it wasn't particularly comfortable here for a while.

Anyway I think I tend to react the same way.

1. What has changed, can I think of anything recent that might cause a problem?
2. What do the forums say, has anyone seen this before? Yes I quite often come here for a search before logging a call with HP.
3. Check logs, syslog, dmesg (or /var/adm/messages), cmviewcl, bdf, vmstat.
4. Log a call with HP.
5. Look at the disk array, LAN cards.
6. HP went through armlog -e, armdsp -a, armdiag etc...

While this is going on my manager is telling the users what is happening (via the helpdesk) and our DBA is looking for database problems.

So there it is, my rather haphazard route to rectifying unplanned downtime.
I hope this helps a little.
Kent Ostby
Honored Contributor

Re: Downtime checklists

JOe -- For the application that I manage, my checklist is pretty much:

#1) Start up application on failover system.

#2) Check to see if this looks like a long outage or a short outage (just best guess) (e.g. short outage -- sys admin rebooted the machine my app runs on without telling me of any planned down time. long outage -- no one knows why the machine crashed).

If short outage then just wait for App to come back up.

If long outage then email a predefined list of managers and ask them to have their people log into the failover machine.

For a system, I would think it would be similar. Get ready to move your critical applications if necessary and have a list of people you need to notify.

Best regards,

Kent M. Ostby
"Well, actually, she is a rocket scientist" -- Steve Martin in "Roxanne"