Operating System - HP-UX
1846689 Members
3559 Online
110256 Solutions
New Discussion

What to do when you aren't sure a reboot will work (*on a remote server)

 
SOLVED
Go to solution
Richard Briggs
Regular Advisor

What to do when you aren't sure a reboot will work (*on a remote server)

So, one of my admins fatfingered a command.... and several files under /etc got clobbered.

I've restored what I could from backups, and recreated what I could with many of the lv and vg commands... ioscans yadda yadda...

What I really would like to know (if you have any hints)...is how I can reasonably test if a reboot will work... without rebooting... it's a remote server in an unmanned office and I don't have remote console... just ssh & telnet... which work! Any recommendations on commands to run to kinda test a reboot?

Thanks!
#find / -name coffee | cup < cream
17 REPLIES 17
Rita C Workman
Honored Contributor

Re: What to do when you aren't sure a reboot will work (*on a remote server)

Do you have your recovery on tape or do you you make_net_recovery ?

You really can't be 100% sure everything is solid till you reboot. And then...if you don't have a solid current recovery solution....well....I think you know the rest.

If you don't have a recovery solution, then you need to make fresh exports of all your data vg's and put a copy on another "safe" box. It will save you a ton of work if you choke on reboot. Other thing, any vendor "hook" files or anything unique on that box - copy it over to that "safe" place.

Holler back - folks here love a challenge !
Rgrds,
Rita

Torsten.
Acclaimed Contributor

Re: What to do when you aren't sure a reboot will work (*on a remote server)

Would be better to know what you and your admins did ...
So far the only solid test if a reboot works is in fact a reboot.
Somebody should be prepared to put the ignite tape into the drive ...

Hope this helps!
Regards
Torsten.

__________________________________________________
There are only 10 types of people in the world -
those who understand binary, and those who don't.

__________________________________________________
No support by private messages. Please ask the forum!

If you feel this was helpful please click the KUDOS! thumb below!   
Matti_Kurkela
Honored Contributor
Solution

Re: What to do when you aren't sure a reboot will work (*on a remote server)

I'm assuming you know for a fact that the damage was indeed limited to the files under /etc.

With good backups, you should be able to restore a copy of the entire /etc sub-tree to some other location, e.g. /var/tmp/etc and then run "diff -rc /var/tmp/etc /etc" to find the differences. Then review to see which changes are meaningful and/or desirable.

If you cannot do that, someone should prepare to travel to that remote office for a reboot test.

If that is not possible either, you have a lot of careful checking to do, and a high-anxiety-factor reboot after that. I would not recommend this.

In any case, when you do the first real reboot, you should at the very *least* have someone present at the local console who will not be mentally paralyzed by the sight of the command line *and* can reliably relay information and follow instructions to the letter.

This is *not* an exhaustive list, just a few things that come to mind:

1.) /etc/ioconfig - is it OK and does it match /stand/ioconfig? Any problems here might (in worst case) cause a prompt that must be answered on the console.

2.) Does /etc/lvmtab contain all the volume groups and their disks? Although the system *should* be able to complete the reboot with just vg00, it saves some effort to fix this now if necessary.

3.) Is /etc/inittab OK? How about the files it refers to?

With these checks successful, the system at least has a chance to come up to _single user mode_ without complications.

4.) The usual suspects: check the files in /etc/rc.config.d. If you don't have a remote console, be *extremely* careful about network parameters. Also make sure that /etc/rc.config is OK. And don't leave any backup copies or any other junk to /etc/rc.config.d!

5.) Is the PAM configuration OK? (It probably is, because the system allowed you to log in now. Check anyway.)

6.) For network access, check /etc/inetd.conf and the name service configuration files: /etc/nsswitch.conf, /etc/resolv.conf, /etc/hosts. If you use NIS or LDAP, you may have even more things to check here.

7.) The big one: start with /etc/inittab, find any scripts it runs, then examine those scripts to find out what sub-scripts they might call and what files under /etc/ are used. In essence, step through the scripted parts of the HP-UX boot sequence in your head.

MK
MK
Richard Briggs
Regular Advisor

Re: What to do when you aren't sure a reboot will work (*on a remote server)

MK - Thanks. I did every one of those except inittab. But checked and it's good.

I've been doing HPUX since version 9.0 - and there *used* to be a little white paper floating around with "50 files you have to have to boot" or something like that (perhaps more catchy) Maybe it was 49 files. I dunno. Was hoping someone remembered and had a copy.

/etc/rc.config.d/netconf was a big one. nsswitch.conf
resolv.conf
fstab

etc... anyway.. thanks for the good tips. No one will be onsite, it will be a white knuckle reboot - hate those.
#find / -name coffee | cup < cream
TwoProc
Honored Contributor

Re: What to do when you aren't sure a reboot will work (*on a remote server)

I'd go with Matt's suggestion - and by hand recover any missing files you find.

Now it's time to try and simulate a reboot as much as possible - well, you sort of can by stepping through all the elements, in reverse order in /sbin/rcX.d.

Assuming you're running at level 3. Keep in mind that services start from the lowest numbers in the /sbin/rcX.d areas, and go up in value. So, We just need to work the problem in reverse - from highest numbered service back down to the lowest.

Go to /sbin/rc3.d and individually, stop and restart every service in the file list that starts with SXXX[someservicename]. Make sure you pick the highest numbered service first. You can issue for example:
./S693[someservicename] status
and if that's OK then
./S693[someservicename] restart
and finally, to leave it down to keep moving back downward through the list
./S693[someservicename] stop

Keep in mind that some services possibly may not have a restart command, so you'd just issue a stop, then a start instead - you get the idea here.

And If that piece bounces up, then you're OK, and then proceed BACKWARDS (HIGHEST NUMBERS FIRST) until you go through all in the list.

If that all works, then it's time to do an
"init 2" and begin working through all the items in run level 2. If that all works, then run "init 1" and repeat for all items in run level 1. If you make it through all that testing its just time to verify that your root is bootable and that / and /stand is there.


Now, verify that each and every file system - one at a time will unmount and remount.
Look at each line in /etc/fstab to get your working list.

For that, the easiest way is just go into "SAM" and create a new bootable kernel. You can do this by choosing "kernel configuration" and then "configurable parameters" - from that menu choose actions and see if you can select "process new kernel" it will probably let you do that without making any changes to the kernel, you'll just get a new one to boot from. If it won't let you without making a change, just change something like "semmns" to one more than it already it is, and then you can generate a new kernel.

If you think any other pieces file systems might have gotten nuked (like /opt, /usr ,etc) - restore those and compare them too. But it sounds like you know where the admin was when the command was issued, so you're probably in better shape than you think.

Now, you'll have verified all of disks, all the contents of /etc that you need, and you definitely have a bootable kernel in place.

If you've passed all this stuff, then you've pretty much covered it all as much as possible, and I think there is an excellent chance that you'd come up fine on reboot, and can't see why not at that point.
We are the people our parents warned us about --Jimmy Buffett
TwoProc
Honored Contributor

Re: What to do when you aren't sure a reboot will work (*on a remote server)

Oh, just one more thing.
Keep in mind that after doing an "init 2" for the next level of testing - you'd have do a "cd /sbin/rc2.d" - don't just stay in the /sbin/rc3.d directory.
We are the people our parents warned us about --Jimmy Buffett
Richard Briggs
Regular Advisor

Re: What to do when you aren't sure a reboot will work (*on a remote server)

TwoProc - Thanks Good points all. One question though... at one point in init 2 or init 1 - the network connectivity is going to go down...and that would be baaaaaaad

right?
#find / -name coffee | cup < cream
TwoProc
Honored Contributor

Re: What to do when you aren't sure a reboot will work (*on a remote server)

Auuugh crud. I forgot about that. Well, over here, we use those little ssh boxes that you can connect to a box, and plug one end to a serial port and the other end to a network line. We don't have tons of those, but when I want console access for a long period of time to somewhere I'll go stick one on a server (or have the operator do it for me) so I don't have to work in the computer room - and still have secure console level communication with the box.

So you could get one of those, and send it there, and have them hook one of those up. There are several brands to choose from.

Or, if you can figure out what the network pieces are, just leave them out of the test, and don't init between levels, just bring down everything in lvl 3 you want except anytyhing networking, and then just repeat in level 2 directory. This of course, isn't NEARLY as good a test, so I'd recommend the ssh box strategy.

You might also have laying around some old hardware that HP called "remote console" which was a little box that you'd hook to the serial port, and it had a URL to hit to use the console. It was both slow and unsecured, but it would work if you can believe somehow that you're tunnelling over to the remote site leaves you secure "enough" to recover that box. Alternatively, if you've got a box near that box, you could set up an ssh tunnel with a squid server going to that remote console box, and hit that squid server with your browser (by telling it is using a proxy server). And I've done this before, but it's quite a bit of setup, especially if you've not done it before.

I'm suggesting shipping over a remote ssh secure box to that location already configured by you to work off of that IP address range at that location as the best solution. You should be able to locate a vendor with it in stock and have it overnightted to you, have it figured out at your site in a day, and have it overnighted that evening to the remote site (well I guess it depends on how remote), and have it plugged in and up on the next day. Meaning that you'd be a total of three days out from a test reboot.
We are the people our parents warned us about --Jimmy Buffett
Richard Briggs
Regular Advisor

Re: What to do when you aren't sure a reboot will work (*on a remote server)

IT would be easy with budget $$$$ and people onsite. ;) Anyone find that list?
#find / -name coffee | cup < cream
TwoProc
Honored Contributor

Re: What to do when you aren't sure a reboot will work (*on a remote server)

OK, Richard ...

I'm done. Out of ideas.
Do everything you possibly can to test and then boot that sucker!!! If it totally crashes from the boot, then they HAVE to let you go fix it!
:-)

Where's the box? Maybe it's near me and I can just go fix it for you.

:-)
We are the people our parents warned us about --Jimmy Buffett
Richard Briggs
Regular Advisor

Re: What to do when you aren't sure a reboot will work (*on a remote server)

Thanks again. I'm leaving it open just on the off chance someone finds that white paper on the top 50 files to check before reboot... ...i can dream.
#find / -name coffee | cup < cream
TwoProc
Honored Contributor

Re: What to do when you aren't sure a reboot will work (*on a remote server)

Just thought of another one, and it wouldn't be hard to do at all.

Make an ignite backup of the server (which can be done remotely over to a tape drive near you). Restore that backup onto one of your own servers locally, and see if it starts up.

This method should work for you very well to see if you really can boot, and if not, why not. You could then proceed to fix each item one at a time, both locally and on the remote server until the local one comes up.

When done, get a new ignite backup and replay the whole scenario. When you've got it working here, and all fixes in place over there - you're bound to be up and working fine on your next reboot at the remote site.
We are the people our parents warned us about --Jimmy Buffett
Richard Briggs
Regular Advisor

Re: What to do when you aren't sure a reboot will work (*on a remote server)

the server's compressed gzip golden image is 70Gb... I can't find a tape that big... My DDS 4 tapes are only 20 - are you talking about just a boot tape - ? Wouldn't the existing (possibly foobar'd) server's info be duplicated to the ignite tape anyway?
#find / -name coffee | cup < cream
TwoProc
Honored Contributor

Re: What to do when you aren't sure a reboot will work (*on a remote server)

Yes, I'm talking about just an ignite boot tape. It should fit comfortably on your DDS tape, however an ignite backup to an ignite network server fits on disk, so you're not limited to a tape drive. Got a 36 Gig hd on an HPUX box somewhere? Make an ignite server out of it, and then ignite that remote server to it. Then you can proceed with my recommendation. In your specs of making a boot tape, you can specify just the boot volume, that is, vg00 usually. Most people don't put lot's of other stuff in vg00 normally, this is usually accomplished on other disks and therefore, other vg's. So the size usually is small(ish).
We are the people our parents warned us about --Jimmy Buffett
Richard Briggs
Regular Advisor

Re: What to do when you aren't sure a reboot will work (*on a remote server)

Sounds good. Thanks for the info.
#find / -name coffee | cup < cream
TwoProc
Honored Contributor

Re: What to do when you aren't sure a reboot will work (*on a remote server)

Under the FWIW department:

http://www.circlingcycle.com.au/Unix-sources/HP-UX-check-OAT.pl.txt

It's a script from a forum member (here) who had problems going to customer sites and then not knowing whether or not the systems could stand up to patching, rebooting, etc.

So he wrote a script that checks out a wide variety of issues. I've not run it or tested it. But, its goal at least sound similar to your needs, at least at times. Sadly, I snagged the link, but not the guy who posted it's name, and he didn't have it in the script itself (which I was kinda counting on). So, I hate this, but I lost the original poster's name.
We are the people our parents warned us about --Jimmy Buffett
Richard Briggs
Regular Advisor

Re: What to do when you aren't sure a reboot will work (*on a remote server)

wow... i'm kinda scared to let that beast run -- hafta look into it... but thanks! !
#find / -name coffee | cup < cream