System Administration
cancel
Showing results for 
Search instead for 
Did you mean: 

Proper way to recover from hung system

SOLVED
Go to solution

Proper way to recover from hung system

I have an 11.23 system that has hung and become non-responsive. I am running audsys and it appears that the filesystem that it writes to has become 100% full but this is not in the vg00 group. What is the proper way to get the system back up and running?
5 REPLIES
Steven E. Protter
Exalted Contributor

Re: Proper way to recover from hung system

Shalom,

Boot the system into single user mode at the console.

ISL> hpux -is

HPUX(IA-64 prompt)>hpux -is

mount -a

Try and clear the full file system.

SEP
Steven E Protter
Owner of ISN Corporation
http://isnamerica.com
http://hpuxconsulting.com
Sponsor: http://hpux.ws
Twitter: http://twitter.com/hpuxlinux
Founder http://newdatacloud.com
Bill Hassell
Honored Contributor
Solution

Re: Proper way to recover from hung system

The first step is to define the 'hang'. This is a general guide for hangs:

A system may be perfectly fine but a switch port has gone bad or turned off by a mistake in the network department. So start with the real console. Ideally, the console LAN port is on an isolated (and secure) network. If not, then a network outage will make all remote access impossible. Get to the computer room and hook up a serial console terminal or PC with a known-to-work serial adapter.

Start with console access to see if the server is even powered on. If you can get a GSP/MP prompt, then connect to the console. If you can't get a prompt, try CTRL-C to kill anything talking to the console. Then try CTRL-\ as a SIGQUIT signal. If you can get a system prompt and logged in, start by shutting down any applications or databases. Now you can take a look at the system. Start with bdf -l (-l = local, skips NFS). If the filesystems are OK, check NFS servers that are supposed to be working -- they can hang other computers.

Since you mentioned audsys, you must have some access to the system so shutting down audsys is your top priority. audsys is a very extensive monitoring tool but it can generate gigabytes of data in a few minutes if you monitor everything. Always choose audsys settings very carefully to record just what you need. It is not a monitoring tool to watch user activities.

Still no response? You'll probably need to TC the system (Transfer Of Control) which forces a crash dump. Once the system finishes selftest, interrupt the boot process and come up in single user mode. Run fsck on the lvols for /usr, /tmp and /var. You'll need to find the rlvol name for each of those mountpoints. Now mount those 3 filesystems (do not use mount -a). Now you'll have access to commands like bdf, du, vi and so on.

If you find and fix the problem then you can reboot into single user mode. If not, reboot and monitor the startup steps. They are also recorded in /etc/rc.log. Once the system is running, you can offload the /var/adm/crash directory so you can submit the dump to HP for analysis. Hangs in the HP-UX kernel are very unusual unless this is an old system with no patches. The dump itself will require HP-UX internals training to analyze the reason for the hang.

If you don't have HP software support, then bring your system up to the current (within 6 months) patch level.


Bill Hassell, sysadmin

Re: Proper way to recover from hung system

From what I could tell, I had a problem that generated a ton of semop errors via audsys that filled up my external vg02 that I use for logging. I had two open ssh sessions that stopped responding and my mp console let me in but when I tried to log into the console, it allowed the user id but then hung on the password. All traffic stopped and the system became unresponsive. I had no way to get in but to cycle the power and force a reboot. Not the best choice of the day but I couldn't figure a more gracious way to get back in. I appreciate both of your answers and information.
N,Vipin
Frequent Advisor

Re: Proper way to recover from hung system

First ensure that system hung is not because of network issue. (switch port/bad cable etc.) If system not responding even in the console also, the only way to restart the system. I would use MP:CM>TC command to restart the system. The MP:CM>TC command create crash dump under /var/adm/crash which help us to find out the root cause of the hung later point of time.
Wim Rombauts
Honored Contributor

Re: Proper way to recover from hung system

And even the most important question is : How do I change my system configuration to prevent this from hapenning again. It's not wrong to make mistakes. It's wrong not to learn from them.
- Increase the filesystem that was full ?
- Maybe set the "nolargefiles" option so that multigigabyte files cannot be created and cannot fill the filesystem to 100% ?
- Modify your auditing configuration ?