Operating System - HP-UX
1752801 Members
5771 Online
108789 Solutions
New Discussion юеВ

Lost a SCSI disk and system hangs

 
Michael D'Aulerio
Regular Advisor

Lost a SCSI disk and system hangs

I'm running HP-UX 10.20 on a 744. My system is a little out of date. We have 4 of the 744s networked together on a local network. Each node monitors the availability of the other nodes using a TCP/IP ping. The software is installed on 2 external SCSI drives. Recently we ran into a problem where the external boot drive stopped responding. The workstation hung. All the X display screens and the keyboard and mouse froze and you could not log into the system. But it still responded to the ping command. Any ideas?


The monitor processes we have feeds information to another process on each node which distributes processes among the 4. When one workstation goes down the processes that are assigned to it are relocated to another box. Most of the times this works fine. In the case of the drive failure though we run into major problems.

The process that monitors node availability and the process that allocates processes are both custom programs written by our project. We inherited a lot of legacy code from a former project.
Email: michael.n.daulerio@lmco.com
4 REPLIES 4
Patrick Wallek
Honored Contributor

Re: Lost a SCSI disk and system hangs

If you're boot drive is not mirrored, which is sounds like it isn't, then this makes sense.

The ping command works at the network level and requires very little from the other parts of the OS, specifically no disk IO.

Any other processes when running require some sort of IO to run, be it reading the program from disk, opening device files, etc. When the boot disk went bad any IO that was or is pointed toward anything on that disk will hang indefinitely. You most likely cannot even get logged in because you've got to read things like /etc/profile, /home/????/.profile, etc.

With your process as it is designed, you are really out of luck in this case. The only real way around this is to mirror the disks on the box. If you don't have Mirror Disk already you would have to buy it but unfortunately HP-UX 10.20 is WAY out of support so I doubt you could even buy that anymore.
Bill Hassell
Honored Contributor

Re: Lost a SCSI disk and system hangs

ping responds because it is a very low level response and essentially takes place in memory. It is not an adequate indicator of the system's health. Xwindows is critically dependent on having a running system and if the boot disk is dead, virtually everything stops. This is a case where mirroring the disks would help. The monitoring you are describing is a crude form of Service Guard but the broken root disk will almost always cause the hang you describe. The reason is that the kernel needs the disk for swap (it's part of memory), for logging and to refresh screens. There is no easy solution short of a Service Guard environment. MirrorDisk would help but since you are running 10.20, it has not been available for purchase for a couple of years.


Bill Hassell, sysadmin
Ted Buis
Honored Contributor

Re: Lost a SCSI disk and system hangs

The 744 is a "workstation" system in HP terms and so ServiceGuard was never an option for that product.
Mom 6
Denver Osborn
Honored Contributor

Re: Lost a SCSI disk and system hangs

it sounds like you have your own type of failover in place, but it's only good if the ping doesn't respond. So in the case of your disk failure, the box was up and responded to a ping... so the other nodes thought all was well.

Maybe what you need is to improve upon your monitoring checks. Without having to do too much, if ftp is running on the boxes, you could script fpt and replace that with your ping test... maybe ftp a file to each host for to show a system status??

or if sendmail is is running and listening on all the nodes... have them send a message that contains info about the nodes status. Should be easy enough to implement. Instead of another mailbox to manage, add an alias and redirect the output to a script that would parse the results of the other node's email w/ status info.... assuming that your entire issue was caused by the down box, but ping response.. the above ideas may help.

-denver