ARMServer problems or hardware problems?

Alex Georgiev
Regular Advisor

I'm new to HP-UX and the HP RAID arrays, so please bear with me.

A few days ago I noticed an error message by the array monitor deamon, saying that /dev/rdsk/c1t0d0 is inaccessible. I then called the people who support our HP-UX box asking for help. They said it's a software problem with the ARMServer, and that the arraymond message comes up when the ARMServer isn't running.

The bottom line: ARMServer stops running by itself!? Crashing might be a better word, but I did not find any core files. Nobody seems to know why would the ARMServer stop all by itself... Could it be a hardware problem that's causing that?

I get the feeling that the company responsible for supporting the HP-UX 10.20 machine are not as knowledgable as they should be, and of course I myself am more into the Linux/Solaris world and have little experience with troubleshooting RAID arrays and SCSI problems.

I did find a bunch of registry dumps and SCSI error/diag messages in the syslog. It says that the SCSI bus was reset. The messages you will see sometimes appear every 2 minutes, other times every 20 or so. If anyone could help me and tell me what do the syslog messages mean and why is ARMServer stoping I would very, very much appreciate it! Even just ideas as to where I should look for clues would be helpful. A small chunk of the syslog file is attached as syslog.txt.

To cap this off, I tried getting some info with arraydsp and arraylog, and those commands don't seem to work. For example, arraydsp -i just sits there without printing anything. I would try tracing it when I get a chance, but I doubt that that will help me much. Things are just looking pretty scarry...

Let me know if I need to post any more info. Thanks for any responses in advance!
Insu Kim
Honored Contributor

ARMServer died intermittently and this can be solved out by applying the latest ARMServer patch.

Run "/sbin/init.d/hparaymgr start" or /opt/hparray/bin/ARMserver to see the status of the array.

# arraydsp -i
# arraydsp -a or verify which component has failed by examining the logs using the logprint command.

You said that this error message comes frequently so there must be an error in the AutoRAID.
I know that the error never stop until you fix problem by experience.

First, take a look at the front panel to see if it's in READY state, especially controller with SCSI ID set to 0.
and use "arraydsp -a" and "logprint".
Finally, you can also use mstm -> logtools to get a clue.

Honored Contributor

there's no question you need to install the latest SCSI patches.

arraydsp -R
should rescan for hardware.

If your f/s and vgs are okay on the autoraid - ie data acess is fine, then there is no pb with the h/w typically.

ioscan -fnkCdisk
look for the C5447A device files
do a diskinfo -v /dev/rdsk/c...

The ARMServer won't start properly if LUN 0 isn't accessable.

You should always have a lun0 show up on ioscan even it it exists or not on the array.

After arraydsp -R
run top
wait for around 2 ioscans processes (full) to finish..
the second one after you see the ARMServer process appear.

Then quit top,
arraydsp -i
should work.

Man logprint
and run it to interrogate the autoraid for h/w errors.
You problem tho is possibly cable or no terminator.

Are there any other devices on the same bus as the autoraid. What's the autoraid scsi id/priority wrt the other ids on the same bus.. think of increasing it.. don't put the autoraid on the same bus as a streaming device.

arraydsp -a
to see your autoraid f/w.
The ARMServer PHCO must match the f/w revision. ie don't install the latest PHCO for ARMServer unless you have read the release notes and have upgraded autoraid controller f/w before PHCO upgrade.

what /opt/hparray/bin/ARMServer

swlist -l product | grep -i arm


Alex Georgiev
Regular Advisor

Thanks for the responses! The problem turned out to be a SCSI controller gone bad. Thanks to the fact that we have 2 SCSI channels to "talk" to the RAID we could access the data just fine, but all the array* commands would hang. What bugs me the most is that the 3 HP engineers that worked on this problem could not figure anything out by looking at the error logs and the output of various diagnostics commands. It was a machine reboot that finaly brought this server down to its knees and had us calling HP hardware people on site.