Operating System - HP-UX
1847184 Members
4252 Online
110263 Solutions
New Discussion

server problems - need help

 
Mark Vollmers
Esteemed Contributor

server problems - need help

I've been having some serious problems with my server for the past few months. I'm not quite sure what's going on, but I'll tell what I know. Any help or ideas (or anything) would be greatly appreciated.

We started noticing the problems with the backup that we were running. I was backing up the server using an UNIX agent on our NT server, since it was handy. The backup would not backup whole folders at a time and complained about others. Occasionally, /home would totally shut down and not let anyone see it (reboot fixed this every time). There was also a time when the system froze up during shutdown or reboot, but that seems to have passed. I went to the old backup system we used (using fbackup, I think) and the server just crashed again, so it is not the backup that is causing the problem. i have two ideas.
a)corrupt file or files that are being accessed through backup and that is bringing down /home. I have deleted files that backup noted were problems, but that did not fix the problem. On a side note, There was a folder that I wanted to delete, but the rm process kept hanging and I could not kill it. Upon reboot, I could remove the folder. May be related??
b) the RAID drive that /home is mounted to is whacked and causing problems. When the server reboots, it does the fsck check, and fixes problems that it sees. this last time, it said that everything was clean, but sometimes it finds problems. Does the fsck check the mounted drives and can it handle a RAID drive?

I have looked through the syslog (most recent) and can't find anything that jumps out at me. I will attach the last few. I know the UPS doesn't work, and that it in the log.

Any ideas on where to go from here would be greatly appreciated. I can give any info anyone wants to see (past logs, etc). Thanks in advance!!
"We apologize for the inconvience" -God's last message to all creation, from Douglas Adams "So Long and Thanks for all the Fish"
6 REPLIES 6
Steven Sim Kok Leong
Honored Contributor

Re: server problems - need help

Hi,

From your syslog, it looks very much a SCSI device failing, possibly SCSI disk housing /home or the SCSI controller connected to it. Try to identify the device by looking for the file with minor number 1f000000 in /dev/dsk.

If possible, perform a mstm exercise or verify on the disk, which does not require downtime but will slow down the system.

If you have predictive support, run psconfig and take a look at your predictive logs on any hardware errors.

Hope this helps. Regards.

Steven Sim Kok Leong
Brainbench MVP for Unix Admin
http://www.brainbench.com

==
Feb 7 08:52:01 munix vmunix: lsp: 5ae7d80
Feb 7 08:52:01 munix vmunix: bp->b_dev: 1f000000
Feb 7 08:52:01 munix vmunix: scb->io_id: 6faf2
Feb 7 08:52:01 munix vmunix: scb->cdb: 28 00 00 9f 25 80 00 00 80 00
Feb 7 08:52:01 munix vmunix: lbolt_at_timeout: 0, lbolt_at_start: 0
Feb 7 08:52:01 munix vmunix: lsp->state: 1
Feb 7 08:52:01 munix vmunix: lbp->owner: 60c5900
Feb 7 08:52:01 munix vmunix: bp->b_dev: 1f000000
Feb 7 08:52:01 munix vmunix: scb->io_id: 6faf4
Feb 7 08:52:01 munix vmunix: scb->cdb: 28 00 00 9f 29 80 00 00 80 00
Feb 7 08:52:01 munix vmunix: lbolt_at_timeout: 0, lbolt_at_start: 0
Feb 7 08:52:01 munix vmunix: lsp->state: 15
Feb 7 08:52:01 munix vmunix: scratch_lsp: 0
Feb 7 08:52:01 munix vmunix:
Feb 7 08:52:01 munix vmunix: SCSI: Ignoring redundant reset request -- lbolt: 7537735, bus: 0
Feb 7 08:52:01 munix vmunix: LVM: vg[1]: pvnum=0 (dev_t=0x1f000000) is POWERFAILED
Feb 7 08:52:01 munix vmunix: LVM: PV 0 has been returned to vg[1].
==
Patrick Wallek
Honored Contributor

Re: server problems - need help

Volker Borowski
Honored Contributor

Re: server problems - need help

Hello Mark,
there are several entrys that do not look well:
- btlan4 has trouble negotiating autosense
- / - Filesystem is full
- pv0 is loosing power
- and the rest of the scsi errors does not look good as well

I would recommend:
- fix freespace problem on / first !
- check disk 0 for loose powercables
- check overall SCSI cable length

After this, go for btlan4 checking esp. half/full-duplex mismatches, to reestablish your network backup.

Hope this helps
Volker
Rita C Workman
Honored Contributor

Re: server problems - need help

lbolt is generally indicative of hardware failure.
the dev = 1f gave you the hex for the actual device
(1f = 31) and 31 = disk
So 1f000000 shows it's c0t0d0 (you drop last 2 0's)

Now my question is this.....is this disk by any chance in a disk array connected by a fiber adapter????
Your problem could be the actual disk, OR....it could be caused by a bad fiber adapter card. I received these messages with the timeout...and it turned out to be bad fiber cards. You may want to check on both of these possibilites. The fact that your having sporatic problems, tends to make think the fiber card...if it were the drive you would be seeing more than timeouts.

/rcw
Mark Vollmers
Esteemed Contributor

Re: server problems - need help

I don't know offhand how the RAID drive is attached to the server (fiber card or whatever); I will have to look into it. It appears to be connected securely on the outside; card connections may not be that way. Looking at past syslogs shows that some of the logs have the SCSI device violation error and some don't. This message probably correlates to the times when /home has crashed on us, since that folder is mounted on that drive. I have not been able to figure out what causes it, though (a file or file being accessed, the system just decided not to play nice, whatever).
"We apologize for the inconvience" -God's last message to all creation, from Douglas Adams "So Long and Thanks for all the Fish"
Michael Lampi
Trusted Contributor

Re: server problems - need help

The disk drive at SCSI ID 0 is acting poorly. The drive is reporting "Contigent Allegiance" errors. This is caused either by flaky firmware, bad SCSI cabling and/or termination, or bugs in the the HP SCSI driver.

I'd check the patch level of your C720 SCSI drivers on your system, and also check the SCSI cabling.

If the errors persist, then replace the drive.
A journey of 1000 steps ends in a mile.