1847215 Members
2951 Online
110263 Solutions
New Discussion

server hang

 
T G Manikandan
Honored Contributor

server hang

Hello,
Yesterday one of my L2000 class server did not respondi.e server hang
I had to go the GSP to do a 'RS' to reset the system.The other way I had was to switch off the main power and bring it back.
Today morning I was monitoring the server,I found that the CPU's(4) were constantly using 100%.
We have a compilation environment where we have two database instances and Tuxedo running on it.Users also do a lot of compilations.
When I did a sar 5 5 I found that

%user time was 95 on an average

%sys was 5 on an average.
Should I go for PRM to restrict CPU.

My questions are

1.Was this the reason where the system went for a toss?
2.The way I restarted the system using GSP-->RS
signal,Is this the right way in such a case.
3.Are there any patches that needs to be applied?
Please flow in your suggestions.
Thanks Professors

GM
10 REPLIES 10
Shahul
Esteemed Contributor

Re: server hang

Hi

I suggest U to check up these following.

Please check /var/adm/syslog/syslog.log and OLDsyslog.log for errors logged.

Please monitor the output of swapinfo regularly

Please monitor for any networking related problems

Do U have glance plus? If yes check for the HDD integrity.

I hope from these U will get some points...

Best of luck

Shahul
T G Manikandan
Honored Contributor

Re: server hang

 
Carsten Krege
Honored Contributor

Re: server hang

This looks very much like a h/w problem. The logs indicate that the device : bp->b_dev: 1f010000 (/dev/dsk/c1t0d0) is affected, major number of the device file is 31 (run "lsdev" to list the driver for the major).

You should let HP check this device.

Carsten
-------------------------------------------------------------------------------------------------
In the beginning the Universe was created. This has made a lot of people very angry and been widely regarded as a bad move. -- HhGttG
Rainer von Bongartz
Honored Contributor

Re: server hang

You could first check if yout disk at /dev/dsk/c1t0d0 has problems
read the entire disk and watch out for errors

dd if=/dev/dsk/c1t0d0 of=/dev/null bs=4096

if this is successfull you should get
xxx Blocks in
xxx Blocks out,
otherwise an error message should be displayed.

He's a real UNIX Man, sitting in his UNIX LAN making all his UNIX plans for nobody ...
T G Manikandan
Honored Contributor

Re: server hang

Hello,
Thanks for the replies.It was really useful.
Luckily the disk was a data disk.
The disk is functioning well when the server is rebooted.But at a sp. time the disk hangs and everything goes for a toss.
One more question,
What are teh adv. of adding two or more disks in a volume group rather than a single disk.
Is that the I/O perf. will be good.Any other?

Thanks

GM
Carsten Krege
Honored Contributor

Re: server hang

There is no real advantage by adding multiple disks to a volume group (besdides more storage capacity and flexibility to extend your lvols). For the root VG (vg00) I recommend to configure as few disks as possible.

Things become more interesting if you mirror your disks on additional drives using MirrorDisk/UX. In the majority of cases you can continue working in case of a disk failure since the data is simply accessed from the mirror disk. Using mirroring would also add little performance for read access, but on the other hand cost some time writing the additional admin data (both is usually neglectable).

MirrorDisk/UX is widely used and a very efficient mean to provide high availability for disk mechs.

Carsten
-------------------------------------------------------------------------------------------------
In the beginning the Universe was created. This has made a lot of people very angry and been widely regarded as a bad move. -- HhGttG
T G Manikandan
Honored Contributor

Re: server hang

Hello,
Thanks for the replies.Krege,
i went through the documents and I found a special one which gives a patch as solution.
PHKL_24004 1.0 SCSI IO Subsystem Cumulative Patch .
But after this patch was installed the server stopped writing into the syslog file.I got the same problem even today.The server was hung.
I am attaching a document which gives the analysis reg. this.
When the system gets hung can we restart the system using the GSP>RS signal.Is that ok?
It makes me compare a L2000 server with NT server.Oops!

Thanks again
G Manikandan
T G Manikandan
Honored Contributor

Re: server hang

Hello,
My server is going out for a toss.
Today I have rebooted it once.I have no option rather than switching back the power on.
I am planning to remove the hard disk that creates the problem.
I am to follow the following steps.
1.remove the the sp. volume group.
#vgremove
2.re-create a volume group with the left-out disks.
#vgcreate
3.create volumes.
#lvcreate
4.Restore the files.
5.Mount the file systems as usual.
Is this ok.
Requesting suggestions!
I have pasted the latest syslog file.

polnec vmunix: Unexpected interrupt on EIRR bit

polnec vmunix: scb->cdb: 28 00 00 02 fa 38 00 00 08 00Aug 1 11:09:16 polnec vmunix: scb->cdb: 28 00 00 06 71 28 00 00 08 00Aug 1 11:09:19 polnec vmunix: SCSI: Abort abandoned -- lbolt: 897064, dev: 1f010000, io_id: 1027930, status: 200
Aug 1 11:09:19 polnec vmunix: Aug 1 11:09:19 polnec vmunix: SCSI: Read error -- dev: b 31 0x010000, errno: 126, resid: 16384,Aug 1 11:09:19 polnec vmunix: blkno: 77412, sectno: 154824, offset: 79269888, bcount: 16384.Aug 1 11:09:19 polnec vmunix: SCSI: Read error -- dev: b 31 0x010000, errno: 126, resid: 2048,Aug 1 11:09:19 polnec vmunix: blkno: 8, sectno: 16, offset: 8192, bcount: 2048.Aug 1 11:09:19 polnec vmunix: LVM: vg[1]: pvnum=0 (dev_t=0x1f010000) is POWERFAILEDAug 1 11:09:24 polnec vmunix: SCSI: Read error -- dev: b 31 0x010000, errno: 126, resid: 2048,Aug 1 11:09:32 polnec vmunix: SCSI: Read error -- dev: b 31 0x010000, errno: 126, resid: 4096,Aug 1 11:09:32 polnec vmunix: blkno: 216836, sectno: 433672, offset: 222040064, bcount: 4096.Aug 1 11:09:36 polnec vmunix: SCSI: Read error -- dev: b 31 0x010000, errno: 126, resid: 8192,Aug 1 11:09:36 polnec vmunix: blkno: 21776, sectno: 43552, offset: 22298624, bcount: 8192.

I am unable to umount a file system to do a fsck.
Should I comment the entry from the fstab file and try umounting the file system.

Thanks.
Rainer von Bongartz
Honored Contributor

Re: server hang

You cannot unmount your FS as long as some process is accessing files on this FS.
find out the mount point of the LVOLS of this disk and check with fuser who is accessing files there.

Only after killing this processes you will be able to unmount
He's a real UNIX Man, sitting in his UNIX LAN making all his UNIX plans for nobody ...
Rainer von Bongartz
Honored Contributor

Re: server hang

You cannot unmount your FS as long as some process is accessing files on this FS.
find out the mount point of the LVOLS of this disk and check (using fuser command) who is accessing files there.

Only after killing this processes you will be able to unmount
He's a real UNIX Man, sitting in his UNIX LAN making all his UNIX plans for nobody ...