Re: Disk error messages in root mail.

Kinzyagulov Yunir · ‎10-21-2004

In root mail there are messages about disk I/O error. This is HP cluster, and disk which has errors, is lock disk. I have checked disk status with ioscan command - the disk is OK. You can find outputs and logs in attachment.

scp2:/ #strings /etc/lvmtab
/dev/vg00
/dev/dsk/c1t2d0
/dev/dsk/c2t2d0
/dev/vglock
Yv=i
/dev/dsk/c4t8d0
/dev/vg01
/dev/dsk/c1t0d0
/dev/dsk/c2t0d0

Thank you!

Steven E. Protter · ‎10-21-2004

ioscan is not enough

dmesg look for lbolts or powerfail

fire up cstm mstm or xstm

test all disks with exercize command.

diskinfo.

If a disk is near failure, do a backup and arrange replacement.

SEP

Steven E Protter
Owner of ISN Corporation
http://isnamerica.com
http://hpuxconsulting.com
Sponsor: http://hpux.ws
Twitter: http://twitter.com/hpuxlinux
Founder http://newdatacloud.com

Sridhar Bhaskarla · ‎10-21-2004

Hi,

Do a 'diskinfo /dev/rdsk/c4t8d0' and see if you get good details. Try a 'dd' on that disk to see there are no bad blocks.

dd if=/dev/rdsk/c4t8d0 of=/dev/null bs=1024k

You may need to get this disk replaced. If the disk system supports hot-pluggability, then replace the disk on-line and run 'vgcfgrestore' to restore the VG headers.

vgcfgrestore -n vglock /dev/rdsk/c4t8d0
vgchange -a e vglock

If it is not hot-pluggable, then you will need to shutdown the system, replace the disk and bring up the system. Before bringing up the cluster, just only run 'vgcfgrestore' command as below leaving vgchange command as it will be run when you bring up the package.

-Sri

You may be disappointed if you fail, but you are doomed if you don't try

Kinzyagulov Yunir · ‎10-21-2004

I have run diskinfo command:

scp2:/ #diskinfo /dev/rdsk/c4t8d0
SCSI describe of /dev/rdsk/c4t8d0:
vendor: HP 18.2G
product id: ST318406LC
type: direct access
size: 17783240 Kbytes
bytes per sector: 512
Also have run dmesg command and cstm tool, you can find outputs in attach.

Sridhar Bhaskarla · ‎10-21-2004

Hi,

'diskinfo' is not a 100% check command. It may be helpful for obvious errors but not for all. Morever, diskinfo may not query the disk in all the case.

In your case, try 'dd'.

dd if=/dev/rdsk/c4t8d0 of=/dev/null bs=1024k

to confirm the failures. The errr

//CSI: Read error -- dev: b 31 0x048000, errno: 126, resid: 1024,
blkno: 8, sectno: 16, offset: 8192, bcount: 1024.
scb->cdb: 28 00 00 00 00 10 00 00 02 00//

clearly indicates that there is a problme with this disk.

-Sri

You may be disappointed if you fail, but you are doomed if you don't try

Kinzyagulov Yunir · ‎10-21-2004

I had run dd comand:

scp2:/ #dd if=/dev/rdsk/c4t8d0 of=/dev/null bs=1024k
17366+1 records in
17366+1 records out

Sridhar Bhaskarla · ‎10-21-2004

They could be intermittent failures. However, if the errors last too long, then you would see 'lock disk missing' like failures in your syslog.log. Do you see any issues with your 'vgdisplay vglock' command?

-Sri

You may be disappointed if you fail, but you are doomed if you don't try

Kinzyagulov Yunir · ‎10-21-2004

vglock is a lock disk in HP cluster, now scp2 is not standby host, active is scp1.

scp2:/ #vgdisplay vglock
vgdisplay: Volume group not activated.
vgdisplay: Cannot display volume group "vglock".

Some syslog.log messages you can find in attachment

Kinzyagulov Yunir · ‎10-21-2004

Sorry, scp2 is standby host.

Sridhar Bhaskarla · ‎10-21-2004

OK. Few more details are needed now.

1. Hardware models of your systems.
2. SCSI bus type
3. Do you see any other messages related to serviceguard when the scsi errors are occuring?
4. Do you see any other errors in syslog.log related to LVM and Serviceguard?.
5. Are you seeing similar errors on the primary node?

ScSI RESET messages are quite common on the secondary node on old type SCSI buses during cluster reformations on cluster lock disks.

-Sri

You may be disappointed if you fail, but you are doomed if you don't try

Kinzyagulov Yunir · ‎10-21-2004

1. Hardware:

MCP-HP-L2000
2CPU/440MHz/1GB Mem/4*9GB in HD/9GB Disk Cabinet/DVD/2 Ethernet/2 SCSI/Console/1.6M Cabinet/HP unix 11.0)+4GB Type

2. SCSI bus type ultra 160

scp2:/ #diskinfo -v /dev/rdsk/c4t8d0
SCSI describe of /dev/rdsk/c4t8d0:
vendor: HP 18.2G
product id: ST318406LC
type: direct access
size: 17783240 Kbytes
bytes per sector: 512
rev level: HP04
blocks per disk: 35566480
ISO version: 0
ECMA version: 0
ANSI version: 2
removable media: no
response format: 2

3. There is no other Service Guard information in error messages in syslog.log (I have attached more).

5. On the main node there is no such error messages.

Sridhar Bhaskarla · ‎10-21-2004

I didn't realize you posted your 'ioscan' output.

ext_bus 4 0/3/0/0 c720 CLAIMED INTERFACE SCSI C895 Ultra2 Wide LVD

Since you are not seeing any errors on the primary node, we can be sure that the disk is ok. So, the suspecion may now go to either the controller on the system, or controller on the disk cabinet or the cable. You can run 'cstm|stm', select this interface and see if you see any errors. If you have a spare card, try replacing it to eliminate it from the picture.

-Sri

You may be disappointed if you fail, but you are doomed if you don't try

Kinzyagulov Yunir · ‎10-24-2004

I have run mstm tool, but in this tool there is no 0/3/0/0.8.0 disk (see in attachment).

Also, I couldn't find SCSI bus:

ext_bus 4 0/3/0/0 c720 CLAIMED INTERFACE SCSI C895 Ultra2 Wide LVD

Kinzyagulov Yunir · ‎10-27-2004

Checked SCSI bus and lock disk - they are normal. (see in attachment).

May be the problem with monitoring program, which send's error messages to root log?

accent · ‎10-28-2004

You can test your disk using the stm utility (you have already installed because in the mail I see it)

run the following sentences:

#stm
--> then select your disk and do:
--> tools --> information --> run (the report will run and when it has been finished one screen will show you some information like errors, serial number,......

you can run one exercise too and then see the result.

This tool is the best way to test the HW.

Sanjay_6 · ‎10-28-2004

Hi,

How is the termination on the bus. If these disks are connected through SCSI cables, do you have inline scsi terminator cables on the server side on both the servers. You may be seeing these errors because of the termination issue too.

Hope this helps.

regds

Kinzyagulov Yunir · ‎10-28-2004

I have run Information tool for disk:

-- Information Tool Log for SCSI Disk on path 0/3/0/0.8.0 --

Log creation time: Fri Oct 29 09:11:22 2004

Hardware path: 0/3/0/0.8.0

Product Id: ST318406LC Vendor: HP 18.2G
Device Type: SCSI Disk Firmware Rev: HP04
Device Qualifier: HP18.2GST318406LC Logical Unit: 0
Serial Number: 3FE0WSV0
Capacity (M Byte): 17366.45
Block Size: 512
Max Block Address: 35566479
Error Logs
Total Retries: 0 Buffer Overruns: N/A
Read Reverse Errors: N/A Buffer Underruns: N/A
Write Errors: 0 Non-Medium Errors: 1
Verify Errors: 0

Andrew Merritt_2 · ‎11-02-2004

One thing I can see is that you have an old version of OnlineDiags installed, A.35.00, so you should consider upgrading to the latest version.

That said, the fact that disk_em is reporting these events is telling you that there is a real problem with the disk at path 0/3/0/0.8.0, regardless of whether this shows up in any other tool. These events are generated when the device driver encounters problems. You're seeing event 17 and event 100091 from disk_em, only for the one device, which again points to a specific hardware problem rather than a generic system one.

The text of the event pretty much describes the possible causes:
The bus device reset may have occurred or the device has failed. The bus
device reset could have occurred because the power was cycled or that the
device is a part of an enclosure and a device in that enclosure was pulled
out or put in, or that the interface card had a problem and it reset the
bus. If these errors continue, there may be a problem with the device or
the card.

The fact that the events are repeating, and for only the one device, strongly suggests this isn't a transient problem.

The hardware problem could be with the disk itself, or with the SCSI card to which it is attached. Can you try switching SCSI cards, if you have more than one, and seeing if the problem follows the card?

Andrew

Categories

Company

Local Language

Forums

Discussions

Forums

Discussions

Forums

Discussions

Forums

Discussions

Forums

Discussions

Discussions

Forums

Forums

Discussions

Forums

Discussions

Forums

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Community

Resources

Other HPE Sites

Discussions

Forums

Blogs

Re: Disk error messages in root mail.

Disk error messages in root mail.