Re: /storage/events/disks/default/0_0_2_0.0.0 :Media failure

kate32 · ‎08-30-2007

Hi,

I received this mail from Event monitor
Description of Error:
/storage/events/disks/default/0_0_2_0.0.0 :Media failure

How do you troubleshoute for this kind of error message on hp-ux ?
all the commands and logs you check.

I have checked ioscan the disk status and is in CLAIMED status (by the way this disk is a new one and has been changed 5 months ago). It will be strange that the disk is really failed...
Thanks in advance,
Cheers
Al.

Joelmel Roche · ‎08-30-2007

Hi,

Need some changes:

1) change the polling frequency for disk_em to reduce the testing time:

Edit /var/stm/config/tools/monitor/disk_em.cfg and change the line:
# POLL_INTERVAL 60 # polling interval in minutes

To be:
POLL_INTERVAL 5 # polling interval in minutes
(you need to remove the '#' and change the polling frequency)

2)Run 'ps -ef|grep disk_em' to get the process ID of disk_em

3) Run 'aplsrv', and give the same filter file as before

4) In a different window 'kill -9 ' After at most a couple
of minutes, you should get trace in the 'aplsrv' window

5) If available, run 'tusc' on the new disk_em process when it starts
tusc -rall -wall -l -f -p -a -E -v -T%T -o /tmp/tusc.out

After about 10 minutes, stop the 'aplsrv' with and 'T', and send me
the APL.LOG file, and the tusc.out if available.

6)Edit /var/stm/config/tools/monitor/disk_em.cfg and restore the
default polling time by undoing the previous changes.

kate32 · ‎08-30-2007

Hi

What is aplsrv, I don;t have this command on my system. It is an HP-ux 11.00 (yeap old one)

Cheers
K.

Amit Parui · ‎08-30-2007

Hi Kate

Don't know what you've checked after this error but what I normally follow in such conditions other then ioscan is -

1) Check the pv status for any stale blocks or the pv being available
- pvdisplay -v /dev/dsk/c?t?d?

2) Check the disk info for data i/o
- diskinfo -v /dev/rdsk/c?t?d?

3) And lastly,
dd if=/dev/dsk/c?t?d? of=/dev/null bs=1024

If all is right, then it seems the dodgy, still best is not to take chances and replace it.

If Life gives u a ROCK, its upto u to build a BRIDGE or a WALL !!!

Bill Hassell · ‎08-30-2007

aplsrv is located in /usr/sbin/stm/uut/bin/tools/monitor/aplsrv

Since you got an EMS message, there should be a lot of diagnostic error messages in syslog.log. Electronic devices can fail 5 minutes after replacing them or run for 20 years without a problem. Disks are no different except they are part electronics and part mechanical.

ioscan is a very primitive test -- it simply asks the disk controller for a SCSI ID. CLAIMED means that the the driver ID knows about this device and it returned back an ID string. The disk could be broken and the ID could still return.

diskinfo is a bit better in that it queries more of the electronics but does not verify the disk itself.

The dd test is the best for verifying (sequential) access to the disk itself. But unless you have very small disks (less than 10 GB), change the bs value to 256k and ALWAYS use the rdsk (not dsk) device file for dd.

dd if=/dev/rdsk/what_ever of=/dev/null bs=256k

The default for dd is extremely small (512 bytes) and 1024 is far too small. Even a 10GB disk will take a lot of minutes to read. Just to be safe, start a backup of the data on this disk.

Bill Hassell, sysadmin

kate32 · ‎09-02-2007

Hi,

Thanks for the reply.
The first think that I did in that case is:
ioscan
pvdisplay
diskinfo

but for me (my point of view) it is not enough relevant as, each time I have the disk CLAIMED status but this doesn't help me to check that there is or not a disk pb with this kind of info.

So I have found now the aplsrv ... thanks Bill.
I have the same error everyday during the night after the full backup data.
Do you think If I test it now (following the instruction of Joelmel) it will give useful information even if I have the error only the night after the backup ?

Thanks in advance.
Cheers
K.

kate32 · ‎09-02-2007

Hi,

Thanks for the reply.
The first think that I did in that case is:
ioscan
pvdisplay
diskinfo
check logs

but for me (my point of view) it is not enough relevant as, each time I have the disk CLAIMED status but this doesn't help me to check that there is or not a disk pb with this kind of info.

So I have found now the aplsrv ... thanks Bill.
I have the same error everyday during the night after the full backup data.
Do you think If I test it now (following the instruction of Joelmel) it will give useful information even if I have the error only the night after the backup ?

Thanks in advance.
Cheers
K.

Steven E. Protter · ‎09-02-2007

Shalom Kate,

Are you sure /storage/events/disks/default/0_0_2_0.0.0 is a valid disk assigned to the system.

I've run into a situation a number of times when a system is fiber connected and no LUN0 is assigned, the disk array itself is picked up as a disk in EMS.

Problem is, its not really a disk and it confuses EMS. My solution was to ask the SAN admin to assign the next disk allocation to LUN0.

It's worth checking to see if this is the case. that would involve matching all the disk path's like this one to ioscan output.

SEP

Steven E Protter
Owner of ISN Corporation
http://isnamerica.com
http://hpuxconsulting.com
Sponsor: http://hpux.ws
Twitter: http://twitter.com/hpuxlinux
Founder http://newdatacloud.com

kate32 · ‎09-02-2007

Hello,

Yes it is a valid disk. I have no SAN disk only 4 internal disks: 2 for the system vg00 and 2 for the data vg01. And it is one disk of my vg01 which I received this alert everyday night (4am).
And thoses disks are new we have changed them 2 or 3 months ago.

Regards,
K.

Andrew Merritt_2 · ‎09-03-2007

Hi Kate,
First of all, what's the full EMS event that's being reported in event.log?

These events are reported by the monitor in response to the device driver reporting an error. If you run the logtool command in STM, you should see the corresponding I/O errors there to confirm this. A failure to read the media won't stop the disk being shown as CLAIMED.

You say the event is logged at the same time every day; that may mean you have quite a serious problem, since the OnlineDiags monitors typically filter events so that (in most cases) a particular event is only reported once in 24 hours. Running logtool should show whether you've got multiple errors being reported.

You can also run the Disk Exercise tool in STM, to see if that shows any errors when accessing the disk.

My initial reaction is that you've got a bad disk which will need replacing. If you've got data on it, make sure it's backed up as soon as possible.

Finally, make sure you've got the latest OnlineDiags installed (though I don't think that's a factor in this case, unless as suggested above the device path is not actually a physical disk but a LUN).

Andrew

kate32 · ‎09-03-2007

hi Andrew,
I have the event.log, here the entry that I get now everyday at 4am:

>------------ Event Monitoring Service Event Notification ------------<

Notification Time: Tue Sep 4 04:00:08 2007

MYMACHINE sent Event Monitor notification information:

/storage/events/disks/default/0_0_2_0.0.0 is >= 1.
Its current value is CRITICAL(5).

Event data from monitor:

Event Time..........: Tue Sep 4 04:00:07 2007
Severity............: CRITICAL
Monitor.............: disk_em
Event #.............: 100512
System..............: MYMACHINE

Summary:
Disk at hardware path 0/0/2/0.0.0 : Media failure

Description of Error:

The device was unsuccessful in reading or writing data for the current I/O
request due to an error on the medium. The data could not be recovered.

Probable Cause / Recommended Action:

Reformatting the medium may fix the problem.
removable, replace the medium with a fresh one.

Alternatively, if the medium is not removable, the device has experienced
a hardware failure. Repair or replace the device, as necessary.

Additional Event Data:
System IP Address...: MYIPXXXXXXXX
Event Id............: xxxxxxxxxxxx
Monitor Version.....: B.01.00
Event Class.........: I/O
Client Configuration File...........:
/var/stm/config/tools/monitor/default_disk_em.clcfg
Client Configuration File Version...: A.01.00
Qualification criteria met.
Number of events..: 1
Associated OS error log entry id(s):
xxxxxxxxxxxxxxxxxxxxxxxx
Additional System Data:
System Model Number.............: 9000/800
OS Version......................: B.11.00
STM Version.....................: A.25.00
EMS Version.....................: A.03.20
Latest information on this event:
http://docs.hp.com/hpux/content/hardware/ems/scsi.htm#100512

v-v-v-v-v-v-v-v-v-v-v-v-v D E T A I L S v-v-v-v-v-v-v-v-v-v-v-v-v

Component Data:
Physical Device Path...: 0/0/2/0.0.0
Physical Device Path...: 0/0/2/0.0.0
Device Class...........: Disk
Inquiry Vendor ID......: SEAGATE
Inquiry Product ID.....: ST39204LC
Firmware Version.......: HP06
Serial Number..........: xxxxxxxxxxxxxxxxxx

Product/Device Identification Information:

Logger ID.........: sdisk
Product Identifier: SCSI Disk
Product Qualifier.: SEAGATEST39204LC
SCSI Target ID....: 0x00
SCSI LUN..........: 0x00

I/O Log Event Data:

Driver Status Code..................: 0x0000007C
Length of Logged Hardware Status....: 22 bytes.
Offset to Logged Manager Information: 24 bytes.
Length of Logged Manager Information: 34 bytes.

SCSI Command Data Block: (not present in log record)

Manager-Specific Data Fields:
Request ID.............: 0x018404D7
Data Residue...........: 0x0000C000
CDB status.............: 0x00000002
Sense Status...........: 0x00000000
Bus ID.................: 0x01
Target ID..............: 0x00
LUN ID.................: 0x00
Sense Data Length......: 0x12
Sense Data Length......: 0x12
Q Tag..................: 0x74
Retry Count............: 0

>---------- End Event Monitoring Service Event Notification ----------<

Well this disk is a new one. and really if it is already corrupted I would like to be able to see it somewhere...
Maybe it is only a usage of the disks.
I mean during the night there is a fullback (I mean a real full backup shutdown data base and fullbackup running and then start data base (we are not changing this way of backup just in case someone want to suggest anything it is beyond me... :-)).

Thanks in advance for any advises on where/how to investigate.

Cheers
K.

kate32 · ‎09-03-2007

oups sorry I forget to ask as I never use logtool do you know how to use it ?
/usr/sbin/stm/uut/bin/tools/utility/logtool
Thanks in advance.
Cheers
K.

D Block 2 · ‎09-03-2007

so you get repeated errors or ems alerts during the full-backup?

Maybe you can try a read of all the files or inodes..
Try running a find command and also a dmesg in another window.

terminal 1:
su root
# find / -print 2>&1 1>/dev/null

terminal 2:
tail -f /var/adm/syslog/syslog.log

Does this also produce the errors ? Just wondering so.. and if so, maybe you can use this to method to trouble shoot.

And being interal disks, are you using mirror disk ? you might want to printout some details on vgdisplay -v on the VG.

Golf is a Good Walk Spoiled, Mark Twain.

kate32 · ‎09-04-2007

Hi,

I have done
# find / -print 2>&1 1>/dev/null
tail -f /var/adm/syslog/syslog.log

but nothing happened...
I don;t know where to see now... I still receive the alerts at 4am.
The disk is new and the vg01 is mirrored on 2 new disks (same disks).
Thanks for any help.

kate32 · ‎09-11-2007

Hi,

I still have this issue. I have run csm to try to get more information on the disk.
It was very bizarre first as when I run the map command on the fields Active Tool Status were blanks...
I tried to run il (infolog) but it send a warning: ^-- (InfoLog) is currently disabled. â

So I run info,and then this updated the map and I can see the stauts Information Successful, then run il (infolog) and I get unfortunatly some errors as follow:

Error Logs
Read Errors: 4 Buffer Overruns: N/A
Read Reverse Errors: N/A Buffer Underruns: N/A
Write Errors: 0 Non-Medium Errors: 0
Verify Errors: 0

So what do you think about Read Errors 4 ?
Do you think the disk is faulty (even if it is a new one (I changed it 3 months ago) or any other idea ?

Thanks very much in advance.
Cheers
K.

kate32 · ‎09-11-2007

I have tested the disk with the dd command and here the result:
dd read error: I/O error
112416+0 records in
112416+0 records out

Andrew Merritt_2 · ‎09-11-2007

> It was very bizarre first as when I run
> the map command on the fields Active
> Tool Status were blanks...
Not bizarre if you've never run the tools. There won't be a previous status to report.

>I tried to run il (infolog) but it send a
> warning: -- (InfoLog) is currently
> disabled.
Again, the info log will only be available if you've just run the info tool.

> Do you think the disk is faulty (even if
> it is a new one (I changed it 3 months
> ago) or any other idea ?

Even new disks can go wrong. Personally, I'd be inclined to change it, and if it's new, I'd contact the vendor since it's presumably under warranty. However, disks are so large these days that there are bound to be a few defects, and the means to cope with these are built in, so what you really need to find out is whether the performance is within specifications. The EMS HW monitors are set up so they should only be reporting errors that you need to worry about, they should be ignoring acceptable levels of errors (which the system is designed to cope with).

Andrew

kate32 · ‎09-12-2007

Thanks Andrew for taking time to reply to my post again! Much appreciated and thanks for the "bizarre" not really "bizarre" it was the first that I use cstm... so now I know...
Well the disk is under warranty so I will ask for a replacement again.

Thank you again, I wanted to have a point of view of a hp specialist before taking the decision of replacing again the disk.
Cheers
K.

kate32 · ‎09-12-2007

Thank you all I have learn more to investigate hardware issue and next time I will be more confident in this :-) hi hi hi

Cheers
K.

Categories

Company

Local Language

Forums

Discussions

Forums

Discussions

Forums

Discussions

Forums

Discussions

Forums

Discussions

Discussions

Forums

Forums

Discussions

Forums

Discussions

Forums

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Community

Resources

Other HPE Sites

Discussions

Forums

Blogs

Re: /storage/events/disks/default/0_0_2_0.0.0 :Media failure

/storage/events/disks/default/0_0_2_0.0.0 :Media failure