1833412 Members
3445 Online
110052 Solutions
New Discussion

Re: EMS verses stm

 
John Waller
Esteemed Contributor

EMS verses stm

We have a situation where EMS is reporting a problem with a disk. It is reporting a CRITICAL error and complaining about being unable to perform a write I/O.
Checking the disk using stm and all appears OK, no errors logged at all and verification works fine.
What is most likely to be correct??????
10 REPLIES 10
Steven Sim Kok Leong
Honored Contributor

Re: EMS verses stm

Hi,

You won't want to perform a "write" test on your production disk just to verify whether it is faulty for writes. In STM, you are likely to have performed an exercise test only.

To check it out, schedule downtime, perform a full backup of the data on the disk, initiate a write test using tools like STM or dd. That will help verify whether your disk is faulty. If it isn't, then it could be your disk controller. Check whether other disks on the same bus are also facing the same errors.

Hope this helps. Regards.

Steven Sim Kok Leong
John Waller
Esteemed Contributor

Re: EMS verses stm

Please Don't Laugh.

The main reason for asking this question is that the disk concerned is the root disk, one of two in /dev/vg00. The company didn't want to buy mirrordisk to save on costs so we have no fall back plans. This message is only appearing on one disk, so far 2 messages in 3 days
Steven Sim Kok Leong
Honored Contributor

Re: EMS verses stm

Hi,

Sorry if I gave the impression that I was laughing. I seriously wasn't.

Usually if there is a bad sector on the disk, you will not be able to read from it as well. As such, I personally think that the exercise test you performed in STM would be rather thorough.

Have you ran an exercise test on the SCSI controller as well?

Can you post up the EMS error message for us forumers to take a look? The EMS error might give more insight.

Hope this helps. Regards.

Steven Sim Kok Leong
James R. Ferguson
Acclaimed Contributor

Re: EMS verses stm

Hi John:

I'd make sure that I had a current Ignite recovery tape of all of vg00 -- just for insurance sake, should your disk go bad ;-)

In my opinion, no server should be without a mirrored boot volume.

# /opt/ignite/bin/make_tape_recovery -x inc_entire=vg00 -I -v -a /dev/rmt/0mn

Regards!

...JRF...
John Waller
Esteemed Contributor

Re: EMS verses stm

Attached a copy of the ems report.

We already have an upto date make_recovery tape taken recently on hand just incase it is needed.

Interesting thing I noticed was that the first message occured Sunday pm when nothing was runnning on the system.
S.K. Chan
Honored Contributor

Re: EMS verses stm

I would replace 52.4.0 as soon as possible.
Steven Sim Kok Leong
Honored Contributor

Re: EMS verses stm

Hi,

I would trust the EMS output and give it the benefit of the doubt.

Since EMS has been reporting write errors and not read errors, an exercise test of your disk via STM may not reflect 100% accurately because that is a read test. However, if it was bad sector(s), then a disk exercise should fail during read as well.

Did you perform just a verification test or an exercise test on your harddisk? If you did perform an exercise test and it passed, then the harddisk failure might be due to other harddisk-related hardware reasons (such as the write head) than bad sectors. Note in the EMS error that it stated:

Reallocating the data to a spare area on the medium was
attempted, but failed.

Since this is the only disk on the same bus that has been receiving the errors, it should be safe to deduce that your harddisk SCSI controller is functioning fine.

Since this is your root disk (and not a data disk), you should perform an updated make_tape_recovery and replace the disk, as the others have already mentioned. Otherwise, you risk some of your system data being corrupted.

Hope this helps. Regards.

Steven Sim Kok Leong
John Waller
Esteemed Contributor

Re: EMS verses stm

Steven,

Many thannks for the information, I actually performed a verification to start with but after your previous post I then performed an excersise. Both have worked fine and 100% completed with no new messages reported by EMS. As this machine was purchased 2nd hand, I have found out that my collegue who loaded the OS had actuallly performed a read/write test on the disk pror to the OS load beginning of December which again was fine.
I am interested to see another thread has been started regarding EMS reporting critical failures on a fully working system. Is EMS over sensitive
Michael Tully
Honored Contributor

Re: EMS verses stm

Hi John,

One quick pice of advice:

If the system is production then replace
the disk, and then make arrangements to have
a mirrored disk installed. I guess you could
answer your own question... how much would
really be lost cost wise if a production
system was down when it is being used.
We had a production system down today
because of a faulty HBA card and it has
cost us in the tens of thousands. Having
the right redundancies helps, but you can
never really predict what time of what
day you have a hardware failure.

-Michael
Anyone for a Mutiny ?
John Waller
Esteemed Contributor

Re: EMS verses stm

Michael,

We both know that a simple purcase like mirrordisk saves thousands in lost production time, but when the IT department falls under the control of the accounting trolls (Dilbert cartoon)trying to get a box of DDS2 tapes is a nightmare