Operating System - OpenVMS
1831422 Members
3465 Online
110025 Solutions
New Discussion

Understanding disk errors

 
SOLVED
Go to solution
Jim Lahman_1
Advisor

Understanding disk errors

Guys:

My system disk is reporting multiple errors:

Device Error Count
CGLHD1$DKB0: 195

Attached is a translated dump of the error log file for some of the errors. Can you please help me understand what I'm seeing and how to interpret these errors?

Thanks, Jim
Cheers!
9 REPLIES 9
Ian Miller.
Honored Contributor

Re: Understanding disk errors

What did you use to translate that error log?
I think for a DS25 you have to use SEA (unfourtunally).
http://h18023.www1.hp.com/support/svctools/webes/index.html#SEA

Any disk that has 195 errors is unwell. I hope you have a good backup.
____________________
Purely Personal Opinion
Jan van den Ende
Honored Contributor

Re: Understanding disk errors

Jim,

your first question, so: WELCOME.

I concur with Ian, you should VERY MUCH distrust that disk!.

Do you have ANY means of tranfering the contents to another one? Then do so NOW.
Or go to any length to arrange it....

I am afraid that I have to wish you GOOD LUCK, because that is what I think you need most now.
Be I REALLY wish you ARE lucky on this!

Proost.

Have one on me.

jpe
Don't rust yours pelled jacker to fine doll missed aches.
Aaron Sakovich
Super Advisor
Solution

Re: Understanding disk errors

Let's see if you have any better options than just "luck"! 8^)

First off, it's a SCSI drive, so it has bad block remapping. That's automatic, so by the time the OS sees an error, it may have already exhausted the re-mappable blocks. I can't tell from the log you posted whether it's gotten that far or not.

I'd surely recommend kicking everyone off the system if possible, then Analyze /Disk /Repair Sys$SysDevice: on it. That should give a better clue as to whether any data or programs on the disk have been corrupted. Note that it's not unusual to find some recoverable errors on a disk; it's the unrecoverable errors you should be wary of. The /Repair switch will ensure that any recoverable errors are fixed prior to the next step.

If the system finds any, you will definitely need to resort to a prior backup.

If not, or they are in files you can live without, I'd recommend a backup ASAP. Shut the system down, boot from your VMS732 CD, and use the command line backup to backup your system disk. Full details on how to accomplish this task are available here:

http://h71000.www7.hp.com/doc/732FINAL/aa-pv5mh-tk/00/01/129-con.html

Hopefully you've got a sufficiently capacious tape drive; alternatively, you could run the backup to copy from one drive to another.

Hope that helps!
Aaron
Volker Halle
Honored Contributor

Re: Understanding disk errors

Jim,

unfortunately, ANAL/ERR/ELV (the new OpenVMS Error Log Viewer) is not capable to translate SCSI disk errors. You would need either SEA (as pointed to by Ian) or - even better - DECevent V3.4 (DIAGNOSE), which is still the best tool to translate SCSI related errors.

You can download DECevent from the Analysis Service Tools page:

http://h18023.www1.hp.com/support/svctools/

If you have DECevent installed on any of your other OpenVMS systems, you can copy ERRLOG.SYS there and diagnose it on that node. If you have WEBES/SEA installed on your Windows Notebook/PC, you can also translate and analyze ERRLOG.SYS there.

Please note that ANAL/DISK DKB0: will ONLY detect errors in the file system meta data. These are unlikely to go undetected, as the file system uses write-check with all it's IOs. If there are file system errors, you could try to repair them with ANAL/DISK/REPAIR DKB0:

Only ANAL/DISK/READ disk: would read all blocks from the disk, which are actually allocated to files, so this would show, if there are problems physically reading any blocks in your data files.

Disks sometimes can show increasing error counts for weeks, before they suddenly fail. You should be prepared to replace that disk, make sure you regularily run BACKUP.

Volker.
Ian Miller.
Honored Contributor

Re: Understanding disk errors

Before doing anything get a image backup of that disk. Then
ANAL/DISK/READ will read all the allocated blocks and if the error count goes up you can work out what files are bad.
____________________
Purely Personal Opinion
Jim Lahman_1
Advisor

Re: Understanding disk errors

First, I want to give everyone a BIG thanks for all your input.

Here's what I did so far. I immediately booted everyone off the computer and told them to log onto to the other node (we have a 2 node cluster).

I booted from the VMS732 CD, started the image backup and got these messages...

DKB0 is offline
Mount verification is in progress.

and just repeats itself until I see a fatal drive error.

Now, my next step is to get a new drive. I do have an image backup from 7/3/06.

Jim
Cheers!
Ian Miller.
Honored Contributor

Re: Understanding disk errors

time to test your recovery strategy :-(
____________________
Purely Personal Opinion
Wim Van den Wyngaert
Honored Contributor

Re: Understanding disk errors

I had disks in shadow sets that gave hundreds of errors when starting to use them. After a few weeks they became stable and are now used for over 4 years.

Fwiw

Wim
Wim
Robert Gezelter
Honored Contributor

Re: Understanding disk errors

Jim,

Wim brings up a good point, which is worth emphasizing. I will keep the discussion at a very high level, in the sake of clarity.

"Disk errors" are generic, depending on the error, the appropriate action differs.

Many of us have seen a disk suddenly generate a series of errors, and then "settle down". This is not uncommon, and typically happens when a disk is first placed into service, and then after some time of service.

This phenomenon is generally relatively benign (providing your data is backed up or can be regenerated). Remember, disk media can degrade over time. OpenVMS deals with this situation by revectoring (moving the block to a different location--if one is lucky, the data can be automatically reconstructed through the use of CRC/ECC). Often, multiple related disk blocks will be affected by this.

This is generally disconcerting, but not actually dangerous in a global sense.

Since critical FILES-11 structures are replicated, this generally does not compromise the file system, though it may introduce inconsistencies.

Situations which involve erratic functioning of the motors or positioners are more problematical. They cause complete failure of the drive. It is not uncommon for these failures to lead directly to Mount Verification Timeouts. Many things can cause these, from physical failure to failure of the underlying disk logic (On some older drivers, you can switch the drive resident logic board with a second drive, but you must be careful).

In a production system, I would move users off of the drive and start a period of intense exercising before trusting the drive again, at the least. If it goes into Mount Verification timeout, and switching the drive to a different shelf, etc. (these can also be caused by power supply problems), I would remove the drive from service (with appropriate scrubbing procedures).

- Bob Gezelter, http://www.rlgsc.com