Understanding disk errors

Jim Lahman_1 · ‎08-10-2006

Guys:

My system disk is reporting multiple errors:

Device Error Count
CGLHD1$DKB0: 195

Attached is a translated dump of the error log file for some of the errors. Can you please help me understand what I'm seeing and how to interpret these errors?

Thanks, Jim

Cheers!

Ian Miller. · ‎08-10-2006

What did you use to translate that error log?
I think for a DS25 you have to use SEA (unfourtunally).
http://h18023.www1.hp.com/support/svctools/webes/index.html#SEA

Any disk that has 195 errors is unwell. I hope you have a good backup.

____________________
Purely Personal Opinion

Jan van den Ende · ‎08-10-2006

Jim,

your first question, so: WELCOME.

I concur with Ian, you should VERY MUCH distrust that disk!.

Do you have ANY means of tranfering the contents to another one? Then do so NOW.
Or go to any length to arrange it....

I am afraid that I have to wish you GOOD LUCK, because that is what I think you need most now.
Be I REALLY wish you ARE lucky on this!

Proost.

Have one on me.

jpe

Don't rust yours pelled jacker to fine doll missed aches.

Aaron Sakovich · ‎08-10-2006

Let's see if you have any better options than just "luck"! 8^)

First off, it's a SCSI drive, so it has bad block remapping. That's automatic, so by the time the OS sees an error, it may have already exhausted the re-mappable blocks. I can't tell from the log you posted whether it's gotten that far or not.

I'd surely recommend kicking everyone off the system if possible, then Analyze /Disk /Repair Sys$SysDevice: on it. That should give a better clue as to whether any data or programs on the disk have been corrupted. Note that it's not unusual to find some recoverable errors on a disk; it's the unrecoverable errors you should be wary of. The /Repair switch will ensure that any recoverable errors are fixed prior to the next step.

If the system finds any, you will definitely need to resort to a prior backup.

If not, or they are in files you can live without, I'd recommend a backup ASAP. Shut the system down, boot from your VMS732 CD, and use the command line backup to backup your system disk. Full details on how to accomplish this task are available here:

http://h71000.www7.hp.com/doc/732FINAL/aa-pv5mh-tk/00/01/129-con.html

Hopefully you've got a sufficiently capacious tape drive; alternatively, you could run the backup to copy from one drive to another.

Hope that helps!
Aaron

Volker Halle · ‎08-10-2006

Jim,

unfortunately, ANAL/ERR/ELV (the new OpenVMS Error Log Viewer) is not capable to translate SCSI disk errors. You would need either SEA (as pointed to by Ian) or - even better - DECevent V3.4 (DIAGNOSE), which is still the best tool to translate SCSI related errors.

You can download DECevent from the Analysis Service Tools page:

http://h18023.www1.hp.com/support/svctools/

If you have DECevent installed on any of your other OpenVMS systems, you can copy ERRLOG.SYS there and diagnose it on that node. If you have WEBES/SEA installed on your Windows Notebook/PC, you can also translate and analyze ERRLOG.SYS there.

Please note that ANAL/DISK DKB0: will ONLY detect errors in the file system meta data. These are unlikely to go undetected, as the file system uses write-check with all it's IOs. If there are file system errors, you could try to repair them with ANAL/DISK/REPAIR DKB0:

Only ANAL/DISK/READ disk: would read all blocks from the disk, which are actually allocated to files, so this would show, if there are problems physically reading any blocks in your data files.

Disks sometimes can show increasing error counts for weeks, before they suddenly fail. You should be prepared to replace that disk, make sure you regularily run BACKUP.

Volker.

Ian Miller. · ‎08-10-2006

Before doing anything get a image backup of that disk. Then
ANAL/DISK/READ will read all the allocated blocks and if the error count goes up you can work out what files are bad.

____________________
Purely Personal Opinion

Jim Lahman_1 · ‎08-11-2006

First, I want to give everyone a BIG thanks for all your input.

Here's what I did so far. I immediately booted everyone off the computer and told them to log onto to the other node (we have a 2 node cluster).

I booted from the VMS732 CD, started the image backup and got these messages...

DKB0 is offline
Mount verification is in progress.

and just repeats itself until I see a fatal drive error.

Now, my next step is to get a new drive. I do have an image backup from 7/3/06.

Jim

Cheers!

Ian Miller. · ‎08-11-2006

time to test your recovery strategy :-(

____________________
Purely Personal Opinion

Wim Van den Wyngaert · ‎08-13-2006

I had disks in shadow sets that gave hundreds of errors when starting to use them. After a few weeks they became stable and are now used for over 4 years.

Fwiw

Wim

Wim

Robert Gezelter · ‎08-13-2006

Jim,

Wim brings up a good point, which is worth emphasizing. I will keep the discussion at a very high level, in the sake of clarity.

"Disk errors" are generic, depending on the error, the appropriate action differs.

Many of us have seen a disk suddenly generate a series of errors, and then "settle down". This is not uncommon, and typically happens when a disk is first placed into service, and then after some time of service.

This phenomenon is generally relatively benign (providing your data is backed up or can be regenerated). Remember, disk media can degrade over time. OpenVMS deals with this situation by revectoring (moving the block to a different location--if one is lucky, the data can be automatically reconstructed through the use of CRC/ECC). Often, multiple related disk blocks will be affected by this.

This is generally disconcerting, but not actually dangerous in a global sense.

Since critical FILES-11 structures are replicated, this generally does not compromise the file system, though it may introduce inconsistencies.

Situations which involve erratic functioning of the motors or positioners are more problematical. They cause complete failure of the drive. It is not uncommon for these failures to lead directly to Mount Verification Timeouts. Many things can cause these, from physical failure to failure of the underlying disk logic (On some older drivers, you can switch the drive resident logic board with a second drive, but you must be careful).

In a production system, I would move users off of the drive and start a period of intense exercising before trusting the drive again, at the least. If it goes into Mount Verification timeout, and switching the drive to a different shelf, etc. (these can also be caused by power supply problems), I would remove the drive from service (with appropriate scrubbing procedures).

- Bob Gezelter, http://www.rlgsc.com

Categories

Company

Local Language

Forums

Discussions

Forums

Discussions

Forums

Discussions

Forums

Discussions

Forums

Discussions

Discussions

Forums

Forums

Discussions

Forums

Discussions

Forums

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Community

Resources

Other HPE Sites

Discussions

Forums

Blogs

Understanding disk errors

Understanding disk errors

Re: Understanding disk errors

Re: Understanding disk errors

Re: Understanding disk errors

Re: Understanding disk errors

Re: Understanding disk errors

Re: Understanding disk errors

Re: Understanding disk errors

Re: Understanding disk errors

Re: Understanding disk errors