- Community Home
- >
- Servers and Operating Systems
- >
- Operating Systems
- >
- Operating System - OpenVMS
- >
- Understanding disk errors
Categories
Company
Local Language
Forums
Discussions
Forums
- Data Protection and Retention
- Entry Storage Systems
- Legacy
- Midrange and Enterprise Storage
- Storage Networking
- HPE Nimble Storage
Discussions
Forums
Discussions
Discussions
Discussions
Forums
Discussions
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
- BladeSystem Infrastructure and Application Solutions
- Appliance Servers
- Alpha Servers
- BackOffice Products
- Internet Products
- HPE 9000 and HPE e3000 Servers
- Networking
- Netservers
- Secure OS Software for Linux
- Server Management (Insight Manager 7)
- Windows Server 2003
- Operating System - Tru64 Unix
- ProLiant Deployment and Provisioning
- Linux-Based Community / Regional
- Microsoft System Center Integration
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Community
Resources
Forums
Blogs
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
08-10-2006 08:24 AM
08-10-2006 08:24 AM
My system disk is reporting multiple errors:
Device Error Count
CGLHD1$DKB0: 195
Attached is a translated dump of the error log file for some of the errors. Can you please help me understand what I'm seeing and how to interpret these errors?
Thanks, Jim
Solved! Go to Solution.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
08-10-2006 08:41 AM
08-10-2006 08:41 AM
Re: Understanding disk errors
I think for a DS25 you have to use SEA (unfourtunally).
http://h18023.www1.hp.com/support/svctools/webes/index.html#SEA
Any disk that has 195 errors is unwell. I hope you have a good backup.
Purely Personal Opinion
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
08-10-2006 08:52 AM
08-10-2006 08:52 AM
Re: Understanding disk errors
your first question, so: WELCOME.
I concur with Ian, you should VERY MUCH distrust that disk!.
Do you have ANY means of tranfering the contents to another one? Then do so NOW.
Or go to any length to arrange it....
I am afraid that I have to wish you GOOD LUCK, because that is what I think you need most now.
Be I REALLY wish you ARE lucky on this!
Proost.
Have one on me.
jpe
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
08-10-2006 09:11 AM
08-10-2006 09:11 AM
SolutionFirst off, it's a SCSI drive, so it has bad block remapping. That's automatic, so by the time the OS sees an error, it may have already exhausted the re-mappable blocks. I can't tell from the log you posted whether it's gotten that far or not.
I'd surely recommend kicking everyone off the system if possible, then Analyze /Disk /Repair Sys$SysDevice: on it. That should give a better clue as to whether any data or programs on the disk have been corrupted. Note that it's not unusual to find some recoverable errors on a disk; it's the unrecoverable errors you should be wary of. The /Repair switch will ensure that any recoverable errors are fixed prior to the next step.
If the system finds any, you will definitely need to resort to a prior backup.
If not, or they are in files you can live without, I'd recommend a backup ASAP. Shut the system down, boot from your VMS732 CD, and use the command line backup to backup your system disk. Full details on how to accomplish this task are available here:
http://h71000.www7.hp.com/doc/732FINAL/aa-pv5mh-tk/00/01/129-con.html
Hopefully you've got a sufficiently capacious tape drive; alternatively, you could run the backup to copy from one drive to another.
Hope that helps!
Aaron
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
08-10-2006 05:58 PM
08-10-2006 05:58 PM
Re: Understanding disk errors
unfortunately, ANAL/ERR/ELV (the new OpenVMS Error Log Viewer) is not capable to translate SCSI disk errors. You would need either SEA (as pointed to by Ian) or - even better - DECevent V3.4 (DIAGNOSE), which is still the best tool to translate SCSI related errors.
You can download DECevent from the Analysis Service Tools page:
http://h18023.www1.hp.com/support/svctools/
If you have DECevent installed on any of your other OpenVMS systems, you can copy ERRLOG.SYS there and diagnose it on that node. If you have WEBES/SEA installed on your Windows Notebook/PC, you can also translate and analyze ERRLOG.SYS there.
Please note that ANAL/DISK DKB0: will ONLY detect errors in the file system meta data. These are unlikely to go undetected, as the file system uses write-check with all it's IOs. If there are file system errors, you could try to repair them with ANAL/DISK/REPAIR DKB0:
Only ANAL/DISK/READ disk: would read all blocks from the disk, which are actually allocated to files, so this would show, if there are problems physically reading any blocks in your data files.
Disks sometimes can show increasing error counts for weeks, before they suddenly fail. You should be prepared to replace that disk, make sure you regularily run BACKUP.
Volker.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
08-10-2006 09:31 PM
08-10-2006 09:31 PM
Re: Understanding disk errors
ANAL/DISK/READ will read all the allocated blocks and if the error count goes up you can work out what files are bad.
Purely Personal Opinion
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
08-11-2006 03:10 AM
08-11-2006 03:10 AM
Re: Understanding disk errors
Here's what I did so far. I immediately booted everyone off the computer and told them to log onto to the other node (we have a 2 node cluster).
I booted from the VMS732 CD, started the image backup and got these messages...
DKB0 is offline
Mount verification is in progress.
and just repeats itself until I see a fatal drive error.
Now, my next step is to get a new drive. I do have an image backup from 7/3/06.
Jim
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
08-11-2006 03:45 AM
08-11-2006 03:45 AM
Re: Understanding disk errors
Purely Personal Opinion
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
08-13-2006 06:04 PM
08-13-2006 06:04 PM
Re: Understanding disk errors
Fwiw
Wim
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
08-13-2006 10:01 PM
08-13-2006 10:01 PM
Re: Understanding disk errors
Wim brings up a good point, which is worth emphasizing. I will keep the discussion at a very high level, in the sake of clarity.
"Disk errors" are generic, depending on the error, the appropriate action differs.
Many of us have seen a disk suddenly generate a series of errors, and then "settle down". This is not uncommon, and typically happens when a disk is first placed into service, and then after some time of service.
This phenomenon is generally relatively benign (providing your data is backed up or can be regenerated). Remember, disk media can degrade over time. OpenVMS deals with this situation by revectoring (moving the block to a different location--if one is lucky, the data can be automatically reconstructed through the use of CRC/ECC). Often, multiple related disk blocks will be affected by this.
This is generally disconcerting, but not actually dangerous in a global sense.
Since critical FILES-11 structures are replicated, this generally does not compromise the file system, though it may introduce inconsistencies.
Situations which involve erratic functioning of the motors or positioners are more problematical. They cause complete failure of the drive. It is not uncommon for these failures to lead directly to Mount Verification Timeouts. Many things can cause these, from physical failure to failure of the underlying disk logic (On some older drivers, you can switch the drive resident logic board with a second drive, but you must be careful).
In a production system, I would move users off of the drive and start a period of intense exercising before trusting the drive again, at the least. If it goes into Mount Verification timeout, and switching the drive to a different shelf, etc. (these can also be caused by power supply problems), I would remove the drive from service (with appropriate scrubbing procedures).
- Bob Gezelter, http://www.rlgsc.com