Operating System - OpenVMS
1827853 Members
1870 Online
109969 Solutions
New Discussion

Re: How to interpret device errors

 
SOLVED
Go to solution
Wim Van den Wyngaert
Honored Contributor

How to interpret device errors

I have a problem with device errors. I thought that when a fatal error occured with a disk, it was dropped. But now I found a disk with an exe that returned "parity error" when doing ana/rms. The file was corrupted and unusable. But the disk simply gave a device error.

So my questions :
1. Why is the disk not dropped ?
2. How are bad blocks exactly handled ?
3. If I get medium errors, I see that retries allowable is 16 and remaining is also 16. If I don't get subsequent messages, did the next try succeed ?

Example from duiag under 7.3

**** V3.3 ********************* ENTRY 10 ********************************


Logging OS 1. OpenVMS
System Architecture 2. Alpha
OS version V7.3
Event sequence number 33646.
Timestamp of occurrence 29-AUG-2004 20:34:01
Time since reboot 32 Day(s) 14:23:50
Host name ARBXY

System Model Digital AlphaStation 500/400

Entry Type 1. Device Error


---- Device Profile ----
Unit ARBXY$DKA0
Product Name RZ1BB-BS
Vendor DEC

-- Driver Supplied Info -
Device Firmware Revision 0658
VMS SCSI Error Type 5. Extended Sense Data from Device
SCSI ID x00
SCSI LUN x00
SCSI SUBLUN x00
Port Status x00000001 NORMAL - normal successful completion
SCSI Command Opcode x28 Read (10 byte command)
Command Data
x00
x00
x00
x5F
x10
x00
x00
x7E
x00

SCSI Status x02 Check Condition
Remaining Byte Length 18.

--- Sense Data For Device RZ1BB-BS, 2GB 68 PIN Wide - Fast 10 & Fast
20 - 7200RPM
Error Code xF0 Current Error
Information Bytes are Valid
Segment # x00
Information Byte 3 x00
Byte 2 x00
Byte 1 x5F
Byte 0 x78 LBA: x00005F78
Sense Key x03 MEDIUM ERROR
Additional Sense Length x0A
CMD Specific Info Byte 3 x00
Byte 2 x00
Byte 1 x00
Byte 0 x00
ASC & ASCQ x1100 Unrecovered Read Error.
FRU Code xEA
Sense Key Specific Byte 0 x80 Valid Sense Key Data
Byte 1 x01
Byte 2 x85 Retry Count: x0185

----- Software Info -----
UCB$x_ERTCNT 16. Retries Remaining
UCB$x_ERTMAX 16. Retries Allowable
IRP$Q_IOSB x0000000000000000
UCB$x_STS x18021810 Online
Software Valid
Unload At Dismount
Volume is Valid on the local node
Unit supports the Extended Function bit
IRP$L_PID x000C0027 Requestor "PID"
IRP$x_BOFF 3072. Byte Page Offset
IRP$x_BCNT 64512. Transfer Size In Byte(s)
UCB$x_ERRCNT 11. Errors This Unit
UCB$L_OPCNT 3431484. QIO's This Unit
ORB$L_OWNER x00010004 Owners UIC
UCB$L_DEVCHAR1 x1C4D4408 Directory Structured
File Oriented
Sharable
Available
Mounted
Error Logging
Capable of Input
Capable of Output
Random Access


Wim
Wim
23 REPLIES 23
Ian Miller.
Honored Contributor

Re: How to interpret device errors

there is a error with one part of the disk but it does not mean the rest of the disk is bad. When the corrupt file is deleted and the file is marked as having bad blocks then all the blocks in the file are tested and the bad blocks should get marked as unusuable. They may get re-vectored i.e that range of logical block numbers will refer to a different place on the physical disk.
I think DIR/FU on the file displays if the file is marked as containing bad blocks.

What do you mean the disk gets dropped?

What I usually do after deleting the bad file is to create a file to fill the disk and do a ANAL/DISK/READ which will read every allocated block. This will detect any other unreadable blocks that you may not know about.
____________________
Purely Personal Opinion
Wim Van den Wyngaert
Honored Contributor

Re: How to interpret device errors

Ian,

Dropped=dismounted

If I understand correctly, a single disk will continue to operate, even when there are read/write failures. Only an error that makes it inoperatable will "dismount" it.

In shadow sets a write error will remove the member. But a read error ?

And what about 3)

Wim
Wim
Ian Miller.
Honored Contributor

Re: How to interpret device errors

I think for this point

"3. If I get medium errors, I see that retries allowable is 16 and remaining is also 16. If I don't get subsequent messages, did the next try succeed ?"

For the error listed the I/O was not retried.

I don't know the reasons that cause a shadow set member to be dropped.
____________________
Purely Personal Opinion
Willem Grooters
Honored Contributor

Re: How to interpret device errors

Wim,


If I understand correctly, a single disk will continue to operate, even when there are read/write failures. Only an error that makes it inoperatable will "dismount" it.


In case of a read error, it _may_ continue, if sure it's a surface error (bad block). In case of head-, movement or controller error, disable it.


In shadow sets a write error will remove the member. But a read error ?


Of course onm a write error. One "bad" disk is enough. I don't want the other member(s) to be corrupted.
In case of an read error - no need. The correct data will com from another member.

3:
UCB$x_ERTCNT 16. Retries Remaining
UCB$x_ERTMAX 16. Retries Allowable

Guess this is not updated for READ errors.

on Vax, there used to be a utility to locate and revector bad blocks. What happend with that program?
I don't think data on a bad block can be recovered, even with error correction, can it?

Willem
Willem Grooters
OpenVMS Developer & System Manager
Wim Van den Wyngaert
Honored Contributor

Re: How to interpret device errors

Did some testing on a single disk station (of which I have about 70).

1) Some files are corrupt (parity errors). Did delete/erase of them. Then took all available diskspace and analyzed the files that were created with it. NO ERRORS.

2) Just to be sure : I power cycled the node. No change.

Correct me if I am wrong but if a read error occurs in a shadow set, the other disk will be tried. If this one succeeds, both disks mark the block as bad and use another one. If it fails too ... I don't know what happens. If a write errors occurs, the same happens.

But for single disks, you are in danger : in case of a write, the same happens as in a shadow set but in case of a read : the system continues and returns errors to the programs. But what about corrupt exe's, com's etc ? How will VMS react ?

So, Should I do an ana/dis/rep/read every weekend and repair the files that are damaged ? Or replace the disk ? Or simply implement shadowing ?

Luckaly my servers use shadowing ...

Wim
Wim
Wim Van den Wyngaert
Honored Contributor

Re: How to interpret device errors

Btw : dfg report the file as "error" when trying to move it and says "parity error" afterwards.

Wim
Wim
Bojan Nemec
Honored Contributor

Re: How to interpret device errors

Wim,

I think that modern SCSI disks can handle some parity errors. For that the analize/media is obsolete for such disks (I think gives you an error that the disk is not suitable for this operation). Bad block replacement is implemented on the disk logic. The bad block is replaced from an internal pool of spare blocks which are reserved for this operation. Probably this operation is signaled to the operating system and loged as an error.


Bojan
Bojan Nemec
Honored Contributor
Solution

Re: How to interpret device errors

Hi,

I found an ask the Wizard post which shortly explains what hapen when bad blocks are located.

http://h71000.www7.hp.com/wizard/wiz_6926.html

Bojan
Wim Van den Wyngaert
Honored Contributor

Re: How to interpret device errors

Bojan,

Just what I needed.

The parts that are vague are :

1) in my test : how can I know if the bad blocks are put on the bad block list of CMS or the disk itself ?
2) How is VMS reacting on parity errors ?

E.g. I found a disk with a corrup queue manager db. I tried to stop the queue manager : it won't stop. Nowhere an alarm.

Conclusion : I must repair the bad blocks and do the anal/dis/rep/read. Or replace the disk if too many files are concerned.

Wim
Wim
Wim Van den Wyngaert
Honored Contributor

Re: How to interpret device errors

I have a GS160 with a system disk behind an HSG80. On the HSG80 it is a mirrored disk.

Badblk.sys is containing 35 (bad) blocks.

So, the mirror software on the hsg80 can't handle the errors ?

Wim
Wim
Wim Van den Wyngaert
Honored Contributor

Re: How to interpret device errors

Notice also in the error report of my original message (that concerned a parity error !!!).

Port Status x00000001 NORMAL - normal successful completion

Information Bytes are Valid

One should think that everything went well.
Wim
Wim Van den Wyngaert
Honored Contributor

Re: How to interpret device errors

Did a new test.

1 file with parity error. I copy the same file from another system and delete the file (with /erase to be certain). The free disk space remained the same. This while certain blocks should have been invalidated. So, the next file will have the same problem.

So, the only way to solve it is to rename the file containing the bad blocks to .badblocks. Don't delete the file or you get another file with the same problem.

Right ?

Wim
Wim
labadie_1
Honored Contributor

Re: How to interpret device errors

Well, I do not know if it is the only good approach, but I am convinced it is a good approach, that works.
Bojan Nemec
Honored Contributor

Re: How to interpret device errors

Wim,

Just a speculation on disk bad block replacement (Or maybe to import more confusion in my and yours mind).
The disk is capable to handle some errors, probably with parity bits and LRC and/or CRC. So is probably capable to handle simple parity errors (with these methods you can replace the bad bits) and replace the bad block. The replacement can be on write and also on read. Is obvious how is this done on write. On read, when an error is found, try to handle it. If it can be resolved, revector one new block from the pool and rewrite it from the valid data. Maybe the algorithm is simplier, on read error, try to resolve the error, and revector and rewrite a new block no mater if the data was resumed to its original or not. This will explain the behavior that the error is reported only once and this means that you could have a good physical block, but the data is not valid.
Another thing that I dont know, is what hapens when the disk goes out of bad block pool. Is there a different error (or different severity). Such a disk must be replaced in shortest posible time.

By the way, how do you simulate bad blocks in yours testing?

Bojan
Wim Van den Wyngaert
Honored Contributor

Re: How to interpret device errors

Bojan,

I have 70 station, on which 25% have device errors on the disk (AS 500 of 1997).

I did a ana/dis/read on some of them and tried the repair methods.

My main problem is however to understand what diag is reporting.

Preventive maintenance is not simple ...

Wim
Wim
Uwe Zessin
Honored Contributor

Re: How to interpret device errors

The error log entry from you first message is about device 'ARBXY$DKA0', right? Fibre channel disk drives are named '$1$DGAunit#:'.


BADBLK.SYS - I suggest you check the retrieval headers with

$ dump/header/blocks=count=0/page badblk.sys

I bet it is mapping the last (incomplete cluster) of the disk. That's a 'trick' to avoid special code in the allocation bitmap handling.


Port Status x00000001 - if I recall correctly, it is a message from the SCSI port driver and indicates that there were no problems on the SCSI bus itself.


'The free disk space remained the same.' - Of course! The space that is set aside for bad block reallocation is not drawn from the space that is available to the operating system! It is reserved and maintained inside the disk drive by its own firmware.


'to handle simple parity errors' - no, disk drives have been using EDC (error detection and correction codes) for quite a number of years. DEC has always made big noise of how many bad bits in a row they were able to detect and correct.

With 'simple parity' you can only detect 1-bit errors, but you cannot correct them, because there is not enough information to find out _which_ bit is wrong.
.
Wim Van den Wyngaert
Honored Contributor

Re: How to interpret device errors

In the link of Bojan :

"OpenVMS will set a forced-error flag in the file header".

How can I find these files or how can I see the flag using DCL ?

Wim
Wim
Bojan Nemec
Honored Contributor

Re: How to interpret device errors

Wim,

Searching FORCEDERROR, I found another ask the Wizard http://h71000.www7.hp.com/wizard/wiz_2607.html . You can get the text if you do a HELP/MESSAGE PAGRDERR. If I understand well, ANALIZE/DISK/READ_CHECK will do the search for you.

Bojan
Wim Van den Wyngaert
Honored Contributor

Re: How to interpret device errors

Bojan (means Beau Jean ?),

I think the file will be found by anal, but not based upon the file header flag.
I have the impression that all programs must set the flag when they encounter the error.
Then when deleting the file, a delete/erase will be done that will check all blocks and place the bad ones on the bad block list.

But may be I am dreaming ...

Wim

Wim
Bojan Nemec
Honored Contributor

Re: How to interpret device errors

Have you look at HELP/MESSAGE FORCEDERROR. Seems that this error is per sector basis and no per file basis.

(Bojan is an old slavic name which is wide used in Slovenia)

Bojan
Bojan Nemec
Honored Contributor

Re: How to interpret device errors

Hi,

There is more reading about bad blocks, forcederror and parity errors. http://h71000.www7.hp.com/doc/731FINAL/6136/6136pro_007.html#dsa_devices
Seems quite old (not updated).
The link is to the 7.3-1 documentation to make a direct link. You can read the same in the 7.3-2 documentation set.

(With Beau Jean you probably mean how Bojan is pronaunced. Sory, my french is quite rusty, more than 25 years. With Beau Jean it will go if you replace the J with Y like in voyage)

Bojan
Wim Van den Wyngaert
Honored Contributor

Re: How to interpret device errors

Bojan : very old indeed.

I found how to find the files marked with "contains badblocks".

$ dfu search disk/char=badblocks

But I can't confirm that it is working because all my 65 stations returned 0 files found.

Btw : defragmenter stops when it finds a parity error. If you check your dfg log files you may find a corrupt disk.

Wim
Wim
Bojan Nemec
Honored Contributor

Re: How to interpret device errors

Wim,

I dont know DFU internals, but seems that search /characteristic looks to the file header characteristics (FH2$L_FILECHAR). There is a bit called FH2$V_BADBLOCK, so with
$ dfu search disk/char=badblocks
you search files which has this bit set. According to the ask the Wizard (I posted a link in one of me previous posts to this thread) this bit is used to force bad block scan on a file.

"You can request a scan of bad blocks (using BADBLOCK_SCAN) during file deletion, by setting the FH2$V_BADBLOCK bit in the file header."

Bojan