Operating System - OpenVMS
cancel
Showing results for 
Search instead for 
Did you mean: 

RECOVERED ERROR what is may be ?

 
Alex Chupahin
Super Advisor

RECOVERED ERROR what is may be ?

Hello!
May be this is dumb question.
I see in
$show dev dka100

Error count
435

I look into syslog
anal/err
And see many
RECOVERED ERROR
RECOVERED READ WITH READ RETRIES
is it dangerous?

This is log fragment
******************************* ENTRY 15. *******************************
ERROR SEQUENCE 5. LOGGED ON: CPU_TYPE 00000002
DATE/TIME 14-DEC-2007 13:29:54.52 SYS_TYPE 00000006
SYSTEM UPTIME: 0 DAYS 00:00:26
SCS NODE: A1 OpenVMS AXP V8.2

HW_MODEL: 00000414 Hardware Model = 1044.

DEVICE ERROR DEC 2000 Model 30

GENERIC DK SUB-SYSTEM, UNIT _A1$DKA100:, CURRENT LABEL ""
IBM DDYS-T18350M


HW REVISION 48363953
HW REVISION = S96H
ERROR TYPE 05
EXTENDED SENSE DATA RECEIVED
SCSI ID 01
SCSI ID = 1.
SCSI LUN 00
SCSI LUN = 0.
SCSI SUBLUN 00
SCSI SUBLUN = 0.
PORT STATUS 00000001
%SYSTEM-S-NORMAL, NORMAL SUCCESSFUL
COMPLETION
SCSI CMD 829E0808
0040
READ
SCSI STATUS 02
CHECK CONDITION

EXTENDED SENSE DATA

EXTENDED SENSE 000100F0
18949E08
00000000
80000117
00000200
00000000
FF04BB00
0000C001
RECOVERED ERROR
RECOVERED READ WITH READ RETRIES
UCB$L_ERTCNT 00000010
16. RETRIES REMAINING
UCB$L_ERTMAX 00000010
16. RETRIES ALLOWABLE
ORB$L_OWNER 00010001
OWNER UIC [001,001]
UCB$L_CHAR 1C4D4008
DIRECTORY STRUCTURED
FILE ORIENTED
SHARABLE
AVAILABLE
MOUNTED
ERROR LOGGING
CAPABLE OF INPUT
CAPABLE OF OUTPUT
RANDOM ACCESS
UCB$L_STS 08021910
ONLINE
BUSY
SOFTWARE VALID
UNLOAD AT DISMOUNT
UCB$L_OPCNT 00000236
566. QIO'S THIS UNIT
UCB$L_ERRCNT 00000004
4. ERRORS THIS UNIT
IRP$L_BCNT 00008000
TRANSFER SIZE 32768. BYTE(S)
IRP$L_BOFF 00000400
1024. BYTE PAGE OFFSET
IRP$L_PID 00010003
REQUESTOR "PID"
IRP$Q_IOSB 00000000
00000000 IOSB, 0. BYTE(S) TRANSFERRED


14 REPLIES 14
Heuser-Hofmann
Frequent Advisor

Re: RECOVERED ERROR what is may be ?

Jur van der Burg
Respected Contributor

Re: RECOVERED ERROR what is may be ?

I beg to differ that this can be ignored. Your drive is dying, and I would say make a backup while you can.

Jur.
Jan van den Ende
Honored Contributor

Re: RECOVERED ERROR what is may be ?

Alex,

I am with Jur on this one!

>>>
SYSTEM UPTIME: 0 DAYS 00:00:26
.
.
.

4. ERRORS THIS UNIT
<<<

The occasional correctible read error does not overly concern me.

But FOUR of them in 26 SECONDS?

Time for a backup NOW, and replace ASAP.

hth

Proost.

Have one on me.

jpe
Don't rust yours pelled jacker to fine doll missed aches.
Alex Chupahin
Super Advisor

Re: RECOVERED ERROR what is may be ?

I forget to say that errors appears
while I copy a huge number of files into that drive.

May be I should to format the drive by SCU ?
Hoff
Honored Contributor

Re: RECOVERED ERROR what is may be ?

Extended sense data in isolation is normal, and arises with some disks and with SCSI bus resets in multi-host configurations.

That written, the:

RECOVERED ERROR
RECOVERED READ WITH READ RETRIES

is severe badness.

The extended sense data referenced over in the ATW topic does not manifest itself in this fashion.

This disk is very likely in the process of transitioning over into the great disk slag heap in the sky.

When disks are logging errors, the first step is to verify the integrity and timeliness of the data archives.

I've posted up details on typical disk failures and pointers to various papers on observed disk lifetimes and failure rates, and on how depending on SMART can be risky -- but I'd bet that SMART monitoring on this drive is probably screaming "danger". Start here:

http://64.223.189.234/node/93
http://64.223.189.234/node/188

Stephen Hoffman
HoffmanLabs LLC
Hoff
Honored Contributor

Re: RECOVERED ERROR what is may be ?

I've found that re-formating a disk drive to be a waste of time and effort in most any case. Replace the drive.

I've yet to encounter a case where formatting a failing drive works, and the process risks your data twice, and it requires more work.

Formatting (or erasing) can be a good way to push a drive over into full failure. Simply getting your data off the disk can sometimes push the drive over into failure.

Existing bad block handing recovers from transient errors by/during rewriting blocks; when the drive starts showing errors, you're usually headed for failure. (nb: once the disk sector is sufficiently corrupted, then the contents can't be recovered. In cases such as this, recovery from a hard failed ECC block is only feasible using RAID or HBVS or such; from another copy of the block.)

If there's one solidly bad block (soft and recoverable, or a hard and unrecoverable error) under a busy read-only file, you might make a case for deleting and re-writing the file. But not for formatting the disk. And if the failures are scattered around -- as is the norm -- you're almost certainly headed for failure.

And at the price of replacement disks and disk arrays, how much is your time and your data worth?

Stephen Hoffman
HoffmanLabs LLC





Alex Chupahin
Super Advisor

Re: RECOVERED ERROR what is may be ?

Interesting things to be continued/

I get this "died" hard and insert it into
my Itanium server. Just go read/write.
No errors for hours!

I insert it back to an old Alpha (SCSI adapter is very old AHA-1742). Errors just returns again. But I still cant see troubles - all files read/write Ok. Except error count of course. May be the hard drive still not bad?

Hoff
Honored Contributor

Re: RECOVERED ERROR what is may be ?

Ah. DEC 2000 model 300, DEC 2000 model 500, or DECpc 150 AXP. A Jensen. Missed that detail.

Not my favorite Alpha box, and not one for the "faint of SCSI". The Jensen SCSI Adaptec AHA-1742 is quite sensitive to the bus configuration and bus timing.

If the SCSI bus is incorrectly extended (eg: disks both inside and outside the box at the same time), or otherwise configured beyond the supported configuration, SCSI problems can arise.

If the AHA-1742 firmware isn't at G.2 (and it probably is a compatible firmware version, if you've gotten as far as you have here), bad things can happen, too.

HP retired OpenVMS Alpha support for this box at V7.3-1; that's the highest version here. Jensen is one of two Alpha boxes where support has been retired.
Alex Chupahin
Super Advisor

Re: RECOVERED ERROR what is may be ?

You will be surprised, but 8.2 still work
on DECpc AXP-150, so this hardware is *supported*. I've seen it by my eyes.
It is very interesting to test 8.3, but I (and other people around) have no it.
The ROM on the AHA-1742 is G.2-A if I remember.
What do you think, I hope my hard drive still
have good health. :)

Jensen was my first and only Alpha for a long time (since 1998 or 1999) and I learn and use Tru64 Unix, WindowsNT, Linux and OpenVMS on it. Most of my OpenVMS ports I did on this machine. Of course, her architecture and bus (EISA) is far from ideal, but I still love it ;)

Interesting, I have seen, no any Seagate SCSI drive can work with AHA-1742. There is very stange thing: only first (or a couple first) cilinders/headers etc I dont know can works.
So I can initialize the drive, but cant read/write anything.
But any IBM, Quantum,Conner etc works good.

Alex Chupahin
Super Advisor

Re: RECOVERED ERROR what is may be ?

Yes, the ROM is G.2A on that box, not G.2. May be this is a reason for errors, as you said.

About Seagate SCSI disks - I still dont know what the reason may be to lead that trouble.
Any ideas are welcome :)
Ian Miller.
Honored Contributor

Re: RECOVERED ERROR what is may be ?

'works' and 'supported' are not the same thing. Yes VMS V8.2 runs on that alpha but it is not supported. This mean it may mostly work but HP have not performed any testing.

I consider recovered errors to be a hint that its time to be looking for a replacement disk.
I know acquiring a replacement disk can be non-trivial sometimes, and that you have been seeking for some while a alpha system which is not quite so vintage.
____________________
Purely Personal Opinion
Heuser-Hofmann
Frequent Advisor

Re: RECOVERED ERROR what is may be ?

In some cases a modification of the setup of the drive helps i.e. disable tagged queuing.
Alex Chupahin
Super Advisor

Re: RECOVERED ERROR what is may be ?

Yes you are right.
But in Tru64 for example "not supporting" means "cant boot anyway". I'm very happy to see OpenVMS can be booted as surprise without HP support even.

Wim Van den Wyngaert
Honored Contributor

Re: RECOVERED ERROR what is may be ?

We have about 70 old AS500 stations with 1 RZ1BB disk. We get errors weekly (mostly during defrag).

Because they are no longer in a support contract, we **repair** the disks as follows.

Create dumb files with a loop until the disk is nearly full.
Do anal/dis/read for the whole disk. For the files on which you get parity error, you do "rename to .bad_blocks" and "set file/nomov". This way the bad block is frozen in an unused file.

Many stations have 1 bad block. Some have a few and the meximum is about 15. But none of the disks is getting really bad. 1 disk turned unreliable (giving the extended messages but no parity errors) and is taken out of production.

But we have the advantage that there is no unreplaceble data on it !

Fwiw

Wim
Wim