Operating System - HP-UX
1833230 Members
2810 Online
110051 Solutions
New Discussion

LBOLT errors indicative of impending hard drive failure?

 
SOLVED
Go to solution
Mark Tunnell
Advisor

LBOLT errors indicative of impending hard drive failure?

On the HP9000 systems I've been managing over the past 6 years or so, I've occasional encountered LBOLT errors on external SCSI drives that appear out of the blue, with no hardware or software changes preceding them. In some cases powering everything off, tightening all the SCSI connections and turning everything back on fixes the problem. However, in most cases I've had to replace the drive generating the errors to make the errors go away.

What confuses me about this is that the drive in question will often continue to work after the inital errors. Diskinfo may show the disk being fine. Small amounts of data may be written to it. Sometimes if a large amount of data is written to it the LBOLT error return sometimes they don't. When the drive is replaced the problems are gone.

What I've gradually come to assume is that the disk is "going bad" and I need to replace it as soon as possible. However, I've never understood what "going bad" would actually mean? And why would it generate LBOLT errors? Could a hard drive begin operating inefficiently and therefore generate LBOLT errors becuase its timeout value needs to be increased?

Has anyone had the same experience but come to a more satisfactory explanation of the cause? Any insight would be appreciated.

Mark
6 REPLIES 6
Bharat Katkar
Honored Contributor

Re: LBOLT errors indicative of impending hard drive failure?

Hi,
The link below should through some light on the issue.

http://unix.derkeiler.com/Mailing-Lists/HP-UX-Admin/2003-09/0006.html

Regards.
You need to know a lot to actually know how little you know
A. Clay Stephenson
Acclaimed Contributor

Re: LBOLT errors indicative of impending hard drive failure?

We've all come to think of disk drives as so ordinary and so ubiquitous that we tend to forget just ow many things have to go (almost) exactly right for these "ordinary" things to work. Many times, I've seen hot-plug disks fail and the "fix" was to simply unplug them and reinsert them. The fix might last for a few days and could even be repeated but eventually the drive would fail. A slight error is the positioning of the head, a slight error in the speed, or any number of similar problems can lead to a drive that almost works perfectly. Generally, I've found that the biggest enemy of these drives is heat -- anything that you can do to keep them better cooled
is good. One of the things that you learn in a solid state physics lab is just how much even a one-time modest overtemperature
exposure can permanently alter the characteristics of a semiconducter. Often not enough to completely disable a junction but enough to make it marginal.
If it ain't broke, I can fix that.
LoC_1
Frequent Advisor
Solution

Re: LBOLT errors indicative of impending hard drive failure?

Mark.

These lbolts can be caused by hardware or software. A timeout can occur because of a hardware issue as noted above or by a software issue.

If the operating system holds a spinlock for a long time (several
seconds), it can cause interrupts to be missed by the SCSI driver.

This results in the SCSI driver erroneously thinking that the device
did not complete the operation (in fact, it did, but the SCSI driver
never got the notification). This is recorded as a "Request Timeout".
An "Abort averted" is a timeout that "almost happened"; it actually
did time out, but the notification was received by the driver just as
it was about to send an Abort msg to the device to cancel the request.
And a "First party detected bus hang" is often the result of the driver
losing communication w/ the card, which can occur when the driver doesn't
acknowledge interrupts in a timely fashion -- one of several possible
consequences of holding a spinlock for "too long".


This type of error can be corrected by increasing the timeout with
pvchange -t

Louis
Mark Tunnell
Advisor

Re: LBOLT errors indicative of impending hard drive failure?

Thanks for the responses. Any idea why these LBOLT would appear out of the blue, with no preceding changes to the system, either hardware or software? Could a disk drive getting old or "going bad" somehow cause it to respond more slowly and generate these timeout errors?
A. Clay Stephenson
Acclaimed Contributor

Re: LBOLT errors indicative of impending hard drive failure?

I've never seen a need to increase the timeout of a local disk (as opposed to a LUN in an array) beyond the default 30 seconds. I'm willing to bet that replacing the drive will make the error messages disappear. Marginal electronics will cause all these problems. The drive might receive the i/o request and actually perform the i/o but fail to acknowledge the operation.

Your so-called "LBOLT" errors result from many causes and w/o seeing those, it's difficult to be very specific. Timeouts are but one form of error. As a general rulke, if the i/o errors are confined to a single disk then replace the disk; if the i/o errors appear on multiple devices on the same SCSI bus then look at the controller, termination, and cabling. One of the most surprising aspects of SCSI technology is that often even an unterminated SCSI bus will almost work well.
If it ain't broke, I can fix that.
Dave Unverhau_1
Honored Contributor

Re: LBOLT errors indicative of impending hard drive failure?

Mark,

What kind of hardware are you experiencing these SCSI timeouts on?

On the older FWD SCSI interfaces with external storage, these problems would occur relatively randomly due to cabling issues. In fact, many of the problems were caused by usng cables between the HBA and disk enclosures, and between daisy-chained enclosures, that were too *short*. The fix was to use slightly longer cables.

It's also possible that one or more of your cables has been bent too severely and the wiring inside has been distorted, causing impedance "bumps" that cause noise on the bus.

Of course, there's also the possibility of flaky terminators and rogue electronics on drives, causing the drives to "not play well with others".

SCSI can provide lots of entertaining puzzle-solving activity time...entertaining when your job isn't in jeopardy, anyway...

Best Regards,

Dave
Romans 8:28