Re: Tape drive scsi problems

Fredrik.eriksson · ‎01-09-2009

Hi again guys :)

Thanks for the help in my last post about the broken tape drive.

Now it's acting up again but this time with another type of error, it seems that it is actually working properly this time.

During the holidays the tape drive worked for the better part. But when I got back to work it started reporting device timeout errors.

To explain a bit more, when I run "mount /for mke500:" it reports that medium is offline and then just prints out "%MOUNT-F-TIMEOUT, device timeout". After that it just says the error message directly when (within 1 second) I execute mount.

I've tried rebooting the machine in hopes that it would solve something, but when I did it didn't even show up as a device. After some work I got it working again, but it seemed more like a coincidence than a solution. I've also tried changing the scsi channel which worked for about 2 hours and then just gave up with the same result.

In my logic reasoning this could be 1 of 3 things,
1) scsi cable is broken?
2) scsi card is broken? (Qlogic ISP1020 SCSI)
3) the tape drive is broken, this is unlikely thou... since I got it to work fine for periods of time before the timeout errors occured.

Is there something more usual than this or am I on the right track?

Best regards
Fredrik Eriksson

Steven Schweda · ‎01-09-2009

> [...] it didn't even show up as a device.

If SYSMAN IO AUTOCONFIGURE (or a reboot)
does not detect the device now (but did
before), then you would seem to have some bad
hardware somewhere in the chain.

> [...] this is unlikely thou[gh]... [...]

Working things can fail. (Often, it's
exactly the working things which do fail.)
Replacing things other than the (new) tape
drive would certainly be a reasonable way to
start, however. What is the system here?
SCSI cables and old Qlogic PCI SCSI cards are
normally pretty easy to find for close to no
money.

> [...] my last post [...]

Including an URL would make that easier to
find. (For best results, leave out the XX in
"forumsXX.itrc.hp.com", and any "admit=X+Y+Z"
segment in the query string.)

Fredrik.eriksson · ‎01-09-2009

Yes ofc, should've included a link.
(http://forums.itrc.hp.com/service/forums/questionanswer.do?threadId=1295322)

It's an OpenVMS 7.3-2. I might've gotten a hold of a proper scsi cable, but probably wont be available until monday to try it out.

I know that "working" hardware could malfunction in wierd ways, and it is a possibility since we just got a "new" one from HP just before christmas.
I'd rather like that it's just a scsi cable problem, mostly because that would be the simplest fix ;P

running $ MC SYSMAN IO AUTO /LOG works fine until it starts reporting device timeout and then it just can't detect it.

Best regards
Fredrik Eriksson

Allan Large · ‎01-09-2009

We have actually encountered a very similar problem with an rx2660. The solution was simple in that the controller card had to be reseated. The problem was resolved.

Dennis Handly · ‎01-09-2009

>should've included a link.

Also be careful about punctuation (your trailing ")"), better to use a line by itself.
http://forums.itrc.hp.com/service/forums/questionanswer.do?threadId=1295322

cnb · ‎01-13-2009

Did HP swap out the entire drive and enclosure or just the drive?

In addition to your suspect list don't overlook the terminator.

It could be the enclosure p/s as the older table-top supplies had a high failure history.

Turn off the enclosure for a while then power up and retry, if you can connect then most likely the p/s is failing when hot.

Just a thought.

HTH,

Fredrik.eriksson · ‎01-13-2009

Hi cnb,

You're correct, they didn't change the enclosure, just the internal bits.
I haven't checked the terminator so you could be correct. But it's somewhat like you describe it.
If I turn it off for a while it does reconnect and work again. I've noticed this not directly in the way you described it, but it has usually started working when I've tried to disconnect and reconnect the scsi cable, which might give it sufficient time to cool down i guess.
But it's wierd... I've had it running for several hours before it stops responding with device timeout errors. Even when I've only stopped it for like 2 minutes.
I haven't checked yet, but I moved it to another ES40 machine yesterday around 1pm and it was working when I went home around 5pm. Hopefully (simplest solution actually) is that it doesn't work and that it's causing these issues.
If it does work, in my reasoning, there can be 2 error sources. Either my OpenVMS installation is doing this or the SCSI card is broken.
Replacing the SCSI card isn't much of an issue, but if it's operationsystem problems then I need to find some other temporary solution since these machines are to be shutdown within 6 months.

Best regards
Fredrik Eriksson

cnb · ‎01-13-2009

Is there anything in the VMS error log to indicate GROSS or SCSI Phase Errors?

I'll bet my Carlsberg on the power supply being 'noisy'.
;-)

HTH,

David Lethe · ‎01-17-2009

There is an industry standard spec for monitoring health and decoding errors for tapes and autochanges. Google tapealert and you'll find info. One software product that is not ported to VMS, but ported to HP-UX, Windows, and just about everything else has some screenshots and further info. If you are able, temporarily hook the tape to a host running a supported O/S. Check out the manual and links for tapealert at http://www.santools.com/smart/unix/manualo

Hoff · ‎01-17-2009

Certainly do watch the SMART data (as there are some data points that do tend to predict failure), but do keep your data archives or recovery strategies current.

There are standards for all sorts of things to do with SCSI, too. Some of which sort-of match reality. (The best part of working with storage standards is that there are so many to choose from.)

The quote from Smart Reseller magazine over at that cited web site aside, the SMART monitoring (for disks) has been found to detect and report only a surprisingly small fraction of disk device failures. Prior to catastrophic failure, that is. SMART simply isn't a reliable predictor of failure, based on some large-scale empirical studies from folks at CMU and Google.

As for tools, here's some open source that might well be (reasonably) portable:

http://sourceforge.net/projects/smartmontools/

The ioctl() code that's very likely included (I haven't looked at the source code) would need to be switched over to IO$_DIAGNOSE calls to send the SCSI command packets, etc.

With OpenVMS and the specific device timeout case, that's already a failure somewhere in the chain. SMART and related tools likely won't help all that much. HP SIM / HP SEA / WBEM+WEBES / whateverthisstuffiscallednow might be worth a look. But I'd start swapping some SCSI parts here first, and see if or where the bug moved to. That's simpler.

Categories

Company

Local Language

Forums

Discussions

Forums

Discussions

Discussions

Forums

Discussions

Forums

Discussions

Forums

Forums

Discussions

Forums

Discussions

Forums

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Community

Resources

Other HPE Sites

Discussions

Forums

Blogs

Re: Tape drive scsi problems

Tape drive scsi problems