Tape Libraries and Drives
cancel
Showing results for 
Search instead for 
Did you mean: 

MSL2024 1 Drive 1840 backup failure

SOLVED
Go to solution
CLEB
Valued Contributor

MSL2024 1 Drive 1840 backup failure

I have an MSL2024 1 drive 1840 SCSI library that keeps failing at random points during the backup window.

It's attached via an HP SC11Xe in an HP DL580 G5 server.

I get an error "Unexpected SCSI Sense Code[ABSL:5040 CMD SC:28h 40]"

Firmware on the drive is B59W. I ran an HP LTT device read/write test and it passed after testing a full cartridge.
34 REPLIES
CLEB
Valued Contributor

Re: MSL2024 1 Drive 1840 backup failure

I've set the reg key for SCSI retry to 250.

CLEB
Valued Contributor

Re: MSL2024 1 Drive 1840 backup failure

LTT logs attached
Johan Guldmyr
Honored Contributor

Re: MSL2024 1 Drive 1840 backup failure

Hi, they're not attached.

What about a wellness test from the OCP - does it pass that?

Are you attaching the ltt support ticket too?
SUBHAJIT KHANBARMAN_1
Respected Contributor

Re: MSL2024 1 Drive 1840 backup failure

"I get an error "Unexpected SCSI Sense Code[ABSL:5040 CMD SC:28h 40]""

Where you have received the above error?

Can you please tell me what error code you have noticed in the OCP of the library?


Can you please send the L&TT support ticket for the entire library?
CLEB
Valued Contributor

Re: MSL2024 1 Drive 1840 backup failure

The error message is appearing in the ARCserve activity log.

[04/21 07:11:17 211c 1 WD 16316 16317 4] Unexpected SCSI Sense Code [ABSL:5040 CMD SC:28h 40]

[04/21 07:13:27 0b78 1 TK 16316 3] DRV:4 [HP Ultrium 4-SCSI B59W] recovered [1] error for [writing.].

Looks like the attachment wasn't added as it was too big.

Server is running W2K8 R2 with PSP8.70
SUBHAJIT KHANBARMAN_1
Respected Contributor

Re: MSL2024 1 Drive 1840 backup failure

I don't think there is any issue with the hardware as the ticket looks clean to me.
CLEB
Valued Contributor

Re: MSL2024 1 Drive 1840 backup failure

I just had another failure last night.

I will run another test with HP LTT on the full library.



CLEB
Valued Contributor

Re: MSL2024 1 Drive 1840 backup failure

I disabled the HP Insight Storage Agents service and I've had the first successful backup in five attempts.
Curtis Ballard
Honored Contributor

Re: MSL2024 1 Drive 1840 backup failure

Are you certain that you got the right SCSI registry entry for retries when busy set to 250? That behavior really sounds like the busy retry issue.

The operating system creates another entry every time you update drive firmware so there are often a lot of similar entries. I've seen a number of cases where the entry that the OS was using didn't get configured when there were multiple entries. To help with that we will be posting a tool shortly that will set the entry automatically but I don't have an official final copy of that tool that I can post yet.
CLEB
Valued Contributor

Re: MSL2024 1 Drive 1840 backup failure

Curtis

I made the registry change and this fixed the issue. I had several weeks to a month worth of successful backups.

This has just started to fail again recently.

The only change has been the installation of PSP 8.70

This is the reg key:
[HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Enum\SCSI\Sequential&Ven_HP&Prod_Ultrium_4-SCSI\8&2f5f9346&0&000400\Device Parameters\Storport]
"BusyRetryCount"=dword:000000fa
CLEB
Valued Contributor

Re: MSL2024 1 Drive 1840 backup failure

I've had the HP Insight Storage Agents service disabled and I'm not seeing the issue now.
Curtis Ballard
Honored Contributor

Re: MSL2024 1 Drive 1840 backup failure

If you don't mind operating with the storage agents disabled that is fine.

If you want to turn the storage agents back on you'll want to look and see if installing the PSP caused Windows to decide to create a new registry entry for that drive. Depending on what the PSP installs Windows can create new registry entries on PSP installation.
CLEB
Valued Contributor

Re: MSL2024 1 Drive 1840 backup failure

It's not ideal as I'd like the functionality that the storage agents provide.

I've attached the two reg locations.

There is only one entry for the actual tape drive but there is another entry for the SCSI HBA.
Curtis Ballard
Honored Contributor
Solution

Re: MSL2024 1 Drive 1840 backup failure

Thanks for posting the registry entries. That was very helpful.

That is an unusual configuration as there is an MSA1000 on the same controller as the tape drive. That isn't a recommended configuration and would probably be considered unsupported but it usually can be made to work.

Since the array is on the same controller as the tape drive it is likely that there are some additional delays happening somewhere.

I would start out by increasing the BusyRetryCount quite a bit higher. The 0xfa value was determined to be a good setting for a single tape drive on the controller but having other devices easily could cause it to need to be much higher. I would recommend changing it to 0xffff.

All that parameter does is say how long to wait before giving up when a device is busy. Before Microsoft created the registry entry the wait time was infinite. That was the default when this card was first made. That worked fine most of the time but some devices could get stuck reporting Busy and cause a hang condition. To fix that a timeout was put into the lower layer drivers but whoever picked it chose a value that is frequently too low.

There are a couple of other registry entries that were created at the same time that you might try:

Value - BusyPauseTime
Type - DWORD
Data - 250 Decimal (default)
Range - number of milliseconds

If you change the pause time I would recommend trying 500 or 1000.

Value - QueueFullWaitIoPercentage
Type - DWORD
Data - 25 Decimal (default)
Range - 1 to 100 percentage of time

This value for tape would be better to be more like 50 to 75 but be careful with an array attached as you could impact performance on a heavily loaded array by making this number too high.

It is a real pain messing with these registry entries especially in a production environment but there are too many potential interactions to calculate precisely what you need. For the retry count, too high of a value does nothing except cause a slightly longer delay to reporting errors on a fatal permanent busy condition (really rare). For the pause time you can cause a few milliseconds extra delay in detecting the end of a busy condition which normally means nothing but can add up if the system is really heavily loaded and busy occurs frequently. The queue full parameter won't effect tape performance at all but can effect disk.
CLEB
Valued Contributor

Re: MSL2024 1 Drive 1840 backup failure

Thanks for all the information Curtis.

The MSA1000 is on a FC1242SR 4Gb PCI-e DC HBA FC HBA.

The tape drive is part of the MSL2024 library which is attached via the SC11Xe which is parallel SCSI.

Perhaps there is something wrong with the registry keys.

The MSA1000 is due to be replaced with a spare MSA70 soon.

I'll have a look at making those changes you recommend. I have an exact hardware copy at another site that I can do testing on.
Curtis Ballard
Honored Contributor

Re: MSL2024 1 Drive 1840 backup failure

Interesting that the MSA is on a different HBA. Sorry about that. In the registry output everything was on the same "Bus" but I missed that Windows has a fourth qualifier for card which didn't show in the registry entries.

If you are willing to experiment a bit hearing how it goes for you would be very helpful. I have requested quite a bit of testing of this specific configuration trying to reproduce problems like you have seen with the BusyRetryCount registry entry set to 250 but none of the lab tests have experienced any failures after setting that entry. You obviously have it set so we might be able to learn something new.

Since you indicate that you have a mirror system outside of production where you can run tests I'll mention that if you would like to try it there is a software SCSI analyzer HP uses that has a client you can download and run to take low level traces and possibly catch a SCSI bus trace of a failure. That tool is called BusTRACE and the busTRACE capture client on the following page can take traces that we can analyze back at the lab.

http://bustrace.com/downloads/free_utilities.php

If the failure happens at the physical level (HBA or on the wire) then that tool won't capture it and we have to use a hardware analyzer but frequently it captures everything we need.
CLEB
Valued Contributor

Re: MSL2024 1 Drive 1840 backup failure

Ok I'll grab this tool.

Are there any specific instructions for using it?
CLEB
Valued Contributor

Re: MSL2024 1 Drive 1840 backup failure

I've got the trace output but unfortunately the backup job didn't fail. Which is just typical.

I filtered on only the LSI adapter and MSL G3 and 1840 tape drive.
Curtis Ballard
Honored Contributor

Re: MSL2024 1 Drive 1840 backup failure

Thanks for the update that you have the tracing utility. It looks like you've figured out how to use it. I'll continue to monitor this thread for any updates. I'd like to figure this one out and appreciate your help.
CLEB
Valued Contributor

Re: MSL2024 1 Drive 1840 backup failure

Had a failure last night. Going to re-run the backup with the util monitoring.
CLEB
Valued Contributor

Re: MSL2024 1 Drive 1840 backup failure

Curtis, annoyingly when I ran the backup again it worked with no issues. So I presume there is no reason to need the trace file?

I'm going to reseat all the interface cards in the server as I have an MSA70 and P800 to replace the MSA1000 FC.

Hopfully this will cure any gremlins in the system.

CLEB
Valued Contributor

Re: MSL2024 1 Drive 1840 backup failure

Curtis

I had another failure yesterday. It seems to work every other time I run a backup.

I haven't made any changes yet though.

I do have a server setup with Command View TL, which is configured to pull tickets and info from the MSL2024. I have this happening on two MSL4048 libraries though. Do you think this could be an issue?

CLEB
Valued Contributor

Re: MSL2024 1 Drive 1840 backup failure

Are these extra reg entries supposed to be under the SCSI HBA or the actual tape drive? I notice there is an empty storport entry under the SCSI HBA keys.

CLEB
Valued Contributor

Re: MSL2024 1 Drive 1840 backup failure

I've had a failure twice in a row now.

I'm currently installing the latest library and drive FW using the May update bundle.

If I have issues after that I'm going to increase the timeout value from 250.