Tape Libraries and Drives
cancel
Showing results for 
Search instead for 
Did you mean: 

Backup problems with MSL2024

Ian Pennington_1
Occasional Advisor

Backup problems with MSL2024

Hi
I have DL380G5 8TB storage server. This has SC11Xe HBA fitted to which is connected a MSL2024 LTO-4 Ultrium 1840 SCSI Drive Library. All firmware and drivers are latest versions.
Backup software is CA BrightStor 11.5 SP3 When running a backup it fails with the CA software reporting unknown hardware SCSI error, but the Server and Tape library show no hardware errors.
Rebuilt system with recovery DVD and installed CA BrightStor 12 and got same error.
Replaced server and library hardware completely with new and got same error with CA 11.5 and 12.
Rebuilt with DVD and installed Backup Exec v 12 and got same error!!! However, Backup exec forum said thet the HP Management Agents were know to cause this problem. Rebuilt original hardware, disabled HP Agents and installed CA 11.5 SP3 and backups now run fine. 1.2TB in 4.5 hours.
So, how do I get round this as I want to remote monitor server with Agents.
Anyone seen this before?
29 REPLIES
Arend Lensen
Trusted Contributor

Re: Backup problems with MSL2024

Ian,

make sure you are using the latest firmware in the library controller. Does it happen almost around the same time?.
Is that perhaps 24 hours after a library reboot?. The native fibre drives do not have such problems with fibre agents as the scsi variant with the NSR but better disable them.

Regards,
Arend
Ian Pennington_1
Occasional Advisor

Re: Backup problems with MSL2024

As I said, firm ware/drivers are latest versions. Failure runs from at 45GB to 250GB backup, totally random. Replicated on two seperate sets of hardware and with two different backup software packages. What is HP playing at!!
Arend Lensen
Trusted Contributor

Re: Backup problems with MSL2024

Ian,

Can you post a LTT support ticket here, taken asap after a failure please?.
Many, many things can cause a backup to abort, it maybe has nothing to do with hardware. You wrote that u used the SC11Xe adapter. It might be worth to emphasize that as not all adapters are supported with all libraries.
Can you please have a look at the compatibility table at hp.com/go/ebs.

Regards,
Arend
Curtis Ballard
Honored Contributor

Re: Backup problems with MSL2024

There is one issue being worked that sounds similar to this which only seems to occur with Microsoft storport.sys drivers after 5.2.3790.3959

For that issue it's possible to work around the problem temporarily by loading an older version of storport.sys

HP is working with all of the different companies involved to figure out where the issue is and a solution will be made available as soon as possible.
Curtis Ballard
Honored Contributor

Re: Backup problems with MSL2024

I've heard good results for working around this problem by changing a registry entry for the StorPort Driver. See http://support.microsoft.com/kb/932755 for details.

An excerpt from that document for the registry entry that needs changed:
HKEY_LOCAL_MACHINE\System\CurrentControlSet\Enum\SCSI\\\DeviceParameters\Storport\
Value - BusyRetryCount
Type - DWORD
Data - 20 Decimal (default)
Range - number of retries

I don't know what the magic value is that works but 20 was probably just a guess for a good retry value. I would recommend you change it to 32 or 64.
Steve Higgins_1
Occasional Advisor

Re: Backup problems with MSL2024

We have been experiencing the same exact problems with the following setup: DL320s G1 server, SC44Ge HBA connected to MSL2024 library with 2 SAS LTO-3 Ultrium 920 drives. All firmware is up to date. I just updated the Storport driver to 5.2.3790.4021, but don't see the option in the registry to change the BusyRetryCount. I also just updated the Insight agents from 7.80 to 8.10.
Hopefully this resolves the issue.
I don't want to disable the HP agents, that is not a valid workaround in my opinion.
Some other thoughts: we only see the issue when both tape drives are powered on. If I disabled one, backups are successful. When both are enabled, it doesn't matter if they are both actively being used, it will fail with a Bus reset.

Crossing fingers for tonight's backups.
Curtis Ballard
Honored Contributor

Re: Backup problems with MSL2024

You have to create the BusyRetryCount registry entry. It isn't there by default.
Steve Higgins_1
Occasional Advisor

Re: Backup problems with MSL2024

I tried that and it did not work. Disabling the Insight agents also didn't help, though the drives themselves don't seem to be going offline for 2 minutes and causing the backup to fail completely. One backup will complete, while the other sits there and waits for a long time and finally starts backing up 10-15 minutes after the first one is done. There are still LSI_SAS controller errors in the system log every 5 minutes while this is happening.
Curtis Ballard
Honored Contributor

Re: Backup problems with MSL2024

Can you post the LSI SAS event binary data from the event log in word format?
Steve Higgins_1
Occasional Advisor

Re: Backup problems with MSL2024

HP support recommended that I increase the BusyRetryCount to 75 and test again. I will post the results.

Here are the errors for LSI_SAS:

Event Type: Error
Event Source: Lsi_sas
Event Category: None
Event ID: 11
Date: 7/30/2008
Time: 11:52:40 PM
User: N/A
Computer: KSTLON0AP002
Description:
The driver detected a controller error on \Device\RaidPort1.

For more information, see Help and Support Center at http://go.microsoft.com/fwlink/events.asp.
Data:
0000: 0018000f 00680001 00000000 c004000b
0010: ad010030 00000000 00000000 00000000
0020: 00000000 00000000 00000000 00000000
0030: 00000000 c004000b 00000000 00000000

Then this error saying the drive went offline:

Event Type: Error
Event Source: Storage Agents
Event Category: Events
Event ID: 1223
Date: 7/30/2008
Time: 11:52:40 PM
User: N/A
Computer: KSTLON0AP002
Description:
SAS Tape Drive Status Change. The tape drive in Slot 1, Device 6 with serial number "HU10726KDG", has a new status of 3.
(Tape Drive status values: 1=other, 2=ok, 3=offline)
[SNMP TRAP: 5025 in CPQSCSI.MIB]
Data:
0000: 01c0009a 00000003 00000006 03010304
0010: 746f6c53 00003120 00000000 00000000
0020: 00000000 00000000 00000000 00000000
0030: 00000000 00000000 00000000 00000000
0040: 00000000 00000000 00000000 00000000
0050: 00000000 00000000 00000000 00000000
0060: 00000000 00000000 00000000 00000000
0070: 00000000 00000000 00000000 00000000
0080: 00000000 00000000 00000000 00000000
0090: 69766544 36206563 00000000 00000000
00a0: 00000000 00000000 00000000 00000000
00b0: 00000000 00000000 00000000 00000000
00c0: 00000000 00000000 00000000 00000000
00d0: 00000000 00000000 00000000 00000000
00e0: 20504800 52544c55 394d5549 44203032
00f0: 00005652 00000000 00000000 00000000
0100: 00000000 00000000 00000000 00000000
0110: 00000000 00000000 00000000 00000000
0120: 00000000 00000000 00000000 00000000
0130: 00000000 00000000 00000000 00000000
0140: 00000000 00000000 00000000 00000000
0150: 00000000 00000000 00000000 00000000
0160: 32430000 00005734 48000000 37303155
0170: 444b3632 00000047 00000000 00000000
0180: 00000000 00000000 00000000 00000000
0190: 00000000 00000000 00000000 00000000
01a0: 00000000 00000000 00000000 31303035
01b0: 30413031 38383030 41333239 00000000
01c0: 00000000 00000000 00000000 00000000
01d0: 00000003 00000006 00000000 00000000
01e0: 843504c7 0001000f 00000001 000000ff
01f0: 002c2f00 00000000 009a01c0 ffff000a
0200: 0000ffff 00000000 000c0000 0167008c
0210: 8080000b 80418051 00010001 00000002
0220: 00000004 00040000 00000004 00000000
0230: 00000000 00000000 00000000 00000000
0240: 00000002 00000000 0029cb80 00000000
0250: 00000008 00000000

Then 2 minutes later the drive is online again:

Event Type: Information
Event Source: Storage Agents
Event Category: Events
Event ID: 1223
Date: 7/30/2008
Time: 11:54:40 PM
User: N/A
Computer: KSTLON0AP002
Description:
SAS Tape Drive Status Change. The tape drive in Slot 1, Device 6 with serial number "HU10726KDG", has a new status of 2.
(Tape Drive status values: 1=other, 2=ok, 3=offline)
[SNMP TRAP: 5025 in CPQSCSI.MIB]
Data:
0000: 01c0009a 00000003 00000006 02020202
0010: 746f6c53 00003120 00000000 00000000
0020: 00000000 00000000 00000000 00000000
0030: 00000000 00000000 00000000 00000000
0040: 00000000 00000000 00000000 00000000
0050: 00000000 00000000 00000000 00000000
0060: 00000000 00000000 00000000 00000000
0070: 00000000 00000000 00000000 00000000
0080: 00000000 00000000 00000000 00000000
0090: 69766544 36206563 00000000 00000000
00a0: 00000000 00000000 00000000 00000000
00b0: 00000000 00000000 00000000 00000000
00c0: 00000000 00000000 00000000 00000000
00d0: 00000000 00000000 00000000 00000000
00e0: 20504800 52544c55 394d5549 44203032
00f0: 00005652 00000000 00000000 00000000
0100: 00000000 00000000 00000000 00000000
0110: 00000000 00000000 00000000 00000000
0120: 00000000 00000000 00000000 00000000
0130: 00000000 00000000 00000000 00000000
0140: 00000000 00000000 00000000 00000000
0150: 00000000 00000000 00000000 00000000
0160: 32430000 00005734 48000000 37303155
0170: 444b3632 00000047 00000000 00000000
0180: 00000000 00000000 00000000 00000000
0190: 00000000 00000000 00000000 00000000
01a0: 00000000 00000000 00000000 31303035
01b0: 30413031 38383030 41333239 00000000
01c0: 00000000 00010000 00000000 00000000
01d0: 00000003 00000006 00000000 00000000
01e0: 843504c7 0001000f 00000001 000000ff
01f0: 002c2f00 00000000 009a01c0 ffff000a
0200: 0000ffff 00000000 000c0000 0167008c
0210: 8080000b 80418051 00010001 00000002
0220: 00000004 00040000 00000004 00000000
0230: 00000000 00000000 00000000 00000000
0240: 00000002 00000000 0029cb80 00000000
0250: 00000008 00000000
Curtis Ballard
Honored Contributor

Re: Backup problems with MSL2024

Very interesting data in the LSI SAS event log entry. The data looks good with a valid LSI header but the error code doesn't look like any I have ever seen with this card. The error code is at byte 10 and in this data is 0xad010030 but LSI SAS errors should always start with a 0x30,0x31, or 0x32. I suspect that this may have been set at some higher level in the LSI driver and I don't have the specs for all the layers. I'll have to dig a bit and see if I can come up with something.

Do all of the LSI SAS event entries have the same data starting at offset 0x10? If there are any 0x3...... values that would be useful.
Steve Higgins_1
Occasional Advisor

Re: Backup problems with MSL2024

They all seem to be similar. Here is the error when both tapes drives went offline:

Event Type: Error
Event Source: Lsi_sas
Event Category: None
Event ID: 11
Date: 7/30/2008
Time: 4:24:39 PM
User: N/A
Computer: KSTLON0AP002
Description:
The driver detected a controller error on \Device\RaidPort1.

For more information, see Help and Support Center at http://go.microsoft.com/fwlink/events.asp.
Data:
0000: 0018000f 00680001 00000000 c004000b
0010: 31140000 00000000 00000000 00000000
0020: 00000000 00000000 00000000 00000000
0030: 00000000 c004000b 00000000 00000000
Steve Higgins_1
Occasional Advisor

Re: Backup problems with MSL2024

I just re-read your message, so I believe that is the error you were used to seeing. I seem to get both of those sometimes. They also appear with this error:

Event Type: Warning
Event Source: Lsi_sas
Event Category: None
Event ID: 129
Date: 7/28/2008
Time: 11:44:30 PM
User: N/A
Computer: KSTLON0AP002
Description:
Reset to device, \Device\RaidPort1, was issued.

For more information, see Help and Support Center at http://go.microsoft.com/fwlink/events.asp.
Data:
0000: 0010000f 00680001 00000000 80040081
0010: 00000004 00000000 00000000 00000000
0020: 00000000 00000000 00000000 00000000
0030: 00000600 80040081
Curtis Ballard
Honored Contributor

Re: Backup problems with MSL2024

Thanks for posting another event data dump. The second posting of Event ID 11 data shows that the HBA received an ABORT command from the operating system while a command to the device was outstanding.

That usually means a timeout but doesn't say what timed out. A timeout is consistent with the failure that is corrected by changing the retry count registry entry but doesn't have to be caused by that issue.
Steve Higgins_1
Occasional Advisor

Re: Backup problems with MSL2024

I had more of the same LSI_SAS errors and tape drives going offline again for 2 minutes and coming back online during the backup. For some reason, the backup didn't fail and seemed to keep writing even though the drive was reported offline. This is strange. I am not sure if this was just a fluke or not. I am still unable to perform 2 backups at once to utilize both tape drives at the same time. Is it possible there is still an issue with the controller or library?
Curtis Ballard
Honored Contributor

Re: Backup problems with MSL2024

Yes it's very possible that there is a problem with the controller or the software/driver stack. The two drives are entirely independent from the library perspective so I can't think of anything that could happen to cause problems with having both drives running at once. I've had a lot more than two drives running at once connected to that HBA.

Even from the HBA perspective the drives should be almost completely independent but they would share the same PCIe link and there is some shared hardware on the HBA.

I would start by making certain that the application was configured properly to use both drives at once. It sounds to me like there may be some hardware that isn't quite right somewhere in the system though. If you haven't already I would suggest you try getting a different SAS cable. If you have a standard SAS cable laying around with IB to mini-SAS ends you can connect it to one drive and see if that solves your communication issues. I'm not convinced the cable is the issue however as that cable is actually 4 cables bundled into one casing but physically completely independent so when there are cable issues it is usually just one link that is bad and switching to a different cable end at the library will make the problem go away.
mrwallis2
Occasional Visitor

Re: Backup problems with MSL2024

We have had very similar issues to the above, DL380 G5 also with the SC44Ge, 920 drives in an MSL2024. We get the LSI_SAS events 11 and 129 when the backup jobs fail.

Our system seems to run for longer with only a single drive when we switch one off (but still fails occassionally).

All the firmware, BIOS, drivers are up to date and we have had a drive and SAS card and cable swapped out by support. L&TT reports healthy drives. Still the problem continues, so it looks to be a driver or inherent hardware fault related to "high" loads.

Does anyone have futher updates on this at HP? Seems like a serious issue for many people.
Curtis Ballard
Honored Contributor

Re: Backup problems with MSL2024

Thanks for posting the note about the issue you were seeing with the DL380G5 SC44Ge and the MSL 2024. As you note there have been several postings here of a similar nature and HP is aggressively working on figuring out what is causing the problems. I'll post more as soon as I learn more.
mrwallis2
Occasional Visitor

Re: Backup problems with MSL2024

Thankyou Curtis, I would welcome any more information you can gather.
Ian Pennington_1
Occasional Advisor

Re: Backup problems with MSL2024

Just an update for all of you. We have had a difficult time with HP trying to resolve this. HP tried to blame the CA BrightStor software for a few months but all testing proved there was an issue with the SCSI HW or drivers. We switched to fibre channel drives in the MSL2024, for testing, and all problems dissapeared! The trouble is that HP have tried to shed all responsibility for this trouble and to date have yet to agree a way forward to finally fix this isue. My company is so disgusted with HP that I am now in talks with Dell to replace all 107 HP servers we use as well as the 1400 desktops.
Netics
Occasional Visitor

Re: Backup problems with MSL2024

Hi everyone!
I have found a solution for this problem with my tape library.
The problem was with a library device, which had 2 SCSI interfaces, and on the free interface there was no Warp Tool inserted. I think, this SC11Xe HBA does not support automatic termination and Warp Tool should be installed.
Curtis Ballard
Honored Contributor

Re: Backup problems with MSL2024

A problem has been identified with the SC44Ge and the MSL2024/4048 libraries with LTO-3 SAS drives in a very small number of configurations. We have been able to resolve the issue with a drive firmware change. New firmware was approved today for open escalations working with HP support and will be posted as soon as testing is complete. The revision is C25W. A full test cycle takes several weeks so the final release won't be for a few weeks yet.
Ian Pennington_1
Occasional Advisor

Re: Backup problems with MSL2024

Thanks but the previous reply does not cover my HW setup.
jacksbox
Advisor

Re: Backup problems with MSL2024

I don't know if it's too late, or if this is the advice you're looking for, but we experienced something similar with our MSL4048.

In the end, the solution was to stop and disable the HP WMI Storage Provider service in Windows.

Just my 2 cents.