StoreEver Tape Storage
1747984 Members
4694 Online
108756 Solutions
New Discussion юеВ

Re: Backup problems with MSL2024

 
Steve Higgins_1
Occasional Advisor

Re: Backup problems with MSL2024

HP support recommended that I increase the BusyRetryCount to 75 and test again. I will post the results.

Here are the errors for LSI_SAS:

Event Type: Error
Event Source: Lsi_sas
Event Category: None
Event ID: 11
Date: 7/30/2008
Time: 11:52:40 PM
User: N/A
Computer: KSTLON0AP002
Description:
The driver detected a controller error on \Device\RaidPort1.

For more information, see Help and Support Center at http://go.microsoft.com/fwlink/events.asp.
Data:
0000: 0018000f 00680001 00000000 c004000b
0010: ad010030 00000000 00000000 00000000
0020: 00000000 00000000 00000000 00000000
0030: 00000000 c004000b 00000000 00000000

Then this error saying the drive went offline:

Event Type: Error
Event Source: Storage Agents
Event Category: Events
Event ID: 1223
Date: 7/30/2008
Time: 11:52:40 PM
User: N/A
Computer: KSTLON0AP002
Description:
SAS Tape Drive Status Change. The tape drive in Slot 1, Device 6 with serial number "HU10726KDG", has a new status of 3.
(Tape Drive status values: 1=other, 2=ok, 3=offline)
[SNMP TRAP: 5025 in CPQSCSI.MIB]
Data:
0000: 01c0009a 00000003 00000006 03010304
0010: 746f6c53 00003120 00000000 00000000
0020: 00000000 00000000 00000000 00000000
0030: 00000000 00000000 00000000 00000000
0040: 00000000 00000000 00000000 00000000
0050: 00000000 00000000 00000000 00000000
0060: 00000000 00000000 00000000 00000000
0070: 00000000 00000000 00000000 00000000
0080: 00000000 00000000 00000000 00000000
0090: 69766544 36206563 00000000 00000000
00a0: 00000000 00000000 00000000 00000000
00b0: 00000000 00000000 00000000 00000000
00c0: 00000000 00000000 00000000 00000000
00d0: 00000000 00000000 00000000 00000000
00e0: 20504800 52544c55 394d5549 44203032
00f0: 00005652 00000000 00000000 00000000
0100: 00000000 00000000 00000000 00000000
0110: 00000000 00000000 00000000 00000000
0120: 00000000 00000000 00000000 00000000
0130: 00000000 00000000 00000000 00000000
0140: 00000000 00000000 00000000 00000000
0150: 00000000 00000000 00000000 00000000
0160: 32430000 00005734 48000000 37303155
0170: 444b3632 00000047 00000000 00000000
0180: 00000000 00000000 00000000 00000000
0190: 00000000 00000000 00000000 00000000
01a0: 00000000 00000000 00000000 31303035
01b0: 30413031 38383030 41333239 00000000
01c0: 00000000 00000000 00000000 00000000
01d0: 00000003 00000006 00000000 00000000
01e0: 843504c7 0001000f 00000001 000000ff
01f0: 002c2f00 00000000 009a01c0 ffff000a
0200: 0000ffff 00000000 000c0000 0167008c
0210: 8080000b 80418051 00010001 00000002
0220: 00000004 00040000 00000004 00000000
0230: 00000000 00000000 00000000 00000000
0240: 00000002 00000000 0029cb80 00000000
0250: 00000008 00000000

Then 2 minutes later the drive is online again:

Event Type: Information
Event Source: Storage Agents
Event Category: Events
Event ID: 1223
Date: 7/30/2008
Time: 11:54:40 PM
User: N/A
Computer: KSTLON0AP002
Description:
SAS Tape Drive Status Change. The tape drive in Slot 1, Device 6 with serial number "HU10726KDG", has a new status of 2.
(Tape Drive status values: 1=other, 2=ok, 3=offline)
[SNMP TRAP: 5025 in CPQSCSI.MIB]
Data:
0000: 01c0009a 00000003 00000006 02020202
0010: 746f6c53 00003120 00000000 00000000
0020: 00000000 00000000 00000000 00000000
0030: 00000000 00000000 00000000 00000000
0040: 00000000 00000000 00000000 00000000
0050: 00000000 00000000 00000000 00000000
0060: 00000000 00000000 00000000 00000000
0070: 00000000 00000000 00000000 00000000
0080: 00000000 00000000 00000000 00000000
0090: 69766544 36206563 00000000 00000000
00a0: 00000000 00000000 00000000 00000000
00b0: 00000000 00000000 00000000 00000000
00c0: 00000000 00000000 00000000 00000000
00d0: 00000000 00000000 00000000 00000000
00e0: 20504800 52544c55 394d5549 44203032
00f0: 00005652 00000000 00000000 00000000
0100: 00000000 00000000 00000000 00000000
0110: 00000000 00000000 00000000 00000000
0120: 00000000 00000000 00000000 00000000
0130: 00000000 00000000 00000000 00000000
0140: 00000000 00000000 00000000 00000000
0150: 00000000 00000000 00000000 00000000
0160: 32430000 00005734 48000000 37303155
0170: 444b3632 00000047 00000000 00000000
0180: 00000000 00000000 00000000 00000000
0190: 00000000 00000000 00000000 00000000
01a0: 00000000 00000000 00000000 31303035
01b0: 30413031 38383030 41333239 00000000
01c0: 00000000 00010000 00000000 00000000
01d0: 00000003 00000006 00000000 00000000
01e0: 843504c7 0001000f 00000001 000000ff
01f0: 002c2f00 00000000 009a01c0 ffff000a
0200: 0000ffff 00000000 000c0000 0167008c
0210: 8080000b 80418051 00010001 00000002
0220: 00000004 00040000 00000004 00000000
0230: 00000000 00000000 00000000 00000000
0240: 00000002 00000000 0029cb80 00000000
0250: 00000008 00000000
Curtis Ballard
Honored Contributor

Re: Backup problems with MSL2024

Very interesting data in the LSI SAS event log entry. The data looks good with a valid LSI header but the error code doesn't look like any I have ever seen with this card. The error code is at byte 10 and in this data is 0xad010030 but LSI SAS errors should always start with a 0x30,0x31, or 0x32. I suspect that this may have been set at some higher level in the LSI driver and I don't have the specs for all the layers. I'll have to dig a bit and see if I can come up with something.

Do all of the LSI SAS event entries have the same data starting at offset 0x10? If there are any 0x3...... values that would be useful.
Steve Higgins_1
Occasional Advisor

Re: Backup problems with MSL2024

They all seem to be similar. Here is the error when both tapes drives went offline:

Event Type: Error
Event Source: Lsi_sas
Event Category: None
Event ID: 11
Date: 7/30/2008
Time: 4:24:39 PM
User: N/A
Computer: KSTLON0AP002
Description:
The driver detected a controller error on \Device\RaidPort1.

For more information, see Help and Support Center at http://go.microsoft.com/fwlink/events.asp.
Data:
0000: 0018000f 00680001 00000000 c004000b
0010: 31140000 00000000 00000000 00000000
0020: 00000000 00000000 00000000 00000000
0030: 00000000 c004000b 00000000 00000000
Steve Higgins_1
Occasional Advisor

Re: Backup problems with MSL2024

I just re-read your message, so I believe that is the error you were used to seeing. I seem to get both of those sometimes. They also appear with this error:

Event Type: Warning
Event Source: Lsi_sas
Event Category: None
Event ID: 129
Date: 7/28/2008
Time: 11:44:30 PM
User: N/A
Computer: KSTLON0AP002
Description:
Reset to device, \Device\RaidPort1, was issued.

For more information, see Help and Support Center at http://go.microsoft.com/fwlink/events.asp.
Data:
0000: 0010000f 00680001 00000000 80040081
0010: 00000004 00000000 00000000 00000000
0020: 00000000 00000000 00000000 00000000
0030: 00000600 80040081
Curtis Ballard
Honored Contributor

Re: Backup problems with MSL2024

Thanks for posting another event data dump. The second posting of Event ID 11 data shows that the HBA received an ABORT command from the operating system while a command to the device was outstanding.

That usually means a timeout but doesn't say what timed out. A timeout is consistent with the failure that is corrected by changing the retry count registry entry but doesn't have to be caused by that issue.
Steve Higgins_1
Occasional Advisor

Re: Backup problems with MSL2024

I had more of the same LSI_SAS errors and tape drives going offline again for 2 minutes and coming back online during the backup. For some reason, the backup didn't fail and seemed to keep writing even though the drive was reported offline. This is strange. I am not sure if this was just a fluke or not. I am still unable to perform 2 backups at once to utilize both tape drives at the same time. Is it possible there is still an issue with the controller or library?
Curtis Ballard
Honored Contributor

Re: Backup problems with MSL2024

Yes it's very possible that there is a problem with the controller or the software/driver stack. The two drives are entirely independent from the library perspective so I can't think of anything that could happen to cause problems with having both drives running at once. I've had a lot more than two drives running at once connected to that HBA.

Even from the HBA perspective the drives should be almost completely independent but they would share the same PCIe link and there is some shared hardware on the HBA.

I would start by making certain that the application was configured properly to use both drives at once. It sounds to me like there may be some hardware that isn't quite right somewhere in the system though. If you haven't already I would suggest you try getting a different SAS cable. If you have a standard SAS cable laying around with IB to mini-SAS ends you can connect it to one drive and see if that solves your communication issues. I'm not convinced the cable is the issue however as that cable is actually 4 cables bundled into one casing but physically completely independent so when there are cable issues it is usually just one link that is bad and switching to a different cable end at the library will make the problem go away.
mrwallis2
New Member

Re: Backup problems with MSL2024

We have had very similar issues to the above, DL380 G5 also with the SC44Ge, 920 drives in an MSL2024. We get the LSI_SAS events 11 and 129 when the backup jobs fail.

Our system seems to run for longer with only a single drive when we switch one off (but still fails occassionally).

All the firmware, BIOS, drivers are up to date and we have had a drive and SAS card and cable swapped out by support. L&TT reports healthy drives. Still the problem continues, so it looks to be a driver or inherent hardware fault related to "high" loads.

Does anyone have futher updates on this at HP? Seems like a serious issue for many people.
Curtis Ballard
Honored Contributor

Re: Backup problems with MSL2024

Thanks for posting the note about the issue you were seeing with the DL380G5 SC44Ge and the MSL 2024. As you note there have been several postings here of a similar nature and HP is aggressively working on figuring out what is causing the problems. I'll post more as soon as I learn more.
mrwallis2
New Member

Re: Backup problems with MSL2024

Thankyou Curtis, I would welcome any more information you can gather.