StoreEver Tape Storage
1752793 Members
6028 Online
108789 Solutions
New Discussion

Error 1040 (again) in Dataprotector Express 3.50SP2 / Linux

 
AdmiralThrawn
Occasional Contributor

Error 1040 (again) in Dataprotector Express 3.50SP2 / Linux

Hello!

I - too - am suffering from the Error 1040 when trying to run any kind of tape-related job. Here, the configuration:

 

CentOS Linux 5.8 (RHEL5), kernel 2.6.18-348.6.1.el5PAE

HP Ultrium 1840 LTO-4 on parallel SCSI, firmware B63D attached to LSI 53C1030 running latest firmware and BIOS

 

Tested controller drivers LSI MPT/Fusion 4.00.43.00 and also stock 3.04.15rh. Tried port/bus/target resets using "lsiutil". Tried to limit drive speed to U160 instead of U320. Bus is properly terminated; LVD/SE terminator sitting on the cable right after the drive, no other drives attached to that SCSI controller. Cable is 30cm long between HBA and drive, and is proper twisted pair LVD/SE. Also tried changing controller AND cable just for testing, to no avail.

 

Software is HP Dataprotector Express SSE 3.50 SP2 build #56936 (license not valid for >=4.0).

 

What happens is, that after a reboot, the drive is not properly recognized anymore. It will now show "Element Status: Unknown" in the devices section of Dataprotector Express. The backup software can still read the firmware version and status of the drive (it seems), but no backup, identify, eject or whatever job would run. All terminate immediately after start with the following error:

 

Error 1040: No devices specified or all devices are now offline

 

I doublechecked that the drive is selected for all those jobs in DataProtector Express, and it is. 

 

What I have tried: Using older kernels, removing the "st" kernel driver in case it interferes with how Dataprotector Express accesses the drive via "sg", I tried to run L&TT, the HP diagnostics tools and I also tried to run the regular Linux tape tools on the drive ("mt" and "tar").

 

L&TT reports no problems, buffer test ok with many patterns and 10.000 iterations over the entire cache, read+write+compare test runs fine, no problems. When the "st" driver is loaded, mt can control+erase the drive and I tested a simple tar backup, restore AND diff on the drive with 2GB of data, all works fine (stuff like "mt -f /dev/st0 erase" or "tar -czf /dev/st0 /home" and "tar -xzf /dev/st0" is all fine).

 

I will attach the diagnostics report of DataProtector Express (HTML) and the support ticket of L&TT in both LZT and XML formats. You will need to rename the files, removing the TXT suffixes.

 

Oh, and I have also tried to completely remove DataProtector Express including its entire catalog, I did a fresh install afterwards. Didn't help, same problem still there. I would need to diagnose this properly and find the culprit. It kind of looks like a drive failure as there are also some Domain Validation problems (only with the older controller driver though) and some stuff like:

 

kernel: INFO: task scsi_id:5051 blocked for more than 120 seconds.

 

or:

 

kernel: INFO: task mt:31372 blocked for more than 120 seconds.

 

 

or:

 

kernel: st0: Current: sense key: Illegal Request
kernel: Add. Sense: Invalid field in cdb

 

or, with the older LSI controller driver that ships with the RedHat kernel, this, when playing around with the sg/st drivers and lsiutils port reset:

 

kernel: mptbase: ioc0: Initiating recovery
kernel: target8:0:3: Beginning Domain Validation
kernel: target8:0:3: Domain Validation Initial Inquiry Failed
kernel: target8:0:3: Ending Domain Validation
kernel: target8:0:3: asynchronous
kernel: mptbase: ioc0: Initiating recovery
kernel: target8:0:3: Beginning Domain Validation
kernel: target8:0:3: Ending Domain Validation
kernel: target8:0:3: FAST-160 WIDE SCSI 320.0 MB/s DT IU HMCS (6.25 ns, offset 64)
kernel: st: Unloaded.
kernel: sd 0:0:0:0: Attached scsi generic sg0 type 0
kernel: scsi 0:1:0:0: Attached scsi generic sg1 type 0
kernel: scsi 0:1:1:0: Attached scsi generic sg2 type 0
kernel: scsi 0:1:2:0: Attached scsi generic sg3 type 0
kernel: scsi 0:1:3:0: Attached scsi generic sg4 type 0
kernel: scsi 0:1:4:0: Attached scsi generic sg5 type 0
kernel: scsi 0:1:5:0: Attached scsi generic sg6 type 0
kernel: scsi 0:1:6:0: Attached scsi generic sg7 type 0
kernel: scsi 0:1:7:0: Attached scsi generic sg8 type 0
kernel: scsi 8:0:3:0: Attached scsi generic sg9 type 1
kernel: mptscsih: ioc0: attempting task abort! (sc=c02f8c00)
kernel: scsi 8:0:3:0:
kernel: command: Inquiry: 12 01 00 00 fe 00
kernel: mptscsih: ioc0: task abort: FAILED (rv=2003) (sc=c02f8c00)
kernel: mptscsih: ioc0: attempting target reset! (sc=c02f8c00)
kernel: scsi 8:0:3:0:
kernel: command: Inquiry: 12 01 00 00 fe 00
kernel: mptscsih: ioc0: WARNING - Issuing Reset from mptscsih_IssueTaskMgmt!! doorbell=0x24000000
kernel: mptscsih: ioc0: target reset: SUCCESS (sc=c02f8c00)
kernel: scsi 8:0:3:0: timing out command, waited 5s
kernel: target8:0:3: Beginning Domain Validation
kernel: target8:0:3: Ending Domain Validation
kernel: target8:0:3: FAST-160 WIDE SCSI 320.0 MB/s DT IU HMCS (6.25 ns, offset 64)
kernel: mptscsih: ioc0: attempting task abort! (sc=c08780c0)
kernel: scsi 8:0:3:0:
kernel: command: Inquiry: 12 01 00 00 fe 00
kernel: mptscsih: ioc0: task abort: FAILED (rv=2003) (sc=c08780c0)
kernel: mptscsih: ioc0: attempting target reset! (sc=c08780c0)

 

..and so on.

 

However, tar and L&TT still complete all their tasks, and HP DataProtector Express can't even use the drive at all anymore.

 

What my boss wants to know is whether the drive is actually broken or not, or whether it's just the software being bitchy. So far I have not found a 100% conclusion. I could still try to rip out the entire SCSI subsystem (PCI-X) plus the drive and try it in another machine, but that's a lot of work.

 

Could anyone here throw me a bone on this issue? I'm running out of ideas here fast.

1 REPLY 1
AdmiralThrawn
Occasional Contributor

Re: Error 1040 (again) in Dataprotector Express 3.50SP2 / Linux

I have now tried the drive including its SCSI controller and cable in my local CentOS 6.3 workstation, and here it works just fine.

 

HAS to be something with the software...

 

Nobody? :(