Re: LTO2 and 5 drives with PMR "wedged" - can't unset and can't eject

Alan Brown · ‎02-02-2011

The software I'm using is Bacula, but it only uses standard linux st lock/unlock calls to set/unset PMR.

Users of other backup packages have reported similar problems.

OS is RHEL5, but this has manifested for me on RHEL4 and Suse8/9 too. Others have reported similar events on a wide variety of other OSes including BSD and Windows.

The hardware is FC connected to tape drives via Qlogic Sanbox switches and Qlogic FC controllers - but others have reported the same issue on SAS and LC-SCSi installations.

I've had this happen on HP LTO2 and LTO5 drives, but this has been reported across a wide range of hardware from various manufacturers.

Every so often (about every 5-50 loads), a drive seems to wedge PMR on and won't unset it.

Repeated attempts to unlock the drive using mt, sg_prevent or L&TT don't work.

Rebooting the server doesn't help, nor does reloading fc or scsi driver modules, disconnecting/reconnecting FC, sending a scsi reset or power cycling the FC switches.

The only way out seems to be to powercycle the tape drive.

That's fine for a standalone tape drive but a disaster if it's in a changer as the changer needs power cycling too (I've had this happen repeatedly in HP MSL6060 and Overland Neo8000 robots) and that affects backups running on other drives.

I've repeatedly reported this in the past to HP support and been blown off as the only person who's ever reported it, etc. Ditto Overland.

I've attached a L&TT report from a currently wedged drive.

Here's what happens at command level.

# mt -f /dev/nst15 status
SCSI 2 tape drive:
File number=-1, block number=-1, partition=0.
Tape block size 0 bytes. Density code 0x58 (no translation).
Soft error count since last status=0
General status bits on (1010000):
ONLINE IM_REP_EN

# mt -f /dev/nst15 unlock

# mt -f /dev/nst15 eject
/dev/nst15: Input/output error

# sg_prevent -a /dev/nst15

# mt -f /dev/nst15 eject
/dev/nst15: Input/output error

# mt -f /dev/nst15 offline
/dev/nst15: Input/output error

# sg_prevent -vv /dev/nst15
open /dev/nst15 with flags=0x802
Prevent allow medium removal cdb: 1e 00 00 00 01 00

# sg_prevent -vv -a /dev/nst15
open /dev/nst15 with flags=0x802

# mt -f /dev/nst15 status
SCSI 2 tape drive:
File number=-1, block number=-1, partition=0.
Tape block size 0 bytes. Density code 0x58 (no translation).
Soft error count since last status=0
General status bits on (1010000):
ONLINE IM_REP_EN

# mt -f /dev/nst15 eject
/dev/nst15: Input/output error

Does anyone have any ideas?

Marino Meloni_1 · ‎02-02-2011

the nature of the scsi command "reserve" is to prevent interferences during scsi operation.
The reserve status can only be reset ("release") by the host that set it.
It could be that due to an i/o error the connection between the host and the device break.
At this point, the "reserve" flag cannot be removed anymore.
A power cycle of the Drive is usually used to "release" the status of the drive

Alan Brown · ‎02-02-2011

This is _NOT_ a scsi reservation problem.

If it was, all communications to the drive would be rejected, not just the eject/offline commands.

Curtis Ballard · ‎02-02-2011

This behavior most of the time is traced to the tape drive being on a Fibre Channel SAN with a Windows system somewhere on the same SAN. Do you have any Windows systems on your SAN? If you do and they can see the library you either need to zone your fibre channel switch so those systems are locked out or you need to make certain that Removable Storage Manager is disabled on your windows system.

The "PREVENT" function is defined on a per-host basis. If you send the "ALLOW" from your host but the drive still reports that unloading is prevented then that means that some other host sent a prevent. Unfortunately L&TT isn't able to clear a prevent from another host either as it runs into the same problem - only the host that set it can clear it.

Alan Brown · ‎02-02-2011

There's zero MS on the fabric and the fabric zoning is setup so that only one host can see the drives (and vice versa)

The unlock command is being sent from the same host which sent the lock command.

This is what the tape drive says:

# tapeinfo -f /dev/nst15
Product Type: Tape Drive
Vendor ID: 'HP '
Product ID: 'Ultrium 5-SCSI '
Revision: 'I33H'
Attached Changer: No
SerialNumber: 'HU1036C7AR'
TapeAlert[10]: No Removal: Cannot unload, initiator is preventing media removal.
MinBlock:1
MaxBlock:16777215
SCSI ID: 8
SCSI LUN: 0
Ready: yes
BufferedMode: yes
Medium Type: Not Loaded
Density Code: 0x58
BlockSize: 0
DataCompEnabled: yes
DataCompCapable: yes
DataDeCompEnabled: yes
CompType: 0x1
DeCompType: 0x1
BOP: yes
Block Position: 0

Alan Brown · ‎02-02-2011

Hmmmm...... You were close with saying the lock is per host.

It's not.

It's Per INITIATOR - so on hosts with dual FC connections you can probably guess what can happen.

Curtis Ballard · ‎02-09-2011

Good clarification Alan - yes prevent/allow settings are per initiator/target connection so a single host can have multiple connections and send a prevent over one then for some reason switch to a different port - the allow has to come from the same port as the prevent.

I reviewed the log that was attached at with the first posting. That log shows four different connections to the drive:

||__ WW Node Name | WW Port Name
||__ 10:00:00:c0:dd:0d:94:cd | 2f:fc:00:c0:dd:0d:94:cd
||__ 20:01:00:1b:32:b1:63:8b | 21:01:00:1b:32:b1:63:8b
||__ 20:00:00:1b:32:91:63:8b | 21:00:00:1b:32:91:63:8b
||__ 10:00:00:e0:02:02:64:d5 | 10:00:00:e0:02:22:64:d5

All of the activity in the recent command history is from:
20:00:00:1b:32:91:63:8b | 21:00:00:1b:32:91:63:8b

However looking at all history since the drive was turned on shows that two commands setting PREVENT were received from initiator 1 which is:
20:01:00:1b:32:b1:63:8b | 21:01:00:1b:32:b1:63:8b

Those PREVENT settings were never cleared so they are still in effect.

I have attached a file showing the full PREVENT/ALLOW history seen by the drive for this power cycle. You can see two instances of PAMR (Prevent Allow Media Removal) set to 1 by initiator 1 at the bottom of the list of events - there are no cases where initiator 1 attempts to set PAMR back to 0.

The SCSI standards require that the drive logically OR all of the PAMR settings for the different initiators and if any of them are set to 1 then media removal is prevented and in this case media removal is prevented by the host with the WWN shown above.

Alan Brown · ‎02-10-2011

Thanks for the explanation. This should be a FAQ answer somewhere... :)

In my case this means that to ensure ejects happen I'll have to send drive unlocks to ALL -nst instances of the same tape drive or prevent drive locks being sent at all.

One is a matter of hacking up some scripting while the other involves a bit more messing around with the Bacula storage-daemon.

Either way will solve the issue.

Categories

Company

Local Language

Forums

Discussions

Forums

Discussions

Discussions

Forums

Discussions

Forums

Discussions

Forums

Forums

Discussions

Forums

Discussions

Forums

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Community

Resources

Other HPE Sites

Discussions

Forums

Blogs

Re: LTO2 and 5 drives with PMR "wedged" - can't unset and can't eject

LTO2 and 5 drives with PMR "wedged" - can't unset and can't eject