Operating System - OpenVMS
1748240 Members
3684 Online
108759 Solutions
New Discussion юеВ

Re: Fiber Channel tape drive doesn't work on one node of a cluster only

 
Ziggy Filek
Frequent Advisor

Fiber Channel tape drive doesn't work on one node of a cluster only

I have a 3 node cluster under VMS 7.3-2. One of my fibre channel connected tape drives started to give parity errors on all 3 nodes(A,B and C), and also finally gave the following on node A only:
HNAA..SYSTEM> mount/foreign $2$mga3
%MOUNT-F-DRVERR, fatal drive error
HNAA..SYSTEM>
The drive was replaced inside the library, the library power-cycled and the following attempted:
HNAA..SYSTEM> mcr sysman
SYSMAN> set envir/node=(HNAA,HNAB,HNAC)
%SYSMAN-I-ENV, current command environment:
Individual nodes: HNAA,HNAB,HNAC
Username SYSTEM will be used on nonlocal nodes

SYSMAN> io replace_wwid $2$MGA3
%SYSMAN-I-NODERR, error returned from node HNAA
-SYSTEM-F-DEVACTIVE, device is active
SYSMAN> exit
HNAA..SYSTEM>
I confirmed that the command worked OK on B and C, that the SYS$COMMON:SYS$DEVICES.DAT has been properly updated, and that the new drive $2$MGA3 works just fine on nodes B and C. However, on node A it still gives
HNAA..SYSTEM> mount/foreign $2$mga3
%MOUNT-F-DRVERR, fatal drive error
How to fix it without re-booting A? Thank you for any insights - node A is an extremely important production node and cannot be easily re-booted.
18 REPLIES 18
The Brit
Honored Contributor

Re: Fiber Channel tape drive doesn't work on one node of a cluster only

Ziggy,
What do you get if you do a "show dev/full" on the tape device. (just making sure that the device is not locked up by some old process, although I know you probably checked that already)

Dave
Thomas Ritter
Respected Contributor

Re: Fiber Channel tape drive doesn't work on one node of a cluster only

Following on from The Brit... execute a $show device $2$mga3 on each node and post the results.

Ziggy Filek
Frequent Advisor

Re: Fiber Channel tape drive doesn't work on one node of a cluster only

Here's $shov/dev $2$mga3 on both the "bad" and the "good" nodes:
HNAA..ZFILEK> sho dev /full $2$mga3

Magtape $2$MGA3: (HNAA), device type HP Ultrium 2-SCSI, is online, record-
oriented device, file-oriented device, available to cluster, device has
multiple I/O paths, error logging is enabled, controller supports compaction
(compaction disabled), device supports fastskip (per_io).

Error count 0 Operations completed 15744460
Owner process "" Owner UIC [SYSTEM]
Owner process ID 00000000 Dev Prot S:RWPL,O:RWPL,G:R,W
Reference count 0 Default buffer size 512
WWID 02000008:5006-0B00-0033-3170
Density default Format Normal-11
Allocation class 2

Volume status: no-unload on dismount, beginning-of-tape, odd parity.

I/O paths to device 2
Path PGB0.1000-00E0-0222-F492 (HNAA), primary path, current path.
Error count 0 Operations completed 15744382
Path PGB0.1000-00E0-0222-F492 (HNAA).
Error count 0 Operations completed 78

HNAA..ZFILEK>
And the good:
HNAC..SYSTEM> sh dev /full $2$mga3:

Magtape $2$MGA3: (HNAC), device type HP Ultrium 2-SCSI, is online, record-
oriented device, file-oriented device, available to cluster, error logging
is enabled, controller supports compaction (compaction disabled), device
supports fastskip (per_io).

Error count 0 Operations completed 105770463
Owner process "" Owner UIC [SYSTEM]
Owner process ID 00000000 Dev Prot S:RWPL,O:RWPL,G:R,W
Reference count 0 Default buffer size 512
WWID 02000008:5006-0B00-0029-1683
Density default Format Normal-11
Allocation class 2

Volume status: no-unload on dismount, beginning-of-tape, odd parity.

HNAC..SYSTEM>
Joanne Korol_1
Occasional Contributor

Re: Fiber Channel tape drive doesn't work on one node of a cluster only

I'm not an expert, but did you perform:

MCR SYSMAN
SYSMAN> IO AUTO

Are you connected to an MDR, has that been power cycled?

I was told by an FE that it's best to pull the power plug on the tape drive if you want a complete initialization. Some errors/problems don't clear by turning the tape drive off and on using the power switch.

We have a similar problem, and it ended up being a GBIC on one of the switches.

The Brit
Honored Contributor

Re: Fiber Channel tape drive doesn't work on one node of a cluster only

Ziggy,
The only thing that stands out to me is that on HNAA the device shows up as "multipathed" whereas on HNAC, it doesn't. (Are you a Cerner Client by any chance). Anyway, I would look into that, unless it is how you would expect it to look, based on your configuration, or how it was before being repaired.

dave
The Brit
Honored Contributor

Re: Fiber Channel tape drive doesn't work on one node of a cluster only

Ziggy,
Here is a more likely cause for your problem. Looking at your listing of the "show dev /full" from the two different nodes, if both cluster members are looking at the same device, then I'm pretty sure that they should both display the same WWID.
In your case they don't! I suspect that the WWID displayed on the "working" nodes is the correct one. What you need to do is to get the A system to look at the correct WWID.
Look at SYSMAN HELP for "IO REPLACE_WWID"

Dave
Rick Dyson
Valued Contributor

Re: Fiber Channel tape drive doesn't work on one node of a cluster only

You may have to manually edit or delete/rebuild the Sys$System:Sys$Devices.dat. If all your nodes don't share this file, check to see that they all agree for WWID and devices, etc.

In some cases, you might have to clear out device name caches with a full cluster reboot. At least, that is the only way I could get things reset once, HP had me schedule a full shutdown and reboot of all nodes.

I had what sounds like a similar problem once and needed to go that route. Replacing a tape drive in a SAN can be tricky.

Rick
Jan van den Ende
Honored Contributor

Re: Fiber Channel tape drive doesn't work on one node of a cluster only

@Rick:
>>>
In some cases, you might have to clear out device name caches with a full cluster reboot.
<<<
That would be the day!
Having a fully redundant VMS cluster targetted at never-down, and then REBOOT just to change a tape drive?
Not in my (our) book!!
But it looks like The Brit gave a good answer:
>>>
Look at SYSMAN HELP for "IO REPLACE_WWID"
<<<

fwiw

Proost.

Have one on me.

jpe
Don't rust yours pelled jacker to fine doll missed aches.
Rick Dyson
Valued Contributor

Re: Fiber Channel tape drive doesn't work on one node of a cluster only

Jan: I know, and I run a hospital Cerner system and that solution (in the end) was the only resolution. I had to wait a while with one node (the one that normally ran the backups) unable to access the tape library!

Maybe it was just my lack of detailed knowledge combined with a support tech who also didn't know every trick, but they told me that was my only option. A few months before that, when adding or changing drives, I had to issue "MCR SYSMAN Io Autoconfig" which generated a BUGCHECK and crashed my backup node. That forced a full HBV Shadow merge (this was prior to HBMM!) of all 30+ volumes in the middle of the afternoon... I was not popular!

I have been a little gun-shy of messing with such things since. Later, I believe CSC had a patch that I think addressed the kind of BUGCHECK I had. It has never happened again and I have used it repeatedly. :)

Sorry Ziggy! Back to your problems!!!

Rick