Operating System - OpenVMS
cancel
Showing results for 
Search instead for 
Did you mean: 

Fiber Channel tape drive doesn't work on one node of a cluster only

Ziggy Filek
Frequent Advisor

Fiber Channel tape drive doesn't work on one node of a cluster only

I have a 3 node cluster under VMS 7.3-2. One of my fibre channel connected tape drives started to give parity errors on all 3 nodes(A,B and C), and also finally gave the following on node A only:
HNAA..SYSTEM> mount/foreign $2$mga3
%MOUNT-F-DRVERR, fatal drive error
HNAA..SYSTEM>
The drive was replaced inside the library, the library power-cycled and the following attempted:
HNAA..SYSTEM> mcr sysman
SYSMAN> set envir/node=(HNAA,HNAB,HNAC)
%SYSMAN-I-ENV, current command environment:
Individual nodes: HNAA,HNAB,HNAC
Username SYSTEM will be used on nonlocal nodes

SYSMAN> io replace_wwid $2$MGA3
%SYSMAN-I-NODERR, error returned from node HNAA
-SYSTEM-F-DEVACTIVE, device is active
SYSMAN> exit
HNAA..SYSTEM>
I confirmed that the command worked OK on B and C, that the SYS$COMMON:SYS$DEVICES.DAT has been properly updated, and that the new drive $2$MGA3 works just fine on nodes B and C. However, on node A it still gives
HNAA..SYSTEM> mount/foreign $2$mga3
%MOUNT-F-DRVERR, fatal drive error
How to fix it without re-booting A? Thank you for any insights - node A is an extremely important production node and cannot be easily re-booted.
18 REPLIES
The Brit
Honored Contributor

Re: Fiber Channel tape drive doesn't work on one node of a cluster only

Ziggy,
What do you get if you do a "show dev/full" on the tape device. (just making sure that the device is not locked up by some old process, although I know you probably checked that already)

Dave
Thomas Ritter
Respected Contributor

Re: Fiber Channel tape drive doesn't work on one node of a cluster only

Following on from The Brit... execute a $show device $2$mga3 on each node and post the results.

Ziggy Filek
Frequent Advisor

Re: Fiber Channel tape drive doesn't work on one node of a cluster only

Here's $shov/dev $2$mga3 on both the "bad" and the "good" nodes:
HNAA..ZFILEK> sho dev /full $2$mga3

Magtape $2$MGA3: (HNAA), device type HP Ultrium 2-SCSI, is online, record-
oriented device, file-oriented device, available to cluster, device has
multiple I/O paths, error logging is enabled, controller supports compaction
(compaction disabled), device supports fastskip (per_io).

Error count 0 Operations completed 15744460
Owner process "" Owner UIC [SYSTEM]
Owner process ID 00000000 Dev Prot S:RWPL,O:RWPL,G:R,W
Reference count 0 Default buffer size 512
WWID 02000008:5006-0B00-0033-3170
Density default Format Normal-11
Allocation class 2

Volume status: no-unload on dismount, beginning-of-tape, odd parity.

I/O paths to device 2
Path PGB0.1000-00E0-0222-F492 (HNAA), primary path, current path.
Error count 0 Operations completed 15744382
Path PGB0.1000-00E0-0222-F492 (HNAA).
Error count 0 Operations completed 78

HNAA..ZFILEK>
And the good:
HNAC..SYSTEM> sh dev /full $2$mga3:

Magtape $2$MGA3: (HNAC), device type HP Ultrium 2-SCSI, is online, record-
oriented device, file-oriented device, available to cluster, error logging
is enabled, controller supports compaction (compaction disabled), device
supports fastskip (per_io).

Error count 0 Operations completed 105770463
Owner process "" Owner UIC [SYSTEM]
Owner process ID 00000000 Dev Prot S:RWPL,O:RWPL,G:R,W
Reference count 0 Default buffer size 512
WWID 02000008:5006-0B00-0029-1683
Density default Format Normal-11
Allocation class 2

Volume status: no-unload on dismount, beginning-of-tape, odd parity.

HNAC..SYSTEM>
Joanne Korol_1
Occasional Contributor

Re: Fiber Channel tape drive doesn't work on one node of a cluster only

I'm not an expert, but did you perform:

MCR SYSMAN
SYSMAN> IO AUTO

Are you connected to an MDR, has that been power cycled?

I was told by an FE that it's best to pull the power plug on the tape drive if you want a complete initialization. Some errors/problems don't clear by turning the tape drive off and on using the power switch.

We have a similar problem, and it ended up being a GBIC on one of the switches.

The Brit
Honored Contributor

Re: Fiber Channel tape drive doesn't work on one node of a cluster only

Ziggy,
The only thing that stands out to me is that on HNAA the device shows up as "multipathed" whereas on HNAC, it doesn't. (Are you a Cerner Client by any chance). Anyway, I would look into that, unless it is how you would expect it to look, based on your configuration, or how it was before being repaired.

dave
The Brit
Honored Contributor

Re: Fiber Channel tape drive doesn't work on one node of a cluster only

Ziggy,
Here is a more likely cause for your problem. Looking at your listing of the "show dev /full" from the two different nodes, if both cluster members are looking at the same device, then I'm pretty sure that they should both display the same WWID.
In your case they don't! I suspect that the WWID displayed on the "working" nodes is the correct one. What you need to do is to get the A system to look at the correct WWID.
Look at SYSMAN HELP for "IO REPLACE_WWID"

Dave
Rick Dyson
Valued Contributor

Re: Fiber Channel tape drive doesn't work on one node of a cluster only

You may have to manually edit or delete/rebuild the Sys$System:Sys$Devices.dat. If all your nodes don't share this file, check to see that they all agree for WWID and devices, etc.

In some cases, you might have to clear out device name caches with a full cluster reboot. At least, that is the only way I could get things reset once, HP had me schedule a full shutdown and reboot of all nodes.

I had what sounds like a similar problem once and needed to go that route. Replacing a tape drive in a SAN can be tricky.

Rick
Jan van den Ende
Honored Contributor

Re: Fiber Channel tape drive doesn't work on one node of a cluster only

@Rick:
>>>
In some cases, you might have to clear out device name caches with a full cluster reboot.
<<<
That would be the day!
Having a fully redundant VMS cluster targetted at never-down, and then REBOOT just to change a tape drive?
Not in my (our) book!!
But it looks like The Brit gave a good answer:
>>>
Look at SYSMAN HELP for "IO REPLACE_WWID"
<<<

fwiw

Proost.

Have one on me.

jpe
Don't rust yours pelled jacker to fine doll missed aches.
Rick Dyson
Valued Contributor

Re: Fiber Channel tape drive doesn't work on one node of a cluster only

Jan: I know, and I run a hospital Cerner system and that solution (in the end) was the only resolution. I had to wait a while with one node (the one that normally ran the backups) unable to access the tape library!

Maybe it was just my lack of detailed knowledge combined with a support tech who also didn't know every trick, but they told me that was my only option. A few months before that, when adding or changing drives, I had to issue "MCR SYSMAN Io Autoconfig" which generated a BUGCHECK and crashed my backup node. That forced a full HBV Shadow merge (this was prior to HBMM!) of all 30+ volumes in the middle of the afternoon... I was not popular!

I have been a little gun-shy of messing with such things since. Later, I believe CSC had a patch that I think addressed the kind of BUGCHECK I had. It has never happened again and I have used it repeatedly. :)

Sorry Ziggy! Back to your problems!!!

Rick
Ziggy Filek
Frequent Advisor

Re: Fiber Channel tape drive doesn't work on one node of a cluster only

Everybody: As I specified in the original question, I did everything by the book: Changed tape drive inside the library, power-cycled the library to make sure the fibre router inside the library is set to automatic discovery, then issued SYSMAN> REPLACE_WWID $2$MGA3 clusterwide. This command was sussessful on nodes B and C, but bombed on A with "Device is active" error. SYS$DEVICES.DAT has beed correctly updated with the new WWID and the drive works just fine on B and C. The WWIDs on A is different than on B and C because the sysman replace_wwid bombed on this node! I logged a call with HP, and this problem has been already escalated to engineering, since indeed it is silly to boot a "never down" machine because of a lousy tape drive. I will keep you posted, since this can happen to anyone.
Jan van den Ende
Honored Contributor

Re: Fiber Channel tape drive doesn't work on one node of a cluster only

@ Rick:

>>>
That forced a full HBV Shadow merge (this was prior to HBMM!) of all 30+ volumes in the middle of the afternoon... I was not popular!
<<<
Oh yeah! Been there, also have those scars! In a police cluster running the callroom this did not win the popularity poll neither :-)

If you ever went to European DECUSes, or the 1999 San Diego one, you may remember me from the Engeneering Panel discussions.
I have brought up the mini-merge missing from SCSI devices every time since SCSI became popular, pointing out the giant step back from DSA devices.
... and I also was the one to ask the audience of that same panel during one of the first Bootcamps to loudly applaud the realisation of HBMM.

But. Ziggy, it need not be a cluster reboot.
A well-planned one-node reboot should be relatively unnoticeable for the users. And the accompanying inconvenience for system management should be part of the job. A good occasion to demonstrate to Management that your job _IS_ important :-)

Proost.

Have one on me.

jpe
Don't rust yours pelled jacker to fine doll missed aches.
Ziggy Filek
Frequent Advisor

Re: Fiber Channel tape drive doesn't work on one node of a cluster only

Jan: "Well planed reboot" would be fine without any applications running. Unfortunately we run applications using Oracle RAC in active-passive mode and the node in question happens to be the "active" one. If you boot it, Oracle is supposed to fail over. Application is also supposed to fail over but our application people are scared to death to induce the failover even though it has been supposedly tested... Real life bites again. Fortunately there WILL be a re-boot in the middle of the night one day next week to install application patches, so my problem will disappear.
By the way, HP people offered me to walk me thru booting DELTA and fixing some UCBs manually, but I said I was not brave enough. It is strange they don't have a fix for this common problem (it happened to me a few times before, but in a non-critical scenario). Let's hope the VMS engineering people come up with a patch, or something.

Brit and Rick: Yeah, we are talking Cerner Millennium here, with the world-famous "High Availability Toolkit".
The Brit
Honored Contributor

Re: Fiber Channel tape drive doesn't work on one node of a cluster only

Ziggy,
I know this shouldnt be necessary, however did you follow the instruction in the "Description" section, i.e. get the wwid from a "io list_wwid" command, and then include the "/WWID=" qualifier. followed by and "IO AUTO"

SYSMAN> io replace_wwid /wwid=02000008:5006-0B00-0029-1683
SYSMAN> io auto

??

(Just trying to rule out everything.)

Dave
Ziggy Filek
Frequent Advisor

Re: Fiber Channel tape drive doesn't work on one node of a cluster only

No, same error message "device is active".
The Brit
Honored Contributor

Re: Fiber Channel tape drive doesn't work on one node of a cluster only

hi again Ziggy.

Just a couple of final questions,

Does the NEW tape device appear in a "IO LIST_WWID" display??

I was thinking that you might be able to create a new "mga" device with the new WWID, possibly by editing it into the SYS$DEVICES.DAT file directly, i.e. put in something like

[Device $2$MGA4]
WWID=

(just throwing out suggestions, but I suspect this will still require a reboot)

I think this is probably the end of my contribution -- the idea bucket is empty.

Wish you luck, and if you do find a solution be sure to post it.

Dave.
Ziggy Filek
Frequent Advisor

Re: Fiber Channel tape drive doesn't work on one node of a cluster only

No, it does not show, because it is already configured in sys$devices.dat! As to your suggestion, I know I could have tried try to side-step the problem by creating a new device by replacing the tape drive, then instedad of running REPLACE WWID, run LIST_WWID on all 3 nodes and then CREATE_WWID $2$MGA4/wwid=... on all three nodes, but that would leave me with now permanently screwed up MGA3, and I would have to change all comand procedures using tape devices etc. In any case my problem disappears tomorrow 4AM, since I'l have a re-boot...
Thanks for your efforts! Ziggy
Ziggy Filek
Frequent Advisor

Re: Fiber Channel tape drive doesn't work on one node of a cluster only

Will have a re-boot tomorrow, so that the problem will be fixed. Thank you everybody for your input. -Ziggy
Tom O'Toole
Respected Contributor

Re: Fiber Channel tape drive doesn't work on one node of a cluster only


Ziggy,

Please let us know of any further results of your escalation, in particualr, why you are getting the "device is active" message. VMS still seems to have a few bugs like this in the handling of tape devices, which are very annoying because their aren't satisfactory workarounds.
Can you imagine if we used PCs to manage our enterprise systems? ... oops.