1839217 Members
5104 Online
110137 Solutions
New Discussion

Re: MVTIMEOUT

 
Wim Van den Wyngaert
Honored Contributor

MVTIMEOUT

When a disk becomes unreachable, it goes into mount verification until the disk comes back or until a (MV)timeout occurs.

To my knowledge, when a disk fails, it is thrown out of the shadow set.

Is there any situation that a disk or the controller fail (permanently) and that VMS will wait for its come back, thus blocking the application ?
Wim
14 REPLIES 14
John Eerenberg
Valued Contributor

Re: MVTIMEOUT

Really good question!

I don't have a definitive answer but it doesn't look like a senerio exists.
Here is a quick test I just performed:

2 GS140, 1 HSG80 fibre channel.

Node A - Started one process with a continuous stream of IO (actually a $ Analyze/Disk in a loop) to DGA1001.

Node B - no IO yet.

Went into the HSG80 and *deleted* both DGA1001's Ident (nothing happened) and the unit number (I assume this is good enough to call a "permanent" failure). Deleting the unit number hung Node A's process.

Mount verify did not display on Node A (yet).
Node A could not do a Show Device dga1001 - hung (rather odd . . .).
Node B's Show Device dga1001 showed nothing unusual (expected).
After a few seconds Node A put dga1001 into mount verify (it was about time).
Node B's Show Device dga1001 still showed nothing unusual (expected).
Started IO from Node B and it errored out (expected).
Node B's Show Device dga1001 shows "offline mounted." (which makes sense)

Went into the HSG80 and restored dga1001's unit and ident.
Ran mcr sysman io auto on Node A.
Mount verify completed on Node A.
Ran mcr sysman io auto on Node B.
Node B's dga1001 went into mount verify then completed (well, it took a little IO from Node B to do this).

Everything is back to normal.

Looks like MVTIMEOUT comes into play eventually.

May I ask what problem are you trying to solve?

john
It is better to STQ then LDQ
Mike Naime
Honored Contributor

Re: MVTIMEOUT

Yes, I have seen the waiting situation several times over the last few years with our SAN.

In version 7.2-1h1, We actually discovered and had Compaq fix a problem in that the drives would immediately go into MVtimeout when a connection was lost. We got a new SYS$CLUSTER.EXE file to fix that problem. The state reported was incorrect, and it caused the MVTIMEOUT flag to get set.

Another time we lost a SCSI bus, and both controllers froze. One Cluster on that controller was CTRL-P'd before they called me to discover that we had a hung controller. The other clusters on that controller where just waiting for the drives to come back.

Last month during a SAN fabric upgrade, when we took down our BLUE fabric, one VMS cluster had a problem with the RED fabric, and so that cluster went into Mount Verify until we had completed our SAN fabric upgrade and had re-established that fabric. The drives came back from the Mount Verify state. Unfortunately, Oracle decided a long time back that it was not going to play anymore, and they had to bounce the Application/Oracle to get everything running again.
VMS SAN mechanic
Uwe Zessin
Honored Contributor

Re: MVTIMEOUT

It's been a while I had a small chat with 'Mr. Shadowing' from OpenVMS. I don't have the conversation handy right now, but he has confirmed my guess that the shadowing driver has no timeouts itself. It depends on the underlying error how the shadow driver handles the situation.

I don't recall if we talked about this, but I beleive that MVTIMEOUT applies to the disk 'volume' that is mounted to the DSAunit: and not to any underlying members.
.
Cass Witkowski
Trusted Contributor

Re: MVTIMEOUT

We had a situation with a cluster and two pairs of HSG80s each with a member of a shadow set. We accidentially did a SET HOST/SCSI to the HSG80s while we had SWCC running at the same time. The controller crashed hard and lost the configuration of the stripeset that was being shadowed. Needless to say the system hung. I can't remember if the disk was in a mount verification or not.

Luckily we were able to recreate the stripeset definition and bring the unit back online. Everything took off from where it paused.

John Eerenberg
Valued Contributor

Re: MVTIMEOUT

fwiw - Set Host/SCSI had problems with multiple users accessing the controller simultaneously (amongst other thigs) and is not supported for HSG80's. Command Line Scripter is the supported way to go from VMS to HSG80 controllers. Quite useful/good actually . . .
It is better to STQ then LDQ
Wim Van den Wyngaert
Honored Contributor

Re: MVTIMEOUT

John,

With google, I didn't find the tool Command Line Scripter. Do you know where to find it ?

I am using the unsupported tool too. So, if I can avoid it ...
Wim
Mike Naime
Honored Contributor

Re: MVTIMEOUT

If you want to avoid the tool, get some terminal servers (Decserver 700's) and a console manager. Use the manager to connect to the CLI port through the serial port. This is really the best way to monitor all of the console output of the HSG's.

If you are using a product like TDI's Consoleworks, you can have shared access, logging, scans...Etc. on those consoles.

If you do not want to spend the money for the managenet software, teh terminal server will allow you to do a telnet connection directly to the console port. This regulates your one person access.
VMS SAN mechanic
Uwe Zessin
Honored Contributor

Re: MVTIMEOUT

It is called the 'HP StorageWorks command scripter' these days. Note that it costs money. The link is free;-)

http://h18000.www1.hp.com/products/sanworks/commandscripter/index.html
.
Wim Van den Wyngaert
Honored Contributor

Re: MVTIMEOUT

Uwe and John : no money for the tool. I will stay with set ho/scsi.

Mike : they are connected to console manager. Only, my watchdog does the unattended monitoring with set ho /scsi. ConsoleWorks is PC dependend and not unattended.

All : thanks for sharing the info.
Wim
Mike Naime
Honored Contributor

Re: MVTIMEOUT

WIM:

The concept of connecting to a DECserver 700 and doing a Telnet to the port will work with or without ConsoleWorks to monitor the port. I actually setup the terminal servers and telnet to the individual ports to test them prior to using Consoleworks on them.

We ran Consoleworks on a VMS DS10L until the Internal Disk fried. Then we moved it to a DS20 with SAN drives. They may not promote the fact that there is a VMS version of the software that well, but it is out there! It does require $$$ for the licenses and scan files, but I think that it is money well spent for the functionality that you get.

IF you are a small shop, I would not spend the money unless you really needed the realtime pager/alert notifications.


Mike
VMS SAN mechanic
Wim Van den Wyngaert
Honored Contributor

Re: MVTIMEOUT

Today I had the real thing.

A cluster of 2 4100 nodes and 8 alphastations that have a local pagefile.
The 2 servers are connected with a dual HSZ70.

MVTIMEOUT is 180 seconds.

%%%%%%%%%%% OPCOM 10-MAR-2004 14:23:48.70 %%%%%%%%%%%
Device $45$DKA401: (SALPV1 PKC, SALPV2) is offline.
Mount verification is in progress.

...


%%%%%%%%%%% OPCOM 10-MAR-2004 14:27:00.45 %%%%%%%%%%%
Mount verification has aborted for device $45$DKA401: (SALPV1 PKC, SALPV2)

Then I found the disk in mntverifytimeout.

I tried dismount/abort on all cluster nodes. Not all nodes did the dismount due to open files.

I did stop/id of all processes accessing the disk. Then dismount/abort worked.

I tried to mount the disk on one of the 2 servers. "Medium is offline". Idem on other server.

I did a reboot of both disk controllers (1 at the time). After that the disk could be mounted.

Why isn't dismount/abort aborting the processes ?
How can a disk become "offline" ? BTW it is a raid5 disk and the hsz70 reported all disks "normal".
Wim
Uwe Zessin
Honored Contributor

Re: MVTIMEOUT

Can't verify right now, but I think you have to use:
$ dismount /abort /override=checks ...

But be careful- if you ommit /abort, it will mark the disk 'dismount pending' when there are open files - reboot, here we come...

Have you checked the server's error-logs for entries?
.
Wim Van den Wyngaert
Honored Contributor

Re: MVTIMEOUT

Uwe,

Sorry for the late reaction.
The error log shows :

Logging OS 1. OpenVMS
System Architecture 2. Alpha
OS version V7.3
Event sequence number 16321.
Timestamp of occurrence 10-MAR-2004 14:27:38
Time since reboot 12 Day(s) 1:08:51
Host name SALPV2

System Model AlphaServer 4100 5/466 4MB

Entry Type 98. Asynchronous Device Attention

---- Device Profile ----
Unit SALPV2$PKC0
Product Name ISP1020/1040 PCI-SCSI Adapter
----- SCSI Port -----
Long Word Count x00000005
Error Log Revision x01
Error Type x060B 0x0B, Controller Error 0x06, Transport Timeout
SCSI ID x04

Command Length x06
Command & Data
x12
x00
x00
x00
xFF
x00

SCSI Status xFF No Status Received

----- Software Info -----
UCB$x_ERTCNT 16. Retries Remaining
UCB$x_ERTMAX 16. Retries Allowable
IRP$Q_IOSB x0000000000000000
UCB$x_STS x10000010 Online
IRP$L_PID x00000000 Requestor "PID"
IRP$x_BOFF 0. Byte Page Offset
IRP$x_BCNT 0. Transfer Size In Byte(s)
UCB$x_ERRCNT 15. Errors This Unit
UCB$L_OPCNT 219. QIO's This Unit
ORB$L_OWNER x00010004 Owners UIC
UCB$L_DEVCHAR1 x0C440000 Available Error Logging
Capable of Input
Capable of Output
Wim
Uwe Zessin
Honored Contributor

Re: MVTIMEOUT

Looks like there was a problem with the disk or the bus. See if you can find somebody who can tell you what 'Transport Timeout' means in this context. I am afraid that my guess would be as good as yours.

BTW, you asked: "Why isn't dismount/abort aborting the processes ?"
It is not supposed to do that. /ABORT is to terminate the mount verification _and_ cancel all outstanding I/O requests.

A process cannot be 'aborted' from external anyway. If you use "$STOP/IDENTIFICATION=pid" or "$STOP processname" or "SYS$DELPRC()" the process is being 'tapped' on its shoulder and told to terminate itself. Part of this is closing all I/O channels, which cannot be done if there are outstanding I/Os. Many system managers know the 'RWAST effect'.
.