Re: Disk IO retry - OpenVMS 7.3-2

Kevin Raven (UK) · ‎05-08-2007

We have a four node ES45 cluster. Shared system disk and shared nine other disk. We do not shadow any disk.
We have a EMC storage array that does Raid for us and presents OpenVMS with 10 disk.
We are running OpenVMS 7.3-2 with Update V8. Yes I know we are a little behind with the updates. We access disk via all four node in the cluster with 2 HBA cards per server. Both cards are single port and have fibre cables connected to them. WE do not MSCP serve disk between node.
A few weeks ago someone did a reconfig of some type on the EMC storage. As a result IO to the disk on all 4 node was stalled for 4.9 seconds. Then things resumed.
The EMC support team claim that no other systems connected were effected. These other systems being Windows and Solaris. They also claim that the stall in IO would have only been approx 1 second.
My point is the following:
- The other systems might not measure antthing more than a second of stall as an outage.

- If the outage was only 1 second approx ..then IO would have only stalled for 1 second and not 4.9.
Would this be the case ?
Could a 1 second stall in IO cause VMS to stall IO for approx 4.9 seconds.
Checked the operator logs and other logs. No multi path switching took place during the IO stall.

We consider a .5 second outage as application unavailable. Our cluster is more or less as realtime as you can get ..cluster RECN Interval set as low as 4 seconds and associated params.

Comments ?

Volker Halle · ‎05-08-2007

Kevin,

could an IO error have triggered mount-verification on the disks ?

Mount-verifications might not be logged to OPCOM, see the MVSUPMSG_INTVL and MVSUPMSG_NUM sysgen parameters.

Volker.

Steve-Thompson · ‎05-08-2007

Hi Kevin

My response to this situation...
The word glib comes to mind.

ANY IO delay is unacceptable!

I would say the people managing the EMC box to fix it.
(Ie. if the EMC box was working before, then it can work again).

So what did they change?
Have these delays started on ALL systems since the EMC revision?
Is there something implied by the change to the EMC that implies revising the FABRIC configuration?

You claim there's no path switching going on, which could account for a delay, as there's obviously a problem with the new configuration,

To confirm this, do a:
$sho device /fu

Look and see if all "operations completed"
are where you expect them to be.

Regards
Steven

Kevin Raven (UK) · ‎05-08-2007

The storage guys will never do a config during production time again. So we will not see any further 1 second or 4.9 second IO delays. I just wanted to get to the bottom of how a 1 second delay in IO on the EMC storage (If that was the case !) can transpond into a 4.9 second stall on the VMS servers.

Kevin Raven (UK) · ‎05-08-2007

"Kevin,

could an IO error have triggered mount-verification on the disks ?

Mount-verifications might not be logged to OPCOM, see the MVSUPMSG_INTVL and MVSUPMSG_NUM sysgen parameters.

Volker."

$ mc sysgen show MVSUPMSG
Parameter Name Current Default Min. Max. Unit Dynamic
-------------- ------- ------- ------- ------- ---- -------
MVSUPMSG_INTVL 3600 3600 0 -1 Seconds D
MVSUPMSG_NUM 5 5 0 -1 Pure-numbe D
$

Ian Miller. · ‎05-08-2007

could the error have lead to a queue full being reported back to VMS for that storage controller port then VMS backing off sending I/O for a while?

____________________
Purely Personal Opinion

Wim Van den Wyngaert · ‎05-09-2007

Read this too
http://forums1.itrc.hp.com/service/forums/questionanswer.do?threadId=1066685

May be the minimum recovery time is about 5 seconds ?

Wim

Wim

James Cristofero · ‎05-09-2007

I also have EMC storage on Alpha. Running UPDATE V 7 and FIbre-SCSI V9. We have a dedicated DMX3000 attached to 19 Alpha's either 2 HBA's or 4.

While deploying additional storage (HDS) we had a cluster hang. Any attempt to I/O on EMC would "hang" that server.

No Mount verification never came back from the frame. So apparently, not all communication you would expect to see is available on the EMC paths.

We backed out the HDS changes /crashed rebooted and all the I/O was restored.

Jur van der Burg · ‎05-09-2007

The big question is whether the EMC controllers returned an error or not. If they just stalled for one second then there's nothing in VMS that would stall the request for more than that time. If the controller returned an error then mountverification would have kicked in, which may have been surpressed. Now mountverification will stall all I/O's, and issue a packack to DKdriver every second until is gets a response or mountverification timeout (3600 seconds default). The packack issues a scsi test unit ready command. So if that command was delayed by the controller it may explain the delay. From a VMS perspecitive it could be that multipath may add some additional seconds as it participates in the error recovery.

Bottom line is that I think that the controller has returned an error, and that recovery on such a serious event may take a couple of seconds.

Jur.

Robert Brooks_1 · ‎05-09-2007

From a VMS perspecitive it could be that multipath may add some additional seconds as it participates in the error recovery.

--

Multipath can add some additional time, but since multipath does its work in the context of mount verification, you'd expect to see the OPCOM messages. However, if mount verification suppression is enabled (as it is by default), then it's difficult to figure out what's going on.

Attempting to troubleshoot this after the fact is nearly impossible. A Tool to use *while this problem is happening* would be the DKLOG SDA extension, which will log all the SCSI commands and the SCSI statuses coming back from the controller.

-- Rob

Categories

Company

Local Language

Forums

Discussions

Forums

Discussions

Discussions

Forums

Discussions

Forums

Discussions

Forums

Forums

Discussions

Forums

Discussions

Forums

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Community

Resources

Other HPE Sites

Discussions

Forums

Blogs

Re: Disk IO retry - OpenVMS 7.3-2

Disk IO retry - OpenVMS 7.3-2