Operating System - OpenVMS
1753258 Members
5500 Online
108792 Solutions
New Discussion юеВ

Re: Troubleshooting Mount Verification

 
SOLVED
Go to solution
Mike Lievre
Occasional Contributor

Troubleshooting Mount Verification

I am currently experiencing a large number of mount verifications in a Alpha ES47 VMS 8.3 environment connected to a couple of HSZ80s hosting older 18gb and 36gb disks.

The mount verification messages are related to one volume only and only when under load. I understand the mount verification will happen when there is a delay in I/O somewhere along the path.

The mount verifications are nearly instant, but there are many. Below is an example of a typical OPCOM message.

%%%%%%%%%%% OPCOM 18-FEB-2008 08:56:10.17 %%%%%%%%%%%
Device $5$DKA3: (SCORPO PKD) is offline.
Mount verification is in progress.

%%%%%%%%%%% OPCOM 18-FEB-2008 08:56:10.18 %%%%%%%%%%%
Mount verification has completed for device $5$DKA3: (SCORPO PKD)

The fact that the mount verifications only happen on one volume (there are similar load characteristics on other volumes hosted on the same HSZ80) suggests that there is a problem with a disk or possibly a shelf.

Does anyone have any suggestions on how I might troubleshoot this? The HSZ80 doesn't show any obvious errors when I log in.

Apart from the mount verification messages there appears to be no additional impacts, though I haven't done an analysis on performance of this volume vs others.
13 REPLIES 13
Volker Halle
Honored Contributor

Re: Troubleshooting Mount Verification

Mike,

a disk enters mount-verification if an IO is finished with a certain class of re-tryable IO error status values. Are any errors being logged against the SCSI adapter (PKD) or the disk itself ?

Volker.
Mike Lievre
Occasional Contributor

Re: Troubleshooting Mount Verification

Yes, the disk itself. There are a handful of errors logged against the PKD device, but thousands against the disk.

After I made the original post I actually started to go down the path of trying to dig up more information on the disk errors. Unfortunately I found that my SEA web interface wouldn't connect correctly to the server to bring up any details. I thought I had verified the WEBES toolset after our 8.3 upgrade, but perhaps I need to move to the newest version of WEBES. I'm too easily sidetracked it appears.

Using analyze/error/elv to translate the errorlog I see the errors. Nothing from the actual report jumps out at me, but in a full translate there's a lot of untranslatable binary data. Within that data I see references to $5$DKA, HSZ80, and what might be serial numbers. I'll have to see if those match up with anything in the storage system.

I've attached a text file with an example error.



Hoff
Honored Contributor
Solution

Re: Troubleshooting Mount Verification

Troubleshoot? I'd immediately assume a bad disk here. Archive its contents real soon now, and prepare to swap it.

As for the posted error log entry, the "Dump untranslatable event body" means that one of the other error-reporting tools will be needed here. This is usually one of ANALYZE /ERROR /ELV, SEA, or DIAGNOSE DECevent tools.

The device type code in the posted dump is that of a 36 GB Ultra(3) SCSI disk. Old. Probably 180726-003 or related Universal brick, in a 4314R or 4354R series or similar shelf, IIRC. At somewhere between about US$25 and US$75 for a spare on the used-disk market (plus shipping), I'd preemptively swap it.

I might well look to replace the shelf, and retire the whole lot of bricks with something slightly newer.

FWIW, here's a page of links to disk MTBF patterns found in various large-scale device surveys:

http://64.223.189.234/node/93

Stephen Hoffman
HoffmanLabs LLC

Mike Lievre
Occasional Contributor

Re: Troubleshooting Mount Verification

Yes. My assumption was also that there was a bad disk that just hadn't outright failed yet. However, the VMS volume is a raidset. How would I determine which disk in the raidset is causing the problem? Is there a SCSI target somewhere there that I'm not seeing?

There appears to be a serial or drive #, but it's not unique to drives in the raidset.

Perhaps when I get SEA back up and running it will provide further information.

As for replacements, there are a great many shelves and bricks to be replaced. The replacement for this particular storage is already speced to come in the form of FATA disk in an EVA.
Bill Hall
Honored Contributor

Re: Troubleshooting Mount Verification

Mike,

Do these mount verifications come at a regular interval, maybe every 5 minutes? Do you have the latest fibre-scsi ECO installed?

I seem to recall seeing a periodic mount verification of either HSx devices or maybe it was devices on models of scsi controllers (maybe the shared differential controllers). I was thinking this was a V8.2 with/without ECO kind of a "feature".

Does this sound familiar to anyone else?

Bill
Bill Hall
Hoff
Honored Contributor

Re: Troubleshooting Mount Verification

Using one or more of rztools_alpha (http://64.223.189.234/node/761), or cddvd /inquire (http://64.223.189.234/node/746) or scsi_info (http://64.223.189.234/node/746) tools (all three of which are in sys$etc: in typical OpenVMS distros), you can ask each of the disks you can see for its serial number in turn, and see if it matches something embedded in that big wad of error data. (I coded the cddvd /inquire tool to grab the serial number whenver it could find it. Not all disks have serial numbers, though.)

Some substring of that ZG91400368V83Z string looks good as a serial number.

Also see if the HSZ80 error log and device configuration data has any relevant data and any related error messages. (SWCC stuff is linked here: http://64.223.189.234/node/564)

I've also seen bad slots in shelves, flaky firmware, bad controllers, and I know of a failing (failed?) disk that was so far out of balance that it shook so badly that it caused seek problems across other disks in the same mounting. But an old 36 GB disk brick looks like the best of the usual suspects here.

Or get somebody in to sort this out for you.

Stephen Hoffman
HoffmanLabs LLC
Jon Pinkley
Honored Contributor

Re: Troubleshooting Mount Verification

The latest HSZ controller I ever worked with was the HSZ70. If you have physical access to the controller, you should be able to connect to the preferred controller for unit that the raid storageset is on.

If you are getting hardware errors, they should be generating event logs on the console port of the controller that is the "master" for the raid storage set. And those event logs will have the P T L for the device that is getting errors (if that is the cause).

Is SCORPO part of a cluster? Is $5$DKA3: the quorum disk? When we had our quorum disk on the HSZ70, it was not uncommon to get mount verification messages when the quorum disk was backed up using the "old" recommendation to give BACKUP as many resources as possible, so the disk queues could get quite long while backup was in use.

Jon
it depends
Hoff
Honored Contributor

Re: Troubleshooting Mount Verification

re: quorum disk I/O and BACKUP disk I/O

An OpenVMS engineer somewhere back in the mists of time had decided that the quorum I/O would be queued with the lowest priority, which meant it was politely queued up behind the other typical I/O flying around.

For a system-level cluster coordination function that involved one I/O every three seconds or so and that led to badness when sequential quorum I/Os were missed or otherwise delayed during a BACKUP or other I/O storm; that degree of I/O queue deference didn't seem particularly sensible given the high cost and repercussions of missed quorum I/O and the infrequency of the quorum I/O, and that then led to a conversation with the then-maintainer of the quorum watcher.

Off-hand, I don't know if the priority in the IRP has changed.
Volker Halle
Honored Contributor

Re: Troubleshooting Mount Verification

Mike,

DECevent V3.4 is still the tool of choice for translating SCSI-related errors. Or install WEBES (SEA) on your laptop or PC and copy over ERRLOG.SYS for analysis.

ANAL/ERR/ELV is useless in most cases, as it does no translate the most interesting part of most errlog entries.

The 'ZGxxx' serial number is most likely from the HSZ80 itself.

The reason for the mount-verifications seem to have been explained.

Volker.