Operating System - OpenVMS
1753435 Members
4574 Online
108794 Solutions
New Discussion юеВ

Re: Problem with replacement LTO4 tapedrive.

 
P Muralidhar Kini
Honored Contributor

Re: Problem with replacement LTO4 tapedrive.

Hi Brit,

>> $ Mount/Foreign/NoAssist $2$MGA5:
>> %MOUNT-I-MOUNTED, BACKA mounted on _$2$MGA5: (BUD)
First time, the volume got mounted with Label BACKA.

>> $ Mount/Foreign $2$MGA5:
>> %MOUNT-I-MOUNTED, mounted on _$2$MGA5: (BUD)
The retry mount attempt on the volume is not showing any label.


>> %BACKUP-E-FATALERR, fatal error on $2$MGA5:[]OPENVMS.BCK;
>> -SYSTEM-F-VOLINV, volume is not software enabled
Looks like the volume valid bit is not set for the volume.

DCL help for VOLINV -

VOLINV, volume is not software enabled
Facility: SYSTEM, System Services

Explanation: The volume valid bit is not set for the volume. All physical
and logical I/O operations will be rejected until the bit is
set.

User Action: Check for a programming error. Verify that the volume is
mounted and loaded. Check to see that the power is on before
retrying the program.


check the following link for general troubleshooting techniques for
SYSTEM-F-VOLINV error.
http://bizsupport1.austin.hp.com/bizsupport/TechSupport/Document.jsp?lang=en&cc=us&taskId=110&prodSeriesId=459923&prodTypeId=18964&prodSeriesId=459923&objectID=c01508750

>> $ show dev/full mga5
>> Error count 3
Looks like the error count on the device has increased due to the problem.

>> (previously, the initialize was giving a parity error).
>> I am going to run another test using a different cassette.
Looks like the problem might be with the volume. Earlier when you tried to
use the same volume, you were getting the parity error (maybe sometimes).
The volume may be faulty and that might be the cause for the problem.

Yes, the way forward would be to use different volume for the backup's.
Probably you can use a volume that you know is good and is used recently
without any problems.


>> $ Robot Show Robot
...
Drives: 2
>>
There are 2 drives in total.
How is the other drive working. Is it also facing similar problem during backup's?

Regards,
Murali
Let There Be Rock - AC/DC
Shriniketan Bhagwat
Trusted Contributor

Re: Problem with replacement LTO4 tapedrive.

Hi,

>> %MOUNT-I-MOUNTED, mounted on _$2$MGA5: (BUD)
>> The retry mount attempt on the volume is not showing any label.

I have also faced this problem. then I did initialize the tape with some label and then my BACKUPs started running fine. Try initializing the tape just before the BACKUP operation.

Regards,
Ketan
P Muralidhar Kini
Honored Contributor

Re: Problem with replacement LTO4 tapedrive.

Ketan,

>> I have also faced this problem. then I did initialize the tape with
>> some label and then my BACKUPs started running fine.
If you look at the attachment, once the backup fails with "SYSTEM-F-VOLINV"
error, the error handling logic does a "Dismount/NoUnLoad" and then a
"MOUNT/FOR" for retry. Did you also try the same "Dismount/NoUnLoad"
command by any chance ?

In any case, the first error ecountered by backup still needs investigation.
Ther results of backups run with a different volume would be intresting.

Regards,
Murali
Let There Be Rock - AC/DC
Volker Halle
Honored Contributor

Re: Problem with replacement LTO4 tapedrive.

Dave,

your SDA output shows, that the 'hanging' operation was a MOUNT command. $2$MGA5: being 'busy' indicates, that a QIO had been issued, but not yet finished.

If a MOUNT/FOR returns no label, this either indicates an unlabeled tape or an error reading the tape label.

Volker.
The Brit
Honored Contributor

Re: Problem with replacement LTO4 tapedrive.

Just to answer a few questions which have arisen and not been answered.

1. The library is an MSL4048 (F/W 6.90)
2. Library contains 2 x HP Ultrium 4-SCSI tape drives (both drives at F/W H58W).
3. MRU is Version 1.8B.
4. OpenVMS version is 8.3-1H1, patched up to Update 7.
5. The script has been used successfully, without modification for over two years. (successfully used the day before replacement).
6. The script was used successfully yesterday to obtain the required backup USING THE OTHER DRIVE (MGA4).
7. After replacement, the first backup attempt was 90% completed before the process crapped out (2hr 20m into a 2hr 35m job).
8. The log file for that job indicated that the previous 21 savesets had backed up successfully.

7 & 8 imply that the new tape was successfully connected and accessible from the VMS side.

9. The first attempt ended with the process hung, (no IO or CPU for over 1 hour).
I terminated the job by deleting the entry and waiting for VMS to cleanup.

Since that first attempt, I have not managed to complete the backup of even 1 saveset. Every attempt has failed either with (see attachment)

----------------------------------
$ Init/OverRide=(Access,Expiration,Owner) $2$MGA5: BACKA
%INIT-F-PARITY, parity error

followed by

$ Mount/Foreign/NoAssist $2$MGA5:
%MOUNT-I-MOUNTED, mounted on _$2$MGA5: (BUD)

(Note no label)

and then

$ Backup/Image/NoAssist/NoCRC/Ignore=(Label,InterLock) -
DSA101: $2$MGA5:OpenVMS.Bck/Save_Set
%BACKUP-F-LABELERR, error in tape label processing on $2$MGA5:[000000]OPENVMS.BCK;
-SYSTEM-F-MEDOFL, medium is offline

---------------------------------------

or (as shown in my earlier attachment)


The other Drive in the enclosure is working fine, in fact I use that drive yesterday morning to redo the failed backup, (making sure I had a good backup in the can)

Based on items 7/8, it seems to me that the drive is correctly attached. The fact that I can communicate with it at all implies that. And since there have been no changes to the process, I am inclined to reject any suggestions that indicate it might be some kind of programming or DCL error.

Dave

The Brit
Honored Contributor

Re: Problem with replacement LTO4 tapedrive.


Volker,

On occasion, the job appeared to "get into" the backup of the first saveset, but examination of the process stats showed that although it was running the BACKUP image, no IO or CPU consumption was taking place.

Are there a series of SDA commands that I could use to get some information on what the process is actually doing. Even though I am not experienced with SDA, I would be more than happy to post the output for analysis by others.

Dave.
Volker Halle
Honored Contributor

Re: Problem with replacement LTO4 tapedrive.

Dave,

if that process seems to 'hang', look at the following with SDA:

SDA> SET PROC/ID=
SDA> SHOW PROC/CHAN ! look for 'busy'
SDA> SHOW PROC/LOCK ! look for 'waiting' locks
SDA> SHOW DEV xxx ! any device that was shown as busy

SDA> SHOW RES/LOCK= ! of any waiting lock

If there is an IO outstanding (i.e. 'busy') to the magtape and it doesn't finish, there must be some kind of hardware/firmware/connectivity problem.

Volker.
Hoff
Honored Contributor

Re: Problem with replacement LTO4 tapedrive.

Asking software folks for hardware help?

Following my preferred field circus approach, review the error logs and compare them to the previous errors. If you're seeing the same errors for the original and the new drive, then look at the other common components in the I/O path.

You may well have issues upstream from the drive.

For giggles, I'd swap the two tape drives in the chassis and see if the problem moves.

In the absence of logged packet data within the driver and in the absence of errors in the error log, the current state of the process isn't very helpful for these; using SDA to peek at BACKUP provides little output, as you're looking for specific grains of error data in that flood of I/O (or sometimes the lack of that returning data), and that's not easily visible outside of driver-level logging or the error logs. Sure, you can see BACKUP is wedged, but the antecedence won't usually be obvious.

As for peeking into the drivers, I don't recall off-hand if there's an SDA extension for the tape drivers, but I don't recall one. Here are the common SDA extensions, FWIW:

http://labs.hoffmanlabs.com/node/546

Parity errors are usually media errors. They can also be drive controller errors and cable errors; the other components that are involved with reading and writing parity. IIRC, you also have a SCSI connection in play here inside this MSL.

Being professionally distrustful of hardware, I'd also suspect the fibre and the HBAs, though the other drive appears to be functional and which tends to rule out common components. (Confirm you're using the same path for your drives; I don't remember if the MSL tapes can select different FC paths, but if they can, make sure you're on the same path for both drives. Swapping the two tapes within the cabinet will get you there, too.)

There are various MSL drives and parts available from a number of sources.

Review the error logs. Swap the two drives. Then (if the problem stays with the slot) move back up the I/O chain. If the problem moves with the drive, swap that drive for (another) spare.

P Muralidhar Kini
Honored Contributor

Re: Problem with replacement LTO4 tapedrive.

Hi Brit,

>> The other Drive in the enclosure is working fine,
So the problem is only with the new LTO4 tape drive that you have replaced.

>> Since that first attempt, I have not managed to complete the backup of
>> even 1 saveset.
>> Based on items 7/8, it seems to me that the drive is correctly
>> attached. The fact that I can communicate with it at all implies that.
It seems to me that the connection must be ok. Because first time when
you had run your backup's, it had run 90% and had written about 21 savesets.
If there was some connection problem that the first backup would not have
progressed so far. Even though chances are less, the cause could still be
related to tape drive connection issues.

>> although it was running the BACKUP image, no IO or CPU consumption
>> was taking place.
From the SDA commands that Volker has suggested, it would be intresting
to know what going on with the Backup at the time of hang.

First Backup,
>> $ Robot Show Slot 12
>> SLOT: 12 100164L3

Second Backup,
>> $ Robot Show Slot 12
>> SLOT: 12 100167L3

I guess you have run the second backup using a different volume when
compared to the first one and are still seeing the same problem.
Infact in the second backup, the first INIT attempt of the volume is
giving you the "INIT-F-PARITY" error. This is strange.
So problamatic volume option is ruled out.

VMS parity error can have multiple causes when reported from a SCSI tape
drive. Looking at the VMS errorlog gives more details. One typical reason
for a SYSTEM-F-PARITY on a SCSI tape drive is a SCSI Blank Check. It means
that the software tried to read into a yet unwritten portion of the tape.
This would indicate an unexpected format on tape. For example if a BACKUP
operation is cancelled in the middle of a save the tape is left without
the trailing ANSI labels. The next attempt to append another save set to
the tape then fails with a parity error and the errorlog shows a blank check.

What events are reported in the VMS ERRORLOG when the problem is seen?

Regards,
Murali
Let There Be Rock - AC/DC
Hoff
Honored Contributor

Re: Problem with replacement LTO4 tapedrive.

FWIW...

The WEBES tool for OpenVMS was recently removed from the HP service tools web site. The available path for analyzing hardware errors on OpenVMS now involves a tool chain on and the transfer of the error data over to a Microsoft Windows box.

Which means you can either get those tools and that path set up, or you can download and use an older version of the WEBES tool and the DECevent DIAGNOSE tool (and given the MSL has been around for a while, you probably don't need the newest versions of either of these tools), or you can see if the integrated ANALYZE /ERROR /ELV tool and its TRANSLATE bit-to-text command gets you enough details around this tape device. (ELV knows about and can translate the core system devices, but isn't as good a choice when you're further afield. That's where you can need WEBES or DECevent DIAGNOSE.)

I have pointers and additional details posted.