HPE Community read-only access December 15, 2018
This is a maintenance upgrade. You will be able to read articles and posts, but not post or reply.
Hours:
Dec 15, 4:00 am to 10:00 am UTC
Dec 14, 10:00 pm CST to Dec 15, 4:00 am CST
Dec 14, 8:00 pm PST to Dec 15, 2:00 am PST
Operating System - OpenVMS
cancel
Showing results for 
Search instead for 
Did you mean: 

Problem with replacement LTO4 tapedrive.

 
The Brit
Honored Contributor

Problem with replacement LTO4 tapedrive.

Yesterday I replaced an LTO4 tapedrive in a MSL4048 library.

After installing the drive, I loaded a scratch tape, initialized it, mounted it, and all seemed well.

Last night I ran a normal system backup (batch job) which appeared to be running fine until the last saveset, at which point the process appeared to hang.

Show Proc/Continuous showed the job was still running the Backup Image, however there was no IO, Buffered or Direct, for over an hour. At that point, I killed the job.

It took the system ~5 minutes to cleanup the process and exit.

The problem now is that the tapedrive keeps returning "medium is offline" messages whenever I try to initialize or mount a tape.

I am going to go onsite and cycle the power on the Library and drive to see if that helps.

If anyone has any other suggestions, I would appreciate them.

Dave.
22 REPLIES
P Muralidhar Kini
Honored Contributor

Re: Problem with replacement LTO4 tapedrive.

Hi Brit,

>> Last night I ran a normal system backup (batch job) which appeared to be
>> running fine until the last saveset, at which point the process appeared
>> to hang.
I guess you mean VMS Backup itself and not ABS/MDMS.

>> The problem now is that the tapedrive keeps returning "medium is offline"
>> messages whenever I try to initialize or mount a tape.
I guess you would have tried to mount different volumes on the tapedrive,
just to rule out a bad volume problem.

Also is the volume used compatible with the drive ?

You can use the MRU (Media Robotic Utility) commands like
ROBO SHOW DRIVE , ROBO LOAD/UNLOAD to check if the MRU commands
also face problem accessing the drive.

If the problem persists for a while, you should consider cleaning the
tape drive.

Regards,
Murali
Let There Be Rock - AC/DC
Hoff
Honored Contributor

Re: Problem with replacement LTO4 tapedrive.

Based upon the description, this could be inferred to be a third-party or otherwise unsupported LTO tape drive, and the behavior here would then imply a compatibility issue.

Or this is a a supported drive with incorrect firmware, or with a failure of some sort.

A bad SCSI connection.

Or bad media.

Or a command error within the procedure. Unfortunately, the last known latent bug has not yet been identified.

Cycling the drive might or might not clear the underlying error, though it may well clear the "medium is offline" stuff.

Check the error logs.

Check the batch log.

Check the OPCOM log.

Having a stuck doesn't mean all that much.
Shriniketan Bhagwat
Trusted Contributor

Re: Problem with replacement LTO4 tapedrive.

Hi,

BACKUP is on to single tape or multiple tapes? Is there any tape span-over during BACKUP?
Did you try the BACKUP with /IGNORE=LABEL qualifier? What is the exact BACKUP command?

Regards,
Ketan
P Muralidhar Kini
Honored Contributor

Re: Problem with replacement LTO4 tapedrive.

Hi Brit,

>> You can use the MRU (Media Robotic Utility) commands like
When i said this, i assume you had MRU installed. If MRU is not installed
then these commands cannot be used.

Also, provide the output of "$SHOW DEVICE/FULL
Does it show the status as ONLINE or OFFLINE ?

Once the backup's failed and you killed the job, were you able to unload
that volume from the drive sucessfully or the unload volume failed with
medium offline error?
May be the volume might have got stuck in the drive for some reason.

Regards,
Murali
Let There Be Rock - AC/DC
The Brit
Honored Contributor

Re: Problem with replacement LTO4 tapedrive.

When I arrived on site, the clean drive light was on and the Attention LED was flashing.

I inserted a cleaning tape in to the mailslot and used the front panel to initiate a clean on the new drive. The Clean LED subsequently went out.

The Attention LED was still flashing however when I move a cassette into the drive, the Attention LED went out, (although I cant say for sure that the two events were related).

Anyway, the good news is that I was able to initialize a tape and mount it. (previously, the initialize was giving a parity error).

I seem to be back at the point I was at after installing the replacement yesterday.

I will now run a test of the backup I tried last night to see if it is truly OK, or if I get the same outcome as last night.

I would like to thank the contributors for being connected on a Saturday. I will close the thread if all works out OK.

thanks

Dave.


P Muralidhar Kini
Honored Contributor

Re: Problem with replacement LTO4 tapedrive.

Hi Brit,

>>When I arrived on site, the clean drive light was on and the Attention
>> LED was flashing.
>> Anyway, the good news is that I was able to initialize a tape and mount it.
Cool. So looks like cleaning the drive did the trick.

>> I will now run a test of the backup I tried last night to see if it is
>> truly OK, or if I get the same outcome as last night.
Yes, also you might want to use the same volume (or set of volumes if its a
multi tape backup) for the backups.

>> I would like to thank the contributors for being connected on a Saturday.
Your are welcome. This forum is always ON !!

Good luck with your backup's.

Regards,
Murali
Let There Be Rock - AC/DC
Shriniketan Bhagwat
Trusted Contributor

Re: Problem with replacement LTO4 tapedrive.

Hi,
>> previously, the initialize was giving a parity error
Parity error indicates there is some problem with the tape. Please check the online help on parity. $ help/message parity. You may want to check, if there are any parity error with the same tape by initializing it multiple times. If you observe the parity error then its time to retire the tape.

Regards,
Ketan
The Brit
Honored Contributor

Re: Problem with replacement LTO4 tapedrive.

Problem not resolved.

Although I was able to initialize and mount the tape cassette, and all seemed well.

When I ran my backup test (using the same tape cassette as in the above test) I got the result shown in the attachment.

Notice that after the error occurs, the script jumps to an ERROR handling subroutine which remounts the tape and dismounts it. When the Error routine mounts the tape, it no longer shows a label, although it was initially initialized and mounted OK.

I am going to run another test using a different cassette.

A couple of other observations. I ran the LTT utility while the Batchjob was running and did a scan of MGA5. It said there was no media in the drive. ???

I entered SDA and did a "show proc /id=nnn /chan" and it indicated that MGA5 was in fact Open, and "Busy" even though the Process was showing no CPU use or IO.

SDA> show proc/id=2027CB7E/chan

Process index: 037E Name: BKUP_PHASE1 Extended PID: 2027CB7E
--------------------------------------------------------------------


Process active channels
-----------------------

Channel CCB Window Status Device/file accessed
------- --- ------ ------ --------------------
0010 7FF08000 00000000 DSA10:
0020 7FF08020 8DC81A00 DSA101:[VMS$COMMON.SYSEXE]VMOUNT.EXE;1
0030 7FF08040 8A9A6E40 DSA101:[VMS$COMMON.SYSLIB]LIBOTS.EXE;1 (section file)
0040 7FF08060 8A9A6DC0 DSA101:[VMS$COMMON.SYSLIB]LIBRTL.EXE;1 (section file)
0050 7FF08080 8A9B6280 DSA101:[VMS$COMMON.SYSEXE]DCL.EXE;1 (section file)
0060 7FF080A0 8A9A6C40 DSA101:[VMS$COMMON.SYSLIB]DCLTABLES.EXE;140 (section file)
0070 7FF080C0 8B488780 DSA10:[TESSCO.LOG_FILES.BACKUP]BKUP_PHASE1.LOG;556
0080 7FF080E0 8D6E3EC0 DSA10:[TESSCO.EON_COM_FILES]BKUP_PHASE1.COM;39
0090 7FF08100 8A9AB640 DSA101:[VMS$COMMON.SYSLIB]DECC$SHR.EXE;1 (section file)
00A0 7FF08120 8A9AAEC0 DSA101:[VMS$COMMON.SYSLIB]DPML$SHR.EXE;1 (section file)
00B0 7FF08140 8A9A9740 DSA101:[VMS$COMMON.SYSLIB]CMA$TIS_SHR.EXE;1 (section file)
00C0 7FF08160 8A9A58C0 DSA101:[VMS$COMMON.SYSLIB]MOUNTSHR.EXE;1 (section file)
00D0 7FF08180 00000000 Busy $2$MGA5:

Total number of open channels : 13.



Bob Blunt
Respected Contributor

Re: Problem with replacement LTO4 tapedrive.

Dave, the usual questions and recommendations:
VMS Version
Connection method (noted that it's a SAN-cnx drive)
Patches?

I'm sure that I should presume that you swapped the drive and performed the other steps required to bring the drive online since the WWID should have changed when the drive was replaced and VMS needs you to update the device structures for the new drive to work properly. Power cycling doesn't usually reset the device connection in the operating system.

bob
P Muralidhar Kini
Honored Contributor

Re: Problem with replacement LTO4 tapedrive.

Hi Brit,

>> $ Mount/Foreign/NoAssist $2$MGA5:
>> %MOUNT-I-MOUNTED, BACKA mounted on _$2$MGA5: (BUD)
First time, the volume got mounted with Label BACKA.

>> $ Mount/Foreign $2$MGA5:
>> %MOUNT-I-MOUNTED, mounted on _$2$MGA5: (BUD)
The retry mount attempt on the volume is not showing any label.


>> %BACKUP-E-FATALERR, fatal error on $2$MGA5:[]OPENVMS.BCK;
>> -SYSTEM-F-VOLINV, volume is not software enabled
Looks like the volume valid bit is not set for the volume.

DCL help for VOLINV -

VOLINV, volume is not software enabled
Facility: SYSTEM, System Services

Explanation: The volume valid bit is not set for the volume. All physical
and logical I/O operations will be rejected until the bit is
set.

User Action: Check for a programming error. Verify that the volume is
mounted and loaded. Check to see that the power is on before
retrying the program.


check the following link for general troubleshooting techniques for
SYSTEM-F-VOLINV error.
http://bizsupport1.austin.hp.com/bizsupport/TechSupport/Document.jsp?lang=en&cc=us&taskId=110&prodSeriesId=459923&prodTypeId=18964&prodSeriesId=459923&objectID=c01508750

>> $ show dev/full mga5
>> Error count 3
Looks like the error count on the device has increased due to the problem.

>> (previously, the initialize was giving a parity error).
>> I am going to run another test using a different cassette.
Looks like the problem might be with the volume. Earlier when you tried to
use the same volume, you were getting the parity error (maybe sometimes).
The volume may be faulty and that might be the cause for the problem.

Yes, the way forward would be to use different volume for the backup's.
Probably you can use a volume that you know is good and is used recently
without any problems.


>> $ Robot Show Robot
...
Drives: 2
>>
There are 2 drives in total.
How is the other drive working. Is it also facing similar problem during backup's?

Regards,
Murali
Let There Be Rock - AC/DC
Shriniketan Bhagwat
Trusted Contributor

Re: Problem with replacement LTO4 tapedrive.

Hi,

>> %MOUNT-I-MOUNTED, mounted on _$2$MGA5: (BUD)
>> The retry mount attempt on the volume is not showing any label.

I have also faced this problem. then I did initialize the tape with some label and then my BACKUPs started running fine. Try initializing the tape just before the BACKUP operation.

Regards,
Ketan
P Muralidhar Kini
Honored Contributor

Re: Problem with replacement LTO4 tapedrive.

Ketan,

>> I have also faced this problem. then I did initialize the tape with
>> some label and then my BACKUPs started running fine.
If you look at the attachment, once the backup fails with "SYSTEM-F-VOLINV"
error, the error handling logic does a "Dismount/NoUnLoad" and then a
"MOUNT/FOR" for retry. Did you also try the same "Dismount/NoUnLoad"
command by any chance ?

In any case, the first error ecountered by backup still needs investigation.
Ther results of backups run with a different volume would be intresting.

Regards,
Murali
Let There Be Rock - AC/DC
Volker Halle
Honored Contributor

Re: Problem with replacement LTO4 tapedrive.

Dave,

your SDA output shows, that the 'hanging' operation was a MOUNT command. $2$MGA5: being 'busy' indicates, that a QIO had been issued, but not yet finished.

If a MOUNT/FOR returns no label, this either indicates an unlabeled tape or an error reading the tape label.

Volker.
The Brit
Honored Contributor

Re: Problem with replacement LTO4 tapedrive.

Just to answer a few questions which have arisen and not been answered.

1. The library is an MSL4048 (F/W 6.90)
2. Library contains 2 x HP Ultrium 4-SCSI tape drives (both drives at F/W H58W).
3. MRU is Version 1.8B.
4. OpenVMS version is 8.3-1H1, patched up to Update 7.
5. The script has been used successfully, without modification for over two years. (successfully used the day before replacement).
6. The script was used successfully yesterday to obtain the required backup USING THE OTHER DRIVE (MGA4).
7. After replacement, the first backup attempt was 90% completed before the process crapped out (2hr 20m into a 2hr 35m job).
8. The log file for that job indicated that the previous 21 savesets had backed up successfully.

7 & 8 imply that the new tape was successfully connected and accessible from the VMS side.

9. The first attempt ended with the process hung, (no IO or CPU for over 1 hour).
I terminated the job by deleting the entry and waiting for VMS to cleanup.

Since that first attempt, I have not managed to complete the backup of even 1 saveset. Every attempt has failed either with (see attachment)

----------------------------------
$ Init/OverRide=(Access,Expiration,Owner) $2$MGA5: BACKA
%INIT-F-PARITY, parity error

followed by

$ Mount/Foreign/NoAssist $2$MGA5:
%MOUNT-I-MOUNTED, mounted on _$2$MGA5: (BUD)

(Note no label)

and then

$ Backup/Image/NoAssist/NoCRC/Ignore=(Label,InterLock) -
DSA101: $2$MGA5:OpenVMS.Bck/Save_Set
%BACKUP-F-LABELERR, error in tape label processing on $2$MGA5:[000000]OPENVMS.BCK;
-SYSTEM-F-MEDOFL, medium is offline

---------------------------------------

or (as shown in my earlier attachment)


The other Drive in the enclosure is working fine, in fact I use that drive yesterday morning to redo the failed backup, (making sure I had a good backup in the can)

Based on items 7/8, it seems to me that the drive is correctly attached. The fact that I can communicate with it at all implies that. And since there have been no changes to the process, I am inclined to reject any suggestions that indicate it might be some kind of programming or DCL error.

Dave

The Brit
Honored Contributor

Re: Problem with replacement LTO4 tapedrive.


Volker,

On occasion, the job appeared to "get into" the backup of the first saveset, but examination of the process stats showed that although it was running the BACKUP image, no IO or CPU consumption was taking place.

Are there a series of SDA commands that I could use to get some information on what the process is actually doing. Even though I am not experienced with SDA, I would be more than happy to post the output for analysis by others.

Dave.
Volker Halle
Honored Contributor

Re: Problem with replacement LTO4 tapedrive.

Dave,

if that process seems to 'hang', look at the following with SDA:

SDA> SET PROC/ID=
SDA> SHOW PROC/CHAN ! look for 'busy'
SDA> SHOW PROC/LOCK ! look for 'waiting' locks
SDA> SHOW DEV xxx ! any device that was shown as busy

SDA> SHOW RES/LOCK= ! of any waiting lock

If there is an IO outstanding (i.e. 'busy') to the magtape and it doesn't finish, there must be some kind of hardware/firmware/connectivity problem.

Volker.
Hoff
Honored Contributor

Re: Problem with replacement LTO4 tapedrive.

Asking software folks for hardware help?

Following my preferred field circus approach, review the error logs and compare them to the previous errors. If you're seeing the same errors for the original and the new drive, then look at the other common components in the I/O path.

You may well have issues upstream from the drive.

For giggles, I'd swap the two tape drives in the chassis and see if the problem moves.

In the absence of logged packet data within the driver and in the absence of errors in the error log, the current state of the process isn't very helpful for these; using SDA to peek at BACKUP provides little output, as you're looking for specific grains of error data in that flood of I/O (or sometimes the lack of that returning data), and that's not easily visible outside of driver-level logging or the error logs. Sure, you can see BACKUP is wedged, but the antecedence won't usually be obvious.

As for peeking into the drivers, I don't recall off-hand if there's an SDA extension for the tape drivers, but I don't recall one. Here are the common SDA extensions, FWIW:

http://labs.hoffmanlabs.com/node/546

Parity errors are usually media errors. They can also be drive controller errors and cable errors; the other components that are involved with reading and writing parity. IIRC, you also have a SCSI connection in play here inside this MSL.

Being professionally distrustful of hardware, I'd also suspect the fibre and the HBAs, though the other drive appears to be functional and which tends to rule out common components. (Confirm you're using the same path for your drives; I don't remember if the MSL tapes can select different FC paths, but if they can, make sure you're on the same path for both drives. Swapping the two tapes within the cabinet will get you there, too.)

There are various MSL drives and parts available from a number of sources.

Review the error logs. Swap the two drives. Then (if the problem stays with the slot) move back up the I/O chain. If the problem moves with the drive, swap that drive for (another) spare.

P Muralidhar Kini
Honored Contributor

Re: Problem with replacement LTO4 tapedrive.

Hi Brit,

>> The other Drive in the enclosure is working fine,
So the problem is only with the new LTO4 tape drive that you have replaced.

>> Since that first attempt, I have not managed to complete the backup of
>> even 1 saveset.
>> Based on items 7/8, it seems to me that the drive is correctly
>> attached. The fact that I can communicate with it at all implies that.
It seems to me that the connection must be ok. Because first time when
you had run your backup's, it had run 90% and had written about 21 savesets.
If there was some connection problem that the first backup would not have
progressed so far. Even though chances are less, the cause could still be
related to tape drive connection issues.

>> although it was running the BACKUP image, no IO or CPU consumption
>> was taking place.
From the SDA commands that Volker has suggested, it would be intresting
to know what going on with the Backup at the time of hang.

First Backup,
>> $ Robot Show Slot 12
>> SLOT: 12 100164L3

Second Backup,
>> $ Robot Show Slot 12
>> SLOT: 12 100167L3

I guess you have run the second backup using a different volume when
compared to the first one and are still seeing the same problem.
Infact in the second backup, the first INIT attempt of the volume is
giving you the "INIT-F-PARITY" error. This is strange.
So problamatic volume option is ruled out.

VMS parity error can have multiple causes when reported from a SCSI tape
drive. Looking at the VMS errorlog gives more details. One typical reason
for a SYSTEM-F-PARITY on a SCSI tape drive is a SCSI Blank Check. It means
that the software tried to read into a yet unwritten portion of the tape.
This would indicate an unexpected format on tape. For example if a BACKUP
operation is cancelled in the middle of a save the tape is left without
the trailing ANSI labels. The next attempt to append another save set to
the tape then fails with a parity error and the errorlog shows a blank check.

What events are reported in the VMS ERRORLOG when the problem is seen?

Regards,
Murali
Let There Be Rock - AC/DC
Hoff
Honored Contributor

Re: Problem with replacement LTO4 tapedrive.

FWIW...

The WEBES tool for OpenVMS was recently removed from the HP service tools web site. The available path for analyzing hardware errors on OpenVMS now involves a tool chain on and the transfer of the error data over to a Microsoft Windows box.

Which means you can either get those tools and that path set up, or you can download and use an older version of the WEBES tool and the DECevent DIAGNOSE tool (and given the MSL has been around for a while, you probably don't need the newest versions of either of these tools), or you can see if the integrated ANALYZE /ERROR /ELV tool and its TRANSLATE bit-to-text command gets you enough details around this tape device. (ELV knows about and can translate the core system devices, but isn't as good a choice when you're further afield. That's where you can need WEBES or DECevent DIAGNOSE.)

I have pointers and additional details posted.
Shriniketan Bhagwat
Trusted Contributor

Re: Problem with replacement LTO4 tapedrive.

Hi,

After replacing LTO4 tapedrive in a MSL4048 library did you execute below commands?

MC sysman io list
MC sysman io find
MC sysman io Auto

OR

Mc sysman io replace /WWID


Regards,
Ketan
P Muralidhar Kini
Honored Contributor

Re: Problem with replacement LTO4 tapedrive.

Hi Brit,

The problem of hang when performing a tape backup can also happen
because of a faulty SCSI terminator.

Check the following link -
http://forums11.itrc.hp.com/service/forums/questionanswer.do?threadId=1426068

Here also a similar kind of problem was seen, wherein the backups hang
for a long period of time after which it had to be manually aborted.

Regards,
Murali
Let There Be Rock - AC/DC
The Brit
Honored Contributor

Re: Problem with replacement LTO4 tapedrive.

First I would like to thank you all for your suggestions and assistance over the weekend.

I have not found a clear explanation for the cause of my original post (sad to say), but on the semi-good side, the problem may have gone away.

Summary of yesterday.

Arrived to find the library in the same state as Saturday, that is, Error LED On, Attention LED flashing, Clean LED On.

Steps performed.
1. Unloaded the Drive
2. Power Cycled the Library (Error LED now Out)
3. Run cleaning tape (Clean LED now Out)(Note:: this is second clean in 2 days, without a successful backup in between)

Attention LED is still flashing, and Library Status shows (!). Checked all of the slots and found one cassette with the (!) alongside. Removed this cassette from the Library and discarded.

Attention LED now Out, and Library status is good.

Incidentally, the cassette that I removed was one of those being used for testing over the weekend, and may have been the cause of many of the "Parity" and other errors. In retrospect, the "scratch" tapes which are kept in reserved slots in the library, and are used for automatic recovery when a "severe error" occurs in the backup job, have been in the library for close to 2 years. While the lifetime for tape media should be longer than that (5 years I believe), the age of the tapes may have been a contributing factor.

At ~9am yesterday I submitted a test of the failing backup, using this problem drive (Drive 1, MGA5), and it ran successfully to completion. In addition, the normally scheduled backup ran last night without any problems.

I am not totally conviced that I am off the hook yet, I will need a couple of days of success before I start to relax. The main problem being that I do not have a clear picture of what caused my problem, or what caused it to go away. I expect that I will have to just except the situation and move-on.

I just wish I had an explanation for the initial hang which occurred on Friday evening.

Oh Well!

Thanks again for your help.

Dave.