StoreEver Tape Storage
1748063 Members
5463 Online
108758 Solutions
New Discussion юеВ

I/O errors on DLT 7000 drives

 
SOLVED
Go to solution
John de Villiers
Frequent Advisor

I/O errors on DLT 7000 drives

We have a storagetek with 8 DLT 7000 drives. These drives are connected to a SAN via two FC-SCSI bridges. The SAN swith is a Brocade br16.

These all connect to 5 different HP-UX hosts. Omniback 4.0 is used for the backups.

We have had several drive replaced due to consistent I/O error during the backup. Most of the time we wait to see if it happens again on the same drive, and if it we log a call to have it looked at. HP then nomally replaces the drive.

What i find very suspicious is that most of the i/o errors happen on drive 1, 2 & 8.
Besides an actually faulty/overworked drive, what else can i look at to try and find this problem ?
12 REPLIES 12
Vincent Farrugia
Honored Contributor

Re: I/O errors on DLT 7000 drives

Hello,

Are drives 1,2 and 8 used more than the others? If so, that might explain your problem.

HTH,
Vince
Tape Drives RULE!!!

Re: I/O errors on DLT 7000 drives

1. Has the EMS dm_stape process been disabled on all five HPUX hosts?
2. Has the kernel parameter st_ats_enabled been set to 0?
3. Have all 'rewind' type tape device files been removed from /dev/rmt?
4. Have you set a consistent lock name for the tape devices? ( an output from omnidownload would be useful to determine this)
5. Have you checked that all your SAN connections are in fabric-logon mode rather than quickloop?

HTH

Duncan

I am an HPE Employee
Accept or Kudo
John de Villiers
Frequent Advisor

Re: I/O errors on DLT 7000 drives

Vincent : All the tapes have similar usage. I tried to spread the tappe allocation in omniback as much as possible. None of the drives are idle between 18:00 and 06:00 the next morning. Theyre all running doing a total of 14 online SAP/Oracle backups ( one of which is 1 TB in size )

Duncan : 1 ) where do i check this ?
2 ) yes
3 ) No - is this an issue ?
4 ) Yes - and index numbers
5 ) All SAN connection are Fabric. currently only tape devices occupy the switch, so there arent even any zones. Everything can see all the drives down all the paths. It means i get to see 20 drives instead of 10. Not an issue for me - i split the access to the drives down 2 paths ( 2 FC cards in each host dedicated for tapes ) 1st 4 drives go down one path and next 4 go down another path.

Re: I/O errors on DLT 7000 drives

1 ) where do i check this ?
Just run ps -ef | grep stape - you want to see no dm_stape processes - if you do find some, then these can cause intermitent IO errors on the tape devices... The process to stop them running is a little convoluted (and also slightly out of date if you have the latest version of EMS installed) - but I have attached what I have used in the past... If this doesn't work HP should have a more up-to-date procedure

3 ) No - is this an issue ?
Yes...basically when certain tools (like Ignite/UX for instance) scan the devices files of rewind tape devices, they can cause the tape to rewind even tho it is in use - this only happens in SAN environments - I knocked up a startup script to always take care of this (use at your own risk of course!) - I'll have to attach it seperately...

HTH

Duncan



I am an HPE Employee
Accept or Kudo
Solution

Re: I/O errors on DLT 7000 drives

Here's the script...

HTH

Duncan

I am an HPE Employee
Accept or Kudo
David Ruska
Honored Contributor

Re: I/O errors on DLT 7000 drives

Duncan,
Your EMS disabling instructions mention:

7. Rename the dm_stape binary (optional; makes sure dm_stape will not run at all)
as follows:

You don't want to do this, as dm_stape is also used as a decoder for logtool.

As you mentioned, that procedure is a bit out of date (and we wrote it, so the above error is no fault of yours :-).

Here's the preferred method to resolve the EMS issue:

Workaround #1: Prevent dm_stape polling - set POLLING_INTERVAL configuration to zero.

This is the preferred and simplest workaround to implement. This feature was introduced in the September 2001 HWE (IPR0109) bundle release. Also, in the December 2001 HWE (IPR0112) bundle this configuration is defaulted to zero (but only if the dm_stape.cfg file has not been modified by the customer after initial install).

IMPORTANT NOTE: With HWE releases previous to September 2001, simply setting the polling interval to zero can have negative side effects and require the installation of an Interim (unofficial) Patch.

Set the ???POLL_INTERVAL??? value in the /var/stm/config/tools/monitor/dm_stape.cfg file to 0 to stop the monitor from polling.

Duncan, if you send your email address to ltt_team@hp.com, I can get you an up-to-date procedure.
The journey IS the reward.
John de Villiers
Frequent Advisor

Re: I/O errors on DLT 7000 drives

2 of the 5 Servers in the SAN show this process running. On both of them the polling interval has already been set to 0.

Should the process show up or not ?

The two systems it shows up on are relatively new HP-UX 11.11 ( N4000-55 ) where the rest are older HP-UX 11.0 ( V & K Class ) boxes.
John de Villiers
Frequent Advisor

Re: I/O errors on DLT 7000 drives

Duncan, with regards to your script for removing the auto-rewind devices.

I'd like to remove all the devices and only use the one i created manually.
I use /dev/rmt/D1...10 for my drives so that i can easilly tie a drive to its position in the storage tek.

Some of the device names that get created is the normal /dev/rmt/8mn type and other look like /dev/rmt/C1T1D4BEST - makes for a really confusing list when you ioscan -fnC tape.

i'll do a script that will search for a D1..10 designator for each hardware path and if found it must remove all other device files but the D1..10 one. That way new drives will show up, but ones with a custom name will only have the custon device file.

We dont use anything other than onmiback and tar do access the tapes, and tar is only in very rare occasions. ( like sending a coredump to hp ;-) )

Make sense ?

John

Re: I/O errors on DLT 7000 drives

John,

With regards to dm_stape still running... I'm not sure as I've never implemented this the way David describes.

Re the script to remove rewind devices - the process you describe sounds fine... You do need to have this running at boot time as otherwise the device files may get re-created by insf. At some sites where I've implemented this, we have also introduced automated checks for rewind tape devices as administrators or CEs may run 'insf -e' without realising the consequences.

David,

Thanks for the update! I knew there was a better way of doing this - I will drop an e-mail for your attention to the address you gave.

Thanks

Duncan


I am an HPE Employee
Accept or Kudo