1833433 Members
2904 Online
110052 Solutions
New Discussion

Re: Numerous disk errors

 
Rick Garland
Honored Contributor

Numerous disk errors

Hi all:

Got a problem with disk drives in an XP256 that some folks want to associate with the backups performed by Data Protector, I am hesitant to do so.

First, a description of the configuration:

All servers are HPUX 11.00 and are either L class or RP class. Each server has 2 HBA cards (A5158) for the data traffic and for the backup traffic. The primary data is stored on a XP256 disk array. Using Data Protector 5.0 to a HP 70/700 silo.

From a HBA on each server, the data routes to the XP via a HP Hub S10. From the other HBA on the servers the traffic is routed through a Brocade Silkworm 2800, then to either a SCSI Bridge FC 4/2 or a Bridge 2/1 LV (for DLT or LTO tape drives).

I have attached a portion of a syslog file from one of the servers.

Granted the times match with the backup in this example but this is not true of all servers.

The end result is that the system crashes, the backup fails, or the backup completes in an extra long time.

This backup is only with the Oracle database. If I backup the OS or the archives on the same system, none of these errors appear and things work OK.

Hopefully someone out there has encountered this issue before.

Many thanks!

6 REPLIES 6
T G Manikandan
Honored Contributor

Re: Numerous disk errors

SCSI error are due to

1.Improper scsi termination or scsi cabling

2.Disk not responding to a particular timeout period.

SHould be increased with pvchange -t

3.As your messages relate to read errors I suspect with the the hard disk
0x06b400

find out the hard disk using

ll /dev/dsk|grep 0x06b400

You can just do a

diskinfo
to check up the bad disk.

And probably this message 0x1f07b400 is POWERFAILED is due to the fact that the devices were not responding so the other disk has given those messages


Revert
T G Manikandan
Honored Contributor

Re: Numerous disk errors

Also make sure that you have all the LVM patches updated on the system
T G Manikandan
Honored Contributor

Re: Numerous disk errors

Steven E. Protter
Exalted Contributor

Re: Numerous disk errors

Are there any unused disks on the scsi chain?

I've seen problems like this when a disk with no data on it has failed. It can effect other disks up and down the chain.

The lack of lbolts shows nothing is critical yet, though that could soon change.

I had problems like this for months on a D box and disks with data on them failing about every month. Finally, I had time to do an inventory and test all the disks, data or not with xstm. I exercized the disk and found the bad one.

After hot swapping the bad disk out, the messages stopped.

SEP
Steven E Protter
Owner of ISN Corporation
http://isnamerica.com
http://hpuxconsulting.com
Sponsor: http://hpux.ws
Twitter: http://twitter.com/hpuxlinux
Founder http://newdatacloud.com
Rita C Workman
Honored Contributor

Re: Numerous disk errors

Have you checked for errors on your HBAs?

fcmsutil /dev/fcms<#> stat > fcms<#>.txt

Then go and check that file, I look for Link Failed Count and Loss of Signal Count and the like. If you see some numbers there than you may have a problem with your HBA getting flakey.

Not sure this is your problem, but it is something to check. Besides I just got use these commands this past weekend...wondered how long before I found a thread that might apply.

Rgrds,
Rita

...had similar problem with some very old A3404A cards once...

Rick Garland
Honored Contributor

Re: Numerous disk errors

Still hacking at it. Have verified and exercised the disks in the XP, have previously set the timeout (pvchange -t) to 180, have remove some fiber cables that were suspect, etc.

Still nothing.

At this point it is thought that we have too many LUNS for the HW to handle and this could be causing a LIPS storm. I have made up some quick and dirty scripts for the systems to collect the fcmsutil data over time and see what may be happening.

Right now, I sure don't.

Any other suggstions are greatly welcome.