Operating System - HP-UX
1826213 Members
2790 Online
109691 Solutions
New Discussion

scsi problems causing backup to fail; what's going on?

 
Mark Vollmers
Esteemed Contributor

scsi problems causing backup to fail; what's going on?

Hi, all. I've got a problem here that I could use help on. I'll try to include everything that's been going on.
On May 4, our backup failed. One of the NFS mounted NT stations was powered off, so I assumed that this was the problem. Monday came, and I tried to run a backup (manually) in the morning. I got errors saying that the device output file for /dev/rmt/0m was bad. I told it to keep going (without backing up the NFS mounts). Sorry I don't have the exact error. It worked fine. That night, for the regular backup (backup all files and the NFS stuff), it killed again. the br_log has an exit code of 2 for the backup command. I tried the backup again during the day, and it did not work. It would start with the output file errors and exit in the middle. I finally rebooted the server (thinking that the NT station being turned off seriously screwed it up) and let backup run. It appears to have run successfully last night. The kicker is the syslog. It is peppered with SCSI errors. Every time I ran a backup for the last two days, these errors appeared. It looks like a new set is there from last night, but it backed up successfully. I have attached the syslog entries and one of the mails to root from backup. I have no idea what these all mean. I assume that they appear when backup fails and shows why, but what about the ones for the sucessful backup (may 8)? We had similar messages a while back, and HP replaced the SCSI card on the server. Could it be the RAID drive or controller? what about the server? what is backup doing to trigger this? I'd like an idea of which tree to bark up. Any ideas are welcome; I'm all out of my own. Thanks!
"We apologize for the inconvience" -God's last message to all creation, from Douglas Adams "So Long and Thanks for all the Fish"
20 REPLIES 20
Vincenzo Restuccia
Honored Contributor

Re: scsi problems causing backup to fail; what's going on?

Verify SCSI termination,disk,BCC and controller.
Thierry Poels_1
Honored Contributor

Re: scsi problems causing backup to fail; what's going on?

hi,
was anything changed on your server? This sounds indeed like SCSI problems : duplicate SCSI address, bad terminator, ...
good luck,
Thierry.
All unix flavours are exactly the same . . . . . . . . . . for end users anyway.
Mark Vollmers
Esteemed Contributor

Re: scsi problems causing backup to fail; what's going on?

The RAID drive has all slots terminated or has cables in them. Likewise with the server. I didn't notice anything odd with the controller. We recently had to re-configure it cause it lost the SCSI channel info after a shutdown (we replaced the controller casing and reconfigured), but that was a month ago and had worked fine afterwards. I don't understand where all the disconnects are coming from, since I can access /home (on the RAID) fine, so that has not crashed on me.
"We apologize for the inconvience" -God's last message to all creation, from Douglas Adams "So Long and Thanks for all the Fish"
A. Clay Stephenson
Acclaimed Contributor

Re: scsi problems causing backup to fail; what's going on?

Hi Mark,

Definitely looks like SCSI problems; I would make sure that the buss in properly terminated
on both ends - Did HP install the resistor packs in the controller (if the controller is at the end of the buss). I once had a problem like this on a K-box and it turned out that
one of the terminators was bad but replacing it didn't fix the problem. The bad terminator had blown the on-board term power fuse.
Also, are you anywhere near maximum cable lenght?

My 2 cents, Clay
If it ain't broke, I can fix that.
Mark Vollmers
Esteemed Contributor

Re: scsi problems causing backup to fail; what's going on?

I can try to track down a new terminator and cable to check those. I'm not sure that I understand what's going on, though. If there is a bad cable (or terminator), why is backup the only thing that is having problems? If the cable were bad, for example, wouldn't that take out the whole drive? And why did backup work last night but still give SCSI errors?

The only thing on the server that was different was the one NT that got turned off but was still mounted, so the server would look for it but not find it. It was powered back on Mon, and the server rebooted Tues. night. The NFS mount shouldn't be interacting with the RAID, but I could be wrong. Thanks for the ideas so far.

Mark
"We apologize for the inconvience" -God's last message to all creation, from Douglas Adams "So Long and Thanks for all the Fish"
A. Clay Stephenson
Acclaimed Contributor

Re: scsi problems causing backup to fail; what's going on?

Mark,

One other thought. You didn't mention the tape device type. If it's a DLT (and especially a DLT7000 or 8000 - or Ultrium) it shouldn't be on the same buss.

Clay
If it ain't broke, I can fix that.
paul courry
Honored Contributor

Re: scsi problems causing backup to fail; what's going on?

Clay, this is what I was thinking. He aluded to a change in configuration about a month ago. Since DLT's are bandwidth hogs they should be given their own SCSI card. However, problems with placing other devices on the same card won't necessarily show up immediately; they'll show up either when you start to use them or when you get spikes in the bandwidth usage.
Mark Vollmers
Esteemed Contributor

Re: scsi problems causing backup to fail; what's going on?

Clay-
It is a HP DDS-3 Dat24 external drive.
"We apologize for the inconvience" -God's last message to all creation, from Douglas Adams "So Long and Thanks for all the Fish"
Mark Vollmers
Esteemed Contributor

Re: scsi problems causing backup to fail; what's going on?

Sorry, sent that off a little early. I don't think that they are on the same card. I could be wrong (hardware is not my specialty), but the drive cable is near the top and oriented hoizontal, and the tape cable is on the left side oriented vertical. The drive card was replaced in Jan (I think). The backup has been running daily successfully for a while.

Mark
"We apologize for the inconvience" -God's last message to all creation, from Douglas Adams "So Long and Thanks for all the Fish"
Volker Borowski
Honored Contributor

Re: scsi problems causing backup to fail; what's going on?

Mark,

if you have some space on two diffrent disks,
I would check, to put some disk-to-disk traffic on the bus.

I.e. copy a 500M lvol using dd or so.
If this works, I think it has to be the tape (or tape cable).

If you get faults during this copy as well, may be with the tape even disconnected, it is more close to controller trouble or a disk going bad.

In addition I would recommend to install the diagnostics and take a look at the error logs of each disk. HP-Support has to give you a password to login to tools (diagmon or so), and coach you through the menus, but it is fairly simple. I did twice with HP on the phone.

Volker
Volker Borowski
Honored Contributor

Re: scsi problems causing backup to fail; what's going on?

Mark,

to get clear information about the tape and the disks, can you do an

ioscan -C tape
ioscan -C disk

and give us the output
Thanks
Volker
Mark Vollmers
Esteemed Contributor

Re: scsi problems causing backup to fail; what's going on?

No problem. Here it is. the RAID has /home and /download. The Seagate drives should be mirrored and has everything else (/var, /usr, etc)

# ioscan -C tape
H/W Path Class Description
============================================
8/16/5.3.0 tape HP C1537A


# ioscan -C disk
H/W Path Class Description
============================================
8/0.0.0 disk ARTECON LynxRAID
8/0.5.0 disk SEAGATE ST34573WC
8/0.8.0 disk SEAGATE ST34573WC
8/16/5.2.0 disk TOSHIBA CD-ROM XM-5701TA
"We apologize for the inconvience" -God's last message to all creation, from Douglas Adams "So Long and Thanks for all the Fish"
Volker Borowski
Honored Contributor

Re: scsi problems causing backup to fail; what's going on?

Well this shows us, that tape and CD are on one bus and all the disks are on the other.

Since most SCSI messages (all?) refer to cd013000 and the following line:
SCSI TAPE: dev = 0xcd013000 I/O error during close
identifies this one as the tape,

I think your disks and the coresponding bus is OK. The diskcopy-test (if possible) I mentioned before should go fine.

Replace the SCSI-cable for the tape first (easiest and cheapest shot).
Try another tape if available.
Check terminator on the tape.

Good hunting
Volker
A. Clay Stephenson
Acclaimed Contributor

Re: scsi problems causing backup to fail; what's going on?

Hi Mark,
If everything else checks out, since it appears
that you are on maintenance, I would have the tape drive replaced. Regardless of the status of individual backup runs you shouldn't be seeing all those syslog errors.

However, all of this count be termination. Don't overlook the internal termination. It's actually amazing how well SE scsi does with no termination. It worls just well enough to drive you crazy.

Clay
If it ain't broke, I can fix that.
Mark Vollmers
Esteemed Contributor

Re: scsi problems causing backup to fail; what's going on?

Does anyone know how to make sense of the scsi errors that are in the log (see the first one I attached)? I see them and know that there is a scsi problem, but all the lbolts and lbps and everything mean nothing to me. Anyone got a good way to figure out some of the stuff so I can pinpoint problem areas? Or is that just one of those things you just pick up after years of working with it?
"We apologize for the inconvience" -God's last message to all creation, from Douglas Adams "So Long and Thanks for all the Fish"
MANOJ SRIVASTAVA
Honored Contributor

Re: scsi problems causing backup to fail; what's going on?

Hi Mark

Generally the errors reported are set to report as the description and also perhaps the status of CSR of the device.So in this case it looks like the device is giving lots of parity error , it looks to me that the device itself is fine , please do the following :

1. Check for terminations , on the BUS and devices ( which should be fine as I assume that the system was working ).
2.Check for the pins bent in the SCSI Buses as they can casue lots of intermittent stuff.
3. Finally you can go ahead by changing the controller , which could be your solution.


Manoj Srivastava
A. Clay Stephenson
Acclaimed Contributor

Re: scsi problems causing backup to fail; what's going on?

Hi Mark, I'll try to answer a portion of your last question. Probably the most important
thing to note in LBOLT's is the device, in your case 'cd013000'. This breaks downs
as:
cd - major device number 0xcd = 205 decimal
if you do as lsdev you will see that 205 is stape - must be a scsi tape drive

01 - buss (controller number) c1

3 - SCSI Target ID t3

0 - LUN d0

00 - the last 2 hex digits are device driver specific flags; (they set things like norewind, density, compression on a tape drive but the same values might do completely different things on a disk drive - it depends on the driver)

In your case we know it's a tape drive c1t3d0.

When that fails, we use the force.

Hopes this helps a bit, Clay
If it ain't broke, I can fix that.
Mark Vollmers
Esteemed Contributor

Re: scsi problems causing backup to fail; what's going on?

Hello, all. Sorry to keep going with this, but I have another question. Based on everything that I have been told, the tape drive is nuts. I am going to try to get a new cable and then go bug HP for a new one. Incidently, it failed last night again. I went to run it this morning, hopeing to get lucky. The errors that I saw make me wonder about the RAID vs. the tape. Could the tape be causing the new errors, or is there a bigger problem. I have attached a file with the messages from fbackup and syslog that pertain. I entered a new output file when prompted (/home/temp/tempfile) that it was writing to. I really appreciate the help! Thanks!

Mark
"We apologize for the inconvience" -God's last message to all creation, from Douglas Adams "So Long and Thanks for all the Fish"
Carlos Fernandez Riera
Honored Contributor

Re: scsi problems causing backup to fail; what's going on?


Clearly it is a hardware error. Run stm for both tape and disks.

You said your tape is DDS3... For this device tapes must be 125m (DDS3) tapes.

Try my program on :
http://forums.itrc.hp.com/cm/QuestionAnswer/1,1150,0x298bee3e323bd5118fef0090279cd0f9,00.html


It get full statistics from DDS driver.
unsupported
A. Clay Stephenson
Acclaimed Contributor

Re: scsi problems causing backup to fail; what's going on?

Hi Mark,
Now it looks as though you have two independent problems going on. Since the tape drive is on one buss and your RAID is on another the problems are unrelated (unless there is a common thread like system board).
But all in all your machine seems to be working too well for a system board failure.
I didn't see the type of RAID you are using - do you have any monitoring software for it?

If it ain't broke, I can fix that.