Operating System - HP-UX
1834893 Members
2007 Online
110071 Solutions
New Discussion

SCSI problem help needed identifying disk

 
Tony Horton
Frequent Advisor

SCSI problem help needed identifying disk

Hi,

I have a scsi problem but I'm not sure which disk it is. The output from dmesg is as follows.

Oct 29 12:04
...
- lbolt: 1061326722, bus: 6
scb->cdb: 28 00 03 c5 f3 80 00 00 20 00
scb->cdb: 28 00 00 1a 3d a0 00 00 60 00
SCSI: Resetting SCSI -- lbolt: 1069236335, bus: 6
SCSI: Reset detected -- lbolt: 1069236335, bus: 6
scb->cdb: 2a 00 00 00 0c 40 00 00 02 00
scb->cdb: 28 00 00 5d f9 00 00 00 60 00
SCSI: Resetting SCSI -- lbolt: 1069793138, bus: 6
SCSI: Reset detected -- lbolt: 1069793138, bus: 6
scb->cdb: 28 00 00 13 76 b0 00 00 50 00
scb->cdb: 28 00 00 10 38 20 00 00 60 00
SCSI: Resetting SCSI -- lbolt: 1074152496, bus: 6
SCSI: Reset detected -- lbolt: 1074152496, bus: 6
scb->cdb: 28 00 00 3a c1 80 00 00 10 00
scb->cdb: 28 00 00 08 5d 00 00 00 80 00
SCSI: Resetting SCSI -- lbolt: 1093971586, bus: 6
SCSI: Reset detected -- lbolt: 1093971586, bus: 6
scb->cdb: 2a 00 00 bb 9b 3c 00 00 02 00
scb->cdb: 28 00 00 2d 25 80 00 00 80 00

I'm assuming that the bus 6 refers to the scsi devices with device files c6t2d0 (for example). There are only three disks on this channel. scsi id's 2 12 and 13.....

Any help in identifying the culprit greatly appreciated.

when these messages appear the system seems to freeze for 20-30 seconds, and then resumes normal operation.

Regards,

Tony.
No man is an isthmus
27 REPLIES 27
KCS_1
Respected Contributor

Re: SCSI problem help needed identifying disk

Hi,

Does it remain any messages in the syslog?

If installed the STM Diagnosic tool, have a look at the eventlog in /var/opt/resmon/log directory.

It will let you know and find more advanced symptoms.


Easy going at all.
Tony Horton
Frequent Advisor

Re: SCSI problem help needed identifying disk

Hi Patrick,

The messages also appear in the syslog, but no more than the messages from dmesg (although different order).

Oct 29 11:12:06 ceeng vmunix: scb->cdb: 2a 00 00 00 cf f0 00 00 10 00
Oct 29 11:12:06 ceeng vmunix: scb->cdb: 28 00 00 5d f9 00 00 00 60 00
Oct 29 11:12:07 ceeng vmunix: SCSI: Resetting SCSI -- lbolt: 1171439936, bus: 6
Oct 29 11:12:07 ceeng vmunix: SCSI: Reset detected -- lbolt: 1171439936, bus: 6

I checked the event log as suggested, but the last entry was from August 3rd. It is however an entry for one of the disks I thought might be the problem, and was a recoverable read error.

I'll do some non-destructive STM tests on that disk and see if anything shows up. Thanks for the tip.

Regards,

Tony.
No man is an isthmus
Rajeev  Shukla
Honored Contributor

Re: SCSI problem help needed identifying disk

It appears to me that some disk is going offline and coming back online which is also freezing the system (because after coming online it is syncronizing which is freezing the system)

What you can parallely do is have a look at vgdisplay -v to see if you find any disk showing offline/ or stale LV

ioscan can also tell you when it happens.

Or have a look at STM logs if online diagnostics is installed.
KCS_1
Respected Contributor

Re: SCSI problem help needed identifying disk

Hi,Tony

Then, what's the output of these commands which followed ?

# ioscan -funCdisk

Is there NO_HW or UNCLAIMED disk?

# vgdisplay -v | more

Any disk LV is Staled or not weird Symptoms of disk?

If there are some unnormal the output, take have this testing,

# dd if=/dev/rdsk/cXtYdZ of=/dev/null bs=1024

-Read/write looping test on specified disk


Good luck!




Easy going at all.
Tony Horton
Frequent Advisor

Re: SCSI problem help needed identifying disk

Thanks Guys,

Tried all suggestions but turned up blank. Couldn't trigger the messages with the dd either. We seem to get the pauses multiple times a day, but the messages in the log seem to be less frequent, I suspect one of the disks is on the way out, but it is managing to read/write after a number of retries, and succeeding most times before the timeout period. I had a similar problem on a linux box before. Changing the offending disk fixed the problem.

There is actually a swap LV on this disk, could be the cause of the pauses. Maybe I should transfer the swap to another disk!

I couldn't see anywhere in the error messages where it specifically points to a particular disk, which made me wonder whether there was in fact a problem with the controller.

Regards,

Tony.
No man is an isthmus
Rajeev  Shukla
Honored Contributor

Re: SCSI problem help needed identifying disk

Hi Tony,
The best i suggest is since this is a production machine, and dont take a chance till the disk completely fails. Install "Event Monitoring Service" which is part of Online diagnostics, this will tell you about any powerfails or I/O error on disk with its hardware address and you can locate the disk easily and schedule some down time and replace the disk.
Dont let users get annoyed by slow preformance of the server.
Tony Horton
Frequent Advisor

Re: SCSI problem help needed identifying disk

Hi Rajeev,

Unless it has died for some reason, EMS should be running, although it hasn't reported any problems since August. I agree with you about replacing before it dies, it's just a matter of knowing which one :-).... I ran STM and two of the 3 disks on that chanell showed read and write errors when I did an info, the ones playing up a quantum atlas 10K III drives, I have another four of them on the other channel, and all of them show zero errors. I checked the scsi cables, and one connector which went to the last disk on the chain didn't have it's thumb screws tightened up, have done that now just in case, but suspect it is something more than that.

Regards,

Tony.
No man is an isthmus
Joe Short
Super Advisor

Re: SCSI problem help needed identifying disk

It could also be the HBA. You said STM shows errors on 2 of the 3 disks. If the cables are all properly connected, so it may be somewhere else. Have you tried reseating the HBA? What model server is this? Is the HBA in the correct slot? Some servers have priorities for the expansion slots, higher performance cards should be installed in specific slots.
Tony Horton
Frequent Advisor

Re: SCSI problem help needed identifying disk

Looks like it is one particular disk. I finally got some more meaningfull messages last night. A whole pile of write errors on one of the three disks.

I did pull the Controller out and reseat it, no change. It is in the correct slot (it's an L1000) so I don't have any choice but to put it in the "turbo" slots.

I suspect that the disk has been marginal for some time, and it's finally started to get to the point where the errors aren't recoverable. It's one of two disks in a mirror, so I should be able to reduce the mirror and do some more extensive tests on it now that I know which one :-)

Regards,

Tony.
No man is an isthmus
Tony Horton
Frequent Advisor

Re: SCSI problem help needed identifying disk

Now that I know which disk is suspect. Whats the best way of making sure its a media problem.

should I run mediainit, stm exerciser (this doesn't ever seem to do much). I'd like to get some hard evidence for warranty claim.

Regards,

Tony.
No man is an isthmus
Tony Horton
Frequent Advisor

Re: SCSI problem help needed identifying disk

Hmmm, Helps if you increase the timeout in the exerciser tool options, and set it to maximum coverage :-)
No man is an isthmus
Michael Steele_2
Honored Contributor

Re: SCSI problem help needed identifying disk

Attach LOGTOOL report and 'ioscan -fnkC disk':

STM > TOOLS > UTILITY > RUN > LOGTOOL > FILE > VIEW > RAW SUMMARY.
Support Fatherhood - Stop Family Law
Tony Horton
Frequent Advisor

Re: SCSI problem help needed identifying disk

Hi Michael,

I thought the problem was probably termination or cable or controller before, as all of the disks on the 0/4/0/1 path were showing similar numbers of errors.

I'm getting a new ultrium on Monday, and the IBM drive has already been replaced (it died completely). It's scsi ID 13 that seems to be the main culprit, ie the Atlas 10K3 73GB drive (based on this mornings syslog messages)

I have added the syslog messages from this morning on the end of the summary.

Regards,

Tony.
No man is an isthmus
Tony Horton
Frequent Advisor

Re: SCSI problem help needed identifying disk

Sorry, once again with a .txt extention
No man is an isthmus
Steven E. Protter
Exalted Contributor

Re: SCSI problem help needed identifying disk

If you have not blown away /etc/lvmtab, the script I'm attaching will help you identify the disk. It must have been configured into a volume group for this script to work.

SEP
Steven E Protter
Owner of ISN Corporation
http://isnamerica.com
http://hpuxconsulting.com
Sponsor: http://hpux.ws
Twitter: http://twitter.com/hpuxlinux
Founder http://newdatacloud.com
Tony Horton
Frequent Advisor

Re: SCSI problem help needed identifying disk

Sorry forgot the ioscan before.
No man is an isthmus
Michael Steele_2
Honored Contributor

Re: SCSI problem help needed identifying disk

LOGTOOL didn't have enough errors to indicate a failure but when c6t13d0 is Powerfailed "...LVM: vg[1]: pvnum=5 (dev_t=0x1f06d000) is POWERFAILED..." usually the firmware needs upgarding or the timeout needs to be increased.

Verify the disks with this command to find vg1 and pv 5.

# strings /etc/lvmtab

First vg is vg1.
5th disk is pv 5.

######################################

(239) 0/4/0/1.13.0 = QUANTUMATLAS10K3_73_WLS
(208) 0/4/0/1.2.0 = SEAGATEST318203LW
(216) 0/4/0/1.12.0 = QUANTUMATLAS10K3_73_WLS

Check the firmware on these. Increase the timeout:

# pvchange -t 160 /dev/rdsk/cXtYd0
Support Fatherhood - Stop Family Law
Michael Steele_2
Honored Contributor

Re: SCSI problem help needed identifying disk

Since all of your errors are isolated to the c6 controller I'd be interested to know whats on c6. Use 'pvdisplay -v' and 'bdf' to see whats on these disks.

Use 'diskinfo' to find the firmware version. See if there are upgrades.

# diskinfo -v /dev/rdsk/cXtYdZ

Please attach:

sar -d 5 5
sar -u 5 5
Support Fatherhood - Stop Family Law
Tony Horton
Frequent Advisor

Re: SCSI problem help needed identifying disk

Hi Michael,

There are only 3 devices on that controller they are an 18GB Quantum Atlas 10K3, a 73GB Quantum Atlas 10K3 (both in a kingston data silo case, they aren't "HP" disks). The third device is a segate 18GB disk which is a genuine HP external disk.

The 73GB drive has one 70GB lvol on it which is one half of a lotus notes mirror. When the errors started this morning database compacting was occuring.

The 18GB quantum drive has secondary swap, progress 4GL compiled programs, spool files from the progress system, and a D3 (pick) raw partition (note that all of these are mirror copies as well).

The 18GB seagate disk has a lvol for squid, an lvol for samba, and an lvol for the test progress database, none of these are particulary high volume, none are mirrored.

The sar output is probably a bit pointless at the moment as I'm running a non destructive read/write excerciser on the 73GB disk, but I've attached it anyway.

Unfortunately since the disks are non-hp, firmware upgrade is out of the question. Maxtor don't seem to release new versions.

Regards,

Tony.
No man is an isthmus
Tony Horton
Frequent Advisor

Re: SCSI problem help needed identifying disk

Thanks Steven,

That's a handy script, I might put it in as a cron job. Unfortunately at the moment no disks have actually failed, so it didn't pick anything up, but It would have been very handy a few months ago, when I went for about 2 months not realising one of the disks in a mirror had completely failed!!!

Regards,

Tony.
No man is an isthmus
Tony Horton
Frequent Advisor

Re: SCSI problem help needed identifying disk

Michael,

Just on the io timeout, I'm wondering why the other identical disks (other half of the mirror) on the other channel of the controller wouldn't be getting any errors, there are actually four disks on the other channel, so theoretically it should be even more busy.

Regards,

Tony.
No man is an isthmus
Michael Steele_2
Honored Contributor

Re: SCSI problem help needed identifying disk

When you post the sar output and post the file system layout we will know.
Support Fatherhood - Stop Family Law
Tony Horton
Frequent Advisor

Re: SCSI problem help needed identifying disk

Hi Michael,

When you say the filesystem layout do you want the sizes and relative position on the disks, or just for instance BDF output? I'll do a second sar output now that the excercise has finished, but its a bit quiet now (6:35PM)

I've done the sar output and a bdf. It would probably be more meaningfull though if I did it at a peak time like 11:30AM.

Regards,

Tony.
No man is an isthmus
Tony Horton
Frequent Advisor

Re: SCSI problem help needed identifying disk

Sorry got an error then and the attachment didn't stick.
No man is an isthmus