Re: SCSI error, performance related?

Mark J McDonald · ‎06-07-2007

Hi All

We have a D380, which we look after hardware wise, but are not sure about the application running on it. (Alcatel/Lucent app with informix DB)

Recently we get a disk error, where the disk was showing as NO_HW in an ioscan. The disk is the only disk in vg02, which looks like a temporary dump area for the application.

We replaced the disk, and all looked ok for about a day, when the same error occured. We have sinse replaced the machine for another D class, again this has ran for about a day, before giving the same error.

Is it possible that the application is writing far more than the SCSI bus / disk can cope with? Are there some kernel parameters I can look at?

A sar on the disk is showing very high %wio (99%)

Any ideas appreciated

Thanks

Mark

Duncan Edmonstone · ‎06-07-2007

Mark,

Can you post the actual errors you are getting?

HTH

Duncan

I am an HPE Employee

Mark J McDonald · ‎06-07-2007

Duncan

After about 18 hours of up time the syslog shows:
Jun 6 18:12:28 lucluc20 vmunix: SCSI: Request Timeout -- lbolt: 6229688, dev: 1f009000
Jun 6 18:12:28 lucluc20 vmunix: lbp->state: 20
Jun 6 18:12:28 lucluc20 vmunix: lbp->offset: ffffffff
Jun 6 18:12:28 lucluc20 vmunix: lbp->uPhysScript: 480000
Jun 6 18:12:28 lucluc20 vmunix: From most recent interrupt:
Jun 6 18:12:28 lucluc20 vmunix: ISTAT: 22, SIST0: 00, SIST1: 04, DSTAT: 00, DSPS: 00480580
Jun 6 18:12:28 lucluc20 vmunix: lsp: 000000004281c600
Jun 6 18:12:28 lucluc20 vmunix: bp->b_dev: 1f009000
Jun 6 18:12:28 lucluc20 vmunix: scb->io_id: 7e895
Jun 6 18:12:28 lucluc20 vmunix: scb->cdb: 2a 00 00 1f 48 40 00 00 10 00
Jun 6 18:12:28 lucluc20 vmunix: lbolt_at_timeout: 6226560, lbolt_at_start: 6226560
Jun 6 18:12:28 lucluc20 vmunix: lsp->state: 10d
Jun 6 18:12:28 lucluc20 vmunix: lbp->owner: 000000004281c600
Jun 6 18:12:28 lucluc20 vmunix: scratch_lsp: 0000000000000000
Jun 6 18:12:28 lucluc20 vmunix: Pre-DSP script dump [0000000041001020]:
Jun 6 18:12:28 lucluc20 vmunix: fbf44810 004807c8 41090000 00480290
Jun 6 18:12:28 lucluc20 vmunix: 78347200 0000000a 78350800 00000000
Jun 6 18:12:28 lucluc20 vmunix: Script dump [0000000041001040]:
Jun 6 18:12:28 lucluc20 vmunix: 0e000004 00480580 80000000 00000000
Jun 6 18:12:28 lucluc20 vmunix: 870b0000 004802d8 0a000000 00480588
Jun 6 18:12:28 lucluc20 vmunix: SCSI: Abort abandoned -- lbolt: 6229688, dev: 1f009000, io_id: 7e895, status: 200
Jun 6 18:12:28 lucluc20 vmunix:
Jun 6 18:12:28 lucluc20 vmunix: SCSI: Read error -- dev: b 31 0x009000, errno: 126, resid: 2048,
Jun 6 18:12:28 lucluc20 vmunix: blkno: 8, sectno: 16, offset: 8192, bcount: 2048.
Jun 6 18:12:28 lucluc20 vmunix: LVM: vg[2]: pvnum=0 (dev_t=0x1f009000) is POWERFAILED

AND

We are contunually getting scsi read errors like these:
Jun 7 15:08:37 lucluc20 vmunix:
Jun 7 15:08:37 lucluc20 vmunix: SCSI: Read error -- dev: b 31 0x009000, errno: 126, resid: 2048,
Jun 7 15:15:51 lucluc20 vmunix:
Jun 7 15:16:01 lucluc20 above message repeats 42 times
Jun 7 15:15:51 lucluc20 vmunix: SCSI: Read error -- dev: b 31 0x009000, errno: 126, resid: 2048,
Jun 7 15:16:01 lucluc20 above message repeats 42 times
Jun 7 15:15:51 lucluc20 vmunix: blkno: 8, sectno: 16, offset: 8192, bcount: 2048.
Jun 7 15:16:01 lucluc20 above message repeats 115 times
Jun 7 15:16:01 lucluc20 vmunix:
Jun 7 15:16:01 lucluc20 vmunix: SCSI: Read error -- dev: b 31 0x009000, errno: 126, resid: 2048,
Jun 7 15:16:01 lucluc20 vmunix: blkno: 8, sectno: 16, offset: 8192, bcount: 2048.

A. Clay Stephenson · ‎06-07-2007

When you say that you replaced the disk, I assume that that means you did more than simply pulling a disk out and replacing it.

In any event, you are having problems with /dev/rdsk/c0t9d0. If this is an internal disk, it's the second from bottom of a D3xx.

The fact that a disk is nearly 100% busy should be invisible to the application; it simply issues a read() or write() system call and waits until that operation is completed --- and the disk being busy could easily be a symptom of a failing disk.

If it ain't broke, I can fix that.

Mark J McDonald · ‎06-07-2007

Yes we can see which disk is causing the problems from the ioscan.

The disk was replaced, and the volume etc re-created.

Our customer is now telling us to transfer everything to an L class, but I'm not convinced it is a hardware issue. Why does it work for the best part of a day after a reboot?

A. Clay Stephenson · ‎06-07-2007

Without more to go on, it's difficult to say but I have seen flaky/failing disks behave exactly this way. In any event, when you see POWERFAILED in ioscan, it means that the disk is no longer properly responding and is not an application error. Note that I have not said that you might not have an application error; there could be many but those are independent of this symptom.

If it ain't broke, I can fix that.

Mark J McDonald · ‎06-07-2007

The annoying thing is, we do not know the history of this.

Apparently HP changed the disk shortly before we took over maintenance, they will not tell us how long HP's new disk has lasted.

We have also changed the disk.

Sandman! · ‎06-07-2007

If the disk you replaced was /dev/dsk/c0t9d0 then imho you should look into reseating the drive, make sure it's properly terminated, check cables and connectors. Any of these things may cause the drive to behave strangely.

Steven E. Protter · ‎06-07-2007

Shalom Mark,

I have a couple of D boxes in my home. Kinda strange, but I learned things from them.

One thing I learned is that the D class boxes route power and bandwidth to the disk through something called a drive cage.

One of my D boxes had a problem around the turn of the century. Disks just kept going bad. It was eating disks like popcorn, every couple of weeks.

Until I got HP to replace the drive cage disks kept going bad. A few of those disks turned out to have not actually gone bad, I kept one and tested it after the drive cage was replaced.

Sometimes you have to argue with hardware to get them to replace this part, but its worth looking into.

These D systems were oracle development servers for a number of years. They were severely stressed both in cpu and i/o. They never lost a disk after the drive cage replacement in spite of being pushed very, very hard for a number of years.

So, No I don't believe performance issues can cause disks to fail.

SEP

Steven E Protter
Owner of ISN Corporation
http://isnamerica.com
http://hpuxconsulting.com
Sponsor: http://hpux.ws
Twitter: http://twitter.com/hpuxlinux
Founder http://newdatacloud.com

Mark J McDonald · ‎06-07-2007

Thanks Steven

The disks were changed twice and showed the same error, so thats 3 different disks in the 1st D class, then all the current internal disks from the system were put in to a whole different D class.

Would the fact that the disk has been in a damaged D class, mean it could stop working in another D class?

Strange 1 this!

Im off work now until Tuesday but Im sure 1 of the other lads will check this.

Thanks

Mark

A. Clay Stephenson · ‎06-07-2007

The most common cause of repeated drive failures is poor/inadequate cooling and if memory serves there is a dedicated fan just for the internal drives on a D-3xx (or a D-2xx with the hot-plug option). If that fan were underperforming and/or if the air passages were blocked with dust bunnies on the original server then after moving drives that have been operating in a high temperature environment, I would expect a dramatically shortened life and a drive can be damaged (to the point that it becomes flaky) by a one-time exposure to over-temperature conditions. It's very common for such drives to fail; you can then pull them out; and immediately reseat them and they function again for hours or a few days and then the symptoms repeat.

If I were you, I would carefully check cooling and power supply voltages but I'm betting on poor cooling. I suspect that you have actually fixed the problem by moving to the second D-box and now you are simply dealing with the artifacts of the drives having been previously operated in a harsh environment.

Of course, the newest HVD SCSI drives are probably 8 or 9 years old now --- so what do you expect?

If it ain't broke, I can fix that.

stevebennett · ‎06-07-2007

Hi,

Just to add some more details to this fault. The original configuration had vg02 which contained 2 disks , 1 internal c0t9d0 and 1 external c3t2d0.

the D380 always failed in the same way , first the external disk would fail with the syslog messages that have already been posted, then several hours later the internal disk would fail with the same messages.

Both disks have been swapped out 3 times, the entire D380 has also been swapped out. I have used STM to exercise the disks, cpu, memory of the original D class and everything tests ok.

To try and narrow the fault to a single disk we removed the external disk, however it made no difference. We then removed the entire external array from the config.

We know little of what the application does, however the D380 is running 11i version 1,

Are there any patches which may cause this ??

Thanks for your help
steve

Mark J McDonald · ‎06-13-2007

I have visited the site to put the L class in place. The overheating is looking more likely, The D classes are in cabinets with glass doors. These must be hampering the air flow through the machines. I have removed the door from the cab where the other D class lives, and where the L calss is installed. Luckily the L classs is too big to have the doors on so they cannot be replaced.

Mark

Categories

Company

Local Language

Forums

Discussions

Forums

Discussions

Forums

Discussions

Forums

Discussions

Forums

Discussions

Discussions

Forums

Forums

Discussions

Forums

Discussions

Forums

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Community

Resources

Other HPE Sites

Discussions

Forums

Blogs

Re: SCSI error, performance related?

SCSI error, performance related?