Disk Arrays
cancel
Showing results for 
Search instead for 
Did you mean: 

Model 12H Problem

SOLVED
Go to solution

Model 12H Problem

Everyone,

We are experiencing a problem on our Model 12H Array. Attached is a portion of our syslog.log file. We tried replacing the X&Y controllers (96MB each) , replaced the cables too and still having problems. One idea was to have the whole baclplane of the Model 12H replaced but somebody told us that
that the reason we're having this problem is because we have below 10% unallocated space left for LUN's. Any help or ideas about this problem will be very much appreciated. Thanks.
12 REPLIES
Vincent Farrugia
Honored Contributor

Re: Model 12H Problem

Hello,

Having less that 10% unallocated is always bad, since, if this happens, the autoRAID will be migrating from Raid 0/1 to Raid 5 back and forth, yielding to severe performance problems.

Try increasing the free space by inserting a new drive in it. If the drive is bigger than the rest you have, you have to insert 2 such drives in order to obtain the full capacity of it. Otherwise, it would be seen the same size as the other drives.

Dunno whether this will help regarding your problem, but in any case, this should be done.

HTH,
Vince
Tape Drives RULE!!!
David Bell_1
Honored Contributor

Re: Model 12H Problem

Debbie,

I agree with Vincent. However, this problem may be related to a timeout due to longer I/O times becuase you have such a small amount of free space. It can also be cables, patches, termination, or disks. Please see the following:

http://support1.itrc.hp.com/service/cki/docDisplay.do?docLocale=en_US&docId=200000015663097

While this article relates to dmesg, the result is the same. Also, be sure that you have the latest level of firmware on both the controllers, HBA's, and disks as well as the latest SCSI patches. Do a search in the technical knowledge base on SCSI + lbolt and look at the patches for your O/S.

HTH,

Dave
A. Clay Stephenson
Acclaimed Contributor

Re: Model 12H Problem

Looking at your attachment, I fould at least one case where a LUN switched to the alternate path (probably from X to Y) and still had problems. This would tend to rule out the host controllers or cables or terminators unless BOTH external paths are flaky.

A possible source of the problem is missing resistor packs (or disabled termination by DIP switch) on both host controllers. Typically SCSI buses terminated on only one end will almost work well - the worst kind of problem. I would have your local HP Mr. Goodwrench come out and examine both SCSI controllers in your host computer. (You can pull these yourself, if you like).

I have seen AutoRAID's complete allocated and not exhibit this behavior so I very much doubt that this is your problem. The default timeout of 30 seconds is generally too short; I would immediately set it to 120 seconds (or so) for each LUN using pvchange. You received good advice about upgrading the firmware and don't forget the ARMserver and arraymgr software on the host.

If you still see problems after all this, I would have the backplane replaced.

One final thought - very noisy power.
If it ain't broke, I can fix that.
Steve Labar
Valued Contributor

Re: Model 12H Problem

I have also seen multiple SCSI timeouts when a bad disk is installed in the 12h. Try doing
arraylog -d {slot_id} {array_id} for each disk installed on the array. Most notibly check the end of the report for the "Grown Defect List" This list should be nearly empty if not completely empty. If the list is too long, you might need to replace a disk in your array.

Good Luck.

Steve

Re: Model 12H Problem

Guys,

Thanks for the very swift reply. We had a number of HP guys in here already, when the controllers were replaced and when the power supplies and cables were replaced. Hp shipped us 2 36Gb drives last night and we swapped one in place of one 18 GB. This gave us more than 18Gb of unallocated space for the LUN's. It still gave us all the messages that I included in the syslog and the thing is it took the rebuild finished at around 8:00 this morning and balanced about 4 times already. It only showed ready status for about about 30 secs. and then went back to balancing. This happened 3 times already and right now it is still balancing. I already made a call to HP and it has been escalated already. Any thoughts about this will surely be appreciated. Thanks

Re: Model 12H Problem

Everyone,

We've also tried doing a pvchange to change the timeouts to 180 on each individual LUN's (disk device files) and it did not help.
We also did a patch tool with the hp CE's and they did not find anyhthing wrong with our patches, and that they are current too. We also checked the arraymgr and armserver software and they are current too. I will do the arraylog and see what happens. Again, thanks a lot to all of you.
Bill McNAMARA_1
Honored Contributor

Re: Model 12H Problem

you're wasting your time diagnosing from syslogs..
what you're seeing is a symptom.

Get the autoraid logs via

logprint
(see the man)

Send the output to your hp rep.

He will load it in the AutoRAID log tool (which he gets from the wtec/lab) This tool will identify common problem.

I would suggest (backup lun) deleting a lun and recreating it. (restore backup)

Perf will increase after this.

Later,
Bill
It works for me (tm)

Re: Model 12H Problem

Bill,

Why do you think we should re-create the LUN's? Is that just to defrag the array? A CE from Hp will come in and change the backplane, the two SCSI cards, and upgrade the firmware on our Disks to HP04.
Do you think we re-create the LUN's before or after this process? Thanks again.
Bill McNAMARA_1
Honored Contributor
Solution

Re: Model 12H Problem

Well, with all this replacement going on, backup is a good idea.

The autoraid looses its head if operating for a long time, the maps need to be 'refreshed'..
it's not as if the autoraid is really defragging, but the maps that are maintained in the controllers are more or less fragmented. The fragentation oif these causes perf problems at the controller level, especially when moving data around and calculating the free space for it... fragged up disks sure doesn't help either.

It'd do all my backups and just test it before pulling things apart. The backplane replacement is a long operations, only doo it if you see from log print that a certain disk is hot... ie lots of scsi retries - one screwy disk in the autoraid could mess the whole thing up... i think HP are trying to rule out the backplane before they ask you to replace every disk....

DO THE LOGPRINT!

Later,
Bill
It works for me (tm)

Re: Model 12H Problem

Bill,

Thanks a lot. I did send the Logprint to HP and they found a lot of resets and lbolts, and pointing to just about everywhere. They went with changing the backplane, the 2 scsi cards that the raid array is connected to, and 1 36GB disk and up to now, we are still error free. Although I will be close watching the system, I'm keeping my fingers crossed. I will update you on what will happen with this.

Re: Model 12H Problem

Hi everyone! The HP CE decided to replace the whole path to the disk array again, that includes the array controllers, cables, terminators, HBA cards and the I/O backplane on the N4000. They did the change on 7/17 and up to now we've been error free. We're back to having 1Gb free available for LUN's and it is still error free. Thanks to everyone who responded.
Bill McNAMARA_1
Honored Contributor

Re: Model 12H Problem

sounds like that should do it!
I can vaguely recall that HBAs can be prone to errors of a common kind.. I would doubt that _everything_ failed!.. and the solution taken was a little commando!.. you must have been making some noise!

Good luck!
Later,
Bill
It works for me (tm)