Disk Arrays
cancel
Showing results for 
Search instead for 
Did you mean: 

finding which disk inside the array is bad?

SOLVED
Go to solution

finding which disk inside the array is bad?

I have a very general problem but seems like this would be a very odd question to ask as well. I have a K class 9000 connected to a HP autoraid 12H which started giving me errors about not being able to read the headers from vgda then it gives me the hardware path 10.0.?.? which refers to that array. I unmounted the file system and did a full scan which happened to fix it for a day. Now I am getting the same error, which basically disabled the whole file system.

The weird thing is, how do I find that which hard drive is bad on that array. Is there a easier way to find out rather than taking one at a time and keep an eye on the syslog. Any ideas/help would be greatly appreciated.

18 REPLIES
Eugeny Brychkov
Honored Contributor

Re: finding which disk inside the array is bad?

12h is disk array and it is protected with CRC all the time. If data got corrupt, array either corrects it or does not return anything at all.
Most suspective is SCSI bus. Check cables and terminators (you know SCSI bus should be terminated with FWD terminators).
How to check array?
arraydsp -i
arraydsp -a
logprint -t All -v
zip all these outputs and attach to your next reply
Eugeny

Re: finding which disk inside the array is bad?

Thank you for the reply. I am monitoring the syslog to see if it outputs any error. I will post it as I get those errors again.

Thanks again.
Eugeny Brychkov
Honored Contributor
Solution

Re: finding which disk inside the array is bad?

These errors in syslog you see are most likely consequences, but you need to find root cause ASAP. If you would attach what I asked then I would check 12h's health and if it's the cause or not
Eugeny

Re: finding which disk inside the array is bad?

 
Eugeny Brychkov
Honored Contributor

Re: finding which disk inside the array is bad?

Very strange configuration your 12h has. There're only 3 disks, 2x18G and 1x50G. This one of 50G is not native HP disk - it has 'BA11' firmware (I guess seagate's native).
In addition, you should install disks of the same capacity by 2 or more (examples: 2x18G, 3x36G, 5x9G). In current config you lose 32G of this 50G disk (50G-18G=32G).
Largest supported disk within 12h is 36G disk. All disks should have HP firmware.
What can I tell you? 12h is in unsupported/untested config having unsupported disk of unsupported capacity...
Anyway, you should look into logs (logprint -t All -v) to see if there're any suspicious events
Eugeny

Re: finding which disk inside the array is bad?

Well, I had to delete some of the logs as it was filling up the drive space. After I trouble shoot some more, I removed one of the drive and rebuild the array. This is what I have for the logs.

System change record for Subsystem 0000000DFE24 at Fri Jul 18 15:39:22 2003
Controller timestamp = 1333497
Event code = 163
Event code description = Disk Set Detached
Event count = 1
FRU ID = 129
FRU description = Reporting Controller


System change record for Subsystem 0000000DFE24 at Fri Jul 18 15:39:22 2003
Controller timestamp = 1333498
Event code = 151
Event code description = Disk Set Attached
Event count = 1
FRU ID = 129
FRU description = Reporting Controller


System change record for Subsystem 0000000DFE24 at Fri Jul 18 15:39:22 2003
Controller timestamp = 1333498
Event code = 163
Event code description = Disk Set Detached
Event count = 1
FRU ID = 129
FRU description = Reporting Controller


System change record for Subsystem 0000000DFE24 at Fri Jul 18 15:39:22 2003
Controller timestamp = 1333499
Event code = 151
Event code description = Disk Set Attached
Event count = 1
FRU ID = 129
FRU description = Reporting Controller


System change record for Subsystem 0000000DFE24 at Fri Jul 18 15:39:22 2003
Controller timestamp = 1333499
Event code = 186
Event code description = Drive Missing At Power On
Event count = 1
FRU ID = 129
FRU description = Reporting Controller


System change record for Subsystem 0000000DFE24 at Fri Jul 18 15:39:22 2003
Controller timestamp = 1333499
Event code = 137
Event code description = Rebuild Started
Event count = 1
FRU ID = 0
FRU description = Disk in slot A1


Disk error record for Subsystem 0000000DFE24 at Fri Jul 18 15:39:22 2003
Controller timestamp = 1333647
Event code = 22
Event code description = Recovered By Disk Drive
Event count = 1
FRU number = 10
FRU description = Disk in slot A6

Slot number = A6
Sense key = 0x1
Additional Sense code = 0x18
Additional Sense code qualifier = 0x3
LBA = 16916039

System change record for Subsystem 0000000DFE24 at Fri Jul 18 15:39:22 2003
Controller timestamp = 1333653
Event code = 147
Event code description = Disk Drive Deleted From Disk Set
Event count = 1
FRU ID = 129
FRU description = Reporting Controller


System change record for Subsystem 0000000DFE24 at Fri Jul 18 15:39:22 2003
Controller timestamp = 1333653
Event code = 138
Event code description = Rebuild Complete
Event count = 1
FRU ID = 0
FRU description = Disk in slot A1


Usage record for Subsystem 0000000DFE24 at Fri Jul 18 15:46:54 2003
Eugeny Brychkov
Honored Contributor

Re: finding which disk inside the array is bad?

Did you remove A6? And what's 'arraydsp -a' now (please attach, not post)? Is data accessible from the host?
Eugeny

Re: finding which disk inside the array is bad?

No, I didn't remove the A6 disk but I did changed the cable. As for the logprint shows, it only shows a one time error on that disk A6. Now what happened is, after I changed the cable and started to backup about 20 clients, it crashed with the following error. I forgot to mention, its a netbackup server.

Reboot after panic: kalloc: out of kernel virtual space

Should I remove the A6 disk and rebuild the array? I am totally confuse with this server.

Thanks
Eugeny Brychkov
Honored Contributor

Re: finding which disk inside the array is bad?

I do not think it's related to storage. Looks like patches. Try patching system with: latest OnlineDiags, then GR patch bundle and then HWE patch bundle. Most probably your system lacks of some SCSI/OS critical patches
Eugeny

Re: finding which disk inside the array is bad?

Thank you so much for the reply. I have attached the errors from the syslog from yesterday. Please let me know what you think of it. It seems like its complaining about the first physical volume of vg02 which is the array.

Any help would be appreciated.

Thanks again.
Eugeny Brychkov
Honored Contributor

Re: finding which disk inside the array is bad?

I see you daisy-chained 12h's controllers. Your problems are because if links are even switched, alternate dies and data to write is lost. This should not be a 12h's problem. If you wish to solve (or at least try to solve) problems here please:
1. replace all SCSI cables. Make sure daisy-chain cable between X and Y is not shorter than 1m;
2. replace terminator - make sure replacement term is also HVD (C2905A, A1658-63013 or A1658-62024);
3. patch your system (see my reply above).
And then we will see
Eugeny

Re: finding which disk inside the array is bad?

Eugeny,

Did you mean that the daisy chain cable shouldn't be shorter than 1 meter? cause what I have right now is certainly shorter than 1 meter and have been using for all these years. I just patched the box with the May 2003 QPack, going to install the HWE bundle next. After rebooting the machine after the patch was installed, got an error for inode 999 has and went through the fix process and continue starting the machine. I am going to remove the hard drive A6 which happened to be the 50G drive and rebuild the array. It was a regular seagate drive put inside the hp array hard drive case. I can't find the cable C2905A but I have C2981A instead. I will also change the terminators later on.

Thanks.
Eugeny Brychkov
Honored Contributor

Re: finding which disk inside the array is bad?

C2981A is 0.5 meter SCSI cable...
Anyway, you need to act systematically to find defective component (within software or hardware). Write an action plan and replace components one-by-one. This way you can find bad component. Although there's another way - replace everything at once and see if it will fix. If yes, then return old components and see if problem will appear again. As soon as it will appear, last installed component is bad.
These are just basic troubleshooting hints...
About cable lenght. I've heard something about it, but I believe it may cause SCSI events, but not PV powerfails
Eugeny

Re: finding which disk inside the array is bad?

Thanks for the reply. Basically, this is what I am doing right now. I have installed both patches (QPK and HWE). I removed the disk A6, now the array is rebuilding it and its also complaining about loosing the redundancy. I am replacing the daisy chain cable with the one I have C2981A since I don't have other at the moment. Actually, C2981A came with the array I believe. This is something I inheritated from someone and not being too savvy with HP-UX, I am figuring ways to find and fix this problem. I will send you the update. You can e-mail me if you have other suggestions at kumardeuja@yahoo.com. I greatly appreciate it.

Re: finding which disk inside the array is bad?

Eugeny,

So far, I haven't seen any errors. Basically, I removed the disk A6 and replaced the cable which connects the array and the K class. The system is been up for about 20 hours and looks good. Thanks for all the help and advices, really appreciate it. I will let you know if anything new comes up.

Thanks again.

-K

Re: finding which disk inside the array is bad?

Well, here is what happened and this is driving me crazy. As soon I am trying to backup that particular mount point then I get this PV failed error but when I unmount and run a full scan then it doesn't show any error. Please see the attached message. Any thoughts?


Thanks again.
Eugeny Brychkov
Honored Contributor

Re: finding which disk inside the array is bad?

If your server has one more HVD adapter then try splitting controllers between them, i.e. connect X to SCSI controller1 and Y to SCSI controller2. It will require VG reconfiguration (alternate links to PV will change), but in this case PV will be switching not just between 12h controllers, but between SCSI HBAs and we will see if host claims both SCSI HBAs as 'power failed'
Eugeny

Re: finding which disk inside the array is bad?

Eugeny,

I don't have another HVD on that server. But after I replaced the terminator and all the cables. It seems to be working fine. Its still a mystery to me. Thanks for all the advice.

-K