Disk Enclosures
1748250 Members
3343 Online
108760 Solutions
New Discussion юеВ

Re: finding which disk inside the array is bad?

 
SOLVED
Go to solution
Kumar Deuja_1
Advisor

finding which disk inside the array is bad?

I have a very general problem but seems like this would be a very odd question to ask as well. I have a K class 9000 connected to a HP autoraid 12H which started giving me errors about not being able to read the headers from vgda then it gives me the hardware path 10.0.?.? which refers to that array. I unmounted the file system and did a full scan which happened to fix it for a day. Now I am getting the same error, which basically disabled the whole file system.

The weird thing is, how do I find that which hard drive is bad on that array. Is there a easier way to find out rather than taking one at a time and keep an eye on the syslog. Any ideas/help would be greatly appreciated.

18 REPLIES 18
Eugeny Brychkov
Honored Contributor

Re: finding which disk inside the array is bad?

12h is disk array and it is protected with CRC all the time. If data got corrupt, array either corrects it or does not return anything at all.
Most suspective is SCSI bus. Check cables and terminators (you know SCSI bus should be terminated with FWD terminators).
How to check array?
arraydsp -i
arraydsp -a
logprint -t All -v
zip all these outputs and attach to your next reply
Eugeny
Kumar Deuja_1
Advisor

Re: finding which disk inside the array is bad?

Thank you for the reply. I am monitoring the syslog to see if it outputs any error. I will post it as I get those errors again.

Thanks again.
Eugeny Brychkov
Honored Contributor
Solution

Re: finding which disk inside the array is bad?

These errors in syslog you see are most likely consequences, but you need to find root cause ASAP. If you would attach what I asked then I would check 12h's health and if it's the cause or not
Eugeny
Kumar Deuja_1
Advisor

Re: finding which disk inside the array is bad?

 
Eugeny Brychkov
Honored Contributor

Re: finding which disk inside the array is bad?

Very strange configuration your 12h has. There're only 3 disks, 2x18G and 1x50G. This one of 50G is not native HP disk - it has 'BA11' firmware (I guess seagate's native).
In addition, you should install disks of the same capacity by 2 or more (examples: 2x18G, 3x36G, 5x9G). In current config you lose 32G of this 50G disk (50G-18G=32G).
Largest supported disk within 12h is 36G disk. All disks should have HP firmware.
What can I tell you? 12h is in unsupported/untested config having unsupported disk of unsupported capacity...
Anyway, you should look into logs (logprint -t All -v) to see if there're any suspicious events
Eugeny
Kumar Deuja_1
Advisor

Re: finding which disk inside the array is bad?

Well, I had to delete some of the logs as it was filling up the drive space. After I trouble shoot some more, I removed one of the drive and rebuild the array. This is what I have for the logs.

System change record for Subsystem 0000000DFE24 at Fri Jul 18 15:39:22 2003
Controller timestamp = 1333497
Event code = 163
Event code description = Disk Set Detached
Event count = 1
FRU ID = 129
FRU description = Reporting Controller


System change record for Subsystem 0000000DFE24 at Fri Jul 18 15:39:22 2003
Controller timestamp = 1333498
Event code = 151
Event code description = Disk Set Attached
Event count = 1
FRU ID = 129
FRU description = Reporting Controller


System change record for Subsystem 0000000DFE24 at Fri Jul 18 15:39:22 2003
Controller timestamp = 1333498
Event code = 163
Event code description = Disk Set Detached
Event count = 1
FRU ID = 129
FRU description = Reporting Controller


System change record for Subsystem 0000000DFE24 at Fri Jul 18 15:39:22 2003
Controller timestamp = 1333499
Event code = 151
Event code description = Disk Set Attached
Event count = 1
FRU ID = 129
FRU description = Reporting Controller


System change record for Subsystem 0000000DFE24 at Fri Jul 18 15:39:22 2003
Controller timestamp = 1333499
Event code = 186
Event code description = Drive Missing At Power On
Event count = 1
FRU ID = 129
FRU description = Reporting Controller


System change record for Subsystem 0000000DFE24 at Fri Jul 18 15:39:22 2003
Controller timestamp = 1333499
Event code = 137
Event code description = Rebuild Started
Event count = 1
FRU ID = 0
FRU description = Disk in slot A1


Disk error record for Subsystem 0000000DFE24 at Fri Jul 18 15:39:22 2003
Controller timestamp = 1333647
Event code = 22
Event code description = Recovered By Disk Drive
Event count = 1
FRU number = 10
FRU description = Disk in slot A6

Slot number = A6
Sense key = 0x1
Additional Sense code = 0x18
Additional Sense code qualifier = 0x3
LBA = 16916039

System change record for Subsystem 0000000DFE24 at Fri Jul 18 15:39:22 2003
Controller timestamp = 1333653
Event code = 147
Event code description = Disk Drive Deleted From Disk Set
Event count = 1
FRU ID = 129
FRU description = Reporting Controller


System change record for Subsystem 0000000DFE24 at Fri Jul 18 15:39:22 2003
Controller timestamp = 1333653
Event code = 138
Event code description = Rebuild Complete
Event count = 1
FRU ID = 0
FRU description = Disk in slot A1


Usage record for Subsystem 0000000DFE24 at Fri Jul 18 15:46:54 2003
Eugeny Brychkov
Honored Contributor

Re: finding which disk inside the array is bad?

Did you remove A6? And what's 'arraydsp -a' now (please attach, not post)? Is data accessible from the host?
Eugeny
Kumar Deuja_1
Advisor

Re: finding which disk inside the array is bad?

No, I didn't remove the A6 disk but I did changed the cable. As for the logprint shows, it only shows a one time error on that disk A6. Now what happened is, after I changed the cable and started to backup about 20 clients, it crashed with the following error. I forgot to mention, its a netbackup server.

Reboot after panic: kalloc: out of kernel virtual space

Should I remove the A6 disk and rebuild the array? I am totally confuse with this server.

Thanks
Eugeny Brychkov
Honored Contributor

Re: finding which disk inside the array is bad?

I do not think it's related to storage. Looks like patches. Try patching system with: latest OnlineDiags, then GR patch bundle and then HWE patch bundle. Most probably your system lacks of some SCSI/OS critical patches
Eugeny