Re: va7400 disk failed, leading to IO

stephen peng · ‎08-16-2007

dear all,
i've got a va7400's failing disk. before the disk was failed, armdsp presented such messages:
Redundancy Group:_____________________1
Total Disks:________________________15
Total Physical Size:________________500.679 GB
Allocated to Regular LUNs:__________167.05 GB
Allocated as Business Copies:_______0 bytes
Used as Active Hot Spare:___________66.757 GB
Used for Redundancy:________________187.035 GB
Unallocated (Available for LUNs):___79.835 GB

Redundancy Group:_____________________2
Total Disks:________________________14
Total Physical Size:________________467.3 GB
Allocated to Regular LUNs:__________217.519 GB
Allocated as Business Copies:_______0 bytes
Used as Active Hot Spare:___________66.757 GB
Used for Redundancy:________________167.527 GB
Unallocated (Available for LUNs):___15.496 GB
and after the disk(M/D10) failed,armdsp presented following messages:
Redundancy Group:_____________________1
Total Disks:________________________15
Total Physical Size:________________500.679 GB
Allocated to Regular LUNs:__________167.05 GB
Allocated as Business Copies:_______0 bytes
Used as Active Hot Spare:___________0 bytes
Used for Redundancy:________________253.792 GB
Unallocated (Available for LUNs):___79.835 GB

Redundancy Group:_____________________2
Total Disks:________________________13
Total Physical Size:________________433.921 GB
Allocated to Regular LUNs:__________217.519 GB
Allocated as Business Copies:_______0 bytes
Used as Active Hot Spare:___________0 bytes
Used for Redundancy:________________216.402 GB
Unallocated (Available for LUNs):___0 bytes
and the sybase database's IO failed, i saw following messages in syslog.log:
Aug 15 09:13:09 timsa vmunix: SCSI: Read error -- dev: b 31 0x0b0400, errno: 126, resid: 2048,
Aug 15 09:13:09 timsa vmunix: blkno: 37842266, sectno: 75684532, offset: 95774720, bcount: 2048.
Aug 15 09:13:09 timsa vmunix: blkno: 37842290, sectno: 75684580, offset: 95799296, bcount: 2048.
Aug 15 09:13:09 timsa vmunix: LVM: Performed a switch for Lun ID = 0 (pv = 0x0000000048a9e000), from raw device 0x1f0b0400 (with priority: 0, and current flags: 0x40) to raw device 0x1f0a0400 (with priority: 1, and current flags: 0x0).
Aug 15 09:13:09 timsa vmunix: blkno: 8, sectno: 16, offset: 8192, bcount: 2048.
Aug 15 09:13:09 timsa vmunix:
Aug 15 09:13:09 timsa above message repeats 2 times
Aug 15 09:13:09 timsa vmunix: LVM: vg[3]: pvnum=1 (dev_t=0x1f0a0400) is POWERFAILED
Aug 15 09:13:09 timsa vmunix: SCSI: Read error -- dev: b 31 0x0b0400, errno: 126, resid: 2048,
Aug 15 09:13:14 timsa above message repeats 2 times
Aug 15 09:13:14 timsa vmunix: LVM: Recovered Path (device 0x1f0b0400) to PV 1 in VG 3.
Aug 15 09:13:39 timsa vmunix: LVM: Performed a switch for Lun ID = 0 (pv = 0x0000000048a9e000), from raw device 0x1f0a0400 (with priority: 1, and current flags: 0xc0) to raw device 0x1f0b0400 (with priority: 0, and current flags: 0x80).
Aug 15 09:13:44 timsa vmunix: LVM: vg[3]: pvnum=1 (dev_t=0x1f0b0400) is POWERFAILED
Aug 15 09:14:12 timsa vmunix:
Aug 15 09:13:09 timsa vmunix: LVM: vg[3]: pvnum=1 (dev_t=0x1f0a0400) is POWERFAILED
Aug 15 09:14:12 timsa vmunix: SCSI: Read error -- dev: b 31 0x0b0400, errno: 126, resid: 2048,
Aug 15 09:14:12 timsa vmunix: blkno: 37842264, sectno: 75684528, offset: 95772672, bcount: 2048.
Aug 15 09:16:25 timsa vmunix: LVM: Recovered Path (device 0x1f0b0400) to PV 1 in VG 3.
Aug 15 09:16:25 timsa vmunix: LVM: Recovered Path (device 0x1f0a0400) to PV 1 in VG 3.

i don't think a failed disk in va7400 would cause IO error, but i notice that there was hot spare problem in that va7400, would it caused the IO problem? before, there was 66GB hot spare in RG 1, and after, there was no hot spare in RG 1, and redundancy changed from 187GB to 253GB, why? would failed disk in RG 2 affect RG 1? i could not figure out the relationship in between, could you offer me some hints?

thanks a lot

Denver Osborn · ‎08-16-2007

Run an lvdisplay on the lvols that reside on your va7400... what is bad block set to?

Bad block relocation should be "NONE".

lvchange -r N /dev/vgXX/lvolYY

-denver

tkc · ‎08-16-2007

hi,

a failed disk won't cause io error but the error you saw in the syslog 'SCSI: Read error -- dev: b 31 0x0b0400, errno: 126, resid: 2048' was due to i/o timeout from the va7400. when you have a failed disk, especially if the free disk space is low, there will be some performance impact because of the rebuilding process.

goldboy · ‎08-22-2007

Stephen,
I got a chance to work on a VA7400 2 months ago, one failed disk caused the unix server to completly not see any of the LUN's in the ioscan results.
the minute we pulled out the disk, the VA started rebuilding and the LUN's returned to be seen by the unix server.

I never thought that a single disk can cause a VA with 3 JBOD, a total of 50 disks so many failures.

I suggest you remove that disk from the VA and see if it resolve your issue.

Tal

"Life is what you make out of them!"

Arend Lensen · ‎08-23-2007

Please check rebuild rate, is set to high then host-ios will have (too) low priority.
HP also Recommends that you leave the space for 2 disks unallocated for optimal performance. IS pvtimeout set to 60 seconds on the hosts or still on the default (30) and incorrect?

Owen_15 · ‎08-23-2007

Hi Stephen,

You are quite right, a failure in RG2 should not effect RG1.

With the issue you raised about RG1, all the space that was in Active Hot Spare was placed into Used for Redundancy.

With the actual fault that was occurring in RG2, if you calculate it through it is displaying everything correctly. Let me run you through it

1. Disk fails.
2. Array takes all space in Unallocated and Active Hot Spare and assigns it to the failing disk process.
3. If you add up 66.67+15.496=82.253
Then 82.253-33.38(size of disk fail)=48.87
Then 167.527(Before failure Redundancy size)+48.84=216.402 = After failure Redundancy size.

So, during a rebuild process the array places all space from Unallocated and Active Hot Spare into the Redundancy category.

Once the rebuild and any other balancing or leveling finishes, it will restore what it can into the Active Hot Spare and Unallocated groups.

Obviously, when any disk fails, no matter which RG it is in, causes the array to perform the action of grouping these values together as described when displaying the output of armdsp.

Hope this helps explain it.

Regards
Owen

, as soon as the array had a disk failure, it firstly took all the space in Unused and allocated it for use

Categories

Company

Local Language

Forums

Discussions

Forums

Discussions

Discussions

Forums

Discussions

Forums

Discussions

Forums

Forums

Discussions

Forums

Discussions

Forums

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Community

Resources

Other HPE Sites

Discussions

Forums

Blogs

Re: va7400 disk failed, leading to IO

va7400 disk failed, leading to IO