HPE EVA Storage
1752781 Members
5981 Online
108789 Solutions
New Discussion

Disk Timeouts on CA LUNs

 
Stiwi Wondrusch
Trusted Contributor

Disk Timeouts on CA LUNs

Hi all

We have some serious problems with our EVAs.

This is the setup:

We have 4 EVAs
SAN6_EVA4000___ESX: 6100 CA (10km) to SAN4_EVA4000___Coloc: 6100
SAN3_EVA5000___Mix: 3028 CA (10km) to SAN4_EVA4000___Coloc: “
SAN5_EVA4100___UX: 6110 No CA

Connected to SAN6_EVA4000___ESX we have VMWare ESX3.0 running Windows2003 with Fileserver, Exchange, …

Last Monday we received Errors from Windows2003 about Disk timeouts greater then 30sec.
We broke it down:
The disk timeouts were only on the SAN6 LUns that were CA replicated.
No Problems with other SAN6 LUNS.
No Problems with SAN3 LUNs that replicate to the same DG on SAN4 like SAN6.

In the SAN6 Controller Log I see:

11:51:52:053 15-Oct-2007 Controller A upper 0c1b5f0c #8781
A Data Replication Group has transitioned to the Logging state because the alternate Storage System is not accessible.
Corrective action code: 5f More details

11:51:51:780 15-Oct-2007 Controller A upper 0c345f0c #8779
The Data Replication Path between this Storage System and the Peer Storage System has closed, due to slow response on the connection between the specified host port and the Peer Storage System.
Corrective action code: 5f More details

11:51:51:780 15-Oct-2007 Controller A upper 0c18640c #8778
Conditions on the Data Replication Destination Storage System are preventing acceptable replication throughput: Initiating temporary logging on the affected Data Replication Group that is failsafe mode disabled.
Corrective action code: 64


I opened a case last Monday (15.Oct). I escalated it last Friday. Today they tell me that the expert is out for holiday until Wednesday. I speak with people that know less about EVAs then I do. We have 7x24 4h.

What I know:
Timeouts on EVA LUNs reported by ESX are common. We have them some times, I don’t care. Timeouts greater than 30secs that are reported by the OS are an other issue.

My questions:
-Should I be worried about my data?
-Can somebody explain the Controller Events?
-How to proceed with HP Support?
-Any other input is highly appreciated


Thx & rgds Stiwi Wondrusch
2 REPLIES 2
Uwe Zessin
Honored Contributor

Re: Disk Timeouts on CA LUNs

It looks like you have a problem with the intersite links. I suggest to check the switch counters, too.


The disk timeout for Windows in a SAN or in a VM should be at least 60 seconds:
> reg query "HKLM\SYSTEM\CurrentControlSet\Services\Disk" /v TimeOutValue
> reg add HKLM\SYSTEM\CurrentControlSet\Services\Disk /v TimeOutValue /t REG_DWORD /d 60


I would also check if any .ISO files are stored on a VMFS together with VMs and mapped. I've been told that VMware ESX server can run into SCSI reservation issues. If such a file is permanently polled by a VM. Suggestion is to store them locally or on a separate VMFS.


> -How to proceed with HP Support?

Yes, that is a BIG problem these days :-( :-( :-( :-(

I suggest that you involve your management and have them talk to HP's upper layers. Don't get angry with those poor souls at layer 0 + 1.
.
Stiwi Wondrusch
Trusted Contributor

Re: Disk Timeouts on CA LUNs

Hi Uwe

I checked the switch counters:
ISL to Colocation are on Port 15

sanswitch7:admin> portErrShow
frames enc crc too too bad enc disc link loss loss frjt fbsy
tx rx in err shrt long eof out c3 fail sync sig
=====================================================================
0: 1.0g 1.2g 0 0 0 0 0 146 3.9k 0 1 1 0 0
1: 40m 12m 0 0 0 0 0 296 3 0 11 12 0 0
2: 704m 3.2g 0 0 0 0 0 565 12 23 31 32 0 0
3: 1.7g 2.5g 0 0 0 0 0 4.2k 29 11 40 40 0 0
4: 3.8g 1.1g 0 0 0 0 0 183 16k 0 1 1 0 0
5: 0 0 0 0 0 0 0 0 0 0 0 2 0 0
6: 6.7m 3.9m 0 0 0 0 0 5.2k 0 2 6 6 0 0
7: 613m 876m 0 0 0 0 0 225m 0 3 101 290 0 0
8: 3.3g 489m 0 0 0 0 0 155k 9 2 4 4 0 0
9: 696m 808m 0 0 0 0 0 102 9 5 12 12 0 0
10: 725m 1.7g 0 0 0 0 0 662 0 12 6 4 0 0
11: 350m 650m 0 0 0 0 0 939k 0 0 4 4 0 0
12: 203m 481m 0 0 0 0 0 1.1k 36k 7 1 44 0 0
13: 63m 36m 0 0 0 0 0 0 9 2 7 8 0 0
14: 0 0 0 0 0 0 0 0 0 0 0 2 0 0
15: 2.3g 2.5g 0 0 0 0 0 14 4.8k 1 2 2 0 0

sanswitch8:admin> portErrShow
frames enc crc too too bad enc disc link loss loss frjt fbsy
tx rx in err shrt long eof out c3 fail sync sig
=====================================================================
0: 155m 175m 0 0 0 0 0 188 16 0 1 1 0 0
1: 0 0 0 0 0 0 0 0 0 0 0 2 0 0
2: 894m 1.2g 0 0 0 0 0 2.0k 11 15 16 17 0 0
3: 1.7g 2.5g 0 0 0 0 0 12k 20 14 72 72 0 0
4: 2.7g 686m 0 0 0 0 0 155 0 0 1 1 0 0
5: 1.4g 4.1g 0 0 0 0 0 278m 19 0 69 93 0 0
6: 2.7g 1.5g 0 0 0 0 0 577k 0 4 10 14 0 0
7: 530k 528k 0 0 0 0 0 208m 0 0 123 441 0 0
8: 538m 3.5g 0 0 0 0 0 264k 0 3 6 6 0 0
9: 4.1g 1.3g 0 0 0 0 0 0 0 5 12 12 0 0
10: 2.1g 1.9g 0 0 0 0 0 24m 0 14 8 4 0 0
11: 86m 3.3g 0 0 0 0 0 192k 25 0 4 4 0 0
12: 170m 53m 0 0 0 0 0 759k 2 3 15 16 0 0
13: 149m 664m 0 0 0 0 0 0 0 2 5 6 0 0
14: 0 0 0 0 0 0 0 0 0 0 0 2 0 0
15: 3.8g 3.3g 0 0 0 0 0 12 414 0 3 2 0 0

-What about these figures? Do you think they are to high?
-Do you know a command to figure out when they were last cleared?

-We dont have any ISO files on VMFS Partitions on the SAN.

-60sec timeouts:
Im speaking with our ESX guys. they are discussing this 60sec timeout setting as well since last monday. Your suggestion implies that a timeout of 30-60sec is absolutely normal. Is there no way to eliminate these?

-Support:
Management is already involved all the way up. Nowadays this does not help anymore.

thx a lot Uwe
rgds Stiwi