HPE EVA Storage
1833589 Members
5002 Online
110061 Solutions
New Discussion

Re: Microsoft clusters and physical disk settings in a SAN

 
SOLVED
Go to solution
Adam Garsha
Valued Contributor

Microsoft clusters and physical disk settings in a SAN

Does anyone ever adjust "IsAlive" or "LooksAlive" physical disk resource polling frequencies or do you find that the default is good enough.

I am scraping for parameters to have our MS clusters be more forgiving in our SAN.

Currently if you sneeze on one of our EVA's, you could see an automatic failover (works great, but is too sensitive to SAN perturbations) on one of our clusters (R2|x64|SP2).

Perhaps these two parameters are not related to my issue. Are there other timeouts that you have specifically adjusted to have your windows clusters behave more forgiving-like-unix on a SAN? Our HBA disk timeouts are set to 2minutes. IsAlive is the default 60seconds, and LooksAlive is 5seconds.

Yesterday I pulled (after "remove") a bad (already ungrouped drive) and it triggered a failover event...

5 REPLIES 5
Adam Garsha
Valued Contributor

Re: Microsoft clusters and physical disk settings in a SAN

P.S. all the latest HBA firmware/driver/MPIOFF/PSP7.9/firmware7.9, etc. etc.

Also, our switches have been reviewed and no issues pop-out.

Also, latest and greatest VCS.

Also, no outstanding errors in controller logs.

Also, not using load balancing with MPIO, but manually setting default path to managing controller.

thanks.
Uwe Zessin
Honored Contributor

Re: Microsoft clusters and physical disk settings in a SAN

I remember a note or advisory (but don't have a pointer at hand right now, sorry) that claims that the cluster software shrinks the default disk timeout and it should be set back to 60 seconds at least.
Watch out - command lines might wrap.

> reg query "HKLM\SYSTEM\CurrentControlSet\Services\Disk" /v TimeOutValue

> reg add HKLM\SYSTEM\CurrentControlSet\Services\Disk /v TimeOutValue /t REG_DWORD /d 60

A reboot is required...
.
Adam Garsha
Valued Contributor

Re: Microsoft clusters and physical disk settings in a SAN

Thanks, Uwe, we set that to 2 minutes as part of standard procedure I'll recheck it though. But please keep the timeouts coming...

We also set that 360 setting that is in the release notes for VCS 4.04/07+.
Urban Petry
Valued Contributor
Solution

Re: Microsoft clusters and physical disk settings in a SAN

Adam,

what is the System event log saying at failover time? And have you looked at %SystemRoot%\Cluster\cluster.log (on the node that "lost" the disk) for any clues (maybe you can post those two informations so we can take a look - including with a timestamp of a failover)?

Are you using storport or scsiport driver model?

Is it always the same disk that causes the failover (if there is more than one clustered disk) or is it random?

What you might try is letting your physical disk resource(s) run in a seperate resource monitor (configurable checkbox on the physical disk's property page), which prevents "hanging" checks on other cluster resources (e.g. lots of file shares, third party cluster resources) to influence the check. Unfortunately you have to take the physical disk resource offline and bring it back online once for this change to take effect!

If you use storport drivers you might want to upgrade to the latest version in KB935561 (http://support.microsoft.com/default.aspx?scid=kb;en-us;935561).
Also check out KB912593 and KB923424, but since we're still on SP1 I don't know whether they're still needed on SP2 machines.

Urban
Urban Petry
Valued Contributor

Re: Microsoft clusters and physical disk settings in a SAN

Adam,

what is the System event log saying at failover time? And have you looked at %SystemRoot%\Cluster\cluster.log (on the node that "lost" the disk) for any clues (maybe you can post those two informations so we can take a look - including with a timestamp of a failover)?

Are you using storport or scsiport driver model?

Is it always the same disk that causes the failover (if there is more than one clustered disk) or is it random?

What you might try is letting your physical disk resource(s) run in a separate resource monitor (configurable checkbox on the physical disk's property page), which prevents "hanging" checks on other cluster resources (e.g. lots of file shares, third party cluster resources) to influence the check. Unfortunately you have to take the physical disk resource offline and bring it back online once for this change to take effect!

If you use storport drivers you might want to upgrade to the latest version in KB935561 (http://support.microsoft.com/default.aspx?scid=kb;en-us;935561).
Also check out KB912593 and KB923424, but since we're still on SP1 I don't know whether they're still needed on SP2 machines.

Urban