BackOffice Products
1751975 Members
5235 Online
108784 Solutions
New Discussion юеВ

w2k cluster with exchange 5.5 problem

 
SOLVED
Go to solution
Marta
Advisor

w2k cluster with exchange 5.5 problem

Hello eveybody!!!
I need your help. I have a w2k cluster with exchange 5.5 installed. Yesterday I decided to move the exchange group from one node to the other (from the Cluster Administrator), and almost all the exchange services failed with error 997. I know the description for this error is the following : "overlapped I/O operation is in progress", but this doesn??t tell me anything. I have looked for this error in the Microsoft Knowledge Base, and in the Premier Support, but what I find doesn??t help me. The cluster is in production and I cant make any probe.
Do you have any idea that could help me???
Customer wants to know what is happening.

Thank you very much in advanced.

Cheers,

Marta
8 REPLIES 8
Enes Dizdarevic
Honored Contributor

Re: w2k cluster with exchange 5.5 problem

It seems that you have more than one partition on one raid logical drive on shared disks. Give more information about your configuration (shared disks setup, logical drives, cluster groups, etc...)
Marta
Advisor

Re: w2k cluster with exchange 5.5 problem

Hello Enes!!!
Thank you for your help. I dont understand very well what you need to know about the cluster, but I tell you more. We have a fibre channel cluster with two shared logical arrays, one raid 5 drive (where the exchange is installed) and one raid 1 drive (where the backup is installed).In the exchange drive, we have only configured one W2k partition, and its letter is F:. I dont know if this is what you need, so please tell me about it.

Again, thank you very much for your interest.
Enes Dizdarevic
Honored Contributor

Re: w2k cluster with exchange 5.5 problem

If your cluster is well configured (it seems it is) this error is "allmost normal". When you do cluster move group operation cluster service sometimes can not properly determine status of the group (it takes group offline, check status and thinks it is still online). This error tends to appear in systems with write cache enabled on shared disks. Microsoft recommends using command line version of cluster administrator for moving group cluster.exe with option wait:10 (which takes group offline, waits 10 seconds and continue). If you want to use GUI version of cluster administrator do not use move group command. Instead take grup offline, wait 10 seconds, move group to other node and bring it online. You can go to http://support.microsoft.com/support/kb/articles/Q248/4/08.ASP?LN=EN-US&SD=gn&FR=0&qry=overlapped%20I/O%20operation%20&rnk=2&src=DHCS_MSPSS_gn_SRCH&SPR=WIN2000
and see more information.
Marta
Advisor

Re: w2k cluster with exchange 5.5 problem

Thank you very much for your answer, I will do it like you say. Only one thing, can you explain to me more in detail what you say about the "write cache" of the shared disks? What is the right configuration of the disks?
When are the ten seconds waited ?
Excuse me for being so "slow", my customer wants to know everything in detail, so if you don??t mind, can you explain it to me slowlier, please??
One more time, thank you very very much for your exact answer.
Enes Dizdarevic
Honored Contributor
Solution

Re: w2k cluster with exchange 5.5 problem

Proper configuration of caching policy for shared disks depends of the type of shared storage. If your shared storage has two controllers and cache in this controllers is mirored (for example VA and XP) you can enable write cache on controllers because in the case of one controller failure (or administrative failover) second controller will take over LUNs and data in cache. If your storage does not have mirrored cache (for example RS12FC and all SCSI configuration with 3SI or 2M/4M) you should disable write cache because in case of failover data in cache will be lost. (Some controllers does not allow enabling write cache in cluster mode) Disabling write cache degrades performance so some people enable write cache even in case cache is not mirrored. When moves cluster group cluster service waits for controller message that cache is flushed on disk. If cache is mirrored or cache is disabled this message comes instantly. If it is not you may have error message. That is why you have to wait before next cluster command. Microsoft said wait 10 seconds, in many cases less than 10 is OK.
Marta
Advisor

Re: w2k cluster with exchange 5.5 problem

Hello Enes,
Thank you very very much for your help, it is a pleasure to work with people with such a big knowledge as you (and want to share it). Tomorrow I will try to put all the ideas in order and to explain them to my customer. If I have any doubt, I would tell you if you dont mind.
Again, thank you very much for your help, this exchange cluster is getting me crazy.
Thanks a lot and regards,

Marta
Marta
Advisor

Re: w2k cluster with exchange 5.5 problem

Hello Enes,
If you read this message, I would be very pleased you answered me the question the customer asked me. I told him what you said, and he wants to know if this error is going to occur when one node by itself failed, because if this happens, the failover is going to fail, and this is a very serious problem....What he wants to be is if the cluster, internally, uses the same procedure to move the exchange group if one node has a problem.
Thank you again very very much
Enes Dizdarevic
Honored Contributor

Re: w2k cluster with exchange 5.5 problem

Hi,
There is no timing problem if failover occurs due to one node crash. Before start taking all cluster group which were active on crached node survived node will spend more than a minute to make sure that another node is realy dead.
When one cluster node detects that there is no hearbeat signal from other node it will not start failover immediately. It will first start chalenge/respose checks on quorrum disk to make sure that other node is really dead. (In this way situation where other node is actually alive and only hearbeat network is dead is eliminated). This process takes time (node unlock quorum disk if it was locked, wait and try to lock it again, repeat it, etc... Sometimes it takes more than one minute. During that time disk controller gets its "stabile" state and disk resources can be brought online without any error messages.