Re: Dual cluster lock, split brain

Richard Perez · ‎08-05-2004

I have a geographically distant SG/RAC cluster with two nodes , both connected through a SAN and a IP connection. Each site has an EVA5000 and the vg are mirrored between both EVAs using LVM. The cluster lock is in the EVA5000 at site 1.

I understand that an IP communication break (while the SAN connection remains up) will provoke a TOC in one node: the one that doesn't get the cluster lock located at site 1. If the site 1 suffers a catastrophical failure the node at site 2 will not get the cluster lock then will TOC and the cluster will never get up with that node.

Will a dual cluster lock avoid such situation?
If I setup a dual cluster lock, what will happen if the IP connection is lost and each node get a cluster lock?

Regards
Rick

Steven E. Protter · ‎08-05-2004

There is a concept in the SG manual of setting up a shared tie breaking disk. When split brain occurs, both systems try and get control and the system that loses goes TOC.

This can probably be a disk on your SAN, providing you the realiability you need in the scenario you present.

SEP

Steven E Protter
Owner of ISN Corporation
http://isnamerica.com
http://hpuxconsulting.com
Sponsor: http://hpux.ws
Twitter: http://twitter.com/hpuxlinux
Founder http://newdatacloud.com

Steven E. Protter · ‎08-05-2004

It's called the quorum disk, not tie breaker.

I think I have split brain syndrome.

SEP

Steven E Protter
Owner of ISN Corporation
http://isnamerica.com
http://hpuxconsulting.com
Sponsor: http://hpux.ws
Twitter: http://twitter.com/hpuxlinux
Founder http://newdatacloud.com

Richard Perez · ‎08-05-2004

Steven
thanks for your reply. However I continue wondering how can I avoid the situation when one of the nodes of the cluster AND the cluster lock die at the same time.

Regards

Brian M Rawlings · ‎08-05-2004

Rick: you have one of the classic clustering issues, and you are on the right track with respect to your question and your guess about the probable outcome.

MCSG requires a single cluster lock disk for two-node clusters, suggests it for 3- and 4-node clusters, and doesn't allow it for 5+ node clusters. The cluster lock "disk" is just any one of the Volume Groups that are under MC/SG control.

The only information used by the cluster lock "disk" is a tiny piece in the VGRA, the Volume Group Reserved Area (space which is always set aside in any LVM disk anyway). This means that the cluster lock disk doesn't take up any actual room, and also that all you have to do is designate one of the VGs under MCSG control to be the lock disk, and those bits in the VGRA get set up for cluster lock function. It is ideal to do it this way, so that you are sure the lock disk's underlying hardware (RAID LUN or actual disk drive) are working correctly.

With a geographically separated cluster ("campus cluster" or "extended campus topology"), you have exactly the problem that you have described. This is the only case where HP recommends using dual cluster lock disks, one at each site. The good news is that this will allow one site to take over all cluster functions if a site is lost. The bad news is that both sites will attempt to run all cluster packages (and succeed), if the only thing lost is the network between the two (split-brain syndrome, with apps up on both sides, all databases open twice, generally very bad news).

The best way to reduce the liklihood of this happening is to have two (or more) completely redundant networks in place, unbridged (no points of connection other than the clustered servers). This means that the wires or fiber need to be in separate places (trenches, ceilings, plenums, etc), so that no single event can drop the whole network. There is almost no way to be extreme enough about this... just like the two computer rooms (separate power sources, even different utility companies, if possible). As long as any network is still functional, split-brain syndrome is avoided, and users can still access their apps and data.

There are two other ways to avoid any possibility of split-brain syndrome, but they (naturally) have their own headaches. One is to avoid the use of cluster locks altogether (simply don't assign any), and to instead use a small HP-UX server in a third location, a third node in the cluster. HP calls this an "Arbiter Node" or "arbitrator system", and it is a fully supported configuration (specifically intended for this exact situation). The arbiter node only needs to be on the network, it has no apps, no connection to any shared storage, and no role in running any packages (none are ever assigned to it).

Its sole purpose is to provide a quorum for whichever system is still in communications with it, if a network fails. Since more than 50% of the nodes are present, quorum is established, the cluster is rebuilt as a two-node cluster, and MCSG restarts all packages as per its control scripts (normally all start on the surviving node). The reason this works is that a cluster lock is not requred for a 3-node cluster, and any two nodes can establish the quorum.

The downside is two-fold: first, a complete network failure will still cause no node to establish a quorum, and all nodes will TOC and halt. This means no split-brain syndrome, but also no automated/lights-out failover. Somebody has to manually restart one or both nodes, and convince it that it should resume cluster operations. In a lot of cases, this is the preferred outcome, compared to split-brain syndrome. The second downside is that not everybody actually has a campus, with 3 or more buildings over which they have ownership or control. A lot of people have their main site, and a co-location center or a second building, and nothing else.

For this scheme to work, a disaster in either main site cannot affect the arbiter node -- so it needs to be in a third location, or some sort of specially powered, air conditioned, and otherwise hardened space in one of the two main sites. This can get tricky and costly, as can the requirement for the arbiter node to be on all network segments in use by the cluster. But this is the optimal way around your dilemma.

The second way to avoid split-brain syndrome is much simpler, but is unacceptable for some high-uptime requirement situations: only install one cluster lock, as you are doing, and simply be OK with the fact that you have partial failover automation. If the disaster hits the site with the cluster lock disk, and the remaining site cannot establish a quorum, it will TOC. For lots of people, this is preferable to split-brain syndrom, as mentioned earlier. Manual intervention is required to get things back up and running at the functional site, and MCSG has to be convinced to come up and run as the only node. But your data is all intact and uncorrupted, and the tasks to whip MCSG into action are merely procedural, easily documented, and even scriptable (for the most part).

So, for a site with 7x24 operations, most people prefer some manual intervention to the data corruption or other issues surrounding split-brain syndrome. If there is an actual disaster, 30 to 60 minutes to get things back on-line is an unbelievable blessing and miracle, and nobody will be fussing or complaining. If there was just a network outage and things are down for an hour for manual restart, some questions might come up later, but the answers are about the costs for special schemes described above. The costs are substantial, even overwhelming. Most of the time, everybody shuts up about the outage when the cost of total and complete automation and redundancy is finally established.

Sorry for the novel, it is a complex subject. Hope it helps... and I hope you get some other answers, I don't profess to know everything about this specialized field, in which the rules change from time to time as technology or new features add additional wrinkles or solutions.

Best Regards, --bmr

We must indeed all hang together, or, most assuredly, we shall all hang separately. (Benjamin Franklin)

Carsten Krege · ‎08-05-2004

Brian,

nice summary. Most things are basically correct, however, there is one major mistake in your description of dual cluster lock.

> With a geographically separated cluster
> ("campus cluster" or "extended campus
> topology"), you have exactly the problem
> that you have described. This is the only
> case where HP recommends using dual
> cluster lock disks, one at each site. The
> good news is that this will allow one
> site to take over all cluster functions
> if a site is lost. The bad news is that
> both sites will attempt to run all
> cluster packages (and succeed), if the
> only thing lost is the network between
> the two (split-brain syndrome, with apps
> up on both sides, all databases open
> twice, generally very bad news).

Dual cluster lock works a little bit different. Dual cluster lock is a compound lock, ie. in a campus (=extended) cluster each side requires to access BOTH cluster lock disks, primary and secondary, in the case of lost heartbeat connectivity.

Generally spoken a SG cluster member in a 2node cluster will perform a TOC if there is no heartbeat anymore and it cannot access both cluster lock disks.

Only for the case that the ioctl system call used by the SG node to write to the cluster lock disk returns specific return values (e.g. I/O error or powerfailure), SG assumes that the cluster lock disk is dead and it will form the cluster accessing only one of the two disks. This is useful for the case that a complete datacenter fails (including one of the lock disks).

THerefore you run the risk of a split brain syndrome only in the case of loss of heartbeat connectivity AND if each node only sees its own cluster lock disk and the other disk returning I/O error.

The standard example is: All heartbeat lans AND the cables for storage are cut by a excavator at the same time. The cluster will then reform, but each side of the campus cluster will only see the lock disk in the local datacenter, whereas access to the other cluster lock disk returns I/O error. Then each datacenter will continue to form its own cluster and this is a split brain situation.

Therefore for maximum availability you should make sure that lan cables and storage cables are connected to the other datacenter on different pathes.

Hope this became clearer.

Carsten

-------------------------------------------------------------------------------------------------
In the beginning the Universe was created. This has made a lot of people very angry and been widely regarded as a bad move. -- HhGttG

pitogochik · ‎08-05-2004

An HP supported option for such a case is to put another node in a 3rd site. This is termed as the arbiter. We've been through a similar scenario before and this is the more cost-effective solution.

Frederic Sevestre · ‎08-05-2004

Hi,

Have a look here :
http://www.docs.hp.com/hpux/onlinedocs/B7660-90014/B7660-90014.html

Regards
Frederic

Crime doesn't pay...does that mean that my job is a crime ?

Brian M Rawlings · ‎08-06-2004

Carsten:

Thanks for the details and the correction. It is good to know that HP added some additional smarts for the dual-lock. I guess a lot of people run all the cables through one trench, or whatever, because I've heard of sites who ended up with split-brain syndrome (not many, thankfully).

I think the 3rd node/arbiter system is the best way to go, if possible. It's nice to know that dual lock disks are a better option than I thought, for those who don't have access to a 3rd site.

Regards, --bmr

We must indeed all hang together, or, most assuredly, we shall all hang separately. (Benjamin Franklin)

Richard Perez · ‎08-08-2004

Thanks a lot for the answers. They gave me useful pointers and clues.

Regards

Carsten Krege · ‎08-08-2004

One sidenote.

Alternatively to using an arbitrator, you can also use a quorum server (QS) (which also needs to be in a 3rd datacenter). QS is a node which acts like a cluster lock disk in the network. The difference between QS and arbitrator is that a QS is not configured as a cluster member. It also can provide quorum service for more than one cluster.

Because QS is also supported on those Linux server hardware that is supported with SG, it might be an interesting alternative.

More details on:
http://www.docs.hp.com/hpux/ha/#Quorum%20Server

See also the White Paper "Arbitration For Data Integrity in ServiceGuard Clusters".

Carsten

-------------------------------------------------------------------------------------------------
In the beginning the Universe was created. This has made a lot of people very angry and been widely regarded as a bad move. -- HhGttG

Kent Ostby · ‎08-09-2004

Here is a document that discusses Quorum vs cluster lock:

http://www6.itrc.hp.com/service/cki/search.do?category=c0&mode=id&searchString=UMCSGKBRC00012642&searchCrit=allwords&docType=EngineerNotes&search.x=22&search.y=7

"Well, actually, she is a rocket scientist" -- Steve Martin in "Roxanne"

Categories

Company

Local Language

Forums

Discussions

Forums

Discussions

Forums

Discussions

Forums

Discussions

Forums

Discussions

Discussions

Forums

Forums

Discussions

Forums

Discussions

Forums

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Community

Resources

Other HPE Sites

Discussions

Forums

Blogs

Re: Dual cluster lock, split brain

Dual cluster lock, split brain