- Community Home
- >
- Servers and Operating Systems
- >
- Operating Systems
- >
- Operating System - HP-UX
- >
- Re: Dual cluster lock, split brain
Categories
Company
Local Language
Forums
Discussions
Forums
- Data Protection and Retention
- Entry Storage Systems
- Legacy
- Midrange and Enterprise Storage
- Storage Networking
- HPE Nimble Storage
Discussions
Forums
Discussions
Discussions
Discussions
Forums
Discussions
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
- BladeSystem Infrastructure and Application Solutions
- Appliance Servers
- Alpha Servers
- BackOffice Products
- Internet Products
- HPE 9000 and HPE e3000 Servers
- Networking
- Netservers
- Secure OS Software for Linux
- Server Management (Insight Manager 7)
- Windows Server 2003
- Operating System - Tru64 Unix
- ProLiant Deployment and Provisioning
- Linux-Based Community / Regional
- Microsoft System Center Integration
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Community
Resources
Forums
Blogs
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
08-05-2004 07:22 AM
08-05-2004 07:22 AM
I understand that an IP communication break (while the SAN connection remains up) will provoke a TOC in one node: the one that doesn't get the cluster lock located at site 1. If the site 1 suffers a catastrophical failure the node at site 2 will not get the cluster lock then will TOC and the cluster will never get up with that node.
Will a dual cluster lock avoid such situation?
If I setup a dual cluster lock, what will happen if the IP connection is lost and each node get a cluster lock?
Regards
Rick
Solved! Go to Solution.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
08-05-2004 07:42 AM
08-05-2004 07:42 AM
Re: Dual cluster lock, split brain
This can probably be a disk on your SAN, providing you the realiability you need in the scenario you present.
SEP
Owner of ISN Corporation
http://isnamerica.com
http://hpuxconsulting.com
Sponsor: http://hpux.ws
Twitter: http://twitter.com/hpuxlinux
Founder http://newdatacloud.com
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
08-05-2004 07:42 AM
08-05-2004 07:42 AM
Re: Dual cluster lock, split brain
I think I have split brain syndrome.
SEP
Owner of ISN Corporation
http://isnamerica.com
http://hpuxconsulting.com
Sponsor: http://hpux.ws
Twitter: http://twitter.com/hpuxlinux
Founder http://newdatacloud.com
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
08-05-2004 10:14 AM
08-05-2004 10:14 AM
Re: Dual cluster lock, split brain
thanks for your reply. However I continue wondering how can I avoid the situation when one of the nodes of the cluster AND the cluster lock die at the same time.
Regards
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
08-05-2004 03:04 PM
08-05-2004 03:04 PM
SolutionMCSG requires a single cluster lock disk for two-node clusters, suggests it for 3- and 4-node clusters, and doesn't allow it for 5+ node clusters. The cluster lock "disk" is just any one of the Volume Groups that are under MC/SG control.
The only information used by the cluster lock "disk" is a tiny piece in the VGRA, the Volume Group Reserved Area (space which is always set aside in any LVM disk anyway). This means that the cluster lock disk doesn't take up any actual room, and also that all you have to do is designate one of the VGs under MCSG control to be the lock disk, and those bits in the VGRA get set up for cluster lock function. It is ideal to do it this way, so that you are sure the lock disk's underlying hardware (RAID LUN or actual disk drive) are working correctly.
With a geographically separated cluster ("campus cluster" or "extended campus topology"), you have exactly the problem that you have described. This is the only case where HP recommends using dual cluster lock disks, one at each site. The good news is that this will allow one site to take over all cluster functions if a site is lost. The bad news is that both sites will attempt to run all cluster packages (and succeed), if the only thing lost is the network between the two (split-brain syndrome, with apps up on both sides, all databases open twice, generally very bad news).
The best way to reduce the liklihood of this happening is to have two (or more) completely redundant networks in place, unbridged (no points of connection other than the clustered servers). This means that the wires or fiber need to be in separate places (trenches, ceilings, plenums, etc), so that no single event can drop the whole network. There is almost no way to be extreme enough about this... just like the two computer rooms (separate power sources, even different utility companies, if possible). As long as any network is still functional, split-brain syndrome is avoided, and users can still access their apps and data.
There are two other ways to avoid any possibility of split-brain syndrome, but they (naturally) have their own headaches. One is to avoid the use of cluster locks altogether (simply don't assign any), and to instead use a small HP-UX server in a third location, a third node in the cluster. HP calls this an "Arbiter Node" or "arbitrator system", and it is a fully supported configuration (specifically intended for this exact situation). The arbiter node only needs to be on the network, it has no apps, no connection to any shared storage, and no role in running any packages (none are ever assigned to it).
Its sole purpose is to provide a quorum for whichever system is still in communications with it, if a network fails. Since more than 50% of the nodes are present, quorum is established, the cluster is rebuilt as a two-node cluster, and MCSG restarts all packages as per its control scripts (normally all start on the surviving node). The reason this works is that a cluster lock is not requred for a 3-node cluster, and any two nodes can establish the quorum.
The downside is two-fold: first, a complete network failure will still cause no node to establish a quorum, and all nodes will TOC and halt. This means no split-brain syndrome, but also no automated/lights-out failover. Somebody has to manually restart one or both nodes, and convince it that it should resume cluster operations. In a lot of cases, this is the preferred outcome, compared to split-brain syndrome. The second downside is that not everybody actually has a campus, with 3 or more buildings over which they have ownership or control. A lot of people have their main site, and a co-location center or a second building, and nothing else.
For this scheme to work, a disaster in either main site cannot affect the arbiter node -- so it needs to be in a third location, or some sort of specially powered, air conditioned, and otherwise hardened space in one of the two main sites. This can get tricky and costly, as can the requirement for the arbiter node to be on all network segments in use by the cluster. But this is the optimal way around your dilemma.
The second way to avoid split-brain syndrome is much simpler, but is unacceptable for some high-uptime requirement situations: only install one cluster lock, as you are doing, and simply be OK with the fact that you have partial failover automation. If the disaster hits the site with the cluster lock disk, and the remaining site cannot establish a quorum, it will TOC. For lots of people, this is preferable to split-brain syndrom, as mentioned earlier. Manual intervention is required to get things back up and running at the functional site, and MCSG has to be convinced to come up and run as the only node. But your data is all intact and uncorrupted, and the tasks to whip MCSG into action are merely procedural, easily documented, and even scriptable (for the most part).
So, for a site with 7x24 operations, most people prefer some manual intervention to the data corruption or other issues surrounding split-brain syndrome. If there is an actual disaster, 30 to 60 minutes to get things back on-line is an unbelievable blessing and miracle, and nobody will be fussing or complaining. If there was just a network outage and things are down for an hour for manual restart, some questions might come up later, but the answers are about the costs for special schemes described above. The costs are substantial, even overwhelming. Most of the time, everybody shuts up about the outage when the cost of total and complete automation and redundancy is finally established.
Sorry for the novel, it is a complex subject. Hope it helps... and I hope you get some other answers, I don't profess to know everything about this specialized field, in which the rules change from time to time as technology or new features add additional wrinkles or solutions.
Best Regards, --bmr
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
08-05-2004 05:22 PM
08-05-2004 05:22 PM
Re: Dual cluster lock, split brain
nice summary. Most things are basically correct, however, there is one major mistake in your description of dual cluster lock.
> With a geographically separated cluster
> ("campus cluster" or "extended campus
> topology"), you have exactly the problem
> that you have described. This is the only
> case where HP recommends using dual
> cluster lock disks, one at each site. The
> good news is that this will allow one
> site to take over all cluster functions
> if a site is lost. The bad news is that
> both sites will attempt to run all
> cluster packages (and succeed), if the
> only thing lost is the network between
> the two (split-brain syndrome, with apps
> up on both sides, all databases open
> twice, generally very bad news).
Dual cluster lock works a little bit different. Dual cluster lock is a compound lock, ie. in a campus (=extended) cluster each side requires to access BOTH cluster lock disks, primary and secondary, in the case of lost heartbeat connectivity.
Generally spoken a SG cluster member in a 2node cluster will perform a TOC if there is no heartbeat anymore and it cannot access both cluster lock disks.
Only for the case that the ioctl system call used by the SG node to write to the cluster lock disk returns specific return values (e.g. I/O error or powerfailure), SG assumes that the cluster lock disk is dead and it will form the cluster accessing only one of the two disks. This is useful for the case that a complete datacenter fails (including one of the lock disks).
THerefore you run the risk of a split brain syndrome only in the case of loss of heartbeat connectivity AND if each node only sees its own cluster lock disk and the other disk returning I/O error.
The standard example is: All heartbeat lans AND the cables for storage are cut by a excavator at the same time. The cluster will then reform, but each side of the campus cluster will only see the lock disk in the local datacenter, whereas access to the other cluster lock disk returns I/O error. Then each datacenter will continue to form its own cluster and this is a split brain situation.
Therefore for maximum availability you should make sure that lan cables and storage cables are connected to the other datacenter on different pathes.
Hope this became clearer.
Carsten
In the beginning the Universe was created. This has made a lot of people very angry and been widely regarded as a bad move. -- HhGttG
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
08-05-2004 07:48 PM
08-05-2004 07:48 PM
Re: Dual cluster lock, split brain
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
08-05-2004 09:31 PM
08-05-2004 09:31 PM
Re: Dual cluster lock, split brain
Have a look here :
http://www.docs.hp.com/hpux/onlinedocs/B7660-90014/B7660-90014.html
Regards
Frederic
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
08-06-2004 04:53 AM
08-06-2004 04:53 AM
Re: Dual cluster lock, split brain
Thanks for the details and the correction. It is good to know that HP added some additional smarts for the dual-lock. I guess a lot of people run all the cables through one trench, or whatever, because I've heard of sites who ended up with split-brain syndrome (not many, thankfully).
I think the 3rd node/arbiter system is the best way to go, if possible. It's nice to know that dual lock disks are a better option than I thought, for those who don't have access to a 3rd site.
Regards, --bmr
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
08-08-2004 01:40 PM
08-08-2004 01:40 PM
Re: Dual cluster lock, split brain
Regards
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
08-08-2004 07:49 PM
08-08-2004 07:49 PM
Re: Dual cluster lock, split brain
Alternatively to using an arbitrator, you can also use a quorum server (QS) (which also needs to be in a 3rd datacenter). QS is a node which acts like a cluster lock disk in the network. The difference between QS and arbitrator is that a QS is not configured as a cluster member. It also can provide quorum service for more than one cluster.
Because QS is also supported on those Linux server hardware that is supported with SG, it might be an interesting alternative.
More details on:
http://www.docs.hp.com/hpux/ha/#Quorum%20Server
See also the White Paper "Arbitration For Data Integrity in ServiceGuard Clusters".
Carsten
In the beginning the Universe was created. This has made a lot of people very angry and been widely regarded as a bad move. -- HhGttG
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
08-09-2004 02:38 AM
08-09-2004 02:38 AM
Re: Dual cluster lock, split brain
http://www6.itrc.hp.com/service/cki/search.do?category=c0&mode=id&searchString=UMCSGKBRC00012642&searchCrit=allwords&docType=EngineerNotes&search.x=22&search.y=7