MC/ServiceGuard Standards

Geoff Wild · ‎03-08-2004

I'm building a document on MC/ServiceGuard Standards.

I'm looking for input from others as to best practices in managing a clustered environment. For example, patching, upgrades, kernel, Application changes, etc.

If someone has a whitepaper on this already - then that would be great.

Thanks...Geoff

Proverbs 3:5,6 Trust in the Lord with all your heart and lean not on your own understanding; in all your ways acknowledge him, and he will make all your paths straight.

Kent Ostby · ‎03-08-2004

Geoff --

There are grouping of ServiceGuard documents within the tree which solve both How To issues and give a general flavor of best practices.

You can also search on UXMCSG* and UXHA* documents in the Knowledge Data base for a series of documents on ServiceGuard.

Best regards,

Kent M. Ostby

"Well, actually, she is a rocket scientist" -- Steve Martin in "Roxanne"

Geoff Wild · ‎03-08-2004

Maybe it would be better to give you what I have so far:

MC/ServiceGuard Standards

What Is High Availability?

High Availability makes a service or application as resilient as possible against hardware, software, network and power failure, allowing critical systems to continue to operate with minimal disruption and hence lower risk to a business.

How Can High Availability Be Achieved?

High availability is achieved by identifying and removing all possible single points of failure within a system and its environment, and by having well rehearsed procedures in place to recover as quickly and securely as possible with minimal disruption, utilising redundant components.

Best Practices:

Have a test cluster in addition to the production cluster.

Run a utility like CCMON on all nodes in a cluster.

Keep kernels the same.

Keep patch levels the same.

If Applications reside on non-clustered volumes, then ensure all files are the same (version, permissions).

Apply changes to 1 node at a time and test

Test Cluster - The reasons for a test cluster are the same as a test server. A test cluster should be used for testing:
patching, kernel changes, OS upgrades, MC/ServiceGuard upgrades, Application Upgrades, any changes to be put into production. It is ideal to have the test cluster be physically the same as the production cluster as possible.

CCMON - The basic principle of the Cluster Consistency Monitor is to compare resource configurations of nodes.

Kernel - Changes to the kernel need to be applied to both nodes for consistency.

Patches - All nodes in a cluster need to be at the same patch level for consistency.

Applications - If applications reside on clustered volumes, then they will be the same no matter which node they run on. If this is not possible, then a procedure needs to be in place to ensure that the application configuration and files are the same on both nodes.

Changes - A basic principle must be followed when making any changes to a cluster. Apply the changes on one node, then test package(s) to ensure they run. If something doesn't work - back out changes. If everything works fine, then apply the same changes on the next node, then test package(s) to ensure they run. Again, if something doesn't work - then back out all changes on all nodes.

For Example: A 2 node cluster has a single package - called package1 which is running on nodeA. A kernel parameter has been recommended from the Application Support Team. Change the parameter on nodeB, then test package1 on nodeB. If everything checks out, then apply the same change to nodeA and now test package1 on nodeA.

Proverbs 3:5,6 Trust in the Lord with all your heart and lean not on your own understanding; in all your ways acknowledge him, and he will make all your paths straight.

Colin Topliss · ‎03-08-2004

Don't know about anyone else, but I'd have a section detailing known issues or 'gotchas' that are all too easily overlooked.

Once you build your cluster its usually stable. The problem comes when a change is made. Things like:

1) If a change is made by your applications people to a script called by SG, make sure it is deployed to both nodes.
2) If you have to migrate cluster data from one array to another, make sure the lock disk (if you use one) is catered for (the device name will change and not reflected in your SG configuration - SG will keep running until you shutdown/restart the cluster).
3) Extending filesystems and having to add more disk means that the VG updates need to be applied to the other cluster members.
4) Don't have your cluster data span more than one array (makes upgrades more problematic).
5) There are occasional problems where someone may be logged in and cd'd into a directory that is acting as your mount point. The cluster will fail to start until that person's process is removed.

My pet hate is lvmrc. If you have a mixed environment (some cluster controlled VGs and some local VGs), the recommended change to lvmrc means that the local VGs won't get mounted. That one is often overlooked (everything is OK until you reboot, then you lose half your filesystems).

Whatever you put in there, the most important thing to stress is to test, after EVERY change, that your cluster fails over between nodes successfully. I've been caught out a few times by that little gem, where changes have been made on one part of the cluster and not tested on the other.

malay boy · ‎03-08-2004

Geoff,
I think you left out one important issue which are user administration.On how to maintain the user, so that switch over process are transparent to the user.

regards
mB

There are three person in my team-Me ,myself and I.

Karthik S S · ‎03-08-2004

Geoff,

Links to some whitepapers related to SG:

Volume Manager Considerations:
http://h21007.www2.hp.com/dspp/files/unprotected/HP_High_Availability_White_Papers_Jay/ServiceGuardandVolumeManagerWhitePaper.pdf

SG Config for partitioned systems:
ftp://ftp.hp.com/pub/enterprise/partitioning.pdf

HP-UX 11i Availability Features (like APA etc):
http://www.nasi.com/hp-ux_availability.htm

Also you can include the various versions of SG available and the difference b/w them.

Thanks,
Karthik S S

For a list of all the ways technology has failed to improve the quality of life, please press three. - Alice Kahn

Geoff Wild · ‎03-09-2004

Colin, good input. The /etc/lvmrc - just have to make sure you add all local vg's to the custom_vg_activation() section.

Malay Boy - switch over is not a big deal - as long as Application can handle it.

Karthik - nice find on the "Volume Manager Considerations".

Rgds...Geoff

Proverbs 3:5,6 Trust in the Lord with all your heart and lean not on your own understanding; in all your ways acknowledge him, and he will make all your paths straight.

Categories

Company

Local Language

Forums

Discussions

Forums

Discussions

Forums

Discussions

Forums

Discussions

Forums

Discussions

Discussions

Forums

Forums

Discussions

Forums

Discussions

Forums

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Community

Resources

Other HPE Sites

Discussions

Forums

Blogs

MC/ServiceGuard Standards

MC/ServiceGuard Standards

Re: MC/ServiceGuard Standards

Re: MC/ServiceGuard Standards

Re: MC/ServiceGuard Standards

Re: MC/ServiceGuard Standards

Re: MC/ServiceGuard Standards

Re: MC/ServiceGuard Standards