Re: How to quantify uptime of MC Service Guard cluster

Geoff Wild · ‎01-28-2005

I'm looking for a doc or white paper on what kind of numbers one can excpect to achieve having an application in a MC/SG cluster.

99.5
99.99

or what?

I need an actual doc - one of the clusters I have has an SLA of 99.3, but last year we were at 99.91

Thanks...Geoff

Proverbs 3:5,6 Trust in the Lord with all your heart and lean not on your own understanding; in all your ways acknowledge him, and he will make all your paths straight.

Bill Hassell · ‎01-28-2005

This is really easy: it all depends on how you define uptime. I have had non-MCSG system keep uptime at 100% for several years. Of course, no changes (like disks, tapes, memory) and no patches (which is a bad thing). However, if you bundle the application into the uptime, the numbers can become very unstable. When SG is run with in-place patching and upgrading, the numbers are fairly easy to get to four 9's, even five 9's. But it very much depends on the application and the data center environment. For every doc that says 99.99, there will be reports about 90.1 with an explanantion...

Bill Hassell, sysadmin

Steven E. Protter · ‎01-28-2005

Basically you keep a log.

Unplanned downtime should be zero, but if it occurs, note the start, stop and length. Also note the reason so you can make it preventable next time.

Planned downtime for upgrades and such do not count.

If its not downtime its uptime for calculation purposes.

SEP

Steven E Protter
Owner of ISN Corporation
http://isnamerica.com
http://hpuxconsulting.com
Sponsor: http://hpux.ws
Twitter: http://twitter.com/hpuxlinux
Founder http://newdatacloud.com

Geoff Wild · ‎01-28-2005

Bill and Steve - thanks for the replies - I do all that now and more...we have monitors logging all kinds of things...It's just that I've been asked to evaluate the current SLA and not just say - "we've been at 99.91 so let us make the SLA 99.8" - unless I can backup with theory's, real world indicators, statistics, etc.

Environment:

Data centre has a redundant power grid, redundant PDU's, as well as diesel.

The servers themselves are completely redundant as well, 2 node cluster, multiple Lan cards, redundant networks, multiple paths to SAN, etc..., as well as Mission Critical support with HP.

We also have in place a 48 hour DR plan to another city - that is we will be up an running in less then 48 hours with this system in the event of a disaster (tested every year).

Thanks...Geoff

Proverbs 3:5,6 Trust in the Lord with all your heart and lean not on your own understanding; in all your ways acknowledge him, and he will make all your paths straight.

Duncan Edmonstone · ‎01-28-2005

Geoff,

Its unlikely you'll find any kind of guarantee anywhere... the only reference I could find was in this rather old doc:

http://docs.hp.com/en/223/sgdtwb.pdf

This indicates that 99.8 - 99.998 could be achievable with Serviceguard - of course it all depends how you measure your uptime...

One automated way of keeping track of this is the little used foundation monitor toolkit which is (or was?) part of the Enterprise Cluster Master Toolkit:

http://docs.hp.com/en/B5139-90038/B5139-90038.pdf

HTH

Duncan

I am an HPE Employee

A. Clay Stephenson · ‎01-28-2005

Well, the only HP Document I have (and I have no idea if it is available online) is my Student Workbook for H6487S "Hands On with MC/ServiceGuard" dated from 1999. I actually took the course after I was up and running MC/SG. I needed a vacation.

Anyhow, Slide 1-7 lists an availability of 99.95% for a 2-node cluster running MC/SG assuming a 10-minute package failover. When you get up to these levels, one of the most significant factors is the failover times of the packages themselves.

I can say that I am at 5.5+ years of zero unplanned downtime using MC/SG, redundant networks, redundant HVAC, generator, ...

If it ain't broke, I can fix that.

Tim D Fulford · ‎01-29-2005

No docs sorry

So last year you had 7hr 54mins 20sec downtime! This is quite alot of time, does it include all the planned downtime? The SLA requires no more than 61 Hours of down time per year. So Id say job done..

To me ServiceGuard is only one element in achieving high availability. External to SG
o Network. No matter how resiliant SG is, if you have a network broadcast storm you are stuffed.
o SCSI devices. Most environments have a database involved. If so there are usually some form of logs and log archiving. SG may not be set up to detect this, and so it is quite easy for a DB to freeze because the log archive mechanisim has failed.
o Storage devices. Most SG clusters make use of some form of shared or SAN storage. Again a failue on this will cause the whole thing to be out of service.
o Application outage due to peak loading etc
o Assurances about the backup generator and batteries
o floods, fire, earthquakes, terrorists etc.
o and so on...

So in your search for document you need to find out what all the other things are to to be able to derrive a meaningful and achievable SLA. You may want to exclude certain items above (if so explain why!!), but there are enough items on the list to mean that it is not just the availability of SG that determins the SLA but the combination..

To get to my point... if you have measured an availability of 99.91, then this is more meaningful than an acedemic excercise of gathering all the numbers for other potential souces of outage. You might want to anlayse the source of these outages (e.g. 1 failover took 25 mins; 10 application patches of 40 mins each; 55 mins network storm) then concentrate on reducing the worst (10x 40 mins of application patching would be the one!!!)

Regards

Tim

-

Jan van den Ende · ‎01-29-2005

Steven,

Planned downtime for upgrades and such do not count.

I sincerely beg to differ on that!!

Try running a police call-room and tell the callroom manager: "Well, next weekend we will be doung an upgrade. Expect your callroom systems to be unavailable for 6 hours minimum, 12 hours max."
In the last 10 years, we (actually, the NETWORK department) had to do that once, and it took 3 months of planning, all kinds of temporary measures, LOTs of extra manpower, and still the callroom manager had the right to veto up until the last minute.

_WE_ have to come with good explanations for hickups of minutes, and then plans for how to prevent/circumvent that next time!

_I do recognise the reasoning.
Lately we had a national Interex symposium day on uninterupted computing, and one session was by a bank, about how they set up their environment.
When it transpired that they defined "100% uptime" as monday-saturday 07:00-22:00, batch processing and backups at night, maintenance on Sundays, the man was violently attacked by the majority of the audience.
In the discussion afterward the audience agreed that 90 hours out of 168 does not even constitute 60% uptime!

Then again, 90 hours is all they needed, and if that is satified, GREAT! Only, do NOT try to sell that as 100%.

Geoff,

in my math, 48 hours constitutes some .8% of a year. If you do a yearly rehearsel, is that then so realistic as to take down your production? In that case, your attainable upper limit is 99.2%, barring any other unavailabilities.

Like Bill started the first answer in this stream:
it depends HOW you define downtime.

Recently, in another stream,
http://forums1.itrc.hp.com/service/forums/questionanswer.do?threadId=793109
I did a breakdown of various (we hope: most, or all) points of view on downtime.

Especially running many (sometimes interlinked) apps, accessed from a wide area by a segmented network, from WBT's via Citrix desktop servers, UPTIME is in the eye of the beholder!

The only thing strictly measurable is SERVER uptime, but to remote users of some application, that is NOT what THEY perceive!

hth,

Proost.

Have one on me.

Jan

Don't rust yours pelled jacker to fine doll missed aches.

Steve Lewis · ‎01-29-2005

I just want to say that I have one customer who's cluster and only package has been up for so long on the primary node that they are now _afraid_ to switch it over. This is despite the fact that it passed the failover tests. The package isn't even required to be up 24x7.

Maybe they suspect that they haven't been maintaining it properly, or don't trust their own testing.

So maybe you need a schedule to switch it over now and again to keep the management's confidence in availability.

But as been said above, ultimately its the end user who defines availability.

Geoff Wild · ‎01-31-2005

Thanks all for the valuable information - The unplanned downtime on this cluster was 0 - I do include planned.

Thanks to the pointers to the docs and the course material (I have that too - didn't think to check it).

Rgds...Geoff

Proverbs 3:5,6 Trust in the Lord with all your heart and lean not on your own understanding; in all your ways acknowledge him, and he will make all your paths straight.

Florian Heigl (new acc) · ‎01-31-2005

Geoff,

have a look at
http://www.amazon.com/exec/obidos/ASIN/1587130173

Up to now I have not found a more professional book on calculating system availability and even though it's a cisco book is not at all focussed on router.
they also write on how to include part MTBFs and all such, so in the end You'll have a really good calculation.

yesterday I stood at the edge. Today I'm one step ahead.

Chad Brindley · ‎02-24-2005

Geoff

In my experience with writing and being involved in SLA's it all depends on what the business wants. If you are a bank then you would want no 'unplanned' downtime - i.e. 100% SLA. But if you are a 9-5 office hours only business then 99.5% is more appropriate and realistic.
The important thing is to 'plan' all changes and downtime/maintanance slots and do not make changes on the fly as this will only come back and bite you and cause unplanned outage.
Chad.

Categories

Company

Local Language

Forums

Discussions

Forums

Discussions

Forums

Discussions

Forums

Discussions

Forums

Discussions

Discussions

Forums

Forums

Discussions

Forums

Discussions

Forums

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Community

Resources

Other HPE Sites

Discussions

Forums

Blogs

Re: How to quantify uptime of MC Service Guard cluster

How to quantify uptime of MC Service Guard cluster