1832592 Members
2335 Online
110043 Solutions
New Discussion

% Uptime

 
Scott Frye_1
Super Advisor

% Uptime

We have a need to report every 90 days the % of uptime. Has anyone had a need to do such a thing. What management is wanting to know is somthing like the box has been up 99.99% for the past 90 days.

Any thoughts or ideas?

Scott
13 REPLIES 13
Sanjay_6
Honored Contributor

Re: % Uptime

Hi Scott,

We always report any downtime which is recored to calculate the uptime percentage of each server and then all the servers in the organization. Uptime calculation is pretty important parameter is defining the performance of the IT team. It is like metting a SLA (service level agreement) where the IT team agrees to maintain the system uptime level to a certain percentage to meet the business needs.

Hope this helps.

Regds
Scott Frye_1
Super Advisor

Re: % Uptime

Sanjay,

I see your point but management doesn't want to rely on us to tell them downtimes, they want this automated...
Patrick Wallek
Honored Contributor

Re: % Uptime

Well, you first must determine what exactly you mean by "uptime".

Is it:

1) Just the machine being up?
2) What about databases or applications? If the DB's or applications are down but the machine is up (ie. database crash application crash) how is that figured.
3) Do you count scheduled downtime (patch installs, etc.) as a negative against the uptime?
4) What about network problems? If a router or switch breaks, does that figure into your machines uptime since it affects users ability to access it?

In my opinion, the uptime is really needs to be manually calculated as there are lots of factors to take into consideration.
Jeff Schussele
Honored Contributor

Re: % Uptime

Hi Scott,

One idea could be to keep copies of all syslog.log & OLDsyslog.log for a reporting period and then parse them to get
A) The initial boot time from the first entry
B) The last entry could then be used as the reboot time
Then compare these time to the start & end of the reporting time. Any gaps would be "official downtime and you could calculate the % from the total time.

My $0.02,
Jeff
PERSEVERANCE -- Remember, whatever does not kill you only makes you stronger!
Kent Ostby
Honored Contributor

Re: % Uptime

I agree with the manual ideas listed by others.

If people want to know about downtime, you can run and keep a log of "uptime" output every hour and parse it at the end of 90 days for any anomolies.

This could be written as a script.

Best regards,

Oz
"Well, actually, she is a rocket scientist" -- Steve Martin in "Roxanne"
Scott Frye_1
Super Advisor

Re: % Uptime

When uptime is refered to here, it is specific to the box. Application / datebase / networks do not play into this equation. They are specifically looking for the time the box is powered on. Any downtime, be it reboots, scheduled maintenance (which we don't have) are to be included.
Ken Penland_1
Trusted Contributor

Re: % Uptime

we actually have a daemon that runs on each of our boxes that checks to make sure that itself has been running for over a minute, and writes a timestamp to a file...if it has just started, it checks the timestamp in the file to see when it last ran and notes the amount of time in between. You could do something similar using cron every minute, but then it would report the system being down if cron is stopped for any reason which may not be what is considered "downtime"
'
Geoff Wild
Honored Contributor

Re: % Uptime

Have a look at Big Brother:

http://www.bb4.org/

It has built in availability reports.

Rgds...Geoff
Proverbs 3:5,6 Trust in the Lord with all your heart and lean not on your own understanding; in all your ways acknowledge him, and he will make all your paths straight.
Geoff Wild
Honored Contributor

Re: % Uptime

The attached is one of my servers availability reports in text format - html is also available....


Rgds...Geoff
Proverbs 3:5,6 Trust in the Lord with all your heart and lean not on your own understanding; in all your ways acknowledge him, and he will make all your paths straight.
Geoff Wild
Honored Contributor

Re: % Uptime

Of course - better to attach the file....
Proverbs 3:5,6 Trust in the Lord with all your heart and lean not on your own understanding; in all your ways acknowledge him, and he will make all your paths straight.
A. Clay Stephenson
Acclaimed Contributor

Re: % Uptime

You need to have a serious talk with your management because I can absolutely assure you that if you don't have planned downtime you are going to have unplanned downtime. Your statistics also penalize you when, for exapmle, you move a MC/SG package to another node and then down the host for maintenance.

If your performance is being measured by uptime then I would really get the terms clarified.

A much more reasonable metric is the amount of unplanned downtime over a given period. Also, by how much have you exceeded your planned downtime period during the same interval.
If it ain't broke, I can fix that.
Steven E. Protter
Exalted Contributor

Re: % Uptime

I suppose to automate it you could task a box to do the following things:

1) Ping every server
2) Run a network connectivity test such as an ssh command.

You do it at intervals and record whether you get a response code 0 or non-zero. Non-zero means down.

The problem with automating it is figuring out why you were down. The ping won't travel if the network switch is unplugged. Was the server down? No, the network was.

The other problem with my methodology is it doesn't use real applications. ssh can be up but if adabase is down my users call the help desk and say "the systems down"

So there you have it.

An independent help desk that reports actual downtimes manually is how my performance as an admin is measured.

SEP
Steven E Protter
Owner of ISN Corporation
http://isnamerica.com
http://hpuxconsulting.com
Sponsor: http://hpux.ws
Twitter: http://twitter.com/hpuxlinux
Founder http://newdatacloud.com
Jan van den Ende
Honored Contributor

Re: % Uptime

Hi,

I most thoroughly agree with Patrick,


Well, you first must determine what exactly you mean by "uptime".


Just you have to elaborate on that:

--- If a node leaves your cluster planned, no user would ever notice.
Could hardly count as downtime.

--- If a node crashes out, the users happening to be using that node have to reconnect.
From the user's point of view, this looks like a network glitch (see below).
Accountable as downtime?? Short, to some.

--- If an application fails on one node, similar to users of that applic on that node only (subset of previous).

--- If an application ( or its DB) fails, it is expirienced as downtime by isers of THAT applic. Users of other applics wiil not notice (or maybe some more resource availability. Do not count on THAT ever reported). How to account one missing app towards "downtime"?

--- If one SITE fails, users connected to that site have to reconnect. Processing continues, but, diminished resource availability will probably be noticed, and maybe some application(s) need to be given priority over other(s).

--- If part of the network fails, those having connections via that part will be re-routed. They experience interrupts, might have to reconnect, and if the affected network parts are large or vital enough, several users might experience downtime, other will not. How to account?

--- If a "User Interface" (PC, WBT, Terminal Emulator, Terminal,... ) fails, from THAT location there is NO service. Maybe another workplace is near, maybe not. To THAT user it is downtime, but to all others?

--- If you have to upgrade system or software, does it support Rolling Upgrade? If yes, no downtime, if no, planned downtime.
Depending on the organisation, that accounts as downtime or not. (In any 9-to-5, and even in monday-friday 0-24 operation, you can schedule off-hour downtime. In a police call-room, you had better not!)


The system I am maintaining DOES include a 365*24 callroom, but also some 3-shift, and some 9-to-5 applics. The discussion never ends.


So, although it LOOKS like a simple question, even in one and the same systems a lot of answers are possible, all very valid, depending on the perspective.

(and SYSTEM downtimes are easy to report: boottime april 13 1997 10:35 GMT.)

fwiw,

Proost.

Have one on me.

Jan
Don't rust yours pelled jacker to fine doll missed aches.