Operating System - OpenVMS
cancel
Showing results for 
Search instead for 
Did you mean: 

Measuring system uptime

Dave Gudewicz
Valued Contributor

Measuring system uptime

Sounds simple, doesn't it?

After spending some time trying to understand the question, the waters got deeper.

Some say:

1.if the system is running, its up
2.if the system and network are ok, its up
3.if the system, network and app are ok, its up

What did I miss?

I like door #3 btw.
10 REPLIES
Uwe Zessin
Honored Contributor

Re: Measuring system uptime

It is certainly simple: just make up your criteria for 'being up' and then you can measure ;-)

What if you go through door number 3, but find out that everything is awful slow?
.
Dave Gudewicz
Valued Contributor

Re: Measuring system uptime

Slow wasn't in the equation.

That's yet another whole ballgame and the waters get deeper still.
Mike Naime
Honored Contributor

Re: Measuring system uptime

At work, they have the following classifications.

- Outage.
- Performance degredation.
- Incomplete functionality.

Sometimes they classify a really slow system as "Outage". But it's really an application/network problem. Not an OS/Hardware problem.

So, if a network switch dies, and your system is still up. Is this an Outage of the VMS system? I say no. The end user says yes. If I have to re-boot the VMS system before the stupid Cisco switch will talk to it again. It's then a VMS problem.


Or....

If there is a power outage on the East cost, and the hosted system in Missouri is still up and running. Is this an Outage? (Not from where I am sitting. But definitely an Outage to the end user! :-)
VMS SAN mechanic

Re: Measuring system uptime

Hi Dave,

We tend to take a more system-centric view. If the system and the application on the system is up, we're up. Doesn't matter if users can get to the application or not.

Our users don't always like this idea, but it is more tied to a contract with our vendor that contains performance and availability garantees.

I guess this is one of those "Your milage may vary" questions. :-)

Dave Harrold
Mobeen_1
Esteemed Contributor

Re: Measuring system uptime

Dave,
You have bought up a really cool topic, that can be debated on and on. It really depends from Manager to Manager and from Company to Company on how they define the uptime for a system

I have worked for various companies in my careerspan and believe the definition of system up time differed from one to another

1. System down time (Definition 1)
Some compaies define this as the time
for which your system has been up.

2. System down time (Definition 2)
Some companies define this as the time
for which the systems were up and so
were the applications running on that
system

3. System down time (Definition 3)
Some companies define this as the time
for which the systems were up and so
were the aplications as per the SLA

We can debate on this for long. As a systems manager, i always used to find issues with Item#2 and part of Item#3 (not comfortable with the application portion).

As majority of the systems folks understand that some times, the applications will have their own issues and it would be unfair to say the system was unavailable during that time, although you had your servers up and running.

Regards
Mobeen
Jan van den Ende
Honored Contributor

Re: Measuring system uptime

Dave,

In our uptime talk at ENSA@WORK we went somewhat deeper into this topic.

Our story started with the hardware configuration:
-- two-site, 4-node cluster with CLUSTER uptime >7 years.
-- 11 major applics.
-- users spread over 58 location, about 10 of them redundantly connected (number of redundancies is growing)
-- MOST users on WBT's, connected to a Cirix farm, from which a.o. sessions to VMS-apps.
-- several (typically 'heavier' users on PC, some connections to VMS apps
-- a VERY VITAL app, using VT's on Terminal Servers
-- a VERY VITAL app, connected via TerminalServer to a radio transmitter, for MobileTerminalData (MDT) connections.
-- several of the apps have (sometimes MAJOR) connectivity with external systems via Firewall and (private, countrywide) network.

Then, we were charged with; "If the user cannot run his/her app, then the system is down, so how can you claim 7 years up?"

So, we spend some time in breaking this down:

- If a WBT is down, ONE user is down, but he/she can take the WBT at the next desk, and work on.
Does this mean SYSTEM IS DOWN? not many takers here.
- If the network segment to a location is down NO services are offered in that location.
Is the system down? YES- to those in that location; NO- to everybody else.
- Cirix is down or malfunctioning.
Is the system down? YES- to WBT workers
No to all others. NO- to WBT workers who took the trouble to learn about short-circuiting from WBT straight to VMS or UNIX (but at user-level: non-trivial)
- If the network connectivity to the cluster fails?
Well, we DID loose connectivity to ONE site, but until now, VMS has remained reachable along at least one route. But it would mean that system availability for MANY users would cease.
- If a VT breaks down? A nusance to have to move to another desk.
- A terminal server breaks down? Half of the VT based workplaces will fail, but application functionality can continue. The desks in one room are spread over (atleast) two servers.
The radio system, or the Terminal Server it is connected to, fails?
NO MDT communication until manually failed over to the cold standby, Somebody will have to move onsite,
System down? YES- for that app NO- for all other apps.
- Then - an application can have its own issue. Some applics ( the DBMS & the RMS ones ) support rolling upgrades, and in principle they show uptimes comparable with VMS itself. Some use unix-ported Dbms-ses.
The db-engine inherently runs single-instance. Any upgrade or node-failover implies applic downtime.
If one applic is down, is the sytem down? Not to those using other applics.
- If the (connectivity with one of the) remote systems is down, the app runs in a reduced-functionalitu mode. Is this to be considered UP or DOWN? Depends on what functionality our target user needs at that moment...
- And we had a real special case in january 2002, with the Euro introduction.
One app has a rather large financial aspect.
And in the Euro version a prohibitive bug was detected mid-december.
So, the Euro version wasn't available till end january.
First but... the Guilders version was. Only, it could not be used for any new transaction.
Second but... It HAD to be available for statistics and accounting. So, at request of application management, the app WAS made available, but ONLY to application management, Statistics department, and Auditing. (totalling about a dozen people) The app was blocked for the normal users (about 3000). Now, do you consider the app UP or DOWN?
To us, the app is running according to specs by app mgt, but not many users agreed...


We have come to the conclusion, that WE will consider "the system" to be "up", when at least "some" application is available to "some" applcation users, because that seems to be the only thing measurable.
The only other thing would be to measure each individual application (or even application functionality) for each individual user. And even in this approach one purist remarked: "How do you define an app that WOULD be NOT available, IF the user tried, at a time he does not try?"

Then again, maybe if your system is running only ONE app, which eighter IS or IS NOT running, and every user accesses it via one and the same path, maybe THEN you can define applic availability as system uptime.
Not every system has that monocultural approach.

So in the end, back at square one: "it depends".


It really starts to look like some philosofic essay, eh? Hoping this holds some usefull points to some.

Jan
Don't rust yours pelled jacker to fine doll missed aches.
Mobeen_1
Esteemed Contributor

Re: Measuring system uptime

Jan,
Thanks for posting. Your post reminded me of the application/systems component dependencies we were trying to define at one of my work places to address BCP and DR

rgds
Mobeen
Martin P.J. Zinser
Honored Contributor

Re: Measuring system uptime

Hello Dave,

well, the simple answer is, Uptime is whatever you have sold your "customer" as such.

Let me give an example. I am working for an exchange. For us the system is up, if banks etc can trade. In our case this also includes network connectivity in most cases. If this is interrupted, there is a (pro-rated) effect on system availability. This is "fair" since we are selling full service including network management up to the banks sites. OTOH, if the bank decides do use the Internet to connect to the exchange, any outage beyond our routers is not counted, since we do have no influence on the Internet at large ;-)

All this is obviously part of SLAs.

I do recommend having SLAs by the way. While they might appear as a pain from a technical point at first, they are really important tools in expectation management and justification of effort.

Greetings, Martin
Anton van Ruitenbeek
Trusted Contributor

Re: Measuring system uptime

Martin,

SLA is only fun for managers. This is not realy uptime. Uptime is for systemmanager that the system is up and apps. are running. If the networks isn't functioning this is for the user no uptime, but for systemmanagers (internaly) uptime. If the SLA let you have a downtime for 12 hours a day (eg stockexchange frontend is only up from 9 to 9) this is in my opiniun a uptime for 50% and not 100% because this is mentioned as such in the SLA. For the manager, YES this is 100% SLA !

AvR
NL: Meten is weten, maar je moet weten hoe te meten! - UK: Measuremets is knowledge, but you need to know how to measure !
Jan van den Ende
Honored Contributor

Re: Measuring system uptime

Martin,
I think Anton did score a point there.

Not too long ago at a Dutch Decus day (it still was Decus, so must be some years hence) there was a session with a very intersting-looking title: "Achieving never-down for the xxx-bank IBM mainframe" (I forgot which bank).
During the session it slowly turned out that "never-down" was defined as: "running five days a week from 8:00 till 19:00".
And they achieved it mainly by meticulously IPL-ing EACH night!
This was a plenary session without Q&A, and at the concluding plenary technical Q&A the speaker had already left, but there was quite a lot of negative comment about pretending this to be "never-down". It _WAS_ fully complying with the new "heavy-duty" SLA though.

It's not only "it depends", but also "just pretend"!

Jan
Don't rust yours pelled jacker to fine doll missed aches.