Operating System - OpenVMS
1832343 Members
2309 Online
110041 Solutions
New Discussion

Re: Two VMS servers are comming up as Duty

 
Sk Noorul  Hassan
Regular Advisor

Two VMS servers are comming up as Duty

Hi all,

We have three VAX 4105 servers running as Duty, Hot & Warm mode and all servers are up for arround 90 days. The problem we are facing is that, when ever we are switching our Duty to Hot server,it looks HOT server is strugling & Warm is also comming up as Duty leads to Two-Duty scenario. But when we did very frequent switch overs last year (with few days uptime), there was no such issues.

My Question: Is it because of very long uptime, these old VAX servers are behaving like this. Do we need to reboot the servers in certain intervals ?
13 REPLIES 13
Robert Gezelter
Honored Contributor

Re: Two VMS servers are comming up as Duty

Hassan,

More details are needed. Looking at the information in this posting, my conclusion is that there is some problem with the local procedures used to switch between the different roles.

The posting does not include any information about the details of how the roles are managed. I would not expect a problem with OpenVMS itself to be the issue.

It is possible that a resource usage problem is affecting things, but the reason for that problem should be tracked down. I would also avoid just re-booting, as that will likely destroy the evidence of what the problem actually is.

- Bob Gezelter, http://www.rlgsc.com
labadie_1
Honored Contributor

Re: Two VMS servers are comming up as Duty

Are those three Vax in the same Cluster ?

Can you post a
$ show cluster
from each node ?

It may be related to the uptime if the non paged pool is too small and expands until it can't, for example.

Usually Vms server need to be rebooted every 18 years (the famous Irish Railways !) or every 22 years (a node in a restricted area, so HP will refuse to confirm it) or more.
Sk Noorul  Hassan
Regular Advisor

Re: Two VMS servers are comming up as Duty

The servers are in three different networks & not VMS in cluster. There are application processes which remain in sync with each other & updates the database based on Duty server update. As per the design, when duty goes down, Hot will come up as new duty & Warm will come up as Hot. The transition in states is being passed by an application process among the servers. In my case, during the last few switchovers, the prosess responsible for transition in states is going to MWAIT state immediately after switchover in new duty, resulting in application link disconnection between the remaining two servers.
labadie_1
Honored Contributor

Re: Two VMS servers are comming up as Duty

It would be interesting to know more in detail the mwait state (susp, rwxxx...).

Depending on your Vms version (before 7.3 or after), you can do
$ monitor rlock
and you have the great SDA extension
$ ana/sys
lck
to get more info.
If you can install Amds on the 3 nodes, it could help you a lot.

Depending on your Vms version, check if you can do
$ ana/sys
sh lock/waiting
sh lock/blocking
sh resource/contention

Hoff
Honored Contributor

Re: Two VMS servers are comming up as Duty

Rebooting? There are VAX nodes with up-times of a decade or more.

I would prefer to better to understand the trigger, as it is very easily possible the trigger is not sensitive to the application or system uptime. Rebooting might not cure the problem and -- applying Murphy's Law -- rebooting particularly probably won't work exactly when you really need it to work.

If something like the distributed lock manager (DLM) is not used to coordinate the roles, it's potentially easy to get the applications into the wrong states. Proper use of the DLM greatly eases the effort of ensuring each node is in exactly one state. Having coded this stuff manually -- outside a configuration where DLM is available -- it's not easy to get this right, and there are all manner of odd corner cases.

But as Bob G. says, there's nowhere near enough here to go on. And I concur, this looks to be an application or application coordination issue. In particular, take a detailed look at how the applications are coordinating the roles. If it is not using the DLM or if this is not a cluster, then the first spot I'd look is for race conditions and sequencing errors within this area.

Stephen Hoffman
HoffmanLabs

Andy Bustamante
Honored Contributor

Re: Two VMS servers are comming up as Duty

All I can offer is more questions. You mention a database being updated. Is that running on another system? Has it the size of database increased over the last year? Is the number of users increasing over time?

I'll also second Labadie's idea of looking at local resources. Collect feedback and run autogen and look at AGEN$PARAMS.REPORT. Don't set or reboot until you've looked over the recommendations.

Andy
If you don't have time to do it right, when will you have time to do it over? Reach me at first_name + "." + last_name at sysmanager net
Hoff
Honored Contributor

Re: Two VMS servers are comming up as Duty

The application voting code certainly appears to contain a bug. This regardless of the system and application and process tuning. If the voting code allows two primaries, that's clearly bad. (Been there, saw that, got to troubleshoot it. It's not fun finding all the corner cases.)

In an OpenVMS Cluster, having two primaries is the logical equivalent of a partitioned cluster.

There may well be performance issues here in the underlying system, or some other hardware or software problem. This could be anything from system settings to RMS file internal fragmentation to disk fragmentation to process quotas to, well, you name it. And these can most certainly stretch the timing or stress the error paths and can open up a case where you have multiple primaries.

I'd clean out the cruft in MODPARAMS.DAT (and particularly look for any parameter settings where there is no identified reason for the value, and cases where absolute settings were used and where ADD_, MIN_ or MAX_ should be used) and perform a full AUTOGEN pass as a start, and take a look at the Performance Management manual to try to get a handle on what is going on. Start up a MONITOR recording task to see what's happening over time. Record most stuff at, say, ten or fifteen minute intervals, and look at the trends. Look at the error logs. And if the application(s) are dropping into MWAIT states, find out what particular mutex is involved. (Having the IDSM internals and data structures manual can be very helpful here, as it details the implementation of mutexes on OpenVMS.)

Stephen Hoffman
HoffmanLabs
Colin Butcher
Esteemed Contributor

Re: Two VMS servers are comming up as Duty

90 days is nothing. There are many high-availability sites with uptimes of years.

The MWAIT type issue indicates some kind of performance problem or resource constraint. It's almost impossible to tell what it could be without taking a careful look at the systems and the application. It may be one of the triggers that for a sequence of events which leads to the "multiple systems coming up as Duty" symptom.

3 way automatic determination of status is not easy. The state machine and the transitions are complex. 2 way is hard enough to get right under all possible circumstances. Here's the typical state transitions for 2 way to help you understand the kind of logic that's required:

Machine A Machine B
--------------------------------------
Off to Master Off
Master Off to Standby
Master Standby to Off
Master to Off Off to Master
Master to Off Standby to Master
Master to Standby Standby to Master

And so on. Of course, these states actually represent the application - not the physical machine. In complex cases there's sometimes a time lag between the start of a transtion and the completeion of a transition.

You can see the thinking - it's not just the states they're in, but the states they're transitioning to and what action needs to be taken. You also have to cater for machines that hang rather than die.

I've designed quite a few real-time control systems in my time - and this is probably the most difficult area of the whole system design to get right and to test. I once found code that I thought was correct to have a small timing flaw that hardly ever showed up - and it finally happened 7 years after the system went into operation. It's now "perfect" because we revisited the design and went through the entire state machine, the transitions and the actions to be performed very very carefully (again).

If you have trouble understanding how the application works then you may need to involve the original supplier / designers, or seek external help.

Good luck.

Cheers, Colin (www.xdelta.co.uk).
Entia non sunt multiplicanda praeter necessitatem (Occam's razor).
Colin Butcher
Esteemed Contributor

Re: Two VMS servers are comming up as Duty

Oh dear, that broke my formatting by removing spaces. Here it is again with the 'retain formatting' check box ticked.

Machine A Machine B
--------------------------------------
Off to Master Off
Master Off to Standby
Master Standby to Off
Master to Off Off to Master
Master to Off Standby to Master
Master to Standby Standby to Master

and so on. There are also transitions that shouldn't happen which you need to guard against - as you've seen.
Entia non sunt multiplicanda praeter necessitatem (Occam's razor).
Colin Butcher
Esteemed Contributor

Re: Two VMS servers are comming up as Duty

Oh dear, that broke my formatting by removing spaces. Here it is again with the 'retain formatting' check box ticked.

Machine A Machine B
--------------------------------------
Off to Master Off
Master Off to Standby
Master Standby to Off
Master to Off Off to Master
Master to Off Standby to Master
Master to Standby Standby to Master

and so on. There are also transitions that shouldn't happen which you need to guard against - as you've seen.
Entia non sunt multiplicanda praeter necessitatem (Occam's razor).
Jan van den Ende
Honored Contributor

Re: Two VMS servers are comming up as Duty

Hassan,

from your Forum Profile:


I have assigned points to 239 of 321 responses to my questions.

Most of the threads with unassigned answers date back to 2005.

Maybe you can find some time to do some assigning?

http://forums1.itrc.hp.com/service/forums/helptips.do?#33

Mind, I do NOT say you necessarily need to give lots of points. It is fully up to _YOU_ to decide how many. If you consider an answer is not deserving any points, you can also assign 0 ( = zero ) points, and then that answer will no longer be counted as unassigned.
Consider, that every poster took at least the trouble of posting for you!

To easily find your streams with unassigned points, click your own name somewhere.
This will bring up your profile.
Near the bottom of that page, under the caption "My Question(s)" you will find "questions or topics with unassigned points " Clicking that will give all, and only, your questions that still have unassigned postings.
If you have closed some of those streams, you must "Reopen" them to "Submit points". (After which you can "Close" again)

Thanks on behalf of your Forum colleagues.

PS. - nothing personal in this. I try to post it to everyone with this kind of assignment ratio in this forum. If you have received a posting like this before - please do not take offence - none is intended!

PPS. - Zero points for this.

Proost.

Have one on me.

jpe
Don't rust yours pelled jacker to fine doll missed aches.
Robert Gezelter
Honored Contributor

Re: Two VMS servers are comming up as Duty

Hassan,

I would amplify Colin's comment. There is a significant possibility that the communications problem is caused by the MWAIT. The MWAIT itself would be caused by some completely unrelated bug in the code.

As was noted in your last post, these systems are not in an OpenVMS cluster.

My experience in situations like this is the same as that of Colin and Hoff, this type of logic is very sensitive to small errors, and it requires painstaking, careful analysis to ensure that all of the cases are properly taken care of.

- Bob Gezelter, http://www.rlgsc.com
Sk Noorul  Hassan
Regular Advisor

Re: Two VMS servers are comming up as Duty

Thanks all,

We have found the bug & the problem is happening only when my current duty is serving the external interfaces.