RWCLU

Wim Van den Wyngaert · ‎03-23-2005

Yesterday we had a development cluster (with 6 system disk less stations in it) that had processes in RWCLU. All login trials went into RWCLU. All active processes too (not sure of the word all). I still had an open sesssion and could do a show system and a search. The search worked but eventually blocked too.

This took over 15 minutes before I restarted the cluster (I didn't take a system dump, yes I was stupid).

The restart was done for 1 server node but it got a fatal bugcheck in the lock manager.
Then I restarted both nodes. The stations didn't reboot.

I checked the operator log file and found network interruptions but not between the 2 server nodes.

I know RWCLU is notmal an indication of lock remastering. Performance advisor didn't find any anomalies.

Anyone an idea of what happened ?

VMS 7.3 not fully patched on Alphaserver 4100.

Wim

Wim

Volker Halle · ‎03-23-2005

Wim,

RWCLU is also used during cluster state transitions.

What do you mean with 'network interruptions' ? Connection lost ?

Would you want to post the CLUE file from the lock manager crash ?

Volker.

Wim Van den Wyngaert · ‎03-23-2005

Volker,

The operator log file only mentioned "node station lost connection to station or server".

Wim

Wim

Volker Halle · ‎03-23-2005

Wim,

so you had network problems...

The following OPCOM message will indicate, that NODEA did NOT receive a cluster hello (multicast) message from NODEB since about 8-10 seconds:

Node NODEA (csid 00010074) lost connection to node NODEB

Each node in the cluster sends a multicast-message to all other cluster-members (using a MC address based on the cluster group number) every 3 seconds.

If the problem is intermittent, once the next hello message is received from that node, the following OPCOM message is shown:

Node NODEA (csid 00010074) re-established connection to node NODEB

If no hello message is received from NODEB for more than RECNXINTERVAL seconds, the following message is shown:

Node NODEA (csid 00010074) timed-out lost connection to node NODEB

and NODEB is removed from the cluster by NODEA.

While in this state, any process trying to communicate with the lock manager on a remote node may be put in RWCLU.

Volker.

Volker Halle · ‎03-23-2005

Wim,

the LOCKMGRERR crash of node SALPV2 indicates that a bad/corrupted LOCK message has been received from node ALM12. That node should have crashed as well (with LOCKMGRERR) - as indicated by R0=0000223C %SYSTEM-F-NODELEAVE.

This type of crash can happen, when messages are being corrupted (either in the sending or the receiving node OR on the 'wire' i.e. LAN).

Together with your network problems (connection lost), this is another indication of a possible network problem.

Volker.

Wim Van den Wyngaert · ‎03-23-2005

Volker,

You're hot. ALM12 did reboot. But it is the only station that rebooted.

At 15:57:30 there were PEA device errors on SALPV1 (the server). In operator log there were "lost connection" messages.
At 15:59:30 my monitoring found processes in RWCLU (a batch job and a RSH to start with).
At 16:00:30 DSM gave circuit timeout messages
At 16:05: I saw the problem.
At 16:12:30 I rebooted the server.

To survive network outages recnxinterval is on 900 seconds, so 15 minutes. I guess that I rebooted the system just to fast. The problem would have solved itself a little bit later.

What I don't understand :
1) search is also locking local files. Why did it work until a certain level ? Cache ?
2) why I never get this with network outages during the night ? There are batch jobs all the time ! Outages too short ?

Wim

Wim

Volker Halle · ‎03-23-2005

Wim,

I'm assuming that you'll get the RWCLU state, if the lock request somehow involves an operation with a remote node, to which the local node has lost connection.

This would highly depend on which resources your processes touch at that time.

Volker.

Wim Van den Wyngaert · ‎03-24-2005

Only the 2 cluster servers have a non-zero LOCKDIRWT. And the network between them was not interrupted. So, why did I get RWCLU on login ? A lock remastering taking almost 15 minutes ?

Wim

Ian Miller. · ‎03-24-2005

if only two nodes have nonzero lockdirwt then they will be the lock directory nodes. Other nodes talk to them to find out who is the master node for a particular node. Parhaps, due to your network problems, the lock directory lookup operation was what the processes in RWCLU where waiting for.

____________________
Purely Personal Opinion

Wim Van den Wyngaert · ‎03-24-2005

Today the same thing happened.
But I didn't reboot and it freed itself without intervention.
A number of cluster stations where powered off.
The cluster was locked for 15 minutes, as expected.

Wim

Volker Halle · ‎03-24-2005

Wim,

yes, if you power off a satellite without proper shutdown, connections to this node will be lost and it will take RECNXINTERVAL seconds until the node is timed out and removed from the cluster.

Once the node is removed from the cluster, everything (e.g. processes put in RWCLU) will continue.

It's still not 100% clear to me, which kind of LOCK/RESOURCE operation in this scenario causes a process to be put in RWCLU.

Volker.

Volker Halle · ‎03-24-2005

Wim,

one way for getting into the RWCLU state (RSN$_CLUSTRAN) is when a cluster state transtition has been started and locks on the local node are being stalled (LCK$GL_STALLREQS .ne. 0). But a cluster state transition is not started until the connection to the remote node times out (which is not yet true in your case).

If the connection is lost (but not yet timed-out) and a process is involved in a lock operation, which needs to send a lock request to the remote node (to which the connection has been lost), the process is put in RWSCS.

So the only remaining scenario I can think of (for entering RWCLU), is that a remastering operation may happen (between your 2 servers), which also involves a lock on the remote node.

Volker.

Wim Van den Wyngaert · ‎03-28-2005

Lockdirwt is on 0 on all stations, 1 on the servers.

I would expect that the lock manager stayed on the 2 servers but now I notice that ana/sys show lock/sum results in moves of the lm TO the stations and this very frequently (about every minute a move is done).

???

Wim

Wim

Volker Halle · ‎03-28-2005

Wim,

MONI RLOCK (or SDA> SHOW LOCK/SUMM) will list the 3 reasons of lock tree outbound movements:

Tree moved due to higher Activity
Tree moved due to higher LOCKDIRWT
Tree moved due to Single Node Locks

What does it say on your servers and satellites ?

Volker.

Wim Van den Wyngaert · ‎03-28-2005

Servers : 70% due to higher activity, 0% due to higher lockdirwt, 30% due to single node locks.

Stations : 30% due to higher activity, 3% due to higher lockdirwt, 40% due to single node locks. On another station I found 70% due to lockdirwt.

Wim

Wim

Volker Halle · ‎03-28-2005

Wim,

lock remastering involves individual RESOURCE TREES (i.e. a root resource and it's sub-resources) - depending on lock activity in this tree on the different nodes in a cluster.

Depending on your version of OpenVMS Alpha, there will be a SYS$SHARE:LCK$SDA SDA extension with various interesting commands (SDA> LCK provides some basic help).

SDA> LCK SHOW ACTIVE will show resource tress with lock activity

SDA> LCK STAT/NOALL/TREE/TOPTREE=n will display the n most active lock trees

Now as you see the amount of lock remastering in your cluster, one of the possible scenarios for RWCLU (remastering a tree which includes a lock on a machine, which has lost connection to the cluster), becomes more plausible.

If you really want to find out about the reason of a process going into RWCLU, force a crash with a process in that state and then we'll find out...

Volker.

Wim Van den Wyngaert · ‎03-29-2005

Still not clear after a lot of reading.

1) Why do stations receive management of a resource while lockdirwt is 0 ?

2) Is there no way to see what resource is exactly remastered instead of statistics ? I'm afraid it is something ordinary such as the sysuaf.

3) How can I see how many bandwith is eaten by the remastering ? As I understand it, the packets exchanged can be very big and I only can see the number of packets (=messages).

3) How many remasterings per minute is normal ? I see up to 30 remasterings per minute on my GS160 (running Sybase and DSM, DSM in cluster config).

And Volker, I can not ask for a crash. I have to wait ...

Wim

Wim

Volker Halle · ‎03-29-2005

Wim,

1) sole interest. If locks on this resource tree only exist on this station

2) doesn't SDA> LCK SHOW ACTIVE at least show the most active resource trees to give you an idea about the resources involved.

3) Lock re-mastering is a trade-off between on-going remote lock messages versus moving a resource tree once to the most active node and thereby trying to increase local locking.
Heavy lock remastering will increase Interrupt Stack Time.

4) You can limit the maximum size of a resource tree being moved by setting PE1 = n
PE1=-1 will disabled remastering.

Volker.

Wim Van den Wyngaert · ‎03-29-2005

Volker,

I don't agree.

On 1) : I found 1200 trees moved to a station in 6 days. How can a lock move a tree TO the station if it is the only one locking the resource ?

On 2) : HP implemented lots of show commands (LCK...) but not the most important one : monitoring of remastering in detail.
SDA>show remastering
08:01:01.12 resource xxx moved from node aaa (bbb requests in 8sec) to node ccc(ddd requests in 8sec). Moved yyy K in zzz sec.

On 3) : HP shows the number of messages. What importance does it have it the size is between 0 and 64 K ? Shouldn't the number of MB be shown ?

On 4) : this could be suicide. I should have a statistic of all resources with their size and number of remasterings before I could set this PE1. Better would have been a set file command to modify 1 resource setting.

5) Machines are getting quicker all the time. Why is are the parameters for remastering hardcoded ? This is very un-VMS.

Wim

Wim

Volker Halle · ‎04-01-2005

Wim,

re: 1) if a resource tree is first used on multiple nodes and then only generates activity on one stations, it should move there, shouldn't it ?

re: others - I'll try to ask these questions during the OpenVMS Bootcamp in June.

Volker.

Wim Van den Wyngaert · ‎04-06-2005

I made a procedure that can monitor the persistent locks on the move.
Still have to wait for monitoring results but rightslist.dat is on the move all the time. Simply @ it on a cluster.

I found also that in peak times, there are 500 remasterings in 10 minutes generating up to 3000 send and receive messages (of unknown size).

Wim

Wim

Wim Van den Wyngaert · ‎04-06-2005

On the other hand : with the procedure I saw 6 remasterings during 10 minutes while the counter of trees moving out increased with 400.

Wim

Wim

Keith Parris · ‎04-11-2005

>Still not clear after a lot of reading.
>
> 1) Why do stations receive management of
> a resource while lockdirwt is 0 ?

This may happen if they are the only node in the cluster with interest in a particular resource tree at a given point in time. As soon as there is a node with non-zero LOCKDIRWT which begins sharing the tree, VMS will tend to remaster the tree to the node with the higher value of LOCKDIRWT (unless it is artificially prevented from doing so).

> 2) Is there no way to see what resource
> is exactly remastered instead of
> statistics ? I'm afraid it is something
> ordinary such as the sysuaf.

SDA can show lock mastership for a tree, as others pointed out. An easy way is the LOCK_ACTV*.COM tool from the [KP_LOCKTOOLS] directory of the V6 Freeware. It shows, in descending order by activity level, all active resource trees, indicating the present master node with an asterisk.

> 3) How can I see how many bandwith is
> eaten by the remastering ? As I
> understand it, the packets exchanged can
> be very big and I only can see the number
> of packets (=messages).

I don't know of a good way at present. You can use SHOW CLUSTER/CONTINUOUS to get counts of ALL block data transfers, which could at least give you an upper bound value.

And then you could temporarily disable rematering with PE1=(a very large value, or -1) and compare the rates.

> 3) How many remasterings per minute is
> normal ? I see up to 30 remasterings per
> minute on my GS160 (running Sybase and
> DSM, DSM in cluster config).

If it's the same set of trees being remastered all the time, that sounds excessive, like it is thrashing between nodes. I've addressed the causes and workarounds for lock mastership thrashing in some user-group presentations, such as
http://www.geocities.com/keithparris/decus_presentations/s2001dfw_lock_manager.ppt

Your options to avoid thrashing at this point are basically:
o Unbalanced node power rather than a set of equal-powered nodes
o Unequal workloads (bias the load distribution to put more load on one machine than the others)
o Unequal values of LOCKDIRWT
o Non-zero values of PE1 (and since PE1 is dynamic, you could use different values at different times, perhaps allowing remastering for short times periodically to avoid trees getting stranded on sub-optimal nodes)
o Raise value in VMS data cell LCK$GL_SYS_THRSH to require a higher delta in activity between nodes before a tree will be remastered

> 4) You can limit the maximum size of a
> resource tree being moved by setting
> PE1 = n
> PE1=-1 will disabled remastering.

Correct. Note that PE1=-1 disables all remastering, and also disables keeping of the statistics that SDA> LCK SHOW ACTIVE and my LOCK_ACTV* tools look at.

> On 2) : HP implemented lots of show
> commands (LCK...) but not the most
> important one : monitoring of remastering
> in detail.

MONITOR RLOCK is also available. That gives general statistics on remastering in a bit more-readable format than SDA> SHOW RESOURCE/SUMMARY does.

To get detail on which trees are moving, you might have to do something like process the output from a tool like LOCK_ACTV* looking for changes in lock mastership for specific trees.

> On 3) : HP shows the number of messages.
> What importance does it have it the size
> is between 0 and 64 K ? Shouldn't the
> number of MB be shown ?

This is an artifact of history. Since lock remastering used to only use sequenced messages, that is what was counted and reported. Now block data transfer counts would be more interesting. As I noted above, you may be able to get some idea of the magnitude using SHOW CLUSTER/CONTINUOUS to get SCS-level block data transfer statistics.

> On 4) : this could be suicide. I should
> have a statistic of all resources with
> their size and number of remasterings
> before I could set this PE1. Better would
> have been a set file command to modify 1
> resource setting.

Since PE1 is dynamic, it's fairly easy to play with it and observe the behavior.

> 5) Machines are getting quicker all the
> time. Why is are the parameters for
> remastering hardcoded ? This is very
> un-VMS.

I agree. I have a problem report in the system on the issue of LCK$GL_ACT_THRSH being hard-coded at 80 (per 8-second rematering scan interval, or 10 lock operations per second) as the threshold of difference in locking activity between nodes which will trigger remastering of a tree. I'd rather see this be a percentage. Some folks use a program to modify this cell to a higher value.

Wim Van den Wyngaert · ‎04-11-2005

Keith,

How do you explain that sda show stat/tree shows only a few remasterings (show+diff in a loop) while the counters indicate at leat 10 times more remasterings ?

I keep my point that you can consult statistical info but that there is nothing to find what is exactly happening.

I can not play with the system because it is production and rather critical. No other system has comparable activities.

Wim

Wim

Keith Parris · ‎04-11-2005

> How do you explain that sda show stat/tree
> shows only a few remasterings (show+diff
> in a loop) while the counters indicate at
> least 10 times more remasterings ?

SDA> LCK SHOW STATISTIC/TREE will only grab a snapshot of the state of lock mastership at a given point in time. Starting in 7.3 (and also included in 7.2-2), VMS is very efficient at remastering trees, so it is quite possible that a lock tree could be remastered over and remastered back and you might not catch it by polling with SDA> LCK SHOW commands.

> I can not play with the system because it
> is production and rather critical. No
> other system has comparable activities.

Are we still talking about the same development cluster as in the initial note -- the one where you were prepared to spend 15 minutes in RECNXINTERVAL wait time to ride through network outages?

If scheduled outages are easier to do than changes during operation, one possible approach might be to use a scheduled outage to change LOCKDIRWT to a higher value on one system, preferably one of your fastest. This is a very successful strategy for many sites, but the limit of this approach is that it only works up to the point where that single node becomes saturated with all the work of mastering all the shared lock trees.

I think modifying PE1 dynamically would be a much less risky approach.

I once worked with a VMS cluster that grew to 19 nodes, all the largest available models at the time. Prior to 7.3, remastering would cause user-visible pauses of 10 to 50 seconds in duration. But to confirm that this was actually the cause of the pauses, I had to temporarily change PE1 to a non-zero value (I think I started with 10,000 or some other very-large value, to gain confidence as we exercised the new-to-us code paths, and gradually moved to smaller values). As with your situation, there was no test environment in which the symptoms could be duplicated, and the test had to be done under full load on a typical production morning. In this sort of situation, if you take the action you really need to take, you know you'll either be a hero, or be unemployed, one or the other. :-)

In the case of modifying PE1, the risk is low, and any change can be backed out quickly from the active parameters. Since PE1 disables remastering OFF OF a node, be aware that if you think you're starting cautiously by setting it to positive value on only one node at a time, the effect will be that large lock trees that are thrashing around will tend to "stick" on that one node instead of moving freely about, so you'll want to monitor it closely, particularly its interrupt-state time on the Primary CPU. If you were to change PE1 on all nodes at once, you wouldn't have this concentrating effect, even though it sounds scarier intuitively.

I'd recommend starting with relative large positive values for PE1 (perhaps 10K to start). As you gain confidence, you can set PE1 to values gradually lower and lower over time until they are low enough to catch the majority of thrashing trees.

Once you have a non-zero value of PE1 in place, then to prevent trees from sticking permanement on nodes such that lock mastership becomes sub-optimal over time, resulting in lower performance, you can use a DCL command procedure in a batch job to change PE1 back to zero for maybe a minute or two every so often. Be prepared for some remastering to take place at these times, of course.

Categories

Company

Local Language

Forums

Discussions

Forums

Discussions

Forums

Discussions

Forums

Discussions

Forums

Discussions

Discussions

Forums

Forums

Discussions

Forums

Discussions

Forums

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Community

Resources

Other HPE Sites

Discussions

Forums

Blogs

RWCLU

RWCLU

Re: RWCLU

Re: RWCLU

Re: RWCLU

Re: RWCLU

Re: RWCLU

Re: RWCLU

Re: RWCLU

Re: RWCLU

Re: RWCLU

Re: RWCLU

Re: RWCLU

Re: RWCLU

Re: RWCLU

Re: RWCLU

Re: RWCLU

Re: RWCLU

Re: RWCLU

Re: RWCLU

Re: RWCLU

Re: RWCLU

Re: RWCLU

Re: RWCLU

Re: RWCLU

Re: RWCLU