Re: Problem with Locks

John McL · ‎02-09-2011

Occasionally we've had a cluster-wide problem with locks in that a lock with a PID of 0 blocks multiple processes that are trying to obtain exclusive access to a lock for just a few moments.

The very undesirable work around is to kill some processes that, as far as we can see, have nothing to do with the lock that the other processes are trying to access.

The attached text file shows output from SDA and you'll see that the last lock examined has a conversion queue listed.

Any advice would be appreciated because when this lock-jam happens it stops some significant processing for our business.

We're running VMS 8.3 on Alpha on a 6-node cluster.

Andy Bustamante · ‎02-09-2011

Install Availability Manager, http://h71000.www7.hp.com/openvms/products/availman/index.html?jumpid=/products/openvms/availabilitymanager on all nodes. This can provide a global view of lock conflicts.

Are you current on patches for 8.3?

If you don't have time to do it right, when will you have time to do it over? Reach me at first_name + "." + last_name at sysmanager net

Hein van den Heuvel · ‎02-09-2011

It is a 4 byte exec mode lock held in CR mode, owned by system.

That is 99% sure to be an RMS Bucket locked for a bucket in the global buffers. Might be RDB.

You need look at the Parent Lock (12005760) resource.
Is it named RMS$ ?

It that's what is it is, then these locks are supposed to be transient, with no end user program control.

Attached you'll find a Powerpoint presentation I made with some details on those kind of locks.

Somehow the process to lower the lock did not get cycles. Busy box? Are folks playing with process priorities over wide ranges?

>> "kill some processes that, as far as we can see, have nothing to do with the lock "

Hmmm, not very scientific :-).

There are tools out there to help figure this out, notably decAMDS but others like myself have written some helpers as well.

You may want to engage a specialist in this area.

Good luck!
Hein van den Heuvel ( at gmail )

John McL · ‎02-09-2011

Hein,

A couple of brief answers before I discuss this with other people in-house...

1 - We don't run Rdb, but we do use a lot of indexed files and they very likley have global buffers.

2 - The Powerpoint presentation looks good, so thanks a lot.

3 - >> "kill some processes that, as far as we can see, have nothing to do with the lock "

You said "Hmmm, not very scientific :-)".

We were in the process of shutting down all the images that are part of our business system and we were going to reboot when that was done, but by accident we found that as soon as a certain image died the problem disappeared. We restarted that image and had no immediate problems, but the "lock-jam" still happens occasionally.

Let me have a chat to people and then maybe get back to you.

Hein van den Heuvel · ‎02-09-2011

I don't see a timestamp in the SDA output,
but if it is relatively recent then you may well still be able to verify that parent lock.
The lock ID will stick around as long as someone has the file open
So maybe you can still confirm which file it was?
You can use lib$fid_to_name, or DUMP dev:/ID=id to find name of the file in question.
Or using SDA and SHOW PROC/RMS of course.

fwiw,
Hein.

John McL · ‎02-09-2011

We don't have further data for the Lock problem in the attachment but we do have a (forced) crash dump from the previous occurrence. [see attachment to this posting]

The parent lock is a RMS lock (see "part 1" of attachment).

We did use the file ID at the time to identify the file, we suspected it anyway because the problem is that a section of that file is locked or being locked. (see "part 2")

Last time though the lock looked different. Thereâ s still a CR lock with no PID, but thereâ s also a PW lock with a PID.

Also, at the time of the hangs, Iâ d say the boxes werenâ t that busy - it occurred late morning rather than during heavy batch load overnight - and as far as I know we wouldnâ t have any processes at extreme priorities (all our processes would be in the range of 4 to 8).

The real question is what's causing it?

Hein van den Heuvel · ‎02-09-2011

The PPT presentation I attached earlier does NOT show a new (2000 :-) PW lock usage.

PW used to be only used while holding a dirty Deferred write loacl buffer.

With the Concurrent Read (CR) global buffer improvement PW locks are also use to make the transition of the Global Buffer system lock in an out from NL to CR 'safe'.
Trickier still to handle a classic EX request, and again PW locks are used to deal with GBD$V_WRITE_TRANSITION being set for EX.

The goal for CR was not just to allow concurrent read of (notably index) buckets, but to avoid having to take out the lock in the first place by using a reference count in the shared memory section.

So if a process needs CR and a CR system lock is found.. use it.
If not, get a PW lock first such then only one process instantiates the CR lock at a time.

I suspect that the list of PW requests are all processes trying to also change the global buffer lock... but they _could_ be deferred write attempts. I think you can tell them apart as the DFW - PW locks will have a Blocking Ast address.

I'm a little surprised as the the high Seqnum: 000C5EE7.
That suggests almost a million updates to that bucket... best I know with the information so far.
This might have happened over hours, days or even years.
In a cluster, with global buffer if there is always someone to have the file open then bucket lock and it's value block could have a very long life!

However long it took, this bucket usage is potentially 'special'. If this was my problem I would want to poke at the bucket in question to see what it in there!
Maybe it is just a boring SYSUAF data bucket holding a a few popular usernames like for Samba or SQLservice and their last-login date is updated 'all the time'.
Or maybe it is a highly contentious 'master' record holding a rapidly increasing next object id for which the algorithm can be improved.
( Or maybe this is just an artifact of the CR lock usage that I had not figured out yet. )

This may all well lead to an OpenVMS support case, but it behooves you to get the best possible definition of the problem and its conditions. In the process of establishing that you may well find application improvement opportunities. For example, maybe DFW is counter indicated due to low re-dirty hit rate.

Cheers,
Hein

Hein van den Heuvel · ‎02-20-2011

Hey John, any updates?
I talked to John AtoZ Friday night and he indicated you switched of global buffers for the file in question which apparently has a very high write rate. No Deferred write in play right?
Has also mentioned some dirty bit being set? (details were lost as we were supposed to be focusing on our poker game.)
Anyway, any update you can share for the benefit of the community would be appreciated.
Please send me an Email if you feel there are details that have n place in a public forum.
Regards,
Hein

John McL · ‎02-21-2011

Hein, John AtoZ will be more up to date on this than I am. He's much closer to the centre of what's happening. I'm just the channel because I have an ITRC forum account.

If I hear anything that may be useful to other people I'll post it here. I sometimes use this forum to search for solutions to problems so I appreciate how useful it is to have good information here.

Robert Brooks_1 · ‎02-21-2011

John AtoZ will be more up to date on this than I am. He's much closer to the centre of what's happening. I'm just the channel because I have an ITRC forum account.

--

Ah, quit holding AtoZ's hand and tell him to get his own ITRC account :-)

-- Rob

Categories

Company

Local Language

Forums

Discussions

Forums

Discussions

Discussions

Forums

Discussions

Forums

Discussions

Forums

Forums

Discussions

Forums

Discussions

Forums

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Community

Resources

Other HPE Sites

Discussions

Forums

Blogs

Re: Problem with Locks

Problem with Locks