Re: Way to identify which process has a lock grant...

Mark Corcoran · ‎01-28-2009

Hi, I'm trying to produce a workaround for a situation whereby a process "forgets" to unlock a log file that many other processes also use (other processes then hang whilst waiting to write to the log file - this relates to my other posting about WCBs).

Essentially, the workaround, is to identify a process which has got the log file open, and using various criteria, determine that the file is not legitimately open, and $FORCEX the offending process.

I can PIPE the output of SHOW DEVICE /FILES, to search for the log file in question.

Although there aren't generally a large number of files open on the volume, this seemed to me to be an unnecessarily laborious and possibly I/O intensive method of finding the offending process.

THis is because the process - along with (supposedly) all others - co-operates on access to the file by using a named lock resource, which they all (again, supposedly) $ENQW before attempting to open the file, write to it, close it, and then $DEQ the lock resource afterwards.

I had thought it would be a simple case of using SDA to do SHOW LOCK /NAME=resource_name (admittedly, it would still require the output to be post-processed to extract the PID).

However, it seems that some processes are using $DEQ to convert the lock to a NULL-mode lock.

Consequently, I can get any number of copies of the lock reported, before I actually find the one that says "Granted at EX" rather than "Granted at NL".

Whilst SDA allows a /STATUS on the SHOW LOCK command, all (or virtually all) of the lock copies have exactly the same bits set in the status.

What would be ideal, is to specify /FLAGS= - this would allow me to specify /NOFLAGS=CONVERT (or /FLAGS=NOCONVERT if you prefer), but unfortunately, SDA doesn't appear to have this option unless anyone knows a suitable way of restricting the search criteria that's not obvious from HELP??

[I was loathe to process the output of all known copies of locks with the same resource name, because I did notice a sizeable pause at some point in the output.

I'm guessing that this is because the SHOW LOCK /NAME= command is effectively the equivalent of having an SQL database, with an unindexed table, and doing SELECT * FROM table WHERE RESOURCENAME=lock_being_sought

...i.e. it has to traverse the entire table (or in this case, the VMS lock database) to find locks with matching resource names, and that the database is rather full of named lock resources.

[To be honest, I'm not even sure how to find out how many named resources there are in the lock database - SDA> SHOW LOCK /SUMM gives me plenty of figures, but the scant details in the SDA manual don't really help interpret it.

I'm not even sure in hindsight how I would determine the PID from the information that SHOW LOCK /NAME= returns - maybe I /am/ better off sticking with SHOW DEV /FILES?]]

Mark

Hein van den Heuvel · ‎01-28-2009

Just stick with SHOW DEV/FILES.
It is crude, brute force, but effective.

The only catch would be if the waiters are waiting for the application lock, not the file lock. Because then the file _might_ be cloase, but the application lock still held.

An other not-too-hard-to-code solution could be calling SYS$GETLKI (get lucky) in a loop, Unfortunately it does not have a selection criteria. You'll have to look at each lock returned.

You may need to call GETLKI in KERNEL mode, if you need to learn about the FILE LOCK.

Also... do you need to worry about the lock being held on a different clsuter member?

Hein.

Mark Corcoran · ‎01-28-2009

>The only catch would be if the waiters are waiting for the application lock, not the file lock. Because then the file _might_ be cloase, but the application lock still held.

As far as I've been made aware, the code should be using $ENQW everywhere, so it should be relying on the application lock, rather than a badly behaved app locking/unlocking the file without reference to the application lock.

>Also... do you need to worry about the lock being held on a different clsuter member?
Apparently not.

I've never really had to look too deeply into locks before, so I would guess that cluster-wide locking is possible, but I'm not sure how this works.

Whilst things work, I don't need to get my hands dirty and increase my knowledge (don't have the time, for one thing).

Only when it goes wrong, and I'm determined to get to the root of the problem, do I start learning more than would otherwsie be necessary.

Depending on what the developers have found about this potential bug, it might be possible to identify the circumstances under which it occurs, and hence make the workaround a lot simpler.

Fingers Xed!

Jess Goodman · ‎01-28-2009

Unless you are running on IA64 you could just use Edward Heinrich's FILES_INFO program, available at
http://www.tmk.com/ftp/vms-freeware/fileserv/files_info.zip

Just pass it the name of the log file, and you will get:

FILE: _$1$DGA101:[LOG]DUMMY.LOG;1
Total access count of 1, XQP access 1, writers 1, size 10/147

PID USERNAME READS WRITES ACCESS CHARACTERISTICS
-------- ------------ -------- -------- ----------------------
77B54CC7 SYSTEM 0 12 Write, Sequential, NoWriteShr

You must run it on every node where the log file might be open.

I have one, but it's personal.

Hoff · ‎01-28-2009

DECamds (and its follow-on Availability Manager tool) have this mechanism, and it's intuitive and trivial to use.

Once you go through the somewhat unintuitive installation and configuration process, that is.

John Gillings · ‎01-28-2009

Mark,

I'm a bit confused. Is the blocking lock your application lock, or an RMS lock? If RMS is it a record lock or the whole file?

The application lock case should be fairly simple, just $ENQ yourself against the lock, then $GETLKI to find the blocking lock. For RMS, you may be able to do something similar with a ROP=WAT option, perhaps even from DCL with READ/WAIT?

If you can go back a step to the application design and work on the locking mechanism, perhaps implement a blocking AST? Maybe some kind of timeout, if the BLAST fires after you've held the lock to "too long", just kill the process?

A crucible of informative mistakes

David B Sneddon · ‎01-28-2009

Mark,

If it is RMS file/record locks then use google
to find "DBS-SCANLOCKS" which I have used for a
while now to locate processes that are locking
files...

Dave

Mark Corcoran · ‎01-29-2009

Thanks others, for their suggestions regarding software which will help analyse this; unfortunately, Sarbanes-Oxley controls would initially prevent these being installed (and I need to go to another team to get approval on it anyway).

John, in answer to your question:

>I'm a bit confused. Is the blocking lock your application lock, or an RMS lock? If RMS is it a record lock or the whole file?

It's an application lock. Let's for argument's sake call the resource name MIRROR_ON_THE_WALL.

One process $ENQWs a lock request for exclusive mode for this resource, and (for whatever reason - maybe it is stuck looping around doing nothing, waiting for something that will never happen, bug in the code where there's no call to $DEQ, or there is a call to it but under some circumstances the logic path avoids this bit of code) never releases it.

Each process supposedly does a $ENQW for MIRROR_ON_THE_WALL, then when the lock is granted, calls LIB$GET_LUN, Fortran OPEN, Fortran WRITE, Fortran CLOSE, LIB$FREE_LUN and $DEQ.

I would presume there will be RMS locks associated with the Fortran OPEN, but the code in hanging processes doesn't (read: shouldn't) get that far because it is still waiting on the $ENQW for MIRROR_ON_THE_WALL.

[The developers have indicated that there is a generic function that does this (it is an error handler), although looking through the CMS libraries, there seems to be umpteen copies of the handler in different modules.

I'm not sure whether or not they are all still in use and behave exactly the same.

Therein lies the problem in copy useful code into different modules/projects, rather than sticking it in one place...]

>The application lock case should be fairly simple, just $ENQ yourself against the lock, then $GETLKI to find the blocking lock.

As I mentioned, I haven't previously had to look at locking at this kind of level, but I have been doing development on VMS for almost 20yrs, so this won't be a problem once I've read the details of these 2 particular system services (I have the manuals in hard and soft copy form, don't worry!).

In all honesty, any solution I implement as an automatic "workaround" will require it to go thru Sarbanes-Oxley audit controls, but developing it myself might be easier than the pain of getting another team to approve third party (to the company's point of view, rather than to that team) software first of all.

>For RMS, you may be able to do something similar with a ROP=WAT option, perhaps even from DCL with READ/WAIT?
I've not seen the /WAIT qualifier for READ before, and it's not listed in the help library on our system. Is this only available from a particular version?

>If you can go back a step to the application design and work on the locking mechanism, perhaps implement a blocking AST?

Alas, development of the application was outsourced a long time ago, and Â£Â£Â£Â£ is payable for any change, which may take a long time to be delivered (I'm not sure that the developers in the outsource company are hardcore VMSers; probably quite adept at the various languages that the application is written in, but it's not a place one would tend to associate with VMS systems to have worked on for years before winning an outsourcing contract).

They have given more of an update on the bug that they thought they've found; they have seen in cause "a problem", but not the same problem as we are seeing.

Thinking about it from their description (a new version of the log file is created, rather than the existing one being appended to), it sounds to me like some processes are actually not using the application lock at all, and thus:

a) create a new version of the file if the file is already locked open by an offending process
b) hang on getting access to the existing file if it is already locked open.

Mark

David B Sneddon · ‎01-29-2009

Mark,

In the DBS-SCANLOCKS package there is a program
called GETLKI, it is fairly small and there
would be very little involved if you were to
look at that code then "develop" your own
version...

Dave

Ian Miller. · ‎01-29-2009

For many reasons Getting Availability Manager/AMDS installed would be good idea so you should start the process to gain approval.

Download from
http://h71000.www7.hp.com/openvms/products/availman/index.html

or you will find it on the VMS CDs.

____________________
Purely Personal Opinion

Hein van den Heuvel · ‎01-29-2009

>> Each process supposedly does a $ENQW for MIRROR_ON_THE_WALL, then when the lock is granted, calls LIB$GET_LUN, Fortran OPEN, Fortran WRITE, Fortran CLOSE, LIB$FREE_LUN and $DEQ.

That sounds rather lame and is a terrible construction from a resource consumption perspective. It does not scale. But as long as you do this no more than a few times per second I guess it'll work.

But the good news is that it makes your 'who is holding the lock' easy.
When you have a blocked program (phone call, or your own test), just use ANAL/SYS... SET PROC ... SHOW PROC/LOCK.... the blocked lock with be the first shown. Drill down (resource) from there.

>> Thinking about it from their description (a new version of the log file is created,

That would require one of those 'opportunistic' error handlers that tries to 'help' by creating a fresh file upon an error, with disrespect for the actual error.

>> rather than the existing one being appended to),

APPEND (sys$connect + rab$v_eof) to a relative file is a SCARY prospect. RMS will actually start at the end and read backwards looking for the first valid record. That scales even worse than repeased opens and closes.

This may well be the reason 'they' came up with the application lock hack.

There are relatively solutions / workarounds...

As I suggested in a reply to an other topic ... get help!

Hein.

Hoff · ‎01-29-2009

That design looks similar to some Unix code I've been working with lately; that design works acceptably on various Unix implementations but that definitely does not scale well on OpenVMS. One example of that design I worked on went from an overnight run (and somewhat flaky) to about twenty minutes.

I'd work on the SarbOx chain here both to get the tools loaded and to retrofit better and more stable I/O into this application. Potential IT jujitsu available here is the statement that these tools and these code changes are intended in aggregate to avoid SarbOx-scale events; that you're reporting an upcoming event that will adversely affect reporting, and working to avoid it.

Mark Corcoran · ‎01-29-2009

>For many reasons Getting Availability Manager/AMDS installed

My memory's vague on the subject - does AMDS require licensing? Costa Plenty?

>That sounds rather lame and is a terrible construction from a resource consumption perspective
The code was written 10+yrs ago; I've no idea of the original programmers (who no longer work for the oursource company) or their skills, but it certainly wouldn't have been the way I would have done it, and I too was shocked to say the least.

>That would require one of those 'opportunistic' error handlers that tries to 'help' by creating a fresh file upon an error
I'm not sure how it happens. In the /one/ example of the multitude of copies of the error handler that I have found, it always specifies ACCESS=APPEND in the Fortran OPEN call.

I would presume that APPEND in the Fortran OPEN calls works much the same as elsewhere - if the file doesn't exist, then create it.

I can't see why it would create a new file, if one is already in existence - unless of course, one of the myriad versions of the error handler behaves differently to the rest, which is I fear, most likely.

>APPEND (sys$connect + rab$v_eof) to a relative file is a SCARY prospect.
I think you might be getting confused with one of my other posts, Hein; in this case, it is just a "plain" "bog-standard" sequential file.

>As I suggested in a reply to an other topic ... get help!
You wouldn't be biased towards yourself, would you? ;-)

AIUI, HP are being brought in to analyse the cluster for a general performance audit; I'm not sure when this is happening, as that's way over my manager's manager's manager's budget level.

>I'd work on the SarbOx chain here
This in itself isn't a problem; the system has several teams that support different parts of it, so it's persuading other departments to find resource in order to test software on test systems which are already booked 24/7, and to budget for resource, kit & licensing costs.

Perhaps more than that, is a justification for the software in the first place, as to why it should be purchased (if relevant) - what does it do, why do we need it to do that, why this software rather than XYZ other software, how much system resource (if any) will in use up, will it interfere with day-to-day running, will it worsen performance etc.

Well, for some of the latter, you won't know until you try, because even with the best will in the world, test systems (particularly where budget constraints mean that it's not even the same hardware) will never be able to reproduce the same real-life load conditions as the live systems.

I had thought years ago that moving to $VBC would at least solve the budget issues; if anything, it's worse :-(

Volker Halle · ‎01-29-2009

Mark,

the right to use AMDS (Availability Manager) is included in the OpenVMS license. Just download, install and configure.

http://h71000.www7.hp.com/openvms/products/availman/index.html

Volker.

Hoff · ‎01-29-2009

>My memory's vague on the subject - does AMDS require licensing? Costa Plenty?

Free.

Quanto es? Gratis!

The (older) AMDS and (newer) Availability Manager tools are licensed with the clustering license in older versions, and is included with OpenVMS on later versions. IIRC, the cut-over was at V7.1.

>AIUI, HP are being brought in to analyse the cluster for a general performance audit; I'm not sure when this is happening, as that's way over my manager's manager's manager's budget level.

I'd not expect a typical cluster performance audit to look at application source code. Not unless that was included in the original discussion and original request. A good cluster audit will typically get as far as identifying the applications that are processor or memory or I/O intensive, or otherwise contributing to the aggregate load or to issues of contention. Application performance and source code profiling (and source code reviews and changes) are a different service offering and a different skill set in my experience; I certainly end up exercising different brain cells when I perform these two tasks.

Stephen Hoffman
HoffmanLabs LLC

Hein van den Heuvel · ‎01-29-2009

Ok, so this isn't going to help me, but I can't help my self from not commenting on the point assignments.

Yeah I know points-schmoints, and I do have plenty myself, but Mark seems to choose them carefully and very low. Typically a 1 or 2.

Personally I interpret a first 1 point assignement as an 'oops', and further ones as 'please go away you are wasting my time.'

Now in this case I did not go away because the subject is interesting and one I happen know a little bit about.

My potential explanations for 0/1 point are
1) In your culture number 1 it the best!
2) Pi*^ Off
3) I'm not smart enough to understand your answer
4) It is a great answer, maybe even the best possible answer, and I appreciate the time and trouble you invested to try to help me, but it is not the answer I want to hear so 'Up Yours'
5) I'm about to run out of points. Times are tough, make do.

>> would presume that APPEND in the Fortran OPEN calls works much the same as elsewhere - if the file doesn't exist, then create it.

$ del *.tmp.*
$ open/APPEND tmp tmp.tmp
%DCL-E-OPENIN, error opening SYS$SYSDEVICE:[HEIN]TMP.TMP; as input
-RMS-E-FNF, file not found

Fortran will do whatever the STATUS option tells it to do. IMHO a serious application should not allow files to be created haphazardly, and thus STATUS='OLD' gets my vote.

>>> As I suggested in a reply to an other topic ... get help!
> You wouldn't be biased towards yourself, would you? ;-)

Sure, that would be an excellent idea, but I did not write that. There are many fine folks and companies out there, and several participating here in the ITRC forums, that can readily help.

>> AIUI, HP are being brought in to analyse the cluster for a general performance audit; I'm not sure when this is happening, as that's way over my manager's manager's manager's budget level.

Excellent! There are many a fine folks (and friends) at HP that can execute on this. But it is not the only, and maybe not the best option.

I'm outa here,
Good luck!
Hein.

Hoff · ‎01-29-2009

Hein, please don't rail on the users of a dumb UI design for the inevitable results of a dumb UI design. ITRC is fundamentally broken in various ways, and the newbie design mistakes rampant within the points implementation are just one aspect of UI stinkage.

Mark Corcoran · ‎01-30-2009

>Ok, so this isn't going to help me, but I can't help my self from not commenting on the point assignments.

>Yeah I know points-schmoints, and I do have plenty myself, but Mark seems to choose them carefully and very low. Typically a 1 or 2.

>Personally I interpret a first 1 point assignement as an 'oops', and further ones as 'please go away you are wasting my time.'

>Now in this case I did not go away because the subject is interesting and one I happen know a little bit about.

>My potential explanations for 0/1 point are

One isn't obliged to ascribe points to answers, and if one is already hard pressed for time, taking the time to ascribe points gives you even less time.

As Hoff has point out, the UI isn't great - if I select a value, then use the scroll button on the mouse, it alters the value - once I've selected it, I expect that to be a selected-and-I've-moved-on-from-there value.

In my case, that's not what has happened.

If someone has taken the time to respond, I think that they at least deserve a point (for some folks, I guess, points == prizes, and they're more interested in getting a "high score" than actually doing something constructive).

If the post doesn't (to me) give me any more information than I didn't already know (or had provided myself, or been provided by others), then I don't see the justification for say 8 points.

If someone is asking me something I've already given an answer for or repeating themselves, the same goes (e.g. as regards to installing more and more software that I've already explained the hoops I need to go through to get done).

If they are asking me a question rather than saying "Do X, and it fixes it", that doesn't necessarily mean it's of no value, but if the question points the problem in a different direction (and possible solution), then it's of more value.

I'm sorry if my point scoring isn't to your liking, but ascribing values based on "merit" is always subjective, and what you feel an answer is worth may not be the same as myself or the poster.

If it's really a big problem, you don't need to answer, or I can simply not ascribe any point values.

On a non-sniping front, I had potentially a bit of an epiphany this morning after reading an email from one of the developers.

So, I now have another question to ask (related to this whole problem)...

When you specify a resource name in $ENQ or $ENQW, is the resource name subject to logical name translation?

[I did look in the system services manual, but couldn't see any reference to it]
I don't propose to comment further on the point scoring system; if you don't like what I'm doing, then either don't respond, or I can just not bother

Volker Halle · ‎01-30-2009

Mark,

>When you specify a resource name in $ENQ or >$ENQW, is the resource name subject to >logical name translation?

No.

Volker.

Mark Corcoran · ‎01-30-2009

Volker, thanks for your reply.

I think then, that I've hit the nail on the head.

Half of the modules use one resource name, half use another.

One of the resource names happens to be defined as a logical name, whose equivalence name is the same as the other resource name.

I think that (and unfortunate timing windows) are the problem, though the code must've been like that for years, and we've just been very fortunate (or nobody has made as big a fuss of the problem until now).

I'm ~2hrs late in leaving, so will get around to giving people points on Monday, along with replies to some of the other points that people raised yesterday.

Thanks *EVERYONE* for giving your time and replies.

Mark

Mark Corcoran · ‎02-02-2009

>Fortran will do whatever the STATUS option tells it to do. IMHO a serious application should not allow files to be created haphazardly, and thus STATUS='OLD' gets my vote.

In the one example of the error handler I have seen, it uses STATUS=UNKNOWN

After eventually managing to find a copy of the Digital/Compaq/HP Fortran Language Reference Manual, I see that this essentially means (because the file is opened in APPEND mode), if the file exists, it gets appended to; if the file doesn't exist, it gets created and then appended to.

Logically, this is what I would expect to want to do in these circumstances - even if LIB$FIND_FILE is callable (and I've no reason to believe it isn't), I don't see any benefit of doing an individual call to see if the file exists, and then have alternative OPEN statements depending on the result.

Particularly as this is (I would guess) in essence what the OPEN call is doing with a STATUS=UNKNOWN.

I would guess the real reason for using a STATUS=UNKNOWN in this particular code, is because there isn't the equivalent of an ERRFMT process which receives through a mailbox (or other means) messages requiring to be written to a log file - i.e. no single process that is responsible for writing all messages to the log file.

Since each individual process has the ability to write to the file separately (using the locking mechanism, which, it appears, hasn't been implemented properly), it cannot know whether or not a.n.other process has already created the file.

A single OPEN call vs 1xLIB$FIND_FILE, 1xLIB$FIND_FILE_END, and 2xOPEN, gets my vote, but ymmv.

Hein van den Heuvel · ‎02-02-2009

>> the OPEN call is doing with a STATUS=UNKNOWN.

That's a good sign. That means that the developers where aware of that option and made the concious choice to allow the fresh create.

The do a find-file to see whether to create, or open existing, would be silly, because my point is that from my experience it is best NOT to have programs create SERIOUS files 'on the fly'.

The application should just try the open for the existing file. If it is not there, then the error handler can decide whether to fail the application or whether to provision a fresh file, possibly using FDL$CREATE, or just an open statement.

In my world, critical files should be there, well designed with all options RMS has for that. a language (Fortran) might not all have desirable options under control.

Now a quick output/report file is probably 'not serious' (It is for the application functionality, but not from a potential resource consumption).

A log file may or might not be critical.

If it may grow to have millions of records, then it should be PRE-ALLOCATED, with a reasonable EXTEND.

Relative files, which I thought we were talking about here, really should be pre-created with a deliberatly chosen BUCKET SIZE, and ALLOCATION and potentially (often!)with some application header records.

Good luck,
Hein.

Way to identify which process has a lock granted in EXclusive mode?

Way to identify which process has a lock granted in EXclusive mode?

Re: Way to identify which process has a lock granted in EXclusive mode?

Re: Way to identify which process has a lock granted in EXclusive mode?

Re: Way to identify which process has a lock granted in EXclusive mode?

Re: Way to identify which process has a lock granted in EXclusive mode?

Re: Way to identify which process has a lock granted in EXclusive mode?

Re: Way to identify which process has a lock granted in EXclusive mode?

Re: Way to identify which process has a lock granted in EXclusive mode?

Re: Way to identify which process has a lock granted in EXclusive mode?

Re: Way to identify which process has a lock granted in EXclusive mode?

Re: Way to identify which process has a lock granted in EXclusive mode?

Re: Way to identify which process has a lock granted in EXclusive mode?

Re: Way to identify which process has a lock granted in EXclusive mode?

Re: Way to identify which process has a lock granted in EXclusive mode?

Re: Way to identify which process has a lock granted in EXclusive mode?

Re: Way to identify which process has a lock granted in EXclusive mode?

Re: Way to identify which process has a lock granted in EXclusive mode?

Re: Way to identify which process has a lock granted in EXclusive mode?

Re: Way to identify which process has a lock granted in EXclusive mode?

Re: Way to identify which process has a lock granted in EXclusive mode?

Re: Way to identify which process has a lock granted in EXclusive mode?

Re: Way to identify which process has a lock granted in EXclusive mode?