Operating System - OpenVMS
1752817 Members
4090 Online
108789 Solutions
New Discussion юеВ

Re: Delete-pending Global Section with erroneous non-zero Reference Count?

 
SOLVED
Go to solution
Mark Corcoran
Frequent Advisor

Delete-pending Global Section with erroneous non-zero Reference Count?

Whilst trying to get the debugger & heap analyzer working on the development cluster to get to the bottom of a memory leak issue (see my other thread), I've found a problem as per the thread subject header.

When I initially tried to run up the executable (in non-debug mode), it terminated, complaining it couldn't map to a particular Global Section.

I logged on to the production cluster, to ensure that such a Global Section did in fact exist.

Using INSTALL LIST /GLOBAL /FULL, I could see that not only did it exist, but it actually existed twice, with the second instance being listed under the "Delete Pending Global Sections".

I was concerned that some process(es) might still be using the old Global Section, and thus out-of-date data.

Having researched the fact that there's no backtracking of Global Section to mappers, I tracked this the long way around, by doing:

$ ANA /SYS
SDA> SET OUTPUT A.A
SDA> SHOW GST
SDA> SET OUTPUT B.B
SDA> SHOW PROCESS ALL /PST
SDA> ^Z

Take the decimal PGLTCNT value reported by INSTALL LIST, and convert it to hex.

Search A.A for the named Global Section, where the Pagelets column value matches the hex value obtained above.

Now take the GPTE Addr column value from thae same line, and search B.B for it, then scroll backwards as far as the top of FormFeed-separated page, to obtain the Process ID.

Unfortunately, in my case, whilst it does show an internal PID, the Extended PID is set to 000000, and the Process Name is "--".

Just to be sure, I check for the process in SDA using /ID=internal_PID, but alas, it's not there.

I'm not sure if the Extended PID of 00000000 is indicative of it having been created during system startup, rather than someone manually creating the Global Section.

I've checked the accounting files, but no Extended PID ending in the same internal PID exists as far back as the accounting records go.

I've noticed that the command file which creates and deletes the Global Section, just does a RUN image.EXE (but I'm not sure how it creates the sections - the (offshore, time-difference) developers have left for the day, and the source is a bit Byzantine to navigate.

What I have noticed however, is that the Reference Count for this delete-pending Global Section, is 10676.

As I understand it, this is incremented whenever a process maps to the Global Section, and that theoretically, this value means that there are 10676 processes mapped to it.

This is fairly unlikely - MAXPROCESSCNT is set at 3200, but I would have expected the node to have run out of resources and collapsed way before it got as high as that anyway.

[In any case, the SHOW PROCESS ALL /PST only indicated one process as having an entry pointing at the delete-pending Global Section]

By definition, when a process terminates, it will no longer be mapping the Global Section, but under what circumstances is the reference count decremented?

I've not been able to readily find any information on the mechanism by which the count is altered, e.g. is it use of $DELTVA that decrements it, or is it solely image rundown?

[I could quite believe that the various processes that map to the Global Section, don't bother to do $DELTVA, and just expect it to be taken care of during image rundown.

What may be of more concern, is how they map to the Global Section in the first place - absence of an ident parameter to $MGBLSC will result in it defaulting to SEC$K_MATALL, to match all versions of the Global Section.

Could this mean that $MGBLSC might be permitted to map to the old (delete-pending) Global Section?

[Assuming that there's no programmer-defined byte/word in the Gsec which indicates that the Gsec has been invalidated, and that the code should examine and act upon it]

Given that it appears that there are no processes mapping the delete-pending Gsec, how (aside from a reboot) can the reference could be decremented to zero, to allow the section to be deleted?

Thoughts on the back of a posting...


Mark
13 REPLIES 13
H.Becker
Honored Contributor
Solution

Re: Delete-pending Global Section with erroneous non-zero Reference Count?

What's the VMS version (and if that isn't unique, what's the platform)?

No process/program can map to a global section on the delete-pending list. A section is put on this list by a $dgblsc. A second section with the same name is put on the normal list by a $crmpsc while the named section is on the delete-pending list. All subsequent $mgblsc calls will map to the one on the normal list.

From a VMS viewpoint, other than wasting some resources, there is no problem with a section on the delete-pending list.

I can't follow your analysis. A global section will not show in the PST. In this forum, there was a thread which showed the SDA commands to find out, which process(es) mapped a section on the delete pending list.

The high reference count is odd. It is the count of successful mappings to the section. Usually it is the number of processes which mapped it.

If the count is incorrectly maintained, for example in a race condition where a mapping und unmapping happens at the same time, then there is a bug in your VMS version.

Is the system at the current (SYS) ECOs level ?
John Gillings
Honored Contributor

Re: Delete-pending Global Section with erroneous non-zero Reference Count?

Mark,

>As I understand it, this is incremented
>whenever a process maps to the Global
>Section, and that theoretically, this value
>means that there are 10676 processes mapped
>to it.

I don't think this is the correct interpretation. I believe it's the number of page references into the section from all processes (actually pageLET). So, you could have anything from 10676 processes, each of which has a single page mapped, up to, if the section is that large, a single process mapping the whole section, with anything in between.

>Given that it appears that there are no
>processes mapping the delete-pending Gsec,

You have a non-zero reference count, therefore there are process(es) which have the section mapped. Use the SDA magic spell Helmut referred to to find them. Once the reference count gets to zero, the section will be deleted.
A crucible of informative mistakes
Murali L.R.
Advisor

Re: Delete-pending Global Section with erroneous non-zero Reference Count?

Mark,

>>I've not been able to readily find any information on the mechanism
>>by which the count is altered, e.g. is it use of $DELTVA that decrements
>>it, or is it solely image rundown?

Both image rundown and $DELTVA can alter the reference count.
An implicit $DELTVA call made when an image exits. If other
processes still have outsanding references to the global section,
the REFCNT will be positive and the deletion will not happen.

>>Could this mean that $MGBLSC might be permitted to map to the
>> old (delete-pending) Global Section?
No it's not possible.

As Hartmut updated VMS version and patch level will help.

Regards,
Murali
Ian Miller.
Honored Contributor

Re: Delete-pending Global Section with erroneous non-zero Reference Count?

I had a go at listing global sections and the processes - see GBLSEC$SDA at

http://eisner.encompasserve.org/~miller/

it does try for delete pending ones too :-)
SDA> GBLSEC section-name /DELETE_PENDING
____________________
Purely Personal Opinion
Mark Corcoran
Frequent Advisor

Re: Delete-pending Global Section with erroneous non-zero Reference Count?

I think my efforts to provide information here, were based on clutching at straws, and trying to find addresses that appeared to match from different SDA commands :$

Following Hartmut's suggestion, I've done perhaps what I should have done initially, and looked a bit further for information on how to obtain details of processes which are mapped to the Global Section.

I did find a thread in the Sytem Management forum, which discussed this, but the initial SDA commands were more for a file-backed Gsec with a WCB.

However, later in the thread, Volker Halle indicated the commands required:

SDA> SHOW GSD /DEL
SDA> SHOW PROC /PAGE /GSTX=gstx_value

As Hoff has alluded to in a comp.os.vms posting on 07-JAN-2000, "this page table search is a very slow process even on the best of days".

It has already found a number of processes which are still mapped to the section.

It does appear to be hung at the moment - ^Y not having any effect, ^T not outputting anything either, and SHOW PROC /CONT of the process from another session shows it to be in a HIB state, with no CPU time incrementing.

I think I read something about this search causing outswapped processes to be swapped back in, which may not be helping, so I've just killed it.

However, it has identified a series of server processes (used for spreading load).

Checking the process creation time of these, it's clear that they were restarted the last time we had a code release (which is good).

However, we had a data release about 2 days later, and looking at the install instructions, of the two types of each server group, only one of the group types is requested to refresh their data.

Quite what that means from an application perspective, I don't know, but it clearly doesn't involve unmapping and remapping the global section which gets recreated a few steps prior to this.

I'll need to go back to the supplier, to see what these servers use the global section for (they may unnecessarily still map it, even though the code no longer uses it, I don't know).

It may be that whatever the data changes were (that increased the size of the global section, and possibly modified some of the other contents), were to data that are not used by these servers.

I'll endeavour to look harder next time before posting, apologies.
It's also not clear what
Mark Corcoran
Frequent Advisor

Re: Delete-pending Global Section with erroneous non-zero Reference Count?

Oops, an extraneous "It's also not clear what" at the end of the last posting.

What I should also say is that what John said:

I don't think this is the correct interpretation. I believe it's the number of page references into the section from all processes (actually pageLET). So, you could have anything from 10676 processes, each of which has a single page mapped, up to, if the section is that large, a single process mapping the whole section, with anything in between.

appears to be correct...

Although SDA hung whilst I was trying to list the processes connected to the Global Section, so I didn't find ALL of them, I suspect that the (hopefully) limited numbers that are connected to the delete-pending section account for the pagelet reference count, based on the size of the section, and the likely number of these processes, so thanks John, for that.
Hoff
Honored Contributor

Re: Delete-pending Global Section with erroneous non-zero Reference Count?

Are y'all mixing system services and INSTALL commands on the same sections?
Mark Corcoran
Frequent Advisor

Re: Delete-pending Global Section with erroneous non-zero Reference Count?

>Are y'all mixing system services and INSTALL commands on the same sections?

The Global Section in question is deleted and created by an executable image (which presumably determines the size of the section and populates i, based on the number of rows in table X in the database), so I would guess it is using $CRMPSC.

I don't think we have file-backed Global Sections, so I would guess INSTALL is out of the question (other than for INSTALL LIST).

If there were a mixture, would there be a caveat to be aware of in the unlikely event I come across it in the future?
Hoff
Honored Contributor

Re: Delete-pending Global Section with erroneous non-zero Reference Count?

> ... so I would guess...

> ...I don't think we have...

Um, OK.

Do call in somebody to investigate and to debug the application code. If that's you, then crack open the DCL procedures and the source code files and tools such as the OpenVMS debugger, and start digging around in the code.

I'd start the debugging with the quantification of the error, and with investigations of the design of the memory management, and the arguments and the error paths from the memory and section-related services.

Whether or not the application code is the proximate trigger here, the source code is usually the first best suspect, until the bug is proved to lurk elsewhere.

There were updates to the shared memory chapter in the programming concepts manual a while back (V7.3-2?, V8.2?), so have a read through that if the general topic area of shared memory and synchronization is unfamiliar.

And yes, mixing INSTALL and the global section services on the same ranges of virtual memory could trigger untoward behavior.