Re: losing ASTs rapidly

tim lloyd_1 · ‎02-26-2009

Hi,

This follows on from a problem I reported earlier which I haven’t been able to address. We support an IA64 system under VMS which runs a racetrack. Other than selling bets we have to send messages to TV screens every 10s. The messages leave the host computer head to the Ethernet, then to our communications devices and finally to the TVs.

The problem I am getting relates to process quotas being exceeded. I have added some diagnostics to the program and I can see that the AST count is the quota in question. The count is initially 200 (which ties in with software limits) and is drained to 0 in 80s. All other quotas remain untouched.

The process in question hibernates for 10s and checks to see which of 6 TV channels to update. It then decides which individual communications devices to broadcast a message to (there may be multiple TVs connected to 1 comms device). I would expect one AST to be used for each broadcast.

I have calculated that there would be 11 devices receiving each message. The depreciation I see in the AST count each 10s is 11, 44, 11, 33, 11, 33, 55 = 198. When I check the activity of that process over the same period it sends out 18 messages (11 * 18 = 198).

I have the Ethernet logs for the day and this ties in with what I have already found. IE. Every broadcast is sent in groups of 11 and I have matched this up to the comms devices.

So the behaviour of the system is consistent throughout the day but at some point my AST count is not getting refreshed. I can not track back and find whether the TVs were updated or not.

Is there something I am missing here where an individual process is not able to communicate with the Ethernet? I should point out that the rest of the system is ticking over just fine!

Using an analogy: after an hour or 2 in the pub the bar person has decided simply to ignore your requests while carrying on serving other customers! And you are not drunk :)

Is this a stupid question: is there a limit to how much data one process can pass to the Ethernet during its lifetime? Or a similar restriction which can only be cleared once the process is restarted? FYI when the process is restarted it carries on as normal.

Robert Gezelter · ‎02-26-2009

Tim,

In a word, there is no limit on Ethernet connectivity (if there was, one would be able to hear it without the benefit of electronic amplification). Large numbers of sites move gigabytes (or more) per day through their Ethernets without incident.

What is happening is that the AST limit for the process is being reached. When the process is terminated, all ASTs are ended. The new process starts over with a clean slate.

There are many mis-conceptions floating around about ASTs. Having taught many courses on their use, I can assure you that they are one of the soundest and safest mechanisms in OpenVMS. (see "Introduction to AST Programming" at http://www.rlgsc.com/cets/2000/index.html ).

Guessing is dangerous. There are two ways to locate this problem:

- a review of the sources
- using ANALYZE/SYSTEM to see where all of the AST quotas have disappeared to.

It is unlikely that they have disappeared. More likely, they are being tied up with some operation that is taking a long time to expire. There are several system requests that can specify ASTs (e.g., locks, timers, IO). I would suspect that some code path related to the write cycle is doing something incorrectly and tying up pending ASTs.

Without reviewing the sources, it is difficult to analyze.

- Bob Gezelter, http://www.rlgsc.com

John Gillings · ‎02-26-2009

Tim,

Given the quota is so low, as a first cut diagnostic, I'd increase it significantly, say 4096 (I tend to use powers of 2 for quotas, probably more voodoo than science). I'd then track the process to make sure it's not a continuous leak.

It's possible the program is running faster than whenever this worked, so it's racing with the ASTs (and winning ;-).

It's also possible that there are more things consuming ASTs. As long as the process plateau's in its maximum usage of ASTs, quotas in that order of magnitude really don't matter.

A crucible of informative mistakes

tim lloyd_1 · ‎02-26-2009

thanks for the responses. I appreciate the AST limit is low but I would expect a max of 60-70 messages every 10s and the ethernet trace tends to back this up.

The activity of this program should not change much during the day.

So, I am treating this as a continuous leak. I am guessing that (as per Bob's email) there is something else in the program which is causing this. And which is not evident on the logs I am reviewing.

Richard J Maher · ‎02-26-2009

Hi Tim,

Is one/some of your ASTs doing synchronous i/o? Waiting for something else like a lock in synchronous mode? Preventing other ASTs from being delivered and ending up queued instead?

Cheers Richard Maher

tim lloyd_1 · ‎02-26-2009

that's a very valid point. We do issue a qiow in a different section of the same program. There should be an error message logged if the directive fails but it is a lead that is worth chasing up.

I thought I had ventured into all code sections of this system. Here is another completely new area for me.

Thanks

Robert Gezelter · ‎02-26-2009

Tim,

Do consider using ANALYZE/SYSTEM to examine the subject process and determine why the AST Limit is depleted.

In concept, there are two possibilities:

- Some operation of extended (possibly essentially forever) is tying up potential ASTs.
- AST delivery is being blocked and the IO synchronization is being resolved somewhere else (correctly or incorrectly).

Looking at the process in ANALYZE/SYSTEM will tell whether the ASTs are queued for delivery (the latter) or whether the pending count is the issue.

Admittedly, this does not resolve where in the code the problem is. However, it does tell you the (first) problem you are looking for.

I would not necessarily stop with the first problem. A review of the code is in order. In my experience, if AST management is not set up correctly in one aspect, the odds are not insignificant that there are other issues that have not been done properly.

The concern "if this is wrong, it is not likely the only aspect" is not specific to ASTs. If one finds a piece of code with odd pointer problems, it is not likely a single occurrence.

One common practice that I proscribe is the use of SYS$QIOW and other [W] (synchronous) forms of system services from within code that is called at AST level. It can be done, but it is most often an accident waiting to happen.

- Bob Gezelter, http://www.rlgsc.com
- Bob Gezelter, http://www.rlgsc.com

Robert Gezelter · ‎02-26-2009

Tim,

Just a thought.

Was this code ported to Itanium recently?

If so, then be on the lookout for timing presumptions that may no longer be valid. The Itanium platforms are faster, as well as often multi-processor.

Thus, a gimlet eye towards the re-use of data structures is in order. Even if the code is single-threaded (one kernel thread), it may be possible for QIO to complete very fast (since the processing of the actual IO could take place on the other CPU). This can cause an unrealized latent race condition to manifest.

- Bob Gezelter, http://www.rlgsc.com

Richard J Maher · ‎02-26-2009

And the other side of the coin where you're doing *A*synchronous stuff but some condition is causing you to repeat it many times within a single AST invocation.

So vague as to be next to useless I know, but seeing as how you're firing a shotgun anyway :-)

Cheers Richard

Jur van der Burg · ‎02-27-2009

Another thing to look for is that if you issue a $qio[w] you should not only check the return status from the qio but also the status in the iosb (you DO specify an iosb don't you?). Failure to do so is a frequent case of weird problems.

Jur.

Volker Halle · ‎02-27-2009

Tim,

did this process ever work correctly on OpenVMS I64 ? When did it start to behave like this ? What has been done to the system prior to the first 'failure' ?

Volker.

Volker Halle · ‎02-27-2009

Tim,

consider to use ANAL/SYS and provide the following information from this process, when it's 'lost' a couple of ASTs:

$ ANAL/SYS
SDA> READ SYSDEF
SDA> SET PROC/ID=
SDA> SHOW PROC
SDA> SHOW PROC/PHD
SDA> FORM PCB
SDA> EXIT

Collect the output into a .TXT file and attach it to your next reply.

Volker.

tim lloyd_1 · ‎03-01-2009

Thanks for all the advice folks. This system was ported to iA64 and went live in January 2007. The system worked fine until we dropped on a new software release in April 2008.

The problem I face is that this issue occurs sporadically (on average one time in a month - system is up 7 days a week). Due to the nature of the application I can not interrogate the problem when it occurs.

I am going down the IOSB path. In other places we do use this facility. I can't see why we don't do it here in one of the most critical part of the whole system!

Cheer

Volker Halle · ‎03-01-2009

Tim,

does the process crash, if the problem happens ? Or does it just hang ? I assume you have mechanisms in place to re-start the process, if the problem happens.

If it crashes, run it with /DUMP or issue a SET PROC/DUMP before starting the image.

If it hangs, include a SET PROC/DUMP=NOW before stopping and re-starting the process.

If it just issues an error message and exits by itself, call a LIB$STOP(SS$_IMGDMP), this will force an image dump.

You can then do the analysis offline in the image dump with ANAL/PROC. Most of the process-related system data is also available in the image dump.

Do you disable AST delivery somewhere in the application ? And not re-enable it ?

Volker.

tim lloyd_1 · ‎03-01-2009

Volker, the program hangs rather than crashing. I will integrate the suggestions you make.

I had not thought about disabling ASTs. Obviously this is not intentional but possible. Any ideas how I would do this?

John Gillings · ‎03-01-2009

tim,

>So, I am treating this as a continuous leak

I'd still recommend testing your program with a higher limit, just to make sure you're not experiencing a spike in load. It's unlikely to cause any resource problems, and you may find your program recovers itself.

Instead of assuming it's a leak, make sure!

A crucible of informative mistakes

Robert Gezelter · ‎03-01-2009

Tim,

Concerning Disabling ASTs.

In short, read the code. Also, scan the code base for references to $SETAST or SYS$SETAST.

Obviously, also check the routines which call the routines which invoke those routines, particularly error paths.

I make several recommendations about how to do AST programming with a fair degree of safety in my DECUS presentation [mentioned earlier in this thread].

One good rule: Always use an IOSB that cannot re-cycled before the AST is processed AND never use event flags in conjunction with ASTs.

Another good rule is to include a logic check in the program to ensure that a buffer/IOSB combination is not recycled while it has a pending operation. Such a logic check often identifies an incorrect set of logic in the program long before the evidence is disturbed.

- Bob Gezelter, http://www.rlgsc.com

GuentherF · ‎03-02-2009

What about the AST queue hanging off the PCB when the process hangs?

SDA> READ SYSDEF
SDA> SHOW SUMMARY ! to get the process index
SDA> SET PROCESS/INDEX=...
SDA> VALIDATE QUEUE PCB+PCB$L_ASTQFL_U
SDA> VALIDATE QUEUE PCB+PCB$L_ASTQFL_E

And a couple of..
SDA> FORMAT PCB+PCB$L_ASTQFL_U
SDA> FORMAT @.

Also...
SDA> SHOW CALL
SDA> SHOW CALL/NEXT

Get a linker map of the program image and find where the PCs are in the source code.

This all under the asumption that indeed the process is running out of ASTLM.

/Guenther

tim lloyd_1 · ‎03-02-2009

Hi All, Robert's paper on ASTs talks about "access modes". Eg. "queueing by access mode". And then 5 queues - special kernel, kernel, executive, supervisor, user.

My system has about 35 sub processes hanging off a main process. These sub processes have varying priorities from 4 to 15.

Basically what I am trying to establish is that if one of these processes calls setast to halt ASTs, this affects the whole bunch rather than just the process itself. Does that sound correct?

Volker Halle · ‎03-02-2009

Tim,

AST quota is NOT a pooled quota. It is PER PROCESS not PER JOB.

You said that 'the program hangs'. What is the state of this or these processes as reported by SHOW SYSTEM/PROC=xxx ?

As this problem show up only very intermittently, capturing a process dump is the most important work item. Then you can check and answer all the question about where the outstanding ASTs may be pending, whether ASTs are disabled etc.

Volker.

Richard J Maher · ‎03-03-2009

Hi Tim,

Do you really mean priorities 4 to *15*? How many CPUs have you got?

sys$setast allows you to explicitly disable/enable ASTs. You also implicity disable ASTs while you're in an AST at the same access mode. Most of your ASTs will be in User Mode. Although RMS, Rdb, Tier3 etc operate mainly in Exec mode.

So If you got a pyramid game where one AST generates more than one additional ASTs (such as itself) per invocation then you can run out of quota pretty quick.

OTOH if a server is unable to respond to an AST 'cos it can't get a time slice at its curr priority then that might clog things up as well.

Not much help but there's not alot of hard evidence and with that architecture many, many things could go wrong - sorry.

Perhaps it's time to get someone in?

Cheers Richard Maher

Robert Gezelter · ‎03-03-2009

Tim,

I am glad that the session notes were helpful. They are not a substitute for being there. They are an attempt to summarize good rules that I have used for many years in using AST-based mechanisms successfully, and resolving problems with existing uses of ASTs.

As Volker has already noted, AST quotas are definitely per process (a sub process has its own process control block).

There is no pooling between processes in a job or group. For completeness, I will note that it is possible to run out of non-paged pool, but with today's pool sizes that is extraordinarily improbable (but must be mentioned for completeness).

I mentioned previously that there are two paths here, and it remains so:
- get a dump at the point of failure; and
- analyze the logic.

These are not exclusive forks. The process dump tells one precisely what happened, to wit the proximal cause of the failure. If nothing else, it can rule out certain possibilities. A careful code review is necessary in any event. Unless the cause turns out to be a small case of insufficient quota, where a code base has one error in AST management, it is not unlikely that there are more lurking about.

- Bob Gezelter, http://www.rlgsc.com

Categories

Company

Local Language

Forums

Discussions

Forums

Discussions

Discussions

Forums

Discussions

Forums

Discussions

Discussions

Forums

Forums

Discussions

Forums

Discussions

Forums

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Community

Resources

Other HPE Sites

Discussions

Forums

Blogs

Re: losing ASTs rapidly

losing ASTs rapidly