1827807 Members
3066 Online
109969 Solutions
New Discussion

$WAITFR behaviour

 
SOLVED
Go to solution
Barry Alford
Frequent Advisor

$WAITFR behaviour

I am trying to synchronize two processes using event flags. One process (the Master) runs a wait loop using one event flag in a timer ($SETIMR). Once the timer event flag is set, it is cleared and another event flag is set at the end of the loop. This second flag is then cleared at the top of the loop and the timer is reset:

repeat
$SETTMR(timerFlag, cycleTime)
$CLREF(eventFlag)
$WAITFR(timerFlag)
$CLREF(timerFlag)
$SETEF(eventFlag)

This is intended to form an "escape mechanism" - the two flags are never both true.

A slave process implements this loop:

repeat
$WAITFR(timerFlag)
$WAITFR(eventFlag)

Which is intended to keep time with the master process. This relies on $WAITFR having this (documented) behaviour:
"$WAITFR - Tests a specific event flag and returns immediately if the flag is set; otherwise, the process is placed in a wait state until the event flag is set."

I took that to mean that when the flag is set, the process will definitely be woken up and run and see the flag as set. However, I do not see this happening; it seems that because the flags are only set for a short period that these events are "lost" and the slave processes run very erratically (>1 in 10 master cycles) as if the flags are being polled by the processes and not being triggered by an event from the OS.

Have I missed something here?
27 REPLIES 27
Richard Whalen
Honored Contributor

Re: $WAITFR behaviour

Assuming that timerFlag and eventFlag are in a common efn cluster, then I would expect this to work. If you find that it is erratic, then I suspect there is a bug in the version of VMS that you are using.

What version of VMS are you using?
Are all the patches installed?
Is this a multiprocessor system?
Hein van den Heuvel
Honored Contributor

Re: $WAITFR behaviour

My we assume these are some sort of common flags?
Is this a new design? Testing on a multi-cpu system?

>. Have I missed something here?

Yes, you missed a glaring timing window.
Once the timer flag is set, the slave is officialy runnable.
The waitfr time is done, and the waitfr other is about to be requested.
But in the mean time the master sets the other, arms the timer, clears the other and waits for the timer.
The scheduler now starts workgin on the slave.
It finally actually executes the waitfr other, which at this point is cleared. So it really waits for the timer.
Once cycle missed!

KISS!

One event flag is genrally too hard to deal with already.
Be sure to check out $HIBER / $WAKE instead.
Unlike event flags the pending wakes are rememberd.
Much easier!

Good luck,
Hein.


Barry Alford
Frequent Advisor

Re: $WAITFR behaviour

I have tried it on v7.2-1 and v7.3-2 (both on DS10 single processor) so far and about to try v8.1 (PersonalAlpha). All will be pretty much unpatched :-o. (I am reluctant to believe that something as fundamental as this would be broken!)

I am using a common event cluster:
[in Fortran]
$ASCEFC(%VAL(iEflag), %DESCR("TIMER"), %VAL(0), %VAL(0))
...for both flags (69 & 70 in fact)

Hein, I see your point but I think that would make the slave miss at most one firing of the eventFlag. Consider the flags are doors into and out of a room - once the slave enters the room (on the timerFlag) it's waiting for the exit to open (on the eventFlag). Meanwhile, the master has opened the exit many times but the slave doesn't come out!

The problem of using $HIBER/$WAKE is that slaves will have to register with the master to get woken up; I wanted to keep things more adhoc...

(The processes will, in fact, map to a shared region of memory, but I wanted to find a general algorithm. Back to my old college text books and refresh my hazy memories of p & v and co-routines?)
Hein van den Heuvel
Honored Contributor

Re: $WAITFR behaviour

Yabut... the one missed 'other wait' is but the beginning. The slave will also miss a timer event, while waiting for the other flag. So that's 2!

Be happy that you tested this on a single CPU system. You might not have found the design problem on a multi-cpu system untill way too late, but it would have been equally broken!

>> (I am reluctant to believe that something as fundamental as this would be broken!)

Ah, give yourself 8 points!
That would have been 10 points if you had written 'refuse to believe'.
[Yeah, I know you can not give points to yourself]

When reading the base topic, I half expected to read 'waitfr' is broken, and was pleased to see that was not mentioned but replaced by an 'have I missed something'. Excellent.
Now I see it was fully intentional and it pleases me.
There are too many daft individuals out there that think their first dables must have uncovered a major flaw in fundamental stuff. Not!

Cheers,
Hein.
Barry Alford
Frequent Advisor

Re: $WAITFR behaviour

Ah Hein! I'm not awarding any points just yet!

Well, well! When I monitor the slave process with:
$ SHOW/PROC/CONT/ID=
.. it all works _perfectly_! Stop monitoring, and it all goes sticky again.

How d'you like them apples? :-)
Hoff
Honored Contributor

Re: $WAITFR behaviour

Have you missed something? Yes, you've missed that event flags are an OpenVMS analog of "die Lorelei", or of Homer's Sirens. A construct that serves to lure unsuspecting programmers onto the rocks of pain and suffering. By sheer coincidence, I posted up a similar statement to this one -- and a description of why you're headed for the rocks -- just last night.

http://64.223.189.234/node/613

Event flags only look simple. They can get to be very nasty, this in terms of spurious triggers, problems with scaling, limits on the numbers of parallel events, and otherwise.

On no details on the application, I might tend to use locks and potentially lock value blocks here. Mayhap shared memory. I'd work to keep time with the master, whatever that means here -- and some details and some background on the application synchronization requirements would be useful.

Stephen Hoffman
HoffmanLabs LLC
Barry Alford
Frequent Advisor

Re: $WAITFR behaviour

Thank, Hoff, for the warning. I will look up that link tonight (restricted access here).

All I asked was for clarification of how $WAITFR worked; it seems from your words and me scraping my ship on the rocks that the Land of Event Flags is not the place for me!

The application does simulation of various machines; currently each machine is processed serially in a time step. This makes changes a problem in that the whole app has to be rebuilt. We have toyed with shared libraries and late binding, and my aim with this exercise is to experiment with multiple processes, each simulating one machine, but running in step with each other in time. (Did someone say "threads" ot there?)

Anyhow, I will now try Plan B: use the master timer to wake up processes, then a cycle number in shared memory to ensure only one processing step per master cycle.
John Gillings
Honored Contributor

Re: $WAITFR behaviour

Barry,

I agree with Hein and Hoff. Event flags are are very hard to get right. They tend to have nasty timing windows, and because there's so few of them they often get overloaded, so you have to deal with spurious wakeups.

Consider using a pair of locks, maybe called TICK and TOCK. You can lock step your processes by cycling the locks converting to EX then NL in sequence. Put your cycle number in the lock value block.

Now, since your locks are exclusive to the specific pair of processes you avoid any logic for spurious "wakes", and you're guaranteed handshaking. Moreover, since they're locks, the mechanism will work across a cluster (and it can be scaled up to multiple processes fairly easily - just add another TICK for each slave). With some extra logic on the lock value block, you could also build in a way to monitor the presence of the other process.

Master
repeat
$ENQ TOCK CVT->EX ; wait for slave
$ENQ TICK CVT->EX ; block slave
$SETIMR
wait for timer
$ENQ TICK CVT->NL ; release timer
prepare for slave to be released
$ENQ TOCK CVT->NL ; release slave
slave is now executing
next

Slave
repeat
$ENQ TICK CVT->EX ; wait for timer
timer complete
$ENQ TOCK CVT->EX ; wait for master
do something
$ENQ TOCK CVT->NL ; signal complete
$ENQ TICK CVT->NL ;
next
A crucible of informative mistakes
Robert Gezelter
Honored Contributor

Re: $WAITFR behaviour

Barry,

I would agree with John, except it is not clear to me that you actually need two locks to accomplish this.

Personally, I would do this with locking and ASTS to make things maximally safe.

Doing this with event flags is tricky, as has been commented on.

- Bob Gezelter, http://www.rlgsc.com
Hoff
Honored Contributor

Re: $WAITFR behaviour

To add one other oft-overlooked interprocess communications mechanism that's available on OpenVMS: RMS. RMS can be a silly-fast communications channel for many applications, and with very minimal configuration requirements.
Claus Olesen
Advisor
Solution

Re: $WAITFR behaviour

To me it looks like a problem of who gets the cpu. OpenVMS will put the slaves on the runable queue when the master sets the ef but the slaves may not actually get the cpu for while after that. All while the master is free to continue in its endlessly loop of setting and clearing the efs whether or not the slaves have seen it. You may not be able to use realtime priorities but if the master could be set at a lower realtmie priority than the slaves then that could ensure that the slaves gets to run all the way up to the next eventflag before the master again gets the cpu.
Hein van den Heuvel
Honored Contributor

Re: $WAITFR behaviour

Absolutely Claus.

That's a much better and simpler way to explain it. Occam's razor.
As presented the master never waits for the slave.
There is no formal synchronization at all.
It's just 'likely' that the slave gets scheduled when the master goes to wait for the timer, but it is not garantueed.

The $show proc/cont just changed the priorities and scheduling.

Cheers,
Hein.
John Gillings
Honored Contributor

Re: $WAITFR behaviour

Re: Robert "except it is not clear to me that you actually need two locks to accomplish this."

If there are only two processes, you're absolutely correct. One lock will suffice, but it seems to me there may be several "Slaves" here. For maximum control, and to guarantee each slave gets a look in, have one lock for the master, and one for each slave. The slaves synch against master and their own lock. The master controls them all.

re: Hoff's comment about RMS - absolutely true. RMS files can be a simple way of using locks from DCL.
A crucible of informative mistakes
Barry Alford
Frequent Advisor

Re: $WAITFR behaviour

Thanks to all for your replies -- I didn't think that this would elicit such a discussion; I thought I had just misinterpreted how event flags, and especially $WAITFR, works.

Claus solved the problem by mentioning priorities; if I run the master at base priority 3 and the slaves at the default of 4 (tried it with two slaves so far). Hein suggested that the $show proc/cont changed the priorities, and this looks like it may be the case - but (like in quantum physics) I can't measure this without changing it!

I have not tried locks/ASTs as suggested by John and Robert; I may explore that later.

By the way, my Windows programming collegues tell me this is really simple in their world! However, I believe their event driven code is like the slaves registering with the master which Hein suggested with his $HIBER/$WAKE, albeit with some more help from the OS.
Hein van den Heuvel
Honored Contributor

Re: $WAITFR behaviour

>> Claus solved the problem by mentioning priorities; if I run the master at base priority 3 and the slaves at the default of 4 (tried it with two slaves so far). H

NO NO NO NO NO

Assining priorities does not solve problems it merely ides problems.

IF strict sncrhonization is a requirement, then the master somehow has to listen to the slave.

IF it's ok to miss a cycle or two then priorities can be good enough, but they are never a solution.
For example, at some point the maste may do an IO an get a priority boost from that. Or pixscan kicks in or whatever.

Since you have a time goin of anyway this could be acceptable in your case, as long as slaves look for all possible work, not just 1 item.
The folks with just one event event flag and no time may hang too long waiting for the action if a wakeup is missed.

Cheers,
Hein.
Barry Alford
Frequent Advisor

Re: $WAITFR behaviour

Hein, maybe I should have said that Claus had solved the problem of the slaves not seeing the setting of the event flag -- I don't mean to say that this is a solution for my application.

It /could/ be a solution, as some missed frames may not be critical - the slaves are intended to run at 50Hz (20ms cycle time) as their position info is monitored by a graphics process (running on a pc). (Currently, the monolithic application does all that the slaves do in this period.) I have not yet run at this rate or given the slaves any work to do; this may then show up the flakiness of the scheduling problem again...

If so, I would probably try the $HIBER/$WAKE solution suggested by Hein; then I have control of the "wakeup list" of process IDs, rather than depending on what I thought $WAITFR would do. (I will need a mechanism for the slaves to register their process ID with the master.)

Finally, I may also try setting the slaves free and run them at 50Hz but asynchronously to each other. This may be no worse than tolerating them missing a cyle when running under the master.
Hoff
Honored Contributor

Re: $WAITFR behaviour

Why are the secondaries even having this problem?

(There's an ancient software design rule I ascribe to: if it hurts, don't do it.)

Consider another approach... Have a scheduled AST or $schdwk or other such in each of the secondaries (repeating TQE), and free-run the secondaries pulling in the frames from whatever the shared storage is.

The other approach I've used in this environment is an Ethernet multicast. I have used 60-06 here when I first started with this scheme an eon ago (as that stays on-LAN), but I'd also look to use an UDP multicast datagram for various environments.

http://h71000.www7.hp.com/DOC/82final/6529/6529pro_005.html

Basically, the primary multicasts the data (or just a ping) periodically, and if the secondaries pick up on the multicast, they display the data. So long as the next datagram is captured and the data in the datagrams are independent, if a UDP multicast datagram is occasionally dropped on the floor, well, oh, well...

I've successfully used this or a similar scheme for any number of monitoring tasks over the years. (It's also entirely platform-independent; it works on OpenVMS and on anything else that can receive a UDP multicast.) (If you want to chat off-line, I can describe a boo-boo or two I've made when designing these sorts of factory-floor and other such comm systems over the years.)

As for coordination and the election of a primary process, that should use locks. Coordinating processes within a cluster through other means is perilous at best, and a whole lot more work to get right. I'd not use $hiber and $wake for this task, and would definitely not use event flags here.

John Gillings
Honored Contributor

Re: $WAITFR behaviour

Barry,

Fairly simple rule... any synchronization mechanism that depends on differential priorities is wrong! It WILL break somewhere, sometime, when you least expect it, and you won't be able to reproduce the failure.

>By the way, my Windows programming
>collegues tell me this is really simple in
>their world!

Synchronization between processes in OpenVMS is also really simple. Any operating system is fundamentally just a big synchronization engine, so it can't be any other way.

I suppose it could be argued that on OpenVMS it's made slightly harder because you have so much choice as to what mechanism you pick. So, rather than having to bend your design to fit what the operating system offers, you choose the best design for your application and then select the most appropriate mechanism to implement it.

Event flags can be good for "blind" signaling. "Wait here until something happens". There can be multiple processes all waiting on a common event flag, and they all get released at once, BUT you get no feedback, and no way to know all processes are ready before lowering the gate. They can be a good, light weight mechanism, but unless you're very careful you may introduce timeing windows, race conditions or false triggers.

Other mechanisms include HIBER/WAKE, locks, mailboxes, global sections, ICC, IPL, spinlocks, mutexes, ASTs, etc... See the Programming Concepts Manual.

Your first step is to make sure you can clearly describe what you want to do. I'm not sure I know that yet. I THINK it's like this:

master process
timed loop
send an event to each slave
confirm slave has received event
endloop

slave (multiple processes?)
loop
wait for event
confirm receipt of event
endloop

There are lots of possible ways to do this, but event flags probably aren't the best choice.
A crucible of informative mistakes
Claus Olesen
Advisor

Re: $WAITFR behaviour

Hoff suggested using locks. I tried it using this

lksb A,B,C;

if (!strcmp(argv[1],"master"))
mode=LCK$K_EXMODE;
else //slaves
mode=LCK$K_CRMODE;

for (x=0;;x++)
{
printf("%d\n",x);
sys_enq(A,mode);
sys_enq(C,LCK$K_NLMODE);
sys_enq(B,mode);
sys_enq(A,LCK$K_NLMODE);
sys_enq(C,mode);
sys_enq(B,LCK$K_NLMODE);
sleep(atoi(argv[2])); //your work here
}

(sys_enq is pseudo for your convenience wrapper of sys$enq) and my test run with one master and 2 slaves showed lock step without missteps. It has that characteristic that you mentioned that the parties do not need to know about one another. And they can come and go.
Barry Alford
Frequent Advisor

Re: $WAITFR behaviour

Thanks for more food for thought, everyone. I will not have the time to experiment futher for a while, but I will try to clarify what the application does now and how we would like to refactor it.

The current app simulates several (~10) machines which move independantly:

repeat
$SETTMR(timerEventFlag, 20 milliseconds)
moveMachine1
moveMachine2
moveMachine3
...
$WAITFR(timerEventFlag)

Currently, when the characteristics of a machine are revised, the whole app is rebuilt.

I have prototyped using sharable libraries for the moveMachineX functions which allows late binding, but I had the idea of running each machine as a separate process, i.e.

repeat
$SETTMR(timerEventFlag, 20 milliseconds)
moveMachine1
$WAITFR(timerEventFlag)

repeat
$SETTMR(timerEventFlag, 20 milliseconds)
moveMachine2
$WAITFR(timerEventFlag)

...
(each process accesses its own data in one shared image)

There was some concern about each process running asynchronously, albeit at the same frequency, so that's when I tried my original master/slaves design; using one event flag for the master timer allowed each slave to free run between setting the flag and resetting the timer, so I introduced the second flag to act like a clock's escape machanism.

My assumption was that when this flag was set, any process waiting for it would _definitely_ run without rechecking the flag. This assumes that VMS would keep a list of processes waiting for the flag so they would be made runnable when the flag was set. I now don't believe that is what happens...

There is no need for the slaves to inform the master that they have run - the master is meant to behave as a master clock rather than each machine process keeping its own time.

One final note: this application will be ported to Windows (but still live on under VMS), and so I am sadly reluctant to adopt a solution which is very VMS-centric.

Thank you all once more for all your help and suggestions. I will leave this thread open for a while as I welcome any more comments and I will report on any further experimentation I do.
Barry Alford
Frequent Advisor

Re: $WAITFR behaviour

on re-reading my last post, I thought I should clarify:

The timerEventFlag is a local flag in:
$SETTMR(timerEventFlag, 20 milliseconds)

but in the Master/Slave configuration, both the timer and the "escapement" flag are common event flags.
Hoff
Honored Contributor

Re: $WAITFR behaviour

Clock-style escapements don't work with event flags, as the flywheel you're basing all this on here is rotating at variable speeds, and the intervals between the individual teeth isn't consistent.

Sorry, but Foucault doesn't work here. Schrodinger and Murphy both do visit rather regularly, as you've found.

As for how to implement this with locks, in each of the two processes involved here, you could queue and receive an exclusive grant on the resource "ESCAPEMENT" with a blocking AST, the blocking AST arrives and triggers a down-convert to null followed by a requeue for exclusive. The other process is effectively offset one half cycle. Rinse, lather, repeat.

If you need to have a synch, I'd still use the multicast. But you could have the coordinating process queue for and immediately release an incompatible lock such as an exclusive lock. This is the ping. The client processes would receive blocking ASTs and would dequeue and then requeue the shared request on the ping, and the grant AST for the clients would probably provide the update processing.

Another thing you get with the lock is the lock value block; you get a data storage are that's passed around with the lock.

But for a typical data display application requiring a heartbeat and a periodic display, a UDP or a LAN-level multicast datagram is a very sweet solution.


Barry Alford
Frequent Advisor

Re: $WAITFR behaviour

Right now we are thinking of setting the slaves free and running them in their own timed loop as separate processes - they do not interact with each other that much. Where they do, we'll ensure the code is thread-safe.

We already use multicast for communicating with other tasks running on PCs, e.g. a graphics process which displays the positions of the machines; this doesn't need to synchronise and it doesn't matter if a few frames are dropped occasionally.

It would be fun if someone with access to the VMS sources could confirm my suspicions on $WAITFR. Does the process go into some table ready to be run when the EF is set? I don't think so! Probably, the scheduler looks at the EF on each cycle, causing it to miss my narrow window.
Hoff
Honored Contributor

Re: $WAITFR behaviour

No need for access to the OpenVMS source code here. There's a whole chapter on event flags in the Internals and Data Structures Manual; in the IDSM.

In far too few words and details, when a shared event flag state is changed or when any number of other system events of significance occur, the process itself deals with scheduling and a rescheduling event can be declared, and the scheduler then wakes up and goes looking for which process is available and most deserving of the events.

In the case of a common event flag, the scheduling-related processing goes looking for (other) processes residing in an event flag wait state, and particularly looking for any other processes that might have been released by having raised this event flag.

As for this case, there's an expected "hole" where an event flag that's quickly toggled can lead to event flag hangs; where the event flag is seen to change and the waiting process(s) can be detected and started, but the change back when the event flag is lowered effectively triggers a event flag stall; a "lost" event flag toggle. (This "hole" is an inherent part of how ASTs can arrive when in an event flag wait state and can themselves set event flags; having ASTs able to arrive during an event flag wait and to potentially reset the event flag -- which can easily be the event flag that the mainline can be waiting on -- is central to many application designs.)

Again, I've learned to avoid event flags in all but the trivial cases. Until and unless you're willing to work through a state diagram of the transitions and the sequencing of each, communications involving event flags are far harder to get "right" than they might appear, and the effort involved is usually better spent on other (and easier and more reliable and more flexible) approaches.