Operating System - OpenVMS
1753500 Members
4161 Online
108794 Solutions
New Discussion юеВ

Re: How wide is SYS$SETRWM

 
SOLVED
Go to solution

Re: How wide is SYS$SETRWM

Regaring SMP...... Yes we got the problem when we moved to SMP. Also we run a cluster...... I'm thinking of rewriting it to use RDB... :D

Is it possible to invoke a "CPU CACHE FLUSH" i'm not use to alpha asm... only M68K and 6811...
Because if i know that i have the lock, getting it to flush cache would perhaps solve the problem... We get this on rare occations, and never on same place.
If we reset every thing and run "exactly" as before, we don't get the fault. We are not alone on the machine.....
Hoff
Honored Contributor

Re: How wide is SYS$SETRWM

[[[[Well it's hard to predict....]]]]

The best Heisenbugs always are.

[[[[We use it when writing to mailbox.
by placing the SETRWM the old programmer (not me) wanted to catch if the mailbox was full and then write it to an overflow.
Like:
If SYS_STATUS = SS$_MBFULL then
!Overflow
else
! else everything good..... but is it?
end if]]]

If that's the actual code, it's badly broken. Everything other than MBFULL is most definitely NOT success.

You will want to look at the low bit of the status. If it is set, the call worked. If clear, the call failed. I usually test for specific condition values of interest first (eg: MBFULL) and then fall through to a more generalized low-bit check.

[[[We use it in VERY close to the mailbox all.]]]]


Mailboxes have the ability to do a IO$M_NORSWAIT, which is the usual trigger for the MBFULL you're already checking for. Which means the code is likely already skipping the resource wait related to the mailbox, so another resource wait here would largely be meaningless.

And the mailbox isn't tied to the shared memory, so there is obviously rather more going on here.

How many readers for this mailbox? Zero or one is best; zero leads to a stall, one is the typical choice. More than one is often a real problem as you're not sure which way the message is going, and traffic tends to get wedged or resequenced.

Are ASTs in use here?

Are all calls specifying an IOSB and either an explicit and non-shared event flag, or the EFN$C_ENF don't-care event flag?

[[[[Regaring SMP...... Yes we got the problem when we moved to SMP. Also we run a cluster...... ]]]]

Ok, some more details here, please? Are you sharing a common or a global section across nodes?

[[[[I'm thinking of rewriting it to use RDB... :D]]]

In all seriousness, RMS with global buffers enabled is a surprisingly good choice.

Is it possible to invoke a "CPU CACHE FLUSH" i'm not use to alpha asm... only M68K and 6811...]]]

There are gratuitous cache flushes here with the system service calls, but it's feasible that if you have somebody looking at the contents of the structure without benefit of the lock (while there's a parallel write going) you could get stale or inconsistent data.

As for invoking memory barriers, sure. No need for assembler. There are interlocked calls, or you can call the barrier routines yourself directly or via a C wrapper.

Here's an intro to the concepts:

http://64.223.189.234/node/407
http://64.223.189.234/node/638

[[[Because if i know that i have the lock, getting it to flush cache would perhaps solve the problem... We get this on rare occations, and never on same place.
If we reset every thing and run "exactly" as before, we don't get the fault. We are not alone on the machine.....]]]

Yep, that's typical of this class of error; to most of the shared-memory Heisenbugs. The way out of this usually involves desk-checking the code, too. A state table. That, and usually simplifying the associated code, as the usual trigger I've seen on what I've debugged is a very complex interface into the shared memory area.

John Gillings
Honored Contributor

Re: How wide is SYS$SETRWM

>Then i do a ENQW for a resouce i've just
>lock via another window.
>The ENQW still waits untill i do a another
>look to same resource. And then i get a
>normal deadlock.

Correct! The ENQW isn't waiting for a resource, it's waiting for the lock. Apologies if my previous explaination wasn't clear enough.

If all you're trying to do is detect a mailbox full, then please remove all $SETRWM calls from your code and add modifiers

IO$M_NOW and IO$M_NORSWAIT to your write function code. Check out the I/O Users Guide to find the exact behaviour of mailbox I/Os.

You may also want to review your allocation of buffer space when the mailbox is created. Memory is MUCH more abundant on modern systems. Allocating more may help smooth out application flow control and synchronisation.

$SETRWM is far more likely to cause you problems than resolve them.
A crucible of informative mistakes

Re: How wide is SYS$SETRWM

Thx for all answers.

1. Well i'll check the mailbox handling and will rewrite it some. I'll perhaps will have some questions later.

2.
The Memory Barrier use...
I tryed to download som manuals but it failed. So am i right if i think like this in every process.

Get lock via enqw
Do the proccess of shared memory
Before release of the lock do a __MB(void)
Release lock

thx again
BR
Roger

Re: How wide is SYS$SETRWM

Hi

I war wrong in my last.... i need to do the MB before.... =>
enqw
MB
do my stuff.

but i read something that destroy my plan to use this from basic:
$type vms_mb.c
/* Memory barrier */
#include

long VMS_MB()
{
__MB();
return -1;
}

The release not of 7.3 says:
". In addition, a memory barrier in a subroutine call between the Read FLAG and the Read/Use of the DATA will not prevent speculation. The memory barrier must be in line. "

So my "fancy" C function will not help me?

BR
Roger
Hoff
Honored Contributor

Re: How wide is SYS$SETRWM

I'd review the whole of the source code before starting to make changes. If there is a subtle synchronization error lurking, charging in and making changes is a strategy I've found largely futile.

I tend to follow a code review with looking for and fixing coding errors first. Some of the usual coding errors I look for are listed here:

http://h71000.www7.hp.com/wizard/wiz_1661.html

This includes proper handling of return status values, as well as uniform use and verification of the IOSB, etc. No data from an asynchronous call can be trusted until and unless the return status and the non-shared IOSB are both checked.

Next I look at the existing synchronization mechanisms, and at the details of what is being protected, and how it is accessed.

I then look at how the messages are sequenced (explicitly and implicitly), and then at the memory barriers and at the word tearing.

And if you're using the bitlock PALcode calls or the lock manager calls, memory barriers are not typically required. MBs are used when you are changing directions with your memory accesses to a cell (eg: write, write, write, write, read), and you want all the writes to complete and coalesce before you read from a cell. With shared memory, other key issues are cache visibility and access coordination across the processors, and this involves bitlocks or interlocked queues, or other constructs.

Non-interlocked reads don't necessarily read from memory, they can and often do read from local processor cache, so a write to that same memory cell from another processor can be missed. Accordingly, shared memory flags typically need be interlocked. The interlock notifies the processors to reload their caches.

And I'm still not sure what these mailbox messages and these lock management calls and other such have to do with the shared memory. I'm seeing lots of pieces here, and not much of a picture of how the pieces fit together in this application. And it's a coherent view of the whole that is needed when dealing with synchronization.

With one selection of memory management code I remember well, I ended up looking at it and its occasional and transient crashes for some months, then (getting no where and getting frustrated with the application stability) full-time for a week or so and ended up re-writing the whole thing. The resulting code ran far faster, and was stable -- thirty-some pages of memory management source code were reduced down to two pages, too.

What I would do here is similar to what I have described above. I'd first go for the so-called "low-hanging fruit" and desk-check the code (for common coding errors), and (failing that) I'd then look to analyze the footprint of the current error (yes, I'd go for coding bugs before looking at the details of the synchronization code), and would then look to simplify the source code into stability.

Stephen Hoffman
HoffmanLabs LLC

Re: How wide is SYS$SETRWM

Hi
Thx for answer.

More info:

When searching for the fault i ran across the SETRWM, and that was used together with mailbox.....

For the sharedmemmory it has almost nothing to do with.

The sharedmemory consist of an array containing structs, and a control struct.
The control struct hold pointer to free part of the array. Somehow this pointer gets overwritten only when we run on a SMP.

Every thing is coded in basic, with some external functions as enqw.

We do a lock via lock manager to the pointer in the sharedarea and when we get it we go and (in C style):
ptr = shared->pointer
shared->pointer = *ptr->pointer

So we have now a place in the array that is our. But because or pre execution the cache might have already got the date a from memory to the cache. So when i get the lock it might be old data.

I've been code viewing on the desk.... alot of papper. But the code is not big, nor is it complex. It does what it should and nothing more. It wait for it's lock and then process, then release the lock.

I'll read you text again a few times more...
BR

Re: How wide is SYS$SETRWM

Hi

After reading TONS of guide/manuals, it seams hopeless.
HP Basic does not give the propper tools to handle SMP in the respect of instruction order. If it was writen in C it would be a diffrent matter.

Then only right this todo is to rewrite it to use RMS file and place that file in memory. That would solve the cache problem.

If nothing else exists to force "flush" cache to memory? Then i'll close this thread end of this week.

Thx for all answers

BR
Roger
Hein van den Heuvel
Honored Contributor

Re: How wide is SYS$SETRWM

>> Then only right this todo is to rewrite it to use RMS file and place that file in memory. That would solve the cache problem.

Admittedly i did not read this whole stream, but suddenly the word RMS showed up, and I must say the above sentence looks odd (as in someone is clueless!.. That someone could be me, or...)

If Memory Barriers are a concern one (or, me) typically thinks about timing problem with dozens of instructions, often protected by a spinlock.

RMS records operations take millions of instructions (ok... many thousands) and often use several locks.

OpenVMS locking sits nicely in the middle and is probably the safe and easy solution to your problem. It takes hundreds (low thousands) instructions and 'does the right things' for up to 500,000 times per second on a fast box.

An optimal solution with buffers in shared memory probably can be founs using LIB$INSQHI and friends:

"When you use these routines, cooperating processes can communicate without
further synchronization and without danger of being interrupted, either on a
single processor or in a multiprocessor environment. The queue access routines
are also useful in an AST environment; they allow you to add or remove an entry
from a queue without being interrupted by an AST."

>> But because or pre execution the cache might have already got the date a from memory to the cache.

NO.


Hope this helps some,
Hein.

Hoff
Honored Contributor

Re: How wide is SYS$SETRWM

I tend to go over to use of RMS (with global buffers) fairly quickly when presented with these sorts of issues, as it deals with this stuff for me. If I really need speed -- more than I can get with tuning the RMS -- then I start looking at shared memory and at solutions other than RMS.

Sure, RMS is fairly heavyweight. Conversely, code that deals with the same sorts of cases will also be heavyweight, and you'll end up supporting it. (TANSTAAFL here. Sure, direct shared memory and bitlocks is usually fairly lightweight. But it never seems to end there...)

The combination of using existing code (eg: RMS) and throwing hardware at the problem can be a cheap solution.

Now as for reviewing and desk-checking the existing code (the "low-hanging fruit" before more work on the code, or before considering a rewrite), that's something best discussed off-line. I'm getting the distinct impression I do not know what's (really) going on with the code in question, as the more I read the responses here, the more confused I get.