Re: How wide is SYS$SETRWM

Roger Strandberg SEB · ‎07-10-2008

Hi

My question reside on how wide is the SYS$SETRWM, does it effect the SYS$ENQW?
Can it effect other programs started from same GROUP/SCHEDULER etc..........

Hoff · ‎07-10-2008

AFAIK, everything.

This is a arguably nuclear application self-destruct lever outside of an application where you control everything -- and those basically don't exist any more. You're almost inevitably calling into an RTL somewhere.

If you'd like to risk weird and potentially unhandled errors within most anything you might call most anywhere in the call chain (directly or indirectly, too), have at.

Now if you'd like to discuss the real problem here -- that there's a nasty little trade-off where you inevitably have to decide to hang or to drop a packet when operating in traffic spikes or otherwise past the application throughput limits -- that's another matter.

I'd tend to look to design such an application to have enough quotas to avoid running into the quota blade guard; the process quotas are the means by which an application error is (usually) prevented from triggering or escalating into a system-wide failure.

Yep, probably not the answer you wanted. :-)

Stephen Hoffman
HoffmanLabs LLC

John Gillings · ‎07-10-2008

Roger,

Whatever you think SYS$SETRWM will do for you, it won't!

You should only turn off RWM if you know the entire instruction stream up to the point it's turned back on. When it's used correctly it would be for a short duration critical region. I'd go so far as to say there are NO valid applications where RWM is disabled permanently at the process level.

Yes, it affects all system services, including SYS$ENQ(W). If the $ENQ request requires allocation of a resource which cannot be satisfied immediately, RWM determines behaviour.

Example - an $ENQ for a new resource will need to allocate a RSB and LKB from non-paged pool. With RWM enabled, $ENQ will wait if the resource isn't available, with it disabled $ENQ will return immediately with some kind of failure status.

If there are no resource issues, $ENQ(W) will behave in exactly the same manner with RWM on or off. RWM disabled will NOT prevent $ENQW from waiting for a lock if the request is incompatible with an existing lock.

$SETRWM is intended to be used in time critical code, where you need to avoid any wait states. I wouldn't expect to see a $ENQW in such a code path!

Zen answer... if you have to ask questions about using $SETRWM, you shouldn't be using it! ;-)

A crucible of informative mistakes

Roger Strandberg SEB · ‎07-10-2008

Hi. Thx for answers....

I tested the SETRWM in a basic program.
First i do SETRWN(1) disable and on resturn checks the SS$_WASSET just to make sure.

Then i do a ENQW for a resouce i've just lock via another window.
The ENQW still waits untill i do a another look to same resource. And then i get a normal deadlock.

Ian Miller. · ‎07-11-2008

So you are looking to lock a resource using $ENQ but do not wish to wait if the resource is not available?

Perhaps you need the LCK$M_NOQUEUE flag.

____________________
Purely Personal Opinion

Roger Strandberg SEB · ‎07-11-2008

Well no....

I do:
"Some handling with enqw"
SETPWM()
QIO
SETPWM()
"Some handling with enqw"

The enqw are used to lock a shared area.
And some how this area get wrong data in it.
I suspect that the lock is not right.
I do handle deadlocks.

I have several programs that handle this share area in the same way.
=>
Thats why i asked how wide the setpwm, if it's just local to the program....

Ian Miller. · ‎07-11-2008

It is local the process.

What are you trying to achieve by turning off RWM around a QIO?

____________________
Purely Personal Opinion

Hoff · ‎07-11-2008

So you're not using any calls into language RTLs nor any external libraries nor any (other) system services? Cool. It'll do just want you want. Have at...

If, on the other hand, you're using language RTLs or (other) system services or such, well, those typically are not coded to expect nor to contend with having resource wait disabled.

In most every case I've seen over the years where this has been proposed, the application is non-trivial and the call can be potentially hazardous; it's better to code the application with either an AST or a timeout or a NOW flag or other such feature. To code the application to explicitly react appropriately under load; to stall, to back-pressure or to drop messages.

FWIW, the potential RWM weirdnesses can be subtle and very hard to replicate, too.

Hoff · ‎07-11-2008

[[[The enqw are used to lock a shared area.
And some how this area get wrong data in it.
I suspect that the lock is not right.
I do handle deadlocks.]]]

Some application code is broken, yes.

Are any SMP systems involved here?

Some code is not contending correctly with the shared memory around the processor caching (a particular factor when Alpha and shared memory and SMP are ill-mixed together), is going around or otherwise ignoring the locking, or other such arcana.

There are a wide variety of ways to go off the rails here.

Use of RWM will almost certainly make things worse, too.

In recent years, I've tended to avoid using or migrate away from home-grown shared memory code -- even my own code -- and move to RMS files with global buffers. Shared memory looks good right up until you start dealing with these sorts of cases.

Roger Strandberg SEB · ‎07-11-2008

Well it's hard to predict....

We use it when writing to mailbox.
by placing the SETRWM the old programmer (not me) wanted to catch if the mailbox was full and then write it to an overflow.
Like:
If SYS_STATUS = SS$_MBFULL then
!Overflow
else
! else everything good..... but is it?
end if

We use it in VERY close to the mailbox all.
At the maximum place we have a simple print function in between.......

Roger Strandberg SEB · ‎07-11-2008

Regaring SMP...... Yes we got the problem when we moved to SMP. Also we run a cluster...... I'm thinking of rewriting it to use RDB... :D

Is it possible to invoke a "CPU CACHE FLUSH" i'm not use to alpha asm... only M68K and 6811...
Because if i know that i have the lock, getting it to flush cache would perhaps solve the problem... We get this on rare occations, and never on same place.
If we reset every thing and run "exactly" as before, we don't get the fault. We are not alone on the machine.....

Hoff · ‎07-11-2008

[[[[Well it's hard to predict....]]]]

The best Heisenbugs always are.

[[[[We use it when writing to mailbox.
by placing the SETRWM the old programmer (not me) wanted to catch if the mailbox was full and then write it to an overflow.
Like:
If SYS_STATUS = SS$_MBFULL then
!Overflow
else
! else everything good..... but is it?
end if]]]

If that's the actual code, it's badly broken. Everything other than MBFULL is most definitely NOT success.

You will want to look at the low bit of the status. If it is set, the call worked. If clear, the call failed. I usually test for specific condition values of interest first (eg: MBFULL) and then fall through to a more generalized low-bit check.

[[[We use it in VERY close to the mailbox all.]]]]

Mailboxes have the ability to do a IO$M_NORSWAIT, which is the usual trigger for the MBFULL you're already checking for. Which means the code is likely already skipping the resource wait related to the mailbox, so another resource wait here would largely be meaningless.

And the mailbox isn't tied to the shared memory, so there is obviously rather more going on here.

How many readers for this mailbox? Zero or one is best; zero leads to a stall, one is the typical choice. More than one is often a real problem as you're not sure which way the message is going, and traffic tends to get wedged or resequenced.

Are ASTs in use here?

Are all calls specifying an IOSB and either an explicit and non-shared event flag, or the EFN$C_ENF don't-care event flag?

[[[[Regaring SMP...... Yes we got the problem when we moved to SMP. Also we run a cluster...... ]]]]

Ok, some more details here, please? Are you sharing a common or a global section across nodes?

[[[[I'm thinking of rewriting it to use RDB... :D]]]

In all seriousness, RMS with global buffers enabled is a surprisingly good choice.

Is it possible to invoke a "CPU CACHE FLUSH" i'm not use to alpha asm... only M68K and 6811...]]]

There are gratuitous cache flushes here with the system service calls, but it's feasible that if you have somebody looking at the contents of the structure without benefit of the lock (while there's a parallel write going) you could get stale or inconsistent data.

As for invoking memory barriers, sure. No need for assembler. There are interlocked calls, or you can call the barrier routines yourself directly or via a C wrapper.

Here's an intro to the concepts:

http://64.223.189.234/node/407
http://64.223.189.234/node/638

[[[Because if i know that i have the lock, getting it to flush cache would perhaps solve the problem... We get this on rare occations, and never on same place.
If we reset every thing and run "exactly" as before, we don't get the fault. We are not alone on the machine.....]]]

Yep, that's typical of this class of error; to most of the shared-memory Heisenbugs. The way out of this usually involves desk-checking the code, too. A state table. That, and usually simplifying the associated code, as the usual trigger I've seen on what I've debugged is a very complex interface into the shared memory area.

John Gillings · ‎07-13-2008

>Then i do a ENQW for a resouce i've just
>lock via another window.
>The ENQW still waits untill i do a another
>look to same resource. And then i get a
>normal deadlock.

Correct! The ENQW isn't waiting for a resource, it's waiting for the lock. Apologies if my previous explaination wasn't clear enough.

If all you're trying to do is detect a mailbox full, then please remove all $SETRWM calls from your code and add modifiers

IO$M_NOW and IO$M_NORSWAIT to your write function code. Check out the I/O Users Guide to find the exact behaviour of mailbox I/Os.

You may also want to review your allocation of buffer space when the mailbox is created. Memory is MUCH more abundant on modern systems. Allocating more may help smooth out application flow control and synchronisation.

$SETRWM is far more likely to cause you problems than resolve them.

A crucible of informative mistakes

Roger Strandberg SEB · ‎07-14-2008

Thx for all answers.

1. Well i'll check the mailbox handling and will rewrite it some. I'll perhaps will have some questions later.

2.
The Memory Barrier use...
I tryed to download som manuals but it failed. So am i right if i think like this in every process.

Get lock via enqw
Do the proccess of shared memory
Before release of the lock do a __MB(void)
Release lock

thx again
BR
Roger

Roger Strandberg SEB · ‎07-14-2008

Hi

I war wrong in my last.... i need to do the MB before.... =>
enqw
MB
do my stuff.

but i read something that destroy my plan to use this from basic:
$type vms_mb.c
/* Memory barrier */
#include

long VMS_MB()
{
__MB();
return -1;
}

The release not of 7.3 says:
". In addition, a memory barrier in a subroutine call between the Read FLAG and the Read/Use of the DATA will not prevent speculation. The memory barrier must be in line. "

So my "fancy" C function will not help me?

BR
Roger

Hoff · ‎07-14-2008

I'd review the whole of the source code before starting to make changes. If there is a subtle synchronization error lurking, charging in and making changes is a strategy I've found largely futile.

I tend to follow a code review with looking for and fixing coding errors first. Some of the usual coding errors I look for are listed here:

http://h71000.www7.hp.com/wizard/wiz_1661.html

This includes proper handling of return status values, as well as uniform use and verification of the IOSB, etc. No data from an asynchronous call can be trusted until and unless the return status and the non-shared IOSB are both checked.

Next I look at the existing synchronization mechanisms, and at the details of what is being protected, and how it is accessed.

I then look at how the messages are sequenced (explicitly and implicitly), and then at the memory barriers and at the word tearing.

And if you're using the bitlock PALcode calls or the lock manager calls, memory barriers are not typically required. MBs are used when you are changing directions with your memory accesses to a cell (eg: write, write, write, write, read), and you want all the writes to complete and coalesce before you read from a cell. With shared memory, other key issues are cache visibility and access coordination across the processors, and this involves bitlocks or interlocked queues, or other constructs.

Non-interlocked reads don't necessarily read from memory, they can and often do read from local processor cache, so a write to that same memory cell from another processor can be missed. Accordingly, shared memory flags typically need be interlocked. The interlock notifies the processors to reload their caches.

And I'm still not sure what these mailbox messages and these lock management calls and other such have to do with the shared memory. I'm seeing lots of pieces here, and not much of a picture of how the pieces fit together in this application. And it's a coherent view of the whole that is needed when dealing with synchronization.

With one selection of memory management code I remember well, I ended up looking at it and its occasional and transient crashes for some months, then (getting no where and getting frustrated with the application stability) full-time for a week or so and ended up re-writing the whole thing. The resulting code ran far faster, and was stable -- thirty-some pages of memory management source code were reduced down to two pages, too.

What I would do here is similar to what I have described above. I'd first go for the so-called "low-hanging fruit" and desk-check the code (for common coding errors), and (failing that) I'd then look to analyze the footprint of the current error (yes, I'd go for coding bugs before looking at the details of the synchronization code), and would then look to simplify the source code into stability.

Stephen Hoffman
HoffmanLabs LLC

Roger Strandberg SEB · ‎07-14-2008

Hi
Thx for answer.

More info:

When searching for the fault i ran across the SETRWM, and that was used together with mailbox.....

For the sharedmemmory it has almost nothing to do with.

The sharedmemory consist of an array containing structs, and a control struct.
The control struct hold pointer to free part of the array. Somehow this pointer gets overwritten only when we run on a SMP.

Every thing is coded in basic, with some external functions as enqw.

We do a lock via lock manager to the pointer in the sharedarea and when we get it we go and (in C style):
ptr = shared->pointer
shared->pointer = *ptr->pointer

So we have now a place in the array that is our. But because or pre execution the cache might have already got the date a from memory to the cache. So when i get the lock it might be old data.

I've been code viewing on the desk.... alot of papper. But the code is not big, nor is it complex. It does what it should and nothing more. It wait for it's lock and then process, then release the lock.

I'll read you text again a few times more...
BR

Roger Strandberg SEB · ‎07-14-2008

Hi

After reading TONS of guide/manuals, it seams hopeless.
HP Basic does not give the propper tools to handle SMP in the respect of instruction order. If it was writen in C it would be a diffrent matter.

Then only right this todo is to rewrite it to use RMS file and place that file in memory. That would solve the cache problem.

If nothing else exists to force "flush" cache to memory? Then i'll close this thread end of this week.

Thx for all answers

BR
Roger

Hein van den Heuvel · ‎07-14-2008

>> Then only right this todo is to rewrite it to use RMS file and place that file in memory. That would solve the cache problem.

Admittedly i did not read this whole stream, but suddenly the word RMS showed up, and I must say the above sentence looks odd (as in someone is clueless!.. That someone could be me, or...)

If Memory Barriers are a concern one (or, me) typically thinks about timing problem with dozens of instructions, often protected by a spinlock.

RMS records operations take millions of instructions (ok... many thousands) and often use several locks.

OpenVMS locking sits nicely in the middle and is probably the safe and easy solution to your problem. It takes hundreds (low thousands) instructions and 'does the right things' for up to 500,000 times per second on a fast box.

An optimal solution with buffers in shared memory probably can be founs using LIB$INSQHI and friends:

"When you use these routines, cooperating processes can communicate without
further synchronization and without danger of being interrupted, either on a
single processor or in a multiprocessor environment. The queue access routines
are also useful in an AST environment; they allow you to add or remove an entry
from a queue without being interrupted by an AST."

>> But because or pre execution the cache might have already got the date a from memory to the cache.

NO.

Hope this helps some,
Hein.

Hoff · ‎07-14-2008

I tend to go over to use of RMS (with global buffers) fairly quickly when presented with these sorts of issues, as it deals with this stuff for me. If I really need speed -- more than I can get with tuning the RMS -- then I start looking at shared memory and at solutions other than RMS.

Sure, RMS is fairly heavyweight. Conversely, code that deals with the same sorts of cases will also be heavyweight, and you'll end up supporting it. (TANSTAAFL here. Sure, direct shared memory and bitlocks is usually fairly lightweight. But it never seems to end there...)

The combination of using existing code (eg: RMS) and throwing hardware at the problem can be a cheap solution.

Now as for reviewing and desk-checking the existing code (the "low-hanging fruit" before more work on the code, or before considering a rewrite), that's something best discussed off-line. I'm getting the distinct impression I do not know what's (really) going on with the code in question, as the more I read the responses here, the more confused I get.

John Gillings · ‎07-14-2008

Roger,

If you're using the lock manager, or RMS, you can forget about all the stuff about memory barriers, it's all handled for you.

In theory, if you have an extremely limited model of accessing your shared memory, you can implement interlocking at a lower level, which (again, in theory) might be slightly faster than using locks or RMS because you can make assumptions and take short cuts. However, as soon as you start to generalise the way the section is accessed, you'll find the short cuts don't work.

The most common support conversation with folk writing their own shared memory code goes like this:

Programmer: I'm having trouble with transient corruptions in my shared memory

Support: Why aren't you using locks or RMS?

Programmer: Because shared memory is faster!

Support: but your bugs indicate you've missed something. Maybe SMP race conditions, Intercluster timing, priority mismatch, starvation, word tearing, etc...

Programmer: I'll have to add code to deal with that.

(iterate numerous times with different and increasingly subtle issues, getting harder and harder to diagnose...)

Programmer: There! I've finally got it all working.

Support: How is your benchmark agains RMS?

Programmer: I'll just check... Gee, RMS wins! :-(

Moral: The RMS folk are not just good, they're exceptional. If your shared memory access code is faster then theirs, you've missed something!

A crucible of informative mistakes

Roger Strandberg SEB · ‎07-14-2008

Hi all..

Thx for the Answers....

Some input....
The stuff run stable for 15 years on Vax.
Then ported to Alpha (single cpu) and run for 10 years rock solid.
Then after runing on SMP Alpha the queue system got a cold.

When i woke up today i had fresh mind =>
My first idea was to call a VMS_MB function from basic. But that would not help me due to MB it has to be inline. But what if i do my own PEEK and POKE function. like

long ROGERS_OWN_MB_PEEK(long *ptr)
{
__MB();
return *ptr;
}

void ROGERS_OWN_MB_POKE(long *ptr, long val)
{
__MB();
*ptr = val;
}

The thing is -> That i just not only read it as a flag, i use the stuff i read and on that info i get a pointer in to the arrry, and it's catastof if some one else get same pointer. The pointer is just the *ptr => array[*ptr] = my queue entry.

Regarding RMS......... Yes ofcourse isueing like 500000 instructions sloves the problem.
Like when we placed setlogical directly after the LOCK. Then the CPU stopped to cache the area infront....
from basic/lis/mac
asm:
To get the free place from a common shared area to my own area for the process.

TRAPB
MOV 37, R0
STL R0, 32(FP)
TRAPB
LDQ R16, 64(R2)
LDL R16, (R16)
SLL R16, 40, R16
SRL R16, 48, R16
LDQ R17, -16(R2)
LDL R18, TS
ZAP R18, 3, R18
INSWL R16, 0, R16
BIS R18, R16, R18
STL R18, TS

L$65:
LDAH R16, 27(R31)
LDA R16, -32326(R16)
GENTRAP

Yesterday i was a pessimist. Today i'm a Optimistic realist :D

Any way my hope is to skip those GENTRAB by calling my own peek and poke.... what say you?

Thu i have another place where we do a copy of memory to a temp place, then work with it then move it back. All in basic style this could also lead to problem.... if my peek and poke will not solve it.... then RMS.......

BR
Roger

Roger Strandberg SEB · ‎07-14-2008

Extra info.

We do use the lockmanager.
But that does not help...
What clearly happens is that a entry on the queue disapper.
Only way this can happen is that 2 process cache the same value of the free entry.
Some how while the lock manager release the lock for on process, and the other process gets the lock, the value of the free entry is not updated, the second process gets the same free entry and over writes then then entry on cache update.

We have check and do handle deadlock, we even have several exit handler to clean up in case of a program failure.
It has been ROCK SOLID.... but not on SMP machines.......

Jon Pinkley · ‎07-15-2008

As you have rediscovered, synchronization is much harder to get right on a multiprocessor.

Get a copy of the Alpha Architecture Reference Manual. I found a pdf of the fourth edition using a google search for

alpha architecture reference manual pdf

at this url:

http://download.majix.org/dec/alpha_arch_ref.pdf

Read section 5.5 and after you understand that, section 5.6 (this isn't light reading!)

And if something other than shared memory is fast enough, re-consider using something else. For example, if you use RMS indexed files with global buffers, you will only need to write it once; the implementation will work across all VMS platforms, i.e. you won't have to re-engineer for IA64.

Good Luck,

Jon

it depends

Hein van den Heuvel · ‎07-15-2008

>> Some how while the lock manager release the lock for on process, and the other process gets the lock, the value of the free entry is not updated, the second process gets the same free entry and over writes then then entry on cache update.

There is no way that's happening as described.

In my mind it is 100% certain that this code has been broken for ever: in vax as well.

Just your luck to be around when it was finally exposed to be incorrect.
Do not search too deep.
The problem it likely to be is a really big one. Some 'dirty' queue header still used after the lock is release. Some IOSB still active pointing to a long since released shared memory block. Maybe even a $ENQ call where a $ENQW was intended, or just a stupid evnet flag overload.

So what is this peek & pook all about?
Can you attach a .TXT file with aa couple .BAS listing + Machine code around a typical queue entry aquire and release?

Good luck!
Hein.

Categories

Company

Local Language

Forums

Discussions

Knowledge Base

Forums

Discussions

Forums

Discussions

Forums

Discussions

Forums

Discussions

Discussions

Forums

Forums

Discussions

Forums

Discussions

Forums

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Community

Resources

Other HPE Sites

Discussions

Forums

Blogs

Re: How wide is SYS$SETRWM

How wide is SYS$SETRWM