1748163 Members
3626 Online
108758 Solutions
New Discussion юеВ

How wide is SYS$SETRWM

 
SOLVED
Go to solution
John Gillings
Honored Contributor

Re: How wide is SYS$SETRWM

Roger,

If you're using the lock manager, or RMS, you can forget about all the stuff about memory barriers, it's all handled for you.

In theory, if you have an extremely limited model of accessing your shared memory, you can implement interlocking at a lower level, which (again, in theory) might be slightly faster than using locks or RMS because you can make assumptions and take short cuts. However, as soon as you start to generalise the way the section is accessed, you'll find the short cuts don't work.

The most common support conversation with folk writing their own shared memory code goes like this:

Programmer: I'm having trouble with transient corruptions in my shared memory

Support: Why aren't you using locks or RMS?

Programmer: Because shared memory is faster!

Support: but your bugs indicate you've missed something. Maybe SMP race conditions, Intercluster timing, priority mismatch, starvation, word tearing, etc...

Programmer: I'll have to add code to deal with that.

(iterate numerous times with different and increasingly subtle issues, getting harder and harder to diagnose...)

Programmer: There! I've finally got it all working.

Support: How is your benchmark agains RMS?

Programmer: I'll just check... Gee, RMS wins! :-(

Moral: The RMS folk are not just good, they're exceptional. If your shared memory access code is faster then theirs, you've missed something!
A crucible of informative mistakes

Re: How wide is SYS$SETRWM

Hi all..

Thx for the Answers....

Some input....
The stuff run stable for 15 years on Vax.
Then ported to Alpha (single cpu) and run for 10 years rock solid.
Then after runing on SMP Alpha the queue system got a cold.

When i woke up today i had fresh mind =>
My first idea was to call a VMS_MB function from basic. But that would not help me due to MB it has to be inline. But what if i do my own PEEK and POKE function. like

long ROGERS_OWN_MB_PEEK(long *ptr)
{
__MB();
return *ptr;
}

void ROGERS_OWN_MB_POKE(long *ptr, long val)
{
__MB();
*ptr = val;
}


The thing is -> That i just not only read it as a flag, i use the stuff i read and on that info i get a pointer in to the arrry, and it's catastof if some one else get same pointer. The pointer is just the *ptr => array[*ptr] = my queue entry.

Regarding RMS......... Yes ofcourse isueing like 500000 instructions sloves the problem.
Like when we placed setlogical directly after the LOCK. Then the CPU stopped to cache the area infront....
from basic/lis/mac
asm:
To get the free place from a common shared area to my own area for the process.

TRAPB
MOV 37, R0
STL R0, 32(FP)
TRAPB
LDQ R16, 64(R2)
LDL R16, (R16)
SLL R16, 40, R16
SRL R16, 48, R16
LDQ R17, -16(R2)
LDL R18, TS
ZAP R18, 3, R18
INSWL R16, 0, R16
BIS R18, R16, R18
STL R18, TS



L$65:
LDAH R16, 27(R31)
LDA R16, -32326(R16)
GENTRAP


Yesterday i was a pessimist. Today i'm a Optimistic realist :D

Any way my hope is to skip those GENTRAB by calling my own peek and poke.... what say you?

Thu i have another place where we do a copy of memory to a temp place, then work with it then move it back. All in basic style this could also lead to problem.... if my peek and poke will not solve it.... then RMS.......

BR
Roger

Re: How wide is SYS$SETRWM

Extra info.

We do use the lockmanager.
But that does not help...
What clearly happens is that a entry on the queue disapper.
Only way this can happen is that 2 process cache the same value of the free entry.
Some how while the lock manager release the lock for on process, and the other process gets the lock, the value of the free entry is not updated, the second process gets the same free entry and over writes then then entry on cache update.

We have check and do handle deadlock, we even have several exit handler to clean up in case of a program failure.
It has been ROCK SOLID.... but not on SMP machines.......
Jon Pinkley
Honored Contributor
Solution

Re: How wide is SYS$SETRWM

As you have rediscovered, synchronization is much harder to get right on a multiprocessor.

Get a copy of the Alpha Architecture Reference Manual. I found a pdf of the fourth edition using a google search for

alpha architecture reference manual pdf

at this url:

http://download.majix.org/dec/alpha_arch_ref.pdf

Read section 5.5 and after you understand that, section 5.6 (this isn't light reading!)

And if something other than shared memory is fast enough, re-consider using something else. For example, if you use RMS indexed files with global buffers, you will only need to write it once; the implementation will work across all VMS platforms, i.e. you won't have to re-engineer for IA64.

Good Luck,

Jon
it depends
Hein van den Heuvel
Honored Contributor

Re: How wide is SYS$SETRWM

>> Some how while the lock manager release the lock for on process, and the other process gets the lock, the value of the free entry is not updated, the second process gets the same free entry and over writes then then entry on cache update.

There is no way that's happening as described.

In my mind it is 100% certain that this code has been broken for ever: in vax as well.

Just your luck to be around when it was finally exposed to be incorrect.
Do not search too deep.
The problem it likely to be is a really big one. Some 'dirty' queue header still used after the lock is release. Some IOSB still active pointing to a long since released shared memory block. Maybe even a $ENQ call where a $ENQW was intended, or just a stupid evnet flag overload.

So what is this peek & pook all about?
Can you attach a .TXT file with aa couple .BAS listing + Machine code around a typical queue entry aquire and release?

Good luck!
Hein.


Hoff
Honored Contributor

Re: How wide is SYS$SETRWM

[[[The stuff run stable for 15 years on Vax. Then ported to Alpha (single cpu) and run for 10 years rock solid]]]

Um, so? A thirty year old bug in yacc -- a very commonly-used tool -- was recently identified. There have been day-one bugs in OpenVMS itself.

I'd be willing to bet your circa 25 year old code is broken.

There are three reasons why I suspect your code.

First, faster processors and particularly SMP are notorious for exposing latent bugs. Back when I wrote that ATW (1661) topic, I started a list of the reasons why the code can be broken; a list of the bugs I'd encountered over the years. The second is that this code is 25 years old and started out as VAX code, and the implementation details of the memory controllers and the memory caching have changed radically. (There was one VAX around that had fairly aggressive memory caching that broke some code, but most code never encountered an Aquarius; a VAX 9000 SMP box.) The third (and sorry about this) is that you're so completely convinced that this application code is bug free.

Even Knuth's TeX code wasn't bug-free, and the Professor is a whole lot better at this programming stuff than most folks.

My own rule of thumb when something detonates within one of my applications: it's my bug until I prove it's not my bug. The definition of "rock solid" being "no known bugs", after all.

Approach this problem systematically, and approach this without the concept of ownership nor even of familiarity; without having any code conceit.

[[[What clearly happens is that a entry on the queue disapper. Only way this can happen is that 2 process cache the same value of the free entry.]]]

That certainly looks like a problem with how the queue is accessed (details not yet in evidence), with the queue locking (not yet in evidence) or with the processor caching that manifests itself on SMP (not yet in evidence.)

Stephen Hoffman
HoffmanLabs LLC
John Gillings
Honored Contributor

Re: How wide is SYS$SETRWM

Customer support latin motto #2

"Insunt interdum menda in eo quod est efficax"

Literal translation...

"There are sometimes flaws in that which is efficacious"

In other words:

Just because it (apparently) "works" doesn't mean it's correct.

I've seen literally hundreds of cases where code "that's been working for years" breaks when moved onto an SMP system, a different architecture, better optimizer or even just a faster system. In all those cases, not a single instance was due to a bug in the underlying hardware or OS. It's ALWAYS a bug in the synchronization assumptions made in the failing code, which have been protected from exposure by the architecture, speed or uniprocessor environment.

You need to step back, analyze the problem and implement proper synchronization.

My advice is to abstract the shared data structure into a module where the details of the implementaion are hidden, and the operations required by the application are exported as a callable interface. First cut implementation should use something you know you can rely on, like RMS. Once you have the application working, you can then fiddle with, tune, or completely change the implementation without having to change any code in the application.
A crucible of informative mistakes

Re: How wide is SYS$SETRWM

Well.

As you write:
[I'd be willing to bet your circa 25 year old code is broken]
It was my first asumption. (and still is)
The code lay in 5 functions, only those function uses the struct.
We have:
*Reserve entry - Get a free place
*Enqueue entry - Place the entry in the free place
*Dequeue entry - Get data from pointed entry
*Requeue entry - Move data from one queue other queue (data entry still in same place just forward and backward links is uppdated)
*Final entry (remove entry from array and point out that this part of array is free)

So if I understand you right the cache problem can't be as i describe. hmmm
We have before added a Setlogical in the lock request and that showed us yesterday that there are't any deadlocks that we missed when this happen. I means that we have a lock when we enter the code that changes the data. The only diffrent when using the setlogical directly after the enqw is that it took like 60 days before it happens, before i happened 2-3 times every week.

simple view of the struct:


struct share {
struct qhead qh[num_of_queues];
struct qentry *que[max_size];
int free_queue_entry;
}

struct qentry {
int que; /*the queue id*/
int forward; /* next if 0 last in queue*/
int backward; /*previus if 0 first in queue
int in_use; /*set if deq is using it */
}

struct qhead {
int first; /*the first in queue */
int last; /*the last in queue */
}

So qhead is a array of like 30 queues.
the que[] is first set as a linked list when init. Where the free_queue_entry = 1.
We lock the free_queue_entry and on qid.
Very simple...

Tomorrow starts my vacation..... I'll be back in August.... thx for answers.
Ian Miller.
Honored Contributor

Re: How wide is SYS$SETRWM

I'm not clear if you are using one lock ($ENQ) or more than one. Can you outline your code and it the same code used in all processes that access this queue?

____________________
Purely Personal Opinion
Hoff
Honored Contributor

Re: How wide is SYS$SETRWM

If you're maintaining your own queue structures in shared memory (one interpretation for your struct reply), I'd look to migrate to lib$insqhi and friends. (There are considerations on where you can remove items from within these queues; the requirements for these calls might or might not map into your application design.)