Re: How wide is SYS$SETRWM

Hoff · ‎07-15-2008

[[[The stuff run stable for 15 years on Vax. Then ported to Alpha (single cpu) and run for 10 years rock solid]]]

Um, so? A thirty year old bug in yacc -- a very commonly-used tool -- was recently identified. There have been day-one bugs in OpenVMS itself.

I'd be willing to bet your circa 25 year old code is broken.

There are three reasons why I suspect your code.

First, faster processors and particularly SMP are notorious for exposing latent bugs. Back when I wrote that ATW (1661) topic, I started a list of the reasons why the code can be broken; a list of the bugs I'd encountered over the years. The second is that this code is 25 years old and started out as VAX code, and the implementation details of the memory controllers and the memory caching have changed radically. (There was one VAX around that had fairly aggressive memory caching that broke some code, but most code never encountered an Aquarius; a VAX 9000 SMP box.) The third (and sorry about this) is that you're so completely convinced that this application code is bug free.

Even Knuth's TeX code wasn't bug-free, and the Professor is a whole lot better at this programming stuff than most folks.

My own rule of thumb when something detonates within one of my applications: it's my bug until I prove it's not my bug. The definition of "rock solid" being "no known bugs", after all.

Approach this problem systematically, and approach this without the concept of ownership nor even of familiarity; without having any code conceit.

[[[What clearly happens is that a entry on the queue disapper. Only way this can happen is that 2 process cache the same value of the free entry.]]]

That certainly looks like a problem with how the queue is accessed (details not yet in evidence), with the queue locking (not yet in evidence) or with the processor caching that manifests itself on SMP (not yet in evidence.)

Stephen Hoffman
HoffmanLabs LLC

John Gillings · ‎07-15-2008

Customer support latin motto #2

"Insunt interdum menda in eo quod est efficax"

Literal translation...

"There are sometimes flaws in that which is efficacious"

In other words:

Just because it (apparently) "works" doesn't mean it's correct.

I've seen literally hundreds of cases where code "that's been working for years" breaks when moved onto an SMP system, a different architecture, better optimizer or even just a faster system. In all those cases, not a single instance was due to a bug in the underlying hardware or OS. It's ALWAYS a bug in the synchronization assumptions made in the failing code, which have been protected from exposure by the architecture, speed or uniprocessor environment.

You need to step back, analyze the problem and implement proper synchronization.

My advice is to abstract the shared data structure into a module where the details of the implementaion are hidden, and the operations required by the application are exported as a callable interface. First cut implementation should use something you know you can rely on, like RMS. Once you have the application working, you can then fiddle with, tune, or completely change the implementation without having to change any code in the application.

A crucible of informative mistakes

Roger Strandberg SEB · ‎07-15-2008

Well.

As you write:
[I'd be willing to bet your circa 25 year old code is broken]
It was my first asumption. (and still is)
The code lay in 5 functions, only those function uses the struct.
We have:
*Reserve entry - Get a free place
*Enqueue entry - Place the entry in the free place
*Dequeue entry - Get data from pointed entry
*Requeue entry - Move data from one queue other queue (data entry still in same place just forward and backward links is uppdated)
*Final entry (remove entry from array and point out that this part of array is free)

So if I understand you right the cache problem can't be as i describe. hmmm
We have before added a Setlogical in the lock request and that showed us yesterday that there are't any deadlocks that we missed when this happen. I means that we have a lock when we enter the code that changes the data. The only diffrent when using the setlogical directly after the enqw is that it took like 60 days before it happens, before i happened 2-3 times every week.

simple view of the struct:

struct share {
struct qhead qh[num_of_queues];
struct qentry *que[max_size];
int free_queue_entry;
}

struct qentry {
int que; /*the queue id*/
int forward; /* next if 0 last in queue*/
int backward; /*previus if 0 first in queue
int in_use; /*set if deq is using it */
}

struct qhead {
int first; /*the first in queue */
int last; /*the last in queue */
}

So qhead is a array of like 30 queues.
the que[] is first set as a linked list when init. Where the free_queue_entry = 1.
We lock the free_queue_entry and on qid.
Very simple...

Tomorrow starts my vacation..... I'll be back in August.... thx for answers.

Ian Miller. · ‎07-16-2008

I'm not clear if you are using one lock ($ENQ) or more than one. Can you outline your code and it the same code used in all processes that access this queue?

____________________
Purely Personal Opinion

Hoff · ‎07-16-2008

If you're maintaining your own queue structures in shared memory (one interpretation for your struct reply), I'd look to migrate to lib$insqhi and friends. (There are considerations on where you can remove items from within these queues; the requirements for these calls might or might not map into your application design.)

Jon Pinkley · ‎07-16-2008

Make sure the compiler is told these shared memory locations are "volatile", otherwise it may be caching the values in registers instead of returning to memory.

The fact that adding code reduced the frequency of problems implies that the there is a race condition.

You may want to consider using some of the techniques that are used to debug shared structures. For example writing a known value before the entry is released to the free list, and checking for this value when you reserve a new entry before it is used.

Have fun on your vacation.

Jon

it depends

Hein van den Heuvel · ‎07-16-2008

Hmmm,

Thanks for sharing the stuctures and 5 funtions.

It does help, but how to this kindly, it is NOT confidence inspiring.

>> int in_use; /*set if deq is using it */

Scary thought that you might need that.
Sounds look a hack to try to prevent the worst of timing problem.
If DEQ is using it, then it is locked. Case closed. No flag can help.

>> /* next if 0 last in queue*/
That's sort of non-standard making the end different from the rest. Typically you just make it point onwards to the header! Typically an empty queue points to itself, not to nothing (allthough admittedly for self relative, 'nothing' is hard to distinguish from 'self' :-)

The queue header struct suggest that there may be 'memory barrier/cache line' issue. It is to do with being able to declare a variable volatile or not.
I would suggest bumping the size of each queue header with enough fill bytes to make it larger than a cache line. It will help low level performance. You make each header 128 bytes or so (or was it 64, I forget... you look it up!). Can't hurt, might fix it.

Enjoy the vacation,
Hein.

Categories

Company

Local Language

Forums

Discussions

Forums

Discussions

Forums

Discussions

Forums

Discussions

Forums

Discussions

Discussions

Forums

Forums

Discussions

Forums

Discussions

Forums

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Community

Resources

Other HPE Sites

Discussions

Forums

Blogs

Re: How wide is SYS$SETRWM