- Community Home
- >
- Servers and Operating Systems
- >
- Operating Systems
- >
- Operating System - OpenVMS
- >
- Re: How wide is SYS$SETRWM
Categories
Company
Local Language
Forums
Discussions
Forums
- Data Protection and Retention
- Entry Storage Systems
- Legacy
- Midrange and Enterprise Storage
- Storage Networking
- HPE Nimble Storage
Discussions
Forums
Discussions
Discussions
Discussions
Forums
Discussions
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
- BladeSystem Infrastructure and Application Solutions
- Appliance Servers
- Alpha Servers
- BackOffice Products
- Internet Products
- HPE 9000 and HPE e3000 Servers
- Networking
- Netservers
- Secure OS Software for Linux
- Server Management (Insight Manager 7)
- Windows Server 2003
- Operating System - Tru64 Unix
- ProLiant Deployment and Provisioning
- Linux-Based Community / Regional
- Microsoft System Center Integration
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Discussion Boards
Community
Resources
Forums
Blogs
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
07-15-2008 05:27 AM
07-15-2008 05:27 AM
Re: How wide is SYS$SETRWM
Um, so? A thirty year old bug in yacc -- a very commonly-used tool -- was recently identified. There have been day-one bugs in OpenVMS itself.
I'd be willing to bet your circa 25 year old code is broken.
There are three reasons why I suspect your code.
First, faster processors and particularly SMP are notorious for exposing latent bugs. Back when I wrote that ATW (1661) topic, I started a list of the reasons why the code can be broken; a list of the bugs I'd encountered over the years. The second is that this code is 25 years old and started out as VAX code, and the implementation details of the memory controllers and the memory caching have changed radically. (There was one VAX around that had fairly aggressive memory caching that broke some code, but most code never encountered an Aquarius; a VAX 9000 SMP box.) The third (and sorry about this) is that you're so completely convinced that this application code is bug free.
Even Knuth's TeX code wasn't bug-free, and the Professor is a whole lot better at this programming stuff than most folks.
My own rule of thumb when something detonates within one of my applications: it's my bug until I prove it's not my bug. The definition of "rock solid" being "no known bugs", after all.
Approach this problem systematically, and approach this without the concept of ownership nor even of familiarity; without having any code conceit.
[[[What clearly happens is that a entry on the queue disapper. Only way this can happen is that 2 process cache the same value of the free entry.]]]
That certainly looks like a problem with how the queue is accessed (details not yet in evidence), with the queue locking (not yet in evidence) or with the processor caching that manifests itself on SMP (not yet in evidence.)
Stephen Hoffman
HoffmanLabs LLC
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
07-15-2008 03:36 PM
07-15-2008 03:36 PM
Re: How wide is SYS$SETRWM
"Insunt interdum menda in eo quod est efficax"
Literal translation...
"There are sometimes flaws in that which is efficacious"
In other words:
Just because it (apparently) "works" doesn't mean it's correct.
I've seen literally hundreds of cases where code "that's been working for years" breaks when moved onto an SMP system, a different architecture, better optimizer or even just a faster system. In all those cases, not a single instance was due to a bug in the underlying hardware or OS. It's ALWAYS a bug in the synchronization assumptions made in the failing code, which have been protected from exposure by the architecture, speed or uniprocessor environment.
You need to step back, analyze the problem and implement proper synchronization.
My advice is to abstract the shared data structure into a module where the details of the implementaion are hidden, and the operations required by the application are exported as a callable interface. First cut implementation should use something you know you can rely on, like RMS. Once you have the application working, you can then fiddle with, tune, or completely change the implementation without having to change any code in the application.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
07-15-2008 11:53 PM
07-15-2008 11:53 PM
Re: How wide is SYS$SETRWM
As you write:
[I'd be willing to bet your circa 25 year old code is broken]
It was my first asumption. (and still is)
The code lay in 5 functions, only those function uses the struct.
We have:
*Reserve entry - Get a free place
*Enqueue entry - Place the entry in the free place
*Dequeue entry - Get data from pointed entry
*Requeue entry - Move data from one queue other queue (data entry still in same place just forward and backward links is uppdated)
*Final entry (remove entry from array and point out that this part of array is free)
So if I understand you right the cache problem can't be as i describe. hmmm
We have before added a Setlogical in the lock request and that showed us yesterday that there are't any deadlocks that we missed when this happen. I means that we have a lock when we enter the code that changes the data. The only diffrent when using the setlogical directly after the enqw is that it took like 60 days before it happens, before i happened 2-3 times every week.
simple view of the struct:
struct share {
struct qhead qh[num_of_queues];
struct qentry *que[max_size];
int free_queue_entry;
}
struct qentry {
int que; /*the queue id*/
int forward; /* next if 0 last in queue*/
int backward; /*previus if 0 first in queue
int in_use; /*set if deq is using it */
}
struct qhead {
int first; /*the first in queue */
int last; /*the last in queue */
}
So qhead is a array of like 30 queues.
the que[] is first set as a linked list when init. Where the free_queue_entry = 1.
We lock the free_queue_entry and on qid.
Very simple...
Tomorrow starts my vacation..... I'll be back in August.... thx for answers.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
07-16-2008 01:39 AM
07-16-2008 01:39 AM
Re: How wide is SYS$SETRWM
Purely Personal Opinion
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
07-16-2008 05:12 AM
07-16-2008 05:12 AM
Re: How wide is SYS$SETRWM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
07-16-2008 12:11 PM
07-16-2008 12:11 PM
Re: How wide is SYS$SETRWM
The fact that adding code reduced the frequency of problems implies that the there is a race condition.
You may want to consider using some of the techniques that are used to debug shared structures. For example writing a known value before the entry is released to the free list, and checking for this value when you reserve a new entry before it is used.
Have fun on your vacation.
Jon
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
07-16-2008 04:14 PM
07-16-2008 04:14 PM
Re: How wide is SYS$SETRWM
Thanks for sharing the stuctures and 5 funtions.
It does help, but how to this kindly, it is NOT confidence inspiring.
>> int in_use; /*set if deq is using it */
Scary thought that you might need that.
Sounds look a hack to try to prevent the worst of timing problem.
If DEQ is using it, then it is locked. Case closed. No flag can help.
>> /* next if 0 last in queue*/
That's sort of non-standard making the end different from the rest. Typically you just make it point onwards to the header! Typically an empty queue points to itself, not to nothing (allthough admittedly for self relative, 'nothing' is hard to distinguish from 'self' :-)
The queue header struct suggest that there may be 'memory barrier/cache line' issue. It is to do with being able to declare a variable volatile or not.
I would suggest bumping the size of each queue header with enough fill bytes to make it larger than a cache line. It will help low level performance. You make each header 128 bytes or so (or was it 64, I forget... you look it up!). Can't hurt, might fix it.
Enjoy the vacation,
Hein.
- « Previous
-
- 1
- 2
- Next »