Re: Shared interlocked queues

Brian Reiter · ‎02-25-2011

Good Morning Folks,

I've been looking at the varius routines for interlocked queues (LIB$INSQHI, LIB$REMQHI etc) and according to the manuals they can be used for interprocess communication. Is this actually possible or did I misread/misunderstand what the manual was saying.

cheers

Brian

P Muralidhar Kini · ‎02-25-2011

Hi Brian,

>> Is this actually possible or did I misread/misunderstand what the manual
>> was saying.
Using interlocked queues, you are ensuring co-ordinated access to the queues
by mutiple processes. You would achieve synchonization between multiple
processes accessing the queue at the same time.

Check these links -
* Interlocks, Queues and Reentrancy?
http://h71000.www7.hp.com/wizard/wiz_6643.html

* LIB$INSQHI
http://h71000.www7.hp.com/doc/82final/5932/5932pro_031.html

* LIB$INSQHI programs
http://www.eight-cubed.com/examples/framework.php?file=lib_que.c

Hope this helps.

Regards,
Murali

Let There Be Rock - AC/DC

Ian Miller. · ‎02-25-2011

have a global section with a shared queue manipulated by the INSQHI, REMQHI routines is certainly a way of interprocess communication.

See also HP OpenVMS Programming Concepts Manual
http://h71000.www7.hp.com/doc/82final/5841/5841pro.html

____________________
Purely Personal Opinion

Hoff · ‎02-25-2011

These interlocked queues are re-entrant and operate within shared memory global sections... A code example of queue calls here:

http://h71000.www7.hp.com/wizard/wiz_6984.html

It's common to see these primitives used while maintaining work and free queues across ASTs and mainlines and across processes within global sections, for instance.

If you're using C, there are compiler built-ins for interlocked operations that might be of interest here. (That avoids the RTL call.)

Shared memory does not provide event notification:

http://h71000.www7.hp.com/wizard/wiz_2637.html

A common scheme has a free queue and a work queue, and ASTs flying around to field I/O into or out of buffers in a process section or in shared memory (there's not a significant difference between process and global memory) allowing you to operate without additional interlocks. It's common to see this mixed with a $hiber and $wake scheme.

I've posted 64-bit section example C code here:

http://labs.hoffmanlabs.com/node/1413

Using 64-bit address space keeps (big) sections out of the very limited P0 address space on OpenVMS Alpha and OpenVMS I64.

If you're going to roll your own shared memory, this /ASTs and interlocked queues/ scheme would be a common solution to asynchronous requirements, and not the use of attention ASTs and such. This intended to grab the data and enqueue it (or dequeue the data and transmit), rather than the extra effort and complication of an attention AST. This for design questions such as this recent example:

http://h30499.www3.hp.com/t5/Languages-and-Scripting/terminal-QIO-test-for-read/m-p/4758206#M8142

In the current era, I'd tend to look to a higher-level and preferably network-capable library or interface for resolving these and related requirements. Preferably portable, too. The memory- and cache-level interfaces are fairly fussy around alignment and atomicity and bugs tend to be obscure, there's the lack of notifications mentioned, and the single-host nature of this interface. (The interlocked calls help with this, but you're still rolling your own communications protocol.)

Anyway, the power just failed. Again. Posting this from batteries. So this is a little terse.

Hein van den Heuvel · ‎02-25-2011

Hi Brian,

The prior replies the interlocked queue tools are just a small (but critical) component for in interprocess communication method using shared memory. Potentially fast, but limiting!

What problem are you really trying to solve?

Before you code anything please be sure to check out the totally under-recognized but very powerful OpenVMS "Intra-Cluster Communication" tools

"Intra-cluster communication (ICC), available through ICC system services, forms an application program interface (API) for process-to-process communications. For large data transfers, intra-cluster communication is the highest performance OpenVMS application communication mechanism, better than standard network transports and mailboxes."

http://h71000.www7.hp.com/doc/73final/5841/5841pro_008.html

Better still, step back and articulate your need ... for speed and functionality.

- How many message/second?
- How many MB/second ?
- Within the world? sockets!
- Within a cluster? ICC?, RMS shared file Records?
- Strictly within the node? Global section, Mailbox, RMS,
- Within a Numa domain? Global section

Good luck!

Hein van den Heuvel ( at gmail )
HvdH Performance Consulting

John Gillings · ‎02-27-2011

Brian,

Adding my 2c, yes it's possible to use interlocked queues for interprocess communication, but very fiddly. It's a bit like saying "This is a brick, you can use them to build a house".

The statement is true, but it omits to mention all the other work required to achieve the goal.

The basic principle is the queue is in shared memory. Part of the queue element structure is used as an interlock, the mechanics of which are hidden by the INSQHI and REMQHI routines (which are themselves really just jackets around the corresponding VAX machine language instructions - on Alpha and Integrity implemented using lower level primitives).

As Hein has suggested, you're probably better off using something higher up the process synchronisation food chain. Memory based mechanisms are limited to processes that can see the same memory, which puts some fairly severe constraints on scaling.

If you're happy to use VMS specific code, I'll second Hein's recommendation for ICC - an underrated and underused feature for building cluster wide applications. It's fast and cluster transparent, but, you have to accept that it's seriously non-portable, and a bit tricky to get your head around initially.

I'd suggest designing a process synchronisation layer that presents the most appropriate API for your application. Hide the details of how you impliment it from your application code, that way you can change the mechanism, if necessary, without affecting the application logic.

A crucible of informative mistakes

John McL · ‎02-27-2011

Brian, what exactly are you trying to do? Are you trying to co-ordinate access to volatile data, or trying to communicate between processes?

Are you running on a cluster or on a standalone machine? The latter can optionally use some simple interprocess comms that apply to a single machine rather than a cluster.

I'm guessing that you are on a standalone machine because $INSQHI etc. don't operate across a cluster.

Also how much data, if any, is involved? Co-ordinating access to a small amount of data can be done via some methods that aren't available to larger amounts of data.

Please tell us more about what you are trying to do so that an appropriate course of action can be suggested.

Robert Gezelter · ‎02-27-2011

Brian,

Yes, INSQHI and REMQHI do provide the tools to implement shared queues. However, for many years (actually decades) I have been recommending that they be approached with extreme caution.

Anytime that two programs are sharing an address space, there is a serious potential for subtle and painful problems.

As Hein and the two Johns have noted, ICC and other methods fit a wide-variety of needs. Is there really the justification for the potential hazards?

My personal favorite is often DECnet logical links, even within a single system. They are relatively fast (even on older hardware), and they provide a pre-packaged set of mechanisms to deal with related task terminations and other events. If the needed efficiency can be obtained, there is no reason to increase complexity.

In any event, when I implement systems, I hid that level deep under several layers of abstraction. Thus, if there is a need to change the underlying implementation, it can be done without code re-work at higher levels.

- Bob Gezelter, http://www.rlgsc.com

Hoff · ‎02-28-2011

Do not use ICC. Do not use DECnet.

Not until after you look carefully at your options and alternatives.

...At higher-level middleware library (commercial and open source).

...At a language with higher-level command and control and communications support than the socket- or channel-like primitives.

...And at network communications with IP and sockets.

Then look at ICC and DECnet.

Why? The ICC and DECnet interfaces are not portable, and these (and add IP sockets) are all low-level network interfaces. Primitives. Networking tosses all manner of odd timing values and random disconnections into a design.

Longer-term, you'll need to rewrite or replicate the logic from most or all of the ICC and DECnet pieces you create here, if you want to add other hosts into your environment, or as part of a port.

Entirely your call and your project and your budget, of course. But don't walk into the creation of your own low-level parallel processing and networking code thinking "don't worry, be happy" thoughts. It is entirely possible to do this, sure. But if there's a race or a deadlock or a cache handling error or other bug in your design, you'll almost certainly get to find it. And some of these bugs can be really nasty to find.

And if you do start down this roll-your-own course, integrate debugging and tracing from the onset.

And details including directory services and resource location also come into play, too, and particularly if you're creating an arbitrary and abstract library. (Think DECdns, DNS or DNS-SD, for instance.)

I've written this middleware. Abstracting communications and using DECnet and shared memory for same- and multi-host communications. It is entirely possible. Realize that this effort can easily grow past a small project. And definitely build in tracing and debugging.

Richard J Maher · ‎02-28-2011

Will VMS support infiniBand (or a.n.other highspeed interconnect) and the socket(ish) API that appears to go with it?

Discussing the outside world again sorry.

Cheers Richard Maher

Brian Reiter · ‎03-01-2011

Hi Folks,

Well after Mondays plumbing fiasco, I can now try and answer your questions as best I can (although the forum software makes it ticky to see the response without resorting to notepad).

To be homest I was just curious as to how it could be done (and see if there're any examples about), the LIB$ and Programming Concepts manuals mention that it could be done but were a tad hazy about the specifics.

The situation I have is that under extreme loads the rate of messages coming in to the system can swamp the processes. All interprocess communication is done via mailboxes and these mailboxes filling up caused delays across the board and the eventual loss of messages. This siutation is meant to be rare but it is also transient and may only last a few minutes. The system will cope until the last minute or so.

The plan is to use ASTs and the queue routines to keep the mailboxes empty and feed the main process from the new queue. This (as long as memory allows) should allow the system to get over the processing hump without losing any data.

Many thanks for you help - interesting as always

Brian

Hein van den Heuvel · ‎03-01-2011

Hmmm,

But in a sense mailboxes ARE an in memory shared queue managed by interlocked queue instructions.

So if you plan to keep using the mailbox as base communication and then add a layer to it, things may just get worse!

Maybe you can just create the troublesome mailboxes with much more quota mailbox and/or check stop waiting for the message to be consumed?

Did you check Bruce Ellis's writeup?
http://h71000.www7.hp.com/openvms/journal/v9/mailboxes.pdf

Now the QIO mechanism is not cheap (performance).

So if you were to replace the QIO with an alternative method then you may well come out ahead, but it may be a significant investment to get there.

Proof for the price of a QIO?
Check out the NULL device. Cheap right? No!
The 5MB example file below takes 10x longer to copy to NL: then to a disk file.
(1.3 Ghz RX2600, chached file, timings in deci-seconds)

Cheers,
Hein

$ creat/fdl="file; allo 10000; reco; size 10; form fix" fix.tmp
% set file/end fix.tmp
$ @time copy fix.tmp nl:
Dirio=512082 Bufio= 11 Kernel= 549 RMS= 297 DCL=0 User= 37 Elapsed= 941
$ @time copy fix.tmp tmp.tmp
Dirio= 167 Bufio= 16 Kernel= 2 RMS= 0 DCL=0 User= 0 Elapsed= 87
$ @time copy fix.tmp nl:
Dirio=512082 Bufio= 11 Kernel= 534 RMS= 315 DCL=0 User= 27 Elapsed= 938

Robert Gezelter · ‎03-01-2011

Hein,

With all due respect, the timing comparison between NL and a disk is flawed.

NL is a record oriented device. The disk copy is done in multiblock mode. This is comparing apples and oranges.

What was the format of the data in the file?

- Bob Gezelter, http://www.rlgsc.com

Hein van den Heuvel · ‎03-01-2011

>> the timing comparison between NL and a disk is flawed.

I was not comparing them.
I was just showing the price of a QIO (to the NL: device).
About 1 ms of kernel time per QIO !
A mailbox QIO takes fraction longer. See below.

>> NL is a record oriented device. The disk copy is done in multiblock mode. This is comparing apples and oranges.

Both fruits. Both copy all the data.
Some folks still believe using NL: speeds thinkgs up.
It might not... indeed because it is a record device.

>> What was the format of the data in the file?

The whole test was shown.
File as per FDL in the test: 512,000 records of 10 bytes.

Silly mailbox test below.
It just shows that adding a post-processor to a mailbox based ommunication method is not likely to address fundamental issues.

fwiw,

Hein

$ cre/mail my_mbx
$ spaw/nowait/proc=hein_mbx @time copy my_mbx: nl:
%DCL-S-SPAWNED, process HEIN_MBX spawned
$ @time copy fix.tmp my_mbx
Dirio= 82 Bufio=512014 Kernel= 634 RMS= 269 DCL=0 User= 46 Elapsed= 1908
$
Dirio=512001 Bufio=512011 Kernel=1142 RMS= 450 DCL=0 User= 35 Elapsed=n.a.

Hein van den Heuvel · ‎03-01-2011

Ooops, too quickly.
Got my math wrong.
It didn't feel right, but needed to get back to work.

Dirio=512082 ... Kernel= 534 Deci-second = 5340 ms.

So that is about 100 QIOs per ms.
Or 10 micro-second per QIO.
Much better!
Sorry.

Hein

John Gillings · ‎03-01-2011

Brian,

As Hein has pointed out, a mailbox is pretty much what you're talking about implementing. Indeed, if you can find the sources for the mailbox driver, I'd expect you'll find some excellent examples of how to use the INSQHI and REMQHI instructions ;-)

If you're having trouble dealing with spikes in load, make sure your mailboxes have plenty of headroom. See "bufquo" parameter for $CREMBX. This used to be limited to absurdly low values (64K?), but since circa V7.3 it's now 32 bit, limited only by process BYTLM and system NPAGEDYN.

If your $CREMBX doesn't specify a buffer quota, it inherits DEFMBXBUFQUO which defaults to 1K (yes, *K*). If you were going to just shovel the messages out of a mailbox and into your own mailbox, you may as well make room for them in the system mailbox and save yourself the work. The only caveat is the system allocates mailboxes from NPAGEDYN, so the resource isn't quite as cheap as pageable virtual memory.

That said. If you have a chain of processes that pass messages through mailboxes, as you've no doubt discovered, you don't want them synchronising with RWMBX (very expensive!).

The most obvious process design:

loop
$QIOW mailbox READVBLK into buffer
process buffer
endloop

can be a problem if processing potentially exceeds message interarrival time. You can move the spikes from the mailbox into local process virtual memory using a work queue design. It then becomes two threads, like this:

MailboxRead
$QIO mailbox READVBLK into buffer AST MailboxAST
End MailboxRead

MailboxAST
put buffer onto work queue
MailboxRead
$WAKE
End MailboxAST

MAIN
MailboxRead
Loop
$HIBER
$SETAST 0 ! block ASTs
remove buffer from work queue
$SETAST restore
If gotbuffer THEN process buffer
Endloop

Note there's no need for the work queue to be in shared memory and you don't need to use INSQHI/REMQHI, as you're using AST blocks to synchronise the threads. The AST thread can interrupt processing to add more buffers to the work queue.

A crucible of informative mistakes

John McL · ‎03-01-2011

I think there may have been an example of this in some VMS documentation many years ago. The example might have been about an airline reservation system. I'm thinking the documentation could have been around 1985-88 timeframe because I used the concept at a site that I worked at in late 1989.

The principles of that code are as follows:

Start by creating a mailbox, grabbing a bunch of buffers and putting them on the "free" list, then set a QIO read on the mailbox using one of the free buffers and with an AST routine.

The QIO AST routine puts the mailbox buffer onto an "active" list then sets the new QIO AST (using one of the "free" buffers) before exiting.

The main code takes the next buffer off the active list, processes it and then puts the used buffer onto the free list.

You'll see some similarity between the initial code and the AST code but because the AST code operates after that initial code (i.e. there's no chance of simultaneous access) there's no reason why the same routine - expand freelist if required, take buffer from freelist, set-up QIO with AST - can't be used for both.

I used LIB$INSQTI (NB. "tail") to handle putting buffers onto the two lists (and LIB$REMQHI to take them off) because I didn't want the main code halfway through putting a buffer onto the free list when the AST routine jumped in and wanted to take a buffer off that list or the other way around with the active list. (As John G says, the other way to do this is to have the main code disable ASTs while taking buffers off lists or putting them on.)

GuentherF · ‎03-01-2011

"...extreme loads...mailboxes filling up..."

Means your configuration has not enough umps to handle the burst. Find out which part of your configuration is the (current) bottleneck. CPU? I/O? Artifiial waits?

"...loss of messages."
Not caused by the OS's mailbox facility. Your application has a logic error to cause this.

/Guenther

Hoff · ‎03-01-2011

This reeks of an underpowered and/or overloaded system, and a case where latent application bugs are revealed by the load.

Use DECset PCA to profile the application code, looking for wall-clock and processor usage. (I'd probably also look at bigger hunks of data; processing individual records from a mailbox or from a file is a slow technique.)

Look for synchronization and coding bugs. Omitting IOSBs or mishandling IOSBs and omitting return and IOSB status checks are a common trigger of these cases.

http://h71000.www7.hp.com/wizard/wiz_1661.html

Mechanisms that provide guaranteed message delivery can generate application-wide wedgies, too :-) and this when the slowest part of the application configuration is overrun. Your job: find the message loss (that's likely a bug), and find the slowest part of the application.

Richard J Maher · ‎03-02-2011

Hi Brian,

For those cases (and there are many) when the consumer just cannot keep up with the producer I/we have opted for the lightweight write-to-disk consumer that does nothing more than record/persist the phone-call/transaction/trade and maintain the last-available number in a lock-value-block.

The ultimate down-stream consumer can then read sequentially through this work-queue that can cater for the highest peaks and deepest troughs.

Mailboxes are very limited! Interactive users are fine but PABX or Switch traffic is too much. Horses for courses. Just make sure the event (trade, txn, call) is persistent (unlike the ASX :-) and everything else is lazyable.

Cheers Richard Maher

Robert Gezelter · ‎03-02-2011

Brian,

In many ways I agree with Hoff (with respect to hiding the actual implementation below main level visibility). I must also agree with Richard (with respect to persistent queues, to wit disk queuing rather than in memory queuing).

On a VERY SHORT TERM basis, consider looking at the present implementation's creating of mailboxes. What are the parameters specified? $CREBMX allows control of both message size, and bytes allocated to the mailbox. It is possible that one or more mailboxes were created with insufficient values for the present workload. In that case, it is possible to increase the existing system's instantaneous surge capacity to tolerate a higher spike (this is somewhat akin to surge tanks in plumbing and capacitors in electrical circuits).

Longer term, I would want to take a look at the direction somewhat in the persistent queuing direction to ameliorate the dangers posed by a crash at an inopportune moment. I would also want to establish some instrumentation so that the actual activity of the system can be measured and recorded (probably in a format compatible with T4, so that it can be integrated with overall system performance information).

I would also consider whether some outside technical review would be appropriate. Choosing a sound direction is an important determinate of long term costs, a fact that I noted in many of my architectural presentations for the IEEE Computer Society under the auspices of its Distinguished Visitor Program (see "Architectural Techniques for Interoperability and Coexistence"; slides at http://www.rlgsc.com/ieee/ottawa/2006-11/swarch.html). [Disclosure: We provide services in this area, as do several other active participants in this forum].

- Bob Gezelter, http://www.rlgsc.com

Categories

Company

Local Language

Forums

Discussions

Forums

Discussions

Forums

Discussions

Forums

Discussions

Forums

Discussions

Discussions

Forums

Forums

Discussions

Forums

Discussions

Forums

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Community

Resources

Other HPE Sites

Discussions

Forums

Blogs

Re: Shared interlocked queues

Shared interlocked queues