Operating System - OpenVMS
1748169 Members
3938 Online
108758 Solutions
New Discussion юеВ

Using LIB$BBSSI and BBCCI for locking

 
SOLVED
Go to solution
John McL
Trusted Contributor

Using LIB$BBSSI and BBCCI for locking

I'd like to move certain code from using VMS locks to using LIB$BBSSI and LIB$BBCCI. (In this case the locking operates on the same node, not cluster-wide on our 4-node cluster.) I have an old copy of the internals book for Vax/VMS v5.2 and it might not be entirely accurate for Alpha or IA64 so I want to throw some questions to the forum.
(a) What's the overheads?
(b) What's the performance gain over normal Locks (quantified, not subjective e.g. about 100 x faster)?
(c) Any gotcha's I should be aware of?

We run several hundred images and have a large number of users, so contention can be a real issue.
29 REPLIES 29
John Gillings
Honored Contributor
Solution

Re: Using LIB$BBSSI and BBCCI for locking

Hi John,

I assume you're talking about implementing your own spin locks?

LIB$BBSSI works the same on all architectures, but the underlying instructions are different.

Performance gains or losses depend on contention. If you have very little contention, and short critical regions, that is most requests for the lock are granted immediately, and locks are only held for a short time, then the performance of spinlocks can be exceptionally fast. But, if you have high levels of contention, and/or long critical regions, then performance can be terrible, with waiting processes burning CPU. As more processes join the mix, the worse things get.

It's impossible to give you a simple number as it's entirely dependent on load and contention.

Gotchas? There are lots and lots of them. Just looking at one - priority equalisation. Spinlocks don't work well with processes at different priorities. On uniprocessors this is fatal, as a higher priority requesting process will starve out a lower priority process holding the lock. Deadlock and dead system. On multiprocessors this isn't necessarily fatal, unless you get a low priority process holding the lock and N higher priority processes requesting it (where N is the number of available processors), but you can still end up with strange behaviour.

To work around this you should equalise priorities while requesting the lock, so there's your first overhead. Two calls to $SETPRI for each lock request. If you decide to skip this, assuming all your processes will always be at the same priority, you'd better make sure it's clearly and LOUDLY documented, or someone somewhere down the track will start a batch job at priority 3 and everything will break!

NUMA can also do odd things, causing asymmetries.

Typically synchronisation mechanisms are layered, so that the lowest level, "busy wait" mechanisms are used only for very short duration locking of data structures that implement higher level mechanisms like semaphores or VMS style locks. You can use this principle to build your own mechanisms, but you'll soon discover you're just reimplementing the lock manager.

This is non-trivial stuff... The worst issue is you won't necessarily know if there's some tiny timing window waiting to catch you at the least opportune time.

I'd want to see some very strong evidence that there is a real problem with the lock manager before delving into lower level synchronisation mechanisms. What do you think can be improved?

If you really want to do this, I'd recommend building a layer of code that implements an "ideal" locking API for your application, without revealing the underlying mechanism.

Implement it first using the lock manager, for simplicity and robustness. Once that's working, implement a version using spin locks and see if you get any measurable improvement.
A crucible of informative mistakes
Hoff
Honored Contributor

Re: Using LIB$BBSSI and BBCCI for locking

ITRC glitched again; this is a second attempt to post this.

I'm with John G. here.

Though there are no details on the complexity of the application, this switch-over is usually and likely a large project in an application with several hundred images and with masses of active users, and you'll need to ensure you're removing the right roadblocks.

Some idea that locks are the limiting issue here would be a requirement, and a look at moving toward sharding or toward finer granularity of the locks would (also) be in order; at re-architecting the locks required and the lock sequences. In particular, make sure you don't actually have a critical-path problem here; a case where you have critical code sequences. (qv: Amdahls' Law, et al. http://labs.hoffmanlabs.com/node/900 and http://labs.hoffmanlabs.com/node/638 among others.)

In addition to the priority inversion deadlocks John G mentioned, there are more direct deadlocks that can (also) arise here and you'll want or need to code deadlock scans for those. (The lock manager does these scans for you.)

Various applications use sequences of lock acquisition and conversions and releases, and use locks as notification "doorbells" in various designs; features that aren't available with bitlocks. You'll need to find any of those, and figure out how to implement the notifications.

BBCCI and BBSSI also involve the memory controller on some of the boxes; you'll need to be careful around the controller granularity and the bitlock memory locations here as you can end up with subtle lock contention. (With OpenVMS on Alpha or Itanium, you are hitting far fewer instructions than with the lock management calls, though you're getting a memory barrier or a memory fence; these calls are lighter-weight.)

This migration to bitlocks also means you're single-host from here on out, or that you are now rolling your own distributed synchronization. Or both.

For an application structural change this fundamental, I'd likely want to look at the whole design of the application, and instrument the current environment. (This work is a sizable chunk of a full platform port, in practical terms.) If a locking rewrite is on the table, then the whole design of the application is (also) on the table. And I'd also look at where I wanted to end up longer-term, whether that's an application locking layer, or a redesign of how the data rolls and roils through the application environment.

The abstraction layering John G. mentions is a classic OpenVMS application design. That's entirely reasonable here and I might well look to go further here, given the scale of the changes that are involved.

Robert Gezelter
Honored Contributor

Re: Using LIB$BBSSI and BBCCI for locking

John,

I have to agree with John and Hoff: Be careful. There is much potential for a large cost with debatable gain.

The first questions that I would ask are:

- Is a lot of time being spent processing locks?
- Is there a lot of contention?

If the delays are being caused by contention, then the gains by changing mechanisms are limited. The solution to contention is not to change the locking mechanism, but to take a careful look at what is protected by what lock and break that into different locks. This was seen in the changes in recent TCP/IP Services releases relating to IOLOCK8. At the user level, the concept is the same.

Performance monitoring using T4 or similar tools to gather statistics is paramount as a first step. If the performance monitoring shows that locking is an issue, the sequence of steps is:

- Tune Lock Manager performance
- Consider the use of Dedicated Lock Manager (a CPU in a multiprocessor dedicated to running the Lock Manager).
- Review the relationships between Lock Manager resources and whether this is creating contention needlessly
- Only then consider whether one should use low level spin-lock mechanisms

The above sequence also corresponds to an approximation of the cost and risks associated with each set of measures. Tuning is lowest risk and least expensive, a full re-structuring of the code and debugging of spin lock mechanisms can be expensive and very demanding.

- Bob Gezelter, http://www.rlgsc.com
John McL
Trusted Contributor

Re: Using LIB$BBSSI and BBCCI for locking

John, Hoff and Bob, thanks for your comments and I'll certainly reflect on them.

Myabe if I tell you a little more about the situation you'll better understand where I'm coming from and why I'm considering bitlocks to replace some, but certainly not all, of our locking.

We have a 4-node Alpha cluster that during evening processing of batch jobs has its CPU's all running at about 100% for about 6 hours while processing batch jobs. The ENQ/DEQ rate for much of this time seems to average about 50,000 per second and MPSYNCH on one processor (or maybe one node, can't recall right now) is around 90%.

There are instances where the locking is trivial - e.g. to assign space in a table in a global section (obviously on one node) - so I'm investigating whether situations like this would be better as bitlocks rather than bouncing lock information around the cluster.

One issue not mentioned yet is the release of locks should processes die but that's something we can handle through our process monitoring tools that identifies dead processes and releases resources.

Yes, John G, I was plannning on having this in functions that other code calls rather than scattered across and/or duplicated in several images. That's the only sensible way to do it both for maintenance and tweaking internal monitoring code.
John McL
Trusted Contributor

Re: Using LIB$BBSSI and BBCCI for locking

I forgot to add ... re-architecting the application is not possible in the short term because it's just too big. I also doubt if the cost/effort of the change could be justified in the long-term. I'm looking to just plug in a new faster module and then switch certain components to use this module rather than the old one.
John Gillings
Honored Contributor

Re: Using LIB$BBSSI and BBCCI for locking

John,

If you're already getting high levels of MPSYNCH using locks, chances are it would only get WORSE with spin locks covering wider critical regions. Why? At the moment, a process that is waiting for a lock request is not consuming CPU. The MPSYNCH you see is the time spend spinning on OpenVMS spinlocks, waiting to access the lock structures. Spinning for the whole duration of the lock request will be much worse.

Look at the granularity of the locks, and try to subdivide the objects of contention. Reduce MPSYNCH and increase parallelism.

Another thing to consider... if these are all batch jobs, what would happen if you ran them sequentially? That might eliminate the contention altogether. It's entirely possible you will complete the sequence faster than running them all in parallel.
A crucible of informative mistakes
Hoff
Honored Contributor

Re: Using LIB$BBSSI and BBCCI for locking

Fire up DECset PCA or analogous, and see where the applications are actually spending more of the time. Find your critical (slowest) code paths.

Only knowingly replicate "dumb"; don't blindly do so.

And don't blindly replicate an older application design.

There have been cases I've worked were it was far faster to load the whole data store into memory and run with it; disks and files are a convenience for restricted virtual and physical memory, after all. Stuff was designed prior to 64-bit addressing, and when a couple of gigabytes was Big Physical Memory.

Ensure you've properly segmented your cluster and your host-local locks, too. If your global sections are host-local, then embed the host name or such into the lock resource name. Keep the locks and lock trees local.

I'd look to spend time increasing the scope of what is locked or reducing the critical path code (once that is known); looking to tweak the current model. Before I started a locking rewrite.

Then look to get rid of the allocation of space if you can. Or reduce the number of times the application needs to go after it. This could be using sharded or cached allocation of storage, or going to interlocked queues and lookaside lists of allocated or deallocated, or going after bigger hunks.

There are tools around beyond PCA, such as the LCK extension in SDA, and DECamds/AvailMan that can be useful, too.

And do the due diligence involved with tuning; look for overloaded disk spindles and such.
John McL
Trusted Contributor

Re: Using LIB$BBSSI and BBCCI for locking

How I miss good telephone conversations with knowledgeable Digital TSC people!

The trivial instance that I mentioned involves looking through a table of 1000 entries. Since space is assigned once to each process the potential for contention is minor until the system is heavily loaded, but Lock Manager always sends its information around the cluster.

Modifying the granularity sounds useful but there could be significant work in modifying the code and in testing. If we merely split something into smaller portions it might be necessary to lock and access multiple portions until the desired item is found.

I'm already considering our options for reducing that batch processing load and trying to identify the costs and benefits of each.

One point you've not commented on is whether 50,000 ENQ/DNQ operations per second is high, normal or low when running a whole heap of batch jobs.
John Gillings
Honored Contributor

Re: Using LIB$BBSSI and BBCCI for locking

>50,000 ENQ/DNQ operations per second is
>high, normal or low when running a whole
>heap of batch jobs.

On a VAX it might be an issue, but on an Itanium it's no big deal, especially if the locking activity is local. See MONITOR DLOCK.

On the other hand, are they really ENQ/DEQ? If a single process deals with the same resource multiple times, you might consider an ENQ NL at the start, then use lock conversions to synchronize. DEQ when you've completely finished with the resource.

>but Lock Manager always sends its
>information around the cluster.

Not true. You need to check the resource name cluster wide, but once you have a lock on a locally mastered resource, there is no further external activity.

Keeping all the interested processes on a single node should keep the resource local. If the resource is a global section, then that should already be true.
A crucible of informative mistakes