Re: Locking Performance

Vin T · ‎07-16-2009

I'm investigating lock performance issues in our system and how to improve lock throughput. We extensively use locking and hence locking rates impact our system capacity & throughput significantly.

We have two lock libraries that implement locking:

1. For system locks, a shared image library that implements user-written system services that run in exec mode. The locks here also have parent locks and belong to the system resource domain.

2. For UIC-specific locks, an object module that other services can link against. The locks here have no parent lock and belong to a UIC-specific resource domain.

We have seen very poor performance when it comes to the UIC-specific non-exec mode locks. For instance, for one of the locks, when I re-implemented the lock in the exec mode routine with a parent lock, the sys$enq time went down from an avg of ~4ms before to 0.4ms. This is an improvement of 10x! This caught me by surprise and has left me confused as well.

I have a few questions regarding this.

1. Does having a parent lock speed up performance? Are locks that have parent locks faster?

2. Are lock conversion speeds for locks that are acquired from exec-mode, installed shared routines significantly higher?

3. Does the use of a non-system resource domain slow a lock down?

These are the only three differences between the two libraries. However, the performance we get out of them differs dramatically.

I'd really appreciate any clues/pointers/replies.

Thanks!

comarow · ‎07-16-2009

I assume you are grabbing a null lock on any resource that will be used, and is converted
to a least restrictive lock possible, then
as soon as you're done, you convert it to a Null Lock.

Are you getting lock tree remastering?
If so consider reducing PE2 to 0 or a
small number?

You can dedicate a CPU to do all the lock management. It will all show up as Kernal Code, but with a lot of cpus it can help.

Up to date on ECOs?

Ian Miller. · ‎07-17-2009

you may find the LCK SDA extension useful.

It works in a similar fashion to the FLT extension.

http://h71000.www7.hp.com/doc/82final/6549/6549pro_030.html#sda_flt

Enter a LCK command in SDA to see the brief help.

____________________
Purely Personal Opinion

Jon Pinkley · ‎07-17-2009

Vin T,

You have provided more info than many posters do, but there are some important things that are not specified.

Is this in a VMS cluster? Working with a lock resource that has all of its activity on a single node is much faster than having to send request to another node that is the resource master.

What type of processor(s) do you have?

What version of VMS is running?

How are you measuring the sys$enq time?

Are the $enq's all doing conversions (using the LCK$M_CONVERT flag)?

If not conversions, are these for the first lock on the resource name, or does the RSB already exist?

Can you reproduce your results in a small sample program?

Please provide the specific $enq calls that you are making in the two cases.

I will try to address your questions:

>>>1. Does having a parent lock speed up performance? Are locks that have parent locks faster?

If the LCK$M_CONVERT flag is not set, the $enq will create a new lock. If the parent lock is specified, a lock resource master lookup is not needed; and if in addition to the parent lock id being specified, the requesting process is on the node that is the resource master, then the new lock can be created without any internode lock messages.

So for new sublocks in a cluster, I would expect it to be faster. This is especially true if the parent lock is mastered on the node requesting the new lock. But for conversions, I don't think it matters.

>>>2. Are lock conversion speeds for locks that are acquired from exec-mode, installed shared routines significantly higher?

I can't think of any reason they should be. The mode still has to change to kernel

>>>3. Does the use of a non-system resource domain slow a lock down?

I can't think of anything that should cause a significant difference in the amount of time a $enq takes. Are you specifying something in the rsdm_id argument to $enq? Using the system resource id may be slightly faster than the UIC group domain, but we are talking about a short sequence here. Are you using $set_resource_domain? If not, then I doubt you would be able to measure a difference, and even if you are unless the process is joining many resource domains, it is unlikely you will notice a performance difference,

These answers are not based on testing. YMMV.

Here are some additional ITRC threads that you may want to read.

http://forums.itrc.hp.com/service/forums/questionanswer.do?threadId=1337312

http://forums.itrc.hp.com/service/forums/questionanswer.do?threadId=1178555

One final observation: 4 ms is a long time on recent processors. That's 250 $enq operations per second. 2500/sec is still pretty low if these are being handle on a single node. If the request are being handled by a resource manager on a different node, across a high latency link, 4 ms may be considered fast.

Jon

it depends

Hoff · ‎07-17-2009

Here, it's feasible for considerations as diverse as excessive overhead from alignment faults (Itanium) to the aforementioned (lack of) use of null-mode locks, to (re)mastering, to lock collisions within a cluster, to the best that can be expected from a slow Alpha or VAX box or slow disks or...

More recent OpenVMS releases are (usually) faster around lock-related activity, too.

The other approach here being the work toward reducing or eliminating the locking traffic; toward data segmentation and sharding and toward larger granularity with locking.

Andy Bustamante · ‎07-17-2009

Vin T,

Would you provide more detail on the system. Is this a cluster? What's the underlying hardware VAX, Alphaserver or Integrity? Which model systems are in place with how many CPUs? What version of OpenVMS?

Staring in 7.3, the dedicated lock manager became available. Initial recommendation was test for over 4 CPUs, later advice recommended testing on 4 CPU system for your application. I saw noticeable results in testing on GS-80 and GS-1280 systems.

You may also want to install Availablity Manager http://h71000.www7.hp.com/openvms/products/availman/ which can provide some insight into locking behavioring.

Andy Bustamante

If you don't have time to do it right, when will you have time to do it over? Reach me at first_name + "." + last_name at sysmanager net

Vin T · ‎07-17-2009

Some more information about this:

We're running in a VMS cluster of four nodes (4 CPUs each)- two alpha's and two Itanium machines. This test however was run on an Itanium. The VMS version is 8.3-1H1.

I'm basically measuring the sys$enq time by timing the call i.e., measure current time before and after the call and compute delta.

PE2 value is 0.

All the enq's are lock conversions. They are not new lock requests.

I've listed the two enq calls below. The first one is what gives us extremely fast lock speeds while the second one is anywhere between 3-10x slower.

The lock conversions are between NL and PW modes.

1. sys$enqw(EFN$C_ENF, lock_mode, &lksb, LCK$M_VALBLK|LCK$M_CONVERT|LCK$M_SYSTEM, resnam, parent_lock_id, 0, 0, 0, PSL$C_EXEC, resource_domain_id);

2. SYS$ENQW(0, lock_mode, &lksb, LCK$M_SYNCSTS|LCK$M_CONVERT, resnam, 0, 0, 0, 0, 0, resource_domain_id);

The first one is from an exec-mode routine in an installed shared lib. The first call gives us very good performance.

I'm not sure if all the locks are mastered locally, however since they are part of a resource domain based on the UIC, I believe they might be.

Hoff · ‎07-18-2009

Tweak (the code), test (time) the change, and repeat.

I'd start with the EF here, personally. I'd move off of EF0 and to EFN$C_ENF. EF0 has contention all over the place, and you're probably not (actually) using EF0 here.

I'd also look at where the lock trees are located, as well. Having these trees local to the activity is better.

And rather than looking at the lock mastering (up front), I'd be looking at the code and at the lock contention in more general, and at where the lock trees are located and the trade-off with the use of the DLM here; the DLM is wonderful and elegant and (usually) convenient, but data sharding tends to scale better here.

(OpenVMS itself has gone to some effort in recent releases to split up the spinlocks and device locks and such; to identify and reduce or eliminate contention for various kernel data structures.)

The PE1 system parameter was (is) one of the (undocumented) means used to exercise some control over the mastering of locks over the years and particularly to avoid tree thrashing. These system parameters (PE1, PE2, PE3, PE4, PE5, PE6) tend to be undocumented and can also be version-specific.

With OpenVMS V8.3 and later releases, there are tools and knobs specifically intended to assist in monitoring and controlling the trees, and changes to speed the remastering. See both LOCKDIRWT and LOCKRMWT, as a start. MONITOR DLOCK. The SDA locking mechanisms.

Frequent and fine-grained locking (locks frequently acquired and released) can be a serious performance problem, without even needing to look at the difference of 0.4 ms and 4.0 ms per $enqw call.

Jon Pinkley · ‎07-20-2009

Vin,

My previous comment was based on $enq, not $enqw which your calls show.

$enq puts the request in the queue and doesn't wait, where $enqw puts the request into the queue and waits for the lock to be granted.

Those two will have vastly different timings, and the process calling $enqw has little control over how long it will have to wait, unless converting to NL or specifically telling it not to wait if the request can't be granted immediately via the LCK$M_NOQUEUE flag, but in that case the request will not be entered in the queue, so the lock will never be granted. There are some undocumented parameters and flags that can affect whether the $enq can "cut in line" ahead of other existing queued requests. The best that can do is to get the request to the beginning of the queue, it doesn't force other holders to drop any incompatible locks they hold, therefore that can not guarantee that processes holding incompatible locks will release them. The examples you provided are not using any of those features, so that can't explain the differences you reported.

Even if a process holding an incompatible lock has a blocking AST specified, there is no guarantee that is will get scheduled in a timely manner so it can convert its lock to a compatible lock, especially if it is running at a low priority on a busy system. This is classic priority inversion, i.e. a process executing at priority 0 can be locking a lock needed by a priority 16 process, and it can block the high priority process. If there are medium priority processes starving the priority 0 process of CPU, then even if the priority 0 process has a blocking AST, and it is willing to release the lock, if it doesn't get scheduled, it won't be able to release the lock. The PIXSCAN mechanism will eventually grant some CPU to the starved process, but that can take a long time (10s of seconds).

What is the purpose of the locks? Are they to coordinate access to shared memory, as a signaling mechanism for another process, or some other purpose?

What else is using the resource names in the same resource domain? What lock modes (PW,PR,EX, etc.) are being used by the other processes that are using the same resources?

Since your second $enqw is specifying the SYSCSTS flag, are you checking the return status for SS$_SYNCH vs. SS$_NORMAL? In cases where SS$_NORMAL is returned when the LCK$M_SYNCSTS flag is specified, the lock request could not be granted immediately, and your process was forced to wait. Any conversion to NL should always return SS$_SYNCH, but if there is a currently granted lock that is incompatible with PW, then the process will have to wait until whatever is holding the lock converts to a compatible lock or issues a $DEQ, and the process requesting the lock has no control over other processes holding the lock. That is one place where an EXEC acmode lock has some advantage over a user mode lock, as only processes executing in exec or kernel mode can request exec acmode locks.

Is your use of exec acmode locks attempting to synchronize with RMS? Or is the reason for using exec acmode locks so they will survive image rundown?

Several comments about the calls listed above.

1. when a parent lock is specified, the acmode is ignored, and the acmode of the parent lock is used.
2. when the LCK$M_CONVERT flag is specified, the resnam is ignored, as well as LCK$M_SYSTEM and rsdm_id, since these can be determined from the lockid which must be specified in the lksb when the LCK$M_CONVERT flag is specified.

It doesn't hurt to specify them other than possibly causing someone reading the code to make false assumptions about where the info is coming from.

However, since these are all ignored for conversions, I would expect the times to enqueue the lock request to be nearly the same, if they were using the same resource. The setting of a local event flag is fast. So that leaves ether the one resource being mastered on a different node, or contention for the resource name (or the delay associated with another process releasing a blocking lock). If you have a standalone box to test on, and you still see a difference, then it is most likely do to the waiting time, not the queuing time.

Note that the acmode is part of the resource "identification". So even within the same resource domain, there can be multiple resources with the same resnam.

The resource is uniquely identified by the following combination: resnam, UIC group (resource domain), access mode, address of parent RSB.

So if the resources being used by the EXEC mode routines are actually EXEC mode resources, then user mode code will not be able to take new locks on them. The point being that there may be more contention and less control of what can lock user mode resources than exec mode resources.

Also, as Hoff noted, you should probably be using EFN$C_ENF instead of 0, although it is unlikely to make a noticeable difference in time, it ensures you are not causing unintended side effects. For example, some other part you the program may be using event flag 0 as well.

Also, you should be checking the status values returned by SYS$ENQW and the in the lock status block. Even for $ENQW these can have different types of status return values. Perhaps you are checking these; we can't tell.

To find out where the time is being spent, follow Ian's advice and look at the SDA LCK extension, specifically the trace facility. This will give you high-resolution timestamps of when a conversion was requested and when it was granted. Be aware that the trace facility will affect performance and that it can generate a lot of info, that you will then have to sift through, also the

SDA> lck show trace

command displays most recent first, which may not be what you would expect.

Jon

it depends

Categories

Company

Local Language

Forums

Discussions

Forums

Discussions

Forums

Discussions

Forums

Discussions

Forums

Discussions

Discussions

Forums

Forums

Discussions

Forums

Discussions

Forums

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Community

Resources

Other HPE Sites

Discussions

Forums

Blogs

Re: Locking Performance

Locking Performance