Re: High MPSYNC - help

Thomas Thacker · ‎07-20-2006

We have recently been experiencing periods of very high MPSYNC activity. I'd like to know what are the primary (typical) causes of high MPSYNC activity.

We are running OpenVMS 7.3-2 on a 3-node cluster running on GS1280 systems. The system in question is a 16 CPU system with 96GB of memory. We are running Cerner's Millenium software suite, using an Oracle database.

We use host-based shadowing and have HBMM enabled. The system in question has 68 shadow disk sets.

Cluster communication is through dual CIPCA adapters (the second one is for failover).

The disks are dual-fiber connected to a Storageworks SAN (HSG80s).

The activity has gone from approximately 100% (which was normal) to periods of 800-900% for MPSYNC mode (MONITOR MODE).

Any information and/or hints where I should start looking would be appreciated.

There have been no major changes in hardware or software configuration (including VMS updates) since last August.

Jim_McKinney · ‎07-20-2006

Oracle... lots of IOs, lots of locking, lots of CPUs. Wild guess here... are you using dedicated CPU lock manager? If not, check out the

$ mcr sysgen sys_par lckmgr_mode

It's dynamic, so you might experiment... and if you're not already using FAST_PATH for the CIPCA, take a look at that SYSGEN parameter as well (it's not dynamic).

Andy Bustamante · ‎07-20-2006

Along with Jim's comment, make sure to look at sysgen parameter LCKMGR_CPUID. You don't want to have the dedicated lock manager running on a FAST_PATH cpu. I've heard 8.3 will check for conflicts before assigning a lock manager cpu.

For our in house database application the deciated lock manager makes a noticeable performance improvement.

LCKMGR_CPUID and LCKMGR_MODE are dynamic and can modified on the fly. We tested dedicated lock manager on GS-80s and GS-1280s.

Andy Bustamante

If you don't have time to do it right, when will you have time to do it over? Reach me at first_name + "." + last_name at sysmanager net

John Gillings · ‎07-20-2006

Thomas,

>what are the primary (typical) causes
>of high MPSYNC activity.

Generic answer - contention on spinlocks between CPUs. The question is WHICH spinlocks.

You can use the SDA Spinlock Tracing Utility to see which spinlocks are being hit. See

$ ANALYZE/SYSTEM
SDA> SPL

for some cursory documenation. See Chapter 8 of "HP OpenVMS System Analysis Tools Manual" for more details.

Talk to your local CSC if you need further assistance in driving the tool, or interpreting the results.

What you do depends on which spinlocks are most contentious.

If it's the LCKMGR spinlock, using a dedicated lock manager CPU is a good idea. A 16P system is a likely candidate. The obvious question is "so why should I pay for an entire CPU to run the VMS lock manager?". The answer is "so lock management doesn't cost you MORE than an entire CPU!"

At the moment you're using 8 or 9 CPUs just spinning waiting for spinlocks. You could STOP/CPU up to 6 or 7 CPUs and expect IMPROVED performance from the system as a whole!

It's all about how code scales to multiple CPUs. From the workload which can be performed by a single CPU, adding a second will give "almost" 2x performance because of resource contention between the multiple streams of execution. Each additional CPU will add slightly less than the last one, until you reach a plateau where an extra CPU adds nothing. This is the "knee" in the scaling curve for your particular workload. Adding more CPUs will REDUCE the overall throughput because contention is increased by more than the additional compute power added. At 8-900% MPSYNCH, your workload could be past the knee and well on it's way to the ankle ;-)

A crucible of informative mistakes

Volker Halle · ‎07-20-2006

Thomas,

there is a procedure SYS$EXAMPLES:SPL.COM which collects SPINLOCK information and also has some comments and background information included.

Volker.

Thomas Thacker · ‎07-21-2006

Thanks to all for the great responses. It's exactly the kind of info I was looking for.

FYI - we have FAST_PATH enabled, but not dedicated lock manager for this node. I'll try that today.

Based on the spinlock data I saw, it looks like the MQ interface that Cerner uses is the biggest spinlock user...

Regards,
Tom

Art Wiens · ‎07-21-2006

Thomas, such expert, timely, support level advice (for free!) is surely worth assigning some points! Don't worry, they're free as well ;-)

Cheers,
Art

Thomas Thacker · ‎08-02-2006

I dedicated a CPU to the lock manager this morning.

So far, I've noticed no MPSYNC improvement. Still seeing periods of over 800% MPSYNC time...not good... At times, MPSYNC spikes to consume almost all 16 CPUs!.

Volker Halle · ‎08-02-2006

Thomas,

maybe it's time to use SYS$EXAMPLES:SPL.COM and provide the output for us to look at...

T4 would also be a very useful tool to collect system performance information in such a situation. OpenVMS engineering is using this tool for performance analysis.

http://h71000.www7.hp.com/OpenVMS/products/t4/index.html

Volker.

Thomas Thacker · ‎08-02-2006

We do collect t4 data. I don't see anything in T4 that might help determine the cause of the MPSYNC issue. The data does show that the worse MPSYNC abuse occurs between 10-11AM. I'll fire up the SPL procedure during that time tomorrow morning and post the results here.

I suspect that MQ may be part of the problem, but I have no proof. We are a bit behind in VMS patches. The last update was v7.3-2 Update 4. I was trying to determine if MQ V3.0 was included in Update 4 or not. If not, we are 2 patches behind with MQ. The last MQ update I can see in the patch history on our system is MQ V2.0. Unfortunately, I've not been able to find information on-line about previous updates.

Maybe there are other patches that address spin-lock issues that we have not applied yet?

Regards,
Tom

Volker Halle · ‎08-02-2006

Tom,

VMS732_UPDATE-V0100 included VMS732_MQ-V0100. VMS732_MQ-V0300 was included in VMS732_UPDATE-V0500.

There may be an 'interesting' fix in VMS732_MQ-V0300:

5.2.1 Performance Degradation

5.2.1.1 Problem Description:

When a single global section contains thousands of pshared objects, large multiprocessor systems can experience poor performance and very high MP_Synch times.

It's a pity, that HP does not keep old VMS patch descriptions online. For a problem like this, you would be very interested in that kind of information. I have kept all those patch descriptions stored locally.

Old patch descriptions are also kept online by openvms.org and decuserve.org - you'll find links in:

http://www.openvms.org/pages.php?page=Patches

Volker.

Volker Halle · ‎08-02-2006

Tom,

I've now also find old VMS patch descriptions in Ask Compaq (nowadays called: IT resource center - Search Assistant):

http://www5.itrc.hp.com/service/james/CPQhome.do

Volker.

Hein van den Heuvel · ‎08-02-2006

Jim wrote: "Oracle... lots of IOs, lots of locking, lots of CPUs. Wild guess here... are you using dedicated CPU lock manager?"

Hmmm, best I know Oracle does NOT use the VMS lock manager, so a dedicated lock manager would not help for that.

Thomas,
Is this a single system solution or 3-tier?
Is the problem on the DB server or the app side? The Millenium app can have lots of locking.

What are the lock rates according to monitor (MONI LOCK|DLOCK) or T4 (check LCK73 params)

Speaking of T4... does the 'Correlate' button show anything interesting (once you remove all per-cpu mode data).

An other potential cause for high MPSYNC is the network stack. Are you using Multinet per chance? If using VMS TCP/IP then make sure the scaleable kernel is enabled.

What does Cerner support suggest might be the cause. They ought to know!

Mostly though, listen carefully to Volker and try to report what is atually consuming MPSYNC through the SPL data.

Good luck!

Hein van den Heuvel
HvdH Performance Consulting.

Thomas Thacker · ‎08-02-2006

The problem is not on the system that serves the Oracle database. It on a node that hosts the application SRVxxxx processes (271 of them).

The lock rates in T4 for all LCK73 items is zero. Monitor lock/dlock shows there are no deadlocks.

The correlate function does not indicate indicate anything interesting (at least to me).

We are using TCPIP Service for OpenVMS V5.4 ECO 5. The scaleable kernel is enabled.

Opening a ticket with Cerner was next on my list. I just wanted to understand the problem better and do some basic research and troubleshooting first.

I have attached the output from an SPL run today (after the dedicated lock mgr change). It's not the prime MPSYNC time, but it's still higher than normal.

Thanks everyone for the responses.

Regards,
Tom

Volker Halle · ‎08-02-2006

Tom,

I'm missing the SPL ANALYZE output. This may be due to an error reported:

(10) CPU 5 has acquired spinlock at 0X835F4B80 at incorrect IPL; CPU already associated with spinlock at 0X82565380. Returning...

This looks like a possible spinlock synchronization issue or an error in SPL ANALYZE.

The output file should start with a node summary CPU statistics. Maybe try again. There is an example output in the System Analysis Tools manual:

http://h71000.www7.hp.com/doc/82FINAL/6549/6549pro_030.html#command_124

Volker.

Thomas Thacker · ‎08-03-2006

I've attached the SPL output. The summary info is in this output file.

Thanks,
Toom

Ian Miller. · ‎08-03-2006

In SDA what does
SHOW SPINLOCK/ADDR=835F4B80
show ?

____________________
Purely Personal Opinion

Hein van den Heuvel · ‎08-03-2006

Yeah, that 835F4B80 really stands out and would seem to explain 90% of the MPSYNC time.

Spinlock % Time Held Acquires/sec Average Hold % Time Spinning
------------ ----------- ------------ ------------
835F4B80 88.8 490.6 2080501 513.6

Spinlock
Caller's PC
---------
835F4B80
80162804 PSHARED_OBJECT_CREATE_C+007F4
80162874 PSHARED_OBJECT_CREATE_C+00864

Volker Halle · ‎08-03-2006

Tom,

... and this EXACTLY matches the symptoms solved in VMS732_MQ-V0300, which you do not yet have installed.

A great example for SPL tracing - I hope you would allow me to use this for my next DECUS crashdump or SDA extension training.

Volker.

Thomas Thacker · ‎08-03-2006

It does not show much.

Show spinlock/addr=835F4B80

System dynamic spinlock structures
----------------------------------
Unknown Address 835F4B80
Owner CPU ID None DIPL 00000006
Ownership Depth FFFFFFFF Rank FFFFFFFF
Timeout Interval 007FFFFF Share Array 00000000

I have no problem with using the SDA ouput for training.

I'll get the patch installed as soon as I can get a downtime scheduled.

Thank again,
Tom

Volker Halle · ‎08-03-2006

Tom,

this spinlock is related to pshared objects (as can be seen from the caller's names). Not sure, how you could find a link from the spinlock address back to the owner/user of that spinlock. Maybe search nonpaged pool for references to that spinlock address.

But this is not required in this case, as the problem seems to be sufficiently clear.

Volker.

Categories

Company

Local Language

Forums

Discussions

Forums

Discussions

Forums

Discussions

Forums

Discussions

Forums

Discussions

Discussions

Forums

Forums

Discussions

Forums

Discussions

Forums

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Discussion Boards

Community

Resources

Other HPE Sites

Discussions

Forums

Blogs

Re: High MPSYNC - help

High MPSYNC - help